date:20210204

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Greg Nancarrow

On Fri, Feb 5, 2021 at 5:21 PM Amit Langote  wrote:
>
>
> BTW, the original query's cteList is copied into sub_action query but
> not into rule_action for reasons I haven't looked very closely into,
> even though we'd like to ultimately set the latter's hasModifyingCTE
> to reflect the original query's, right?  So we should do the following
> at some point before returning:
>
> if (sub_action->hasModifyingCTE)
> rule_action->hasModifyingCTE = true;
>

Actually, rule_action will usually point to sub_action (in which case,
no need to copy to rule_action), except if the rule action is an
INSERT...SELECT, which seems to be handled by some "kludge" according
to the following comment (and KLUDGE ALERT comment in the function
that is called):

/*
 * Adjust rule action and qual to offset its varnos, so that we can merge
 * its rtable with the main parsetree's rtable.
 *
 * If the rule action is an INSERT...SELECT, the OLD/NEW rtable entries
 * will be in the SELECT part, and we have to modify that rather than the
 * top-level INSERT (kluge!).
 */
sub_action = getInsertSelectQuery(rule_action, &sub_action_ptr);

So in that case (sub_action_ptr != NULL), within rule_action there is
a pointer to sub_action (RTE for the subquery), so whenever sub_action
is re-created, this pointer needs to be fixed-up.
It looks like I might need to copy hasModifyingCTE back to rule_action
in this case - but not 100% sure on it yet - still checking that. All
tests run so far pass without doing that though.
This is one reason for my original approach (though I admit, it was
not optimal) because at least it was reliable and detected the
modifyingCTE after all the rewriting and kludgy code had finished.

Regards,
Greg Nancarrow
Fujitsu Australia

There doesn't seem to be any case where PQputCopyEnd() returns 0

2021-02-04 Thread Kasahara Tatsuhito

Hi,

The following is written in the comments of PQputCopyEnd().

 (snip)
 * Returns 1 if successful, 0 if data could not be sent (only possible
 * in nonblock mode), or -1 if an error occurs.
 (snip)

The PQputCopyEnd() section of the manual (libpq.sgml) describes the following.

   The result is 1 if the termination message was sent; or in
   nonblocking mode, this may only indicate that the termination
   message was successfully queued.  (In nonblocking mode, to be
   certain that the data has been sent, you should next wait for
   write-ready and call , repeating until it
   returns zero.)  Zero indicates that the function could not queue
   the termination message because of full buffers; this will only
   happen in nonblocking mode.  (In this case, wait for
   write-ready and try the  call
   again.)  If a hard error occurs, -1 is returned; you can use
to retrieve details.


These says that 0 may be returned if a non-blocking mode is used, but
there doesn't seem to be any case where 0 is returned in the code of
PQputCopyEnd().

I may have missed something, but is it a mistake in the comments or
documentation?

Or should it return 0 when sending a COPY exit message fails
in non-blocking mode, like this?

@@ -2370,7 +2370,7 @@ PQputCopyEnd(PGconn *conn, const char *errormsg)
/* Send COPY DONE */
if (pqPutMsgStart('c', false, conn) < 0 ||
pqPutMsgEnd(conn) < 0)
-   return -1;
+   return pqIsnonblocking(conn) ? 0 : -1;
}

/*
@@ -2399,7 +2399,7 @@ PQputCopyEnd(PGconn *conn, const char *errormsg)
if (pqPutMsgStart(0, false, conn) < 0 ||
pqPutnchar("\\.\n", 3, conn) < 0 ||
pqPutMsgEnd(conn) < 0)
-   return -1;
+   return pqIsnonblocking(conn) ? 0 : -1;
}
}


Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Michael Paquier

On Fri, Feb 05, 2021 at 08:22:17AM +0100, Ronan Dunklau wrote:
> I'm not surprised by this answer, the good news is it's being back-patched. 

Yes, I have no problem with that.  Until this gets fixed, the damage
can be limited with an extra ALTER INDEX, that takes a
ShareUpdateExclusiveLock so there is no impact on the concurrent
activity.

> Looks good to me ! Thank you.

Thanks for looking at it.  Tomas, do you have any comments?
--
Michael

signature.asc
Description: PGP signature

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Ronan Dunklau

Le vendredi 5 février 2021, 03:17:48 CET Michael Paquier a écrit :
> ConstructTupleDescriptor() does not matter much, but this patch is not
> acceptable to me as it touches the area of the index creation while
> statistics on an index expression can only be changed with a special
> flavor of ALTER INDEX with column numbers.  This would imply an ABI
> breakage, so it cannot be backpatched as-is.

I'm not surprised by this answer, the good news is it's being back-patched. 

> 
> Let's copy this data in index_concurrently_swap() instead.  The
> attached patch does that, and adds a test cheaper than what was
> proposed.  There is a minor release planned for next week, so I may be
> better to wait after that so as we have enough time to agree on a
> solution.

Looks good to me ! Thank you.

-- 
Ronan Dunklau

RE: Single transaction in the tablesync worker?

2021-02-04 Thread osumi.takami...@fujitsu.com

Hello



On Friday, February 5, 2021 2:23 PM Amit Kapila 
> On Fri, Feb 5, 2021 at 7:09 AM Peter Smith  wrote:
> >
> > On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila 
> wrote:
> > >
> > ...
> >
> > > Thanks. I have fixed one of the issues reported by me earlier [1]
> > > wherein the tablesync worker can repeatedly fail if after dropping
> > > the slot there is an error while updating the SYNCDONE state in the
> > > database. I have moved the drop of the slot just before commit of
> > > the transaction where we are marking the state as SYNCDONE.
> > > Additionally, I have removed unnecessary includes in tablesync.c,
> > > updated the docs for Alter Subscription, and updated the comments at
> > > various places in the patch. I have also updated the commit message this
> time.
> > >
> >
> > Below are my feedback comments for V17 (nothing functional)
> >
> > ~~
> >
> > 1.
> > V27 Commit message:
> > For the initial table data synchronization in logical replication, we
> > use a single transaction to copy the entire table and then
> > synchronizes the position in the stream with the main apply worker.
> >
> > Typo:
> > "synchronizes" -> "synchronize"
> >
> 
> Fixed and added a note about Alter Sub .. Refresh .. command can't be
> executed in the transaction block.
Thank you for the updates.

We need to add some tests to prove the new checks of AlterSubscription() work. 
I chose TAP tests as we need to set connect = true for the subscription.
When it can contribute to the development, please utilize this.
I used v28 to check my patch and works as we expect.


Best Regards,
Takamichi Osumi


AlterSubscription_with_refresh_tests.patch
Description: AlterSubscription_with_refresh_tests.patch

Re: pg_replication_origin_drop API potential race condition

2021-02-04 Thread Amit Kapila

On Fri, Feb 5, 2021 at 9:46 AM Peter Smith  wrote:
>
> PSA patch updated per above suggestions.
>

Thanks, I have tested your patch and before the patch, I was getting
errors like "tuple concurrently deleted" or "cache lookup failed for
replication origin with oid 1" and after the patch, I am getting
"replication origin "origin-1" does not exist" which is clearly better
and user-friendly.

Before Patch
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  tuple concurrently deleted
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  cache lookup failed for replication origin with oid 1

After Patch
postgres=# select pg_replication_origin_drop('origin-1');
ERROR:  replication origin "origin-1" does not exist

I wonder why you haven't changed the usage of the existing
replorigin_drop in the code? I have changed the same, added few
comments, ran pgindent, and updated the commit message in the
attached.

I am not completely whether we should retire replorigin_drop or just
keep it for backward compatibility? What do you think? Anybody else
has any opinion?

For others, the purpose of this patch is to "make
pg_replication_origin_drop safe against concurrent drops.". Currently,
we get the origin id from the name and then drop the origin by taking
ExclusiveLock on ReplicationOriginRelationId. So, two concurrent
sessions can get the id from the name at the same time, and then when
they try to drop the origin, one of the sessions will get either
"tuple concurrently deleted" or "cache lookup failed for replication
origin ..".

To prevent this race condition we do the entire operation under lock.
This obviates the need for replorigin_drop() API but we have kept it
for backward compatibility.

-- 
With Regards,
Amit Kapila.

v3-0001-Make-pg_replication_origin_drop-safe-against-conc.patch
Description: Binary data

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Bharath Rupireddy

On Thu, Feb 4, 2021 at 5:42 PM Amit Kapila  wrote:
> > Thanks Amit. I verified it with gdb. I attached gdb to the logical
> > replication worker. In slot_store_data's for loop, I intentionally set
> > CurrentTransactionState->state = TRANS_DEFAULT,
> >
>
> What happens if you don't change CurrentTransactionState->state?

Just a refresher -  the problem we are trying to solve with the 2
patches proposed in this thread [1] is that the error callbacks
shouldn't be accessing system catalogues because the txn might be
broken by then or another error can be raised in system catalogue
access paths (if we have a cache miss and access the catalogue table
from the disk). So the required information that's needed in the error
callback is to be filled in the function that register the callback.

Now, coming to testing the patches: I tried to use the error callback
when we are outside of the xact. For 0001patch i.e.
slot_store_error_callback, I called it after the
apply_handle_commit_internal in apply_handle_commit. The expectation
is that the patches proposed in this thread [1], should just be usable
even outside of the xact.

HEAD: The logical replication worker crashes at
Assert(IsTransactionState() in SearchCatCacheInternal. See [2] for the
testing patch that's used.

Patched: The testing DEBUG message gets printed in the subscriber
server log and no crash happens. See [3] for the testing patch that's
used.
2021-02-05 12:02:32.340 IST [2361724] DEBUG:  calling slot_store_error_callback
2021-02-05 12:02:32.340 IST [2361724] CONTEXT:  processing remote data
for replication target relation "public.t1" column "a1", remote type
remote_test_type, local type local_test_type

Similarly, the 0002 patch in [1] which is avoiding system catalogues
in conversion_error_callback postgres_fdw can also be tested.

Hope this testing is enough for the patches. Please let me know if
anything more is required.

[1] - 
https://www.postgresql.org/message-id/CALj2ACUkvABzq6yeKQZsgng5Rd_NF%2BikhTvL9k4NR0GSRKsSdg%40mail.gmail.com
[2] -
@@ -713,6 +717,26 @@ apply_handle_commit(StringInfo s)
  apply_handle_commit_internal(s, &commit_data);
 +if (rel_for_err_cb_chk)
+{
+/* We are out of xact. */
+Assert(!IsTransactionState());
+
+/* Push callback + info on the error context stack */
+errarg.rel = rel_for_err_cb_chk;
+errarg.local_attnum = 0;
+errarg.remote_attnum = 0;
+errcallback.callback = slot_store_error_callback;
+errcallback.arg = (void *) &errarg;
+errcallback.previous = error_context_stack;
+error_context_stack = &errcallback;
+
+elog(DEBUG1, "calling slot_store_error_callback");
+
+/* Pop the error context stack */
+error_context_stack = errcallback.previous;
+rel_for_err_cb_chk = NULL;
+}
[3]
 @@ -719,6 +724,26 @@ apply_handle_commit(StringInfo s)
  apply_handle_commit_internal(s, &commit_data);
 +if (rel_for_err_cb_chk)
+{
+/* We are out of xact. */
+Assert(!IsTransactionState());
+errarg.rel = rel_for_err_cb_chk;
+errarg.remote_attnum = 0;
+errarg.local_typename = "local_test_type";
+errarg.remote_typename = "remote_test_type";
+errcallback.callback = slot_store_error_callback;
+errcallback.arg = (void *) &errarg;
+errcallback.previous = error_context_stack;
+error_context_stack = &errcallback;
+
+elog(DEBUG1, "calling slot_store_error_callback");
+
+/* Pop the error context stack */
+error_context_stack = errcallback.previous;
+rel_for_err_cb_chk = NULL;
+}

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2021-02-04 Thread Michael Paquier

On Thu, Feb 04, 2021 at 06:35:28PM +, Jacob Champion wrote:
> Ah, sorry about that. Is there an extension I can use (or lack thereof)
> that the CF bot will ignore, or does it scan the attachment contents?

The thing is smart, but there are ways to bypass it.  Here is the
code:
https://github.com/macdice/cfbot/

And here are the patterns looked at:
cfbot_commitfest_rpc.py:groups = re.search('',
line)
--
Michael


signature.asc
Description: PGP signature

Re: Correct comment in StartupXLOG().

2021-02-04 Thread Michael Paquier

On Thu, Feb 04, 2021 at 12:58:29PM +0530, Dilip Kumar wrote:
> Looks good to me.

Rather than using the term "recovery state", I would just use
SharedRecoveryState.  This leads me to the attached.
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..8e3b5df7dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7998,17 +7998,16 @@ StartupXLOG(void)
 	 * All done with end-of-recovery actions.
 	 *
 	 * Now allow backends to write WAL and update the control file status in
-	 * consequence.  The boolean flag allowing backends to write WAL is
-	 * updated while holding ControlFileLock to prevent other backends to look
-	 * at an inconsistent state of the control file in shared memory.  There
-	 * is still a small window during which backends can write WAL and the
-	 * control file is still referring to a system not in DB_IN_PRODUCTION
+	 * consequence.  SharedRecoveryState, that controls if backends can write
+	 * WAL, is updated while holding ControlFileLock to prevent other backends
+	 * to look at an inconsistent state of the control file in shared memory.
+	 * There is still a small window during which backends can write WAL and
+	 * the control file is still referring to a system not in DB_IN_PRODUCTION
 	 * state while looking at the on-disk control file.
 	 *
-	 * Also, although the boolean flag to allow WAL is probably atomic in
-	 * itself, we use the info_lck here to ensure that there are no race
-	 * conditions concerning visibility of other recent updates to shared
-	 * memory.
+	 * Also, we use info_lck to update SharedRecoveryState to ensure that
+	 * there are no race conditions concerning visibility of other recent
+	 * updates to shared memory.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	ControlFile->state = DB_IN_PRODUCTION;


signature.asc
Description: PGP signature

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Amit Langote

On Fri, Feb 5, 2021 at 2:55 PM Greg Nancarrow  wrote:
> On Fri, Feb 5, 2021 at 4:25 PM Hou, Zhijie  wrote:
> > > > That is very close to what I was going to suggest, which is this:
> > > >
> > > > diff --git a/src/backend/rewrite/rewriteHandler.c
> > > > b/src/backend/rewrite/rewriteHandler.c
> > > > index 0672f497c6..3c4417af98 100644
> > > > --- a/src/backend/rewrite/rewriteHandler.c
> > > > +++ b/src/backend/rewrite/rewriteHandler.c
> > > > @@ -631,6 +631,8 @@ rewriteRuleAction(Query *parsetree,
> > > > checkExprHasSubLink((Node *)
> > > > rule_action->returningList);
> > > > }
> > > >
> > > > +   rule_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
> > > > +
> > > > return rule_action;
> > > >  }
> > >
> > >
> > >   if (parsetree->cteList != NIL && sub_action->commandType !=
> > > CMD_UTILITY)
> > >   {
> > >   ...
> > >   sub_action->cteList = list_concat(sub_action->cteList,
> > >   }
> > >
> > > Is is possible when sub_action is CMD_UTILITY ?
> > > In this case CTE will be copied to the newone, should we set the set the
> > > flag in this case ?
> >
> > Sorry , a typo in my word.
> > In this case CTE will not be copied to the newone, should we set the set 
> > the flag in this case ?
> >
>
> No, strictly speaking, we probably shouldn't, because the CTE wasn't
> copied in that case.

Right.

> Also, I know the bitwise OR "works" in this case, but I think some
> will frown on use of that for a bool.
> IMHO better to use:
>
>if (parsetree->hasModifyingCTE)
>rule_action->hasModifyingCTE = true;
>
> So patch might be something like:
>
> diff --git a/src/backend/rewrite/rewriteHandler.c
> b/src/backend/rewrite/rewriteHandler.c
> index 0672f497c6..a989e02925 100644
> --- a/src/backend/rewrite/rewriteHandler.c
> +++ b/src/backend/rewrite/rewriteHandler.c
> @@ -557,6 +557,8 @@ rewriteRuleAction(Query *parsetree,
>  /* OK, it's safe to combine the CTE lists */
>  sub_action->cteList = list_concat(sub_action->cteList,
>copyObject(parsetree->cteList));
> +if (parsetree->hasModifyingCTE)
> +sub_action->hasModifyingCTE = true;
>  }
>
>  /*
> @@ -594,6 +596,9 @@ rewriteRuleAction(Query *parsetree,
>  *sub_action_ptr = sub_action;
>  else
>  rule_action = sub_action;
> +
> +if (parsetree->hasModifyingCTE)
> +sub_action->hasModifyingCTE = true;
>  }

That may be better.

BTW, the original query's cteList is copied into sub_action query but
not into rule_action for reasons I haven't looked very closely into,
even though we'd like to ultimately set the latter's hasModifyingCTE
to reflect the original query's, right?  So we should do the following
at some point before returning:

if (sub_action->hasModifyingCTE)
rule_action->hasModifyingCTE = true;

> I'll do some further checks, because the rewriting is recursive and
> tricky, so don't want to miss any cases ...

Always a good idea.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Greg Nancarrow

On Fri, Feb 5, 2021 at 4:25 PM Hou, Zhijie  wrote:
>
> > >
> > > That is very close to what I was going to suggest, which is this:
> > >
> > > diff --git a/src/backend/rewrite/rewriteHandler.c
> > > b/src/backend/rewrite/rewriteHandler.c
> > > index 0672f497c6..3c4417af98 100644
> > > --- a/src/backend/rewrite/rewriteHandler.c
> > > +++ b/src/backend/rewrite/rewriteHandler.c
> > > @@ -631,6 +631,8 @@ rewriteRuleAction(Query *parsetree,
> > > checkExprHasSubLink((Node *)
> > > rule_action->returningList);
> > > }
> > >
> > > +   rule_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
> > > +
> > > return rule_action;
> > >  }
> >
> >
> >   if (parsetree->cteList != NIL && sub_action->commandType !=
> > CMD_UTILITY)
> >   {
> >   ...
> >   sub_action->cteList = list_concat(sub_action->cteList,
> >   }
> >
> > Is is possible when sub_action is CMD_UTILITY ?
> > In this case CTE will be copied to the newone, should we set the set the
> > flag in this case ?
>
> Sorry , a typo in my word.
> In this case CTE will not be copied to the newone, should we set the set the 
> flag in this case ?
>

No, strictly speaking, we probably shouldn't, because the CTE wasn't
copied in that case.
Also, I know the bitwise OR "works" in this case, but I think some
will frown on use of that for a bool.
IMHO better to use:

   if (parsetree->hasModifyingCTE)
   rule_action->hasModifyingCTE = true;

So patch might be something like:

diff --git a/src/backend/rewrite/rewriteHandler.c
b/src/backend/rewrite/rewriteHandler.c
index 0672f497c6..a989e02925 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -557,6 +557,8 @@ rewriteRuleAction(Query *parsetree,
 /* OK, it's safe to combine the CTE lists */
 sub_action->cteList = list_concat(sub_action->cteList,
   copyObject(parsetree->cteList));
+if (parsetree->hasModifyingCTE)
+sub_action->hasModifyingCTE = true;
 }

 /*
@@ -594,6 +596,9 @@ rewriteRuleAction(Query *parsetree,
 *sub_action_ptr = sub_action;
 else
 rule_action = sub_action;
+
+if (parsetree->hasModifyingCTE)
+sub_action->hasModifyingCTE = true;
 }

 /*

I'll do some further checks, because the rewriting is recursive and
tricky, so don't want to miss any cases ...

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Transactions involving multiple postgres foreign servers, take 2

2021-02-04 Thread Masahiko Sawada

On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao  wrote:
>
>
>
> On 2021/01/27 14:08, Masahiko Sawada wrote:
> > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> >  wrote:
> >>
> >>
> >> You fixed some issues. But maybe you forgot to attach the latest patches?
> >
> > Yes, I've attached the updated patches.
>
> Thanks for updating the patch! I tried to review 0001 and 0002 as the 
> self-contained change.
>
> + * An FDW that implements both commit and rollback APIs can request to 
> register
> + * the foreign transaction by FdwXactRegisterXact() to participate it to a
> + * group of distributed tranasction.  The registered foreign transactions are
> + * identified by OIDs of server and user.
>
> I'm afraid that the combination of OIDs of server and user is not unique. 
> IOW, more than one foreign transactions can have the same combination of OIDs 
> of server and user. For example, the following two SELECT queries start the 
> different foreign transactions but their user OID is the same. OID of user 
> mapping should be used instead of OID of user?
>
>  CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
>  CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 
> 'postgres');
>  CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres');
>  CREATE TABLE t(i int);
>  CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't');
>  BEGIN;
>  SELECT * FROM ft;
>  DROP USER MAPPING FOR postgres SERVER loopback ;
>  SELECT * FROM ft;
>  COMMIT;

Good catch. I've considered using user mapping OID or a pair of user
mapping OID and server OID as a key of foreign transactions but I
think it also has a problem if an FDW caches the connection by pair of
server OID and user OID whereas the core identifies them by user
mapping OID. For instance, mysql_fdw manages connections by pair of
server OID and user OID.

For example, let's consider the following execution:

BEGIN;
SET ROLE user_A;
INSERT INTO ft1 VALUES (1);
SET ROLE user_B;
INSERT INTO ft1 VALUES (1);
COMMIT;

Suppose that an FDW identifies the connections by {server OID, user
OID} and the core GTM identifies the transactions by user mapping OID,
and user_A and user_B use the public user mapping to connect server_X.
In the FDW, there are two connections identified by {user_A, sever_X}
and {user_B, server_X} respectively, and therefore opens two
transactions on each connection, while GTM has only one FdwXact entry
because the two connections refer to the same user mapping OID. As a
result, at the end of the transaction, GTM ends only one foreign
transaction, leaving another one.

Using user mapping OID seems natural to me but I'm concerned that
changing role in the middle of transaction is likely to happen than
dropping the public user mapping but not sure. We would need to find
more better way.

>
> +   /* Commit foreign transactions if any */
> +   AtEOXact_FdwXact(true);
>
> Don't we need to pass XACT_EVENT_PARALLEL_PRE_COMMIT or XACT_EVENT_PRE_COMMIT 
> flag? Probably we don't need to do this if postgres_fdw is only user of this 
> new API. But if we make this new API generic one, such flags seem necessary 
> so that some foreign data wrappers might have different behaviors for those 
> flags.
>
> Because of the same reason as above, AtEOXact_FdwXact() should also be called 
> after CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_COMMIT : 
> XACT_EVENT_COMMIT)?

Agreed.

In AtEOXact_FdwXact() we call either CommitForeignTransaction() or
RollbackForeignTransaction() with FDWXACT_FLAG_ONEPHASE flag for each
foreign transaction. So for example in commit case, we will call new
FDW APIs in the following order:

1. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_PRE_COMMIT
flag and FDWXACT_FLAG_ONEPHASE flag for each foreign transaction.
2. Commit locally.
3. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_COMMIT
flag and FDWXACT_FLAG_ONEPHASE flag for each foreign transaction.

In the future when we have a new FDW API to prepare foreign
transaction, the sequence will be:

1. Call PrepareForeignTransaction() for each foreign transaction.
2. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_PRE_COMMIT
flag for each foreign transaction.
3. Commit locally.
4. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_COMMIT
flag for each foreign transaction.

So we expect FDW that wants to support 2PC not to commit foreign
transaction if CommitForeignTransaction() is called with
XACT_EVENT_PARALLEL_PRE_COMMIT flag and no FDWXACT_FLAG_ONEPHASE flag.

>
> +   /*
> +* Abort foreign transactions if any.  This needs to be done before 
> marking
> +* this transaction as not running since FDW's transaction callbacks 
> might
> +* assume this transaction is still in progress.
> +*/
> +   AtEOXact_FdwXact(false);
>
> Same as above.
>
> +/*
> + * This function is called at PREPARE TRANSACTION.  Since we don't support
> + *

Re: [HACKERS] Custom compression methods

2021-02-04 Thread Dilip Kumar

On Fri, Feb 5, 2021 at 3:51 AM Robert Haas  wrote:
>
> On Thu, Feb 4, 2021 at 11:39 AM Dilip Kumar  wrote:
> > Yeah, actually, I thought I would avoid calling slot_getallattrs if
> > none of the attributes got decompress.  I agree if we call this before
> > we can avoid calling slot_getattr but slot_getattr
> > is only called for the attribute which has attlen -1. I agree that if
> > we call slot_getattr for attnum n then it will deform all the
> > attributes before that.  But then slot_getallattrs only need to deform
> > the remaining attributes not all.  But maybe we can call the
> > slot_getallattrs as soon as we see the first attribute with attlen -1
> > and then avoid calling subsequent slot_getattr, maybe that is better
> > than compared to what I have because we will avoid calling
> > slot_getattr for many attributes, especially when there are many
> > verlena.
>
> I think that if we need to deform at all, we need to deform all
> attributes, right?

IMHO that is not true, because we might need to deform the attribute
just to check its stored compression.  So for example the first
attribute is varchar and the remaining 100 attributes are interger.
So we just need to deform the first attribute and if the compression
method of that is the same as the target attribute then we are done
and not need to deform the remaining and we can just continue with the
original slot and tuple.

I am not saying this is a very practical example and we have to do it
like this, but I am just making a point that it is not true that if we
deform at all then we have to deform all.  However, if we decompress
any then we have to deform all because we need to materialize the
tuple again.

So there's no point in considering e.g.
> slot_getsomeattrs(). But just slot_getallattrs() as soon as we know we
> need to do it might be worthwhile. Could even have two loops: one that
> just figures out whether we need to deform; if not, return. Then
> slot_getallattrs(). Then another loop to do the work.
>
> >  I think the supported procedure for this sort of
> > > thing is to have a second slot, set tts_values, tts_isnull etc. and
> > > then materialize the slot. After materializing the new slot, it's
> > > independent of the old slot, which can then be cleared. See for
> > > example tts_virtual_materialize().
> >
> > Okay, so if we take a new slot then we need to set this slot reference
> > in the ScanState also otherwise that might point to the old slot.  I
> > haven't yet analyzed where all we might be keeping the reference to
> > that old slot.  Or I am missing something.
>
> My guess is you want to leave the ScanState alone so that we keep
> fetching into the same slot as before and have an extra slot on the
> side someplace.

Okay, got your point.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Hou, Zhijie

> > > 
> > > static Query *
> > > rewriteRuleAction(Query *parsetree,
> > > ...
> > > if (sub_action_ptr)
> > > +   {
> > > *sub_action_ptr = sub_action;
> > > +   rule_action->hasModifyingCTE |=
> > parsetree->hasModifyingCTE;
> > > +   }
> > > 
> > >
> > > And the Basic test passed.
> > > What do you think ?
> >
> > That is very close to what I was going to suggest, which is this:
> >
> > diff --git a/src/backend/rewrite/rewriteHandler.c
> > b/src/backend/rewrite/rewriteHandler.c
> > index 0672f497c6..3c4417af98 100644
> > --- a/src/backend/rewrite/rewriteHandler.c
> > +++ b/src/backend/rewrite/rewriteHandler.c
> > @@ -631,6 +631,8 @@ rewriteRuleAction(Query *parsetree,
> > checkExprHasSubLink((Node *)
> > rule_action->returningList);
> > }
> >
> > +   rule_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
> > +
> > return rule_action;
> >  }
> 
> 
>   if (parsetree->cteList != NIL && sub_action->commandType !=
> CMD_UTILITY)
>   {
>   ...
>   sub_action->cteList = list_concat(sub_action->cteList,
>   }
> 
> Is is possible when sub_action is CMD_UTILITY ?
> In this case CTE will be copied to the newone, should we set the set the
> flag in this case ?

Sorry , a typo in my word.
In this case CTE will not be copied to the newone, should we set the set the 
flag in this case ?

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Hou, Zhijie

> > 
> > static Query *
> > rewriteRuleAction(Query *parsetree,
> > ...
> > if (sub_action_ptr)
> > +   {
> > *sub_action_ptr = sub_action;
> > +   rule_action->hasModifyingCTE |=
> parsetree->hasModifyingCTE;
> > +   }
> > 
> >
> > And the Basic test passed.
> > What do you think ?
> 
> That is very close to what I was going to suggest, which is this:
> 
> diff --git a/src/backend/rewrite/rewriteHandler.c
> b/src/backend/rewrite/rewriteHandler.c
> index 0672f497c6..3c4417af98 100644
> --- a/src/backend/rewrite/rewriteHandler.c
> +++ b/src/backend/rewrite/rewriteHandler.c
> @@ -631,6 +631,8 @@ rewriteRuleAction(Query *parsetree,
> checkExprHasSubLink((Node *)
> rule_action->returningList);
> }
> 
> +   rule_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
> +
> return rule_action;
>  }


if (parsetree->cteList != NIL && sub_action->commandType != CMD_UTILITY)
{
...
sub_action->cteList = list_concat(sub_action->cteList,
}

Is is possible when sub_action is CMD_UTILITY ?
In this case CTE will be copied to the newone, should we set the set the flag 
in this case ?

Best regard,
houzj

Re: Single transaction in the tablesync worker?

2021-02-04 Thread Amit Kapila

On Fri, Feb 5, 2021 at 7:09 AM Peter Smith  wrote:
>
> On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila  wrote:
> >
> ...
>
> > Thanks. I have fixed one of the issues reported by me earlier [1]
> > wherein the tablesync worker can repeatedly fail if after dropping the
> > slot there is an error while updating the SYNCDONE state in the
> > database. I have moved the drop of the slot just before commit of the
> > transaction where we are marking the state as SYNCDONE. Additionally,
> > I have removed unnecessary includes in tablesync.c, updated the docs
> > for Alter Subscription, and updated the comments at various places in
> > the patch. I have also updated the commit message this time.
> >
>
> Below are my feedback comments for V17 (nothing functional)
>
> ~~
>
> 1.
> V27 Commit message:
> For the initial table data synchronization in logical replication, we use
> a single transaction to copy the entire table and then synchronizes the
> position in the stream with the main apply worker.
>
> Typo:
> "synchronizes" -> "synchronize"
>

Fixed and added a note about Alter Sub .. Refresh .. command can't be
executed in the transaction block.

> ~~
>
> 2.
> @@ -48,6 +48,23 @@ ALTER SUBSCRIPTION  class="parameter">name RENAME TO <
> (Currently, all subscription owners must be superusers, so the owner 
> checks
> will be bypassed in practice.  But this might change in the future.)
>
> +
> +  
> +   When refreshing a publication we remove the relations that are no longer
> +   part of the publication and we also remove the tablesync slots if there 
> are
> +   any. It is necessary to remove tablesync slots so that the resources
> +   allocated for the subscription on the remote host are released. If due to
> +   network breakdown or some other error, we are not able to remove the 
> slots,
> +   we give WARNING and the user needs to manually remove such slots later as
> +   otherwise, they will continue to reserve WAL and might eventually cause
> +   the disk to fill up. See also  linkend="logical-replication-subscription-slot"/>.
> +  
>
> I think the content is good, but the 1st-person wording seemed strange.
> e.g.
> "we are not able to remove the slots, we give WARNING and the user needs..."
> Maybe it should be like:
> "... PostgreSQL is unable to remove the slots, so a WARNING is
> reported. The user needs... "
>

Changed as per suggestion with a minor tweak.

> ~~
>
> 3.
> @@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
> bool copy_data)
> ...
> + * XXX If there is a network break down while dropping the
>
> "network break down" -> "network breakdown"
>
> ~~
>
> 4.
> @@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
>   (errmsg("could not connect to the publisher: %s", err)));
> ...
> + * XXX We could also instead try to drop the slot, last time we failed
> + * but for that, we might need to clean up the copy state as it might
> + * be in the middle of fetching the rows. Also, if there is a network
> + * break down then it wouldn't have succeeded so trying it next time
> + * seems like a better bet.
>
> "network break down" -> "network breakdown"
>

Changed as per suggestion.

> ~~
>
> 5.
> @@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
> cacheid, uint32 hashvalue)
> ...
> +
> + /*
> + * Cleanup the tablesync slot.
> + *
> + * This has to be done after updating the state because otherwise if
> + * there is an error while doing the database operations we won't be
> + * able to rollback dropped slot.
> + */
> + ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
> + MyLogicalRepWorker->relid,
> + syncslotname);
> +
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
> +
>
> Should this comment also describe why the missing_ok is false for this case?
>

Yeah that makes sense, so added a comment.

Additionally, I have changed the errorcode in RemoveSubscriptionRel,
moved the setup of origin before copy_table in
LogicalRepSyncTableStart to avoid doing copy again due to an error in
setting up origin. I have made a few comment changes as well.

-- 
With Regards,
Amit Kapila.


v28-0001-Allow-multiple-xacts-during-table-sync-in-logica.patch
Description: Binary data

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Amit Langote

Hi,

While reviewing the v14 set of patches (will send my comments
shortly), I too had some reservations on how 0001 decided to go about
setting hasModifyingCTE.

On Fri, Feb 5, 2021 at 1:51 PM Hou, Zhijie  wrote:
> > > I took a look into the hasModifyingCTE bugfix recently, and found a
> > > possible bug case without the parallel insert patch.
> > >
> > > -
> > > drop table if exists test_data1;
> > > create table test_data1(a int, b int) ; insert into test_data1 select
> > > generate_series(1,1000), generate_series(1,1000); set
> > > force_parallel_mode=on;
> > >
> > > CREATE TEMP TABLE bug6051 AS
> > >   select i from generate_series(1,3) as i;
> > >
> > > SELECT * FROM bug6051;
> > > CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD select a as
> > > i from test_data1;
> > >
> > > WITH t1 AS ( DELETE FROM bug6051 RETURNING * ) INSERT INTO bug6051
> > > SELECT * FROM t1;
> > >
> > > ***
> > > ***ERROR:  cannot assign XIDs during a parallel operation
> > > ***
> > > -
> > >
> > > I debugged it and it did have modifycte in the parsetree after rewrite.
> > > I think if we can properly set the hasModifyingCTE, we can avoid this
> > error by not consider parallel for this.
> > >
> >
> > Thanks. You've identified that the bug exists for SELECT too. I've verified
> > that the issue is fixed by the bugfix included in the Parallel INSERT patch.
> > Are you able to review my bugfix?
> > Since the problem exists for SELECT in the current Postgres code, I'd like
> > to pull that bugfix out and provide it as a separate fix.

+1, a separate patch for this seems better.

> > My concern is that there may well be a better way to fix the issue - for
> > example, during the re-writing, rather than after the query has been
> > re-written.
> Hi,
>
> I took a look at the fix and have some thoughts on it.
> (Please correct me if you have tried this idea and found something is wrong)
>
> IMO, the main reason for the hasModifyingCTE=false is that:
> the Rewriter did not update the  hasModifyingCTE when copying the existsing 
> 'cteList' to the rewrited one.
>
> It seems there is only one place where ctelist will be copied separately.
> ---
> static Query *
> rewriteRuleAction(Query *parsetree,
> ...
> /* OK, it's safe to combine the CTE lists */
> sub_action->cteList = list_concat(sub_action->cteList,
>   
> copyObject(parsetree->cteList));
> +   sub_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
> 
>
> Based on the above, if we update the hasModifyingCTE here, we may solve this 
> problem.
>
> And there is another point here, the sub_action may be not the final 
> parsetree.
> If defined the rule like "DO INSTEAD insert into xx select xx from xx", 
> Rewriter will
> Put the ctelist into subquery in parsetree's rtable.
> In this case, we may also need to update the final parsetree too.
> (I think you know this case, I found same logic in the latest patch)
>
> 
> static Query *
> rewriteRuleAction(Query *parsetree,
> ...
> if (sub_action_ptr)
> +   {
> *sub_action_ptr = sub_action;
> +   rule_action->hasModifyingCTE |= 
> parsetree->hasModifyingCTE;
> +   }
> 
>
> And the Basic test passed.
> What do you think ?

That is very close to what I was going to suggest, which is this:

diff --git a/src/backend/rewrite/rewriteHandler.c
b/src/backend/rewrite/rewriteHandler.c
index 0672f497c6..3c4417af98 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -631,6 +631,8 @@ rewriteRuleAction(Query *parsetree,
checkExprHasSubLink((Node *) rule_action->returningList);
}

+   rule_action->hasModifyingCTE |= parsetree->hasModifyingCTE;
+
return rule_action;
 }

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Hou, Zhijie

> > I took a look into the hasModifyingCTE bugfix recently, and found a
> > possible bug case without the parallel insert patch.
> >
> > -
> > drop table if exists test_data1;
> > create table test_data1(a int, b int) ; insert into test_data1 select
> > generate_series(1,1000), generate_series(1,1000); set
> > force_parallel_mode=on;
> >
> > CREATE TEMP TABLE bug6051 AS
> >   select i from generate_series(1,3) as i;
> >
> > SELECT * FROM bug6051;
> > CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD select a as
> > i from test_data1;
> >
> > WITH t1 AS ( DELETE FROM bug6051 RETURNING * ) INSERT INTO bug6051
> > SELECT * FROM t1;
> >
> > ***
> > ***ERROR:  cannot assign XIDs during a parallel operation
> > ***
> > -
> >
> > I debugged it and it did have modifycte in the parsetree after rewrite.
> > I think if we can properly set the hasModifyingCTE, we can avoid this
> error by not consider parallel for this.
> >
> 
> Thanks. You've identified that the bug exists for SELECT too. I've verified
> that the issue is fixed by the bugfix included in the Parallel INSERT patch.
> Are you able to review my bugfix?
> Since the problem exists for SELECT in the current Postgres code, I'd like
> to pull that bugfix out and provide it as a separate fix.
> My concern is that there may well be a better way to fix the issue - for
> example, during the re-writing, rather than after the query has been
> re-written.
Hi,

I took a look at the fix and have some thoughts on it.
(Please correct me if you have tried this idea and found something is wrong)

IMO, the main reason for the hasModifyingCTE=false is that:
the Rewriter did not update the  hasModifyingCTE when copying the existsing 
'cteList' to the rewrited one.

It seems there is only one place where ctelist will be copied separately.
---
static Query *
rewriteRuleAction(Query *parsetree,
...
/* OK, it's safe to combine the CTE lists */
sub_action->cteList = list_concat(sub_action->cteList,

  copyObject(parsetree->cteList));
+   sub_action->hasModifyingCTE |= parsetree->hasModifyingCTE;


Based on the above, if we update the hasModifyingCTE here, we may solve this 
problem.

And there is another point here, the sub_action may be not the final parsetree.
If defined the rule like "DO INSTEAD insert into xx select xx from xx", 
Rewriter will
Put the ctelist into subquery in parsetree's rtable.
In this case, we may also need to update the final parsetree too.
(I think you know this case, I found same logic in the latest patch)


static Query *
rewriteRuleAction(Query *parsetree,
...
if (sub_action_ptr)
+   {
*sub_action_ptr = sub_action;
+   rule_action->hasModifyingCTE |= 
parsetree->hasModifyingCTE;
+   }


And the Basic test passed.
What do you think ?

Best regards,
houzj

Re: shared tempfile was not removed on statement_timeout

2021-02-04 Thread Thomas Munro

On Sun, Jan 31, 2021 at 6:07 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > So that gives a very simple back-patchable patch.
>
> Hmm, so is the *rest* of that function perfectly okay with being
> interrupted?

It looks OK to me.  There aren't any CFI()s in there.

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Greg Nancarrow

On Fri, Feb 5, 2021 at 2:58 PM Hou, Zhijie  wrote:
>
> Hi,
>
> I took a look into the hasModifyingCTE bugfix recently,
> and found a possible bug case without the parallel insert patch.
>
> -
> drop table if exists test_data1;
> create table test_data1(a int, b int) ;
> insert into test_data1 select generate_series(1,1000), 
> generate_series(1,1000);
> set force_parallel_mode=on;
>
> CREATE TEMP TABLE bug6051 AS
>   select i from generate_series(1,3) as i;
>
> SELECT * FROM bug6051;
> CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD select a as i from 
> test_data1;
>
> WITH t1 AS ( DELETE FROM bug6051 RETURNING * ) INSERT INTO bug6051 SELECT * 
> FROM t1;
>
> ***
> ***ERROR:  cannot assign XIDs during a parallel operation
> ***
> -
>
> I debugged it and it did have modifycte in the parsetree after rewrite.
> I think if we can properly set the hasModifyingCTE, we can avoid this error 
> by not consider parallel for this.
>

Thanks. You've identified that the bug exists for SELECT too. I've
verified that the issue is fixed by the bugfix included in the
Parallel INSERT patch.
Are you able to review my bugfix?
Since the problem exists for SELECT in the current Postgres code, I'd
like to pull that bugfix out and provide it as a separate fix.
My concern is that there may well be a better way to fix the issue -
for example, during the re-writing, rather than after the query has
been re-written.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Is Recovery actually paused?

2021-02-04 Thread Bharath Rupireddy

On Fri, Feb 5, 2021 at 10:06 AM Dilip Kumar  wrote:
>
> On Fri, Feb 5, 2021 at 6:22 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Feb 4, 2021 at 7:20 PM Dilip Kumar  wrote:
> > > > How can we do that this is not a 1 byte flag this is enum so I don't
> > > > think we can read any atomic state without a spin lock here.
> > >
> > > I have fixed the other comments and the updated patch is attached.
> >
> > Can we just do like below so that we could use the existing
> > SetRecoveryPause instead of setting the state outside?
> >
> > /* loop until recoveryPauseState is set to RECOVERY_NOT_PAUSED */
> > while (1)
> > {
> > RecoveryPauseState state;
> >
> > state = GetRecoveryPauseState();
> >
> > if (state == RECOVERY_NOT_PAUSED)
> > break;
> >
> > HandleStartupProcInterrupts();
> >
> > if (CheckForStandbyTrigger())
> > return;
> > pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
> >
> > /*
> >  * If recovery pause is requested then set it paused.  While we are 
> > in
> >  * the loop, user might resume and pause again so set this every 
> > time.
> >  */
> > if (state == RECOVERY_PAUSE_REQUESTED)
> > SetRecoveryPause(RECOVERY_PAUSED)
>
> We can not do that, basically, under one lock we need to check the
> state and set it to pause.  Because by the time you release the lock
> someone might set it to RECOVERY_NOT_PAUSED then you don't want to set
> it to RECOVERY_PAUSED.

Got it. Thanks.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Is Recovery actually paused?

2021-02-04 Thread Dilip Kumar

On Fri, Feb 5, 2021 at 6:22 AM Bharath Rupireddy
 wrote:
>
> On Thu, Feb 4, 2021 at 7:20 PM Dilip Kumar  wrote:
> > > How can we do that this is not a 1 byte flag this is enum so I don't
> > > think we can read any atomic state without a spin lock here.
> >
> > I have fixed the other comments and the updated patch is attached.
>
> Can we just do like below so that we could use the existing
> SetRecoveryPause instead of setting the state outside?
>
> /* loop until recoveryPauseState is set to RECOVERY_NOT_PAUSED */
> while (1)
> {
> RecoveryPauseState state;
>
> state = GetRecoveryPauseState();
>
> if (state == RECOVERY_NOT_PAUSED)
> break;
>
> HandleStartupProcInterrupts();
>
> if (CheckForStandbyTrigger())
> return;
> pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
>
> /*
>  * If recovery pause is requested then set it paused.  While we are in
>  * the loop, user might resume and pause again so set this every time.
>  */
> if (state == RECOVERY_PAUSE_REQUESTED)
> SetRecoveryPause(RECOVERY_PAUSED)

We can not do that, basically, under one lock we need to check the
state and set it to pause.  Because by the time you release the lock
someone might set it to RECOVERY_NOT_PAUSED then you don't want to set
it to RECOVERY_PAUSED.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Is Recovery actually paused?

2021-02-04 Thread Dilip Kumar

On Thu, Feb 4, 2021 at 10:19 PM Robert Haas  wrote:
>
> On Thu, Feb 4, 2021 at 7:46 AM Dilip Kumar  wrote:
> > How can we do that this is not a 1 byte flag this is enum so I don't
> > think we can read any atomic state without a spin lock here.
>
> I think this discussion of atomics is confused. Let's talk about what
> atomic reads and writes mean. Imagine that you have a 64-bit value
> 0x0101010101010101. Somebody sets it to 0x0202020202020202. Imagine
> that just as they are doing that, someone else reads the value and
> gets 0x0202020201010101, because half of the value has been updated
> and the other half has not yet been updated yet. This kind of thing
> can actually happen on some platforms and what it means is that on
> those platforms 8-byte reads and writes are not atomic. The idea of an
> "atom" is that it can't be split into pieces but these reads and
> writes on some platforms are actually not "atomic" because they are
> split into two 4-byte pieces. But there's no such thing as a 1-byte
> read or write not being atomic. In theory you could imagine a computer
> where when you change 0x01 to 0x23 and read in the middle and see 0x21
> or 0x13 or something, but no actual computers behave that way, or at
> least no mainstream ones that anybody cares about. So the idea that
> you somehow need a lock to prevent this is just wrong.
>
> Concurrent programs also suffer from another problem which is
> reordering of operations, which can happen either as the program is
> compiled or as the program is executed by the CPU. The CPU can see you
> set a->x = 1 and a->y = 2 and decide to update y first and then x even
> though you wrote it the other way around in the program text. To
> prevent this, we have barrier operations; see README.barrier in the
> source tree for a longer explanation. Atomic operations like
> compare-and-exchange are also full barriers, so that they not only
> prevent the torn read/write problem described above, but also enforce
> order of operations more strictly.
>
> Now I don't know whether a lock is needed here or not. Maybe it is;
> perhaps for consistency with other code, perhaps because the lock
> acquire and release is serving the function of a barrier; or perhaps
> to guard against some other hazard. But saying that it's because
> reading or writing a 1-byte value might not be atomic does not sound
> correct.

I never told that reading /writing 1 byte is not atomic, of course,
they are.  I told that we can only guarantee that 1-byte read/write is
atomic but this variable is not a bool or 1-byte value and the enum
can take 32 bits on a 32-bit platform so we can not guarantee the
atomic read/write on some processor so we need a lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_replication_origin_drop API potential race condition

2021-02-04 Thread Peter Smith

On Thu, Feb 4, 2021 at 9:20 PM Amit Kapila  wrote:
>
> On Thu, Feb 4, 2021 at 1:31 PM Peter Smith  wrote:
> >
> > PSA a patch which I think implements what we are talking about.
> >
>
> This doesn't seem correct to me. Have you tested that the patch
> resolves the problem reported originally? Because the lockmode
> (RowExclusiveLock) you have used in the patch will allow multiple
> callers to acquire at the same time. The other thing I don't like
> about this is that first, it acquires lock in the function
> replorigin_drop_by_name and then again we acquire the same lock in a
> different mode in replorigin_drop.
>
> What I was imagining was to have a code same as replorigin_drop with
> the first parameter as the name instead of id and additionally, it
> will check the existence of origin by replorigin_by_name after
> acquiring the lock. So you can move all the common code from
> replorigin_drop (starting from restart till end leaving table_close)
> to a separate function say replorigin_drop_guts and then call it from
> both replorigin_drop and replorigin_drop_by_name.
>
> Now, I have also thought to directly change replorigin_drop but this
> is an exposed API so let's keep it as it is because some extensions
> might be using it. We can anyway later drop it if required.
>

PSA patch updated per above suggestions.


Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-replorigin_drop_by_name.patch
Description: Binary data

RE: POC: postgres_fdw insert batching

2021-02-04 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> It appears I had missed your reply, sorry.
> 
> > +   PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
> > +
> (PgFdwModifyState *) resultRelInfo->ri_FdwState :
> > +   NULL;
> >
> > This can be written as:
> >
> > +   PgFdwModifyState *fmstate = (PgFdwModifyState *)
> > + resultRelInfo->ri_FdwState;
> 
> Facepalm, yes.
> 
> Patch updated.  Thanks for the review.

Thank you for picking this up.


Regards
Takayuki Tsunakawa

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Hou, Zhijie

Hi,

I took a look into the hasModifyingCTE bugfix recently, 
and found a possible bug case without the parallel insert patch.

-
drop table if exists test_data1;
create table test_data1(a int, b int) ;
insert into test_data1 select generate_series(1,1000), generate_series(1,1000);
set force_parallel_mode=on;

CREATE TEMP TABLE bug6051 AS
  select i from generate_series(1,3) as i;

SELECT * FROM bug6051;
CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD select a as i from 
test_data1;

WITH t1 AS ( DELETE FROM bug6051 RETURNING * ) INSERT INTO bug6051 SELECT * 
FROM t1;

***
***ERROR:  cannot assign XIDs during a parallel operation
***
-

I debugged it and it did have modifycte in the parsetree after rewrite.
I think if we can properly set the hasModifyingCTE, we can avoid this error by 
not consider parallel for this.

Thoughts ?

Best regards,
houzj

Re: SPI: process exit in SPI_exec when table not exist. error code not return.

2021-02-04 Thread sasa su

Thanks barwick, Is the process crash also expected, And dose anyone considered enhancing this API? 05.02.2021, 10:55, "Ian Lawrence Barwick" :2021年2月5日(金) 11:38 sasa su :Hi. When running a query with a not exist table in SPI_exec. the process exit with -1 in SPI_exec function.the error code SPI_ERROR_REL_NOT_FOUND never return. I made a minimal reproduction code in https://github.com/Sasasu/worker_spi_table_not_exist The core code is:    int spi = SPI_exec("select * from not_exist_table", 0);   // can not reach here   Assert(spi == SPI_ERROR_REL_NOT_FOUND);  The list of return codes returned by SPI_exec is here:    https://www.postgresql.org/docs/current/spi-spi-execute.html and doesn't include "SPI_ERROR_REL_NOT_FOUND". The onlyfunction which does return that is SPI_unregister_relation, see:   https://www.postgresql.org/docs/current/spi-spi-unregister-relation.html  Regards Ian Barwick--EnterpriseDB: https://www.enterprisedb.com

Re: making update/delete of inheritance trees scale better

2021-02-04 Thread Amit Langote

On Thu, Feb 4, 2021 at 10:41 PM Robert Haas  wrote:
> On Thu, Feb 4, 2021 at 4:33 AM Amit Langote  wrote:
> > To be clear, the new refetch in ExecModifyTable() is to fill in the
> > unchanged columns in the new tuple.  If we rejigger the
> > table_tuple_update() API to receive a partial tuple (essentially
> > what's in 'planSlot' passed to ExecUpdate) as opposed to the full
> > tuple, we wouldn't need the refetch.
>
> I don't think we should assume that every AM needs the unmodified
> columns.  Imagine a table AM that's a columnar store. Imagine that each
> column is stored completely separately, so you have to look up the TID
> once per column and then stick in the new values. Well, clearly you
> want to skip this completely for columns that don't need to be
> modified. If someone gives you all the columns it actually sucks,
> because now you have to look them all up again just to figure out
> which ones you need to change, whereas if they gave you only the
> unmodified columns you could just do nothing for those and save a
> bunch of work.

Right, that's the idea in case I wasn't clear.  Currently, a slot
containing the full tuple is passed to the table AM, with or without
the patch.  The API:

static inline TM_Result
table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot, ...

 describes its 'slot' parameter as:

 *  slot - newly constructed tuple data to store

We could, possibly in a follow-on patch, adjust the
table_tuple_update() API to only accept the changed values through the
slot.

> zheap, though, is always going to need to take another look at the
> tuple to do the update, unless you can pass up some values through
> hidden columns. I'm not exactly sure how expensive that really is,
> though.

I guess it would depend on how many of those hidden columns there need
to be (in addition to the existing "ctid" hidden column) and how many
levels of the plan tree they would need to climb through when bubbling
up.
--
Amit Langote
EDB: http://www.enterprisedb.com

Re: psql tab completion for CREATE DATABASE ... LOCALE

2021-02-04 Thread Ian Lawrence Barwick

2021年2月5日(金) 11:51 Thomas Munro :

> On Fri, Feb 5, 2021 at 2:40 PM Ian Lawrence Barwick 
> wrote:
> > Trivialest of trivial patches attached (will add to next CF).
>
> Thanks, pushed.
>

Oh, that was quick, thanks! I hadn't even got round to adding it to the CF.


Regards

Ian Barwick
-- 
EnterpriseDB: https://www.enterprisedb.com

Re: SPI: process exit in SPI_exec when table not exist. error code not return.

2021-02-04 Thread Ian Lawrence Barwick

2021年2月5日(金) 11:38 sasa su :

> Hi.
>
> When running a query with a not exist table in SPI_exec. the process exit
> with -1 in SPI_exec function.
> the error code SPI_ERROR_REL_NOT_FOUND never return.
>
> I made a minimal reproduction code in
> https://github.com/Sasasu/worker_spi_table_not_exist
>
> The core code is:
>
>int spi = SPI_exec("select * from not_exist_table", 0);
>// can not reach here
>Assert(spi == SPI_ERROR_REL_NOT_FOUND);
>


The list of return codes returned by SPI_exec is here:

   https://www.postgresql.org/docs/current/spi-spi-execute.html

and doesn't include "SPI_ERROR_REL_NOT_FOUND". The only
function which does return that is SPI_unregister_relation, see:

  https://www.postgresql.org/docs/current/spi-spi-unregister-relation.html


Regards

Ian Barwick


-- 
EnterpriseDB: https://www.enterprisedb.com

RE: Determine parallel-safety of partition relations for Inserts

2021-02-04 Thread Huang, Qiuyan

Hi,

I notice the comment 5) about is_parallel_possible_for_modify() seems to be 
inaccurate in clauses.c.
*  5) the reloption parallel_dml_enabled is not set for the target table

Because you have set parallel_dml_enabled to 'true' as default. 
{
{
"parallel_dml_enabled",
"Enables \"parallel dml\" feature for this table",
RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
ShareUpdateExclusiveLock
},
true
}

So even user doesn't set parallel_dml_enabled explicit, its value is 'on', 
Parallel is also possible for this rel(when enable_parallel_dml is on). 

Maybe we can just comment like this or a better one you'd like if you agree 
with above:
*  5) the reloption parallel_dml_enabled is set to off

Regards
Huang

> -Original Message-
> From: Hou, Zhijie 
> Sent: Wednesday, February 3, 2021 9:01 AM
> To: Greg Nancarrow 
> Cc: Amit Kapila ; PostgreSQL Hackers
> ; vignesh C ;
> Amit Langote ; David Rowley
> ; Tom Lane ; Tsunakawa,
> Takayuki/綱川 貴之 
> Subject: RE: Determine parallel-safety of partition relations for Inserts
> 
> Hi,
> 
> Attaching v5 patches with the changes:
>   * rebase the code on the greg's latest parallel insert patch
>   * fix some code style.
> 
> Please consider it for further review.
> 
> Best regards,
> Houzj
> 
> 
> 
>

Re: POC: postgres_fdw insert batching

2021-02-04 Thread Amit Langote

Tsunakwa-san,

On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Amit Langote 
> > Yes, it can be simplified by using a local join to prevent the update of 
> > the foreign
> > partition from being pushed to the remote server, for which my example in 
> > the
> > previous email used a local trigger.  Note that the update of the foreign
> > partition to be done locally is a prerequisite for this bug to occur.
>
> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it 
> partway.  Good catch (and my bad miss.)

It appears I had missed your reply, sorry.

> +   PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
> +   (PgFdwModifyState *) 
> resultRelInfo->ri_FdwState :
> +   NULL;
>
> This can be written as:
>
> +   PgFdwModifyState *fmstate = (PgFdwModifyState *) 
> resultRelInfo->ri_FdwState;

Facepalm, yes.

Patch updated.  Thanks for the review.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v3-0001-Prevent-FDW-insert-batching-during-cross-partitio.patch
Description: Binary data

Re: psql tab completion for CREATE DATABASE ... LOCALE

2021-02-04 Thread Thomas Munro

On Fri, Feb 5, 2021 at 2:40 PM Ian Lawrence Barwick  wrote:
> Trivialest of trivial patches attached (will add to next CF).

Thanks, pushed.

SPI: process exit in SPI_exec when table not exist. error code not return.

2021-02-04 Thread sasa su

Hi. When running a query with a not exist table in SPI_exec. the process exit with -1 in SPI_exec function.the error code SPI_ERROR_REL_NOT_FOUND never return. I made a minimal reproduction code in https://github.com/Sasasu/worker_spi_table_not_exist The core code is:    int spi = SPI_exec("select * from not_exist_table", 0);   // can not reach here   Assert(spi == SPI_ERROR_REL_NOT_FOUND); I think it is a bug, PG_TRY macro it not mentioned in SPI document.The code inside SPI should be wrapped with PG_TRY macro.

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Michael Paquier

On Thu, Feb 04, 2021 at 03:52:44PM +0100, Ronan Dunklau wrote:
>> Hmmm, that sure seems like a bug, or at least unexpected behavior (that
>> I don't see mentioned in the docs).

Yeah, per the rule of consistency, this classifies as a bug to me.

> Please find attached a correct patch.

ConstructTupleDescriptor() does not matter much, but this patch is not
acceptable to me as it touches the area of the index creation while
statistics on an index expression can only be changed with a special
flavor of ALTER INDEX with column numbers.  This would imply an ABI
breakage, so it cannot be backpatched as-is.

Let's copy this data in index_concurrently_swap() instead.  The
attached patch does that, and adds a test cheaper than what was
proposed.  There is a minor release planned for next week, so I may be
better to wait after that so as we have enough time to agree on a
solution.
--
Michael
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index ae720c1496..77871aaefc 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -90,6 +90,7 @@ extern Oid	get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
 			  int16 procnum);
 extern char *get_attname(Oid relid, AttrNumber attnum, bool missing_ok);
 extern AttrNumber get_attnum(Oid relid, const char *attname);
+extern int	get_attstattarget(Oid relid, AttrNumber attnum);
 extern char get_attgenerated(Oid relid, AttrNumber attnum);
 extern Oid	get_atttype(Oid relid, AttrNumber attnum);
 extern void get_atttypetypmodcoll(Oid relid, AttrNumber attnum,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1cb9172a5f..9403ae64f8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1884,6 +1884,63 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 	/* Copy data of pg_statistic from the old index to the new one */
 	CopyStatistics(oldIndexId, newIndexId);
 
+	/* Copy pg_attribute.attstattarget for each index attribute */
+	{
+		HeapTuple	attrTuple;
+		Relation	pg_attribute;
+		SysScanDesc scan;
+		ScanKeyData key[1];
+
+		pg_attribute = table_open(AttributeRelationId, RowExclusiveLock);
+		ScanKeyInit(&key[0],
+	Anum_pg_attribute_attrelid,
+	BTEqualStrategyNumber, F_OIDEQ,
+	ObjectIdGetDatum(newIndexId));
+		scan = systable_beginscan(pg_attribute, AttributeRelidNumIndexId,
+  true, NULL, 1, key);
+
+		while (HeapTupleIsValid((attrTuple = systable_getnext(scan
+		{
+			Form_pg_attribute att = (Form_pg_attribute) GETSTRUCT(attrTuple);
+			Datum		repl_val[Natts_pg_attribute];
+			bool		repl_null[Natts_pg_attribute];
+			bool		repl_repl[Natts_pg_attribute];
+			int			attstattarget;
+			HeapTuple	newTuple;
+
+			/* Ignore dropped columns */
+			if (att->attisdropped)
+continue;
+
+			/*
+			 * Get attstattarget from the old index and refresh the new
+			 * value.
+			 */
+			attstattarget = get_attstattarget(oldIndexId, att->attnum);
+
+			/* no need for a refresh if both match */
+			if (attstattarget == att->attstattarget)
+continue;
+
+			memset(repl_null, false, sizeof(repl_null));
+			memset(repl_repl, false, sizeof(repl_repl));
+
+			repl_repl[Anum_pg_attribute_attstattarget - 1] = true;
+			repl_val[Anum_pg_attribute_attstattarget - 1] = Int32GetDatum(attstattarget);
+
+			newTuple = heap_modify_tuple(attrTuple,
+		 RelationGetDescr(pg_attribute),
+		 repl_val, repl_null, repl_repl);
+			CatalogTupleUpdate(pg_attribute, &newTuple->t_self, newTuple);
+
+			heap_freetuple(newTuple);
+		}
+
+		systable_endscan(scan);
+		table_close(pg_attribute, RowExclusiveLock);
+	}
+
+
 	/* Close relations */
 	table_close(pg_class, RowExclusiveLock);
 	table_close(pg_index, RowExclusiveLock);
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 85c458bc46..6bba5f8ec4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -871,6 +871,33 @@ get_attnum(Oid relid, const char *attname)
 		return InvalidAttrNumber;
 }
 
+/*
+ * get_attstattarget
+ *
+ *		Given the relation id and the attribute number,
+ *		return the "attstattarget" field from the attribute relation.
+ *
+ *		Errors if not found.
+ */
+int
+get_attstattarget(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+	Form_pg_attribute att_tup;
+	int			result;
+
+	tp = SearchSysCache2(ATTNUM,
+		 ObjectIdGetDatum(relid),
+		 Int16GetDatum(attnum));
+	if (!HeapTupleIsValid(tp))
+		elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+			 attnum, relid);
+	att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+	result = att_tup->attstattarget;
+	ReleaseSysCache(tp);
+	return result;
+}
+
 /*
  * get_attgenerated
  *
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ce734f7ef3..2c07940b98 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2559,6 +2559,7

Re: POC: postgres_fdw insert batching

2021-02-04 Thread Ian Lawrence Barwick

2021年1月22日(金) 14:50 Ian Lawrence Barwick :

> Hi
>
> 2021年1月21日(木) 8:00 Tomas Vondra :
>
>> OK, pushed after a little bit of additional polishing (mostly comments).
>>
>> Thanks everyone!
>>
>
> There's a minor typo in the doc's version of the ExecForeignBatchInsert()
> declaration;
> is:
>
> TupleTableSlot **
> ExecForeignBatchInsert(EState *estate,
>   ResultRelInfo *rinfo,
>   TupleTableSlot **slots,
>   TupleTableSlot *planSlots,
>   int *numSlots);
>
> should be:
>
> TupleTableSlot **
> ExecForeignBatchInsert(EState *estate,
>   ResultRelInfo *rinfo,
>   TupleTableSlot **slots,
>   TupleTableSlot **planSlots,
>   int *numSlots);
>
> (Trivial patch attached).
>
>
Forgot to mention the relevant doc link:


https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com

Re: POC: postgres_fdw insert batching

2021-02-04 Thread Ian Lawrence Barwick

2021年1月22日(金) 14:50 Ian Lawrence Barwick :

> Hi
>
> 2021年1月21日(木) 8:00 Tomas Vondra :
>
>> OK, pushed after a little bit of additional polishing (mostly comments).
>>
>> Thanks everyone!
>>
>
> There's a minor typo in the doc's version of the ExecForeignBatchInsert()
> declaration;
> is:
>
> TupleTableSlot **
> ExecForeignBatchInsert(EState *estate,
>   ResultRelInfo *rinfo,
>   TupleTableSlot **slots,
>   TupleTableSlot *planSlots,
>   int *numSlots);
>
> should be:
>
> TupleTableSlot **
> ExecForeignBatchInsert(EState *estate,
>   ResultRelInfo *rinfo,
>   TupleTableSlot **slots,
>   TupleTableSlot **planSlots,
>   int *numSlots);
>
> (Trivial patch attached).
>
>
> Regards
>
> Ian Barwick
>
> --
> EnterpriseDB: https://www.enterprisedb.com
>
>

-- 
EnterpriseDB: https://www.enterprisedb.com

psql tab completion for CREATE DATABASE ... LOCALE

2021-02-04 Thread Ian Lawrence Barwick

Greetings

This morning I was overcome by an urge to create a database with a specific
locale, and my insufficiently caffeinated brain reminded me there was a
handy
LOCALE option added not so long ago, but was confused by its stubborn
absence
from the list of tab completion options presented by psql on a current HEAD
build, no matter how much I mashed the tab key.

Further investigation confirmed the LOCALE option was added in PostgreSQL 13
(commit 06140c20) but neither commit [1] nor discussion [2] mention psql.
Trivialest of trivial patches attached (will add to next CF).

[1]
https://git.postgresql.org/pg/commitdiff/06140c201b982436974d71e756d7331767a41e57
[2]
https://www.postgresql.org/message-id/flat/d9d5043a-dc70-da8a-0166-1e218e6e34d4%402ndquadrant.com


Regards

Ian Barwick


-- 
EnterpriseDB: https://www.enterprisedb.com
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a75647b1cc..5f0e775fd3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2429,7 +2429,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("OWNER", "TEMPLATE", "ENCODING", "TABLESPACE",
 	  "IS_TEMPLATE",
 	  "ALLOW_CONNECTIONS", "CONNECTION LIMIT",
-	  "LC_COLLATE", "LC_CTYPE");
+	  "LC_COLLATE", "LC_CTYPE", "LOCALE");
 
 	else if (Matches("CREATE", "DATABASE", MatchAny, "TEMPLATE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_template_databases);

Re: Single transaction in the tablesync worker?

2021-02-04 Thread Peter Smith

On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila  wrote:
>
...

> Thanks. I have fixed one of the issues reported by me earlier [1]
> wherein the tablesync worker can repeatedly fail if after dropping the
> slot there is an error while updating the SYNCDONE state in the
> database. I have moved the drop of the slot just before commit of the
> transaction where we are marking the state as SYNCDONE. Additionally,
> I have removed unnecessary includes in tablesync.c, updated the docs
> for Alter Subscription, and updated the comments at various places in
> the patch. I have also updated the commit message this time.
>

Below are my feedback comments for V17 (nothing functional)

~~

1.
V27 Commit message:
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronizes the
position in the stream with the main apply worker.

Typo:
"synchronizes" -> "synchronize"

~~

2.
@@ -48,6 +48,23 @@ ALTER SUBSCRIPTION name RENAME TO <
(Currently, all subscription owners must be superusers, so the owner checks
will be bypassed in practice.  But this might change in the future.)
   
+
+  
+   When refreshing a publication we remove the relations that are no longer
+   part of the publication and we also remove the tablesync slots if there are
+   any. It is necessary to remove tablesync slots so that the resources
+   allocated for the subscription on the remote host are released. If due to
+   network breakdown or some other error, we are not able to remove the slots,
+   we give WARNING and the user needs to manually remove such slots later as
+   otherwise, they will continue to reserve WAL and might eventually cause
+   the disk to fill up. See also .
+  

I think the content is good, but the 1st-person wording seemed strange.
e.g.
"we are not able to remove the slots, we give WARNING and the user needs..."
Maybe it should be like:
"... PostgreSQL is unable to remove the slots, so a WARNING is
reported. The user needs... "

~~

3.
@@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
bool copy_data)
...
+ * XXX If there is a network break down while dropping the

"network break down" -> "network breakdown"

~~

4.
@@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
  (errmsg("could not connect to the publisher: %s", err)));
...
+ * XXX We could also instead try to drop the slot, last time we failed
+ * but for that, we might need to clean up the copy state as it might
+ * be in the middle of fetching the rows. Also, if there is a network
+ * break down then it wouldn't have succeeded so trying it next time
+ * seems like a better bet.

"network break down" -> "network breakdown"

~~

5.
@@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
cacheid, uint32 hashvalue)
...
+
+ /*
+ * Cleanup the tablesync slot.
+ *
+ * This has to be done after updating the state because otherwise if
+ * there is an error while doing the database operations we won't be
+ * able to rollback dropped slot.
+ */
+ ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+ MyLogicalRepWorker->relid,
+ syncslotname);
+
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
+

Should this comment also describe why the missing_ok is false for this case?


Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Is Recovery actually paused?

2021-02-04 Thread Bharath Rupireddy

On Thu, Feb 4, 2021 at 7:20 PM Dilip Kumar  wrote:
> > How can we do that this is not a 1 byte flag this is enum so I don't
> > think we can read any atomic state without a spin lock here.
>
> I have fixed the other comments and the updated patch is attached.

Can we just do like below so that we could use the existing
SetRecoveryPause instead of setting the state outside?

/* loop until recoveryPauseState is set to RECOVERY_NOT_PAUSED */
while (1)
{
RecoveryPauseState state;

state = GetRecoveryPauseState();

if (state == RECOVERY_NOT_PAUSED)
break;

HandleStartupProcInterrupts();

if (CheckForStandbyTrigger())
return;
pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);

/*
 * If recovery pause is requested then set it paused.  While we are in
 * the loop, user might resume and pause again so set this every time.
 */
if (state == RECOVERY_PAUSE_REQUESTED)
SetRecoveryPause(RECOVERY_PAUSED)

And a typo - it's "pg_is_wal_replay_paused" not
"pg_is_wal_repay_paused" +
pg_is_wal_repay_paused() returns whether a
request

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-02-04 Thread Zhihong Yu

Hi,
For Array-containselem-gin-v4.patch , one small comment:

+ * array_contains_elem : checks an array for a spefific element

typo: specific

Cheers

On Thu, Feb 4, 2021 at 4:03 PM Mark Rofail  wrote:

> Hello Joel,
>
> No error, even though bigint[] isn't compatible with smallint.
>>
> I added a check to compart the element type of the fkoperand and the type
> of the pkoperand should be the same
> Please try v18 attached below, you should get the following message
> ```
> ERROR:  foreign key constraint "fktableviolating_ftest1_fkey" cannot be
> implemented
> DETAIL:  Specified columns have element types smallint and bigint which
> are not homogenous.
> ```
>
> Changelog (FK arrays):
> - v18 (compatible with current master 2021-02-54, commit
> c34787f910585f82320f78b0afd53a6a170aa229)
> * add check for operand compatibility at constraint creation
>
> Changelog (FK arrays Elem addon)
> - v4 (compatible with FK arrays v18)
>* re-add Composite Type support
>
> I believe we should start merging these two patches as one, due to the
> Elem addon's benefits. such as adding Composite Type support.
>
> /Mark
>
> On Thu, Feb 4, 2021 at 9:00 AM Joel Jacobson  wrote:
>
>> On Tue, Feb 2, 2021, at 13:51, Mark Rofail wrote:
>> >Array-ELEMENT-foreign-key-v17.patch
>>
>> When working on my pit tool, I found another implicit type casts problem.
>>
>> First an example to show a desired error message:
>>
>> CREATE TABLE a (
>> a_id smallint,
>> PRIMARY KEY (a_id)
>> );
>>
>> CREATE TABLE b (
>> b_id bigint,
>> a_ids text[],
>> PRIMARY KEY (b_id)
>> );
>>
>> ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a;
>>
>> The below error message is good:
>>
>> ERROR:  foreign key constraint "b_a_ids_fkey" cannot be implemented
>> DETAIL:  Key column "a_ids" has element type text which does not have a
>> default btree operator class that's compatible with class "int2_ops".
>>
>> But if we instead make a_ids a bigint[], we don't get any error:
>>
>> DROP TABLE b;
>>
>> CREATE TABLE b (
>> b_id bigint,
>> a_ids bigint[],
>> PRIMARY KEY (b_id)
>> );
>>
>> ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a;
>>
>> No error, even though bigint[] isn't compatible with smallint.
>>
>> We do get an error when trying to insert into the table:
>>
>> INSERT INTO a (a_id) VALUES (1);
>> INSERT INTO b (b_id, a_ids) VALUES (2, ARRAY[1]);
>>
>> ERROR:  operator does not exist: smallint[] pg_catalog.<@ bigint[]
>> LINE 1: ..."."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(p...
>>  ^
>> HINT:  No operator matches the given name and argument types. You might
>> need to add explicit type casts.
>> QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM
>> pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*)
>> FROM (SELECT 1 FROM ONLY "public"."a" x WHERE ARRAY
>> ["a_id"]::pg_catalog.anyarray OPERATOR(pg_catalog.<@)
>> $1::pg_catalog.anyarray FOR KEY SHARE OF x) z)
>>
>> I wonder if we can come up with some general way to detect these
>> problems already at constraint creation time,
>> instead of having to wait for data to get the error,
>> similar to why compile time error are preferred over run time errors.
>>
>> /Joel
>>
>

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-02-04 Thread Masahiro Ikeda


I pgindented the patches.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 47f436d7e423ece33a25adebf4265eac02e575f3 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Fri, 29 Jan 2021 16:41:34 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75...@oss.nttdata.com
---
 doc/src/sgml/config.sgml  | 21 +++
 doc/src/sgml/monitoring.sgml  | 50 
 doc/src/sgml/wal.sgml | 12 +++-
 src/backend/access/transam/xlog.c | 57 +++
 src/backend/catalog/system_views.sql  |  4 ++
 src/backend/postmaster/checkpointer.c |  2 +-
 src/backend/postmaster/pgstat.c   |  4 ++
 src/backend/postmaster/walwriter.c|  3 +
 src/backend/utils/adt/pgstatfuncs.c   | 24 +++-
 src/backend/utils/misc/guc.c  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/pgstat.h  | 10 
 src/test/regress/expected/rules.out   |  6 +-
 15 files changed, 199 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..4bdc341141 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7418,6 +7418,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  track_wal_io_timing (boolean)
+  
+   track_wal_io_timing configuration parameter
+  
+  
+  
+   
+Enables timing of WAL I/O calls. This parameter is off by default,
+because it will repeatedly query the operating system for
+the current time, which may cause significant overhead on some
+platforms.  You can use the  tool to
+measure the overhead of timing on your system.
+I/O timing information is
+displayed in 
+pg_stat_wal.  Only superusers can
+change this setting.
+   
+  
+ 
+
  
   track_functions (enum)
   
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..2435f401db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3487,6 +3487,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   wal_write bigint
+  
+  
+   Number of times WAL buffers were written out to disk via 
+   XLogWrite, which nomally called by an 
+   XLogFlush request(see )
+  
+ 
+
+ 
+  
+   wal_write_time double precision
+  
+  
+   Total amount of time spent writing WAL data to disk, excluding sync time unless 
+is ether open_datasync or 
+   open_sync. Units are in milliseconds with microsecond resolution.
+   This is zero when  is disabled.
+  
+ 
+
+ 
+  
+   wal_sync bigint
+  
+  
+   Number of times WAL files were synced to disk via 
+   issue_xlog_fsync, which nomally called by an 
+   XLogFlush request(see ),
+   while  was set to one of the 
+   "sync at commit" options (i.e., fdatasync, 
+   fsync, or fsync_writethrough).
+  
+ 
+
+ 
+  
+   wal_sync_time double precision
+  
+  
+   Total amount of time spent syncing WAL files to disk, in milliseconds with microsecond 
+   resolution. This requires setting  to one of 
+   the "sync at commit" options (i.e., fdatasync, fsync,
+   or fsync_writethrough).
+   This is zero when  is disabled.
+  
+ 
+
  
   
stats_reset timestamp with time zone
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 66de1ee2f8..984cb5764c 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -663,7 +663,9 @@
the WAL buffers in shared memory. If there is no
space for the new record, XLogInsertRecord will have
to write (move to kernel cache) a few filled WAL
-   buffers. This is undesirable because XLogInsertRecord
+   buffers. The number of times it happend is counted as 
+   wal_buffers_full in .
+   This is undesir

Re: [HACKERS] Custom compression methods

2021-02-04 Thread Robert Haas

On Thu, Feb 4, 2021 at 11:39 AM Dilip Kumar  wrote:
> Yeah, actually, I thought I would avoid calling slot_getallattrs if
> none of the attributes got decompress.  I agree if we call this before
> we can avoid calling slot_getattr but slot_getattr
> is only called for the attribute which has attlen -1. I agree that if
> we call slot_getattr for attnum n then it will deform all the
> attributes before that.  But then slot_getallattrs only need to deform
> the remaining attributes not all.  But maybe we can call the
> slot_getallattrs as soon as we see the first attribute with attlen -1
> and then avoid calling subsequent slot_getattr, maybe that is better
> than compared to what I have because we will avoid calling
> slot_getattr for many attributes, especially when there are many
> verlena.

I think that if we need to deform at all, we need to deform all
attributes, right? So there's no point in considering e.g.
slot_getsomeattrs(). But just slot_getallattrs() as soon as we know we
need to do it might be worthwhile. Could even have two loops: one that
just figures out whether we need to deform; if not, return. Then
slot_getallattrs(). Then another loop to do the work.

>  I think the supported procedure for this sort of
> > thing is to have a second slot, set tts_values, tts_isnull etc. and
> > then materialize the slot. After materializing the new slot, it's
> > independent of the old slot, which can then be cleared. See for
> > example tts_virtual_materialize().
>
> Okay, so if we take a new slot then we need to set this slot reference
> in the ScanState also otherwise that might point to the old slot.  I
> haven't yet analyzed where all we might be keeping the reference to
> that old slot.  Or I am missing something.

My guess is you want to leave the ScanState alone so that we keep
fetching into the same slot as before and have an extra slot on the
side someplace.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: fdatasync(2) on macOS

2021-02-04 Thread Thomas Munro

On Mon, Jan 18, 2021 at 4:39 PM Thomas Munro  wrote:
> On Sat, Jan 16, 2021 at 6:55 AM Tom Lane  wrote:
> > So it's not a no-op, but on the other hand it's not succeeding in getting
> > bits down to the platter.  I'm not inclined to dike it out, but it does
> > seem problematic that we're defaulting to open_datasync, which is also
> > not getting bits down to the platter.
>
> Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC
> actually do something then.  It's baffling that they are undocumented.

I was digging through Apple sources again trying to learn something
about bug report #16827, and I spotted one extra detail that I wanted
to share in this thread about these undocumented system interfaces,
just for the record.  It appears that as of macOS 11.2/XNU 7195.83.3,
their vn_write() doesn't treat O_DSYNC any differently than O_SYNC:

/*
 * Treat synchronous mounts and O_FSYNC on the fd as equivalent.
 *
 * XXX We treat O_DSYNC as O_FSYNC for now, since we can not delay
 * XXX the non-essential metadata without some additional VFS work;
 * XXX the intent at this point is to plumb the interface for it.
 */
if ((fp->fp_glob->fg_flag & (O_FSYNC | O_DSYNC)) ||
(vp->v_mount && (vp->v_mount->mnt_flag & MNT_SYNCHRONOUS))) {
ioflag |= IO_SYNC;
}

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-04 Thread John Naylor

On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas  wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.

I gave that a shot, and it's actually pretty good. According to this paper,
[1], 16 bytes was best and gives a good apples-to-apples comparison to SSE
registers, so I tried both 16 and 8 bytes.

> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.

I tried both the static inline function and also putting the whole
optimized utf-8 loop in a separate function to which the caller passes a
pointer to the appropriate pg_*_verifychar().

In the table below, "inline" refers to coding directly inside
pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef.
I didn't bother splitting them out because for other encodings, we want one
of the other approaches above. For those, "C retail" refers to a static
inline function to code the contents of the inner loop, if I understood
your suggestion correctly. This needs more boilerplate in each function, so
I don't prefer this. "C func pointer" refers to the pointer approach I just
mentioned. That is the cleanest looking way to generalize it, so I only
tested that version with different strides -- 8- and 16-bytes

This is the same test I used earlier, which is the test in [2] but adding
an almost-pure multibyte Chinese text of about the same size.

x64-64 Linux gcc 8.4.0:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |1480 |   848 |   428
 inline SSE   |1617 |   634 |63
 inline C |1481 |   843 |50
 C retail |1493 |   838 |49
 C func pointer   |1467 |   851 |49
 C func pointer 8 |1518 |   757 |56

x64-64 MacOS clang 10.0.0:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |1086 |   760 |   374
 inline SSE   |1081 |   529 |70
 inline C |1093 |   649 |49
 C retail |1132 |   695 |   152
 C func pointer   |1085 |   609 |59
 C func pointer 8 |1099 |   571 |71

PowerPC-LE Linux gcc 4.8.5:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |2961 |  1525 |   871
 inline SSE   |   (n/a) | (n/a) | (n/a)
 inline C |2911 |  1329 |80
 C retail |2838 |  1311 |   102
 C func pointer   |2828 |  1314 |80
 C func pointer 8 |3143 |  1249 |   133

Looking at the results, the main advantage of SSE here is it's more robust
for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block
of ascii at the front, we can skip those with a single CPU instruction, but
in C, we have to verify the whole chunk using the slow path.

The "C func pointer approach" seems to win out over the "C retail" approach
(static inline function).

Using an 8-byte stride is slightly better for mixed inputs on all platforms
tested, but regresses on pure ascii and also seems to regress on pure
multibyte. The difference in the multibyte caes is small enough that it
could be random, but it happens on two platforms, so I'd say it's real. On
the other hand, pure multibyte is not as common as mixed text.

Overall, I think the function pointer approach with an 8-byte stride is the
best balance. If that's agreeable, next I plan to test with short inputs,
because I think we'll want a guard if-statement to only loop through the
fast path if the string is long enough to justify that.

> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.

I changed topics a bit quickly, but here I'm talking about using a
table-driven state machine to verify the multibyte case. It's possible I
did something wrong, since my model implementation decodes, and having to
keep track of how

Re: new heapcheck contrib module

2021-02-04 Thread Robert Haas

On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger
 wrote:
> I also made changes to clean up 0003 (formerly numbered 0004)

"deduplice" is a typo.

I'm not sure that I agree with check_each_database()'s commentary
about why it doesn't make sense to optimize the resolve-the-databases
step. Like, suppose I type 'pg_amcheck sasquatch'. I think the way you
have it coded it's going to tell me that there are no databases to
check, which might make me think I used the wrong syntax or something.
I want it to tell me that sasquatch does not exist. If I happen to be
a cryptid believer, I may reject that explanation as inaccurate, but
at least there's no question about what pg_amcheck thinks the problem
is.

Why does check_each_database() go out of its way to run the main query
without the always-secure search path? If there's a good reason, I
think it deserves a comment saying what the reason is. If there's not
a good reason, then I think it should use the always-secure search
path for 100% of everything. Same question applies to
check_one_database().

ParallelSlotSetHandler(free_slot, VerifyHeapamSlotHandler, sql.data)
could stand to be split over two lines, like you do for the nearly
run_command() call, so that it doesn't go past 80 columns.

I suggest having two variables instead of one for amcheck_schema.
Using the same variable to store the unescaped value and then later
the escaped value is, IMHO, confusing. Whatever you call the escaped
version, I'd rename the function parameters elsewhere to match.

"status = PQsendQuery(conn, sql) == 1" seems a bit uptight to me. Why
not just make status an int and then just "status = PQsendQuery(conn,
sql)" and then test for status != 0? I don't really care if you don't
change this, it's not actually important. But personally I'd rather
code it as if any non-zero value meant success.

I think the pg_log_error() in run_command() could be worded a bit
better. I don't think it's a good idea to try to include the type of
object in there like this, because of the translatability guidelines
around assembling messages from fragments. And I don't think it's good
to say that the check failed because the reality is that we weren't
able to ask for the check to be run in the first place. I would rather
log this as something like "unable to send query: %s". I would also
assume we need to bail out entirely if that happens. I'm not totally
sure what sorts of things can make PQsendQuery() fail but I bet it
boils down to having lost the server connection. Should that occur,
trying to send queries for all of the remaining objects is going to
result in repeating the same error many times, which isn't going to be
what anybody wants. It's unclear to me whether we should give up on
the whole operation but I think we have to at least give up on that
connection... unless I'm confused about what the failure mode is
likely to be here.

It looks to me like the user won't be able to tell by the exit code
what happened. What I did with pg_verifybackup, and what I suggest we
do here, is exit(1) if anything went wrong, either in terms of failing
to execute queries or in terms of those queries returning problem
reports. With pg_verifybackup, I thought about trying to make it like
0 => backup OK, 2 => backup not OK, 2 => trouble, but I found it too
hard to distinguish what should be exit(1) and what should be exit(2)
and the coding wasn't trivial either, so I went with the simpler
scheme.

The opening line of appendDatabaseSelect() could be adjusted to put
the regexps parameter on the next line, avoiding awkward wrapping.

If they are being run with a safe search path, the queries in
appendDatabaseSelect(), appendSchemaSelect(), etc. could be run
without all the paranoia. If not, maybe they should be. The casts to
text don't include the paranoia: with an unsafe search path, we need
pg_catalog.text here. Or no cast at all, which seems like it ought to
be fine too. Not quite sure why you are doing all that casting to
text; the datatype is presumably 'name' and ought to collate like
collate "C" which is probably fine.

It would probably be a better idea for appendSchemaSelect to declare a
PQExpBuffer and call initPQExpBuffer just once, and then
resetPQExpBuffer after each use, and finally termPQExpBuffer just
once. The way you have it is not expensive enough to really matter,
but avoiding repeated allocate/free cycles is probably best.

I wonder if a pattern like .foo.bar ends up meaning the same thing as
a pattern like foo.bar, with the empty database name being treated the
same as if nothing were specified.

>From the way appendTableCTE() is coded, it seems to me that if I ask
for tables named j* excluding tables named jam* I still might get
toast tables for my jam, which seems wrong.

There does not seem to be any clear benefit to defining CT_TABLE = 0
in this case, so I would let the compiler deal with it. We should not
be depending on that to have any particular numeric value.

Why does pg_amcheck.c have a hea

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Andres Freund

Hi,

On 2021-02-04 15:08:42 +0530, Bharath Rupireddy wrote:
> On Thu, Feb 4, 2021 at 5:16 AM Andres Freund  wrote:
> > > > > About 0001, have we tried to reproduce the actual bug here which means
> > > > > when the error_callback is called we should face some problem? I feel
> > > > > with the correct testcase we should hit the Assert
> > > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > > > there we expect the transaction to be in a valid state. I understand
> > > > > that the transaction is in a broken state at that time but having a
> > > > > testcase to hit the actual bug makes it easy to test the fix.
> > > >
> > > > I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
> > > > and run anything involving logical rep using a low enough log
> > > > level. There might even be enough messages with a low enough level
> > > > already somewhere in the relevant paths.
> > >
> > > I'm not sure how adding a DEBUG message to XLogWrite() would ensure
> > > the logical replication worker on the subscriber system enters
> > > slot_store_error_callback and hits the Assert(IsTransactionState());?
> > > Could you please elaborate? Or I may be missing something here to
> > > understand.
> >
> > XLogWrite() is in a critical section, the DEBUG messages triggers error
> > context callbacks to be called, the callbacks allocate memory, which
> > triggers an assertion.
>
> I see that XLogWrite() can be called from logical replication
> worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite
> --> by now the txn state is not TRANS_INPROGRESS, so
> IsTransactionState() can return false. Adding a DEBUG message there
> can call the error context callback.
>
> But the error context callback slot_store_error_callback, gets used in
> slot_store_data and slot_modify_data. In both places we just register
> the callback before an attribute processing for loop and deregister
> after it. So, when we are in those for loops, we expect to be in
> TRANS_INPROGRESS, see ensure_transaction before slot_store_data and
> the same is true for slot_modify_data. So, to really hit the assert
> failure in the catalogue access from slot_store_error_callback, first,
> we shouldn't be in TRANS_INPROGRESS in those for loops and have a
> DEBUG message. I'm not exactly sure how we achieve this.

I think you're focussing too much on IsTransactionState(). It's easier
to hit the critical section issues.

What makes the issue a bit hard to hit is that to hit it you need a
syscache miss while slot_store_error_callback is set up, and that cache
miss then needs to trigger a read from disk (to potentially trigger a
WAL flush), and then the slot_store_error_callback() needs to actually
do something.

With the below patch

$ cat /tmp/tmp_config.conf
log_min_messages=debug5
shared_buffers=1MB

make TEMP_CONFIG=/tmp/extra_config.conf -C src/test/subscription/ check

triggers an assertion most of the time. The catcache.c changes aren't
necessary for this test, but we should have them...

Regards,

Andres

diff --git c/src/backend/access/transam/xlog.c 
i/src/backend/access/transam/xlog.c
index f03bd473e2b..7f422b4f5f2 100644
--- c/src/backend/access/transam/xlog.c
+++ i/src/backend/access/transam/xlog.c
@@ -2432,6 +2432,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 int startidx;
 uint32  startoffset;
 
+elog(DEBUG1, "writing!");
+
 /* We should always be inside a critical section here */
 Assert(CritSectionCount > 0);
 
@@ -2885,6 +2887,8 @@ XLogFlush(XLogRecPtr record)
 
 START_CRIT_SECTION();
 
+elog(DEBUG1, "checking for flushing");
+
 /*
  * Since fsync is usually a horribly expensive operation, we try to
  * piggyback as much data as we can on each fsync: if we see any more data
diff --git c/src/backend/replication/logical/worker.c 
i/src/backend/replication/logical/worker.c
index eb7db89cef7..19aa76af889 100644
--- c/src/backend/replication/logical/worker.c
+++ i/src/backend/replication/logical/worker.c
@@ -435,6 +435,8 @@ slot_store_error_callback(void *arg)
 Oid remotetypoid,
 localtypoid;
 
+Assert(IsTransactionState() && CritSectionCount == 0);
+
 /* Nothing to do if remote attribute number is not set */
 if (errarg->remote_attnum < 0)
 return;
diff --git c/src/backend/utils/cache/catcache.c 
i/src/backend/utils/cache/catcache.c
index fa2b49c676e..c7335c08570 100644
--- c/src/backend/utils/cache/catcache.c
+++ i/src/backend/utils/cache/catcache.c
@@ -1210,7 +1210,7 @@ SearchCatCacheInternal(CatCache *cache,
 CatCTup*ct;
 
 /* Make sure we're in an xact, even if this ends up being a cache hit */
-Assert(IsTransactionState());
+Assert(IsTransactionState() && CritSectionCount == 0);
 
 Assert(cache->cc_nkeys == nkeys);
 
@@ -1523,6 +1523,9 @@ SearchCatCacheList(CatCache *cache,
 MemoryContext

Re: Bug in COPY FROM backslash escaping multi-byte chars

2021-02-04 Thread Heikki Linnakangas

On 04/02/2021 03:50, Kyotaro Horiguchi wrote:

At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas  wrote in

On 03/02/2021 15:38, John Naylor wrote:

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas mailto:hlinn...@iki.fi>> wrote:
  >
  > Hi,
  >
  > While playing with COPY FROM refactorings in another thread, I noticed
  > corner case where I think backslash escaping doesn't work correctly.
  > Consider the following input:
  >
  > \么.foo
I've seen multibyte delimiters in the wild, so it's not as outlandish
as it seems.

We don't actually support multi-byte characters as delimiters or quote
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR:  COPY delimiter must be a single one-byte character

The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

I'm not sure the assumption in the second hunk always holds, but
that's fine at least with Shift-JIS and -2004 since they are two-byte
encoding.

The assumption is that a multi-byte character cannot have a special 
meaning, as far as the loop in CopyReadLineText is concerned. The 
characters with special meaning are '\\', '\n' and '\r'. That hold 
regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly 
better. Normally in the loop, raw_buf_ptr points to the next byte to 
consume, and 'c' is the last consumed byte. At the end of the loop, we 
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd 
and 4th byte if necessary. The crux of the bug is that after the 
"raw_buf_ptr++;" to skip the character after the backslash, we left c to 
'\\', even though we already consumed the first byte of the next 
character. Because of that, the end-of-the-loop check didn't correctly 
treat it as a multi-byte character. So a more straightforward fix is to 
set 'c' to the byte we skipped over.

- Heikki
>From a9411516dcdcf4c91a9f88dd05741a1479dec7a3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Feb 2021 21:36:46 +0200
Subject: [PATCH v2 1/1] Fix a corner-case in COPY FROM backslash processing.

If a multi-byte character is escaped with a backslash in TEXT mode input,
we didn't always treat the escape correctly. If:

- a multi-byte character is escaped with a backslash, and
- the second byte of the character is 0x5C, i.e. the ASCII code of a
  backslash (\), and
- the next character is a dot (.), then

CopyReadLineText function would incorrectly interpret the sequence as an
end-of-copy marker (\.). This can only happen in encodings that can
"embed" ascii characters as the second byte. One example of such sequence
is '\x5ca45c2e666f6f' in Big5 encoding. If you put that in a file, and
load it with COPY FROM, you'd incorrectly get an "end-of-copy marker
corrupt" error.

Backpatch to all supported versions.

Reviewed-by: John Naylor, Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
---
 src/backend/commands/copyfromparse.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b843d315b1..315b16fd7a 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1084,7 +1084,7 @@ CopyReadLineText(CopyFromState cstate)
 break;
 			}
 			else if (!cstate->opts.csv_mode)
-
+			{
 /*
  * If we are here, it means we found a backslash followed by
  * something other than a period.  In non-CSV mode, anything
@@ -1095,8 +1095,16 @@ CopyReadLineText(CopyFromState cstate)
  * backslashes are not special, so we want to process the
  * character after the backslash just like a normal character,
  * so we don't increment in those cases.
+ *
+ * Set 'c' to skip whole character correctly in multi-byte
+ * encodings.  If we don't have the whole character in the
+ * buffer yet, we might loop back to process it, after all,
+ * but that's OK because multi-byte characters cannot have any
+ * special meaning.
  */
 raw_buf_ptr++;
+c = c2;
+			}
 		}

 		/*
-- 
2.30.0

Re: Multiple full page writes in a single checkpoint?

2021-02-04 Thread Bruce Momjian

On Wed, Feb  3, 2021 at 08:07:16PM -0500, Bruce Momjian wrote:
> > > I can try to add a hint-bit-page-write page counter, but that might
> > > overflow, and then we will need a way to change the LSN anyway.
> > 
> > That's just a question of width...
> 
> Yeah, the hint bit counter is just delaying the inevitable, plus it
> changes the page format, which I am trying to avoid.  Also, I need this
> dummy record only if the page is marked clean, meaning a write
> to the file system already happened in the current checkpoint --- should
> not be to bad.

In looking your comments on Sawada-san's POC patch for buffer
encryption:


https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de

I see that he put a similar function call in exactly the same place I
did, but you pointed out that he was inserting into WAL while holding a
buffer lock.

I restructured my patch to not make that same mistake, and modified it
for non-permanent buffers --- attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

>From d535b08d264281348ab9e8eac645a5382ec35a8a Mon Sep 17 00:00:00 2001
From: Bruce Momjian 
Date: Thu, 4 Feb 2021 13:43:56 -0500
Subject: [PATCH] hint squash commit

---
 src/backend/access/rmgrdesc/xlogdesc.c   |  6 +-
 src/backend/access/transam/xlog.c|  4 ++
 src/backend/access/transam/xloginsert.c  | 16 +
 src/backend/replication/logical/decode.c |  1 +
 src/backend/storage/buffer/bufmgr.c  | 89 
 src/include/access/xloginsert.h  |  2 +
 src/include/catalog/pg_control.h |  1 +
 7 files changed, 90 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 92cc7ea073..16a967bb71 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -80,7 +80,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfoString(buf, xlrec->rp_name);
 	}
-	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT ||
+			 info == XLOG_HINT_LSN)
 	{
 		/* no further information to print */
 	}
@@ -185,6 +186,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_HINT_LSN:
+			id = "HINT_LSN";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..f67316ee07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10260,6 +10260,10 @@ xlog_redo(XLogReaderState *record)
 			UnlockReleaseBuffer(buffer);
 		}
 	}
+	else if (info == XLOG_HINT_LSN)
+	{
+		/* nothing to do here */
+	}
 	else if (info == XLOG_BACKUP_END)
 	{
 		XLogRecPtr	startpoint;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245e..7635a3d44d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -980,6 +980,22 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 	return recptr;
 }
 
+/*
+ * On the first hint bit change during a checkpoint, XLogSaveBufferForHint()
+ * writes a full page image to the WAL and returns a new LSN which can be
+ * assigned to the page.  Cluster file encryption needs a new LSN for
+ * every hint bit page write because the LSN is used in the nonce, and
+ * the nonce must be unique.  Therefore, this routine just advances the LSN
+ * so the page can be assigned a new LSN.
+ */
+XLogRecPtr
+XLogLSNForHint(void)
+{
+	XLogBeginInsert();
+
+	return XLogInsert(RM_XLOG_ID, XLOG_HINT_LSN);
+}
+
 /*
  * Write a WAL record containing a full image of a page. Caller is responsible
  * for writing the page to disk after calling this routine.
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afa1df00d0..2ada89c6b1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -223,6 +223,7 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPW_CHANGE:
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
+		case XLOG_HINT_LSN:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..b9f3f76798 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3810,13 +3810,14 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		 * If we need to protect hint bit updates from torn writes, WAL-log a
 		 * full page image of the page. This full page image is only necessary
 		 * if the hint bit update is the first change to the page since the
-		 * last checkpoint.
+		 * last checkpoint.  If cluster file encryption is

Re: Support for NSS as a libpq TLS backend

2021-02-04 Thread Jacob Champion

On Thu, 2021-02-04 at 16:30 +0900, Michael Paquier wrote:
> On Tue, Feb 02, 2021 at 08:33:35PM +, Jacob Champion wrote:
> > Note that this changes the error message printed during the invalid-
> > root tests, because NSS is now sending the root of the chain. So the
> > server's issuer is considered untrusted rather than unrecognized.
> 
> I think that it is not a good idea to attach the since-v*.diff patches
> into the threads.  This causes the CF bot to fail in applying those
> patches.

Ah, sorry about that. Is there an extension I can use (or lack thereof)
that the CF bot will ignore, or does it scan the attachment contents?

--Jacob

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-04 Thread Alvaro Herrera

On 2021-Feb-05, Julien Rouhaud wrote:

> Thanks, that's way better, copied in v3.  I'm still a bit worried about that
> description though, as that flag isn't consistently set for the FOR UPDATE
> case.  Well, to be more precise it's maybe consistently set when the hint bits
> are computed, but in some cases the flag is later cleared, so you won't
> reliably find it in the tuple.

Hmm, that sounds bogus.  I think the resetting of the other bits should
be undone afterwards, but I'm not sure that we correctly set
KEYS_UPDATED again after the TOAST business.  (What stuff does, from
memory, is to make the tuple look as if it is fully updated, which is
necessary during the TOAST handling; if the bits are not correctly set
transiently, that's okay.  But it needs reinstated again later, once the
TOAST stuff is finished).

-- 
Álvaro Herrera39°49'30"S 73°17'W
"El número de instalaciones de UNIX se ha elevado a 10,
y se espera que este número aumente" (UPM, 1972)

Re: Minimal logical decoding on standbys

2021-02-04 Thread Fabrízio de Royes Mello

On Thu, Feb 4, 2021 at 1:49 PM Drouvot, Bertrand 
wrote:
>
> Had to do a little change to make it compiling again (re-add the heapRel
argument in _bt_delitems_delete() that was removed by commit
dc43492e46c7145a476cb8ca6200fc8eefe673ef).
>
> Given that this attached v9 version:
>
> compiles successfully on current master
> passes "make check"
> passes the 2 associated tap tests "make -C src/test/recovery check
PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v"  and
"make -C src/test/recovery check PROVE_TESTS=t/
023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"
>

Perfect thanks... will review ASAP!

> wouldn't that make sense to (re)add this patch in the commitfest?
>

Added to next commitfest: https://commitfest.postgresql.org/32/2968/

Regards,

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Is Recovery actually paused?

2021-02-04 Thread Robert Haas

On Thu, Feb 4, 2021 at 7:46 AM Dilip Kumar  wrote:
> How can we do that this is not a 1 byte flag this is enum so I don't
> think we can read any atomic state without a spin lock here.

I think this discussion of atomics is confused. Let's talk about what
atomic reads and writes mean. Imagine that you have a 64-bit value
0x0101010101010101. Somebody sets it to 0x0202020202020202. Imagine
that just as they are doing that, someone else reads the value and
gets 0x0202020201010101, because half of the value has been updated
and the other half has not yet been updated yet. This kind of thing
can actually happen on some platforms and what it means is that on
those platforms 8-byte reads and writes are not atomic. The idea of an
"atom" is that it can't be split into pieces but these reads and
writes on some platforms are actually not "atomic" because they are
split into two 4-byte pieces. But there's no such thing as a 1-byte
read or write not being atomic. In theory you could imagine a computer
where when you change 0x01 to 0x23 and read in the middle and see 0x21
or 0x13 or something, but no actual computers behave that way, or at
least no mainstream ones that anybody cares about. So the idea that
you somehow need a lock to prevent this is just wrong.

Concurrent programs also suffer from another problem which is
reordering of operations, which can happen either as the program is
compiled or as the program is executed by the CPU. The CPU can see you
set a->x = 1 and a->y = 2 and decide to update y first and then x even
though you wrote it the other way around in the program text. To
prevent this, we have barrier operations; see README.barrier in the
source tree for a longer explanation. Atomic operations like
compare-and-exchange are also full barriers, so that they not only
prevent the torn read/write problem described above, but also enforce
order of operations more strictly.

Now I don't know whether a lock is needed here or not. Maybe it is;
perhaps for consistency with other code, perhaps because the lock
acquire and release is serving the function of a barrier; or perhaps
to guard against some other hazard. But saying that it's because
reading or writing a 1-byte value might not be atomic does not sound
correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-02-04 Thread Dilip Kumar

On Wed, Feb 3, 2021 at 2:07 AM Robert Haas  wrote:

While going through your comments, I need some suggestions about the
one except this all other comments looks fine to me.
>
> It puzzles me that CompareCompressionMethodAndDecompress() calls
> slot_getallattrs() just before clearing the slot. It seems like this
> ought to happen before we loop over the attributes, so that we don't
> need to call slot_getattr() every time.

Yeah, actually, I thought I would avoid calling slot_getallattrs if
none of the attributes got decompress.  I agree if we call this before
we can avoid calling slot_getattr but slot_getattr
is only called for the attribute which has attlen -1. I agree that if
we call slot_getattr for attnum n then it will deform all the
attributes before that.  But then slot_getallattrs only need to deform
the remaining attributes not all.  But maybe we can call the
slot_getallattrs as soon as we see the first attribute with attlen -1
and then avoid calling subsequent slot_getattr, maybe that is better
than compared to what I have because we will avoid calling
slot_getattr for many attributes, especially when there are many
verlena.

 See the comment for that
> function. But even if we didn't do that for some reason, why would we
> do it here? If it's already been done, it shouldn't do anything, and
> if it hasn't been done, it might overwrite some of the values we just
> poked into tts_values.

It will not overwrite those values because slot_getallattrs will only
fetch the values for "attnum > slot->tts_nvalid" so whatever we
already fetched will not be overwritten.  Just did that at the end to
optimize the normal cases where we are not doing "insert into select *
from" so that those can get away without calling slot_getallattrs at
all.  However, maybe calling slot_getattr for each varlena might cost
us extra so I am okay to call slot_getallattrs this early.

 It also seems suspicious that we can get away
> with clearing the slot and then again marking it valid. I'm not sure
> it really works like that. Like, can't clearing the slot invalidate
> pointers stored in tts_values[]? For instance, if they are pointers
> into an in-memory heap tuple, tts_heap_clear() is going to free the
> tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
> going to unpin it.

Yeah, that's completely wrong.  I think I missed that part.  One
solution can be that we can just detach the tuple from the slot and
then materialize so it will form the tuple with new values and then we
can clear the old tuple.  But that seems a bit hacky.

 I think the supported procedure for this sort of
> thing is to have a second slot, set tts_values, tts_isnull etc. and
> then materialize the slot. After materializing the new slot, it's
> independent of the old slot, which can then be cleared. See for
> example tts_virtual_materialize().

Okay, so if we take a new slot then we need to set this slot reference
in the ScanState also otherwise that might point to the old slot.  I
haven't yet analyzed where all we might be keeping the reference to
that old slot.  Or I am missing something.

Anyway, I will get a better idea once I try to implement this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-04 Thread Julien Rouhaud

On Thu, Feb 04, 2021 at 01:22:35PM -0300, Alvaro Herrera wrote:
> On 2021-Feb-05, Julien Rouhaud wrote:
> 
> >  - HEAP_KEYS_UPDATED
> >This bit lives in t_infomask2.  If set, indicates that the XMAX updated
> > -  this tuple and changed the key values, or it deleted the tuple.
> > -  It's set regardless of whether the XMAX is a TransactionId or a 
> > MultiXactId.
> > +  this tuple and changed the key values, or it deleted the tuple.  It can 
> > also
> > +  be set in combination of HEAP_XMAX_LOCK_ONLY.  It's set regardless of 
> > whether
> > +  the XMAX is a TransactionId or a MultiXactId.
> 
> I think we should reword this more completely, to avoid saying one thing
> (that the op is an update or delete) and then contradicting ourselves
> (that it can also be a lock).  I propose this:
> 
>   This bit lives in t_infomask2.  If set, it indicates that the
>   operation(s) done by the XMAX compromise the tuple key, such as
>   a SELECT FOR UPDATE, an UPDATE that modifies the columns of the
>   key, or a DELETE.

Thanks, that's way better, copied in v3.  I'm still a bit worried about that
description though, as that flag isn't consistently set for the FOR UPDATE
case.  Well, to be more precise it's maybe consistently set when the hint bits
are computed, but in some cases the flag is later cleared, so you won't
reliably find it in the tuple.
>From d2eb1a1314a92adc2d0115192d02164aef09bb88 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Fri, 22 Jan 2021 00:06:34 +0800
Subject: [PATCH v3] Allow HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED
 combination.

This combination of hint bits was previously detected as a form of corruption,
but it can be obtained with some mix of SELECT ... FOR UPDATE and UPDATE
queries.

Discussion: https://postgr.es/m/20210124061758.GA11756@nol
---
 contrib/amcheck/t/001_verify_heapam.pl | 14 +-
 contrib/amcheck/verify_heapam.c|  7 ---
 src/backend/access/heap/README.tuplock |  7 ---
 src/backend/access/heap/hio.c  |  8 +++-
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl 
b/contrib/amcheck/t/001_verify_heapam.pl
index a2f65b826d..6050feb712 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -4,7 +4,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 
-use Test::More tests => 79;
+use Test::More tests => 80;
 
 my ($node, $result);
 
@@ -47,6 +47,9 @@ detects_heap_corruption(
 #
 fresh_test_table('test');
 $node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
+detects_no_corruption(
+   "verify_heapam('test')",
+   "all-frozen not corrupted table");
 corrupt_first_page('test');
 detects_heap_corruption("verify_heapam('test')",
"all-frozen corrupted table");
@@ -92,6 +95,15 @@ sub fresh_test_table
ALTER TABLE $relname ALTER b SET STORAGE external;
INSERT INTO $relname (a, b)
(SELECT gs, repeat('b',gs*10) FROM 
generate_series(1,1000) gs);
+   BEGIN;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $relname SET b = b WHERE a = 42;
+   RELEASE s1;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $relname SET b = b WHERE a = 42;
+   COMMIT;
));
 }
 
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 88ab32490c..49f5ca0ef2 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -608,13 +608,6 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, 
HeapCheckContext *ctx)
   
ctx->tuphdr->t_hoff, ctx->lp_len));
header_garbled = true;
}
-   if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
-   (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED))
-   {
-   report_corruption(ctx,
- pstrdup("tuple is marked as 
only locked, but also claims key columns were updated"));
-   header_garbled = true;
-   }
 
if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
(ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index d03ddf6cdc..6441e8baf0 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -146,9 +146,10 @@ The following infomask bits are applicable:
   FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.
 
 - HEAP_KEYS_UPDATED
-  This bit lives in t_infomask2.  If set, indicates that the XMAX updated
-  this tuple and changed the key values, or it deleted the tuple.
-  It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.
+

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-04 Thread Alvaro Herrera

On 2021-Feb-05, Julien Rouhaud wrote:

>  - HEAP_KEYS_UPDATED
>This bit lives in t_infomask2.  If set, indicates that the XMAX updated
> -  this tuple and changed the key values, or it deleted the tuple.
> -  It's set regardless of whether the XMAX is a TransactionId or a 
> MultiXactId.
> +  this tuple and changed the key values, or it deleted the tuple.  It can 
> also
> +  be set in combination of HEAP_XMAX_LOCK_ONLY.  It's set regardless of 
> whether
> +  the XMAX is a TransactionId or a MultiXactId.

I think we should reword this more completely, to avoid saying one thing
(that the op is an update or delete) and then contradicting ourselves
(that it can also be a lock).  I propose this:

This bit lives in t_infomask2.  If set, it indicates that the
operation(s) done by the XMAX compromise the tuple key, such as
a SELECT FOR UPDATE, an UPDATE that modifies the columns of the
key, or a DELETE.

Also, I just noticed that the paragraph just above this one says that
HEAP_XMAX_EXCL_LOCK is used for both SELECT FOR UPDATE and SELECT FOR NO
KEY UPDATE, and that this bit is what differentiates them.

-- 
Álvaro Herrera   Valdivia, Chile
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-04 Thread Julien Rouhaud

On Thu, Feb 04, 2021 at 08:34:13PM +0530, Mahendra Singh Thalor wrote:
> On Tue, 2 Feb 2021 at 09:51, Julien Rouhaud  wrote:
> >
> > On Mon, Feb 01, 2021 at 02:00:48PM -0300, Alvaro Herrera wrote:
> > > So I think that I misspoke earlier in this thread when I said this is a
> > > bug, and that the right fix here is to remove the Assert() and change
> > > amcheck to match.
> >
> > I'm attaching a patch to do so.
> 
> Thanks Julien for the patch.
> 
> Patch looks good to me and it is fixing the problem. I think we can
> register in CF.

Thanks for looking at it!  I just created an entry for the next commitfest.

> >
> > Changing the name may be overkill, but claryfing the hint bit usage in
> > README.tuplock would definitely be useful, especially since the combination
> > isn't always produced.  How about adding something like:
> >
> >  HEAP_KEYS_UPDATED
> >   This bit lives in t_infomask2.  If set, indicates that the XMAX updated
> >   this tuple and changed the key values, or it deleted the tuple.
> > +  It can also be set in combination of HEAP_XMAX_LOCK_ONLY.
> >   It's set regardless of whether the XMAX is a TransactionId or a 
> > MultiXactId.
> Make sense. Please can you update this?

Sure, done in attached v2!
>From d74c216d8817659237f26baec789c6b79d337296 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Fri, 22 Jan 2021 00:06:34 +0800
Subject: [PATCH v2] Allow HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED
 combination.

This combination of hint bits was previously detected as a form of corruption,
but it can be obtained with some mix of SELECT ... FOR UPDATE and UPDATE
queries.

Discussion: https://postgr.es/m/20210124061758.GA11756@nol
---
 contrib/amcheck/t/001_verify_heapam.pl | 14 +-
 contrib/amcheck/verify_heapam.c|  7 ---
 src/backend/access/heap/README.tuplock |  5 +++--
 src/backend/access/heap/hio.c  |  8 +++-
 4 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl 
b/contrib/amcheck/t/001_verify_heapam.pl
index a2f65b826d..6050feb712 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -4,7 +4,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 
-use Test::More tests => 79;
+use Test::More tests => 80;
 
 my ($node, $result);
 
@@ -47,6 +47,9 @@ detects_heap_corruption(
 #
 fresh_test_table('test');
 $node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
+detects_no_corruption(
+   "verify_heapam('test')",
+   "all-frozen not corrupted table");
 corrupt_first_page('test');
 detects_heap_corruption("verify_heapam('test')",
"all-frozen corrupted table");
@@ -92,6 +95,15 @@ sub fresh_test_table
ALTER TABLE $relname ALTER b SET STORAGE external;
INSERT INTO $relname (a, b)
(SELECT gs, repeat('b',gs*10) FROM 
generate_series(1,1000) gs);
+   BEGIN;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $relname SET b = b WHERE a = 42;
+   RELEASE s1;
+   SAVEPOINT s1;
+   SELECT 1 FROM $relname WHERE a = 42 FOR UPDATE;
+   UPDATE $relname SET b = b WHERE a = 42;
+   COMMIT;
));
 }
 
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 88ab32490c..49f5ca0ef2 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -608,13 +608,6 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, 
HeapCheckContext *ctx)
   
ctx->tuphdr->t_hoff, ctx->lp_len));
header_garbled = true;
}
-   if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
-   (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED))
-   {
-   report_corruption(ctx,
- pstrdup("tuple is marked as 
only locked, but also claims key columns were updated"));
-   header_garbled = true;
-   }
 
if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
(ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index d03ddf6cdc..6d357565a8 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -147,8 +147,9 @@ The following infomask bits are applicable:
 
 - HEAP_KEYS_UPDATED
   This bit lives in t_infomask2.  If set, indicates that the XMAX updated
-  this tuple and changed the key values, or it deleted the tuple.
-  It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.
+  this tuple and changed the key values, or it deleted the tuple.  It can also
+  be set in combination of HEAP_XMAX_LOCK_ONLY.  It's set regardless of whether
+  the XMAX is a TransactionId or a MultiXactId.
 
 We

Re: Removing support for COPY FROM STDIN in protocol version 2

2021-02-04 Thread Heikki Linnakangas


On 04/02/2021 17:35, Tom Lane wrote:

Alvaro Herrera  writes:

Yeah, the changes I was thinking about are all in libpq-int.h so that's
not really a problem.  But one enum in libpq-fe.h renumbers values, and
I think it's better to keep the old value labelled as "unused" to avoid
any changes.


Oh, yeah, can't do that.  libpq-fe.h probably shouldn't change at all;
but certainly we can't renumber existing enum values there.


Ah, right, there's even a comment above the enum that says that's a no 
no. But yeah, fixing that, I see no need for .so version bump.


- Heikki

Re: Removing support for COPY FROM STDIN in protocol version 2

2021-02-04 Thread Tom Lane

Alvaro Herrera  writes:
> Yeah, the changes I was thinking about are all in libpq-int.h so that's
> not really a problem.  But one enum in libpq-fe.h renumbers values, and
> I think it's better to keep the old value labelled as "unused" to avoid
> any changes.

Oh, yeah, can't do that.  libpq-fe.h probably shouldn't change at all;
but certainly we can't renumber existing enum values there.

regards, tom lane

Re: get rid of tags in the docs?

2021-02-04 Thread Tom Lane

John Naylor  writes:
> While looking at the proposed removal of the v2 protocol, I noticed that we
> italicize some, but not all, instances of 'per se', 'pro forma', and 'ad
> hoc'. I'd say these are widespread enough in formal registers of English
> that they hardly need to be called out as foreign, so I propose removing
> the tags for those words.

+1, nobody italicizes those in normal usage.

> The other case is 'voilà', found in rules.sgml. The case for italics here
> is stronger, but looking at that file, I actually think a more
> generic-sounding phrase here would be preferable.

Yeah, seeing that we only use that in one place, I think we could do
without it.  Looks like something as pedestrian as "The results are:"
would do fine.

regards, tom lane

Re: Removing support for COPY FROM STDIN in protocol version 2

2021-02-04 Thread Alvaro Herrera

On 2021-Feb-04, Tom Lane wrote:

> Alvaro Herrera  writes:
> > On 2021-Feb-04, Heikki Linnakangas wrote:
> >> Ok, here we go.
> 
> > Are you going to bump the .so version for this?  I think that should be
> > done, since some functions disappear and there are struct changes.  It
> > is curious, though, to see that exports.txt needs no changes.
> 
> Uh, what?  There should be no externally visible ABI changes in libpq
> (he says without having read the patch).  If there's a need for a library
> major version bump, that'd be sufficient reason not to do this IMO.

Yeah, the changes I was thinking about are all in libpq-int.h so that's
not really a problem.  But one enum in libpq-fe.h renumbers values, and
I think it's better to keep the old value labelled as "unused" to avoid
any changes.

-- 
Álvaro Herrera   Valdivia, Chile

Re: Snapbuild woes followup

2021-02-04 Thread Alvaro Herrera

On 2021-Jan-25, Andres Freund wrote:

> See attached patch...

Looks good to me.

I was wondering if there would be a point in using a FullTransactionId
instead of an unadorned one.  I don't know what's the true risk of
an Xid wraparound occurring here, but it seems easier to reason about.
But then that's probably a larger change to make all of snapbuild use
FullTransactionIds, so not for this patch to worry about.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Removing support for COPY FROM STDIN in protocol version 2

2021-02-04 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Feb-04, Heikki Linnakangas wrote:
>> Ok, here we go.

> Are you going to bump the .so version for this?  I think that should be
> done, since some functions disappear and there are struct changes.  It
> is curious, though, to see that exports.txt needs no changes.

Uh, what?  There should be no externally visible ABI changes in libpq
(he says without having read the patch).  If there's a need for a library
major version bump, that'd be sufficient reason not to do this IMO.

regards, tom lane

Re: Is MaxHeapAttributeNumber a reasonable restriction for foreign-tables?

2021-02-04 Thread Kohei KaiGai

2021年2月4日(木) 23:45 Tom Lane :
>
> Amit Langote  writes:
> > On Thu, Feb 4, 2021 at 4:24 PM Kohei KaiGai  wrote:
> >> I think that MaxHeapAttributeNumber is a senseless restriction for foreign-
> >> tables. How about your opinions?
>
> > My first reaction to this was a suspicion that the
> > MaxHeapAttributeNumber limit would be too ingrained in PostgreSQL's
> > architecture to consider this matter lightly, but actually browsing
> > the code, that may not really be the case.
>
> You neglected to search for MaxTupleAttributeNumber...
>
> I'm quite skeptical of trying to raise this limit significantly.
>
> In the first place, you'd have to worry about the 2^15 limit on
> int16 AttrNumbers --- and keep in mind that that has to be enough
> for reasonable-size joins, not only an individual table.  If you
> join a dozen or so max-width tables, you're already most of the way
> to that limit.
>
free_parsestate() also prevents to use target-list more than
MaxTupleAttributeNumber.
(But it is reasonable restriction because we cannot guarantee that
HeapTupleTableSlot
is not used during query execution.)

> In the second place, as noted by the comment you quoted, there are
> algorithms in various places that are O(N^2) (or maybe even worse?)
> in the number of columns they're dealing with.
>
Only table creation time, isn't it?
If N is not small (probably >100), we can use temporary HTAB to ensure
duplicated column-name is not supplied.

> In the third place, I've yet to see a use-case that didn't represent
> crummy table design.  Pushing the table off to a remote server doesn't
> make it less crummy design.
>
I met this limitation to create a foreign-table that try to map Apache
Arrow file that
contains ~2,500 attributes of scientific observation data.
Apache Arrow internally has columnar format, and queries to this
data-set references
up to 10-15 columns on average. So, it shall make the query execution much more
efficient.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Re: Is MaxHeapAttributeNumber a reasonable restriction for foreign-tables?

2021-02-04 Thread Amit Langote

On Thu, Feb 4, 2021 at 11:45 PM Tom Lane  wrote:
> Amit Langote  writes:
> > On Thu, Feb 4, 2021 at 4:24 PM Kohei KaiGai  wrote:
> >> I think that MaxHeapAttributeNumber is a senseless restriction for foreign-
> >> tables. How about your opinions?
>
> > My first reaction to this was a suspicion that the
> > MaxHeapAttributeNumber limit would be too ingrained in PostgreSQL's
> > architecture to consider this matter lightly, but actually browsing
> > the code, that may not really be the case.
>
> You neglected to search for MaxTupleAttributeNumber...

Ah, I did.  Although, even its usage seems mostly limited to modules
under src/backend/access/heap.

> I'm quite skeptical of trying to raise this limit significantly.
>
> In the first place, you'd have to worry about the 2^15 limit on
> int16 AttrNumbers --- and keep in mind that that has to be enough
> for reasonable-size joins, not only an individual table.  If you
> join a dozen or so max-width tables, you're already most of the way
> to that limit.
>
> In the second place, as noted by the comment you quoted, there are
> algorithms in various places that are O(N^2) (or maybe even worse?)
> in the number of columns they're dealing with.

Those are certainly intimidating considerations.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Removing support for COPY FROM STDIN in protocol version 2

2021-02-04 Thread Alvaro Herrera

On 2021-Feb-04, Heikki Linnakangas wrote:

> On 04/02/2021 08:54, Michael Paquier wrote:
> > On Wed, Feb 03, 2021 at 11:29:37AM -0500, Tom Lane wrote:
> > > Then let's kill it dead, server and libpq both.
> > 
> > Yeah.
> 
> Ok, here we go.

Are you going to bump the .so version for this?  I think that should be
done, since some functions disappear and there are struct changes.  It
is curious, though, to see that exports.txt needs no changes.

(I'm not sure what's our protocol for so-version changes. Do we wait
till end of cycle, or do we put it together with the commit that
modifies the library? src/tools/RELEASE_CHANGES doesn't say)

-- 
Álvaro Herrera   Valdivia, Chile

Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-04 Thread Mahendra Singh Thalor

On Tue, 2 Feb 2021 at 09:51, Julien Rouhaud  wrote:
>
> On Mon, Feb 01, 2021 at 02:00:48PM -0300, Alvaro Herrera wrote:
> > On 2021-Jan-24, Julien Rouhaud wrote:
> >
> > > While working on pg14 compatibility for an extension relying on an 
> > > apparently
> > > uncommon combination of FOR UPDATE and stored function calls, I hit some 
> > > new
> > > Asserts introduced in 866e24d47db (Extend amcheck to check heap pages):
> > >
> > > +   /*
> > > +* Do not allow tuples with invalid combinations of hint bits to be 
> > > placed
> > > +* on a page.  These combinations are detected as corruption by the
> > > +* contrib/amcheck logic, so if you disable one or both of these
> > > +* assertions, make corresponding changes there.
> > > +*/
> > > +   Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > > +(tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> > >
> > >
> > > I attach a simple self contained script to reproduce the problem, the last
> > > UPDATE triggering the Assert.
> >
> > Maybe we should contest the idea that this is a sensible thing to Assert
> > against.  AFAICS this was originally suggested here:
> > https://www.postgresql.org/message-id/flat/CAFiTN-syyHc3jZoou51v0SR8z0POoNfktqEO6MaGig4YS8mosA%40mail.gmail.com#ad215d0ee0606b5f67bbc57d011c96b8
> > and it appears now to have been a bad idea.  If I recall correctly,
> > HEAP_KEYS_UPDATED is supposed to distinguish locks/updates that don't
> > modify the key columns from those that do.  Since SELECT FOR UPDATE
> > stands for a future update that may modify arbitrary portions of the
> > tuple (including "key" columns), then it produces that bit, just as said
> > UPDATE or a DELETE; as opposed to SELECT FOR NO KEY UPDATE which stands
> > for a future UPDATE that will only change columns that aren't part of
> > any keys.
>
> Thanks for the clarification, that makes sense.
>
> > So I think that I misspoke earlier in this thread when I said this is a
> > bug, and that the right fix here is to remove the Assert() and change
> > amcheck to match.
>
> I'm attaching a patch to do so.

Thanks Julien for the patch.

Patch looks good to me and it is fixing the problem. I think we can
register in CF.

>
> > Separately, maybe it'd also be good to have a test case based on
> > Julien's SQL snippet that produces this particular infomask combination
> > (and other interesting combinations) and passes them through VACUUM etc
> > to see that everything behaves correctly.
>
> I also updated amcheck perl regression tests to generate such a combination,
> add added an additional pass of verify_heapam() just after the VACUUM.
>
> >
> > You could also argue the HEAP_KEYS_UPDATED is a misnomer and that we'd
> > do well to change its name, and update README.tuplock.
>
> Changing the name may be overkill, but claryfing the hint bit usage in
> README.tuplock would definitely be useful, especially since the combination
> isn't always produced.  How about adding something like:
>
>  HEAP_KEYS_UPDATED
>   This bit lives in t_infomask2.  If set, indicates that the XMAX updated
>   this tuple and changed the key values, or it deleted the tuple.
> +  It can also be set in combination of HEAP_XMAX_LOCK_ONLY.
>   It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.
Make sense. Please can you update this?

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: adding wait_start column to pg_locks

2021-02-04 Thread torikoshia


On 2021-02-03 11:23, Fujii Masao wrote:
64-bit fetches are not atomic on some platforms. So spinlock is 
necessary when updating "waitStart" without holding the partition 
lock? Also GetLockStatusData() needs spinlock when reading 
"waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and 
pg_atomic_write_u64().


waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx 
and pg_atomic_write_xxx only supports unsigned int, so I cast the type.


I may be using these functions not correctly, so if something is wrong, 
I would appreciate any comments.



About the documentation, since your suggestion seems better than v6, I 
used it as is.



Regards,

--
Atsushi TorikoshiFrom 38a3d8996c4b1690cf18cdb1015e270201d34330 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 4 Feb 2021 23:23:36 +0900
Subject: [PATCH v7] To examine the duration of locks, we did join on pg_locks
 and pg_stat_activity and used columns such as query_start or state_change.
 However, since they are the moment when queries have started or their state
 has changed, we could not get the lock duration in this way. This patch adds
 a new field "waitstart" preserving lock acquisition wait start time.

Note that updating this field and lock acquisition are not performed
synchronously for performance reasons.  Therefore, depending on the
timing, it can happen that waitstart is NULL even though granted is
false.

Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a...@oss.nttdata.com

modifies
---
 contrib/amcheck/expected/check_btree.out |  4 ++--
 doc/src/sgml/catalogs.sgml   | 13 +
 src/backend/storage/ipc/standby.c| 17 +++--
 src/backend/storage/lmgr/lock.c  |  8 
 src/backend/storage/lmgr/proc.c  | 14 ++
 src/backend/utils/adt/lockfuncs.c|  9 -
 src/include/catalog/pg_proc.dat  |  6 +++---
 src/include/storage/lock.h   |  2 ++
 src/include/storage/proc.h   |  1 +
 src/test/regress/expected/rules.out  |  5 +++--
 10 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 13848b7449..5a3f1ef737 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx');
 SELECT * FROM pg_locks
 WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 'bttest_b_idx']::regclass[])
 AND pid = pg_backend_pid();
- locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath 
---+--+--+--+---++---+-+---+--++-+--+-+--
+ locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart 
+--+--+--+--+---++---+-+---+--++-+--+-+--+---
 (0 rows)
 
 COMMIT;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 865e826fb0..7df4c30a65 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10592,6 +10592,19 @@ SCRAM-SHA-256$:&l
lock table
   
  
+
+ 
+  
+   waitstart timestamptz
+  
+  
+   Time when the server process started waiting for this lock,
+   or null if the lock is held.
+   Note that this can be null for a very short period of time after
+   the wait started even though granted
+   is false.
+  
+ 
 

   
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..1c8135ba74 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -539,13 +539,26 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict)
 {
-	TimestampTz ltime;
+	TimestampTz ltime, now;
 
 	Assert(InHotStandby);
 
 	ltime = GetStandbyLimitTime();
+	now = GetCurrentTimestamp();
 
-	if (GetCurrentTimestamp() >= ltime && ltime != 0)
+	/*
+	 * Record waitStart using the current time obtained for comparison
+	 * with ltime.
+	 *
+	 * It would be ideal this can be synchronously done with updating
+	 * lock information. Howerver, since it gives performance impacts
+	 * to hold partitionLock longer time, we do it here asynchronously.
+	 */
+	if (pg_atomic_read_u64(&MyProc->waitStart) ==

get rid of tags in the docs?

2021-02-04 Thread John Naylor

Hi,

While looking at the proposed removal of the v2 protocol, I noticed that we
italicize some, but not all, instances of 'per se', 'pro forma', and 'ad
hoc'. I'd say these are widespread enough in formal registers of English
that they hardly need to be called out as foreign, so I propose removing
the tags for those words. Alternatively, we could just add tags to make
existing usage consistent, but I have little reason to think it will stay
that way. It's also impractical to go and search for other possible words
that should have been italicized but weren't.

The other case is 'voilà', found in rules.sgml. The case for italics here
is stronger, but looking at that file, I actually think a more
generic-sounding phrase here would be preferable.

Other opinions?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Ronan Dunklau



Hmmm, that sure seems like a bug, or at least unexpected behavior (that
I don't see mentioned in the docs).

But the patch seems borked in some way:

$ patch -p1 < ~/keep_attstattargets_on_reindex_concurrently.patch
patch:  Only garbage was found in the patch input.

There seem to be strange escape characters and so on, how did you 
create

the patch? Maybe some syntax coloring, or something?


You're right, I had syntax coloring in the output, sorry.

Please find attached a correct patch.

Regards,

--
Ronan Dunklaudiff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a70fe4d2c..b4adb32af0 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -107,7 +107,8 @@ static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
 		  List *indexColNames,
 		  Oid accessMethodObjectId,
 		  Oid *collationObjectId,
-		  Oid *classObjectId);
+		  Oid *classObjectId,
+		  int32 *attstattargets);
 static void InitializeAttributeOids(Relation indexRelation,
 	int numatts, Oid indexoid);
 static void AppendAttributeTuples(Relation indexRelation, Datum *attopts);
@@ -272,7 +273,8 @@ ConstructTupleDescriptor(Relation heapRelation,
 		 List *indexColNames,
 		 Oid accessMethodObjectId,
 		 Oid *collationObjectId,
-		 Oid *classObjectId)
+		 Oid *classObjectId,
+		 int32 *attstattargets)
 {
 	int			numatts = indexInfo->ii_NumIndexAttrs;
 	int			numkeyatts = indexInfo->ii_NumIndexKeyAttrs;
@@ -310,12 +312,14 @@ ConstructTupleDescriptor(Relation heapRelation,
 
 		MemSet(to, 0, ATTRIBUTE_FIXED_PART_SIZE);
 		to->attnum = i + 1;
-		to->attstattarget = -1;
 		to->attcacheoff = -1;
 		to->attislocal = true;
 		to->attcollation = (i < numkeyatts) ?
 			collationObjectId[i] : InvalidOid;
-
+		if(attstattargets != NULL)
+			to->attstattarget = attstattargets[i];
+		else
+			to->attstattarget = -1;
 		/*
 		 * Set the attribute name as specified by caller.
 		 */
@@ -697,6 +701,7 @@ index_create(Relation heapRelation,
 			 Oid *collationObjectId,
 			 Oid *classObjectId,
 			 int16 *coloptions,
+			 int32 *colstattargets,
 			 Datum reloptions,
 			 bits16 flags,
 			 bits16 constr_flags,
@@ -882,7 +887,8 @@ index_create(Relation heapRelation,
 			indexColNames,
 			accessMethodObjectId,
 			collationObjectId,
-			classObjectId);
+			classObjectId,
+			colstattargets);
 
 	/*
 	 * Allocate an OID for the index, unless we were told what to use.
@@ -1413,11 +1419,13 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 optionDatum;
 	oidvector  *indclass;
 	int2vector *indcoloptions;
+	int32  *colstattargets;
 	bool		isnull;
 	List	   *indexColNames = NIL;
 	List	   *indexExprs = NIL;
 	List	   *indexPreds = NIL;
 
+
 	indexRelation = index_open(oldIndexId, RowExclusiveLock);
 
 	/* The new index needs some information from the old index */
@@ -1501,10 +1509,11 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 			true);
 
 	/*
-	 * Extract the list of column names and the column numbers for the new
+	 * Extract the list of column names, column numbers and stattargets for the new
 	 * index information.  All this information will be used for the index
 	 * creation.
 	 */
+	colstattargets = palloc(sizeof(int32) * oldInfo->ii_NumIndexAttrs);
 	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
 	{
 		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
@@ -1512,8 +1521,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 
 		indexColNames = lappend(indexColNames, NameStr(att->attname));
 		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		colstattargets[i] = att->attstattarget;
 	}
-
 	/*
 	 * Now create the new index.
 	 *
@@ -1534,13 +1543,14 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 			  indexRelation->rd_indcollation,
 			  indclass->values,
 			  indcoloptions->values,
+			  colstattargets,
 			  optionDatum,
 			  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT,
 			  0,
 			  true, /* allow table to be a system catalog? */
 			  false,	/* is_internal? */
 			  NULL);
-
+	pfree(colstattargets);
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
 	ReleaseSysCache(indexTuple);
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index d7b806020d..09c1be3611 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -313,7 +313,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
  list_make2("chunk_id", "chunk_seq"),
  BTREE_AM_OID,
  rel->rd_rel->reltablespace,
- collationObjectId, classObjectId, coloptions, (Datum) 0,
+ collationObjectId, classObjectId, coloptions, NULL, (Datum) 0,
  INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
 
 	table_close(toast_r

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Tomas Vondra

On 2/4/21 11:04 AM, Ronan Dunklau wrote:
> Hello !
> 
> ...
> 
> junk=# REINDEX INDEX CONCURRENTLY t1_date_trunc_idx;
> REINDEX
> junk=# \d+ t1_date_trunc_idx 
> Index "public.t1_date_trunc_idx"
>Column   |Type | Key? | Definition 
>  
> | Storage | Stats target 
> +-+--
> +-+-+--
>  date_trunc | timestamp without time zone | yes  | date_trunc('day'::text, 
> ts) 
> | plain   | 
> btree, for table "public.t1"
> 
> 
> I'm attaching a patch possibly solving the problem, but maybe the proposed 
> changes will be too intrusive ?
> 

Hmmm, that sure seems like a bug, or at least unexpected behavior (that
I don't see mentioned in the docs).

But the patch seems borked in some way:

$ patch -p1 < ~/keep_attstattargets_on_reindex_concurrently.patch
patch:  Only garbage was found in the patch input.

There seem to be strange escape characters and so on, how did you create
the patch? Maybe some syntax coloring, or something?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Is MaxHeapAttributeNumber a reasonable restriction for foreign-tables?

2021-02-04 Thread Tom Lane

Amit Langote  writes:
> On Thu, Feb 4, 2021 at 4:24 PM Kohei KaiGai  wrote:
>> I think that MaxHeapAttributeNumber is a senseless restriction for foreign-
>> tables. How about your opinions?

> My first reaction to this was a suspicion that the
> MaxHeapAttributeNumber limit would be too ingrained in PostgreSQL's
> architecture to consider this matter lightly, but actually browsing
> the code, that may not really be the case.

You neglected to search for MaxTupleAttributeNumber...

I'm quite skeptical of trying to raise this limit significantly.

In the first place, you'd have to worry about the 2^15 limit on
int16 AttrNumbers --- and keep in mind that that has to be enough
for reasonable-size joins, not only an individual table.  If you
join a dozen or so max-width tables, you're already most of the way
to that limit.

In the second place, as noted by the comment you quoted, there are
algorithms in various places that are O(N^2) (or maybe even worse?)
in the number of columns they're dealing with.

In the third place, I've yet to see a use-case that didn't represent
crummy table design.  Pushing the table off to a remote server doesn't
make it less crummy design.

regards, tom lane

Re: Is Recovery actually paused?

2021-02-04 Thread Dilip Kumar

On Thu, Feb 4, 2021 at 6:16 PM Dilip Kumar  wrote:
>
> On Thu, Feb 4, 2021 at 4:58 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Feb 4, 2021 at 10:28 AM Dilip Kumar  wrote:
> > > Please find an updated patch which addresses these comments.
> >
> > Thanks for the patch. I tested the new function 
> > pg_get_wal_replay_pause_state:
> >
> > postgres=# select pg_get_wal_replay_pause_state();
> >  pg_get_wal_replay_pause_state
> > ---
> >  not paused
> > postgres=# select pg_wal_replay_pause();
> >  pg_wal_replay_pause
> > -
> >
> > (1 row)
> >
> > I can also see the "pause requested" state after I put a gdb
> > breakpoint in WaitForWALToBecomeAvailable in the standby startup
> > process .
> >
> > postgres=# select pg_get_wal_replay_pause_state();
> >  pg_get_wal_replay_pause_state
> > ---
> >  pause requested
> > (1 row)
> >
> > postgres=# select pg_get_wal_replay_pause_state();
> >  pg_get_wal_replay_pause_state
> > ---
> >  paused
> > (1 row)
> >
> > Mostly, the v10 patch looks good to me, except below minor comments:
> >
> > 1) A typo in commit message - "just check" --> "just checks"
> >
> > 2) How about
> > +Returns recovery pause state. The return values are not 
> > paused
> > instead of
> > +Returns recovery pause state, the return values are not 
> > paused
> >
> > 3) I think it is 'get wal replay pause state', instead of { oid =>
> > '1137', descr => 'get wal replay is pause state',
> >
> > 4) can we just do this
> > /*
> >  * If recovery pause is requested then set it paused.  While we are 
> > in
> >  * the loop, user might resume and pause again so set this every 
> > time.
> >  */
> > if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState ==
> > RECOVERY_PAUSE_REQUESTED)
> > SetRecoveryPause(RECOVERY_PAUSED);
> > instead of
> > /*
> >  * If recovery pause is requested then set it paused.  While we are 
> > in
> >  * the loop, user might resume and pause again so set this every 
> > time.
> >  */
> > SpinLockAcquire(&XLogCtl->info_lck);
> > if (XLogCtl->recoveryPauseState == RECOVERY_PAUSE_REQUESTED)
> > XLogCtl->recoveryPauseState = RECOVERY_PAUSED;
> > SpinLockRelease(&XLogCtl->info_lck);
> >
> > I think it's okay, since we take a spinlock anyways in
> > GetRecoveryPauseState(). See the below comment and also a relevant
> > commit 6ba4ecbf477e0b25dd7bde1b0c4e07fc2da19348 on why it's not
> > necessary taking spinlock always:
> > /*
> >  * Pause WAL replay, if requested by a hot-standby session 
> > via
> >  * SetRecoveryPause().
> >  *
> >  * Note that we intentionally don't take the info_lck 
> > spinlock
> >  * here.  We might therefore read a slightly stale value of
> >  * the recoveryPause flag, but it can't be very stale (no
> >  * worse than the last spinlock we did acquire).  Since a
> >  * pause request is a pretty asynchronous thing anyway,
> >  * possibly responding to it one WAL record later than we
> >  * otherwise would is a minor issue, so it doesn't seem 
> > worth
> >  * adding another spinlock cycle to prevent that.
> >  */
>
> How can we do that this is not a 1 byte flag this is enum so I don't
> think we can read any atomic state without a spin lock here.

I have fixed the other comments and the updated patch is attached.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 0d7059a8809cbd28e7a4668c6b99e0c297c564f3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 27 Jan 2021 16:46:04 +0530
Subject: [PATCH v11] Provide a new interface to get the recovery pause status

Currently, pg_is_wal_replay_paused, just checks whether the recovery
pause is requested or not but it doesn't actually tell whether the
recovery is actually paused or not.  So basically there is no way for
the user to know the actual status of the pause request.  This patch
provides a new interface pg_get_wal_replay_pause_state that will
return the actual status of the recovery pause i.e.'not paused' if
pause is not requested, 'pause requested' if pause is requested but
recovery is not yet paused and 'paused' if recovery is actually paused.
---
 doc/src/sgml/func.sgml | 32 ++---
 src/backend/access/transam/xlog.c  | 64 +-
 src/backend/access/transam/xlogfuncs.c | 50 +++---
 src/include/access/xlog.h  | 12 +--
 src/include/catalog/pg_proc.dat|  4 +++
 5 files changed, 134 insertions(+), 28 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f30eaa3..eb131a6 100644
--- a/doc/src/sgml/func.sgml

Re: making update/delete of inheritance trees scale better

2021-02-04 Thread Robert Haas

On Thu, Feb 4, 2021 at 4:33 AM Amit Langote  wrote:
> So would zheap refetch a tuple using the "ctid" column in the plan's
> output tuple and then use some other columns from the fetched tuple to
> actually do the update?

Yes.

> To be clear, the new refetch in ExecModifyTable() is to fill in the
> unchanged columns in the new tuple.  If we rejigger the
> table_tuple_update() API to receive a partial tuple (essentially
> what's in 'planSlot' passed to ExecUpdate) as opposed to the full
> tuple, we wouldn't need the refetch.

I don't think we should assume that every AM needs the unmodified
columns. Imagine a table AM that's a columnar store. Imagine that each
column is stored completely separately, so you have to look up the TID
once per column and then stick in the new values. Well, clearly you
want to skip this completely for columns that don't need to be
modified. If someone gives you all the columns it actually sucks,
because now you have to look them all up again just to figure out
which ones you need to change, whereas if they gave you only the
unmodified columns you could just do nothing for those and save a
bunch of work.

zheap, though, is always going to need to take another look at the
tuple to do the update, unless you can pass up some values through
hidden columns. I'm not exactly sure how expensive that really is,
though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Is Recovery actually paused?

2021-02-04 Thread Dilip Kumar

On Thu, Feb 4, 2021 at 4:58 PM Bharath Rupireddy
 wrote:
>
> On Thu, Feb 4, 2021 at 10:28 AM Dilip Kumar  wrote:
> > Please find an updated patch which addresses these comments.
>
> Thanks for the patch. I tested the new function pg_get_wal_replay_pause_state:
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  not paused
> postgres=# select pg_wal_replay_pause();
>  pg_wal_replay_pause
> -
>
> (1 row)
>
> I can also see the "pause requested" state after I put a gdb
> breakpoint in WaitForWALToBecomeAvailable in the standby startup
> process .
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  pause requested
> (1 row)
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  paused
> (1 row)
>
> Mostly, the v10 patch looks good to me, except below minor comments:
>
> 1) A typo in commit message - "just check" --> "just checks"
>
> 2) How about
> +Returns recovery pause state. The return values are not 
> paused
> instead of
> +Returns recovery pause state, the return values are not 
> paused
>
> 3) I think it is 'get wal replay pause state', instead of { oid =>
> '1137', descr => 'get wal replay is pause state',
>
> 4) can we just do this
> /*
>  * If recovery pause is requested then set it paused.  While we are in
>  * the loop, user might resume and pause again so set this every time.
>  */
> if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState ==
> RECOVERY_PAUSE_REQUESTED)
> SetRecoveryPause(RECOVERY_PAUSED);
> instead of
> /*
>  * If recovery pause is requested then set it paused.  While we are in
>  * the loop, user might resume and pause again so set this every time.
>  */
> SpinLockAcquire(&XLogCtl->info_lck);
> if (XLogCtl->recoveryPauseState == RECOVERY_PAUSE_REQUESTED)
> XLogCtl->recoveryPauseState = RECOVERY_PAUSED;
> SpinLockRelease(&XLogCtl->info_lck);
>
> I think it's okay, since we take a spinlock anyways in
> GetRecoveryPauseState(). See the below comment and also a relevant
> commit 6ba4ecbf477e0b25dd7bde1b0c4e07fc2da19348 on why it's not
> necessary taking spinlock always:
> /*
>  * Pause WAL replay, if requested by a hot-standby session via
>  * SetRecoveryPause().
>  *
>  * Note that we intentionally don't take the info_lck spinlock
>  * here.  We might therefore read a slightly stale value of
>  * the recoveryPause flag, but it can't be very stale (no
>  * worse than the last spinlock we did acquire).  Since a
>  * pause request is a pretty asynchronous thing anyway,
>  * possibly responding to it one WAL record later than we
>  * otherwise would is a minor issue, so it doesn't seem worth
>  * adding another spinlock cycle to prevent that.
>  */

How can we do that this is not a 1 byte flag this is enum so I don't
think we can read any atomic state without a spin lock here.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: WIP: System Versioned Temporal Table

2021-02-04 Thread easteregg

hi,

i tested the temporal patch ( https://commitfest.postgresql.org/26/2316/ ) with 
the current 14devel applied ontop of ef3d461 without any conflicts.
i build with no special options passed to ./configure and noticed, that the 
postgresql-client-13 from the debian repositories crashes with the \d command

to reproduce the issue:

  CREATE TABLE test (
id int PRIMARY KEY generated ALWAYS AS IDENTITY,
name text NOT NULL,
start_timestamp timestamp with time zone GENERATED ALWAYS AS ROW START,
end_timestamp timestamp with time zone GENERATED ALWAYS AS ROW END,
PERIOD FOR SYSTEM_TIME (start_timestamp, end_timestamp)
  );

  \d test

it failes after outputting the table informations with this backtrace:

  free(): invalid pointer
  [1]587783 abort (core dumped)  psql -X -U easteregg -h localhost postgres

  (gdb) bt 50
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  #1  0x7f21a62e0537 in __GI_abort () at abort.c:79
  #2  0x7f21a6339768 in __libc_message (action=action@entry=do_abort, 
fmt=fmt@entry=0x7f21a6447e31 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
  #3  0x7f21a6340a5a in malloc_printerr (str=str@entry=0x7f21a644605e 
"free(): invalid pointer") at malloc.c:5347
  #4  0x7f21a6341c14 in _int_free (av=, p=, 
have_lock=0) at malloc.c:4173
  #5  0x55c9fa47b602 in printTableCleanup 
(content=content@entry=0x7ffece7e41c0) at ./build/../src/fe_utils/print.c:3250
  #6  0x55c9fa444aa3 in describeOneTableDetails (schemaname=, schemaname@entry=0x55c9fbebfee6 "public", relationname=, 
oid=oid@entry=0x55c9fbebfee0 "16436", verbose=verbose@entry=false) at 
./build/../src/bin/psql/describe.c:3337
  #7  0x55c9fa4490c9 in describeTableDetails 
(pattern=pattern@entry=0x55c9fbebf540 "abk", verbose=verbose@entry=false, 
showSystem=) at ./build/../src/bin/psql/describe.c:1421
  #8  0x55c9fa4372ff in exec_command_d 
(scan_state=scan_state@entry=0x55c9fbebd130, 
active_branch=active_branch@entry=true, cmd=cmd@entry=0x55c9fbebf430 "d") at 
./build/../src/bin/psql/command.c:722
  #9  0x55c9fa43ae2b in exec_command (previous_buf=0x55c9fbebd3a0, 
query_buf=0x55c9fbebd270, cstack=0x55c9fbebd250, scan_state=0x55c9fbebd130, 
cmd=0x55c9fbebf430 "d") at ./build/../src/bin/psql/command.c:317
  #10 HandleSlashCmds (scan_state=scan_state@entry=0x55c9fbebd130, 
cstack=cstack@entry=0x55c9fbebd250, query_buf=0x55c9fbebd270, 
previous_buf=0x55c9fbebd3a0) at ./build/../src/bin/psql/command.c:220
  #11 0x55c9fa4539e0 in MainLoop (source=0x7f21a6479980 <_IO_2_1_stdin_>) 
at ./build/../src/bin/psql/mainloop.c:502
  #12 0x55c9fa433d64 in main (argc=, argv=0x7ffece7e47f8) at 
./build/../src/bin/psql/startup.c:441

the client is this version:

  apt-cache policy postgresql-client-13
  postgresql-client-13:
Installed: 13.1-1.pgdg+2+b3
Candidate: 13.1-1.pgdg+2+b3
Version table:
   *** 13.1-1.pgdg+2+b3 100
  100 http://apt.postgresql.org/pub/repos/apt sid-pgdg-testing/main 
amd64 Packages
  100 /var/lib/dpkg/status

the the 14devel version from my build or a selfcompiled REL_13_STABLE client 
will not crash.
i was wondering if this might pose a security concern.


i am a bit out of my depths here, but would be glad to help, if any 
informations are missing
with kind regards, 
richard

Re: Is MaxHeapAttributeNumber a reasonable restriction for foreign-tables?

2021-02-04 Thread Amit Langote

Hello,

On Thu, Feb 4, 2021 at 4:24 PM Kohei KaiGai  wrote:
> I noticed that CheckAttributeNamesTypes() prevents to create a table that has
> more than MaxHeapAttributeNumber (1600) columns, for foreign-table also.
> IIUC, this magic number comes from length of the null-bitmap can be covered
> with t_hoff in HeapTupleHeaderData.
> For heap-tables, it seems to me a reasonable restriction to prevent overrun of
> null-bitmap. On the other hand, do we have proper reason to apply same
> restrictions on foreign-tables also?
>
> Foreign-tables have their own unique internal data structures instead of
> the PostgreSQL's heap-table, and some of foreign-data can have thousands
> attributes in their structured data.
> I think that MaxHeapAttributeNumber is a senseless restriction for foreign-
> tables. How about your opinions?

My first reaction to this was a suspicion that the
MaxHeapAttributeNumber limit would be too ingrained in PostgreSQL's
architecture to consider this matter lightly, but actually browsing
the code, that may not really be the case.  Other than
src/backend/access/heap/*, here are the places that check it:

catalog/heap.c: CheckAttributeNamesTypes() that you mentioned:

/* Sanity check on column count */
if (natts < 0 || natts > MaxHeapAttributeNumber)
ereport(ERROR,
(errcode(ERRCODE_TOO_MANY_COLUMNS),
 errmsg("tables can have at most %d columns",
MaxHeapAttributeNumber)));

tablecmds.c: MergeAttributes():

/*
 * Check for and reject tables with too many columns. We perform this
 * check relatively early for two reasons: (a) we don't run the risk of
 * overflowing an AttrNumber in subsequent code (b) an O(n^2) algorithm is
 * okay if we're processing <= 1600 columns, but could take minutes to
 * execute if the user attempts to create a table with hundreds of
 * thousands of columns.
 *
 * Note that we also need to check that we do not exceed this figure after
 * including columns from inherited relations.
 */
if (list_length(schema) > MaxHeapAttributeNumber)
ereport(ERROR,
(errcode(ERRCODE_TOO_MANY_COLUMNS),
 errmsg("tables can have at most %d columns",
MaxHeapAttributeNumber)));

tablecmds.c: ATExecAddColumn():

/* Determine the new attribute's number */
newattnum = ((Form_pg_class) GETSTRUCT(reltup))->relnatts + 1;
if (newattnum > MaxHeapAttributeNumber)
ereport(ERROR,
(errcode(ERRCODE_TOO_MANY_COLUMNS),
 errmsg("tables can have at most %d columns",
MaxHeapAttributeNumber)));

So, unless I am terribly wrong, we may have a shot at revisiting the
decision that would have set this limit.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Amit Kapila

On Thu, Feb 4, 2021 at 5:38 PM Bharath Rupireddy
 wrote:
>
> On Thu, Feb 4, 2021 at 4:00 PM Amit Kapila  wrote:
> > > > About 0001, have we tried to reproduce the actual bug here which means
> > > > when the error_callback is called we should face some problem? I feel
> > > > with the correct testcase we should hit the Assert
> > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > > there we expect the transaction to be in a valid state. I understand
> > > > that the transaction is in a broken state at that time but having a
> > > > testcase to hit the actual bug makes it easy to test the fix.
> > >
> > > I have not tried hitting the Assert(IsTransactionState() in
> > > SearchCatCacheInternal. To do that, I need to figure out hitting
> > > "incorrect binary data format in logical replication column" error in
> > > either slot_modify_data or slot_store_data so that we will enter the
> > > error callback slot_store_error_callback and then IsTransactionState()
> > > should return false i.e. txn shouldn't be in TRANS_INPROGRESS.
> > >
> >
> > Even, if you hit that via debugger it will be sufficient or you can
> > write another elog/ereport there to achieve the same. The exact test
> > case to hit that error is not mandatory.
>
> Thanks Amit. I verified it with gdb. I attached gdb to the logical
> replication worker. In slot_store_data's for loop, I intentionally set
> CurrentTransactionState->state = TRANS_DEFAULT,
>

What happens if you don't change CurrentTransactionState->state?

-- 
With Regards,
Amit Kapila.

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Bharath Rupireddy

On Thu, Feb 4, 2021 at 4:00 PM Amit Kapila  wrote:
> > > About 0001, have we tried to reproduce the actual bug here which means
> > > when the error_callback is called we should face some problem? I feel
> > > with the correct testcase we should hit the Assert
> > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > there we expect the transaction to be in a valid state. I understand
> > > that the transaction is in a broken state at that time but having a
> > > testcase to hit the actual bug makes it easy to test the fix.
> >
> > I have not tried hitting the Assert(IsTransactionState() in
> > SearchCatCacheInternal. To do that, I need to figure out hitting
> > "incorrect binary data format in logical replication column" error in
> > either slot_modify_data or slot_store_data so that we will enter the
> > error callback slot_store_error_callback and then IsTransactionState()
> > should return false i.e. txn shouldn't be in TRANS_INPROGRESS.
> >
>
> Even, if you hit that via debugger it will be sufficient or you can
> write another elog/ereport there to achieve the same. The exact test
> case to hit that error is not mandatory.

Thanks Amit. I verified it with gdb. I attached gdb to the logical
replication worker. In slot_store_data's for loop, I intentionally set
CurrentTransactionState->state = TRANS_DEFAULT, and jumped to the
existing error "incorrect binary data format in logical replication
column", so that the slot_store_error_callback is called. While we are
in the error context callback:

On master: since the system catalogues are accessed in
slot_store_error_callback, the Assert(IsTransactionState() in
SearchCatCacheInternal failed and the error we intend to see is not
logged and we see below in the subscriber server log and the session
in the subscriber gets restarted.
2021-02-04 17:26:27.517 IST [2269230] ERROR:  could not send data to
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2021-02-04 17:26:27.518 IST [2269190] LOG:  background worker "logical
replication worker" (PID 2269230) exited with exit code 1

With patch: since we avoided system catalogue access in
slot_store_error_callback, we see the error that we intentionally
jumped to, in the subscriber server log.
2021-02-04 17:27:37.542 IST [2269424] ERROR:  incorrect binary data
format in logical replication column 1
2021-02-04 17:27:37.542 IST [2269424] CONTEXT:  processing remote data
for replication target relation "public.t1" column "a1", remote type
integer, local type integer

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Is Recovery actually paused?

2021-02-04 Thread Dilip Kumar

On Thu, Feb 4, 2021 at 4:58 PM Bharath Rupireddy
 wrote:
>
> On Thu, Feb 4, 2021 at 10:28 AM Dilip Kumar  wrote:
> > Please find an updated patch which addresses these comments.
>
> Thanks for the patch. I tested the new function pg_get_wal_replay_pause_state:
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  not paused
> postgres=# select pg_wal_replay_pause();
>  pg_wal_replay_pause
> -
>
> (1 row)
>
> I can also see the "pause requested" state after I put a gdb
> breakpoint in WaitForWALToBecomeAvailable in the standby startup
> process .
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  pause requested
> (1 row)
>
> postgres=# select pg_get_wal_replay_pause_state();
>  pg_get_wal_replay_pause_state
> ---
>  paused
> (1 row)
>
> Mostly, the v10 patch looks good to me, except below minor comments:

Thanks for the testing.

> 1) A typo in commit message - "just check" --> "just checks"
>
> 2) How about
> +Returns recovery pause state. The return values are not 
> paused
> instead of
> +Returns recovery pause state, the return values are not 
> paused
>
> 3) I think it is 'get wal replay pause state', instead of { oid =>
> '1137', descr => 'get wal replay is pause state',
>
> 4) can we just do this
> /*
>  * If recovery pause is requested then set it paused.  While we are in
>  * the loop, user might resume and pause again so set this every time.
>  */
> if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState ==
> RECOVERY_PAUSE_REQUESTED)
> SetRecoveryPause(RECOVERY_PAUSED);
> instead of
> /*
>  * If recovery pause is requested then set it paused.  While we are in
>  * the loop, user might resume and pause again so set this every time.
>  */
> SpinLockAcquire(&XLogCtl->info_lck);
> if (XLogCtl->recoveryPauseState == RECOVERY_PAUSE_REQUESTED)
> XLogCtl->recoveryPauseState = RECOVERY_PAUSED;
> SpinLockRelease(&XLogCtl->info_lck);
>
> I think it's okay, since we take a spinlock anyways in
> GetRecoveryPauseState(). See the below comment and also a relevant
> commit 6ba4ecbf477e0b25dd7bde1b0c4e07fc2da19348 on why it's not
> necessary taking spinlock always:
> /*
>  * Pause WAL replay, if requested by a hot-standby session via
>  * SetRecoveryPause().
>  *
>  * Note that we intentionally don't take the info_lck spinlock
>  * here.  We might therefore read a slightly stale value of
>  * the recoveryPause flag, but it can't be very stale (no
>  * worse than the last spinlock we did acquire).  Since a
>  * pause request is a pretty asynchronous thing anyway,
>  * possibly responding to it one WAL record later than we
>  * otherwise would is a minor issue, so it doesn't seem worth
>  * adding another spinlock cycle to prevent that.
>  */
>

I will work on these comments and send the updated patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Greg Nancarrow

On Thu, Feb 4, 2021 at 11:56 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> >
> > So, the results indicate that after the patch we touch more buffers
> > during planning which I think is because of accessing the partition
> > information, and during execution, the patch touches fewer buffers for
> > the same reason. But why this can reduce the time with patch? I think
> > this needs some investigation.
>
> I guess another factor other than shared buffers is relcache and catcache.  
> The patched version loads those cached entries for all partitions of the 
> insert target table during the parallel-safety check in planning, while the 
> unpatched version has to gradually build those cache entries during 
> execution.  How can wee confirm its effect?
>

I believe that we can confirm its effect by invalidating relcache and
catcache, in both the patched and unpatched versions, just after the
parallel-safety checks are performed in the planner, and then running
tests and comparing the performance.

So that's exactly what I did (adding a call to
InvalidateSystemCaches() just after the parallel-safety checks in the
planner).
I found that then the unpatched version always performed better than
the patched version for tests inserting 1000 records into a table with
100,200,500 and 1000 partitions.
Looking at the breakdown of the timing for each Insert, the Planning
Time was always significantly more for the patched version (expected,
because it does extra checks), but the Execution Time was very similar
for both the patched and unpatched versions.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: DROP TABLE can crash the replication sync worker

2021-02-04 Thread Amit Kapila

On Thu, Feb 4, 2021 at 5:31 AM Peter Smith  wrote:
>
> On Wed, Feb 3, 2021 at 11:49 PM Amit Kapila  wrote:
> >
> > On Wed, Feb 3, 2021 at 2:53 PM Peter Smith  wrote:
> > >
> > > Hi Hackers.
> > >
> > > As discovered in another thread [master] there is an *existing* bug in
> > > the PG HEAD code which can happen if a DROP TABLE is done at same time
> > > a replication tablesync worker is running.
> > >
> > > It seems the table's relid that the sync worker is using gets ripped
> > > out from underneath it and is invalidated by the DROP TABLE. Any
> > > subsequent use of that relid will go wrong.
> > >
> >
> > Where exactly did you pause the tablesync worker while dropping the
> > table? We acquire the lock on the table in LogicalRepSyncTableStart
> > and then keep it for the entire duration of tablesync worker so drop
> > table shouldn't be allowed.
> >
>
> I have a breakpoint set on LogicalRepSyncTableStart. The DROP TABLE is
> done while paused on that breakpoint, so no code of
> LogicalRepSyncTableStart has even executed yet.
>

Fair enough. So, you are hitting this problem in finish_sync_worker()
while logging the message because by that time the relation is
dropped. I think it is good to fix that but we don't want the patch
you have attached here, we can fix it locally in finish_sync_worker()
by constructing a different message (something like: "logical
replication table synchronization worker for subscription \"%s\" has
finished") when we can't get rel_name from rel id. This doesn't appear
to be as serious a problem as we were talking about in the patch
"Allow multiple xacts during table sync in logical replication." [1]
because there we don't hold the lock on the table for the entire
duration tablesync. So, even if we want to fix this problem, it would
be more appropriate for back-branches if we push the patch [1].

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPtAKP1FoHbUEWN%2Ba%3D8Pg_njsJKc9Zoz05A_ewJSvjX2MQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Is Recovery actually paused?

2021-02-04 Thread Bharath Rupireddy

On Thu, Feb 4, 2021 at 10:28 AM Dilip Kumar  wrote:
> Please find an updated patch which addresses these comments.

Thanks for the patch. I tested the new function pg_get_wal_replay_pause_state:

postgres=# select pg_get_wal_replay_pause_state();
 pg_get_wal_replay_pause_state
---
 not paused
postgres=# select pg_wal_replay_pause();
 pg_wal_replay_pause
-

(1 row)

I can also see the "pause requested" state after I put a gdb
breakpoint in WaitForWALToBecomeAvailable in the standby startup
process .

postgres=# select pg_get_wal_replay_pause_state();
 pg_get_wal_replay_pause_state
---
 pause requested
(1 row)

postgres=# select pg_get_wal_replay_pause_state();
 pg_get_wal_replay_pause_state
---
 paused
(1 row)

Mostly, the v10 patch looks good to me, except below minor comments:

1) A typo in commit message - "just check" --> "just checks"

2) How about
+Returns recovery pause state. The return values are not paused
instead of
+Returns recovery pause state, the return values are not paused

3) I think it is 'get wal replay pause state', instead of { oid =>
'1137', descr => 'get wal replay is pause state',

4) can we just do this
/*
 * If recovery pause is requested then set it paused.  While we are in
 * the loop, user might resume and pause again so set this every time.
 */
if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState ==
RECOVERY_PAUSE_REQUESTED)
SetRecoveryPause(RECOVERY_PAUSED);
instead of
/*
 * If recovery pause is requested then set it paused.  While we are in
 * the loop, user might resume and pause again so set this every time.
 */
SpinLockAcquire(&XLogCtl->info_lck);
if (XLogCtl->recoveryPauseState == RECOVERY_PAUSE_REQUESTED)
XLogCtl->recoveryPauseState = RECOVERY_PAUSED;
SpinLockRelease(&XLogCtl->info_lck);

I think it's okay, since we take a spinlock anyways in
GetRecoveryPauseState(). See the below comment and also a relevant
commit 6ba4ecbf477e0b25dd7bde1b0c4e07fc2da19348 on why it's not
necessary taking spinlock always:
/*
 * Pause WAL replay, if requested by a hot-standby session via
 * SetRecoveryPause().
 *
 * Note that we intentionally don't take the info_lck spinlock
 * here.  We might therefore read a slightly stale value of
 * the recoveryPause flag, but it can't be very stale (no
 * worse than the last spinlock we did acquire).  Since a
 * pause request is a pretty asynchronous thing anyway,
 * possibly responding to it one WAL record later than we
 * otherwise would is a minor issue, so it doesn't seem worth
 * adding another spinlock cycle to prevent that.
 */

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Tid scan improvements

2021-02-04 Thread David Rowley

On Thu, 4 Feb 2021 at 10:31, David Rowley  wrote:
>
> Thanks for looking at this.
>
> On Thu, 4 Feb 2021 at 10:19, Andres Freund  wrote:
> > Perhaps something like
> >
> > typedef struct TableScanTidRange TableScanTidRange;
> >
> > TableScanTidRange* table_scan_tid_range_start(TableScanDesc sscan, 
> > ItemPointer mintid, ItemPointer maxtid);
> > bool table_scan_tid_range_nextslot(TableScanDesc sscan, TableScanTidRange 
> > *range, TupleTableSlot *slot);
> > void table_scan_tid_range_end(TableScanDesc sscan, TableScanTidRange* 
> > range);
> >
> > would work better? That'd allow an AM to have arbitrarily large state
> > for a tid range scan, would make it clear what the lifetime of the
> > ItemPointer mintid, ItemPointer maxtid are etc.  Wouldn't, on the API
> > level, prevent multiple tid range scans from being in progress at the
> > same times though :(. Perhaps we could add a TableScanTidRange* pointer
> > to TableScanDesc which'd be checked/set by tableam.h which'd prevent that?
>
> Maybe the TableScanTidRange can just have a field to store the
> TableScanDesc. That way table_scan_tid_range_nextslot and
> table_scan_tid_range_end can just pass the TableScanTidRange pointer.
>
> That way it seems like it would be ok for multiple scans to be going
> on concurrently as nobody should be reusing the TableScanDesc.

I ended up adding just two new API functions to table AM.

void (*scan_set_tid_range) (TableScanDesc sscan,
   ItemPointer mintid,
   ItemPointer maxtid);

and
bool (*scan_tid_range_nextslot) (TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot);

I added an additional function in tableam.h that does not have a
corresponding API function:

static inline TableScanDesc
table_tid_range_start(Relation rel, Snapshot snapshot,
  ItemPointer mintid,
  ItemPointer maxtid)

This just calls the standard scan_begin then calls scan_set_tid_range
setting the specified mintid and maxtid.

I also added 2 new fields to TableScanDesc:

ItemPointerData rs_mintid;
ItemPointerData rs_maxtid;

I didn't quite see a need to have a new start and end scan API function.

Updated patch attached.

David
From 36c34e3089d2ad224e7606df14209d8cc199aa67 Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Thu, 21 Jan 2021 16:48:15 +1300
Subject: [PATCH v13] Add TID Range Scans to support efficient scanning ranges
 of TIDs

This adds a new node type named TID Range Scan.  The query planner will
generate paths for TID Range scans when quals are discovered on base
relations which search for ranges of ctid.  These ranges may be open at
either end.

To support this, a new optional callback function has been added to table
AM which is named scan_getnextslot_inrange.  This function accepts a
minimum and maximum ItemPointer to allow efficient retrieval of tuples
within this range.  Table AMs where scanning ranges of TIDs does not make
sense or is difficult to implement efficiently may choose to not implement
this function.

Author: Edmund Horner and David Rowley
Discussion: 
https://postgr.es/m/CAMyN-kB-nFTkF=va_jpwfno08s0d-yk0f741s2b7ldmyai8...@mail.gmail.com
---
 src/backend/access/heap/heapam.c   | 147 
 src/backend/access/heap/heapam_handler.c   |   3 +
 src/backend/commands/explain.c |  23 ++
 src/backend/executor/Makefile  |   1 +
 src/backend/executor/execAmi.c |   6 +
 src/backend/executor/execProcnode.c|  10 +
 src/backend/executor/nodeTidrangescan.c| 413 +
 src/backend/nodes/copyfuncs.c  |  24 ++
 src/backend/nodes/outfuncs.c   |  13 +
 src/backend/optimizer/README   |   1 +
 src/backend/optimizer/path/costsize.c  |  95 +
 src/backend/optimizer/path/tidpath.c   | 117 +-
 src/backend/optimizer/plan/createplan.c|  98 +
 src/backend/optimizer/plan/setrefs.c   |  16 +
 src/backend/optimizer/plan/subselect.c |   6 +
 src/backend/optimizer/util/pathnode.c  |  29 ++
 src/backend/optimizer/util/plancat.c   |   6 +
 src/backend/optimizer/util/relnode.c   |   3 +
 src/backend/storage/page/itemptr.c |  58 +++
 src/include/access/heapam.h|   6 +-
 src/include/access/relscan.h   |   4 +
 src/include/access/tableam.h   |  84 -
 src/include/catalog/pg_operator.dat|   6 +-
 src/include/executor/nodeTidrangescan.h|  23 ++
 src/include/nodes/execnodes.h  |  18 +
 src/include/nodes/nodes.h  |   3 +
 src/include/nodes/pathnodes.h  |  18 +
 src/include/nodes/plannodes.h  |  13 +
 src/include/optimizer/cost.h   |   3 +
 src/include/optimizer/pathnode.h   |   4 +
 src/include/storage/itemptr.h  |   2 +
 src/test/regress/expected/tidrangescan.out | 302 +++
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/serial_schedule   |   1 +
 src/test/regress/sql/tidrangescan.sql  | 104 +++

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-04 Thread Amit Kapila

On Thu, Feb 4, 2021 at 6:26 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > On Mon, Jan 18, 2021 at 2:40 PM Tang, Haiying
> >  wrote:
> > > Execute EXPLAIN on Patched:
> > > postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part
> > select * from test_data1;
> > >QUERY PLAN
> > >
> > ---
> > -
> > >  Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual
> > time=44.139..44.140 rows=0 loops=1)
> > >Buffers: shared hit=1005 read=1000 dirtied=3000 written=2000
> > >->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000
> > width=8) (actual time=0.007..0.201 rows=1000 loops=1)
> > >  Output: test_data1.a, test_data1.b
> > >  Buffers: shared hit=5
> > >  Planning:
> > >Buffers: shared hit=27011
> > >  Planning Time: 24.526 ms
> > >  Execution Time: 44.981 ms
> > >
> > > Execute EXPLAIN on non-Patched:
> > > postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part
> > select * from test_data1;
> > >QUERY PLAN
> > >
> > ---
> > -
> > >  Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual
> > time=72.656..72.657 rows=0 loops=1)
> > >Buffers: shared hit=22075 read=1000 dirtied=3000 written=2000
> > >->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000
> > width=8) (actual time=0.010..0.175 rows=1000 loops=1)
> > >  Output: test_data1.a, test_data1.b
> > >  Buffers: shared hit=5
> > >  Planning:
> > >Buffers: shared hit=72
> > >  Planning Time: 0.135 ms
> > >  Execution Time: 79.058 ms
> > >
> >
> > So, the results indicate that after the patch we touch more buffers
> > during planning which I think is because of accessing the partition
> > information, and during execution, the patch touches fewer buffers for
> > the same reason. But why this can reduce the time with patch? I think
> > this needs some investigation.
>
> I guess another factor other than shared buffers is relcache and catcache.  
> The patched version loads those cached entries for all partitions of the 
> insert target table during the parallel-safety check in planning, while the 
> unpatched version has to gradually build those cache entries during execution.
>

Right.

>  How can wee confirm its effect?
>

I am not sure but if your theory is correct then won't in consecutive
runs both should have the same performance?


-- 
With Regards,
Amit Kapila.

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Amit Kapila

On Wed, Feb 3, 2021 at 4:31 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 28, 2021 at 11:14 AM Amit Kapila  wrote:
> >
> > On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu  wrote:
> > >
> > > Thanks for pointing to the changes in the commit message. I corrected
> > > them. Attaching v4 patch set, consider it for further review.
> > >
> >
> > About 0001, have we tried to reproduce the actual bug here which means
> > when the error_callback is called we should face some problem? I feel
> > with the correct testcase we should hit the Assert
> > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > there we expect the transaction to be in a valid state. I understand
> > that the transaction is in a broken state at that time but having a
> > testcase to hit the actual bug makes it easy to test the fix.
>
> I have not tried hitting the Assert(IsTransactionState() in
> SearchCatCacheInternal. To do that, I need to figure out hitting
> "incorrect binary data format in logical replication column" error in
> either slot_modify_data or slot_store_data so that we will enter the
> error callback slot_store_error_callback and then IsTransactionState()
> should return false i.e. txn shouldn't be in TRANS_INPROGRESS.
>

Even, if you hit that via debugger it will be sufficient or you can
write another elog/ereport there to achieve the same. The exact test
case to hit that error is not mandatory.

-- 
With Regards,
Amit Kapila.

Re: Asynchronous Append on postgres_fdw nodes.

2021-02-04 Thread Etsuro Fujita

On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita  wrote:
> On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita  wrote:
> > * I haven't yet done anything about the issue on postgres_fdw's
> > handling of concurrent data fetches by multiple ForeignScan nodes
> > (below *different* Append nodes in the query) using the same
> > connection discussed in [2].  I modified the patch to just disable
> > applying this feature to problematic test cases in the postgres_fdw
> > regression tests, by a new GUC enable_async_append.
>
> A solution for the issue would be a scheduler designed to handle such
> data fetches more efficiently, but I don’t think it’s easy to create
> such a scheduler.  Rather than doing so, I'd like to propose to allow
> FDWs to disable async execution of them in problematic cases by
> themselves during executor startup in the first cut.  What I have in
> mind for that is:
>
> 1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
> to record, for each of the async-capable and non-async-capable
> ForeignScan(s), the information on a connection to be used for the
> ForeignScan into EState during BeginForeignScan().
>
> 2) After doing ExecProcNode() to each SubPlan and the main query tree
> in InitPlan(), we give the FDW a chance to a) reconsider, for each of
> the async-capable ForeignScan(s), whether the ForeignScan can be
> executed asynchronously as planned, based on the information stored
> into EState in #1, and then b) disable async execution of the
> ForeignScan if not.

s/ExecProcNode()/ExecInitNode()/.  Sorry for that.  I’ll post an
updated patch for this in a few days.

Best regards,
Etsuro Fujita

Re: pg_replication_origin_drop API potential race condition

2021-02-04 Thread Amit Kapila

On Thu, Feb 4, 2021 at 1:31 PM Peter Smith  wrote:
>
> PSA a patch which I think implements what we are talking about.
>

This doesn't seem correct to me. Have you tested that the patch
resolves the problem reported originally? Because the lockmode
(RowExclusiveLock) you have used in the patch will allow multiple
callers to acquire at the same time. The other thing I don't like
about this is that first, it acquires lock in the function
replorigin_drop_by_name and then again we acquire the same lock in a
different mode in replorigin_drop.

What I was imagining was to have a code same as replorigin_drop with
the first parameter as the name instead of id and additionally, it
will check the existence of origin by replorigin_by_name after
acquiring the lock. So you can move all the common code from
replorigin_drop (starting from restart till end leaving table_close)
to a separate function say replorigin_drop_guts and then call it from
both replorigin_drop and replorigin_drop_by_name.

Now, I have also thought to directly change replorigin_drop but this
is an exposed API so let's keep it as it is because some extensions
might be using it. We can anyway later drop it if required.

-- 
With Regards,
Amit Kapila.

Re: Next Commitfest Manager.

2021-02-04 Thread Ibrar Ahmed

On Thu, Feb 4, 2021 at 2:00 AM David Steele  wrote:

> On 2/3/21 3:13 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Ibrar Ahmed (ibrar.ah...@gmail.com) wrote:
> >> Anyone else already volunteers that? It is my first time so need some
> >> access, if all agree.
> >
> > Thanks for volunteering!
> >
> > That said, our last commitfest tends to be the most difficult as it's
> > the last opportunity for features to land in time for the next major
> > release and, for my part at least, I think it'd be best to have
> > someone who has experience running a CF previously manage it.
> >
> > To that end, I've talked to David Steele, who has run this last CF for
> > the past few years and we're in agreement that he's willing to run this
> > CF again this year, assuming there's no objections.  What we've thought
> > to suggest is that you follow along with David as he runs this CF and
> > then offer to run the July CF.  Of course, we would encourage you and
> > David to communicate and for you to ask David any questions you have
> > about how he handles things as part of the CF.  This is in line with how
> > other CF managers have started out also.
> >
> > Open to your thoughts, as well as those of anyone else who wishes to
> > comment.
>
> +1. This all sounds good to me!
>
> --
> -David
> da...@pgmasters.net

Sounds good, I am happy to work with David.



-- 
Ibrar Ahmed

Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-04 Thread Ronan Dunklau

Hello !

We encountered the following bug recently in production: when running REINDEX 
CONCURRENTLY on an index, the attstattarget is reset to 0.

Consider the following example: 

junk=# \d+ t1_date_trunc_idx 
Index "public.t1_date_trunc_idx"
   Column   |Type | Key? | Definition  
| Storage | Stats target 
+-+--
+-+-+--
 date_trunc | timestamp without time zone | yes  | date_trunc('day'::text, ts) 
| plain   | 1000
btree, for table "public.t1"

junk=# REINDEX INDEX t1_date_trunc_idx;
REINDEX
junk=# \d+ t1_date_trunc_idx 
Index "public.t1_date_trunc_idx"
   Column   |Type | Key? | Definition  
| Storage | Stats target 
+-+--
+-+-+--
 date_trunc | timestamp without time zone | yes  | date_trunc('day'::text, ts) 
| plain   | 1000
btree, for table "public.t1"

junk=# REINDEX INDEX CONCURRENTLY t1_date_trunc_idx;
REINDEX
junk=# \d+ t1_date_trunc_idx 
Index "public.t1_date_trunc_idx"
   Column   |Type | Key? | Definition  
| Storage | Stats target 
+-+--
+-+-+--
 date_trunc | timestamp without time zone | yes  | date_trunc('day'::text, ts) 
| plain   | 
btree, for table "public.t1"


I'm attaching a patch possibly solving the problem, but maybe the proposed 
changes will be too intrusive ?

Regards,

-- 
Ronan Dunklau[1mdiff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c[m
[1mindex 5a70fe4d2c..b4adb32af0 100644[m
[1m--- a/src/backend/catalog/index.c[m
[1m+++ b/src/backend/catalog/index.c[m
[36m@@ -107,7 +107,8 @@[m [mstatic TupleDesc ConstructTupleDescriptor(Relation heapRelation,[m
 		  List *indexColNames,[m
 		  Oid accessMethodObjectId,[m
 		  Oid *collationObjectId,[m
[31m-		  Oid *classObjectId);[m
[32m+[m		[32m  Oid *classObjectId,[m
[32m+[m		[32m  int32 *attstattargets);[m
 static void InitializeAttributeOids(Relation indexRelation,[m
 	int numatts, Oid indexoid);[m
 static void AppendAttributeTuples(Relation indexRelation, Datum *attopts);[m
[36m@@ -272,7 +273,8 @@[m [mConstructTupleDescriptor(Relation heapRelation,[m
 		 List *indexColNames,[m
 		 Oid accessMethodObjectId,[m
 		 Oid *collationObjectId,[m
[31m-		 Oid *classObjectId)[m
[32m+[m		[32m Oid *classObjectId,[m
[32m+[m		[32m int32 *attstattargets)[m
 {[m
 	int			numatts = indexInfo->ii_NumIndexAttrs;[m
 	int			numkeyatts = indexInfo->ii_NumIndexKeyAttrs;[m
[36m@@ -310,12 +312,14 @@[m [mConstructTupleDescriptor(Relation heapRelation,[m
 [m
 		MemSet(to, 0, ATTRIBUTE_FIXED_PART_SIZE);[m
 		to->attnum = i + 1;[m
[31m-		to->attstattarget = -1;[m
 		to->attcacheoff = -1;[m
 		to->attislocal = true;[m
 		to->attcollation = (i < numkeyatts) ?[m
 			collationObjectId[i] : InvalidOid;[m
[31m-[m
[32m+[m		[32mif(attstattargets != NULL)[m
[32m+[m			[32mto->attstattarget = attstattargets[i];[m
[32m+[m		[32melse[m
[32m+[m			[32mto->attstattarget = -1;[m
 		/*[m
 		 * Set the attribute name as specified by caller.[m
 		 */[m
[36m@@ -697,6 +701,7 @@[m [mindex_create(Relation heapRelation,[m
 			 Oid *collationObjectId,[m
 			 Oid *classObjectId,[m
 			 int16 *coloptions,[m
[32m+[m			[32m int32 *colstattargets,[m
 			 Datum reloptions,[m
 			 bits16 flags,[m
 			 bits16 constr_flags,[m
[36m@@ -882,7 +887,8 @@[m [mindex_create(Relation heapRelation,[m
 			indexColNames,[m
 			accessMethodObjectId,[m
 			collationObjectId,[m
[31m-			classObjectId);[m
[32m+[m			[32mclassObjectId,[m
[32m+[m			[32mcolstattargets);[m
 [m
 	/*[m
 	 * Allocate an OID for the index, unless we were told what to use.[m
[36m@@ -1413,11 +1419,13 @@[m [mindex_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,[m
 optionDatum;[m
 	oidvector  *indclass;[m
 	int2vector *indcoloptions;[m
[32m+[m	[32mint32  *colstattargets;[m
 	bool		isnull;[m
 	List	   *indexColNames = NIL;[m
 	List	   *indexExprs = NIL;[m
 	List	   *indexPreds = NIL;[m
 [m
[32m+[m
 	indexRelation = index_open(oldIndexId, RowExclusiveLock);[m
 [m
 	/* The new index needs some information from the old index */[m
[36m@@ -1501,10 +1509,11 @@[m [mindex_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,[m
 			true);[m
 [m
 	/*[m
[31m-	 * Extract the list of column names and the column numbers for the new[m
[32m+[m	[32m * Extract the list of column names, column numbers and stattargets for the new[m
 	 * index

Re: logical replication worker accesses catalogs in error context callback

2021-02-04 Thread Bharath Rupireddy

On Thu, Feb 4, 2021 at 5:16 AM Andres Freund  wrote:
> > > > About 0001, have we tried to reproduce the actual bug here which means
> > > > when the error_callback is called we should face some problem? I feel
> > > > with the correct testcase we should hit the Assert
> > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because
> > > > there we expect the transaction to be in a valid state. I understand
> > > > that the transaction is in a broken state at that time but having a
> > > > testcase to hit the actual bug makes it easy to test the fix.
> > >
> > > I think it's easy to hit - add a trivial DEBUG message to XLogWrite(),
> > > and run anything involving logical rep using a low enough log
> > > level. There might even be enough messages with a low enough level
> > > already somewhere in the relevant paths.
> >
> > I'm not sure how adding a DEBUG message to XLogWrite() would ensure
> > the logical replication worker on the subscriber system enters
> > slot_store_error_callback and hits the Assert(IsTransactionState());?
> > Could you please elaborate? Or I may be missing something here to
> > understand.
>
> XLogWrite() is in a critical section, the DEBUG messages triggers error
> context callbacks to be called, the callbacks allocate memory, which
> triggers an assertion.

I see that XLogWrite() can be called from logical replication
worker(apply_handle_commit->apply_handle_commit_internal->CommitTransactionCommand->CommitTransaction->RecordTransactionCommit->XLogFlush->XLogWrite
--> by now the txn state is not TRANS_INPROGRESS, so
IsTransactionState() can return false. Adding a DEBUG message there
can call the error context callback.

But the error context callback slot_store_error_callback, gets used in
slot_store_data and slot_modify_data. In both places we just register
the callback before an attribute processing for loop and deregister
after it. So, when we are in those for loops, we expect to be in
TRANS_INPROGRESS, see ensure_transaction before slot_store_data and
the same is true for slot_modify_data. So, to really hit the assert
failure in the catalogue access from slot_store_error_callback, first,
we shouldn't be in TRANS_INPROGRESS in those for loops and have a
DEBUG message. I'm not exactly sure how we achieve this.

Am I missing something?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Single transaction in the tablesync worker?

2021-02-04 Thread Amit Kapila

On Thu, Feb 4, 2021 at 9:55 AM Ajin Cherian  wrote:
>
> On Wed, Feb 3, 2021 at 11:38 PM Amit Kapila  wrote:
>
> > Thanks for the report. The problem here was that the error occurred
> > when we were trying to copy the large data. Now, before fetching the
> > entire data we issued a rollback that led to this problem. I think the
> > alternative here could be to first fetch the entire data when the
> > error occurred then issue the following commands. Instead, I have
> > modified the patch to perform 'drop_replication_slot' in the beginning
> > if the relstate is datasync.  Do let me know if you can think of a
> > better way to fix this?
>
> I have verified that the problem is not seen after this patch. I also
> agree with the approach taken for the fix,
>

Thanks. I have fixed one of the issues reported by me earlier [1]
wherein the tablesync worker can repeatedly fail if after dropping the
slot there is an error while updating the SYNCDONE state in the
database. I have moved the drop of the slot just before commit of the
transaction where we are marking the state as SYNCDONE. Additionally,
I have removed unnecessary includes in tablesync.c, updated the docs
for Alter Subscription, and updated the comments at various places in
the patch. I have also updated the commit message this time.

I am still not very happy with the way we handle concurrent drop
origins but probably that would be addressed by the other patch Peter
is working on [2].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JdWv84nMyCpTboBURjj70y3BfO1xdy8SYPRqNxtH7TEA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAHut%2BPsW6%2B7Ucb1sxjSNBaSYPGAVzQFbAEg4y1KsYQiGCnyGeQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

v27-0001-Allow-multiple-xacts-during-table-sync-in-logica.patch
Description: Binary data

Re: making update/delete of inheritance trees scale better

2021-02-04 Thread Amit Langote

On Wed, Jan 27, 2021 at 4:42 AM Robert Haas  wrote:
> On Fri, Oct 30, 2020 at 6:26 PM Heikki Linnakangas  wrote:
> > Yeah, you need to access the old tuple to update its t_ctid, but
> > accessing it twice is still more expensive than accessing it once. Maybe
> > you could optimize it somewhat by keeping the buffer pinned or
> > something. Or push the responsibility down to the table AM, passing the
> > AM only the modified columns, and let the AM figure out how to deal with
> > the columns that were not modified, hoping that it can do something smart.
>
> Just as a point of possible interest, back when I was working on
> zheap, I sort of wanted to take this in the opposite direction. In
> effect, a zheap tuple has system columns that don't exist for a heap
> tuple, and you can't do an update or delete without knowing what the
> values for those columns are, so zheap had to just refetch the tuple,
> but that sucked in comparisons with the existing heap, which didn't
> have to do the refetch.

So would zheap refetch a tuple using the "ctid" column in the plan's
output tuple and then use some other columns from the fetched tuple to
actually do the update?

> At the time, I thought maybe the right idea
> would be to extend things so that a table AM could specify an
> arbitrary set of system columns that needed to be bubbled up to the
> point where the update or delete happens, but that seemed really
> complicated to implement and I never tried.

Currently, FDWs can specify tuple-identifying system columns, which
are added to the query's targetlist when rewriteTargetListUD() calls
the AddForeignUpdateTargets() API.

In rewriteTargetListUD(), one can see that the planner assumes that
all local tables, irrespective of their AM, use a "ctid" column to
identify their tuples:

if (target_relation->rd_rel->relkind == RELKIND_RELATION ||
target_relation->rd_rel->relkind == RELKIND_MATVIEW ||
target_relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
 * Emit CTID so that executor can find the row to update or delete.
 */
var = makeVar(parsetree->resultRelation,
  SelfItemPointerAttributeNumber,
  TIDOID,
  -1,
  InvalidOid,
  0);

attrname = "ctid";
}
else if (target_relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
{
/*
 * Let the foreign table's FDW add whatever junk TLEs it wants.
 */
FdwRoutine *fdwroutine;

fdwroutine = GetFdwRoutineForRelation(target_relation, false);

if (fdwroutine->AddForeignUpdateTargets != NULL)
fdwroutine->AddForeignUpdateTargets(parsetree, target_rte,
target_relation);

Maybe the first block could likewise ask the table AM if it prefers to
add a custom set of system columns or just add "ctid" otherwise?

> Here it seems like we're
> thinking of going the other way, and just always doing the refetch.

To be clear, the new refetch in ExecModifyTable() is to fill in the
unchanged columns in the new tuple.  If we rejigger the
table_tuple_update() API to receive a partial tuple (essentially
what's in 'planSlot' passed to ExecUpdate) as opposed to the full
tuple, we wouldn't need the refetch.

We'd need to teach a few other executor routines, such as
ExecWithCheckOptions(), ExecConstraints(), etc. to live with a partial
tuple but maybe that's doable with some effort.  We could even
optimize away evaluating any constraints if none of the constrained
columns are unchanged.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

parse mistake in ecpg connect string

2021-02-04 Thread Wang, Shenhao

Hi, hacker

I found in function ECPGconnect, the connect string in comment is written as:

/*--
 * new style:
 *  :postgresql://server[:port|:/unixsocket/path:]
 *  [/db-name][?options]
 *--
*/

But, the parse logical seems wrong, like:

tmp = strrchr(dbname + offset, ':');
tmp2 = strchr(tmp + 1, ':')

the value tmp2 will always be NULL, the unix-socket path will be ignored.

I have fixed this problem, the patch attached. 
However, since this usage is not recorded in manual[1](maybe this is why this 
problem is not found for a long time), so how about delete this source directly 
instead?
Thoughts?

This patch only fix the problem when using a character variable to store the 
connect string like:

EXEC SQL BEGIN DECLARE SECTION;
char constr[] = 
"unix:postgresql://localhost:/tmp/a:?port=5435&dbname=postgres";
EXEC SQL END DECLARE SECTION;

If I write a source like:
EXEC SQL CONNECT TO unix:postgresql://localhost:/tmp/a:/postgres?port=5435
EXEC SQL CONNECT TO unix:postgresql://localhost/postgres?host=/tmp/a&port=5435
The program ecpg will report some error when parse .pgc file

I will try to fix this problem later, but it seems a little difficult to add 
some lex/bison file rules

[1] https://www.postgresql.org/docs/13/ecpg-connect.html#ECPG-CONNECTING

Best regards
Shenhao Wang




0001-fix-strchr-strrchr-mistake.patch
Description: 0001-fix-strchr-strrchr-mistake.patch

Re: pg_replication_origin_drop API potential race condition

2021-02-04 Thread Peter Smith

On Thu, Feb 4, 2021 at 4:43 PM Amit Kapila  wrote:
>
> On Thu, Feb 4, 2021 at 9:57 AM Peter Smith  wrote:
> >
> > On Wed, Feb 3, 2021 at 11:17 PM Amit Kapila  wrote:
> >
> > >
> > > How about if we call replorigin_by_name() inside replorigin_drop after
> > > acquiring the lock? Wouldn't that close this race condition? We are
> > > doing something similar for pg_replication_origin_advance().
> > >
> >
> > Yes, that seems ok.
> >
> > I wonder if it is better to isolate that locked portion
> > (replyorigin_by_name + replorigin_drop) so that in addition to being
> > called from pg_replication_origin_drop, we can call it internally from
> > PG code to safely drop the origins.
> >
>
> Yeah, I think that would be really good.

PSA a patch which I think implements what we are talking about.


Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-replorigin_drop_by_name.patch
Description: Binary data

95 matches

Mail list logo