from:"Masahiko Sawada"

Re: Fix unmatched file identifications

2022-08-08 Thread Masahiko Sawada

On Tue, Aug 9, 2022 at 11:24 AM John Naylor
 wrote:
>
> On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada  wrote:
> > I found that there are two .c and .h files whose identification in the
> > header comment doesn't match its actual path.
>
> > The attached small patch fixes them.
>
> Pushed, thanks!

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Masahiko Sawada

On Tue, Aug 9, 2022 at 7:33 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> > Okay, I think it's basically in good shape. Since it should be a bit
> > faster than a couple versions ago, would you be up for retesting with
> > the original test having 8 to 512 writers?
>
> Sure thing.  The results are similar.  As before, the improvements are most
> visible when the arrays are large.
>
> writers  head  patch
> 8672   680
> 16   639   664
> 32   701   689
> 64   705   703
> 128  628   653
> 256  576   627
> 512  530   584
> 768  450   536
> 1024 350   494
>
> > And also add the const
> > markers we discussed upthread?
>
> Oops, sorry about that.  This is done in v9.
>
> > Aside from that, I plan to commit this
> > week unless there is further bikeshedding.
>
> Great, thanks.

The patch looks good to me. One minor point is:

+ * IDENTIFICATION
+ *   src/port/pg_lfind.h

The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Fix unmatched file identifications

2022-08-08 Thread Masahiko Sawada

Hi,

I found that there are two .c and .h files whose identification in the
header comment doesn't match its actual path.

src/include/common/compression.h has:

  * IDENTIFICATION
  *src/common/compression.h
  *-

src/fe_utils/cancel.c has:

  * src/fe-utils/cancel.c
  *
  *

The attached small patch fixes them.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v1-0001-Fix-unmatched-file-identifications.patch
Description: Binary data

Re: Handle infinite recursion in logical replication setup

2022-08-04 Thread Masahiko Sawada

On Tue, Aug 2, 2022 at 8:59 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 9:07 AM Amit Kapila  wrote:
> >
> > On Tue, Jul 26, 2022 at 7:13 AM Jonathan S. Katz  
> > wrote:
> > >
> > > Thanks for the example. I agree that it is fairly simple to reproduce.
> > >
> > > I understand that "copy_data = force" is meant to protect a user from
> > > hurting themself. I'm not convinced that this is the best way to do so.
> > >
> > > For example today I can subscribe to multiple publications that write to
> > > the same table. If I have a primary key on that table, and two of the
> > > subscriptions try to write an identical ID, we conflict. We don't have
> > > any special flags or modes to guard against that from happening, though
> > > we do have documentation on conflicts and managing them.
> > >
> > > AFAICT the same issue with "copy_data" also exists in the above scenario
> > > too, even without the "origin" attribute.
> > >
> >
> > That's true but there is no parameter like origin = NONE which
> > indicates that constraint violations or duplicate data problems won't
> > occur due to replication. In the current case, I think the situation
> > is different because a user has specifically asked not to replicate
> > any remote data by specifying origin = NONE, which should be dealt
> > differently. Note that current users or their setup won't see any
> > difference/change unless they specify the new parameter origin as
> > NONE.
> >
>
> Let me try to summarize the discussion so that it is easier for others
> to follow. The work in this thread is to avoid loops, and duplicate
> data in logical replication when the operations happened on the same
> table in multiple nodes. It has been explained in email [1] with an
> example of how a logical replication setup can lead to duplicate or
> inconsistent data.
>
> The idea to solve this problem is that we don't replicate data that is
> not generated locally which we can normally identify based on origin
> of data in WAL. The commit 366283961a achieves that for replication
> but still the problem can happen during initial sync which is
> performed internally via copy. We can't differentiate the data in heap
> based on origin. So, we decided to prohibit the subscription
> operations that can perform initial sync (ex. Create Subscription,
> Alter Subscription ... Refresh) by detecting that the publisher has
> subscribed to the same table from some other publisher.
>
> To prohibit the subscription operations, the currently proposed patch
> throws an error.  Then, it also provides a new copy_data option
> 'force' under which the user will still be able to perform the
> operation. This could be useful when the user intentionally wants to
> replicate the initial data even if it contains data from multiple
> nodes (for example, when in a multi-node setup, one decides to get the
> initial data from just one node and then allow replication of data to
> proceed from each of respective nodes).
>
> The other alternative discussed was to just give a warning for
> subscription operations and probably document the steps for users to
> avoid it. But the problem with that is once the user sees this
> warning, it won't be able to do anything except recreate the setup, so
> why not give an error in the first place?
>
> Thoughts?

Thank you for the summary!

I understand that this feature could help some cases, but I'm really
not sure adding new value 'force' for them is worthwhile, and
concerned it could reduce the usability.

IIUC this feature would work only when origin = 'none' and copy_data =
'on', and the purpose is to prevent the data from being
duplicated/conflicted by the initial table sync. But there are cases
where duplication/conflict doesn't happen even if origin = 'none' and
copy_data = 'on'. For instance, the table on the publisher might be
empty. Also, even with origin = 'any', copy_data = 'on', there is a
possibility of data duplication/conflict. Why do we need to address
only the case where origin = 'none'? I think that using origin =
'none' doesn't necessarily mean using bi-directional (or N-way)
replication. Even when using uni-directional logical replication with
two nodes, they may use origin = 'none'. Therefore, it seems to me
that this feature works only for a narrow situation and has false
positives.

Since it has been the user's responsibility not to try to make the
data inconsistent by the initial table sync, I think that it might be
sufficient if we note the risk in the documentation.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: support for SSE2 intrinsics

2022-08-03 Thread Masahiko Sawada

Hi,

On Wed, Aug 3, 2022 at 2:01 PM John Naylor  wrote:
>
>
> On Tue, Aug 2, 2022 at 11:53 PM Nathan Bossart  
> wrote:
> > I did a bit of cross-checking, and AFAICT this is a reasonable starting
> > point.  emmintrin.h appears to be sufficient for one of my patches that
> > makes use of SSE2 instructions.  That being said, I imagine it'll be
> > especially important to keep an eye on the buildfarm when this change is
> > committed.
>
> Thanks for checking! Here's a concrete patch for testing.

I also think it's a good start. There is a typo in the commit message:

s/hepler/helper/

The rest looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Introduce wait_for_subscription_sync for TAP tests

2022-08-03 Thread Masahiko Sawada

On Wed, Aug 3, 2022 at 1:51 PM Amit Kapila  wrote:
>
> On Sat, Jul 30, 2022 at 12:25 PM Amit Kapila  wrote:
> >
> > On Thu, Jul 28, 2022 at 6:37 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jul 27, 2022 at 7:08 PM shiy.f...@fujitsu.com
> > >  wrote:
> > >
> > > I've attached updated patches that incorporated the above comments as
> > > well as the comment from Amit.
> > >
> > > BTW regarding 0001 patch to remove the duplicated wait, should we
> > > backpatch to v15?
> > >
> >
> > I think it is good to clean this test case even for PG15 even though
> > there is no major harm in keeping it. I'll push this early next week
> > by Tuesday unless someone thinks otherwise.
> >
>
> Pushed this one and now I'll look at your other patch.

Thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-02 Thread Masahiko Sawada

On Tue, Aug 2, 2022 at 5:31 PM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Aug 1, 2022 10:31 PM Amit Kapila  wrote:
> >
> > On Mon, Aug 1, 2022 at 7:46 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Fri, Jul 29, 2022 at 3:45 PM Amit Kapila 
> > wrote:
> > > >
> > >
> > > I've attached updated patches for all branches. Please review them.
> > >
> >
> > Thanks, the patches look mostly good to me. I have made minor edits by
> > removing 'likely' from a few places as those don't seem to be adding
> > much value, changed comments at a few places, and was getting
> > compilation in error in v11/10 (snapbuild.c:2111:3: error: ‘for’ loop
> > initial declarations are only allowed in C99 mode) which I have fixed.
> > See attached, unless there are major comments/suggestions, I am
> > planning to push this day after tomorrow (by Wednesday) after another
> > pass.
> >
>
> Thanks for updating the patch.
>
> Here are some minor comments:
>
> 1.
> patches for REL10 ~ REL13:
> + * Mark the transaction as containing catalog changes. In addition, if the
> + * given xid is in the list of the initial running xacts, we mark the
> + * its subtransactions as well. See comments for NInitialRunningXacts and
> + * InitialRunningXacts for additional info.
>
> "mark the its subtransactions"
> ->
> "mark its subtransactions"
>
> 2.
> patches for REL10 ~ REL15:
> In the comment in catalog_change_snapshot.spec, maybe we can use 
> "RUNNING_XACTS"
> instead of "RUNNING_XACT" "XACT_RUNNING", same as the patch for master branch.
>

Thank you for the comments! These have been incorporated in the latest
version v12 patch I just submitted.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-02 Thread Masahiko Sawada

On Wed, Aug 3, 2022 at 10:20 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 2 Aug 2022 13:54:43 +0530, Amit Kapila  
> wrote in
> > On Tue, Aug 2, 2022 at 12:00 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > +   {
> > > +   int save_errno = errno;
> > > +
> > > +   CloseTransientFile(fd);
> > > +
> > > +   if (readBytes < 0)
> > > +   {
> > > +   errno = save_errno;
> > > +   ereport(ERROR,
> > >
> > > Do we need the CloseTransientFile(fd) there?  This call requires errno
> > > to be remembered but anyway OpenTransientFile'd files are to be close
> > > at transaction end.  Actually CloseTransientFile() is not called
> > > before error'ing-out at error in other places.
> ..
> > We just moved it to a separate function as the same code is being
> > duplicated to multiple places.
>
> There are code paths that doesn't CloseTransientFile() explicitly,
> too.  If there were no need of save_errno there, that'd be fine.  But
> otherwise I guess we prefer to let the orphan fds closed by ERROR and
> I don't think we need to preserve the less-preferred code pattern (if
> we actually prefer not to have the explicit call).

Looking at other codes in snapbuild.c, we call CloseTransientFile()
before erroring out in SnapBuildSerialize(). I think it's better to
keep it consistent with nearby codes in this patch. I think if we
prefer the style of closing the file by ereport(ERROR), it should be
done for all of them in a separate patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-28 Thread Masahiko Sawada

On Thu, Jul 28, 2022 at 8:57 PM Amit Kapila  wrote:
>
> On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila  wrote:
> >
> > On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada  
> > wrote:
> > >
> > > Okay, I've attached an updated patch that does the above idea. Could
> > > you please do the performance tests again to see if the idea can help
> > > reduce the overhead, Shi yu?
> > >
> >
> > While reviewing the patch for HEAD, I have changed a few comments. See
> > attached, if you agree with these changes then include them in the
> > next version.
> >
>
> I have another comment on this patch:
> SnapBuildPurgeOlderTxn()
> {
> ...
> + if (surviving_xids > 0)
> + memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
> + surviving_xids * sizeof(TransactionId))
> ...
>
> For this code to hit, we must have a situation where one or more of
> the xacts in this array must be still running. And, if that is true,
> we would not have started from the restart point where the
> corresponding snapshot (that contains the still running xacts) has
> been serialized because we advance the restart point to not before the
> oldest running xacts restart_decoding_lsn. This may not be easy to
> understand so let me take an example to explain. Say we have two
> transactions t1 and t2, and both have made catalog changes. We want a
> situation where one of those gets purged and the other remains in
> builder->catchange.xip array. I have tried variants of the below
> sequence to see if I can get into the required situation but am not
> able to make it.
>
> Session-1
> Checkpoint -1;
> T1
> DDL
>
> Session-2
> T2
> DDL
>
> Session-3
> Checkpoint-2;
> pg_logical_slot_get_changes()
>  -- Here when we serialize the snapshot corresponding to
> CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
> as catalog-changing xacts.
>
> Session-1
> T1
> Commit;
>
> Checkpoint;
> pg_logical_slot_get_changes()
>  -- Here we will restore from Checkpoint-1's serialized snapshot and
> won't be able to move restart_point to Checkpoint-2 because T2 is
> still open.
>
> Now, as per my understanding, it is only possible to move
> restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
> which case we will never have that in surviving_xids array after the
> purge.
>
> It is possible I am missing something here. Do let me know your thoughts.

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-28 Thread Masahiko Sawada

On Thu, Jul 28, 2022 at 4:13 PM Amit Kapila  wrote:
>
> On Thu, Jul 28, 2022 at 11:56 AM Masahiko Sawada  
> wrote:
> >
> > On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila  
> > wrote:
> > >
> > > On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada  
> > > wrote:
> > > >
> >
> > While editing back branch patches, I realized that the following
> > (parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
> > equivalent:
> >
> > +   /*
> > +* If the COMMIT record has invalidation messages, it could have catalog
> > +* changes. It is possible that we didn't mark this transaction as
> > +* containing catalog changes when the decoding starts from a commit
> > +* record without decoding the transaction's other changes. So, we 
> > ensure
> > +* to mark such transactions as containing catalog change.
> > +*
> > +* This must be done before SnapBuildCommitTxn() so that we can include
> > +* these transactions in the historic snapshot.
> > +*/
> > +   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
> > +   SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
> > + parsed->nsubxacts, parsed->subxacts,
> > + buf->origptr);
> > +
> > /*
> >  * Process invalidation messages, even if we're not interested in the
> >  * transaction's contents, since the various caches need to always be
> >  * consistent.
> >  */
> > if (parsed->nmsgs > 0)
> > {
> > if (!ctx->fast_forward)
> > ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
> >   parsed->nmsgs, parsed->msgs);
> > ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
> > }
> >
> > If that's right, I think we can merge these if branches. We can call
> > ReorderBufferXidSetCatalogChanges() for top-txn and in
> > SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
> > is in the list. What do you think?
> >
>
> Note that this code doesn't exist in 14 and 15, so we need to create
> different patches for those.

Right.

> BTW, how in 13 and lower versions did we
> identify and mark subxacts as having catalog changes without our
> patch?

I think we use HEAP_INPLACE and XLOG_HEAP2_NEW_CID to mark subxacts as well.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-28 Thread Masahiko Sawada

On Thu, Jul 28, 2022 at 3:24 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 28 Jul 2022 09:56:33 +0530, Ashutosh Bapat 
>  wrote in
> > Thanks Masahiko for the updated patch. It looks good to me.
> >
> > I wonder whether the logic should be, similar
> > to ProcArrayApplyRecoveryInfo()
> >  if (xlrec->subxid_overflow)
> > ...
> > else if (xlrec->subxcnt > 0)
> > ...
> >
> > But you may ignore it.
>
> Either is fine if we asuume the record is sound, but since it is
> debugging output, I think we should always output the information *for
> both* . The following change doesn't change the output for a sound
> record.
>
> 
> if (xlrec->subxcnt > 0)
> {
> appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
> for (i = 0; i < xlrec->subxcnt; i++)
> appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt 
> + i]);
> }
> -   else if (xlrec->subxid_overflow)
> +   if (xlrec->subxid_overflow)
> appendStringInfoString(buf, "; subxid overflowed");
> 

Do you mean that both could be true at the same time? If I read
GetRunningTransactionData() correctly, that doesn't happen.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-28 Thread Masahiko Sawada

() an

On Thu, Jul 28, 2022 at 12:21 PM Amit Kapila  wrote:
>
> On Thu, Jul 28, 2022 at 7:18 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila  wrote:
> > >
> >
> > >  I have changed accordingly in the attached
> > > and apart from that slightly modified the comments and commit message.
> > > Do let me know what you think of the attached?
> >
> > It would be better to remember the initial running xacts after
> > SnapBuildRestore() returns true? Because otherwise, we could end up
> > allocating InitialRunningXacts multiple times while leaking the old
> > ones if there are no serialized snapshots that we are interested in.
> >
>
> Right, this makes sense. But note that you can no longer have a check
> (builder->state == SNAPBUILD_START) which I believe is not required.
> We need to do this after restore, in whichever state snapshot was as
> any state other than SNAPBUILD_CONSISTENT can have commits without all
> their changes.

Right.

>
> Accordingly, I think the comment: "Remember the transactions and
> subtransactions that were running when xl_running_xacts record that we
> decoded first was written." needs to be slightly modified to something
> like: "Remember the transactions and subtransactions that were running
> when xl_running_xacts record that we decoded was written.". Change
> this if it is used at any other place in the patch.

Agreed.

>
> > ---
> > +   if (builder->state == SNAPBUILD_START)
> > +   {
> > +   int nxacts =
> > running->subxcnt + running->xcnt;
> > +   Sizesz = sizeof(TransactionId) * nxacts;
> > +
> > +   NInitialRunningXacts = nxacts;
> > +   InitialRunningXacts =
> > MemoryContextAlloc(builder->context, sz);
> > +   memcpy(InitialRunningXacts, running->xids, sz);
> > +   qsort(InitialRunningXacts, nxacts,
> > sizeof(TransactionId), xidComparator);
> > +   }
> >
> > We should allocate the memory for InitialRunningXacts only when
> > (running->subxcnt + running->xcnt) > 0.
> >
>
d > There is no harm in doing that but ideally, that case would have been
> covered by an earlier check "if (running->oldestRunningXid ==
> running->nextXid)" which suggests "No transactions were running, so we
> can jump to consistent."

You're right.

While editing back branch patches, I realized that the following
(parsed->xinfo & XACT_XINFO_HAS_INVALS) and (parsed->nmsgs > 0) are
equivalent:

+   /*
+* If the COMMIT record has invalidation messages, it could have catalog
+* changes. It is possible that we didn't mark this transaction as
+* containing catalog changes when the decoding starts from a commit
+* record without decoding the transaction's other changes. So, we ensure
+* to mark such transactions as containing catalog change.
+*
+* This must be done before SnapBuildCommitTxn() so that we can include
+* these transactions in the historic snapshot.
+*/
+   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)
+   SnapBuildXidSetCatalogChanges(ctx->snapshot_builder, xid,
+ parsed->nsubxacts, parsed->subxacts,
+ buf->origptr);
+
/*
 * Process invalidation messages, even if we're not interested in the
 * transaction's contents, since the various caches need to always be
 * consistent.
 */
if (parsed->nmsgs > 0)
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
      parsed->nmsgs, parsed->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

If that's right, I think we can merge these if branches. We can call
ReorderBufferXidSetCatalogChanges() for top-txn and in
SnapBuildXidSetCatalogChanges() we mark its subtransactions if top-txn
is in the list. What do you think?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-27 Thread Masahiko Sawada

On Thu, Jul 21, 2022 at 10:13 PM Ashutosh Bapat
 wrote:
>
> Hi
>
> On Thu, Jul 21, 2022 at 6:44 AM Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > I realized that standby_desc_running_xacts() in standbydesc.c doesn't
> > describe subtransaction XIDs. I've attached the patch to improve the
> > description. Here is an example by pg_wlaldump:
> >
> > * HEAD
> > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048
> >
> > * w/ patch
> > rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
> > 0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
> > latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1
> > subxacts: 1049
> >
>
> I think this is a good addition to debugging info. +1
>
> If we are going to add 64 subxid numbers then it would help if we
> could be more verbose and print "subxid overflowed" instead of "subxid
> ovf".

Yeah, it looks better so agreed. I've attached an updated patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v2-0001-Improve-description-of-XLOG_RUNNING_XACTS.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 8:33 PM Amit Kapila  wrote:
>
> On Mon, Jul 25, 2022 at 11:26 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada  
> > wrote:
> >
> > I've attached the patch for REl15 that I forgot.
> >
>
> I feel the place to remember running xacts information in
> SnapBuildProcessRunningXacts is not appropriate. Because in cases
> where there are no running xacts or when xl_running_xact is old enough
> that we can't use it, we don't need that information. I feel we need
> it only when we have to reuse the already serialized snapshot, so,
> won't it be better to initialize at that place in
> SnapBuildFindSnapshot()?

Good point, agreed.

>  I have changed accordingly in the attached
> and apart from that slightly modified the comments and commit message.
> Do let me know what you think of the attached?

It would be better to remember the initial running xacts after
SnapBuildRestore() returns true? Because otherwise, we could end up
allocating InitialRunningXacts multiple times while leaking the old
ones if there are no serialized snapshots that we are interested in.

---
+   if (builder->state == SNAPBUILD_START)
+   {
+   int nxacts =
running->subxcnt + running->xcnt;
+   Sizesz = sizeof(TransactionId) * nxacts;
+
+   NInitialRunningXacts = nxacts;
+   InitialRunningXacts =
MemoryContextAlloc(builder->context, sz);
+   memcpy(InitialRunningXacts, running->xids, sz);
+   qsort(InitialRunningXacts, nxacts,
sizeof(TransactionId), xidComparator);
+   }

We should allocate the memory for InitialRunningXacts only when
(running->subxcnt + running->xcnt) > 0.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 8:54 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila  wrote:
> > >
> > > 2.
> > > +# wait for the replication to catchup if required.
> > > +if (defined($publisher))
> > > +{
> > > + croak 'subscription name must be specified' unless defined($subname);
> > > + $publisher->wait_for_catchup($subname, 'replay');
> > > +}
> > > +
> > > +# then, wait for all table states to be ready.
> > > +print "Waiting for all subscriptions in \"$name\" to synchronize 
> > > data\n";
> > > +my $query = qq[SELECT count(1) = 0
> > > + FROM pg_subscription_rel
> > > + WHERE srsubstate NOT IN ('r', 's');];
> > > +$self->poll_query_until($dbname, $query)
> > > + or croak "timed out waiting for subscriber to synchronize data";
> > >
> > > In the tests, I noticed that a few places did wait_for_catchup after
> > > the subscription check, and at other places, we did that check before
> > > as you have it here. Ideally, I think wait_for_catchup should be after
> > > confirming the initial sync is over as without initial sync, the
> > > publisher node won't be completely in sync with the subscriber.
> >
> > What do you mean by the last sentence? I thought the order doesn't
> > matter here. Even if we do wait_for_catchup first then the
> > subscription check, we can make sure that the apply worker caught up
> > and table synchronization has been done, no?
> >
>
> That's right. I thought we should first ensure the subscriber has
> finished operations if possible, like in this case, it can ensure
> table sync has finished and then we can ensure whether publisher and
> subscriber are in sync. That sounds more logical to me.

Make sense. I've incorporated it in the v3 patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-27 Thread Masahiko Sawada

On Wed, Jul 27, 2022 at 7:08 PM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch as well as a patch to remove duplicated
> > waits in 007_ddl.pl.
> >
>
> Thanks for your patch. Here are some comments.

Thank you for the comments!

>
> 1.
> I think some comments need to be changed in the patch.
> For example:
> # Also wait for initial table sync to finish
> # Wait for initial sync to finish as well
>
> Words like "Also" and "as well" can be removed now, we originally used them
> because we wait for catchup and "also" wait for initial sync.

Agreed.

>
> 2.
> In the following places, we can remove wait_for_catchup() and then call it in
> wait_for_subscription_sync().
>
> 2.1.
> 030_origin.pl:
> @@ -128,8 +120,7 @@ $node_B->safe_psql(
>
>  $node_C->wait_for_catchup($appname_B2);
>
> -$node_B->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_B->wait_for_subscription_sync;
>
> 2.2.
> 031_column_list.pl:
> @@ -385,7 +373,7 @@ $node_subscriber->safe_psql(
> ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3
>  ));
>
> -wait_for_subscription_sync($node_subscriber);
> +$node_subscriber->wait_for_subscription_sync;
>
>  $node_publisher->wait_for_catchup('sub1');
>
> 2.3.
> 100_bugs.pl:
> @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres',
>  $node_publisher->wait_for_catchup('tap_sub');
>
>  # Also wait for initial table sync to finish
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_subscriber->wait_for_subscription_sync;
>
>  is( $node_subscriber->safe_psql(
> 'postgres', "SELECT * FROM tab_replidentity_index"),

Agreed.

I've attached updated patches that incorporated the above comments as
well as the comment from Amit.

BTW regarding 0001 patch to remove the duplicated wait, should we
backpatch to v15? I think we can do that as it's an obvious fix and it
seems to be an oversight in 8f2e2bbf145.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0001-Remove-duplicated-wait-for-subscription-synchroni.patch
Description: Binary data


v3-0002-Introduce-wait_for_subscription_sync-for-TAP-test.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-26 Thread Masahiko Sawada

On Tue, Jul 26, 2022 at 2:18 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 25, 2022 at 7:57 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > Hi,
> > >
> > > I did some performance test for the master branch patch (based on v6 
> > > patch) to
> > > see if the bsearch() added by this patch will cause any overhead.
> >
> > Thank you for doing performance tests!
> >
> > >
> > > I tested them three times and took the average.
> > >
> > > The results are as follows, and attach the bar chart.
> > >
> > > case 1
> > > -
> > > No catalog modifying transaction.
> > > Decode 800k pgbench transactions. (8 clients, 100k transactions per 
> > > client)
> > >
> > > master  7.5417
> > > patched 7.4107
> > >
> > > case 2
> > > -
> > > There's one catalog modifying transaction.
> > > Decode 100k/500k/1M transactions.
> > >
> > > 100k500k1M
> > > master  0.0576  0.1491  0.4346
> > > patched 0.0586  0.1500  0.4344
> > >
> > > case 3
> > > -
> > > There are 64 catalog modifying transactions.
> > > Decode 100k/500k/1M transactions.
> > >
> > > 100k500k1M
> > > master  0.0600  0.1666  0.4876
> > > patched 0.0620  0.1653  0.4795
> > >
> > > (Because the result of case 3 shows that there is a overhead of about 3% 
> > > in the
> > > case decoding 100k transactions with 64 catalog modifying transactions, I
> > > tested the next run of 100k xacts with or without catalog modifying
> > > transactions, to see if it affects subsequent decoding.)
> > >
> > > case 4.1
> > > -
> > > After the test steps in case 3 (64 catalog modifying transactions, decode 
> > > 100k
> > > transactions), run 100k xacts and then decode.
> > >
> > > master  0.3699
> > > patched 0.3701
> > >
> > > case 4.2
> > > -
> > > After the test steps in case 3 (64 catalog modifying transactions, decode 
> > > 100k
> > > transactions), run 64 DDLs(without checkpoint) and 100k xacts, then 
> > > decode.
> > >
> > > master  0.3687
> > > patched 0.3696
> > >
> > > Summary of the tests:
> > > After applying this patch, there is a overhead of about 3% in the case 
> > > decoding
> > > 100k transactions with 64 catalog modifying transactions. This is an 
> > > extreme
> > > case, so maybe it's okay.
> >
> > Yes. If we're worried about the overhead and doing bsearch() is the
> > cause, probably we can try simplehash instead of the array.
> >
>
> I am not sure if we need to go that far for this extremely corner
> case. Let's first try your below idea.
>
> > An improvement idea is that we pass the parsed->xinfo down to
> > SnapBuildXidHasCatalogChanges(), and then return from that function
> > before doing bearch() if the parsed->xinfo doesn't have
> > XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
> > non-catalog-modifying transactions. Is it worth trying?
> >
>
> I think this is worth trying and this might reduce some of the
> overhead as well in the case presented by Shi-San.

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


master-v9-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-26 Thread Masahiko Sawada

On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 7:07 AM Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > In tap tests for logical replication, we have the following code in many 
> > places:
> >
> > $node_publisher->wait_for_catchup('tap_sub');
> > my $synced_query =
> >   "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
> > IN ('r', 's');";
> > $node_subscriber->poll_query_until('postgres', $synced_query)
> >   or die "Timed out while waiting for subscriber to synchronize data";
> >
> > Also, we sometime forgot to check either one, like we fixed in commit
> > 1f50918a6fb02207d151e7cb4aae4c36de9d827c.
> >
> > I think we can have a new function to wait for all subscriptions to
> > synchronize data. The attached patch introduce a new function
> > wait_for_subscription_sync(). With this function, we can replace the
> > above code with this one function as follows:
> >
> > $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
> >
>
> +1. This reduces quite some code in various tests and will make it
> easier to write future tests.
>
> Few comments/questions:
> 
> 1.
> -$node_publisher->wait_for_catchup('mysub1');
> -
> -# Also wait for initial table sync to finish.
> -my $synced_query =
> -  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
> IN ('r', 's');";
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> -
>  # Also wait for initial table sync to finish.
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_subscriber->wait_for_subscription_sync($node_publisher, 'mysub1');
>
> It seems to me without your patch there is an extra poll in the above
> test. If so, we can probably remove that in a separate patch?

Agreed.

>
> 2.
> +# wait for the replication to catchup if required.
> +if (defined($publisher))
> +{
> + croak 'subscription name must be specified' unless defined($subname);
> + $publisher->wait_for_catchup($subname, 'replay');
> +}
> +
> +# then, wait for all table states to be ready.
> +print "Waiting for all subscriptions in \"$name\" to synchronize data\n";
> +my $query = qq[SELECT count(1) = 0
> + FROM pg_subscription_rel
> + WHERE srsubstate NOT IN ('r', 's');];
> +$self->poll_query_until($dbname, $query)
> + or croak "timed out waiting for subscriber to synchronize data";
>
> In the tests, I noticed that a few places did wait_for_catchup after
> the subscription check, and at other places, we did that check before
> as you have it here. Ideally, I think wait_for_catchup should be after
> confirming the initial sync is over as without initial sync, the
> publisher node won't be completely in sync with the subscriber.

What do you mean by the last sentence? I thought the order doesn't
matter here. Even if we do wait_for_catchup first then the
subscription check, we can make sure that the apply worker caught up
and table synchronization has been done, no?

>
> 3. In the code quoted in the previous point, why did you pass the
> second parameter as 'replay' when we have not used that in the tests
> otherwise?

It makes sure to use the (current) default value of $mode of
wait_for_catchup(). But probably it's not necessary, I've removed it.

I've attached an updated patch as well as a patch to remove duplicated
waits in 007_ddl.pl.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 35c7b74b66a179b14fc97a012309f3d3ee9a7e26 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 26 Jul 2022 16:37:26 +0900
Subject: [PATCH v2 1/2] Remove duplicated wait for subscription
 synchronization in 007_ddl.pl

---
 src/test/subscription/t/007_ddl.pl | 4 
 1 file changed, 4 deletions(-)

diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index cdd6b119ff..01df54229c 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -57,10 +57,6 @@ my $synced_query =
 $node_subscriber->poll_query_until('postgres', $synced_query)
   or die "Timed out while waiting for subscriber to synchronize data";
 
-# Also wait for initial table sync to finish.
-$node_subscriber->poll_query_until('postgres', $synced_query)
-  or die "Timed out while waiting for subscriber to synchronize data";
-
 # Specifying non-existent publication along with add publication.
 ($ret, $stdout, $stderr) = $node_subscri

Introduce wait_for_subscription_sync for TAP tests

2022-07-25 Thread Masahiko Sawada

Hi,

In tap tests for logical replication, we have the following code in many places:

$node_publisher->wait_for_catchup('tap_sub');
my $synced_query =
  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('r', 's');";
$node_subscriber->poll_query_until('postgres', $synced_query)
  or die "Timed out while waiting for subscriber to synchronize data";

Also, we sometime forgot to check either one, like we fixed in commit
1f50918a6fb02207d151e7cb4aae4c36de9d827c.

I think we can have a new function to wait for all subscriptions to
synchronize data. The attached patch introduce a new function
wait_for_subscription_sync(). With this function, we can replace the
above code with this one function as follows:

$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v1-0001-Introduce-wait_for_subscription_sync-for-TAP-test.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-25 Thread Masahiko Sawada

On Mon, Jul 25, 2022 at 7:57 PM shiy.f...@fujitsu.com
 wrote:
>
> Hi,
>
> I did some performance test for the master branch patch (based on v6 patch) to
> see if the bsearch() added by this patch will cause any overhead.

Thank you for doing performance tests!

>
> I tested them three times and took the average.
>
> The results are as follows, and attach the bar chart.
>
> case 1
> -
> No catalog modifying transaction.
> Decode 800k pgbench transactions. (8 clients, 100k transactions per client)
>
> master  7.5417
> patched 7.4107
>
> case 2
> -
> There's one catalog modifying transaction.
> Decode 100k/500k/1M transactions.
>
> 100k500k1M
> master  0.0576  0.1491  0.4346
> patched 0.0586  0.1500  0.4344
>
> case 3
> -
> There are 64 catalog modifying transactions.
> Decode 100k/500k/1M transactions.
>
> 100k500k1M
> master  0.0600  0.1666  0.4876
> patched 0.0620  0.1653  0.4795
>
> (Because the result of case 3 shows that there is a overhead of about 3% in 
> the
> case decoding 100k transactions with 64 catalog modifying transactions, I
> tested the next run of 100k xacts with or without catalog modifying
> transactions, to see if it affects subsequent decoding.)
>
> case 4.1
> -
> After the test steps in case 3 (64 catalog modifying transactions, decode 100k
> transactions), run 100k xacts and then decode.
>
> master  0.3699
> patched 0.3701
>
> case 4.2
> -
> After the test steps in case 3 (64 catalog modifying transactions, decode 100k
> transactions), run 64 DDLs(without checkpoint) and 100k xacts, then decode.
>
> master  0.3687
> patched 0.3696
>
> Summary of the tests:
> After applying this patch, there is a overhead of about 3% in the case 
> decoding
> 100k transactions with 64 catalog modifying transactions. This is an extreme
> case, so maybe it's okay.

Yes. If we're worried about the overhead and doing bsearch() is the
cause, probably we can try simplehash instead of the array.

An improvement idea is that we pass the parsed->xinfo down to
SnapBuildXidHasCatalogChanges(), and then return from that function
before doing bearch() if the parsed->xinfo doesn't have
XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
non-catalog-modifying transactions. Is it worth trying?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-24 Thread Masahiko Sawada

On Mon, Jul 25, 2022 at 10:45 AM Masahiko Sawada  wrote:
>
> On Sat, Jul 23, 2022 at 8:32 PM Amit Kapila  wrote:
> >
> > On Fri, Jul 22, 2022 at 11:48 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > >
> > > > This is required if we don't want to introduce a new set of functions
> > > > as you proposed above. I am not sure which one is better w.r.t back
> > > > patching effort later but it seems to me using flag stuff would make
> > > > future back patches easier if we make any changes in
> > > > SnapBuildCommitTxn.
> > >
> > > Understood.
> > >
> > > I've implemented this idea as well for discussion. Both patches have
> > > the common change to remember the initial running transactions and to
> > > purge them when decoding xl_running_xacts records. The difference is
> > > how to mark the transactions as needing to be added to the snapshot.
> > >
> > > In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
> > > when the transaction is in the initial running xact list and its
> > > commit record has XINFO_HAS_INVAL flag, we mark both the top
> > > transaction and its all subtransactions as containing catalog changes
> > > (which also means to create ReorderBufferTXN entries for them). These
> > > transactions are added to the snapshot in SnapBuildCommitTxn() since
> > > ReorderBufferXidHasCatalogChanges () for them returns true.
> > >
> > > In poc_mark_top_txn_has_inval.patch, when the transaction is in the
> > > initial running xacts list and its commit record has XINFO_HAS_INVALS
> > > flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
> > > transaction.
> > >
> >
> > It seems that the patch has missed the part to check if the xid is in
> > the initial running xacts list?
>
> Oops, right.
>
> >
> > > In SnapBuildCommitTxn(), we add all subtransactions to
> > > the snapshot without checking ReorderBufferXidHasCatalogChanges() for
> > > subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
> > > flag.
> > >
> > > A difference between the two ideas is the scope of changes: the former
> > > changes only snapbuild.c but the latter changes both snapbuild.c and
> > > reorderbuffer.c. Moreover, while the former uses the existing flag,
> > > the latter adds a new flag to the reorder buffer for dealing with only
> > > this case. I think the former idea is simpler in terms of that. But,
> > > an advantage of the latter idea is that the latter idea can save to
> > > create ReorderBufferTXN entries for subtransactions.
> > >
> > > Overall I prefer the former for now but I'd like to hear what others 
> > > think.
> > >
> >
> > I agree that the latter idea can have better performance in extremely
> > special scenarios but introducing a new flag for the same sounds a bit
> > ugly to me. So, I would also prefer to go with the former idea,
> > however, I would also like to hear what Horiguchi-San and others have
> > to say.
>
> Agreed.
>
> >
> > Few comments on v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during:
> > 1.
> > +void
> > +SnapBuildInitialXactSetCatalogChanges(SnapBuild *builder, TransactionId 
> > xid,
> > +   int subxcnt, TransactionId *subxacts,
> > +   XLogRecPtr lsn)
> > +{
> >
> > I think it is better to name this function as
> > SnapBuildXIDSetCatalogChanges as we use this to mark a particular
> > transaction as having catalog changes.
> >
> > 2. Changed/added a few comments in the attached.
>
> Thank you for the comments.
>
> I've attached updated version patches for the master and back branches.

I've attached the patch for REl15 that I forgot.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


REL15-v8-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-22 Thread Masahiko Sawada

On Wed, Jul 20, 2022 at 5:50 PM Amit Kapila  wrote:
>
> On Wed, Jul 20, 2022 at 1:28 PM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 20, 2022 at 2:19 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > Another idea would be to have functions, say
> > > > SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
> > > > does actual work of handling transaction commits and both
> > > > SnapBuildCommitTxn() and SnapBuildCommit() call
> > > > SnapBuildCommitTxnWithXInfo() with different arguments.
> > > >
> > >
> > > Do you want to say DecodeCommit() instead of SnapBuildCommit() in
> > > above para?
> >
> > I meant that we will call like DecodeCommit() ->
> > SnapBuildCommitTxnWithXInfo() -> SnapBuildCommitTxn_ext(has_invals =
> > true) -> SnapBuildXidHasCatalogChanges(has_invals = true) -> ... If
> > SnapBuildCommitTxn() gets called, it calls SnapBuildCommitTxn_ext()
> > with has_invals = false and behaves the same as before.
> >
>
> Okay, understood. This will work.
>
> > > Yet another idea could be to have another flag
> > > RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
> > > Then, we can retrieve it even for each of the subtxn's if and when
> > > required.
> >
> > Do you mean that when checking if the subtransaction has catalog
> > changes, we check if its top-level XID has this new flag?
> >
>
> Yes.
>
> > Why do we
> > need the new flag?
> >
>
> This is required if we don't want to introduce a new set of functions
> as you proposed above. I am not sure which one is better w.r.t back
> patching effort later but it seems to me using flag stuff would make
> future back patches easier if we make any changes in
> SnapBuildCommitTxn.

Understood.

I've implemented this idea as well for discussion. Both patches have
the common change to remember the initial running transactions and to
purge them when decoding xl_running_xacts records. The difference is
how to mark the transactions as needing to be added to the snapshot.

In v7-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch,
when the transaction is in the initial running xact list and its
commit record has XINFO_HAS_INVAL flag, we mark both the top
transaction and its all subtransactions as containing catalog changes
(which also means to create ReorderBufferTXN entries for them). These
transactions are added to the snapshot in SnapBuildCommitTxn() since
ReorderBufferXidHasCatalogChanges () for them returns true.

In poc_mark_top_txn_has_inval.patch, when the transaction is in the
initial running xacts list and its commit record has XINFO_HAS_INVALS
flag, we set a new flag, say RBTXN_COMMIT_HAS_INVALS, only to the top
transaction. In SnapBuildCommitTxn(), we add all subtransactions to
the snapshot without checking ReorderBufferXidHasCatalogChanges() for
subtransactions if its top transaction has the RBTXN_COMMIT_HAS_INVALS
flag.

A difference between the two ideas is the scope of changes: the former
changes only snapbuild.c but the latter changes both snapbuild.c and
reorderbuffer.c. Moreover, while the former uses the existing flag,
the latter adds a new flag to the reorder buffer for dealing with only
this case. I think the former idea is simpler in terms of that. But,
an advantage of the latter idea is that the latter idea can save to
create ReorderBufferTXN entries for subtransactions.

Overall I prefer the former for now but I'd like to hear what others think.

FWIW, I didn't try the idea of adding wrapper functions since it would
be costly in terms of back patching effort in the future.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..4553252d75 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot catalog_change_snapshot

 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
new file mode 100644
index 00..dc4f9b7018
--- /dev/null
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -0,0 +1,44 @@
+Parsed test spec with

Re: slot_creation_error failures

2022-07-21 Thread Masahiko Sawada

On Fri, Jul 22, 2022 at 10:59 AM Thomas Munro  wrote:
>
> Hi,
>
> Here are some recent $SUBJECT on HEAD.  Unfortunately we don't see the
> regression.diffs file :-(

We can see regression.diffs[1]:

== output_iso/regression.diffs 
diff -w -U3 
C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out
C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out
--- 
C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out
2022-03-19 06:04:11.806604000 +
+++ 
C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out
2022-04-11 21:15:58.342700300 +
@@ -92,23 +92,7 @@
 FROM pg_stat_activity
 WHERE application_name = 'isolation/slot_creation_error/s2';
  
-step s2_init: <... completed>
-FATAL:  terminating connection due to administrator command
-server closed the connection unexpectedly
+PQconsumeInput failed: server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.

-step s1_terminate_s2: <... completed>
-pg_terminate_backend
-
-t
-(1 row)
-
-step s1_c: COMMIT;
-step s1_view_slot:
-SELECT slot_name, slot_type, active FROM pg_replication_slots
WHERE slot_name = 'slot_creation_error'
-
-slot_name|slot_type|active
--+-+--
-(0 rows)
-

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren=2022-04-11%2021%3A04%3A15=test-decoding-check

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-21 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 1:30 PM John Naylor
 wrote:
>
>
>
> On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  wrote:
>
> > I’d like to keep the first version simple. We can improve it and add
> > more optimizations later. Using radix tree for vacuum TID storage
> > would still be a big win comparing to using a flat array, even without
> > all these optimizations. In terms of single-value leaves method, I'm
> > also concerned about an extra pointer traversal and extra memory
> > allocation. It's most flexible but multi-value leaves method is also
> > flexible enough for many use cases. Using the single-value method
> > seems to be too much as the first step for me.
> >
> > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > choice for me as the first step . It can cover wider use cases
> > including vacuum TID use cases. And possibly it can cover use cases by
> > combining a hash table or using tree of tree, for example.
>
> These two aspects would also bring it closer to Andres' prototype, which 1) 
> makes review easier and 2) easier to preserve optimization work already done, 
> so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


radixtree_v5.patch
Description: Binary data

Re: Improve description of XLOG_RUNNING_XACTS

2022-07-21 Thread Masahiko Sawada

On Thu, Jul 21, 2022 at 4:29 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 21 Jul 2022 11:21:09 +0900, Fujii Masao  
> wrote in
> >
> >
> > On 2022/07/21 10:13, Masahiko Sawada wrote:
> > > Hi,
> > > I realized that standby_desc_running_xacts() in standbydesc.c doesn't
> > > describe subtransaction XIDs. I've attached the patch to improve the
> > > description.
> >
> > +1
> >
> > The patch looks good to me.
>
> The subxids can reach TOTAL_MAX_CACHED_SUBXIDS =
> PGPROC_MAX_CACHED_SUBXIDS(=64) * PROCARRAY_MAXPROCS. xact_desc_commit
> also shows subtransactions but they are at maximum 64.  I feel like
> -0.3 if there's no obvious advantage showing them.

xxx_desc() functions are debugging purpose functions as they are used
by pg_waldump and pg_walinspect etc. I think such functions should
show all contents unless there is reason to hide them. Particularly,
standby_desc_running_xacts() currently shows subtransaction
information only when subtransactions are overflowed, which got me
confused when inspecting WAL records.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Improve description of XLOG_RUNNING_XACTS

2022-07-20 Thread Masahiko Sawada

Hi,

I realized that standby_desc_running_xacts() in standbydesc.c doesn't
describe subtransaction XIDs. I've attached the patch to improve the
description. Here is an example by pg_wlaldump:

* HEAD
rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048

* w/ patch
rmgr: Standby len (rec/tot): 58/58, tx:  0, lsn:
0/01D0C608, prev 0/01D0C5D8, desc: RUNNING_XACTS nextXid 1050
latestCompletedXid 1047 oldestRunningXid 1048; 1 xacts: 1048; 1
subxacts: 1049

Please review it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-Improve-description-of-XLOG_RUNNING_XACTS.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-20 Thread Masahiko Sawada

On Wed, Jul 20, 2022 at 2:19 PM Amit Kapila  wrote:
>
> On Wed, Jul 20, 2022 at 9:01 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 20, 2022 at 12:11 PM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada  
> > > wrote:
> > >
> > > Why do you think we can't remove
> > > ReorderBufferInitialXactsSetCatalogChanges() from the back branch
> > > patch? I think we don't need to change the existing function
> > > ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
> > > like SnapBuildXidHasCatalogChanges() similar to master branch patch.
> >
> > IIUC we need to change SnapBuildCommitTxn() but it's exposed.
> >
> > Currently, we call like DecodeCommit() -> SnapBuildCommitTxn() ->
> > ReorderBufferXidHasCatalogChanges(). If we have a wrapper function, we
> > call like DecodeCommit() -> SnapBuildCommitTxn() ->
> > SnapBuildXidHasCatalogChanges() ->
> > ReorderBufferXidHasCatalogChanges(). In
> > SnapBuildXidHasCatalogChanges(), we need to check if the transaction
> > has XACT_XINFO_HAS_INVALS, which means DecodeCommit() needs to pass
> > either parsed->xinfo or (parsed->xinfo & XACT_XINFO_HAS_INVALS != 0)
> > down to SnapBuildXidHasCatalogChanges(). However, since
> > SnapBuildCommitTxn(), between DecodeCommit() and
> > SnapBuildXidHasCatalogChanges(), is exposed we cannot change it.
> >
>
> Agreed.
>
> > Another idea would be to have functions, say
> > SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
> > does actual work of handling transaction commits and both
> > SnapBuildCommitTxn() and SnapBuildCommit() call
> > SnapBuildCommitTxnWithXInfo() with different arguments.
> >
>
> Do you want to say DecodeCommit() instead of SnapBuildCommit() in
> above para?

I meant that we will call like DecodeCommit() ->
SnapBuildCommitTxnWithXInfo() -> SnapBuildCommitTxn_ext(has_invals =
true) -> SnapBuildXidHasCatalogChanges(has_invals = true) -> ... If
SnapBuildCommitTxn() gets called, it calls SnapBuildCommitTxn_ext()
with has_invals = false and behaves the same as before.

> Yet another idea could be to have another flag
> RBTXN_HAS_INVALS which will be set by DecodeCommit for top-level TXN.
> Then, we can retrieve it even for each of the subtxn's if and when
> required.

Do you mean that when checking if the subtransaction has catalog
changes, we check if its top-level XID has this new flag? Why do we
need the new flag?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-20 Thread Masahiko Sawada

On Wed, Jul 20, 2022 at 4:16 PM Kyotaro Horiguchi
 wrote:
>
> At Wed, 20 Jul 2022 10:58:16 +0900, Masahiko Sawada  
> wrote in
> > On Wed, Jul 20, 2022 at 9:58 AM Kyotaro Horiguchi
> >  wrote:
> > > Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
> > > allocate the array with a fixed size. SnapBuildAddCommittedTxn still
> >  > assumes builder->committed.xip is non-NULL.  SnapBuildRestore *kept*
> > > ondisk.builder.commited.xip populated with a valid array pointer. But
> > > the patch allows committed.xip be NULL, thus in that case,
> > > SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
> > > failure.
> >
> > IIUC the patch doesn't allow committed.xip to be NULL since we don't
> > overwrite it if builder->committed.xcnt is 0 (i.e.,
> > ondisk.builder.committed.xip is NULL):
>
> I meant that ondisk.builder.committed.xip can be NULL.. But looking
> again that cannot be.  I don't understand what I was looking at at
> that time.
>
> So, sorry for the noise.

No problem. Thank you for your review and comments!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transparent column encryption

2022-07-20 Thread Masahiko Sawada

4210) at
postmaster.c:4218:3
frame #14: 0x00010650bd57 postgres`ServerLoop at postmaster.c:1808:7
frame #15: 0x0001065094cf postgres`PostmasterMain(argc=5,
argv=0x7faeb2406320) at postmaster.c:1480:11
frame #16: 0x0001063b4dcf postgres`main(argc=5,
argv=0x7faeb2406320) at main.c:197:3
frame #17: 0x7fff721abcc9 libdyld.dylib`start + 1

(2) SEGV by psql

postgres(1:47762)=# create table tbl (t text encrypted with
(column_encryption_key = cek1));
CREATE TABLE
postgres(1:47762)=# insert into tbl values ('test');
INSERT 0 1
postgres(1:47762)=# select * from tbl;
Segmentation fault: 11 (core dumped)

The backtrace is:

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x7fff723a1b36
libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 566
frame #1: 0x00010c509a5f
libpq.5.dylib`get_message_auth_tag(md=0x00010c7f28b8, mac_key="
\x1c,\x98g½ȩ[\x88\x16\x12Kiꔂ\v8g_\x80, mac_key_len=16, encr="test",
encrlen=-12, cekalg=130, md_value="", md_len_p=0x7ffee380a720,
errmsgp=0x7ffee380a8c0) at fe-encrypt-openssl.c:316:2
frame #2: 0x00010c509442
libpq.5.dylib`decrypt_value(res=0x7fa098504770,
cek=0x7fa0985045d0, cekalg=130, input="test", inputlen=4,
errmsgp=0x7ffee380a8c0) at fe-encrypt-openssl.c:429:7
frame #3: 0x00010c4f3c0c
libpq.5.dylib`pqRowProcessor(conn=0x7fa098808e00,
errmsgp=0x7ffee380a8c0) at fe-exec.c:1670:21
frame #4: 0x00010c5013bb
libpq.5.dylib`getAnotherTuple(conn=0x7fa098808e00, msgLength=16)
at fe-protocol3.c:882:6
frame #5: 0x00010c4ffbf2
libpq.5.dylib`pqParseInput3(conn=0x7fa098808e00) at
fe-protocol3.c:410:11
frame #6: 0x00010c4f5b65
libpq.5.dylib`parseInput(conn=0x7fa098808e00) at fe-exec.c:2598:2
frame #7: 0x00010c4f5c59
libpq.5.dylib`PQgetResult(conn=0x7fa098808e00) at fe-exec.c:2684:3
frame #8: 0x00010c401a77
psql`ExecQueryAndProcessResults(query="select * from tbl;",
elapsed_msec=0x7ffee380ab08, svpt_gone_p=0x7ffee380aafe,
is_watch=false, opt=0x,
printQueryFout=0x) at common.c:1514:11
frame #9: 0x00010c402469 psql`SendQuery(query="select * from
tbl;") at common.c:1171:9
frame #10: 0x00010c41a4dd
psql`MainLoop(source=0x7fff98ce8d90) at mainloop.c:439:16
frame #11: 0x00010c426b44 psql`main(argc=3,
argv=0x7ffee380adc8) at startup.c:462:19
frame #12: 0x7fff721abcc9 libdyld.dylib`start + 1
frame #13: 0x7fff721abcc9 libdyld.dylib`start + 1
(lldb) f 1
frame #1: 0x00010c509a5f
libpq.5.dylib`get_message_auth_tag(md=0x00010c7f28b8, mac_key="
\x1c,\x98g½ȩ[\x88\x16\x12Kiꔂ\v8g_\x80, mac_key_len=16, encr="test",
encrlen=-12, cekalg=130, md_value="", md_len_p=0x7ffee380a720,
errmsgp=0x7ffee380a8c0) at fe-encrypt-openssl.c:316:2
   313  #else
   314  memcpy(buf, test_A, sizeof(test_A));
   315  #endif
-> 316  memcpy(buf + PG_AD_LEN, encr, encrlen);
   317  *(int64 *) (buf + PG_AD_LEN + encrlen) =
pg_hton64(PG_AD_LEN * 8);
   318
   319  if (!EVP_DigestSignInit(evp_md_ctx, NULL, md, NULL, pkey))
(lldb) p encrlen
(int) $0 = -12
(lldb)

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Wed, Jul 20, 2022 at 12:11 PM Amit Kapila  wrote:
>
> On Tue, Jul 19, 2022 at 7:28 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 19, 2022 at 9:25 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > > BTW on backbranches, I think that the reason why we add
> > > > initial_running_xacts stuff to ReorderBuffer is that we cannot modify
> > > > SnapBuild that could be serialized. Can we add a (private) array for
> > > > the initial running xacts in snapbuild.c instead of adding new
> > > > variables to ReorderBuffer?
> > > >
> > >
> > > While thinking about this, I wonder if the current patch for back
> > > branches can lead to an ABI break as it changes the exposed structure?
> > > If so, it may be another reason to change it to some other way
> > > probably as you are suggesting.
> >
> > Yeah, it changes the size of ReorderBuffer, which is not good.
> >
>
> So, are you planning to give a try with your idea of making a private
> array for the initial running xacts?

Yes.

>  I am not sure but I guess you are
> proposing to add it in SnapBuild structure, if so, that seems safe as
> that structure is not exposed.

We cannot add it in SnapBuild as it gets serialized to the disk.

>
> > Changing the function names and arguments would also break ABI. So
> > probably we cannot do the above idea of removing
> > ReorderBufferInitialXactsSetCatalogChanges() as well.
> >
>
> Why do you think we can't remove
> ReorderBufferInitialXactsSetCatalogChanges() from the back branch
> patch? I think we don't need to change the existing function
> ReorderBufferXidHasCatalogChanges() but instead can have a wrapper
> like SnapBuildXidHasCatalogChanges() similar to master branch patch.

IIUC we need to change SnapBuildCommitTxn() but it's exposed.

Currently, we call like DecodeCommit() -> SnapBuildCommitTxn() ->
ReorderBufferXidHasCatalogChanges(). If we have a wrapper function, we
call like DecodeCommit() -> SnapBuildCommitTxn() ->
SnapBuildXidHasCatalogChanges() ->
ReorderBufferXidHasCatalogChanges(). In
SnapBuildXidHasCatalogChanges(), we need to check if the transaction
has XACT_XINFO_HAS_INVALS, which means DecodeCommit() needs to pass
either parsed->xinfo or (parsed->xinfo & XACT_XINFO_HAS_INVALS != 0)
down to SnapBuildXidHasCatalogChanges(). However, since
SnapBuildCommitTxn(), between DecodeCommit() and
SnapBuildXidHasCatalogChanges(), is exposed we cannot change it.

Another idea would be to have functions, say
SnapBuildCommitTxnWithXInfo() and SnapBuildCommitTxn_ext(). The latter
does actual work of handling transaction commits and both
SnapBuildCommitTxn() and SnapBuildCommit() call
SnapBuildCommitTxnWithXInfo() with different arguments.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Wed, Jul 20, 2022 at 9:58 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 19 Jul 2022 17:31:07 +0900, Masahiko Sawada  
> wrote in
> > On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
> >  wrote:
> > > At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila  
> > > wrote in
> > > > Good work. I wonder without comments this may create a problem in the
> > > > future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
> > > > freeing the memory any less robust. Also, for consistency, we can use
> > > > a similar check based on xcnt in the SnapBuildRestore to free the
> > > > memory in the below code:
> > > > + /* set catalog modifying transactions */
> > > > + if (builder->catchange.xip)
> > > > + pfree(builder->catchange.xip);
> > >
> > > But xip must be positive there.  We can add a comment explains that.
> > >
> >
> > Yes, if we add the comment for it, probably we need to explain a gcc's
> > optimization but it seems to be too much to me.
>
> Ah, sorry. I confused with other place in SnapBuildPurgeCommitedTxn.
> I agree to you, that we don't need additional comment *there*.
>
> > > +   catchange_xip = 
> > > ReorderBufferGetCatalogChangesXacts(builder->reorder);
> > >
> > > catchange_xip is allocated in the current context, but ondisk is
> > > allocated in builder->context.  I see it kind of inconsistent (even if
> > > the current context is same with build->context).
> >
> > Right. I thought that since the lifetime of catchange_xip is short,
> > until the end of SnapBuildSerialize() function we didn't need to
> > allocate it in builder->context. But given ondisk, we need to do that
> > for catchange_xip as well. Will fix it.
>
> Thanks.
>
> > > +   if (builder->committed.xcnt > 0)
> > > +   {
> > >
> > > It seems to me comitted.xip is always non-null, so we don't need this.
> > > I don't strongly object to do that, though.
> >
> > But committed.xcnt could be 0, right? We don't need to copy anything
> > by calling memcpy with size = 0 in this case. Also, it looks more
> > consistent with what we do for catchange_xcnt.
>
> Mmm. the patch changed that behavior. AllocateSnapshotBuilder always
> allocate the array with a fixed size. SnapBuildAddCommittedTxn still
 > assumes builder->committed.xip is non-NULL.  SnapBuildRestore *kept*
> ondisk.builder.commited.xip populated with a valid array pointer. But
> the patch allows committed.xip be NULL, thus in that case,
> SnapBuildAddCommitedTxn calls repalloc(NULL) which triggers assertion
> failure.

IIUC the patch doesn't allow committed.xip to be NULL since we don't
overwrite it if builder->committed.xcnt is 0 (i.e.,
ondisk.builder.committed.xip is NULL):

builder->committed.xcnt = ondisk.builder.committed.xcnt;
/* We only allocated/stored xcnt, not xcnt_space xids ! */
/* don't overwrite preallocated xip, if we don't have anything here */
if (builder->committed.xcnt > 0)
{
pfree(builder->committed.xip);
builder->committed.xcnt_space = ondisk.builder.committed.xcnt;
builder->committed.xip = ondisk.builder.committed.xip;
}

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 9:25 PM Amit Kapila  wrote:
>
> On Tue, Jul 19, 2022 at 1:10 PM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 18, 2022 at 8:49 PM Amit Kapila  wrote:
> > >
> > > On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Jul 15, 2022 at 3:32 PM shiy.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > >
> > > > I've attached patches for all supported branches including the master.
> > > >
> > >
> > > For back branch patches,
> > > * Wouldn't it be better to move purge logic into the function
> > > SnapBuildPurge* function for the sake of consistency?
> >
> > Agreed.
> >
> > > * Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
> > > Can't we instead have a function similar to
> > > SnapBuildXidHasCatalogChanges() as we have for the master branch? That
> > > will avoid calling it when the snapshot
> > > state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT
> >
> > Seems a good idea. We would need to pass the information about
> > (parsed->xinfo & XACT_XINFO_HAS_INVALS) to the function but probably
> > we can change ReorderBufferXidHasCatalogChanges() so that it checks
> > the RBTXN_HAS_CATALOG_CHANGES flag and then the initial running xacts
> > array.
> >
>
> Let's try to keep this as much similar to the master branch patch as possible.
>
> > BTW on backbranches, I think that the reason why we add
> > initial_running_xacts stuff to ReorderBuffer is that we cannot modify
> > SnapBuild that could be serialized. Can we add a (private) array for
> > the initial running xacts in snapbuild.c instead of adding new
> > variables to ReorderBuffer?
> >
>
> While thinking about this, I wonder if the current patch for back
> branches can lead to an ABI break as it changes the exposed structure?
> If so, it may be another reason to change it to some other way
> probably as you are suggesting.

Yeah, it changes the size of ReorderBuffer, which is not good.
Changing the function names and arguments would also break ABI. So
probably we cannot do the above idea of removing
ReorderBufferInitialXactsSetCatalogChanges() as well.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 4:35 PM Kyotaro Horiguchi
 wrote:

Thank you for the comments!

>
> At Tue, 19 Jul 2022 10:17:15 +0530, Amit Kapila  
> wrote in
> > Good work. I wonder without comments this may create a problem in the
> > future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
> > freeing the memory any less robust. Also, for consistency, we can use
> > a similar check based on xcnt in the SnapBuildRestore to free the
> > memory in the below code:
> > + /* set catalog modifying transactions */
> > + if (builder->catchange.xip)
> > + pfree(builder->catchange.xip);
>
> But xip must be positive there.  We can add a comment explains that.
>

Yes, if we add the comment for it, probably we need to explain a gcc's
optimization but it seems to be too much to me.

>
> +* Array of transactions and subtransactions that had modified 
> catalogs
> +* and were running when the snapshot was serialized.
> +*
> +* We normally rely on HEAP2_NEW_CID and XLOG_XACT_INVALIDATIONS 
> records to
> +* know if the transaction has changed the catalog. But it could 
> happen that
> +* the logical decoding decodes only the commit record of the 
> transaction.
> +* This array keeps track of the transactions that have modified 
> catalogs
>
> (Might be only me, but) "track" makes me think that xids are added and
> removed by activities. On the other hand the array just remembers
> catalog-modifying xids in the last life until the all xids in the list
> gone.
>
> +* and were running when serializing a snapshot, and this array is 
> used to
> +* add such transactions to the snapshot.
> +*
> +* This array is set once when restoring the snapshot, xids are 
> removed
>
> (So I want to add "only" between "are removed").
>
> +* from the array when decoding xl_running_xacts record, and then 
> eventually
> +* becomes empty.

Agreed. WIll fix.

>
>
> +   catchange_xip = ReorderBufferGetCatalogChangesXacts(builder->reorder);
>
> catchange_xip is allocated in the current context, but ondisk is
> allocated in builder->context.  I see it kind of inconsistent (even if
> the current context is same with build->context).

Right. I thought that since the lifetime of catchange_xip is short,
until the end of SnapBuildSerialize() function we didn't need to
allocate it in builder->context. But given ondisk, we need to do that
for catchange_xip as well. Will fix it.

>
>
> +   if (builder->committed.xcnt > 0)
> +   {
>
> It seems to me comitted.xip is always non-null, so we don't need this.
> I don't strongly object to do that, though.

But committed.xcnt could be 0, right? We don't need to copy anything
by calling memcpy with size = 0 in this case. Also, it looks more
consistent with what we do for catchange_xcnt.

>
> -* Remove TXN from its containing list.
> +* Remove TXN from its containing lists.
>
> The comment body only describes abut txn->nodes. I think we need to
> add that for catchange_node.

Will add.

>
>
> +   Assert((xcnt > 0) && (xcnt == rb->catchange_ntxns));
>
> (xcnt > 0) is obvious here (otherwise means dlist_foreach is broken..).
> (xcnt == rb->catchange_ntxns) is not what should be checked here. The
> assert just requires that catchange_txns and catchange_ntxns are
> consistent so it should be checked just after dlist_empty.. I think.
>

If we want to check if catchange_txns and catchange_ntxns are
consistent, should we check (xcnt == rb->catchange_ntxns) as well, no?
This function requires the caller to use rb->catchange_ntxns as the
length of the returned array. I think this assertion ensures that the
actual length of the array is consistent with the length we
pre-calculated.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 4:28 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Sunday, July 17, 2022 9:59 PM Masahiko Sawada  
> wrote:
> > I've attached patches for all supported branches including the master.
> Hi,
>
>
> Minor comments for REL14.
>
> (1) There are some foreign characters in the patches (in the commit message)
>
> When I had a look at your patch for back branches with some editor,
> I could see some unfamiliar full-width characters like below two cases,
> mainly around "single quotes" in the sentences.
>
> Could you please check the entire patches,
> probably by some tool that helps you to detect this kind of characters ?
>
> * the 2nd paragraph of the commit message
>
> ...mark the transaction as containing catalog changes if it窶冱 in the list of 
> the
> initial running transactions ...
>
> * the 3rd paragraph of the same
>
> It doesn窶冲 have the information on which (sub) transaction has catalog 
> changes
>
> FYI, this comment applies to other patches for REL13, REL12, REL11, REL10.
>
>
> (2) typo in the commit message
>
> FROM:
> To fix this problem, this change the reorder buffer so that...
> TO:
> To fix this problem, this changes the reorder buffer so that...
>
>
> (3) typo in ReorderBufferProcessInitialXacts
>
> +   /*
> +* Remove transactions that would have been processed and we don't 
> need to
> +* keep track off anymore.
>
>
> Kindly change
> FROM:
> keep track off
> TO:
> keep track of

Thank you for the comments! I'll address these comments in the next
version patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Mon, Jul 18, 2022 at 8:49 PM Amit Kapila  wrote:
>
> On Sun, Jul 17, 2022 at 6:29 PM Masahiko Sawada  wrote:
> >
> > On Fri, Jul 15, 2022 at 3:32 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> >
> > I've attached patches for all supported branches including the master.
> >
>
> For back branch patches,
> * Wouldn't it be better to move purge logic into the function
> SnapBuildPurge* function for the sake of consistency?

Agreed.

> * Do we really need ReorderBufferInitialXactsSetCatalogChanges()?
> Can't we instead have a function similar to
> SnapBuildXidHasCatalogChanges() as we have for the master branch? That
> will avoid calling it when the snapshot
> state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT

Seems a good idea. We would need to pass the information about
(parsed->xinfo & XACT_XINFO_HAS_INVALS) to the function but probably
we can change ReorderBufferXidHasCatalogChanges() so that it checks
the RBTXN_HAS_CATALOG_CHANGES flag and then the initial running xacts
array.

BTW on backbranches, I think that the reason why we add
initial_running_xacts stuff to ReorderBuffer is that we cannot modify
SnapBuild that could be serialized. Can we add a (private) array for
the initial running xacts in snapbuild.c instead of adding new
variables to ReorderBuffer? That way, the code would become more
consistent with the changes on the master branch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-19 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 1:47 PM Amit Kapila  wrote:
>
> On Tue, Jul 19, 2022 at 6:34 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 18, 2022 at 1:12 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > This patch should have the fix for the issue that Shi yu reported. Shi
> > > > yu, could you please test it again with this patch?
> > > >
> > >
> > > Can you explain the cause of the failure and your fix for the same?
> >
> > @@ -1694,6 +1788,8 @@ out:
> > /* be tidy */
> > if (ondisk)
> > pfree(ondisk);
> > +   if (catchange_xip)
> > +   pfree(catchange_xip);
> >
> > Regarding the above code in the previous version patch, looking at the
> > generated assembler code shared by Shi yu offlist, I realized that the
> > “if (catchange_xip)” is removed (folded) by gcc optimization. This is
> > because we dereference catchange_xip before null-pointer check as
> > follow:
> >
> > +   /* copy catalog modifying xacts */
> > +   sz = sizeof(TransactionId) * catchange_xcnt;
> > +   memcpy(ondisk_c, catchange_xip, sz);
> > +   COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
> > +   ondisk_c += sz;
> >
> > Since sz is 0 in this case, memcpy doesn’t do anything actually.
> >
> > By checking the assembler code, I’ve confirmed that gcc does the
> > optimization for these code and setting
> > -fno-delete-null-pointer-checks flag prevents the if statement from
> > being folded. Also, I’ve confirmed that adding the check if
> > "catchange.xcnt > 0” before the null-pointer check also can prevent
> > that. Adding a check  if "catchange.xcnt > 0” looks more robust. I’ve
> > added a similar check for builder->committed.xcnt as well for
> > consistency. builder->committed.xip could have no transactions.
> >
>
> Good work. I wonder without comments this may create a problem in the
> future. OTOH, I don't see adding a check "catchange.xcnt > 0" before
> freeing the memory any less robust. Also, for consistency, we can use
> a similar check based on xcnt in the SnapBuildRestore to free the
> memory in the below code:
> + /* set catalog modifying transactions */
> + if (builder->catchange.xip)
> + pfree(builder->catchange.xip);

I would hesitate to add comments about preventing the particular
optimization. I think we do null-pointer-check-then-pfree many place.
It seems to me that checking the array length before memcpy is more
natural than checking both the array length and the array existence
before pfree.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread Masahiko Sawada

On Thu, Jul 14, 2022 at 1:17 PM John Naylor
 wrote:
>
> On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada  wrote:
>
> > > > I think that at this stage it's better to define the design first. For
> > > > example, key size and value size, and these sizes are fixed or can be
> > > > set the arbitary size?
> > >
> > > I don't think we need to start over. Andres' prototype had certain
> > > design decisions built in for the intended use case (although maybe
> > > not clearly documented as such). Subsequent patches in this thread
> > > substantially changed many design aspects. If there were any changes
> > > that made things wonderful for vacuum, it wasn't explained, but Andres
> > > did explain how some of these changes were not good for other uses.
> > > Going to fixed 64-bit keys and values should still allow many future
> > > applications, so let's do that if there's no reason not to.
> >
> > I thought Andres pointed out that given that we store BufferTag (or
> > part of that) into the key, the fixed 64-bit keys might not be enough
> > for buffer mapping use cases. If we want to use wider keys more than
> > 64-bit, we would need to consider it.
>
> It sounds like you've answered your own question, then. If so, I'm
> curious what your current thinking is.
>
> If we *did* want to have maximum flexibility, then "single-value
> leaves" method would be the way to go, since it seems to be the
> easiest way to have variable-length both keys and values. I do have a
> concern that the extra pointer traversal would be a drag on
> performance, and also require lots of small memory allocations.

Agreed.

> I also have some concerns about also simultaneously trying to design
> for the use for buffer mappings. I certainly want to make this good
> for as many future uses as possible, and I'd really like to preserve
> any optimizations already fought for. However, to make concrete
> progress on the thread subject, I also don't think it's the most
> productive use of time to get tied up about the fine details of
> something that will not likely happen for several years at the
> earliest.

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-18 Thread Masahiko Sawada

On Mon, Jul 18, 2022 at 12:28 PM shiy.f...@fujitsu.com
 wrote:
>
> On Fri, Jul 15, 2022 10:39 PM Masahiko Sawada  wrote:
> >
> > This patch should have the fix for the issue that Shi yu reported. Shi
> > yu, could you please test it again with this patch?
> >
>
> Thanks for updating the patch!
> I have tested and confirmed that the problem I found has been fixed.

Thank you for testing!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-18 Thread Masahiko Sawada

On Mon, Jul 18, 2022 at 1:12 PM Amit Kapila  wrote:
>
> On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada  wrote:
> >
> > This patch should have the fix for the issue that Shi yu reported. Shi
> > yu, could you please test it again with this patch?
> >
>
> Can you explain the cause of the failure and your fix for the same?

@@ -1694,6 +1788,8 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
+   if (catchange_xip)
+   pfree(catchange_xip);

Regarding the above code in the previous version patch, looking at the
generated assembler code shared by Shi yu offlist, I realized that the
“if (catchange_xip)” is removed (folded) by gcc optimization. This is
because we dereference catchange_xip before null-pointer check as
follow:

+   /* copy catalog modifying xacts */
+   sz = sizeof(TransactionId) * catchange_xcnt;
+   memcpy(ondisk_c, catchange_xip, sz);
+   COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
+   ondisk_c += sz;

Since sz is 0 in this case, memcpy doesn’t do anything actually.

By checking the assembler code, I’ve confirmed that gcc does the
optimization for these code and setting
-fno-delete-null-pointer-checks flag prevents the if statement from
being folded. Also, I’ve confirmed that adding the check if
"catchange.xcnt > 0” before the null-pointer check also can prevent
that. Adding a check  if "catchange.xcnt > 0” looks more robust. I’ve
added a similar check for builder->committed.xcnt as well for
consistency. builder->committed.xip could have no transactions.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-17 Thread Masahiko Sawada

On Fri, Jul 15, 2022 at 3:32 PM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Jul 11, 2022 9:54 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch, please review it.
> >
>
> Thanks for your patch. Here are some comments for the REL14-v1 patch.
>
> 1.
> +   Sizesz = sizeof(TransactionId) * nxacts;;
>
> There is a redundant semicolon at the end.
>
> 2.
> +   workspace = MemoryContextAlloc(rb->context, 
> rb->n_initial_running_xacts);
>
> Should it be:
> +   workspace = MemoryContextAlloc(rb->context, sizeof(TransactionId) * 
> rb->n_initial_running_xacts);
>
> 3.
> +   /* bound check if there is at least one transaction to be removed */
> +   if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
> +   
> running->oldestRunningXid))
> +   return;
> +
>
> Here, I think it should return if rb->initial_running_xacts[0] is older than
> oldestRunningXid, right? Should it be changed to:
>
> +   if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
> +   
> running->oldestRunningXid))
> +   return;
>
> 4.
> +   if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0)
>
> Maybe we can change it like the following, to be consistent with other places 
> in
> this file. It's also fine if you don't change it.
>
> +   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)

Thank you for the comments!

I've attached patches for all supported branches including the master.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


REL13-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL12-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL14-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL11-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL10-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


master-v6-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-15 Thread Masahiko Sawada

On Fri, Jul 15, 2022 at 10:43 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, July 14, 2022 10:31 AM Masahiko Sawada  
> wrote:
> > I've attached an updated patch that incorporated comments from Amit and Shi.
> Hi,
>
>
> Minor comments for v4.

Thank you for the comments!

>
> (1) typo in the commit message
>
> "When decoding a COMMIT record, we check both the list and the ReorderBuffer 
> to see if
> if the transaction has modified catalogs."
>
> There are two 'if's in succession in the last sentence of the second 
> paragraph.
>
> (2) The header comment for the spec test
>
> +# Test that decoding only the commit record of the transaction that have
> +# catalog-changed.
>
> Rewording of this part looks required, because "test that ... " requires a 
> complete sentence
> after that, right ?
>
>
> (3) SnapBuildRestore
>
> snapshot_not_interesting:
> if (ondisk.builder.committed.xip != NULL)
> pfree(ondisk.builder.committed.xip);
> return false;
> }
>
> Do we need to add pfree for ondisk.builder.catchange.xip after the 
> 'snapshot_not_interesting' label ?
>
>
> (4) SnapBuildPurgeOlderTxn
>
> +   elog(DEBUG3, "purged catalog modifying transactions from %d 
> to %d",
> +(uint32) builder->catchange.xcnt, surviving_xids);
>
> To make this part more aligned with existing codes,
> probably we can have a look at another elog for debug in the same function.
>
> We should use %u for casted xcnt & surviving_xids,
> while adding a format for xmin if necessary ?

I agreed with all the above comments and incorporated them into the
updated patch.

This patch should have the fix for the issue that Shi yu reported. Shi
yu, could you please test it again with this patch?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v5-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-14 Thread Masahiko Sawada

On Thu, Jul 14, 2022 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Jul 14, 2022 at 11:16 AM shiy.f...@fujitsu.com
>  wrote:
> >
> > On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada  wrote:
> > >
> > > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > It happened when executing the following code because it tried to free a
> > > NULL
> > > > pointer (catchange_xip).
> > > >
> > > > /* be tidy */
> > > > if (ondisk)
> > > > pfree(ondisk);
> > > > +   if (catchange_xip)
> > > > +   pfree(catchange_xip);
> > > >  }
> > > >
> > > > It seems to be related to configure option. I could reproduce it when 
> > > > using
> > > > `./configure --enable-debug`.
> > > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -
> > > ggdb"`.
> > >
> > > Hmm, I could not reproduce this problem even if I use ./configure
> > > --enable-debug. And it's weird that we checked if catchange_xip is not
> > > null but we did pfree for it:
> > >
> > > #1  pfree (pointer=0x0) at mcxt.c:1177
> > > #2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78,
> > > lsn=25719712) at snapbuild.c:1792
> > >
> > > Is it reproducible in your environment?
> >
> > Thanks for your reply! Yes, it is reproducible. And I also reproduced it on 
> > the
> > v4 patch you posted [1].
>
> Thank you for testing!

I've found out the exact cause of this problem and how to fix it. I'll
submit an updated patch next week with my analysis.

Thank you for testing and providing additional information off-list, Shi yu.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-13 Thread Masahiko Sawada

On Thu, Jul 14, 2022 at 11:16 AM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 12, 2022 5:23 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > It happened when executing the following code because it tried to free a
> > NULL
> > > pointer (catchange_xip).
> > >
> > > /* be tidy */
> > > if (ondisk)
> > > pfree(ondisk);
> > > +   if (catchange_xip)
> > > +   pfree(catchange_xip);
> > >  }
> > >
> > > It seems to be related to configure option. I could reproduce it when 
> > > using
> > > `./configure --enable-debug`.
> > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -
> > ggdb"`.
> >
> > Hmm, I could not reproduce this problem even if I use ./configure
> > --enable-debug. And it's weird that we checked if catchange_xip is not
> > null but we did pfree for it:
> >
> > #1  pfree (pointer=0x0) at mcxt.c:1177
> > #2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78,
> > lsn=25719712) at snapbuild.c:1792
> >
> > Is it reproducible in your environment?
>
> Thanks for your reply! Yes, it is reproducible. And I also reproduced it on 
> the
> v4 patch you posted [1].

Thank you for testing!

>
> [1] 
> https://www.postgresql.org/message-id/CAD21AoAyNPrOFg%2BQGh%2B%3D4205TU0%3DyrE%2BQyMgzStkH85uBZXptQ%40mail.gmail.com
>
> > If so, could you test it again
> > with the following changes?
> >
> > diff --git a/src/backend/replication/logical/snapbuild.c
> > b/src/backend/replication/logical/snapbuild.c
> > index d015c06ced..a6e76e3781 100644
> > --- a/src/backend/replication/logical/snapbuild.c
> > +++ b/src/backend/replication/logical/snapbuild.c
> > @@ -1788,7 +1788,7 @@ out:
> > /* be tidy */
> > if (ondisk)
> > pfree(ondisk);
> > -   if (catchange_xip)
> > +   if (catchange_xip != NULL)
> > pfree(catchange_xip);
> >  }
> >
>
> I tried this and could still reproduce the problem.

Does the backtrace still show we attempt to pfree a null-pointer?

>
> Besides, I tried the suggestion from Amit [2],  it could be fixed by checking
> the value of catchange_xcnt instead of catchange_xip before pfree.

Could you check if this problem occurred when we reached there via
goto pass, i.e., did we call ReorderBufferGetCatalogChangesXacts() or
not?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-13 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 12:40 PM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch.
> >
> > While trying this idea, I noticed there is no API to get the length of
> > dlist, as we discussed offlist. Alternative idea was to use List
> > (T_XidList) but I'm not sure it's a great idea since deleting an xid
> > from the list is O(N), we need to implement list_delete_xid, and we
> > need to make sure allocating list node in the reorder buffer context.
> > So in the patch, I added a variable, catchange_ntxns, to keep track of
> > the length of the dlist. Please review it.
> >
>
> Thanks for your patch. Here are some comments on the master patch.

Thank you for the comments.

>
> 1.
> In catalog_change_snapshot.spec, should we use "RUNNING_XACTS record" instead 
> of
> "RUNNING_XACT record" / "XACT_RUNNING record" in the comment?
>
> 2.
> +* Since catchange.xip is sorted, we find the lower bound of
> +* xids that sill are interesting.
>
> Typo?
> "sill" -> "still"
>
> 3.
> +* This array is set once when restoring the snapshot, xids are 
> removed
> +* from the array when decoding xl_running_xacts record, and then 
> eventually
> +* becomes an empty.
>
> +   /* catchange list becomes an empty */
> +   pfree(builder->catchange.xip);
> +   builder->catchange.xip = NULL;
>
> Should "becomes an empty" be modified to "becomes empty"?
>
> 4.
> + * changes that are smaller than ->xmin. Those won't ever get checked via
> + * the ->committed array and ->catchange, respectively. The committed xids 
> will
>
> Should we change
> "the ->committed array and ->catchange"
> to
> "the ->committed or ->catchange array"
> ?

Agreed with all the above comments. These are incorporated in the
latest v4 patch I just sent[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAyNPrOFg%2BQGh%2B%3D4205TU0%3DyrE%2BQyMgzStkH85uBZXptQ%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-13 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 5:52 PM Amit Kapila  wrote:
>
> On Tue, Jul 12, 2022 at 1:13 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > >
> > > > > I'm doing benchmark tests and will share the results.
> > > > >
> > > >
> > > > I've done benchmark tests to measure the overhead introduced by doing
> > > > bsearch() every time when decoding a commit record. I've simulated a
> > > > very intensified situation where we decode 1M commit records while
> > > > keeping builder->catchange.xip array but the overhead is negilible:
> > > >
> > > > HEAD: 584 ms
> > > > Patched: 614 ms
> > > >
> > > > I've attached the benchmark script I used. With increasing
> > > > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
> > > > pg_logicla_slot_get_changes() decodes 1M commit records while keeping
> > > > catalog modifying transactions.
> > > >
> > >
> > > Thanks for the test. We should also see how it performs when (a) we
> > > don't change LOG_SNAPSHOT_INTERVAL_MS,
> >
> > What point do you want to see in this test? I think the performance
> > overhead depends on how many times we do bsearch() and how many
> > transactions are in the list.
> >
>
> Right, I am not expecting any visible performance difference in this
> case. This is to ensure that we are not incurring any overhead in the
> more usual scenarios (or default cases). As per my understanding, the
> purpose of increasing the value of LOG_SNAPSHOT_INTERVAL_MS is to
> simulate a stress case for the changes made by the patch, and keeping
> its value default will test the more usual scenarios.

Agreed.

I've done simple benchmark tests to decode 100k pgbench transactions:

HEAD: 10.34 s
Patched: 10.29 s

I've attached an updated patch that incorporated comments from Amit and Shi.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 28ca92c9d95cd05a26a7db6e54704f92b1846943 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 6 Jul 2022 12:53:36 +0900
Subject: [PATCH v4] Add catalog modifying transactions to logical decoding
 serialized snapshot.

Previously, we relied on HEAP2_NEW_CID records and XACT_INVALIDATION
records to know if the transaction has modified the catalog, and that
information is not serialized to snapshot. Therefore, if the logical
decoding decodes only the commit record of the transaction that
actually has modified a catalog, we missed adding its XID to the
snapshot. We ended up looking at catalogs with the wrong snapshot.

To fix this problem, this change adds the list of transaction IDs and
sub-transaction IDs, that have modified catalogs and are running when
snapshot serialization, to the serialized snapshot. When decoding a
COMMIT record, we check both the list and the ReorderBuffer to see if
if the transaction has modified catalogs.

Since this adds additional information to the serialized snapshot, we
cannot backpatch it. For back branches, we take another approach;
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose XID is in the list. This doesn't
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild.
---
 contrib/test_decoding/Makefile|   2 +-
 .../expected/catalog_change_snapshot.out  |  44 
 .../specs/catalog_change_snapshot.spec|  39 +++
 .../replication/logical/reorderbuffer.c   |  69 -
 src/backend/replication/logical/snapbuild.c   | 235 --
 src/include/replication/reorderbuffer.h   |  12 +
 6 files changed, 317 insertions(+), 84 deletions(-)
 create mode 100644 contrib/test_decoding/expected/catalog_change_snapshot.out
 create mode 100644 contrib/test_decoding/specs/catalog_change_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b220906479..c7ce603706 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot slot_creation_e

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 7:59 PM Amit Kapila  wrote:
>
> On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > >
> > > It happened when executing the following code because it tried to free a 
> > > NULL
> > > pointer (catchange_xip).
> > >
> > > /* be tidy */
> > > if (ondisk)
> > > pfree(ondisk);
> > > +   if (catchange_xip)
> > > +   pfree(catchange_xip);
> > >  }
> > >
> > > It seems to be related to configure option. I could reproduce it when 
> > > using
> > > `./configure --enable-debug`.
> > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og 
> > > -ggdb"`.
> >
> > Hmm, I could not reproduce this problem even if I use ./configure
> > --enable-debug. And it's weird that we checked if catchange_xip is not
> > null but we did pfree for it:
> >
>
> Yeah, this looks weird to me as well but one difference in running
> tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may
> change the timing of SnapBuildSerialize. The other thing we can try is
> by checking the value of catchange_xcnt before pfree.

Yeah, we can try that.

While reading the code, I realized that we try to pfree both ondisk
and catchange_xip also when we jumped to 'out:':

out:
ReorderBufferSetRestartPoint(builder->reorder,
 builder->last_serialized_snapshot);
/* be tidy */
if (ondisk)
pfree(ondisk);
if (catchange_xip)
pfree(catchange_xip);

But we use both ondisk and catchange_xip only if we didn't jump to
'out:'. If this problem is related to compiler optimization with
'goto' statement, moving them before 'out:' might be worth trying.

>
> BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert
> to ensure rb->catchange_ntxns and xcnt are equal. We can probably then
> avoid having xcnt_p as an out parameter as the caller can use
> rb->catchange_ntxns instead.
>

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch.
> >
>
> Hi,
>
> I met a segmentation fault in test_decoding test after applying the patch for 
> master
> branch. Attach the backtrace.

Thank you for testing the patch!

>
> It happened when executing the following code because it tried to free a NULL
> pointer (catchange_xip).
>
> /* be tidy */
> if (ondisk)
> pfree(ondisk);
> +   if (catchange_xip)
> +   pfree(catchange_xip);
>  }
>
> It seems to be related to configure option. I could reproduce it when using
> `./configure --enable-debug`.
> But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1  pfree (pointer=0x0) at mcxt.c:1177
#2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment? If so, could you test it again
with the following changes?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index d015c06ced..a6e76e3781 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1788,7 +1788,7 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
-   if (catchange_xip)
+   if (catchange_xip != NULL)
pfree(catchange_xip);
 }

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila  wrote:
>
> On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I'm doing benchmark tests and will share the results.
> > >
> >
> > I've done benchmark tests to measure the overhead introduced by doing
> > bsearch() every time when decoding a commit record. I've simulated a
> > very intensified situation where we decode 1M commit records while
> > keeping builder->catchange.xip array but the overhead is negilible:
> >
> > HEAD: 584 ms
> > Patched: 614 ms
> >
> > I've attached the benchmark script I used. With increasing
> > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
> > pg_logicla_slot_get_changes() decodes 1M commit records while keeping
> > catalog modifying transactions.
> >
>
> Thanks for the test. We should also see how it performs when (a) we
> don't change LOG_SNAPSHOT_INTERVAL_MS,

What point do you want to see in this test? I think the performance
overhead depends on how many times we do bsearch() and how many
transactions are in the list. I increased this value to easily
simulate the situation where we decode many commit records while
keeping catalog modifying transactions. But even if we don't change
this value, the result would not change if we don't change how many
commit records we decode.

> and (b) we have more DDL xacts
> so that the array to search is somewhat bigger

I've done the same performance tests while creating 64 catalog
modifying transactions. The result is:

HEAD: 595 ms
Patched: 628 ms

There was no big overhead.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  wrote:
>
> On Tue, Jul 12, 2022 at 9:48 AM Masahiko Sawada  wrote:
> >
> > On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > 1.
> > > > > > In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> > > > > > list length of 'catchange_txns' to allocate xids array? If we can do
> > > > > > so, then we will save the need to repalloc as well.
> > > > >
> > > > > Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
> > > > > catalog modifying transactions, the length of the array could be
> > > > > bigger than the one taken last time. We can start with the previous
> > > > > length but I think we cannot remove the need for repalloc.
> > > > >
> > > >
> > > > It is using the list "catchange_txns" to form xid array which
> > > > shouldn't change for the duration of
> > > > ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
> > > > array after its use. Next time in
> > > > ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
> > > > array happens, so not sure why repalloc would be required?
> > >
> > > Oops, I mistook catchange_txns for catchange->xcnt. You're right.
> > > Starting with the length of catchange_txns should be sufficient.
> > >
> >
> > I've attached an updated patch.
> >
> > While trying this idea, I noticed there is no API to get the length of
> > dlist, as we discussed offlist. Alternative idea was to use List
> > (T_XidList) but I'm not sure it's a great idea since deleting an xid
> > from the list is O(N), we need to implement list_delete_xid, and we
> > need to make sure allocating list node in the reorder buffer context.
> > So in the patch, I added a variable, catchange_ntxns, to keep track of
> > the length of the dlist. Please review it.
> >
>
> I'm doing benchmark tests and will share the results.
>

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


bench.spec
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-11 Thread Masahiko Sawada

On Tue, Jul 12, 2022 at 9:48 AM Masahiko Sawada  wrote:
>
> On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada  wrote:
> >
> > On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > > 1.
> > > > > In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> > > > > list length of 'catchange_txns' to allocate xids array? If we can do
> > > > > so, then we will save the need to repalloc as well.
> > > >
> > > > Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
> > > > catalog modifying transactions, the length of the array could be
> > > > bigger than the one taken last time. We can start with the previous
> > > > length but I think we cannot remove the need for repalloc.
> > > >
> > >
> > > It is using the list "catchange_txns" to form xid array which
> > > shouldn't change for the duration of
> > > ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
> > > array after its use. Next time in
> > > ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
> > > array happens, so not sure why repalloc would be required?
> >
> > Oops, I mistook catchange_txns for catchange->xcnt. You're right.
> > Starting with the length of catchange_txns should be sufficient.
> >
>
> I've attached an updated patch.
>
> While trying this idea, I noticed there is no API to get the length of
> dlist, as we discussed offlist. Alternative idea was to use List
> (T_XidList) but I'm not sure it's a great idea since deleting an xid
> from the list is O(N), we need to implement list_delete_xid, and we
> need to make sure allocating list node in the reorder buffer context.
> So in the patch, I added a variable, catchange_ntxns, to keep track of
> the length of the dlist. Please review it.
>

I'm doing benchmark tests and will share the results.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-11 Thread Masahiko Sawada

On Fri, Jul 8, 2022 at 3:43 PM John Naylor  wrote:
>
> On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada  wrote:
>
> > I guess that the tree height is affected by where garbages are, right?
> > For example, even if all garbage in the table is concentrated in
> > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> > byte of blockhi. If the table is larger than 128GB, the second byte of
> > the blockhi could be used depending on where the garbage exists.
>
> Right.
>
> > Another variation of how to store TID would be that we use the block
> > number as a key and store a bitmap of the offset as a value. We can
> > use Bitmapset for example,
>
> I like the idea of using existing code to set/check a bitmap if it's
> convenient. But (in case that was implied here) I'd really like to
> stay away from variable-length values, which would require
> "Single-value leaves" (slow). I also think it's fine to treat the
> key/value as just bits, and not care where exactly they came from, as
> we've been talking about.
>
> > or an approach like Roaring bitmap.
>
> This would require two new data structures instead of one. That
> doesn't seem like a path to success.

Agreed.

>
> > I think that at this stage it's better to define the design first. For
> > example, key size and value size, and these sizes are fixed or can be
> > set the arbitary size?
>
> I don't think we need to start over. Andres' prototype had certain
> design decisions built in for the intended use case (although maybe
> not clearly documented as such). Subsequent patches in this thread
> substantially changed many design aspects. If there were any changes
> that made things wonderful for vacuum, it wasn't explained, but Andres
> did explain how some of these changes were not good for other uses.
> Going to fixed 64-bit keys and values should still allow many future
> applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

>
> > For value size, if we support
> > different value sizes specified by the user, we can either embed
> > multiple values in the leaf node (called Multi-value leaves in ART
> > paper)
>
> I don't think "Multi-value leaves" allow for variable-length values,
> FWIW. And now I see I also used this term wrong in my earlier review
> comment -- v3/4 don't actually use "multi-value leaves", but Andres'
> does (going by the multiple leaf types). From the paper: "Multi-value
> leaves: The values are stored in one of four different leaf node
> types, which mirror the structure of inner nodes, but contain values
> instead of pointers."

Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.

>
> (It seems v3/v4 could be called a variation of "Combined pointer/value
> slots: If values fit into pointers, no separate node types are
> necessary. Instead, each pointer storage location in an inner node can
> either store a pointer or a value." But without the advantage of
> variable length keys).

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-11 Thread Masahiko Sawada

On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada  wrote:
>
> On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila  wrote:
> >
> > On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > 1.
> > > > In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> > > > list length of 'catchange_txns' to allocate xids array? If we can do
> > > > so, then we will save the need to repalloc as well.
> > >
> > > Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
> > > catalog modifying transactions, the length of the array could be
> > > bigger than the one taken last time. We can start with the previous
> > > length but I think we cannot remove the need for repalloc.
> > >
> >
> > It is using the list "catchange_txns" to form xid array which
> > shouldn't change for the duration of
> > ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
> > array after its use. Next time in
> > ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
> > array happens, so not sure why repalloc would be required?
>
> Oops, I mistook catchange_txns for catchange->xcnt. You're right.
> Starting with the length of catchange_txns should be sufficient.
>

I've attached an updated patch.

While trying this idea, I noticed there is no API to get the length of
dlist, as we discussed offlist. Alternative idea was to use List
(T_XidList) but I'm not sure it's a great idea since deleting an xid
from the list is O(N), we need to implement list_delete_xid, and we
need to make sure allocating list node in the reorder buffer context.
So in the patch, I added a variable, catchange_ntxns, to keep track of
the length of the dlist. Please review it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


master-v3-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-11 Thread Masahiko Sawada

On Wed, Jul 6, 2022 at 3:01 PM Amit Kapila  wrote:
>
> On Wed, Jul 6, 2022 at 7:38 AM Masahiko Sawada  wrote:
> >
> > I'll post a new version patch in the next email with replying to other 
> > comments.
> >
>
> Okay, thanks for working on this. Few comments/suggestions on
> poc_remember_last_running_xacts_v2 patch:
>
> 1.
> +ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb,
> TransactionId xid,
> +uint32 xinfo, int subxcnt,
> +TransactionId *subxacts, XLogRecPtr lsn)
> +{
> ...
> ...
> +
> + test = bsearch(, rb->last_running_xacts, rb->n_last_running_xacts,
> +sizeof(TransactionId), xidComparator);
> +
> + if (test == NULL)
> + {
> + for (int i = 0; i < subxcnt; i++)
> + {
> + test = bsearch([i], rb->last_running_xacts, 
> rb->n_last_running_xacts,
> +sizeof(TransactionId), xidComparator);
> ...
>
> Is there ever a possibility that the top transaction id is not in the
> running_xacts list but one of its subxids is present? If yes, it is
> not very obvious at least to me so adding a comment here could be
> useful. If not, then why do we need this additional check for each of
> the sub-transaction ids?

I think there is no possibility. The check for subtransactions is not necessary.

>
> 2.
> @@ -627,6 +647,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
>   commit_time = parsed->origin_timestamp;
>   }
>
> + /*
> + * Set the last running xacts as containing catalog change if necessary.
> + * This must be done before SnapBuildCommitTxn() so that we include catalog
> + * change transactions to the historic snapshot.
> + */
> + ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid,
> parsed->xinfo,
> +parsed->nsubxacts, parsed->subxacts,
> +buf->origptr);
> +
>   SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
>  parsed->nsubxacts, parsed->subxacts);
>
> As mentioned previously as well, marking it before SnapBuildCommitTxn
> has one disadvantage, we sometimes do this work even if the snapshot
> state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT in which case
> SnapBuildCommitTxn wouldn't do anything. Can we instead check whether
> the particular txn has invalidations and is present in the
> last_running_xacts list along with the check
> ReorderBufferXidHasCatalogChanges? I think that has the additional
> advantage that we don't need this additional marking if the xact is
> already marked as containing catalog changes.

Agreed.

>
> 3.
> 1.
> + /*
> + * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
> + * if the transaction has changed the catalog, and that information
> + * is not serialized to SnapBuilder.  Therefore, if the logical
> + * decoding decodes the commit record of the transaction that actually
> + * has done catalog changes without these records, we miss to add
> + * the xid to the snapshot so up creating the wrong snapshot.
>
> The part of the sentence "... snapshot so up creating the wrong
> snapshot." is not clear. In this comment, at one place you have used
> two spaces after a full stop, and at another place, there is one
> space. I think let's follow nearby code practice to use a single space
> before a new sentence.

Agreed.

>
> 4.
> +void
> +ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb,
> xl_running_xacts *running)
> +{
> + /* Quick exit if there is no longer last running xacts */
> + if (likely(rb->n_last_running_xacts == 0))
> + return;
> +
> + /* First call, build the last running xact list */
> + if (rb->n_last_running_xacts == -1)
> + {
> + int nxacts = running->subxcnt + running->xcnt;
> + Size sz = sizeof(TransactionId) * nxacts;;
> +
> + rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
> + memcpy(rb->last_running_xacts, running->xids, sz);
> + qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
> +
> + rb->n_last_running_xacts = nxacts;
> +
> + return;
> + }
>
> a. Can we add the function header comments for this function?

Updated.

> b. We seem to be tracking the running_xact information for the first
> running_xact record after start/restart. The name last_running_xacts
> doesn't sound appropriate for that, how about initial_running_xacts?

Sound good, updated.

>
> 5.
> + /*
> + * Purge xids in the last running xacts list if we can do that for at least
> + * one xid.
> + */
> + if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
> + running->oldestRunningXid))
>
> I think it would be a good idea to add a few lines here explaining why
> it is safe to

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-08 Thread Masahiko Sawada

On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila  wrote:
>
> On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada  wrote:
> >
> > On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  wrote:
> > >
> >
> > > 1.
> > > In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> > > list length of 'catchange_txns' to allocate xids array? If we can do
> > > so, then we will save the need to repalloc as well.
> >
> > Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
> > catalog modifying transactions, the length of the array could be
> > bigger than the one taken last time. We can start with the previous
> > length but I think we cannot remove the need for repalloc.
> >
>
> It is using the list "catchange_txns" to form xid array which
> shouldn't change for the duration of
> ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
> array after its use. Next time in
> ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
> array happens, so not sure why repalloc would be required?

Oops, I mistook catchange_txns for catchange->xcnt. You're right.
Starting with the length of catchange_txns should be sufficient.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-08 Thread Masahiko Sawada

On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  wrote:
>
> On Fri, Jul 8, 2022 at 6:45 AM Masahiko Sawada  wrote:
> >
> > On Thu, Jul 7, 2022 at 3:40 PM Amit Kapila  wrote:
> > >
> > > On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada  
> > > wrote:
> >
> > I've attached the new version patch that incorporates the comments and
> > the optimizations discussed above.
> >
>
> Thanks, few minor comments:

Thank you for the comments.

> 1.
> In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> list length of 'catchange_txns' to allocate xids array? If we can do
> so, then we will save the need to repalloc as well.

Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
catalog modifying transactions, the length of the array could be
bigger than the one taken last time. We can start with the previous
length but I think we cannot remove the need for repalloc.

> 2.
> /* ->committed manipulation */
> static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
>
> The above comment also needs to be changed.
>
> 3. As SnapBuildPurgeCommittedTxn() removes xacts both from committed
> and catchange arrays, the function name no more remains appropriate.
> We can either rename to something like SnapBuildPurgeOlderTxn() or
> move the catchange logic to a different function and call it from
> SnapBuildProcessRunningXacts.
>
> 4.
> + if (TransactionIdEquals(builder->catchange.xip[off],
> + builder->xmin) ||
> + NormalTransactionIdFollows(builder->catchange.xip[off],
> +builder->xmin))
>
> Can we use TransactionIdFollowsOrEquals() instead of above?
>
> 5. Comment change suggestion:
> /*
>   * Remove knowledge about transactions we treat as committed or
> containing catalog
>   * changes that are smaller than ->xmin. Those won't ever get checked via
> - * the ->committed array and ->catchange, respectively. The committed xids 
> will
> - * get checked via the clog machinery.
> + * the ->committed or ->catchange array, respectively. The committed xids 
> will
> + * get checked via the clog machinery. We can ideally remove the transaction
> + * from catchange array once it is finished (committed/aborted) but that 
> could
> + * be costly as we need to maintain the xids order in the array.
>   */
>

Agreed with the above comments.

> Apart from the above, I think there are pending comments for the
> back-branch patch and some performance testing of this work.

Right. I'll incorporate all comments I got so far into these patches
and submit them. Also, will do some benchmark tests.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-07 Thread Masahiko Sawada

On Tue, Jul 5, 2022 at 5:49 PM John Naylor  wrote:
>
> On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada  wrote:
>
> > > Looking at the node stats, and then your benchmark code, I think key
> > > construction is a major influence, maybe more than node type. The
> > > key/value scheme tested now makes sense:
> > >
> > > blockhi || blocklo || 9 bits of item offset
> > >
> > > (with the leaf nodes containing a bit map of the lowest few bits of
> > > this whole thing)
> > >
> > > We want the lower fanout nodes at the top of the tree and higher
> > > fanout ones at the bottom.
> >
> > So more inner nodes can fit in CPU cache, right?
>
> My thinking is, on average, there will be more dense space utilization
> in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
> cache, since with my idea a search might have to visit more nodes to
> get the common negative result (indexed tid not found in vacuum's
> list).
>
> > > Note some consequences: If the table has enough columns such that much
> > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > > dense case the nodes above the leaves will have lower fanout (maybe
> > > they will fit in a node32). Also, the bitmap values in the leaves will
> > > be more empty. In other words, many tables in the wild *resemble* the
> > > sparse case a bit, even if truly all tuples on the page are dead.
> > >
> > > Note also that the dense case in the benchmark above has ~4500 times
> > > more keys than the sparse case, and uses about ~1000 times more
> > > memory. But the runtime is only 2-3 times longer. That's interesting
> > > to me.
> > >
> > > To optimize for the sparse case, it seems to me that the key/value would 
> > > be
> > >
> > > blockhi || 9 bits of item offset || blocklo
> > >
> > > I believe that would make the leaf nodes more dense, with fewer inner
> > > nodes, and could drastically speed up the sparse case, and maybe many
> > > realistic dense cases.
> >
> > Does it have an effect on the number of inner nodes?
> >
> > >  I'm curious to hear your thoughts.
> >
> > Thank you for your analysis. It's worth trying. We use 9 bits for item
> > offset but most pages don't use all bits in practice. So probably it
> > might be better to move the most significant bit of item offset to the
> > left of blockhi. Or more simply:
> >
> > 9 bits of item offset || blockhi || blocklo
>
> A concern here is most tids won't use many bits in blockhi either,
> most often far fewer, so this would make the tree higher, I think.
> Each value of blockhi represents 0.5GB of heap (32TB max). Even with
> very large tables I'm guessing most pages of interest to vacuum are
> concentrated in a few of these 0.5GB "segments".

Right.

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

> And it's possible path compression would change the tradeoffs here.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-07 Thread Masahiko Sawada

On Thu, Jul 7, 2022 at 3:40 PM Amit Kapila  wrote:
>
> On Thu, Jul 7, 2022 at 8:21 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > 2. Are we anytime removing transaction ids from catchanges->xip array?
> > > >
> > > > No.
> > > >
> > > > > If not, is there a reason for the same? I think we can remove it
> > > > > either at commit/abort or even immediately after adding the xid/subxid
> > > > > to committed->xip array.
> > > >
> > > > It might be a good idea but I'm concerned that removing XID from the
> > > > array at every commit/abort or after adding it to committed->xip array
> > > > might be costly as it requires adjustment of the array to keep its
> > > > order. Removing XIDs from the array would make bsearch faster but the
> > > > array is updated reasonably often (every 15 sec).
> > > >
> > >
> > > Fair point. However, I am slightly worried that we are unnecessarily
> > > searching in this new array even when ReorderBufferTxn has the
> > > required information. To avoid that, in function
> > > SnapBuildXidHasCatalogChange(), we can first check
> > > ReorderBufferXidHasCatalogChanges() and then check the array if the
> > > first check doesn't return true. Also, by the way, do we need to
> > > always keep builder->catchanges.xip updated via SnapBuildRestore()?
> > > Isn't it sufficient that we just read and throw away contents from a
> > > snapshot if builder->catchanges.xip is non-NULL?
> >
> > IIUC catchanges.xip is restored only once when restoring a consistent
> > snapshot via SnapBuildRestore(). I think it's necessary to set
> > catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
> > you mean via SnapBuildSerialize()?∫
> >
>
> Sorry, I got confused about the way restore is used. You are right, it
> will be done once. My main worry is that we shouldn't look at
> builder->catchanges.xip array on an ongoing basis which I think can be
> dealt with by one of the ideas you mentioned below. But, I think we
> can still follow the other suggestion related to moving
> ReorderBufferXidHasCatalogChanges() check prior to checking array.

Agreed. I've incorporated this change in the new version patch.

>
> > >
> > > I had additionally thought if can further optimize this solution to
> > > just store this additional information when we need to serialize for
> > > checkpoint record but I think that won't work because walsender can
> > > restart even without resatart of server in which case the same problem
> > > can occur.
> >
> > Yes, probably we need to write catalog modifying transactions for
> > every serialized snapshot.
> >
> > > I am not if sure there is a way to further optimize this
> > > solution, let me know if you have any ideas?
> >
> > I suppose that writing additional information to serialized snapshots
> > would not be a noticeable overhead since we need 4 bytes per
> > transaction and we would not expect there is a huge number of
> > concurrent catalog modifying transactions. But both collecting catalog
> > modifying transactions (especially when there are many ongoing
> > transactions) and bsearch'ing on the XID list every time decoding the
> > COMMIT record could bring overhead.
> >
> > A solution for the first point would be to keep track of catalog
> > modifying transactions by using a linked list so that we can avoid
> > checking all ongoing transactions.
> >
>
> This sounds reasonable to me.
>
> > Regarding the second point, on reflection, I think we need to look up
> > the XID list until all XID in the list is committed/aborted. We can
> > remove XIDs from the list after adding it to committed.xip as you
> > suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
> > older than builder->xmin from the list like we do for committed.xip in
> > SnapBuildPurgeCommittedTxn().
> >
>
> I think doing along with RUNNING_XACTS should be fine. At each
> commit/abort, the cost could be high because we need to maintain the
> sort order. In general, I feel any one of these should be okay because
> once the array becomes empty, it won't be used again and there won't
> be any operation related to it during ongoing replication.

I've attached the new version patch that incorporates the comments and
the optimizations discussed above.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v2-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-06 Thread Masahiko Sawada

On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila  wrote:
>
> On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  wrote:
> > >
> > > 2. Are we anytime removing transaction ids from catchanges->xip array?
> >
> > No.
> >
> > > If not, is there a reason for the same? I think we can remove it
> > > either at commit/abort or even immediately after adding the xid/subxid
> > > to committed->xip array.
> >
> > It might be a good idea but I'm concerned that removing XID from the
> > array at every commit/abort or after adding it to committed->xip array
> > might be costly as it requires adjustment of the array to keep its
> > order. Removing XIDs from the array would make bsearch faster but the
> > array is updated reasonably often (every 15 sec).
> >
>
> Fair point. However, I am slightly worried that we are unnecessarily
> searching in this new array even when ReorderBufferTxn has the
> required information. To avoid that, in function
> SnapBuildXidHasCatalogChange(), we can first check
> ReorderBufferXidHasCatalogChanges() and then check the array if the
> first check doesn't return true. Also, by the way, do we need to
> always keep builder->catchanges.xip updated via SnapBuildRestore()?
> Isn't it sufficient that we just read and throw away contents from a
> snapshot if builder->catchanges.xip is non-NULL?

IIUC catchanges.xip is restored only once when restoring a consistent
snapshot via SnapBuildRestore(). I think it's necessary to set
catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
you mean via SnapBuildSerialize()?∫

>
> I had additionally thought if can further optimize this solution to
> just store this additional information when we need to serialize for
> checkpoint record but I think that won't work because walsender can
> restart even without resatart of server in which case the same problem
> can occur.

Yes, probably we need to write catalog modifying transactions for
every serialized snapshot.

> I am not if sure there is a way to further optimize this
> solution, let me know if you have any ideas?

I suppose that writing additional information to serialized snapshots
would not be a noticeable overhead since we need 4 bytes per
transaction and we would not expect there is a huge number of
concurrent catalog modifying transactions. But both collecting catalog
modifying transactions (especially when there are many ongoing
transactions) and bsearch'ing on the XID list every time decoding the
COMMIT record could bring overhead.

A solution for the first point would be to keep track of catalog
modifying transactions by using a linked list so that we can avoid
checking all ongoing transactions.

Regarding the second point, on reflection, I think we need to look up
the XID list until all XID in the list is committed/aborted. We can
remove XIDs from the list after adding it to committed.xip as you
suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
older than builder->xmin from the list like we do for committed.xip in
SnapBuildPurgeCommittedTxn().

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Masahiko Sawada

On Thu, Jul 7, 2022 at 1:28 AM Andres Freund  wrote:
>
> On 2022-07-05 14:52:45 -0700, Andres Freund wrote:
> > On 2022-07-04 11:01:01 +0900, Masahiko Sawada wrote:
> > > I've attached the patch, fix_drop_subscriptions_stats.patch, to fix it.
> >
> > LGTM. Unless somebody sees a reason not to, I'm planning to commit that to 
> > 15
> > and HEAD.
>
> Pushed.

Thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Masahiko Sawada

On Thu, Jul 7, 2022 at 12:53 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-06 11:41:46 +0900, Masahiko Sawada wrote:
> > diff --git a/src/test/regress/sql/subscription.sql 
> > b/src/test/regress/sql/subscription.sql
> > index 74c38ead5d..6a46956f6e 100644
> > --- a/src/test/regress/sql/subscription.sql
> > +++ b/src/test/regress/sql/subscription.sql
> > @@ -30,6 +30,12 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 
> > 'dbname=regress_doesnotexist' PUB
> >  COMMENT ON SUBSCRIPTION regress_testsub IS 'test subscription';
> >  SELECT obj_description(s.oid, 'pg_subscription') FROM pg_subscription s;
> >
> > +-- Check if the subscription stats are created and stats_reset is updated
> > +-- by pg_stat_reset_subscription_stats().
> > +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> > pg_stat_subscription_stats ORDER BY 1;
>
> Why use ORDER BY 1 instead of just getting the stats for the subscription we
> want to test? Seems a bit more robust to show only that one, so we don't get
> unnecessary changes if the test needs to create another subscription or such.

Right, it's more robust. I've updated the patch accordingly.

>
>
> > +SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription;
> > +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> > pg_stat_subscription_stats ORDER BY 1;
> > +
>
> Perhaps worth resetting again and checking that the timestamp is bigger than
> the previous timestamp? You can do that with \gset etc.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0001-Create-subscription-stats-entry-at-CREATE-SUBSCRI.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-06 Thread Masahiko Sawada

On Tue, Jul 5, 2022 at 5:09 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
> > A datum value is convenient to represent both a pointer and a value so
> > I used it to avoid defining node types for inner and leaf nodes
> > separately.
>
> I'm not convinced that's a good goal. I think we're going to want to have
> different key and value types, and trying to unify leaf and inner nodes is
> going to make that impossible.
>
> Consider e.g. using it for something like a buffer mapping table - your key
> might be way too wide to fit it sensibly into 64bit.

Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.

>
>
> > Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> > some platforms.
>
> Right - thats another good reason why it's problematic. A lot of key types
> aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
>
>
> > > > +void
> > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool 
> > > > *found_p)
> > > > +{
> > > > + int shift;
> > > > + boolreplaced;
> > > > + radix_tree_node *node;
> > > > + radix_tree_node *parent = tree->root;
> > > > +
> > > > + /* Empty tree, create the root */
> > > > + if (!tree->root)
> > > > + radix_tree_new_root(tree, key, val);
> > > > +
> > > > + /* Extend the tree if necessary */
> > > > + if (key > tree->max_val)
> > > > + radix_tree_extend(tree, key);
> > >
> > > FWIW, the reason I used separate functions for these in the prototype is 
> > > that
> > > it turns out to generate a lot better code, because it allows non-inlined
> > > function calls to be sibling calls - thereby avoiding the need for a 
> > > dedicated
> > > stack frame. That's not possible once you need a palloc or such, so 
> > > splitting
> > > off those call paths into dedicated functions is useful.
> >
> > Thank you for the info. How much does using sibling call optimization
> > help the performance in this case? I think that these two cases are
> > used only a limited number of times: inserting the first key and
> > extending the tree.
>
> It's not that it helps in the cases moved into separate functions - it's that
> not having that code in the "normal" paths keeps the normal path faster.

Thanks, understood.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-06 Thread Masahiko Sawada

On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  wrote:
>
> On Mon, Jul 4, 2022 at 6:12 PM Amit Kapila  wrote:
> >
> > On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've attached three POC patches:
> > >
> >
> > I think it will be a good idea if you can add a short commit message
> > at least to say which patch is proposed for HEAD and which one is for
> > back branches. Also, it would be good if you can add some description
> > of the fix in the commit message. Let's remove poc* from the patch
> > name.
> >
> > Review poc_add_running_catchanges_xacts_to_serialized_snapshot
> > =
>
> Few more comments:

Thank you for the comments.

> 1.
> +
> + /* This array must be sorted in xidComparator order */
> + TransactionId *xip;
> + } catchanges;
>  };
>
> This array contains the transaction ids for subtransactions as well. I
> think it is better mention the same in comments.

Updated.

>
> 2. Are we anytime removing transaction ids from catchanges->xip array?

No.

> If not, is there a reason for the same? I think we can remove it
> either at commit/abort or even immediately after adding the xid/subxid
> to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

>
> 3.
> + if (readBytes != sz)
> + {
> + int save_errno = errno;
> +
> + CloseTransientFile(fd);
> +
> + if (readBytes < 0)
> + {
> + errno = save_errno;
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not read file \"%s\": %m", path)));
> + }
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("could not read file \"%s\": read %d of %zu",
> + path, readBytes, sz)));
> + }
>
> This is the fourth instance of similar error handling code in
> SnapBuildRestore(). Isn't it better to extract this into a separate
> function?

Good idea, updated.

>
> 4.
> +TransactionId *
> +ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> + TransactionId *xids;
> + size_t xcnt = 0;
> + size_t xcnt_space = 64; /* arbitrary number */
> +
> + xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
> +
> + hash_seq_init(_seq, rb->by_txn);
> + while ((ent = hash_seq_search(_seq)) != NULL)
> + {
> + ReorderBufferTXN *txn = ent->txn;
> +
> + if (!rbtxn_has_catalog_changes(txn))
> + continue;
>
> It would be better to allocate memory the first time we have to store
> xids. There is a good chance that many a time this function will do
> just palloc without having to store any xid.

Agreed.

>
> 5. Do you think we should do some performance testing for a mix of
> ddl/dml workload to see if it adds any overhead in decoding due to
> serialize/restore doing additional work? I don't think it should add
> some meaningful overhead but OTOH there is no harm in doing some
> testing of the same.

Yes, it would be worth trying. I also believe this change doesn't
introduce noticeable overhead but let's check just in case.

I've attached an updated patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-Add-catalog-modifying-transactions-to-logical-decodi.patch
Description: Binary data

Re: Issue with pg_stat_subscription_stats

2022-07-05 Thread Masahiko Sawada

On Wed, Jul 6, 2022 at 10:48 AM Andres Freund  wrote:
>
> On 2022-07-06 10:25:02 +0900, Masahiko Sawada wrote:
> > > I think most of this could just be pgstat_reset_entry().
> >
> > I think pgstat_reset_entry() doesn't work for this case as it skips
> > resetting the entry if it doesn't exist.
>
> True - but a pgstat_get_entry_ref(create = true); pgstat_reset_entry(); would
> still be shorter?

Indeed. I've updated the patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v2-0001-Create-subscription-stats-entry-at-CREATE-SUBSCRI.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-05 Thread Masahiko Sawada

On Mon, Jul 4, 2022 at 9:42 PM Amit Kapila  wrote:
>
> On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  
> wrote:
> >
> > I've attached three POC patches:
> >
>
> I think it will be a good idea if you can add a short commit message
> at least to say which patch is proposed for HEAD and which one is for
> back branches. Also, it would be good if you can add some description
> of the fix in the commit message. Let's remove poc* from the patch
> name.

Updated.

>
> Review poc_add_running_catchanges_xacts_to_serialized_snapshot
> =
> 1.
> + /*
> + * Array of transactions that were running when the snapshot serialization
> + * and changed system catalogs,
>
> The part of the sentence after serialization is not very clear.

Updated.

>
> 2.
> - if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
> + if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) ||
> + bsearch(, builder->catchanges.xip, builder->catchanges.xcnt,
> + sizeof(TransactionId), xidComparator) != NULL)
>
> Why are you using xid instead of subxid in bsearch call? Can we add a
> comment to say why it is okay to use xid if there is a valid reason?
> But note, we are using subxid to add to the committed xact array so
> not sure if this is a good idea but I might be missing something.

You're right, subxid should be used here.

>
> Suggestions for improvement in comments:
> -   /*
> -* Update the transactions that are running and changes
> catalogs that are
> -* not committed.
> -*/
> +   /* Update the catalog modifying transactions that are yet not
> committed. */
> if (builder->catchanges.xip)
> pfree(builder->catchanges.xip);
> builder->catchanges.xip =
> ReorderBufferGetCatalogChangesXacts(builder->reorder,
> @@ -1647,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
> COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
> ondisk_c += sz;
>
> -   /* copy catalog-changes xacts */
> +   /* copy catalog modifying xacts */
> sz = sizeof(TransactionId) * builder->catchanges.xcnt;
> memcpy(ondisk_c, builder->catchanges.xip, sz);
> COMP_CRC32C(ondisk->checksum, ondisk_c, sz);

Updated.

I'll post a new version patch in the next email with replying to other comments.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Issue with pg_stat_subscription_stats

2022-07-05 Thread Masahiko Sawada

On Wed, Jul 6, 2022 at 6:52 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-04 11:01:01 +0900, Masahiko Sawada wrote:
> > I've attached the patch, fix_drop_subscriptions_stats.patch, to fix it.
>
> LGTM. Unless somebody sees a reason not to, I'm planning to commit that to 15
> and HEAD.
>
>
> > I've also attached another PoC patch,
> > poc_create_subscription_stats.patch, to create the stats entry when
> > creating the subscription, which address the issue reported in this
> > thread; the pg_stat_reset_subscription_stats() doesn't update the
> > stats_reset if no error is reported yet.
>
> It'd be good for this to include a test.

Agreed.

>
>
> > diff --git a/src/backend/utils/activity/pgstat_subscription.c 
> > b/src/backend/utils/activity/pgstat_subscription.c
> > index e1072bd5ba..ef318b7422 100644
> > --- a/src/backend/utils/activity/pgstat_subscription.c
> > +++ b/src/backend/utils/activity/pgstat_subscription.c
> > @@ -47,8 +47,20 @@ pgstat_report_subscription_error(Oid subid, bool 
> > is_apply_error)
> >  void
> >  pgstat_create_subscription(Oid subid)
> >  {
> > + PgStat_EntryRef *entry_ref;
> > + PgStatShared_Subscription *shstatent;
> > +
> >   pgstat_create_transactional(PGSTAT_KIND_SUBSCRIPTION,
> >   InvalidOid, 
> > subid);
> > +
> > + entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_SUBSCRIPTION,
> > +   
> >   InvalidOid, subid,
> > +   
> >   false);
> > + shstatent = (PgStatShared_Subscription *) entry_ref->shared_stats;
> > +
> > + memset(>stats, 0, sizeof(shstatent->stats));
> > +
> > + pgstat_unlock_entry(entry_ref);
> >  }
> >
> >  /*
>
> I think most of this could just be pgstat_reset_entry().

I think pgstat_reset_entry() doesn't work for this case as it skips
resetting the entry if it doesn't exist.

I've attached an updated patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-Create-subscription-stats-entry-on-CREATE-SUBSCRIPTI.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Masahiko Sawada

On Tue, Jul 5, 2022 at 7:00 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Yea, at some point the compiler starts using a jump table instead of branches,
> and that turns out to be a good bit more expensive. And even with branches, it
> obviously adds hard to predict branches. IIRC I fought a bit with the compiler
> to avoid some of that cost, it's possible that got "lost" in Sawada-san's
> patch.
>
>
> Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
> one is not unimportant until we have path compression.

I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.

>
> Right now the node struct sizes are:
> 4 - 48 bytes
> 32 - 296 bytes
> 128 - 1304 bytes
> 256 - 2088 bytes
>
> I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
> bytes, but needing that separate isset array somehow is sad :/. I wonder if a
> smaller "free index" would do the trick? Point to the element + 1 where we
> searched last and start a plain loop there. Particularly in an insert-only
> workload that'll always work, and in other cases it'll still often work I
> think.

radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.


> One thing I was wondering about is trying to choose node types in
> roughly-power-of-two struct sizes. It's pretty easy to end up with significant
> fragmentation in the slabs right now when inserting as you go, because some of
> the smaller node types will be freed but not enough to actually free blocks of
> memory. If we instead have ~power-of-two sizes we could just use a single slab
> of the max size, and carve out the smaller node types out of that largest
> allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

How about using different block size for different node types?

>
> Btw, that fragmentation is another reason why I think it's better to track
> memory usage via memory contexts, rather than doing so based on
> GetMemoryChunkSpace().

Agreed.

>
>
> > > Ideally, node16 and node32 would have the same code with a different
> > > loop count (1 or 2). More generally, there is too much duplication of
> > > code (noted by Andres in his PoC), and there are many variable names
> > > with the node size embedded. This is a bit tricky to make more
> > > general, so we don't need to try it yet, but ideally we would have
> > > something similar to:
> > >
> > > switch (node->kind) // todo: inspect tagged pointer
> > > {
> > >   case RADIX_TREE_NODE_KIND_4:
> > >idx = node_search_eq(node, chunk, 4);
> > >do_action(node, idx, 4, ...);
> > >break;
> > >   case RADIX_TREE_NODE_KIND_32:
> > >idx = node_search_eq(node, chunk, 32);
> > >do_action(node, idx, 32, ...);
> > >   ...
> > > }
>
> FWIW, that should be doable with an inline function, if you pass it the memory
> to the "array" rather than the node directly. Not so sure it's a good idea to
> do dispatch between node types / search methods inside the helper, as you
> suggest below:
>
>
> > > static pg_alwaysinline void
> > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > > {
> > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> > >   // do simple loop with (node_simple *) node;
> > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> > >   // do vectorized loop where available with (node_vec *) node;
> > > ...
> > > }

Yeah, It's worth trying at some point.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Masahiko Sawada

On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 00..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-
> > + *
> > + * radixtree.c
> > + *   Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: 
> > ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and 
> > Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation.  For 
> > instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 
> > 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the 
> > paper.
> > + * Also, there is no support for path compression and lazy path expansion. 
> > The
> > + * radix tree supports fixed length of the key so we don't expect the tree 
> > level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is 
> > used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > + radix_tree_node *node;
> > + struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?

Yes, we can do that.

>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > + radix_tree_node *newnode;
> > +
> > + newnode = (radix_tree_node *) 
> > MemoryContextAllocZero(tree->slabs[kind],
> > +   
> >radix_tree_node_info[kind].size);
> > + newnode->kind = kind;
> > +
> > + /* update the statistics */
> > + tree->mem_used += GetMemoryChunkSpace(newnode);
> > + tree->cnt[kind]++;
> > +
> > + return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?

Indeed. I'll use MemoryContextMemAllocated instead.

>
>
> > + else if (idx != n4->n.count)
> > + {
> > + /*
> > +  * the key needs to be 
> > inserted in the middle of the
> > +  * array, make space for the 
> > new key.
> > +  */
> > + memmove(&(n4->chunks[idx + 
> > 1]), &(n4->chunks[idx]),
> > + sizeof(uint8) 
> > * (n4->n.count - idx));
> > + memmove(&(n4->slots[idx + 
> > 1]), &(n4->slots[idx]),
> > + 
> > sizeof(radix_tree_node *) * (n4->n.count - idx));
> > + }
>
> Maybe we coul

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Masahiko Sawada

On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada  wrote:
>
> On Tue, Jun 28, 2022 at 10:10 PM John Naylor
>  wrote:
> >
> > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  
> > wrote:
> > >
> > > > I
> > > > suspect other optimizations would be worth a lot more than using AVX2:
> > > > - collapsing inner nodes
> > > > - taking care when constructing the key (more on this when we
> > > > integrate with VACUUM)
> > > > ...and a couple Andres mentioned:
> > > > - memory management: in
> > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > > - node dispatch:
> > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > > >
> > > > Therefore, I would suggest that we use SSE2 only, because:
> > > > - portability is very easy
> > > > - to avoid a performance hit from indirecting through a function pointer
> > >
> > > Okay, I'll try these optimizations and see if the performance becomes 
> > > better.
> >
> > FWIW, I think it's fine if we delay these until after committing a
> > good-enough version. The exception is key construction and I think
> > that deserves some attention now (more on this below).
>
> Agreed.
>
> >
> > > I've done benchmark tests while changing the node types. The code base
> > > is v3 patch that doesn't have the optimization you mentioned below
> > > (memory management and node dispatch) but I added the code to use SSE2
> > > for node-16 and node-32.
> >
> > Great, this is helpful to visualize what's going on!
> >
> > > * sse2_4_16_48_256
> > > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 
> > > 916433
> > > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
> > >
> > > * sse2_4_32_128_256
> > > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, 
> > > n256 = 31
> > > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 
> > > 1
> >
> > > Observations are:
> > >
> > > In both test cases, There is not much difference between using AVX2
> > > and SSE2. The more mode types, the more time it takes for loading the
> > > data (see sse2_4_16_32_128_256).
> >
> > Good to know. And as Andres mentioned in his PoC, more node types
> > would be a barrier for pointer tagging, since 32-bit platforms only
> > have two spare bits in the pointer.
> >
> > > In dense case, since most nodes have around 100 children, the radix
> > > tree that has node-128 had a good figure in terms of memory usage. On
> >
> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?
>
> >
> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably i

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-03 Thread Masahiko Sawada

On Tue, Jun 28, 2022 at 10:10 PM John Naylor
 wrote:
>
> On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  wrote:
> >
> > > I
> > > suspect other optimizations would be worth a lot more than using AVX2:
> > > - collapsing inner nodes
> > > - taking care when constructing the key (more on this when we
> > > integrate with VACUUM)
> > > ...and a couple Andres mentioned:
> > > - memory management: in
> > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > - node dispatch:
> > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > >
> > > Therefore, I would suggest that we use SSE2 only, because:
> > > - portability is very easy
> > > - to avoid a performance hit from indirecting through a function pointer
> >
> > Okay, I'll try these optimizations and see if the performance becomes 
> > better.
>
> FWIW, I think it's fine if we delay these until after committing a
> good-enough version. The exception is key construction and I think
> that deserves some attention now (more on this below).

Agreed.

>
> > I've done benchmark tests while changing the node types. The code base
> > is v3 patch that doesn't have the optimization you mentioned below
> > (memory management and node dispatch) but I added the code to use SSE2
> > for node-16 and node-32.
>
> Great, this is helpful to visualize what's going on!
>
> > * sse2_4_16_48_256
> > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 
> > 916433
> > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
> >
> > * sse2_4_32_128_256
> > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 
> > = 31
> > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 1
>
> > Observations are:
> >
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Good to know. And as Andres mentioned in his PoC, more node types
> would be a barrier for pointer tagging, since 32-bit platforms only
> have two spare bits in the pointer.
>
> > In dense case, since most nodes have around 100 children, the radix
> > tree that has node-128 had a good figure in terms of memory usage. On
>
> Looking at the node stats, and then your benchmark code, I think key
> construction is a major influence, maybe more than node type. The
> key/value scheme tested now makes sense:
>
> blockhi || blocklo || 9 bits of item offset
>
> (with the leaf nodes containing a bit map of the lowest few bits of
> this whole thing)
>
> We want the lower fanout nodes at the top of the tree and higher
> fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

>
> Note some consequences: If the table has enough columns such that much
> fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> dense case the nodes above the leaves will have lower fanout (maybe
> they will fit in a node32). Also, the bitmap values in the leaves will
> be more empty. In other words, many tables in the wild *resemble* the
> sparse case a bit, even if truly all tuples on the page are dead.
>
> Note also that the dense case in the benchmark above has ~4500 times
> more keys than the sparse case, and uses about ~1000 times more
> memory. But the runtime is only 2-3 times longer. That's interesting
> to me.
>
> To optimize for the sparse case, it seems to me that the key/value would be
>
> blockhi || 9 bits of item offset || blocklo
>
> I believe that would make the leaf nodes more dense, with fewer inner
> nodes, and could drastically speed up the sparse case, and maybe many
> realistic dense cases.

Does it have an effect on the number of inner nodes?

>  I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

>
> > the other hand, the radix tree that doesn't have node-128 has a better
> > number in terms of insertion performance. This is probably because we
> > need to iterate over 'isset' flags from the beginning of the array in
> > order to find an empty slot when inserting new data. We do the same
> > thing also for node-48 but it was better than node-128 as it's up to
> > 48.
>
> I mentioned in my

Re: Issue with pg_stat_subscription_stats

2022-07-03 Thread Masahiko Sawada

On Sat, Jul 2, 2022 at 9:52 AM Masahiko Sawada  wrote:
>
>
>
> On Sat, Jul 2, 2022 at 2:53 Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2022-07-01 16:08:48 +0900, Masahiko Sawada wrote:
>> > Yes, my point is that it may be misleading that the subscription stats
>> > are created when a subscription is created.
>>
>> I think it's important to create stats at that time, because otherwise it's
>> basically impossible to ensure that stats are dropped when a transaction 
>> rolls
>> back. If some / all columns should return something else before stats are
>> reported that can be addressed easily by tracking that in a separate field.
>>
>>
>> > On the other hand, I'm not sure we agreed that the behavior that
>> > Melanie reported is not a problem. The user might get confused since
>> > the subscription stats works differently than other stats when a
>> > reset. Previously, the primary reason why I hesitated to create the
>> > subscription stats when creating a subscription is that CREATE
>> > SUBSCRIPTION (with create_slot = false) can be rolled back. But with
>> > the shmem stats, we can easily resolve it by using
>> > pgstat_create_transactional().
>>
>> Yep.
>>
>>
>> > > > The second problem is that the following code in DropSubscription()
>> > > > should be updated:
>> > > >
>> > > > /*
>> > > >  * Tell the cumulative stats system that the subscription is 
>> > > > getting
>> > > >  * dropped. We can safely report dropping the subscription 
>> > > > statistics here
>> > > >  * if the subscription is associated with a replication slot since 
>> > > > we
>> > > >  * cannot run DROP SUBSCRIPTION inside a transaction block.  
>> > > > Subscription
>> > > >  * statistics will be removed later by (auto)vacuum either if it's 
>> > > > not
>> > > >  * associated with a replication slot or if the message for 
>> > > > dropping the
>> > > >  * subscription gets lost.
>> > > >  */
>> > > > if (slotname)
>> > > > pgstat_drop_subscription(subid);
>> > > >
>> > > > I think we can call pgstat_drop_subscription() even if slotname is
>> > > > NULL and need to update the comment.
>> > > >
>> > >
>> > > +1.
>> > >
>> > > > IIUC autovacuum is no longer
>> > > > responsible for garbage collection.
>> > > >
>> > >
>> > > Right, this is my understanding as well.
>> >
>> > Thank you for the confirmation.
>>
>> Want to propose a patch?
>
>
> Yes, I’ll propose a patch.
>

I've attached the patch, fix_drop_subscriptions_stats.patch, to fix it.

I've also attached another PoC patch,
poc_create_subscription_stats.patch, to create the stats entry when
creating the subscription, which address the issue reported in this
thread; the pg_stat_reset_subscription_stats() doesn't update the
stats_reset if no error is reported yet.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


fix_drop_subscription_stats.patch
Description: Binary data


poc_create_subscription_stats.patch
Description: Binary data

Re: Issue with pg_stat_subscription_stats

2022-07-01 Thread Masahiko Sawada

On Sat, Jul 2, 2022 at 2:53 Andres Freund  wrote:

> Hi,
>
> On 2022-07-01 16:08:48 +0900, Masahiko Sawada wrote:
> > Yes, my point is that it may be misleading that the subscription stats
> > are created when a subscription is created.
>
> I think it's important to create stats at that time, because otherwise it's
> basically impossible to ensure that stats are dropped when a transaction
> rolls
> back. If some / all columns should return something else before stats are
> reported that can be addressed easily by tracking that in a separate field.
>
>
> > On the other hand, I'm not sure we agreed that the behavior that
> > Melanie reported is not a problem. The user might get confused since
> > the subscription stats works differently than other stats when a
> > reset. Previously, the primary reason why I hesitated to create the
> > subscription stats when creating a subscription is that CREATE
> > SUBSCRIPTION (with create_slot = false) can be rolled back. But with
> > the shmem stats, we can easily resolve it by using
> > pgstat_create_transactional().
>
> Yep.
>
>
> > > > The second problem is that the following code in DropSubscription()
> > > > should be updated:
> > > >
> > > > /*
> > > >  * Tell the cumulative stats system that the subscription is
> getting
> > > >  * dropped. We can safely report dropping the subscription
> statistics here
> > > >  * if the subscription is associated with a replication slot
> since we
> > > >  * cannot run DROP SUBSCRIPTION inside a transaction block.
> Subscription
> > > >  * statistics will be removed later by (auto)vacuum either if
> it's not
> > > >  * associated with a replication slot or if the message for
> dropping the
> > > >  * subscription gets lost.
> > > >  */
> > > > if (slotname)
> > > > pgstat_drop_subscription(subid);
> > > >
> > > > I think we can call pgstat_drop_subscription() even if slotname is
> > > > NULL and need to update the comment.
> > > >
> > >
> > > +1.
> > >
> > > > IIUC autovacuum is no longer
> > > > responsible for garbage collection.
> > > >
> > >
> > > Right, this is my understanding as well.
> >
> > Thank you for the confirmation.
>
> Want to propose a patch?


Yes, I’ll propose a patch.

Regards,
-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Issue with pg_stat_subscription_stats

2022-07-01 Thread Masahiko Sawada

On Fri, Jul 1, 2022 at 3:01 PM Amit Kapila  wrote:
>
> On Fri, Jul 1, 2022 at 7:12 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 16, 2022 at 11:34 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > While looking at this issue again, I realized there seems to be two
> > problems with subscription stats on shmem stats:
> >
> > Firstly, we call pgstat_create_subscription() when creating a
> > subscription but the subscription stats are reported by apply workers.
> > And pgstat_create_subscription() just calls
> > pgstat_create_transactional():
> >
> > void
> > pgstat_create_subscription(Oid subid)
> > {
> > pgstat_create_transactional(PGSTAT_KIND_SUBSCRIPTION,
> > InvalidOid, subid);
> > }
> >
> > I guess calling pgstat_create_subscription() is not necessary for the
> > current usage. On the other hand, if we create the subscription stats
> > there we can resolve the issue Melanie reported in this thread.
> >
>
> It won't create the stats entry in the shared hash table, so the
> behavior should be the same as without shared stats.

Yes, my point is that it may be misleading that the subscription stats
are created when a subscription is created. The initial behavior is
technically the same for pg_stat_database. That is, we don't create
the stats entry for them when creating the object. But we don’t call
pgstat_create_transactional when creating a database (we don’t have a
function like pgstat_create_database()) whereas we do for subscription
stats.

On the other hand, I'm not sure we agreed that the behavior that
Melanie reported is not a problem. The user might get confused since
the subscription stats works differently than other stats when a
reset. Previously, the primary reason why I hesitated to create the
subscription stats when creating a subscription is that CREATE
SUBSCRIPTION (with create_slot = false) can be rolled back. But with
the shmem stats, we can easily resolve it by using
pgstat_create_transactional().

>
> > The second problem is that the following code in DropSubscription()
> > should be updated:
> >
> > /*
> >  * Tell the cumulative stats system that the subscription is getting
> >  * dropped. We can safely report dropping the subscription statistics 
> > here
> >  * if the subscription is associated with a replication slot since we
> >  * cannot run DROP SUBSCRIPTION inside a transaction block.  
> > Subscription
> >  * statistics will be removed later by (auto)vacuum either if it's not
> >  * associated with a replication slot or if the message for dropping the
> >  * subscription gets lost.
> >  */
> > if (slotname)
> > pgstat_drop_subscription(subid);
> >
> > I think we can call pgstat_drop_subscription() even if slotname is
> > NULL and need to update the comment.
> >
>
> +1.
>
> > IIUC autovacuum is no longer
> > responsible for garbage collection.
> >
>
> Right, this is my understanding as well.

Thank you for the confirmation.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Backup command and functions can cause assertion failure and segmentation fault

2022-07-01 Thread Masahiko Sawada

 when BASE_BACKUP is aborted.
> So attached patch changes do_pg_abort_backup callback so that
> it resets sessionBackupState. I confirmed that, with the patch,
> those assertion failure and segmentation fault didn't happen.

The change looks good to me. I've also confirmed the change fixed the issues.

> But this change has one issue that; if BASE_BACKUP is run while
> a backup is already in progress in the session by pg_backup_start()
> and that session is terminated, the change causes 
> XLogCtl->Insert.runningBackups
> to be decremented incorrectly. That is, XLogCtl->Insert.runningBackups
> is incremented by two by pg_backup_start() and BASE_BACKUP,
> but it's decremented only by one by the termination of the session.
>
> To address this issue, I think that we should disallow BASE_BACKUP
> to run while a backup is already in progress in the *same* session
> as we already do this for pg_backup_start(). Thought? I included
> the code to disallow that in the attached patch.

+1

@@ -233,6 +233,12 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
StringInfo  labelfile;
StringInfo  tblspc_map_file;
    backup_manifest_info manifest;
+   SessionBackupState status = get_backup_status();
+
+   if (status == SESSION_BACKUP_RUNNING)
+   ereport(ERROR,
+   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("a backup is already in progress in this session")));

I think we can move it to the beginning of SendBaseBackup() so we can
avoid bbsink initialization and cleanup in the error case.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Issue with pg_stat_subscription_stats

2022-06-30 Thread Masahiko Sawada

  Hi,

On Wed, Mar 16, 2022 at 11:34 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 16, 2022 at 8:51 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 15, 2022 at 10:09 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Mar 15, 2022 at 3:34 AM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Mon, Mar 14, 2022 at 4:02 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Mon, Mar 14, 2022 at 2:05 AM Melanie Plageman
> > > > >  wrote:
> > > > > >
> > > > > > On Sat, Mar 12, 2022 at 3:15 PM Andres Freund  
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > On 2022-03-12 08:28:35 +0530, Amit Kapila wrote:
> > > > > > > > On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > So, I noticed that pg_stat_reset_subscription_stats() wasn't 
> > > > > > > > > working
> > > > > > > > > properly, and, upon further investigation, I'm not sure the 
> > > > > > > > > view
> > > > > > > > > pg_stat_subscription_stats is being properly populated.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I have tried the below scenario based on this:
> > > > > > > > Step:1 Create some data that generates conflicts and lead to 
> > > > > > > > apply
> > > > > > > > failures and then check in the view:
> > > > > > >
> > > > > > > I think the problem is present when there was *no* conflict
> > > > > > > previously. Because nothing populates the stats entry without an 
> > > > > > > error, the
> > > > > > > reset doesn't have anything to set the stats_reset field in, 
> > > > > > > which then means
> > > > > > > that the stats_reset field is NULL even though stats have been 
> > > > > > > reset.
> > > > > >
> > > > > > Yes, this is what I meant. stats_reset is not initialized and 
> > > > > > without
> > > > > > any conflict happening to populate the stats, after resetting the 
> > > > > > stats,
> > > > > > the field still does not get populated. I think this is a bit
> > > > > > unexpected.
> > > > > >
> > > > > > psql (15devel)
> > > > > > Type "help" for help.
> > > > > >
> > > > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > > > >  subid | subname | apply_error_count | sync_error_count | 
> > > > > > stats_reset
> > > > > > ---+-+---+--+-
> > > > > >  16398 | mysub   | 0 |0 |
> > > > > > (1 row)
> > > > > >
> > > > > > mplageman=# select pg_stat_reset_subscription_stats(16398);
> > > > > >  pg_stat_reset_subscription_stats
> > > > > > --
> > > > > >
> > > > > > (1 row)
> > > > > >
> > > > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > > > >  subid | subname | apply_error_count | sync_error_count | 
> > > > > > stats_reset
> > > > > > ---+-+---+--+-
> > > > > >  16398 | mysub   | 0 |0 |
> > > > > > (1 row)
> > > > > >
> > > > >
> > > > > Looking at other statistics such as replication slots, shared stats,
> > > > > and SLRU stats, it makes sense that resetting it populates the stats.
> > > > > So we need to fix this issue.
> > > > >
> > > > > However, I think the proposed fix has two problems; it can create an
> > > > > entry for non-existing subscriptions if the user directly calls
> > > > > function pg_stat_get_subscription_stats(), and stats_reset value is
> > > > > not updated in the stats file as it is not done by the stats
> > > &

Re: Testing autovacuum wraparound (including failsafe)

2022-06-29 Thread Masahiko Sawada

Hi,

On Tue, Feb 1, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Fri, Jun 11, 2021 at 10:19 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2021-06-10 16:42:01 +0300, Anastasia Lubennikova wrote:
> > > Cool. Thank you for working on that!
> > > Could you please share a WIP patch for the $subj? I'd be happy to help 
> > > with
> > > it.
> >
> > I've attached the current WIP state, which hasn't evolved much since
> > this message... I put the test in 
> > src/backend/access/heap/t/001_emergency_vacuum.pl
> > but I'm not sure that's the best place. But I didn't think
> > src/test/recovery is great either.
> >
>
> Thank you for sharing the WIP patch.
>
> Regarding point (1) you mentioned (StartupSUBTRANS() takes a long time
> for zeroing out all pages), how about using single-user mode instead
> of preparing the transaction? That is, after pg_resetwal we check the
> ages of datfrozenxid by executing a query in single-user mode. That
> way, we don’t need to worry about autovacuum concurrently running
> while checking the ages of frozenxids. I’ve attached a PoC patch that
> does the scenario like:
>
> 1. start cluster with autovacuum=off and create tables with a few data
> and make garbage on them
> 2. stop cluster and do pg_resetwal
> 3. start cluster in single-user mode
> 4. check age(datfrozenxid)
> 5. stop cluster
> 6. start cluster and wait for autovacuums to increase template0,
> template1, and postgres datfrozenxids

The above steps are wrong.

I think we can expose a function in an extension used only by this
test in order to set nextXid to a future value with zeroing out
clog/subtrans pages. We don't need to fill all clog/subtrans pages
between oldestActiveXID and nextXid. I've attached a PoC patch for
adding this regression test and am going to register it to the next
CF.

BTW, while testing the emergency situation, I found there is a race
condition where anti-wraparound vacuum isn't invoked with the settings
autovacuum = off, autovacuum_max_workers = 1. AN autovacuum worker
sends a signal to the postmaster after advancing datfrozenxid in
SetTransactionIdLimit(). But with the settings, if the autovacuum
launcher attempts to launch a worker before the autovacuum worker who
has signaled to the postmaster finishes, the launcher exits without
launching a worker due to no free workers. The new launcher won’t be
launched until new XID is generated (and only when new XID % 65536 ==
0). Although autovacuum_max_workers = 1 is not mandatory for this
test, it's easier to verify the order of operations.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

v1-0001-Add-regression-tests-for-emergency-vacuums.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-28 Thread Masahiko Sawada

Hi,

On Mon, Jun 27, 2022 at 8:12 PM John Naylor
 wrote:
>
> On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada  wrote:
>
> [v3 patch]
>
> Hi Masahiko,
>
> Since there are new files, and they are pretty large, I've attached
> most specific review comments and questions as a diff rather than in
> the email body. This is not a full review, which will take more time
> -- this is a first pass mostly to aid my understanding, and discuss
> some of the design and performance implications.
>
> I tend to think it's a good idea to avoid most cosmetic review until
> it's close to commit, but I did mention a couple things that might
> enhance readability during review.

Thank you for reviewing the patch!

>
> As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
>
> > On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> >  wrote:
> > >
> > > For now, though, I'd like to question
> > > why we even need to use 32-byte registers in the first place. For one,
> > > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > > is 48 and uses a different method to find the index of the next
> > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > > of his patch a couple weeks ago I don't recall a reason mentioned for
> > > it.
> >
> > I might be wrong but since AVX2 instruction set is introduced in
> > Haswell microarchitecture in 2013 and the referenced paper is
> > published in the same year, the art didn't use AVX2 instruction set.
>
> Sure, but with a bit of work the same technique could be done on that
> node size with two 16-byte registers.
>
> > 32-pointer nodes are better from a memory perspective as you
> > mentioned. Andres' prototype supports both 16-pointer nodes and
> > 32-pointer nodes (out of 6 node types). This would provide better
> > memory usage but on the other hand, it would also bring overhead of
> > switching the node type.
>
> Right, using more node types provides smaller increments of node size.
> Just changing node type can be better or worse, depending on the
> input.
>
> > Anyway, it's an important design decision to
> > support which size of node to support. It should be done based on
> > experiment results and documented.
>
> Agreed. I would add that in the first step, we want something
> straightforward to read and easy to integrate into our codebase.

Agreed.



> I
> suspect other optimizations would be worth a lot more than using AVX2:
> - collapsing inner nodes
> - taking care when constructing the key (more on this when we
> integrate with VACUUM)
> ...and a couple Andres mentioned:
> - memory management: in
> https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> - node dispatch:
> https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
>
> Therefore, I would suggest that we use SSE2 only, because:
> - portability is very easy
> - to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

>
> When the PG16 cycle opens, I will work separately on ensuring the
> portability of using SSE2, so you can focus on other aspects.

Thanks!

> I think it would be a good idea to have both node16 and node32 for testing.
> During benchmarking we can delete one or the other and play with the
> other thresholds a bit.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

  name size  attach
  lookup
 avx2_4_32_128_256   1154 MB6742.53 ms   47765.63 ms
 avx2_4_32_48_256 1839 MB4239.35 ms   40528.39 ms
 sse2_4_16_128_256   1154 MB6994.43 ms   40383.85 ms
 sse2_4_16_32_128_256 1154 MB7239.35 ms   43542.39 ms
 sse2_4_16_48_256 1839 MB4404.63 ms   36048.96 ms
 sse2_4_32_128_2561154 MB   6

Re: Postgres do not allow to create many tables with more than 63-symbols prefix

2022-06-26 Thread Masahiko Sawada

On Fri, Jun 24, 2022 at 2:12 PM Andrey Lepikhov
 wrote:
>
> Moved from the pgsql-bugs mailing list [1].
>
> On 6/23/22 07:03, Masahiko Sawada wrote:
>  > Hi,
>  >
>  > On Sat, Jun 4, 2022 at 4:03 AM Andrey Lepikhov
>  >  wrote:
>  >>
>  >> According to subj you can try to create many tables (induced by the case
>  >> of partitioned table) with long prefix - see 6727v.sql for reproduction.
>  >> But now it's impossible because of logic of the makeUniqueTypeName()
>  >> routine.
>  >> You get the error:
>  >> ERROR:  could not form array type name for type ...
>  >>
>  >> It is very corner case, of course. But solution is easy and short. So,
>  >> why not to fix? - See the patch in attachment.
>  >
>  > While this seems to be a good improvement, I think it's not a bug.
>  > Probably we cannot backpatch it as it will end up having type names
>  > defined by different naming rules. I'd suggest discussing it on
>  > -hackers.
> Done.

Thank for updating the patch. Please register this item to the next CF
if not yet.

>
>  > Regarding the patch, I think we can merge makeUniqueTypeName() to
>  > makeArrayTypeName() as there is no caller of makeUniqueTypeName() who
>  > pass tryOriginal = true.
> I partially agree with you. But I have one reason to leave
> makeUniqueTypeName() separated:
> It may be used in other codes with auto generated types. For example, I
> think, the DefineRelation routine should choose composite type instead
> of using the same name as the table.

Okay.

I have one comment on v2 patch:

 +   for(;;)
 {
 -   dest[i - 1] = '_';
 -   strlcpy(dest + i, typeName, NAMEDATALEN - i);
 -   if (namelen + i >= NAMEDATALEN)
 -   truncate_identifier(dest, NAMEDATALEN, false);
 -
 if (!SearchSysCacheExists2(TYPENAMENSP,
 -  CStringGetDatum(dest),
 +  CStringGetDatum(type_name),
ObjectIdGetDatum(typeNamespace)))
 -   return pstrdup(dest);
 +   return type_name;
 +
 +   /* Previous attempt was failed. Prepare a new one. */
 +   pfree(type_name);
 +   snprintf(suffix, sizeof(suffix), "%d", ++pass);
 +   type_name = makeObjectName("", typeName, suffix);
 }

 return NULL;

I think it's better to break from the loop instead of returning from
there. That way, we won't need "return NULL".

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Support logical replication of DDLs

2022-06-23 Thread Masahiko Sawada

On Thu, Jun 23, 2022 at 7:00 PM Amit Kapila  wrote:
>
> On Wed, Jun 22, 2022 at 11:09 AM Masahiko Sawada  
> wrote:
> >
> > I've attached a WIP patch for adding regression tests for DDL deparse.
> > The patch can be applied on
> > v9-0001-Functions-to-deparse-DDL-commands.patch Hou recently
> > submitted[1]. The basic idea is to define the event trigger to deparse
> > DDLs, run the regression tests, load the deparsed DDLs to another
> > database cluster, dump both databases, and compare the dumps.
> >
>
> Thanks for working on this. It is a good start. I think this will be
> helpful to see the missing DDL support. Do we want to run this as part
> of every regression run? Won't it increase the regression time as this
> seems to run internally the regression tests twice?

Yes, It will increase the regression test time but we already do a
similar thing in 002_pg_upgrade.pl and 027_stream_regress.pl and it
seems to be worth adding to me.

>
> Do we need a different trigger to capture drop cases as there are
> separate deparsing routines for them, for example
> deparse_drop_table()?

Right, we need to capture drop cases by another trigger.

>
> > [2] deparsing "ALTER INDEX tbl_idx ALTER COLUMN 2 SET STATISTICS
> > 1000;" causes an assertion failure.
> >
>
> Sorry, it is not clear to me whether you are talking about some
> pre-existing bug or a bug in the proposed patch?

I meant there is a bug in the v9 DDL deparse patch.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Fix instability in subscription regression test

2022-06-23 Thread Masahiko Sawada

On Thu, Jun 23, 2022 at 8:28 PM houzj.f...@fujitsu.com
 wrote:
>
> Hi,
>
> I noticed BF member wrasse failed in 028_row_filter.pl.
>
> #   Failed test 'check publish_via_partition_root behavior'
> #   at t/028_row_filter.pl line 669.
> #  got: ''
> # expected: '1|100
> #  ...
>
> Log:
> 2022-06-23 11:27:42.387 CEST [20589:3] 028_row_filter.pl LOG: statement: 
> ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION WITH (copy_data = true)
> 2022-06-23 11:27:42.470 CEST [20589:4] 028_row_filter.pl LOG: disconnection: 
> session time: 0:00:00.098 user=nm database=postgres host=[local]
> 2022-06-23 11:27:42.611 CEST [20593:1] LOG: logical replication table 
> synchronization worker for subscription "tap_sub", table 
> "tab_rowfilter_partitioned" has started
> ...
> 2022-06-23 11:27:43.197 CEST [20610:3] 028_row_filter.pl LOG: statement: 
> SELECT a, b FROM tab_rowfilter_partitioned ORDER BY 1, 2
> ...
> 2022-06-23 11:27:43.689 CEST [20593:2] LOG: logical replication table 
> synchronization worker for subscription "tap_sub", table 
> "tab_rowfilter_partitioned" has finished
>
> From the Log, I can see it query the target table before the table sync is
> over. So, I think the reason is that we didn't wait for table sync to
> finish after refreshing the publication. Sorry for not catching that
> ealier. Here is a patch to fix it.

+1

The patch looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication

2022-06-22 Thread Masahiko Sawada

Hi,

On Wed, Jun 22, 2022 at 2:44 PM RKN Sai Krishna
 wrote:
>
> Hi,
>
> Problem: Today when a data page is corrupted in the primary postgres with 
> physical replication (sync or async standbys), there seems to be no way to 
> repair it easily and we rely on PITR to recreate the postgres server or drop 
> the corrupted table (of course this is not an option for important customer 
> tables, but may be okay for some maintenance or temporary tables). PITR is 
> costly to do in a production environment oftentimes as it involves creation 
> of the full-blown postgres from the base backup and causing downtime for the 
> customers.
>
> Solution: Make use of the uncorrupted page present in sync or async standby. 
> The proposed tool/extension pg_page_repair (as we call it) can fetch the 
> uncorrupted page from sync or async standby and overwrite the corrupted one 
> on the primary. Yes, there will be a challenge in making sure that the WAL is 
> replayed completely and standby is up-to-date so that we are sure that stale 
> pages are not copied across. A simpler idea could be that the pg_page_repair 
> can wait until the standby replays/catches up with the primary's flush LSN 
> before fetching the uncorrupted page. A downside of this approach is that the 
> pg_page_repair waits for long or rather infinitely if the replication lag is 
> huge. As we all know that the replication lag is something a good postgres 
> solution will always monitor to keep it low, if true, the pg_page_repair is 
> guaranteed to not wait for longer. Another idea could be that the 
> pg_page_repair gets the base page from the standby and applies all the WAL 
> records pertaining to the corrupted page using the base page to get the 
> uncorrupted page. This requires us to pull the replay logic from the core to 
> pg_page_repair which isn't easy. Hence we propose to go with approach 1, but 
> open to discuss on approach 2 as well. We suppose that the solution proposed 
> in this thread holds good even for pages corresponding to indexes.

I'm interested in this topic and recalled I did some research on the
first idea while writing experimental code several years ago[1].

The corruption that can be fixed by this feature is mainly physical
corruption, for example, introduced by storage array cache corruption,
array firmware bugs, filesystem bugs, is that right? Logically corrupt
blocks are much more likely to have been introduced as a result of a
failure or a bug in PostgreSQL, which would end up propagating to
physical standbys.

>
> Implementation Choices: pg_page_repair can either take the corrupted page 
> info (db id, rel id, block number etc.) or just a relation name and 
> automatically figure out the corrupted page using pg_checksums for instance 
> or just database name and automatically figure out all the corrupted pages. 
> It can either repair the corrupted pages online (only the corrupted table is 
> inaccessible, the server continues to run) or take downtime if there are many 
> corrupted pages.

Since the server must be shutdown cleanly before running pg_checksums
if we want to verify checksums of the page while the server is running
we would need to do online checksum verification we discussed
before[2].

>
> Future Scope: pg_page_repair can be integrated to the core so that the 
> postgres will repair the pages automatically without manual intervention.
>
> Other Solutions: We did consider an approach where the tool could obtain the 
> FPI from WAL and replay till the latest WAL record to repair the page. But 
> there could be limitations such as FPI and related WAL not being available in 
> primary/archive location.

How do we find the FPI of the corrupted page effectively from WAL? We
could seek WAL records from backward but it could take a quite long
time.

Regards,

[1] https://github.com/MasahikoSawada/pgtools/tree/master/page_repair
[2] 
https://www.postgresql.org/message-id/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Missing reference to pgstat_replslot.c in pgstat.c

2022-06-22 Thread Masahiko Sawada

On Wed, Jun 22, 2022 at 3:29 PM Drouvot, Bertrand  wrote:
>
> Hi hackers,
>
> I think there's a missing reference to pgstat_replslot.c in pgstat.c.
>
> Attached a tiny patch to fix it.

+1

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Support logical replication of DDLs

2022-06-21 Thread Masahiko Sawada

On Wed, Jun 15, 2022 at 1:00 PM Amit Kapila  wrote:
>
> On Wed, Jun 15, 2022 at 5:44 AM Zheng Li  wrote:
> >
> >
> > While I agree that the deparser is needed to handle the potential
> > syntax differences between
> > the pub/sub, I think it's only relevant for the use cases where only a
> > subset of tables in the database
> > are replicated. For other use cases where all tables, functions and
> > other objects need to be replicated,
> > (for example, creating a logical replica for major version upgrade)
> > there won't be any syntax difference to
> > handle and the schemas are supposed to match exactly between the
> > pub/sub. In other words the user seeks to create an identical replica
> > of the source database and the DDLs should be replicated
> > as is in this case.
> >
>
> I think even for database-level replication we can't assume that
> source and target will always have the same data in which case "Create
> Table As ..", "Alter Table .. " kind of statements can't be replicated
> as it is because that can lead to different results. The other point
> is tomorrow we can extend the database level option/syntax to exclude
> a few objects (something like [1]) as well in which case we again need
> to filter at the publisher level.

Good point.

Regarding the idea of using the parse-tree representation produced by
nodeToString(), I’ve not read the patch yet but I'm not sure it's a
good idea. A field name of a node could be changed in a major version.
If a publisher sends a parse-tree string representation to a newer
major version subscriber, the subscriber needs to be able to parse the
old format parse-tree string representation in order to reconstruct
the DDL, which reduces the maintainability much. On the other hand,
the format of deparsed json string would not be changed often.

>
> >
>  So I think it's an overkill to use deparser for
> > such use cases. It also costs more space and
> > time using deparsing. For example, the following simple ALTER TABLE
> > command incurs 11 times more space
> > in the WAL record if we were to use the format from the deparser,
> > there will also be time and CPU overhead from the deparser.
> >
> ...
> >
> > So I think it's better to define DDL replication levels [1] to tailor
> > for the two different use cases. We can use different logging format
> > based on the DDL replication level. For example,
> > we can simply log the DDL query string and the search_path for
> > database level DDL replication. But for table level DDL replication we
> > need to use the deparser format in order to
> > handle the potential syntax differences and schema mapping requests.
> >
>
> I think having different logging formats is worth considering but I am
> not sure we can distinguish it for database and table level
> replication because of the reasons mentioned above. One thing which
> may need a different format is the replication of global objects like
> roles, tablespace, etc. but we haven't analyzed them in detail about
> those. I feel we may also need a different syntax altogether to
> replicate such objects. Also, I think we may want to optimize the
> current format in some cases so that the WAL amount could be reduced.
>
> I feel if we think that deparsing is required for this project then
> probably at this stage it would be a good idea to explore ways to have
> independent ways to test it. One way is to do testing via the logical
> replication of DDL (aka via this patch) and the other is to write an
> independent test suite as Sawada-San seems to be speculating above
> [2]. I am not sure if there is any progress yet on the independent
> test suite front yet.

I've attached a WIP patch for adding regression tests for DDL deparse.
The patch can be applied on
v9-0001-Functions-to-deparse-DDL-commands.patch Hou recently
submitted[1]. The basic idea is to define the event trigger to deparse
DDLs, run the regression tests, load the deparsed DDLs to another
database cluster, dump both databases, and compare the dumps. Since
the patch doesn't support deparsing all DDLs and there is a bug[2],
the attached regression test does CREATE TABLE and some ALTER TABLE
instead of running regression tests.

Regards,

[1] 
https://www.postgresql.org/message-id/OS0PR01MB5716B1526C2EDA66907E733B94B39%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] deparsing "ALTER INDEX tbl_idx ALTER COLUMN 2 SET STATISTICS
1000;" causes an assertion failure.

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-WIP-Add-regression-tests-for-DDL-deparse.patch
Description: Binary data

Re: Add index scan progress to pg_stat_progress_vacuum

2022-06-20 Thread Masahiko Sawada

On Mon, Jun 6, 2022 at 11:42 PM Robert Haas  wrote:
>
> On Thu, May 26, 2022 at 11:43 AM Masahiko Sawada  
> wrote:
> > Another idea I came up with is that we can wait for all index vacuums
> > to finish while checking and updating the progress information, and
> > then calls WaitForParallelWorkersToFinish after confirming all index
> > status became COMPLETED. That way, we don’t need to change the
> > parallel query infrastructure. What do you think?
>
> +1 from me. It doesn't seem to me that we should need to add something
> like parallel_vacuum_progress_callback in order to solve this problem,
> because the parallel index vacuum code could just do the waiting
> itself, as you propose here.
>
> The question Sami asks him his reply is a good one, though -- who is
> to say that the leader only needs to update progress at the end, once
> it's finished the index it's handling locally? There will need to be a
> callback system of some kind to allow the leader to update progress as
> other workers finish, even if the leader is still working. I am not
> too sure that the idea of using the vacuum delay points is the best
> plan. I think we should try to avoid piggybacking on such general
> infrastructure if we can, and instead look for a way to tie this to
> something that is specific to parallel vacuum. However, I haven't
> studied the problem so I'm not sure whether there's a reasonable way
> to do that.

One idea would be to add a flag, say report_parallel_vacuum_progress,
to IndexVacuumInfo struct and expect index AM to check and update the
parallel index vacuum progress, say every 1GB blocks processed. The
flag is true only when the leader process is vacuuming an index.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-19 Thread Masahiko Sawada

Hi,

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
 wrote:
>
> On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada  
> wrote:
> > I've attached an updated version patch that changes the configure
> > script. I'm still studying how to support AVX2 on msvc build. Also,
> > added more regression tests.
>
> Thanks for the update, I will take a closer look at the patch in the
> near future, possibly next week.

Thanks!

> For now, though, I'd like to question
> why we even need to use 32-byte registers in the first place. For one,
> the paper referenced has 16-pointer nodes, but none for 32 (next level
> is 48 and uses a different method to find the index of the next
> pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> of his patch a couple weeks ago I don't recall a reason mentioned for
> it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type. Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

> Even if 32-pointer nodes are better from a memory perspective, I
> imagine it should be possible to use two SSE2 registers to find the
> index. It'd be locally slightly more complex, but not much. It might
> not even cost much more in cycles since AVX2 would require indirecting
> through a function pointer. It's much more convenient if we don't need
> a runtime check.

Right.

> There are also thermal and power disadvantages when
> using AXV2 in some workloads. I'm not sure that's the case here, but
> if it is, we'd better be getting something in return.

Good point.

> One more thing in general: In an earlier version, I noticed that
> Andres used the slab allocator and documented why. The last version of
> your patch that I saw had the same allocator, but not the "why".
> Especially in early stages of review, we want to document design
> decisions so it's more clear for the reader.

Indeed. I'll add comments in the next version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Skipping logical replication transactions on subscriber side

2022-06-16 Thread Masahiko Sawada

On Thu, Jun 16, 2022 at 2:27 AM Robert Haas  wrote:
>
> On Tue, Jun 14, 2022 at 3:54 AM Masahiko Sawada  wrote:
> > > AFAICS, we could do that by:
> > >
> > > 1. De-supporting platforms that have this problem, or
> > > 2. Introducing new typalign values, as Noah proposed back on April 2, or
> > > 3. Somehow forcing values that are sometimes 4-byte aligned and
> > > sometimes 8-byte aligned to be 8-byte alignment on all platforms
> >
> > Introducing new typalign values seems a good idea to me as it's more
> > future-proof. Will this item be for PG16, right? The main concern
> > seems that what this test case enforces would be nuisance when
> > introducing a new system catalog or a new column to the existing
> > catalog but given we're in post PG15-beta1 it is unlikely to happen in
> > PG15.
>
> I agree that we're not likely to introduce a new typalign value any
> sooner than v16. There are a couple of things that bother me about
> that solution. One is that I don't know how many different behaviors
> exist out there in the wild. If we distinguish the alignment of double
> from the alignment of int8, is that good enough, or are there other
> data types whose properties aren't necessarily the same as either of
> those?

Yeah, there might be.

> The other is that 32-bit systems are already relatively rare
> and probably will become more rare until they disappear completely. It
> doesn't seem like a ton of fun to engineer solutions to problems that
> may go away by themselves with the passage of time.

IIUC the system affected by this problem is not necessarily 32-bit
system. For instance, the hoverfly on buildfarm is 64-bit system but
was affected by this problem. According to the XLC manual[1], there is
no difference between 32-bit systems and 64-bit systems in terms of
alignment for double. FWIW, looking at the manual, there might have
been a solution for AIX to specify -qalign=natural compiler option in
order to enforce the alignment of double to 8.

Regards,

[1] https://support.scinet.utoronto.ca/Manuals/xlC++-proguide.pdf;
Table 11 on page 10.

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-15 Thread Masahiko Sawada

On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada  wrote:
>
> On Tue, May 10, 2022 at 6:58 PM John Naylor
>  wrote:
> >
> > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada  
> > wrote:
> > >
> > > Overall, radix tree implementations have good numbers. Once we got an
> > > agreement on moving in this direction, I'll start a new thread for
> > > that and move the implementation further; there are many things to do
> > > and discuss: deletion, API design, SIMD support, more tests etc.
> >
> > +1
> >
>
> Thanks!
>
> I've attached an updated version patch. It is still WIP but I've
> implemented deletion and improved test cases and comments.

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

The integration with lazy vacuum and parallel vacuum is missing for
now. In order to support parallel vacuum, we need to have the radix
tree support to be created on DSA.

Added this item to the next CF.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index d3562d6fee..a56d6e89da 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -676,3 +676,27 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_AVX2_INTRINSICS
+# 
+# Check if the compiler supports the Intel AVX2 instructinos.
+#
+# If the intrinsics are supported, sets pgac_avx2_intrinsics, and CFLAGS_AVX2.
+AC_DEFUN([PGAC_AVX2_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm256_set_1_epi8 _mm256_cmpeq_epi8 _mm256_movemask_epi8 CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX2="$1"
+  pgac_avx2_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX2_INTRINSICS
diff --git a/configure b/configure
index 7dec6b7bf9..6ebc15a8c1 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
 MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
+CFLAGS_AVX2
 PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
@@ -18829,6 +18830,82 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+# Check for Intel AVX2 intrinsics.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=" >&5
+$as_echo_n "checking for _mm256i CFLAGS=... " >&6; }
+if ${pgac_cv_avx2_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics_=yes
+else
+  pgac_cv_avx2_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics_" >&5
+$as_echo "$pgac_cv_avx2_intrinsics_" >&6; }
+if test x"$pgac_cv_avx2_intrinsics_" = x"yes"; then
+  CFLAGS_AVX2=""
+  pgac_avx2_intrinsics=yes
+fi
+
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=-mavx2" >&5
+$as_echo_n "checking for _mm256i CFLAGS=-mavx2... " >&6; }
+if ${pgac_cv_avx2_intrinsics__mavx2+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx2"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics__mavx2=yes
+else
+  pgac_cv_avx2_intrinsics__mavx2=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics__mavx2" >&5
+$as_echo "$pgac_cv_avx2_intrinsics__mavx2&quo

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-06-14 Thread Masahiko Sawada

On Tue, Jun 14, 2022 at 3:57 PM Amit Kapila  wrote:
>
> On Mon, Jun 13, 2022 at 8:29 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila  wrote:
> > >
> > > On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > >
> > > > poc_add_regression_tests.patch adds regression tests for this bug. The
> > > > regression tests are required for both HEAD and back-patching but I've
> > > > separated this patch for testing the above two patches easily.
> > > >
> >
> > Thank you for the comments.
> >
> > >
> > > Few comments on the test case patch:
> > > ===
> > > 1.
> > > +# For the transaction that TRUNCATEd the table tbl1, the last decoding 
> > > decodes
> > > +# only its COMMIT record, because it starts from the RUNNING_XACT
> > > record emitted
> > > +# during the first checkpoint execution.  This transaction must be 
> > > marked as
> > > +# catalog-changes while decoding the COMMIT record and the decoding
> > > of the INSERT
> > > +# record must read the pg_class with the correct historic snapshot.
> > > +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
> > > "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
> > > "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
> > >
> > > Will this test always work? What if we get an additional running_xact
> > > record between steps "s0_commit" and "s0_begin" that is logged via
> > > bgwriter? You can mimic that by adding an additional checkpoint
> > > between those two steps. If we do that, the test will pass even
> > > without the patch because I think the last decoding will start
> > > decoding from this new running_xact record.
> >
> > Right. It could pass depending on the timing but doesn't fail
> > depending on the timing. I think we need to somehow stop bgwriter to
> > make the test case stable but it seems unrealistic.
> >
>
> Agreed, in my local testing for this case, I use to increase
> LOG_SNAPSHOT_INTERVAL_MS to avoid such a situation but I understand it
> is not practical via test.
>
> > Do you have any
> > better ideas?
> >
>
> No, I don't have any better ideas. I think it is better to add some
> information related to this in the comments because it may help to
> improve the test in the future if we come up with a better idea.

I also don't have any better ideas to make it stable, and agreed. I've
attached an updated version patch for adding regression tests.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


poc_add_regression_tests_v2.patch
Description: Binary data

Re: Skipping logical replication transactions on subscriber side

2022-06-14 Thread Masahiko Sawada

On Mon, Jun 13, 2022 at 11:25 PM Robert Haas  wrote:
>
> On Sun, Apr 17, 2022 at 11:22 PM Noah Misch  wrote:
> > > Yes, but it could be false positives in some cases. For instance, the
> > > column {oid, bool, XLogRecPtr} should be okay on ALIGNOF_DOUBLE == 4
> > > and 8 platforms but the new test fails.
> >
> > I'm happy with that, because the affected author should look for 
> > padding-free
> > layouts before settling on your example layout.  If the padding-free layouts
> > are all unacceptable, the author should update the expected sanity_check.out
> > to show the one row where the test "fails".
>
> I realize that it was necessary to get something committed quickly
> here to unbreak the buildfarm, but this is really a mess. As I
> understand it, the problem here is that typalign='d' is either 4 bytes
> or 8 depending on how the 'double' type is aligned on that platform,
> but we use that typalign value also for some other data types that may
> not be aligned in the same way as 'double'. Consequently, it's
> possible to have a situation where the behavior of the C compiler
> diverges from the behavior of heap_form_tuple(). To avoid that, we
> need every catalog column that uses typalign=='d' to begin on an
> 8-byte boundary. We also want all such columns to occur before the
> first NameData column in the catalog, to guard against the possibility
> that NAMEDATALEN has been redefined to an odd value. I think this set
> of constraints is a nuisance and that it's mostly good luck we haven't
> run into any really awkward problems here so far.
>
> In many of our catalogs, the first member is an OID and the second
> member of the struct is of type NameData: pg_namespace, pg_class,
> pg_proc, etc. That common design pattern is in direct contradiction to
> the desires of this test case. As soon as someone wants to add a
> typalign='d' member to any of those system catalogs, the struct layout
> is going to have to get shuffled around -- and then it will look
> different from all the other ones. Or else we'd have to rearrange them
> all to move all the NameData columns to the end. I feel like it's
> weird to introduce a test case that so obviously flies in the face of
> how catalog layout has been done up to this point, especially for the
> sake of a hypothetical user who want to set NAMEDATALEN to an odd
> number. I doubt such scenarios have been thoroughly tested, or ever
> will be. Perhaps instead we ought to legislate that NAMEDATALEN must
> be a multiple of 8, or some such thing.
>
> The other constraint, that typalign='d' fields must always fall on an
> 8 byte boundary, is probably less annoying in practice, but it's easy
> to imagine a future catalog running into trouble. Let's say we want to
> introduce a new catalog that has only an Oid column and a float8
> column. Perhaps with 0-3 bool or uint8 columns as well, or with any
> number of NameData columns as well. Well, the only way to satisfy this
> constraint is to put the float8 column first and the Oid column after
> it, which immediately makes it look different from every other catalog
> we have. It's hard to feel like that would be a good solution here. I
> think we ought to try to engineer a solution where heap_form_tuple()
> is going to do the same thing as the C compiler without the sorts of
> extra rules that this test case enforces.

These seem to be valid concerns.

> AFAICS, we could do that by:
>
> 1. De-supporting platforms that have this problem, or
> 2. Introducing new typalign values, as Noah proposed back on April 2, or
> 3. Somehow forcing values that are sometimes 4-byte aligned and
> sometimes 8-byte aligned to be 8-byte alignment on all platforms

Introducing new typalign values seems a good idea to me as it's more
future-proof. Will this item be for PG16, right? The main concern
seems that what this test case enforces would be nuisance when
introducing a new system catalog or a new column to the existing
catalog but given we're in post PG15-beta1 it is unlikely to happen in
PG15.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-06-12 Thread Masahiko Sawada

On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila  wrote:
>
> On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  
> wrote:
> >
> > On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > poc_add_regression_tests.patch adds regression tests for this bug. The
> > regression tests are required for both HEAD and back-patching but I've
> > separated this patch for testing the above two patches easily.
> >

Thank you for the comments.

>
> Few comments on the test case patch:
> ===
> 1.
> +# For the transaction that TRUNCATEd the table tbl1, the last decoding 
> decodes
> +# only its COMMIT record, because it starts from the RUNNING_XACT
> record emitted
> +# during the first checkpoint execution.  This transaction must be marked as
> +# catalog-changes while decoding the COMMIT record and the decoding
> of the INSERT
> +# record must read the pg_class with the correct historic snapshot.
> +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
> "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
> "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
>
> Will this test always work? What if we get an additional running_xact
> record between steps "s0_commit" and "s0_begin" that is logged via
> bgwriter? You can mimic that by adding an additional checkpoint
> between those two steps. If we do that, the test will pass even
> without the patch because I think the last decoding will start
> decoding from this new running_xact record.

Right. It could pass depending on the timing but doesn't fail
depending on the timing. I think we need to somehow stop bgwriter to
make the test case stable but it seems unrealistic. Do you have any
better ideas?

>
> 2.
> +step "s1_get_changes" { SELECT data FROM
> pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
> 'include-xids', '0'); }
>
> It is better to skip empty transactions by using 'skip-empty-xacts' to
> avoid any transaction from a background process like autovacuum. We
> have previously seen some buildfarm failures due to that.

Agreed.

>
> 3. Did you intentionally omit the .out from the test case patch?

No, I'll add .out file in the next version patch.

>
> 4.
> This transaction must be marked as
> +# catalog-changes while decoding the COMMIT record and the decoding
> of the INSERT
> +# record must read the pg_class with the correct historic snapshot.
>
> /marked as catalog-changes/marked as containing catalog changes

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Add index scan progress to pg_stat_progress_vacuum

2022-06-02 Thread Masahiko Sawada

On Fri, May 27, 2022 at 10:52 AM Imseih (AWS), Sami  wrote:
>
> >Another idea I came up with is that we can wait for all index vacuums
> >to finish while checking and updating the progress information, and
> >then calls WaitForParallelWorkersToFinish after confirming all index
> >status became COMPLETED. That way, we don’t need to change the
> >parallel query infrastructure. What do you think?
>
> Thinking about this a bit more, the idea of using
> WaitForParallelWorkersToFinish
> Will not work if you have a leader worker that is
> stuck on a large index. The progress will not be updated
> until the leader completes. Even if the parallel workers
> finish.

Right.

>
> What are your thought about piggybacking on the
> vacuum_delay_point to update progress. The leader can
> perhaps keep a counter to update progress every few thousand
> calls to vacuum_delay_point.
>
> This goes back to your original idea to keep updating progress
> while scanning the indexes.

I think we can have the leader process wait for all index statuses to
become COMPLETED before WaitForParallelWorkersToFinish(). While
waiting for it, the leader can update its progress information. After
the leader confirmed all index statuses became COMPLETED, it can wait
for the workers to finish by WaitForParallelWorkersToFinish().

Regarding waiting in vacuum_delay_point, it might be a side effect as
it’s called every page and used not only by vacuum such as analyze,
but it seems to be worth trying.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Perform streaming logical transactions by background workers and parallel apply

2022-05-31 Thread Masahiko Sawada

On Tue, May 31, 2022 at 5:53 PM Amit Kapila  wrote:
>
> On Mon, May 30, 2022 at 5:08 PM Amit Kapila  wrote:
> >
> > On Mon, May 30, 2022 at 2:22 PM wangw.f...@fujitsu.com
> >  wrote:
> > >
> > > Attach the new patches(only changed 0001 and 0002)
> > >
> >
>
> This patch allows the same replication origin to be used by the main
> apply worker and the bgworker that uses it to apply streaming
> transactions. See the changes [1] in the patch. I am not completely
> sure whether that is a good idea even though I could not spot or think
> of problems that can't be fixed in your patch. I see that currently
> both the main apply worker and bgworker will assign MyProcPid to the
> assigned origin slot, this can create the problem because
> ReplicationOriginExitCleanup() can clean it up even though the main
> apply worker or another bgworker is still using that origin slot.

Good point.

> Now,
> one way to fix is that we assign only the main apply worker's
> MyProcPid to session_replication_state->acquired_by. I have tried to
> think about the concurrency issues as multiple workers could now point
> to the same replication origin state. I think it is safe because the
> patch maintains the commit order by allowing only one process to
> commit at a time, so no two workers will be operating on the same
> origin at the same time. Even, though there is no case where the patch
> will try to advance the session's origin concurrently, it appears safe
> to do so as we change/advance the session_origin LSNs under
> replicate_state LWLock.

Right. That way, the cleanup is done only by the main apply worker.
Probably the bgworker can check if the origin is already acquired by
its (leader) main apply worker process for safety.

>
> Another idea could be that we allow multiple replication origins (one
> for each bgworker and one for the main apply worker) for the apply
> workers corresponding to a subscription. Then on restart, we can find
> the highest LSN among all the origins for a subscription. This should
> work primarily because we will maintain the commit order. Now, for
> this to work we need to somehow map all the origins for a subscription
> and one possibility is that we have a subscription id in each of the
> origin names. Currently we use ("pg_%u", MySubscription->oid) as
> origin_name. We can probably append some unique identifier number for
> each worker to allow each origin to have a subscription id. We need to
> drop all origins for a particular subscription on DROP SUBSCRIPTION. I
> think having multiple origins for the same subscription will have some
> additional work when we try to filter changes based on origin.

It also seems to work but need additional work and resource.

> The advantage of the first idea is that it won't increase the need to
> have more origins per subscription but it is quite possible that I am
> missing something and there are problems due to which we can't use
> that approach.

I prefer the first idea as it's simpler than the second one. I don't
see any concurrency problem so far unless I'm not missing something.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Support logical replication of DDLs

2022-05-30 Thread Masahiko Sawada

On Fri, May 27, 2022 at 11:03 AM Amit Kapila  wrote:
>
> On Fri, May 27, 2022 at 3:49 AM Zheng Li  wrote:
> >
> > Hi Masahiko,
> >
> > > Thank you for updating the patches!
> > >
> > > I've not looked at these patches in-depth yet but with this approach,
> > > what do you think we can handle the DDL syntax differences between
> > > major versions? DDL syntax or behavior could be changed by future
> > > changes and I think we need to somehow deal with the differences. For
> >
> > > example, if the user uses logical replication for major version
> > > upgrade, the publisher is older than the subscriber. We might have to
> > > rewrite the DDL before applying to the subscriber because the DDL
> > > executed on the publisher no longer work on a new PostgreSQL version
> >
> > I don't think we will allow this kind of situation to happen in the
> > first place for
> > backward compatibility. If a DDL no longer works on a new version of
> > PostgreSQL, the user will have to change the application code as well.
> > So even if it happens for
> > whatever reason, we could either
> > 1. fail the apply worker and let the user fix such DDL because they'll
> > have to fix the application code anyway when this happens.
> > 2. add guard rail logic in the apply worker to automatically fix such
> > DDL if possible, knowing the version of the source and target. Similar
> > logic must have been implemented for pg_dump/restore/upgrade.
> >
> > > or we might have to add some options to the DDL before the application
> > > in order to keep the same behavior. This seems to require a different
> > > solution from what the patch does for the problem you mentioned such
> >
> > > as "DDL involving multiple tables where only some tables are
> > > replicated”.
> >
> > First of all, this case can only happen when the customer chooses to
> > only replicate a subset of the tables in a database in which case
> > table level DDL replication is chosen instead of database level DDL
> > replication (where all tables
> > and DDLs are replicated). I think the solution would be:
> > 1. make best effort to detect such DDLs on the publisher and avoid
> > logging of such DDLs in table level DDL replication.
> > 2. apply worker will fail to replay such command due to missing
> > objects if such DDLs didn't get filtered on the publisher for some
> > reason. This should be rare and I think it's OK even if it happens,
> > we'll find out
> > why and fix it.
> >
>
> FWIW, both these cases could be handled with the deparsing approach,
> and the handling related to the drop of multiple tables where only a
> few are published is already done in the last POC patch shared by Ajin
> [1].
>

Right. So I'm inclined to think that deparsing approach is better from
this point as well as the point mentioned by Álvaro before[1].

Regards,

[1] 
https://www.postgresql.org/message-id/202204081134.6tcmf5cxl3sz%40alvherre.pgsql

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-05-29 Thread Masahiko Sawada

On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada  wrote:
>
> On Tue, May 24, 2022 at 2:18 PM Amit Kapila  wrote:
> >
> > On Tue, May 24, 2022 at 7:58 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, May 23, 2022 at 2:39 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
> > > >  wrote:
> > > > >
> > > > > At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila 
> > > > >  wrote in
> > > > > > I think if we don't have any better ideas then we should go with
> > > > > > either this or one of the other proposals in this thread. The other
> > > > > > idea that occurred to me is whether we can somehow update the 
> > > > > > snapshot
> > > > > > we have serialized on disk about this information. On each
> > > > > > running_xact record when we serialize the snapshot, we also try to
> > > > > > purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, 
> > > > > > during
> > > > > > that we can check if there are committed xacts to be purged and if 
> > > > > > we
> > > > > > have previously serialized the snapshot for the prior running xact
> > > > > > record, if so, we can update it with the list of xacts that have
> > > > > > catalog changes. If this is feasible then I think we need to somehow
> > > > > > remember the point where we last serialized the snapshot (maybe by
> > > > > > using builder->last_serialized_snapshot). Even, if this is feasible 
> > > > > > we
> > > > > > may not be able to do this in back-branches because of the 
> > > > > > disk-format
> > > > > > change required for this.
> > > > > >
> > > > > > Thoughts?
> > >
> > > It seems to work, could you draft the patch?
> > >
> >
> > I can help with the review and discussion.
>
> Okay, I'll draft the patch for this idea.

I've attached three POC patches:

poc_remember_last_running_xacts_v2.patch is a rebased patch of my
previous proposal[1]. This is based on the original proposal: we
remember the last-running-xacts list of the first decoded
RUNNING_XACTS record and check if the transaction whose commit record
has XACT_XINFO_HAS_INVALS and whose xid is in the list. This doesn’t
require any file format changes but the transaction will end up being
added to the snapshot even if it has only relcache invalidations.

poc_add_running_catchanges_xacts_to_serialized_snapshot.patch is a
patch for the idea Amit Kapila proposed with some changes. The basic
approach is to remember the list of xids that changed catalogs and
were running when serializing the snapshot. The list of xids is kept
in SnapShotBuilder and is serialized and restored to/from the
serialized snapshot. When decoding a commit record, we check if the
transaction is already marked as catalog-changes or its xid is in the
list. If so, we add it to the snapshot. Unlike the first patch, it can
add only transactions properly that have changed catalogs, but as Amit
mentioned before, this idea cannot be back patched as this changes the
on-disk format of the serialized snapshot.

poc_add_regression_tests.patch adds regression tests for this bug. The
regression tests are required for both HEAD and back-patching but I've
separated this patch for testing the above two patches easily.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8da5f9089c..19123cbfa3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -4821,6 +4821,45 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	txn->toast_hash = NULL;
 }

+/*
+ * Return palloc'ed array of the transactions that have changed catalogs.
+ * The returned array is sorted in xidComparator order.
+ *
+ * The caller must free the returned array when done with it.
+ */
+TransactionId *
+ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
+{
+	HASH_SEQ_STATUS hash_seq;
+	ReorderBufferTXNByIdEnt *ent;
+	TransactionId *xids;
+	size_t	xcnt = 0;
+	size_t	xcnt_space = 64; /* arbitrary number */
+
+	xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
+
+	hash_seq_init(_seq, rb->by_txn);
+	while ((ent = hash_seq_search(_seq)) != NULL)
+	{
+		ReorderBufferTXN *txn = ent->txn;
+
+		if (!rbtxn_has_catalog_changes(txn))
+			continue;
+
+		if (xcnt >= xcnt_space)
+		{
+			xcnt_space *= 2;
+			xids = repalloc(xids,

Re: Support logical replication of DDLs

2022-05-27 Thread Masahiko Sawada

On Fri, May 27, 2022 at 7:19 AM Zheng Li  wrote:
>
> Hi Masahiko,
>
> > Thank you for updating the patches!
> >
> > I've not looked at these patches in-depth yet but with this approach,
> > what do you think we can handle the DDL syntax differences between
> > major versions? DDL syntax or behavior could be changed by future
> > changes and I think we need to somehow deal with the differences. For
>
> > example, if the user uses logical replication for major version
> > upgrade, the publisher is older than the subscriber. We might have to
> > rewrite the DDL before applying to the subscriber because the DDL
> > executed on the publisher no longer work on a new PostgreSQL version
>
> I don't think we will allow this kind of situation to happen in the
> first place for
> backward compatibility.

It seems like a big limitation to me.

> If a DDL no longer works on a new version of
> PostgreSQL, the user will have to change the application code as well.
> So even if it happens for
> whatever reason, we could either
> 1. fail the apply worker and let the user fix such DDL because they'll
> have to fix the application code anyway when this happens.

Once the apply worker received the DDL, if the DDL doesn't work on the
subscriber, it will enter an infinite loop until the problem is fixed.
If the failure is due to a syntax error, how does the user fix it?

> 2. add guard rail logic in the apply worker to automatically fix such
> DDL if possible, knowing the version of the source and target. Similar
> logic must have been implemented for pg_dump/restore/upgrade.

If I'm not missing something, there is no such implementation in
pg_dump/restore/upgrade. When we use pg_dump/pg_restore for major
version upgrades, we usually use the newer version pg_dump to fetch
objects from the older version server, then restore the objects by
using the newer version pg_restore.

>
> > or we might have to add some options to the DDL before the application
> > in order to keep the same behavior. This seems to require a different
> > solution from what the patch does for the problem you mentioned such
>
> > as "DDL involving multiple tables where only some tables are
> > replicated”.
>
> First of all, this case can only happen when the customer chooses to
> only replicate a subset of the tables in a database in which case
> table level DDL replication is chosen instead of database level DDL
> replication (where all tables
> and DDLs are replicated). I think the solution would be:
> 1. make best effort to detect such DDLs on the publisher and avoid
> logging of such DDLs in table level DDL replication.

I think it's better to support this case.

> 2. apply worker will fail to replay such command due to missing
> objects if such DDLs didn't get filtered on the publisher for some
> reason. This should be rare and I think it's OK even if it happens,
> we'll find out
> why and fix it.

I'm not sure it's rare since replicating a subset of tables is a
common use case of logical replication. But even if we want to go this
way I think we should consider how to fix it at this stage, otherwise
we will end up redesigning it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Add index scan progress to pg_stat_progress_vacuum

2022-05-26 Thread Masahiko Sawada

On Fri, May 6, 2022 at 4:26 AM Imseih (AWS), Sami  wrote:
>
> Thank you for the feedback!
>
> >I think we can pass the progress update function to
> >   WaitForParallelWorkersToFinish(), which seems simpler. And we can call
>
> Directly passing the callback to WaitForParallelWorkersToFinish
> will require us to modify the function signature.
>
> To me, it seemed simpler and touches less code to have
> the caller set the callback in the ParallelContext.

Okay, but if we do that, I think we should add comments about when
it's used. The callback is used only when
WaitForParallelWorkersToFinish(), but not when
WaitForParallelWorkersToExit().

Another idea I came up with is that we can wait for all index vacuums
to finish while checking and updating the progress information, and
then calls WaitForParallelWorkersToFinish after confirming all index
status became COMPLETED. That way, we don’t need to change the
parallel query infrastructure. What do you think?

>
> >the function after updating the index status to
> >PARALLEL_INDVAC_STATUS_COMPLETED.
>
> I also like this better. Will make the change.
>
> >BTW, currently we don't need a lock for touching index status since
> >each worker touches different indexes. But after this patch, the
> >leader will touch all index status, do we need a lock for that?
>
> I do not think locking is needed here. The leader and workers
> will continue to touch different indexes to update the status.
>
> However, if the process is a leader, it will call the function
> which will go through indstats and count how many
> Indexes have a status of PARALLEL_INDVAC_STATUS_COMPLETED.
> This value is then reported to the leaders backend only.

I was concerned that the leader process could report the wrong
progress if updating and checking index status happen concurrently.
But I think it should be fine since we can read PVIndVacStatus
atomically.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Support logical replication of DDLs

2022-05-26 Thread Masahiko Sawada

On Sat, May 14, 2022 at 6:02 AM Zheng Li  wrote:
>
> > > 4. The handling related to partition tables seems missing because, on
> > > the subscriber-side, it always creates a relation entry in
> > > pg_subscription_rel which won't work. Check its interaction with
> > > publish_via_partition_root.
> >
> > I will test it out.
>
> Hi,
>
> patch 0010 properly handles partitioned table creation on the apply
> worker. Whether a replicated partitioned table should be added to
> pg_subscription_rel catalog depends on the setting of
> publish_via_partition_root of the publication. Thus we need to connect
> to the source DB and check if the partitioned table is in
> pg_catalog.pg_publication_tables after the apply worker creates the
> partitioned table.
>
> Thanks to Borui Yang for enabling and testing replication of DDL type
> T_RenameStmt in patch 0009.
>
> I've also rebased all the patches. Github branch of the same change
> can be found here:
> https://github.com/zli236/postgres/commits/ddl_replication

Thank you for updating the patches!

I've not looked at these patches in-depth yet but with this approach,
what do you think we can handle the DDL syntax differences between
major versions? DDL syntax or behavior could be changed by future
changes and I think we need to somehow deal with the differences. For
example, if the user uses logical replication for major version
upgrade, the publisher is older than the subscriber. We might have to
rewrite the DDL before applying to the subscriber because the DDL
executed on the publisher no longer work on a new PostgreSQL version
or we might have to add some options to the DDL before the application
in order to keep the same behavior. This seems to require a different
solution from what the patch does for the problem you mentioned such
as "DDL involving multiple tables where only some tables are
replicated”.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-05-24 Thread Masahiko Sawada

On Tue, May 24, 2022 at 2:18 PM Amit Kapila  wrote:
>
> On Tue, May 24, 2022 at 7:58 AM Masahiko Sawada  wrote:
> >
> > On Mon, May 23, 2022 at 2:39 PM Amit Kapila  wrote:
> > >
> > > On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila 
> > > >  wrote in
> > > > > I think if we don't have any better ideas then we should go with
> > > > > either this or one of the other proposals in this thread. The other
> > > > > idea that occurred to me is whether we can somehow update the snapshot
> > > > > we have serialized on disk about this information. On each
> > > > > running_xact record when we serialize the snapshot, we also try to
> > > > > purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
> > > > > that we can check if there are committed xacts to be purged and if we
> > > > > have previously serialized the snapshot for the prior running xact
> > > > > record, if so, we can update it with the list of xacts that have
> > > > > catalog changes. If this is feasible then I think we need to somehow
> > > > > remember the point where we last serialized the snapshot (maybe by
> > > > > using builder->last_serialized_snapshot). Even, if this is feasible we
> > > > > may not be able to do this in back-branches because of the disk-format
> > > > > change required for this.
> > > > >
> > > > > Thoughts?
> >
> > It seems to work, could you draft the patch?
> >
>
> I can help with the review and discussion.

Okay, I'll draft the patch for this idea.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-05-24 Thread Masahiko Sawada

On Tue, May 10, 2022 at 6:58 PM John Naylor
 wrote:
>
> On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada  wrote:
> >
> > Overall, radix tree implementations have good numbers. Once we got an
> > agreement on moving in this direction, I'll start a new thread for
> > that and move the implementation further; there are many things to do
> > and discuss: deletion, API design, SIMD support, more tests etc.
>
> +1
>

Thanks!

I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.

> (FWIW, I think the current thread is still fine.)

Okay, agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


radixtree_wip_v2.patch
Description: Binary data

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-05-23 Thread Masahiko Sawada

On Mon, May 23, 2022 at 2:39 PM Amit Kapila  wrote:
>
> On Mon, May 23, 2022 at 10:03 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Sat, 21 May 2022 15:35:58 +0530, Amit Kapila  
> > wrote in
> > > I think if we don't have any better ideas then we should go with
> > > either this or one of the other proposals in this thread. The other
> > > idea that occurred to me is whether we can somehow update the snapshot
> > > we have serialized on disk about this information. On each
> > > running_xact record when we serialize the snapshot, we also try to
> > > purge the committed xacts (via SnapBuildPurgeCommittedTxn). So, during
> > > that we can check if there are committed xacts to be purged and if we
> > > have previously serialized the snapshot for the prior running xact
> > > record, if so, we can update it with the list of xacts that have
> > > catalog changes. If this is feasible then I think we need to somehow
> > > remember the point where we last serialized the snapshot (maybe by
> > > using builder->last_serialized_snapshot). Even, if this is feasible we
> > > may not be able to do this in back-branches because of the disk-format
> > > change required for this.
> > >
> > > Thoughts?

It seems to work, could you draft the patch?

> >
> > I didn't look it closer, but it seems to work. I'm not sure how much
> > spurious invalidations at replication start impacts on performance,
> > but it is promising if the impact is significant.
> >
>
> It seems Sawada-San's patch is doing at each commit not at the start
> of replication and I think that is required because we need this each
> time for replication restart. So, I feel this will be an ongoing
> overhead for spurious cases with the current approach.
>
> >  That being said I'm
> > a bit negative for doing that in post-beta1 stage.
> >
>
> Fair point. We can use the do it early in PG-16 if the approach is
> feasible, and backpatch something on lines of what Sawada-San or you
> proposed.

+1.

I proposed two approaches: [1] and [2,] and I prefer [1].
Horiguchi-san's idea[3] also looks good but I think it's better to
somehow deal with the problem he mentioned:

> One problem with this is that change creates the case where multiple
> ReorderBufferTXNs share the same first_lsn.  I haven't come up with a
> clean idea to avoid relaxing the restriction of AssertTXNLsnOrder..

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAn-k6OpZ6HSAH_G91tpTXR6KYvkf663kg6EqW-f6sz1w%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD21AoD00wV4gt-53ze%2BZB8n4bqJrdH8J_UnDHddy8S2A%2Ba25g%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/20211008.165055.1621145185927268721.horikyota.ntt%40gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Intermittent buildfarm failures on wrasse

2022-05-18 Thread Masahiko Sawada

On Sun, May 15, 2022 at 12:29 AM Alvaro Herrera  wrote:
>
> On 2022-Apr-20, Masahiko Sawada wrote:
>
> > > MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
> > >   (proc->statusFlags & PROC_XMIN_FLAGS);
> > >
> > > Perhaps the latter is more future-proof.
>
> > Copying only xmin-related flags in this way also makes sense to me and
> > there is no problem at least for now. A note would be that when we
> > introduce a new flag that needs to be copied in the future, we need to
> > make sure to add it to PROC_XMIN_FLAGS so it is copied. Otherwise a
> > similar issue we fixed by 0f0cfb494004befb0f6e could happen again.
>
> OK, done this way -- patch attached.

Thank you for updating the patch.

>
> Reading the comment I wrote about it, I wonder if flags
> PROC_AFFECTS_ALL_HORIZONS and PROC_IN_LOGICAL_DECODING should also be
> included.  I think the only reason we don't care at this point is that
> walsenders (logical or otherwise) do not engage in snapshot copying.
> But if we were to implement usage of parallel workers sharing a common
> snapshot to do table sync in parallel, then it ISTM it would be
> important to copy at least the latter.  Not sure there are any cases
> were we might care about the former.

Yeah, it seems to be inconsistent between the comment (and the new
name) and the flags actually included. I think we can include all
xmin-related flags to PROC_XMIN_FLAGS as the comment says. That way,
it would be useful also for other use cases, and I don't see any
downside for now.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

< 3 4 5 6 7 8 9 10 11 12 >

701 - 800 of 2776 matches

Mail list logo