Re: logical decoding and replication of sequences, take 2

2024-03-06 Thread Tomas Vondra
Hi,

Let me share a bit of an update regarding this patch and PG17. I have
discussed this patch and how to move it forward with a couple hackers
(both within EDB and outside), and my takeaway is that the patch is not
quite baked yet, not enough to make it into PG17 :-(

There are two main reasons / concerns leading to this conclusion:

* correctness of the decoding part

There are (were) doubts about decoding during startup, before the
snapshot gets consistent, when we can get "temporarily incorrect"
decisions whether a change is transactional. While the behavior is
ultimately correct (we treat all changes as non-transactional and
discard it), it seems "dirty" and it’s unclear to me if it might cause
more serious issues down the line (not necessarily bugs, but perhaps
making it harder to implement future changes).

* handling of sequences in built-in replication

Per the patch, sequences need to be added to the publication explicitly.
But there were suggestions we might (should) add certain sequences
automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc.
I’m not sure we really want to do that, and so far I assumed we would
start with the manual approach and move to automatic addition in the
future. But the agreement seems to be it would be a pretty significant
"breaking change", and something we probably don’t want to do.


If someone feels has an opinion on either of the two issues (in either
way), I'd like to hear it.


Obviously, I'm not particularly happy about this outcome. And I'm also
somewhat cautious because this patch was already committed+reverted in
PG16 cycle, and doing the same thing in PG17 is not on my wish list.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar
On Wed, Feb 21, 2024 at 2:52 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> > So the problem is that we might consider the transaction change as
> > non-transaction and mark this flag as true.
>
> But it's not "might" right? It's absolutely 100% certain that we will
> consider that transaction's changes as non-transactional ... because
> when we're in fast-forward mode, the table of new relfilenodes is not
> built, and so whenever we check whether any transaction made a new
> relfilenode for this sequence, the answer will be no.
>
> > But what would have
> > happened if we would have identified it correctly as transactional?
> > In such cases, we wouldn't have set this flag here but then we would
> > have set this while processing the DecodeAbort/DecodeCommit, so the
> > net effect would be the same no?  You may question what if the
> > Abort/Commit WAL never appears in the WAL, but this flag is
> > specifically for the upgrade case, and in that case we have to do a
> > clean shutdown so may not be an issue.  But in the future, if we try
> > to use 'ctx->processing_required' for something else where the clean
> > shutdown is not guaranteed then this flag can be set incorrectly.
> >
> > I am not arguing that this is a perfect design but I am just making a
> > point about why it would work.
>
> Even if this argument is correct (and I don't know if it is), the code
> and comments need some updating. We should not be testing a flag that
> is guaranteed false with comments that make it sound like the value of
> the flag is trustworthy when it isn't.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Robert Haas
On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> So the problem is that we might consider the transaction change as
> non-transaction and mark this flag as true.

But it's not "might" right? It's absolutely 100% certain that we will
consider that transaction's changes as non-transactional ... because
when we're in fast-forward mode, the table of new relfilenodes is not
built, and so whenever we check whether any transaction made a new
relfilenode for this sequence, the answer will be no.

> But what would have
> happened if we would have identified it correctly as transactional?
> In such cases, we wouldn't have set this flag here but then we would
> have set this while processing the DecodeAbort/DecodeCommit, so the
> net effect would be the same no?  You may question what if the
> Abort/Commit WAL never appears in the WAL, but this flag is
> specifically for the upgrade case, and in that case we have to do a
> clean shutdown so may not be an issue.  But in the future, if we try
> to use 'ctx->processing_required' for something else where the clean
> shutdown is not guaranteed then this flag can be set incorrectly.
>
> I am not arguing that this is a perfect design but I am just making a
> point about why it would work.

Even if this argument is correct (and I don't know if it is), the code
and comments need some updating. We should not be testing a flag that
is guaranteed false with comments that make it sound like the value of
the flag is trustworthy when it isn't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar
On Wed, Feb 21, 2024 at 1:24 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > > But I am wondering why this flag is always set to true in
> > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > > aborted transactions are not supposed to be replayed?  So if my
> > > observation is correct that for the aborted transaction, this
> > > shouldn't be set to true then we have a problem with sequence where we
> > > are identifying the transactional changes as non-transaction changes
> > > because now for transactional changes this should depend upon commit
> > > status.
> >
> > I have checked this case with Amit Kapila.  So it seems in the cases
> > where we have sent the prepared transaction or streamed in-progress
> > transaction we would need to send the abort also, and for that reason,
> > we are setting 'ctx->processing_required' as true so that if these
> > WALs are not streamed we do not allow upgrade of such slots.
>
> I don't find this explanation clear enough for me to understand.


Explanation about why we set 'ctx->processing_required' to true from
DecodeCommit as well as DecodeAbort:
--
For upgrading logical replication slots, it's essential to ensure
these slots are completely synchronized with the subscriber.  To
identify that we process all the pending WAL in 'fast_forward' mode to
find whether there is any decodable WAL or not.  So in short any WAL
type that we stream to standby in normal mode (no fast_forward mode)
is considered decodable and so is the abort WAL.  That's the reason
why at the end of the transaction commit/abort we need to set this
'ctx->processing_required' to true i.e. there are some decodable WAL
exists so we can not upgrade this slot.

Why the below check is safe?
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

So the problem is that we might consider the transaction change as
non-transaction and mark this flag as true.  But what would have
happened if we would have identified it correctly as transactional?
In such cases, we wouldn't have set this flag here but then we would
have set this while processing the DecodeAbort/DecodeCommit, so the
net effect would be the same no?  You may question what if the
Abort/Commit WAL never appears in the WAL, but this flag is
specifically for the upgrade case, and in that case we have to do a
clean shutdown so may not be an issue.  But in the future, if we try
to use 'ctx->processing_required' for something else where the clean
shutdown is not guaranteed then this flag can be set incorrectly.

I am not arguing that this is a perfect design but I am just making a
point about why it would work.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Robert Haas
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > But I am wondering why this flag is always set to true in
> > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > aborted transactions are not supposed to be replayed?  So if my
> > observation is correct that for the aborted transaction, this
> > shouldn't be set to true then we have a problem with sequence where we
> > are identifying the transactional changes as non-transaction changes
> > because now for transactional changes this should depend upon commit
> > status.
>
> I have checked this case with Amit Kapila.  So it seems in the cases
> where we have sent the prepared transaction or streamed in-progress
> transaction we would need to send the abort also, and for that reason,
> we are setting 'ctx->processing_required' as true so that if these
> WALs are not streamed we do not allow upgrade of such slots.

I don't find this explanation clear enough for me to understand.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar  wrote:
>
> On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:
>
> > Let's say fast_forward is true. Then smgr_decode() is going to skip
> > recording anything about the relfilenode, so we'll identify all
> > sequence changes as non-transactional. But look at how this case is
> > handled in seq_decode():
> >
> > + if (ctx->fast_forward)
> > + {
> > + /*
> > + * We need to set processing_required flag to notify the sequence
> > + * change existence to the caller. Usually, the flag is set when
> > + * either the COMMIT or ABORT records are decoded, but this must be
> > + * turned on here because the non-transactional logical message is
> > + * decoded without waiting for these records.
> > + */
> > + if (!transactional)
> > + ctx->processing_required = true;
> > +
> > + return;
> > + }
>
> It appears that the 'processing_required' flag was introduced as part
> of supporting upgrades for logical replication slots. Its purpose is
> to determine whether a slot is fully caught up, meaning that there are
> no pending decodable changes left before it can be upgraded.
>
> So now if some change was transactional but we have identified it as
> non-transaction then we will mark this flag  'ctx->processing_required
> = true;' so we temporarily set this flag incorrectly, but even if the
> flag would have been correctly identified initially, it would have
> been set again to true in the DecodeTXNNeedSkip() function regardless
> of whether the transaction is committed or aborted. As a result, the
> flag would eventually be set to 'true', and the behavior would align
> with the intended logic.
>
> But I am wondering why this flag is always set to true in
> DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> aborted transactions are not supposed to be replayed?  So if my
> observation is correct that for the aborted transaction, this
> shouldn't be set to true then we have a problem with sequence where we
> are identifying the transactional changes as non-transaction changes
> because now for transactional changes this should depend upon commit
> status.

I have checked this case with Amit Kapila.  So it seems in the cases
where we have sent the prepared transaction or streamed in-progress
transaction we would need to send the abort also, and for that reason,
we are setting 'ctx->processing_required' as true so that if these
WALs are not streamed we do not allow upgrade of such slots.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Amit Kapila
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
 wrote:
>
> On 2/13/24 17:37, Robert Haas wrote:
>
> > In other words, the fact that some sequence changes are
> > non-transactional creates ordering hazards that don't exist if there
> > are no non-transactional changes. So in that way, sequences are
> > different from table modifications, where applying the transactions in
> > order of commit is all we need to do. Here we need to apply the
> > transactions in order of commit and also apply the non-transactional
> > changes at the right point in the sequence. Consider the following
> > alternative apply sequence:
> >
> > 1. T1.
> > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> > the subsequent nextval)
> > 3. T3's nextval
> > 4. T2's first nextval
> >
> > That's still in commit order. It's also wrong.
> >
>
> Yes, this would be wrong. Thankfully the apply is not allowed to reorder
> the changes like this, because that's not what "non-transactional" means
> in this context.
>
> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally).
>

In this regard, I have another scenario in mind where the apply order
could be different for the changes in the same transactions. For
example,

Transaction T1
Begin;
Insert ..
Insert ..
nextval .. --consider this generates WAL
..
Insert ..
nextval .. --consider this generates WAL

In this case, if the nextval operations will be applied in a different
order (aka before Inserts) then there could be some inconsistency.
Say, if, it doesn't follow the above order during apply then a trigger
fired on both pub and sub for each row insert that refers to the
current sequence value to make some decision could have different
behavior on publisher and subscriber. If this is not how the patch
will behave then fine but otherwise, isn't this something that we
should be worried about?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Amit Kapila
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra
 wrote:
>
> On 2/20/24 06:54, Amit Kapila wrote:
> > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
> >  wrote:
> >>
> >> On 12/19/23 13:54, Christophe Pettus wrote:
> >>> Hi,
> >>>
> >>> I wanted to hop in here on one particular issue:
> >>>
>  On Dec 12, 2023, at 02:01, Tomas Vondra  
>  wrote:
>  - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>  better solution for distributed (esp. active-active) systems. But there
>  are important use cases that are likely to keep using regular sequences
>  (online upgrades of single-node instances, existing systems, ...).
> >>>
> >>> +1.
> >>>
> >>> Right now, the lack of sequence replication is a rather large
> >>> foot-gun on logical replication upgrades.  Copying the sequences
> >>> over during the cutover period is doable, of course, but:
> >>>
> >>> (a) There's no out-of-the-box tooling that does it, so everyone has
> >>> to write some scripts just for that one function.
> >>>
> >>> (b) It's one more thing that extends the cutover window.
> >>>
> >>
> >> I agree it's an annoying gap for this use case. But if this is the only
> >> use cases, maybe a better solution would be to provide such tooling
> >> instead of adding it to the logical decoding?
> >>
> >> It might seem a bit strange if most data is copied by replication
> >> directly, while sequences need special handling, ofc.
> >>
> >
> > One difference between the logical replication of tables and sequences
> > is that we can guarantee with synchronous_commit (and
> > synchronous_standby_names) that after failover transactions data is
> > replicated or not whereas for sequences we can't guarantee that
> > because of their non-transactional nature. Say, there are two
> > transactions T1 and T2, it is possible that T1's entire table data and
> > sequence data are committed and replicated but T2's sequence data is
> > replicated. So, after failover to logical subscriber in such a case if
> > one routes T2 again to the new node as it was not successful
> > previously then it would needlessly perform the sequence changes
> > again. I don't how much that matters but that would probably be the
> > difference between the replication of tables and sequences.
> >
>
> I don't quite follow what the problem with synchronous_commit is :-(
>
> For sequences, we log the changes ahead, i.e. even if nextval() did not
> write anything into WAL, it's still safe because these changes are
> covered by the WAL generated some time ago (up to ~32 values back). And
> that's certainly subject to synchronous_commit, right?
>
> There certainly are issues with sequences and syncrep:
>
> https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com
>
> but that's unrelated to logical replication.
>
> FWIW I don't think we'd re-apply sequence changes needlessly, because
> the worker does update the origin after applying non-transactional
> changes. So after the replication gets restarted, we'd skip what we
> already applied, no?
>

It will work for restarts but I was trying to discuss what happens in
the scenario after the publisher node goes down and we failover to the
subscriber node and make it a primary node (or a failover case). After
that, all unfinished transactions will be re-routed to the new
primary. Consider a theoretical case where we send sequence changes of
the yet uncommitted transactions directly from wal buffers (something
like 91f2cae7a4 does for physical replication) and then immediately
the primary or publisher node crashes. After failover to the
subscriber node, the application will re-route unfinished transactions
to the new primary. In such a situation, I think there is a chance
that we will update the sequence value when it would have already
received/applied that update via replication. This is what I was
saying that there is probably a difference between tables and
sequences, for tables such a replicated change would be rolled back.
Having said that, this is probably no different from what would happen
in the case of physical replication.

> But maybe there is an issue and I'm just not getting it. Could you maybe
> share an example of T1/T2, with a replication restart and what you think
> would happen?
>
> > I agree with your point above that for upgrades some tool like
> > pg_copysequence where we can provide a way to copy sequence data to
> > subscribers from the publisher would suffice the need.
> >
>
> Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
> another tool users would need to use.
>

But can logical replica be used for failover? We don't have any way to
replicate/sync the slots on subscribers and neither we have a
mechanism to replicate existing publications. I think if we want to
achieve failover to a logical subscriber we need to replicate/sync the
required logical and physical slots to the subscribers. I haven't
thought through it completely so there 

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Tomas Vondra



On 2/20/24 06:54, Amit Kapila wrote:
> On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
>  wrote:
>>
>> On 12/19/23 13:54, Christophe Pettus wrote:
>>> Hi,
>>>
>>> I wanted to hop in here on one particular issue:
>>>
 On Dec 12, 2023, at 02:01, Tomas Vondra  
 wrote:
 - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
 better solution for distributed (esp. active-active) systems. But there
 are important use cases that are likely to keep using regular sequences
 (online upgrades of single-node instances, existing systems, ...).
>>>
>>> +1.
>>>
>>> Right now, the lack of sequence replication is a rather large
>>> foot-gun on logical replication upgrades.  Copying the sequences
>>> over during the cutover period is doable, of course, but:
>>>
>>> (a) There's no out-of-the-box tooling that does it, so everyone has
>>> to write some scripts just for that one function.
>>>
>>> (b) It's one more thing that extends the cutover window.
>>>
>>
>> I agree it's an annoying gap for this use case. But if this is the only
>> use cases, maybe a better solution would be to provide such tooling
>> instead of adding it to the logical decoding?
>>
>> It might seem a bit strange if most data is copied by replication
>> directly, while sequences need special handling, ofc.
>>
> 
> One difference between the logical replication of tables and sequences
> is that we can guarantee with synchronous_commit (and
> synchronous_standby_names) that after failover transactions data is
> replicated or not whereas for sequences we can't guarantee that
> because of their non-transactional nature. Say, there are two
> transactions T1 and T2, it is possible that T1's entire table data and
> sequence data are committed and replicated but T2's sequence data is
> replicated. So, after failover to logical subscriber in such a case if
> one routes T2 again to the new node as it was not successful
> previously then it would needlessly perform the sequence changes
> again. I don't how much that matters but that would probably be the
> difference between the replication of tables and sequences.
> 

I don't quite follow what the problem with synchronous_commit is :-(

For sequences, we log the changes ahead, i.e. even if nextval() did not
write anything into WAL, it's still safe because these changes are
covered by the WAL generated some time ago (up to ~32 values back). And
that's certainly subject to synchronous_commit, right?

There certainly are issues with sequences and syncrep:

https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com

but that's unrelated to logical replication.

FWIW I don't think we'd re-apply sequence changes needlessly, because
the worker does update the origin after applying non-transactional
changes. So after the replication gets restarted, we'd skip what we
already applied, no?

But maybe there is an issue and I'm just not getting it. Could you maybe
share an example of T1/T2, with a replication restart and what you think
would happen?

> I agree with your point above that for upgrades some tool like
> pg_copysequence where we can provide a way to copy sequence data to
> subscribers from the publisher would suffice the need.
> 

Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
another tool users would need to use.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar
On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:

> Let's say fast_forward is true. Then smgr_decode() is going to skip
> recording anything about the relfilenode, so we'll identify all
> sequence changes as non-transactional. But look at how this case is
> handled in seq_decode():
>
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

It appears that the 'processing_required' flag was introduced as part
of supporting upgrades for logical replication slots. Its purpose is
to determine whether a slot is fully caught up, meaning that there are
no pending decodable changes left before it can be upgraded.

So now if some change was transactional but we have identified it as
non-transaction then we will mark this flag  'ctx->processing_required
= true;' so we temporarily set this flag incorrectly, but even if the
flag would have been correctly identified initially, it would have
been set again to true in the DecodeTXNNeedSkip() function regardless
of whether the transaction is committed or aborted. As a result, the
flag would eventually be set to 'true', and the behavior would align
with the intended logic.

But I am wondering why this flag is always set to true in
DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
aborted transactions are not supposed to be replayed?  So if my
observation is correct that for the aborted transaction, this
shouldn't be set to true then we have a problem with sequence where we
are identifying the transactional changes as non-transaction changes
because now for transactional changes this should depend upon commit
status.

On another thought, can there be a situation where we have identified
this flag wrongly as non-transaction and set this flag, and the
commit/abort record never appeared in the WAL so never decoded? That
can also lead to an incorrect decision during the upgrade.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Robert Haas
On Tue, Feb 20, 2024 at 1:32 PM Dilip Kumar  wrote:
> You might be interested in more detail [1] where I first reported this
> problem and also [2] where we concluded why this is not creating a
> real problem.
>
> [1] 
> https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

Thanks. Dilip and I just spent a lot of time talking this through on a
call. One of the key bits of logic is here:

+ /* Skip the change if already processed (per the snapshot). */
+ if (transactional &&
+ !SnapBuildProcessChange(builder, xid, buf->origptr))
+ return;
+ else if (!transactional &&
+ (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
+   SnapBuildXactNeedsSkip(builder, buf->origptr)))
+ return;

As a stylistic note, I think this would be mode clear if it were
written if (transactional) { if (!SnapBuildProcessChange()) return; }
else { if (something else) return; }.

Now, on to correctness. It's possible for us to identify a
transactional change as non-transactional if smgr_decode() was called
for the relfilenode before SNAPBUILD_FULL_SNAPSHOT was reached. In
that case, if !SnapBuildProcessChange() would have been true, then we
need SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
SnapBuildXactNeedsSkip(builder, buf->origptr) to also be true.
Otherwise, we'll process this change when we wouldn't have otherwise.
But Dilip made an argument to me about this which seems correct to me.
snapbuild.h says that SNAPBUILD_CONSISTENT is reached only when we
find a point where any transaction that was running at the time we
reached SNAPBUILD_FULL_SNAPSHOT have finished. So if this transaction
is one for which we incorrectly identified the sequence change as
non-transactional, then we cannot be in the SNAPBUILD_CONSISTENT state
yet, so SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT will be
true and hence whole "or" condition we'll be true and we'll return. So
far, so good.

I think, anyway. I haven't comprehensively verified that the comment
in snapbuild.h accurately reflects what the code actually does. But if
it does, then presumably we shouldn't see a record for which we might
have mistakenly identified a change as non-transactional after
reaching SNAPBUILD_CONSISTENT, which seems to be good enough to
guarantee that the mistake won't matter.

However, the logic in smgr_decode() doesn't only care about the
snapshot state. It also cares about the fast-forward flag:

+ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+ ctx->fast_forward)
+ return;

Let's say fast_forward is true. Then smgr_decode() is going to skip
recording anything about the relfilenode, so we'll identify all
sequence changes as non-transactional. But look at how this case is
handled in seq_decode():

+ if (ctx->fast_forward)
+ {
+ /*
+ * We need to set processing_required flag to notify the sequence
+ * change existence to the caller. Usually, the flag is set when
+ * either the COMMIT or ABORT records are decoded, but this must be
+ * turned on here because the non-transactional logical message is
+ * decoded without waiting for these records.
+ */
+ if (!transactional)
+ ctx->processing_required = true;
+
+ return;
+ }

This seems suspicious. Why are we testing the transactional flag here
if it's guaranteed to be false? My guess is that the person who wrote
this code thought that the flag would be accurate even in this case,
but that doesn't seem to be true. So this case probably needs some
more thought.

It's definitely not great that this logic is so complicated; it's
really hard to verify that all the tests match up well enough to keep
us out of trouble.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar
On Tue, Feb 20, 2024 at 10:30 AM Robert Haas  wrote:
>
> Is the rule that changes are transactional if and only if the current
> transaction has assigned a new relfilenode to the sequence?

Yes, thats the rule.

> Why does the logic get confused if the state of the snapshot changes?

The rule doesn't get changed, but the way this identification is
implemented at the decoding gets confused and assumes transactional as
non-transactional.  The identification of whether the sequence is
transactional or not is implemented based on what WAL we have decoded
from the particular transaction and whether we decode a particular WAL
or not depends upon the snapshot state (it's about what we decode not
necessarily what we sent).  So if the snapshot state changed the
mid-transaction that means we haven't decoded the WAL which created a
new relfilenode but we will decode the WAL which is operating on the
sequence.  So here we will assume the change is non-transaction
whereas it was transactional because we did not decode some of the
changes of transaction which we rely on for identifying whether it is
transactional or not.


> My naive reaction is that it kinda sounds like you're relying on two
> different mistakes cancelling each other out, and that might be a bad
> idea, because maybe there's some situation where they don't. But I
> don't understand the issue well enough to have an educated opinion at
> this point.

I would say the first one is a mistake in identifying the
transactional as non-transactional during the decoding and that
mistake happens only when we decode the transaction partially.  But we
never stream the partially decoded transactions downstream which means
even though we have made a mistake in decoding it, we are not
streaming it so our mistake is not getting converted into a real
problem.  But again I agree there is a temporary wrong decision and if
we try to do something else based on this decision then it could be an
issue.

You might be interested in more detail [1] where I first reported this
problem and also [2] where we concluded why this is not creating a
real problem.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-19 Thread Amit Kapila
On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
 wrote:
>
> On 12/19/23 13:54, Christophe Pettus wrote:
> > Hi,
> >
> > I wanted to hop in here on one particular issue:
> >
> >> On Dec 12, 2023, at 02:01, Tomas Vondra  
> >> wrote:
> >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> >> better solution for distributed (esp. active-active) systems. But there
> >> are important use cases that are likely to keep using regular sequences
> >> (online upgrades of single-node instances, existing systems, ...).
> >
> > +1.
> >
> > Right now, the lack of sequence replication is a rather large
> > foot-gun on logical replication upgrades.  Copying the sequences
> > over during the cutover period is doable, of course, but:
> >
> > (a) There's no out-of-the-box tooling that does it, so everyone has
> > to write some scripts just for that one function.
> >
> > (b) It's one more thing that extends the cutover window.
> >
>
> I agree it's an annoying gap for this use case. But if this is the only
> use cases, maybe a better solution would be to provide such tooling
> instead of adding it to the logical decoding?
>
> It might seem a bit strange if most data is copied by replication
> directly, while sequences need special handling, ofc.
>

One difference between the logical replication of tables and sequences
is that we can guarantee with synchronous_commit (and
synchronous_standby_names) that after failover transactions data is
replicated or not whereas for sequences we can't guarantee that
because of their non-transactional nature. Say, there are two
transactions T1 and T2, it is possible that T1's entire table data and
sequence data are committed and replicated but T2's sequence data is
replicated. So, after failover to logical subscriber in such a case if
one routes T2 again to the new node as it was not successful
previously then it would needlessly perform the sequence changes
again. I don't how much that matters but that would probably be the
difference between the replication of tables and sequences.

I agree with your point above that for upgrades some tool like
pg_copysequence where we can provide a way to copy sequence data to
subscribers from the publisher would suffice the need.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2024-02-19 Thread Robert Haas
On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra
 wrote:
> For me, the part that I feel most uneasy about is the decoding while the
> snapshot is still being built (and can flip to consistent snapshot
> between the relfilenode creation and sequence change, confusing the
> logic that decides which changes are transactional).
>
> It seems "a bit weird" that we either keep the "simple" logic that may
> end up with incorrect "non-transactional" result, but happens to then
> work fine because we immediately discard the change.
>
> But it still feels better than the alternative, which requires us to
> start decoding stuff (relfilenode creation) before building a proper
> snapshot is consistent, which we didn't do before - or at least not in
> this particular way. While I don't have a practical example where it
> would cause trouble now, I have a nagging feeling it might easily cause
> trouble in the future by making some new features harder to implement.

I don't understand the issues here well enough to comment. Is there a
good write-up someplace I can read to understand the design here?

Is the rule that changes are transactional if and only if the current
transaction has assigned a new relfilenode to the sequence?

Why does the logic get confused if the state of the snapshot changes?

My naive reaction is that it kinda sounds like you're relying on two
different mistakes cancelling each other out, and that might be a bad
idea, because maybe there's some situation where they don't. But I
don't understand the issue well enough to have an educated opinion at
this point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-15 Thread Tomas Vondra
On 2/15/24 05:16, Robert Haas wrote:
> On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
>  wrote:
>> The way I think about non-transactional sequence changes is as if they
>> were tiny transactions that happen "fully" (including commit) at the LSN
>> where the LSN change is logged.
> 
> 100% this.
> 
>> It does not mean we can arbitrarily reorder the changes, it only means
>> the changes are applied as if they were independent transactions (but in
>> the same order as they were executed originally). Both with respect to
>> the other non-transactional changes, and to "commits" of other stuff.
> 
> Right, this is very important and I agree completely.
> 
> I'm feeling more confident about this now that I heard you say that
> stuff -- this is really the key issue I've been worried about since I
> first looked at this, and I wasn't sure that you were in agreement,
> but it sounds like you are. I think we should (a) fix the locking bug
> I found (but that can be independent of this patch) and (b) make sure
> that this patch documents the points from the quoted material above so
> that everyone who reads the code (and maybe tries to enhance it) is
> clear on what the assumptions are.
> 
> (I haven't checked whether it documents that stuff or not. I'm just
> saying it should, because I think it's a subtlety that someone might
> miss.)
> 

Thanks for thinking about these issues with reordering events. Good we
seem to be in agreement and that you feel more confident about this.
I'll check if there's a good place to document this.

For me, the part that I feel most uneasy about is the decoding while the
snapshot is still being built (and can flip to consistent snapshot
between the relfilenode creation and sequence change, confusing the
logic that decides which changes are transactional).

It seems "a bit weird" that we either keep the "simple" logic that may
end up with incorrect "non-transactional" result, but happens to then
work fine because we immediately discard the change.

But it still feels better than the alternative, which requires us to
start decoding stuff (relfilenode creation) before building a proper
snapshot is consistent, which we didn't do before - or at least not in
this particular way. While I don't have a practical example where it
would cause trouble now, I have a nagging feeling it might easily cause
trouble in the future by making some new features harder to implement.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2024-02-14 Thread Robert Haas
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
 wrote:
> The way I think about non-transactional sequence changes is as if they
> were tiny transactions that happen "fully" (including commit) at the LSN
> where the LSN change is logged.

100% this.

> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally). Both with respect to
> the other non-transactional changes, and to "commits" of other stuff.

Right, this is very important and I agree completely.

I'm feeling more confident about this now that I heard you say that
stuff -- this is really the key issue I've been worried about since I
first looked at this, and I wasn't sure that you were in agreement,
but it sounds like you are. I think we should (a) fix the locking bug
I found (but that can be independent of this patch) and (b) make sure
that this patch documents the points from the quoted material above so
that everyone who reads the code (and maybe tries to enhance it) is
clear on what the assumptions are.

(I haven't checked whether it documents that stuff or not. I'm just
saying it should, because I think it's a subtlety that someone might
miss.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-02-14 Thread Tomas Vondra



On 2/13/24 17:37, Robert Haas wrote:
> On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra
>  wrote:
>> Right, locks + apply in commit order gives us this guarantee (I can't
>> think of a case where it wouldn't be the case).
> 
> I couldn't find any cases of inadequate locking other than the one I 
> mentioned.
> 
>> Doesn't the whole logical replication critically depend on the commit
>> order? If you decide to arbitrarily reorder/delay the transactions, all
>> kinds of really bad things can happen. That's a generic problem, it
>> applies to all kinds of objects, not just sequences - a parallel apply
>> would need to detect this sort of dependencies (e.g. INSERT + DELETE of
>> the same key), and do something about it.
> 
> Yes, but here I'm not just talking about the commit order. I'm talking
> about the order of applying non-transactional operations relative to
> commits.
> 
> Consider:
> 
> T1: CREATE SEQUENCE s;
> T2: BEGIN;
> T2: SELECT nextval('s');
> T3: SELECT nextval('s');
> T2: ALTER SEQUENCE s INCREMENT 2;
> T2: SELECT nextval('s');
> T2: COMMIT;
> 

It's not clear to me if you're talking about nextval() that happens to
generate WAL, or nextval() covered by WAL generated by a previous call.

I'm going to assume it's the former, i.e. nextval() that generated WAL
describing the *next* sequence chunk, because without WAL there's
nothing to apply and therefore no issue with T3 ordering.

The way I think about non-transactional sequence changes is as if they
were tiny transactions that happen "fully" (including commit) at the LSN
where the LSN change is logged.


> The commit order is T1 < T3 < T2, but T3 makes no transactional
> changes, so the commit order is really just T1 < T2. But it's
> completely wrong to say that all we need to do is apply T1 before we
> apply T2. The correct order of application is:
> 
> 1. T1.
> 2. T2's first nextval
> 3. T3's nextval
> 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 

Is that quite true? If T3 generated WAL (for the nextval call), it will
be applied at that particular LSN. AFAIK that guarantees it happens
after the first T2 change (which is also non-transactional) and before
the transactional T2 change (because that creates a new relfilenode).

> In other words, the fact that some sequence changes are
> non-transactional creates ordering hazards that don't exist if there
> are no non-transactional changes. So in that way, sequences are
> different from table modifications, where applying the transactions in
> order of commit is all we need to do. Here we need to apply the
> transactions in order of commit and also apply the non-transactional
> changes at the right point in the sequence. Consider the following
> alternative apply sequence:
> 
> 1. T1.
> 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 3. T3's nextval
> 4. T2's first nextval
> 
> That's still in commit order. It's also wrong.
> 

Yes, this would be wrong. Thankfully the apply is not allowed to reorder
the changes like this, because that's not what "non-transactional" means
in this context.

It does not mean we can arbitrarily reorder the changes, it only means
the changes are applied as if they were independent transactions (but in
the same order as they were executed originally). Both with respect to
the other non-transactional changes, and to "commits" of other stuff.

(for serial apply, at least)

> Imagine that you commit this patch and someone later wants to do
> parallel logical apply. So every time they finish decoding a
> transaction, they stick it in a queue to be applied by the next
> available worker. But, non-transactional changes are very simple, so
> we just directly apply those in the main process. Well, kaboom! But
> now this can happen with the above example.
> 
> 1. Decode T1. Add to queue for apply.
> 2. Before the (idle) apply worker has a chance to pull T1 out of the
> queue, decode the first nextval and try to apply it.
> 
> Oops. We're trying to apply a modification to a sequence that hasn't
> been created yet. I'm not saying that this kind of hypothetical is a
> reason not to commit the patch. But it seems like we're not on the
> same page about what the ordering requirements are here. I'm just
> making the argument that those non-transactional operations actually
> act like mini-transactions. They need to happen at the right time
> relative to the real transactions. A non-transactional operation needs
> to be applied after any transactions that commit before it is logged,
> and before any transactions that commit after it's logged.
> 

How is this issue specific to sequences? AFAIK this is a general problem
with transactions that depend on each other. Consider for example this:

T1: INSERT INTO t (id) VALUES (1);
T2: DELETE FROM t WHERE id = 1;

If you parallelize this in a naive way, maybe T2 gets applied before T1.
In which case the DELETE won't find the row yet.

There's 

Re: logical decoding and replication of sequences, take 2

2024-02-13 Thread Robert Haas
On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra
 wrote:
> Right, locks + apply in commit order gives us this guarantee (I can't
> think of a case where it wouldn't be the case).

I couldn't find any cases of inadequate locking other than the one I mentioned.

> Doesn't the whole logical replication critically depend on the commit
> order? If you decide to arbitrarily reorder/delay the transactions, all
> kinds of really bad things can happen. That's a generic problem, it
> applies to all kinds of objects, not just sequences - a parallel apply
> would need to detect this sort of dependencies (e.g. INSERT + DELETE of
> the same key), and do something about it.

Yes, but here I'm not just talking about the commit order. I'm talking
about the order of applying non-transactional operations relative to
commits.

Consider:

T1: CREATE SEQUENCE s;
T2: BEGIN;
T2: SELECT nextval('s');
T3: SELECT nextval('s');
T2: ALTER SEQUENCE s INCREMENT 2;
T2: SELECT nextval('s');
T2: COMMIT;

The commit order is T1 < T3 < T2, but T3 makes no transactional
changes, so the commit order is really just T1 < T2. But it's
completely wrong to say that all we need to do is apply T1 before we
apply T2. The correct order of application is:

1. T1.
2. T2's first nextval
3. T3's nextval
4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
the subsequent nextval)

In other words, the fact that some sequence changes are
non-transactional creates ordering hazards that don't exist if there
are no non-transactional changes. So in that way, sequences are
different from table modifications, where applying the transactions in
order of commit is all we need to do. Here we need to apply the
transactions in order of commit and also apply the non-transactional
changes at the right point in the sequence. Consider the following
alternative apply sequence:

1. T1.
2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
the subsequent nextval)
3. T3's nextval
4. T2's first nextval

That's still in commit order. It's also wrong.

Imagine that you commit this patch and someone later wants to do
parallel logical apply. So every time they finish decoding a
transaction, they stick it in a queue to be applied by the next
available worker. But, non-transactional changes are very simple, so
we just directly apply those in the main process. Well, kaboom! But
now this can happen with the above example.

1. Decode T1. Add to queue for apply.
2. Before the (idle) apply worker has a chance to pull T1 out of the
queue, decode the first nextval and try to apply it.

Oops. We're trying to apply a modification to a sequence that hasn't
been created yet. I'm not saying that this kind of hypothetical is a
reason not to commit the patch. But it seems like we're not on the
same page about what the ordering requirements are here. I'm just
making the argument that those non-transactional operations actually
act like mini-transactions. They need to happen at the right time
relative to the real transactions. A non-transactional operation needs
to be applied after any transactions that commit before it is logged,
and before any transactions that commit after it's logged.

> Yes, I think this is a bug in handling of owned sequences - from the
> moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions
> generate duplicate values (until the S1 is committed, at which point the
> values generated in S2 get "forgotten").
>
> It seems we end up updating both relfilenodes, which is clearly wrong.
>
> Seems like a bug independent of the decoding, IMO.

Yeah.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-01-27 Thread Tomas Vondra
On 1/26/24 15:39, Robert Haas wrote:
> On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra
>  wrote:
>> I did try to explain how this works (and why) in a couple places:
>>
>> 1) the commit message
>> 2) reorderbuffer header comment
>> 3) ReorderBufferSequenceIsTransactional comment (and nearby)
>>
>> It's possible this does not meet your expectations, ofc. Maybe there
>> should be a separate README for this - I haven't found anything like
>> that for logical decoding in general, which is why I did (1)-(3).
> 
> I read over these and I do think they answer a bunch of questions, but
> I don't think they answer all of the questions.
> 
> Suppose T1 creates a sequence and commits. Then T2 calls nextval().
> Then T3 drops the sequence. According to the commit message, T2's
> change will be "replayed immediately after decoding". But it's
> essential to replay T2's change after we replay T1 and before we
> replay T3, and the comments don't explain why that's guaranteed.
> 
> The answer might be "locks". If we always replay a transaction
> immediately when we see it's commit record then in the example above
> we're fine, because the commit record for the transaction that creates
> the sequence must precede the nextval() call, since the sequence won't
> be visible until the transaction commits, and also because T1 holds a
> lock on it at that point sufficient to hedge out nextval. And the
> nextval record must precede the point where T3 takes an exclusive lock
> on the sequence.
> 

Right, locks + apply in commit order gives us this guarantee (I can't
think of a case where it wouldn't be the case).

> Note, however, that this change of reasoning critically depends on us
> never delaying application of a transaction. If we might reach T1's
> commit record and say "hey, let's hold on to this for a bit and replay
> it after we've decoded some more," everything immediately breaks,
> unless we also delay application of T2's non-transactional update in
> such a way that it's still guaranteed to happen after T1. I wonder if
> this kind of situation would be a problem for a future parallel-apply
> feature. It wouldn't work, for example, to hand T1 and T3 off (in that
> order) to a separate apply process but handle T2's "non-transactional"
> message directly, because it might handle that message before the
> application of T1 got completed.
> 

Doesn't the whole logical replication critically depend on the commit
order? If you decide to arbitrarily reorder/delay the transactions, all
kinds of really bad things can happen. That's a generic problem, it
applies to all kinds of objects, not just sequences - a parallel apply
would need to detect this sort of dependencies (e.g. INSERT + DELETE of
the same key), and do something about it.

Similar for sequences, where the important event is allocation of a new
relfilenode.

If anything, it's easier for sequences, because the relfilenode tracking
gives us an explicit (and easy) way to detect these dependencies between
transactions.

> This also seems to depend on every transactional operation that might
> affect a future non-transactional operation holding a lock that would
> conflict with that non-transactional operation. For example, if ALTER
> SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence,
> then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART
> WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what
> the semantics of that would be -- would T1's second nextval() see the
> sequence restart, or what? But if the effect of T1's second nextval
> does depend in some way on the ALTER SEQUENCE operation which precedes
> it in the WAL stream, then we might have some trouble here, because
> both nextvals precede the commit of T2. Fortunately, this sequence of
> events is foreclosed by locking.
> 

I don't quite follow :-(

AFAIK this theory hinges on not having the right lock, but I believe
ALTER SEQUENCE does obtain the lock (at least in cases that assign a new
relfilenode). Which means such reordering should not be possible,
because nextval() in other transactions will then wait until commit. And
all nextval() calls in the same transaction will be treated as
transactional.

So I think this works OK. If something does not lock the sequence in a
way that would prevent other xacts to do nextval() on it, it's not a
change that would change the relfilenode - and so it does not switch the
sequence into a transactional mode.

> But I did find one somewhat-similar case in which that's not so.
> 
> S1: create table withseq (a bigint generated always as identity);
> S1: begin;
> S2: select nextval('withseq_a_seq');
> S1: alter table withseq set unlogged;
> S2: select nextval('withseq_a_seq');
> 
> I think this is a bug in the code that supports owned sequences rather
> than a problem that this patch should have to do something about. When
> a sequence is flipped between logged and unlogged directly, we take a
> stronger lock than we do here when 

Re: logical decoding and replication of sequences, take 2

2024-01-26 Thread Robert Haas
On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra
 wrote:
> I did try to explain how this works (and why) in a couple places:
>
> 1) the commit message
> 2) reorderbuffer header comment
> 3) ReorderBufferSequenceIsTransactional comment (and nearby)
>
> It's possible this does not meet your expectations, ofc. Maybe there
> should be a separate README for this - I haven't found anything like
> that for logical decoding in general, which is why I did (1)-(3).

I read over these and I do think they answer a bunch of questions, but
I don't think they answer all of the questions.

Suppose T1 creates a sequence and commits. Then T2 calls nextval().
Then T3 drops the sequence. According to the commit message, T2's
change will be "replayed immediately after decoding". But it's
essential to replay T2's change after we replay T1 and before we
replay T3, and the comments don't explain why that's guaranteed.

The answer might be "locks". If we always replay a transaction
immediately when we see it's commit record then in the example above
we're fine, because the commit record for the transaction that creates
the sequence must precede the nextval() call, since the sequence won't
be visible until the transaction commits, and also because T1 holds a
lock on it at that point sufficient to hedge out nextval. And the
nextval record must precede the point where T3 takes an exclusive lock
on the sequence.

Note, however, that this change of reasoning critically depends on us
never delaying application of a transaction. If we might reach T1's
commit record and say "hey, let's hold on to this for a bit and replay
it after we've decoded some more," everything immediately breaks,
unless we also delay application of T2's non-transactional update in
such a way that it's still guaranteed to happen after T1. I wonder if
this kind of situation would be a problem for a future parallel-apply
feature. It wouldn't work, for example, to hand T1 and T3 off (in that
order) to a separate apply process but handle T2's "non-transactional"
message directly, because it might handle that message before the
application of T1 got completed.

This also seems to depend on every transactional operation that might
affect a future non-transactional operation holding a lock that would
conflict with that non-transactional operation. For example, if ALTER
SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence,
then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART
WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what
the semantics of that would be -- would T1's second nextval() see the
sequence restart, or what? But if the effect of T1's second nextval
does depend in some way on the ALTER SEQUENCE operation which precedes
it in the WAL stream, then we might have some trouble here, because
both nextvals precede the commit of T2. Fortunately, this sequence of
events is foreclosed by locking.

But I did find one somewhat-similar case in which that's not so.

S1: create table withseq (a bigint generated always as identity);
S1: begin;
S2: select nextval('withseq_a_seq');
S1: alter table withseq set unlogged;
S2: select nextval('withseq_a_seq');

I think this is a bug in the code that supports owned sequences rather
than a problem that this patch should have to do something about. When
a sequence is flipped between logged and unlogged directly, we take a
stronger lock than we do here when it's done in this indirect way.
Also, I'm not quite sure if it would pose a problem for sequence
decoding anyway: it changes the relfilenode, but not the value. But
this is the *kind* of problem that could make the approach unsafe:
supposedly transactional changes being interleaved with supposedly
non-transctional changes, in such a way that the non-transactional
changes might get applied at the wrong time relative to the
transactional changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2024-01-24 Thread Tomas Vondra



On 1/23/24 21:47, Robert Haas wrote:
> On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
>  wrote:
>> 1) desirability: We want a built-in way to handle sequences in logical
>> replication. I think everyone agrees this is not a way to do distributed
>> sequences in an active-active setups, but that there are other use cases
>> that need this feature - typically upgrades / logical failover.
> 
> Yeah. I find it extremely hard to take seriously the idea that this
> isn't a valuable feature. How else are you supposed to do a logical
> failover without having your entire application break?
> 
>> 2) performance: There was concern about the performance impact, and that
>> it affects everyone, including those who don't replicate sequences (as
>> the overhead is mostly incurred before calls to output plugin etc.).
>>
>> The agreement was that the best way is to have a CREATE SUBSCRIPTION
>> option that would instruct the upstream to decode sequences. By default
>> this option is 'off' (because that's the no-overhead case), but it can
>> be enabled for each subscription.
> 
> Seems reasonable, at least unless and until we come up with something better.
> 
>> 3) correctness: The last point is about making "transactional" flag
>> correct when the snapshot state changes mid-transaction, originally
>> pointed out by Dilip [4]. Per [5] this however happens to work
>> correctly, because while we identify the change as 'non-transactional'
>> (which is incorrect), we immediately throw it again (so we don't try to
>> apply it, which would error-out).
> 
> I've said this before, but I still find this really scary. It's
> unclear to me that we can simply classify updates as transactional or
> non-transactional and expect things to work. If it's possible, I hope
> we have a really good explanation somewhere of how and why it's
> possible. If we do, can somebody point me to it so I can read it?
> 

I did try to explain how this works (and why) in a couple places:

1) the commit message
2) reorderbuffer header comment
3) ReorderBufferSequenceIsTransactional comment (and nearby)

It's possible this does not meet your expectations, ofc. Maybe there
should be a separate README for this - I haven't found anything like
that for logical decoding in general, which is why I did (1)-(3).

> To be possibly slightly more clear about my concern, I think the scary
> case is where we have transactional and non-transactional things
> happening to the same sequence in close temporal proximity, either
> within the same session or across two or more sessions.  If a
> non-transactional change can get reordered ahead of some transactional
> change upon which it logically depends, or behind some transactional
> change that logically depends on it, then we have trouble. I also
> wonder if there are any cases where the same operation is partly
> transactional and partly non-transactional.
> 

I certainly understand this concern, and to some extent I even share it.
Having to differentiate between transactional and non-transactional
changes certainly confused me more than once. It's especially confusing,
because the decoding implicitly changes the perceived ordering/atomicity
of the events.

That being said, I don't think it get reordered the way you're concerned
about. The "transactionality" is determined by relfilenode change, so
how could the reordering happen? We'd have to misidentify change in
either direction - and for nontransactional->transactional change that's
clearly not possible. There has to be a new relfilenode in that xact.

In the other direction (transactional->nontransactional), it can happen
if we fail to decode the relfilenode record. Which is what we discussed
earlier, but came to the conclusion that it actually works OK.

Of course, there might be bugs. I spent quite a bit of effort reviewing
and testing this, but there still might be something wrong. But I think
that applies to any feature.

What would be worse is some sort of thinko in the approach in general. I
don't have a good answer to that, unfortunately - I think it works, but
how would I know for sure? We explored multiple alternative approaches
and all of them crashed and burned ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2024-01-23 Thread Robert Haas
On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
 wrote:
> 1) desirability: We want a built-in way to handle sequences in logical
> replication. I think everyone agrees this is not a way to do distributed
> sequences in an active-active setups, but that there are other use cases
> that need this feature - typically upgrades / logical failover.

Yeah. I find it extremely hard to take seriously the idea that this
isn't a valuable feature. How else are you supposed to do a logical
failover without having your entire application break?

> 2) performance: There was concern about the performance impact, and that
> it affects everyone, including those who don't replicate sequences (as
> the overhead is mostly incurred before calls to output plugin etc.).
>
> The agreement was that the best way is to have a CREATE SUBSCRIPTION
> option that would instruct the upstream to decode sequences. By default
> this option is 'off' (because that's the no-overhead case), but it can
> be enabled for each subscription.

Seems reasonable, at least unless and until we come up with something better.

> 3) correctness: The last point is about making "transactional" flag
> correct when the snapshot state changes mid-transaction, originally
> pointed out by Dilip [4]. Per [5] this however happens to work
> correctly, because while we identify the change as 'non-transactional'
> (which is incorrect), we immediately throw it again (so we don't try to
> apply it, which would error-out).

I've said this before, but I still find this really scary. It's
unclear to me that we can simply classify updates as transactional or
non-transactional and expect things to work. If it's possible, I hope
we have a really good explanation somewhere of how and why it's
possible. If we do, can somebody point me to it so I can read it?

To be possibly slightly more clear about my concern, I think the scary
case is where we have transactional and non-transactional things
happening to the same sequence in close temporal proximity, either
within the same session or across two or more sessions.  If a
non-transactional change can get reordered ahead of some transactional
change upon which it logically depends, or behind some transactional
change that logically depends on it, then we have trouble. I also
wonder if there are any cases where the same operation is partly
transactional and partly non-transactional.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-21 Thread Tomas Vondra
On 12/15/23 03:33, Amit Kapila wrote:
> On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
>  wrote:
>>
>> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila  wrote:
>>>
>>> It can only be cleaned if we process it but xact_decode won't allow us
>>> to process it and I don't think it would be a good idea to add another
>>> hack for sequences here. See below code:
>>>
>>> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>> {
>>> SnapBuild  *builder = ctx->snapshot_builder;
>>> ReorderBuffer *reorder = ctx->reorder;
>>> XLogReaderState *r = buf->record;
>>> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>>>
>>> /*
>>> * If the snapshot isn't yet fully built, we cannot decode anything, so
>>> * bail out.
>>> */
>>> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
>>> return;
>>
>> That may be true for a transaction which is decoded, but I think all
>> the transactions which are added to ReorderBuffer should be cleaned up
>> once they have been processed irrespective of whether they are
>> decoded/sent downstream or not. In this case I see the sequence hash
>> being cleaned up for the sequence related transaction in Hayato's
>> reproducer.
>>
> 
> It was because the test you are using was not designed to show the
> problem I mentioned. In this case, the rollback was after a full
> snapshot state was reached.
> 

Right, I haven't tried to reproduce this, but it very much looks like we
the entry would not be removed if the xact aborts/commits before the
snapshot reaches FULL state.

I suppose one way to deal with this would be to first check if an entry
for the same relfilenode exists. If it does, the original transaction
must have terminated, but we haven't cleaned it up yet - in which case
we can just "move" the relfilenode to the new one.

However, can't that happen even with full snapshots? I mean, let's say a
transaction creates a relfilenode and terminates without writing an
abort record (surely that's possible, right?). And then another xact
comes and generates the same relfilenode (presumably that's unlikely,
but perhaps possible?). Aren't we in pretty much the same situation,
until the next RUNNING_XACTS cleans up the hash table?


I think tracking all relfilenodes would fix the original issue (with
treating some changes as transactional), and the tweak that "moves" the
relfilenode to the new xact would fix this other issue too.

That being said, I feel a bit uneasy about it, for similar reasons as
Amit. If we start processing records before full snapshot, that seems
like moving the assumptions a bit. For example it means we'd create
ReorderBufferTXN entries for cases that'd have skipped before. OTOH this
is (or should be) only a very temporary period while starting the
replication, I believe.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-21 Thread Tomas Vondra
On 12/19/23 13:54, Christophe Pettus wrote:
> Hi,
> 
> I wanted to hop in here on one particular issue:
> 
>> On Dec 12, 2023, at 02:01, Tomas Vondra  
>> wrote:
>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>> better solution for distributed (esp. active-active) systems. But there
>> are important use cases that are likely to keep using regular sequences
>> (online upgrades of single-node instances, existing systems, ...).
> 
> +1.
> 
> Right now, the lack of sequence replication is a rather large 
> foot-gun on logical replication upgrades.  Copying the sequences
> over during the cutover period is doable, of course, but:
> 
> (a) There's no out-of-the-box tooling that does it, so everyone has
> to write some scripts just for that one function.
>
> (b) It's one more thing that extends the cutover window.
> 

I agree it's an annoying gap for this use case. But if this is the only
use cases, maybe a better solution would be to provide such tooling
instead of adding it to the logical decoding?

It might seem a bit strange if most data is copied by replication
directly, while sequences need special handling, ofc.

> I don't think it is a good idea to make it mandatory: for example, 
> there's a strong use case for replicating a table but not a sequence 
> associated with it.  But it's definitely a missing feature in
> logical replication.

I don't think the plan was to make replication of sequences mandatory,
certainly not with the built-in replication. If you don't add sequences
to the publication, the sequence changes will be skipped.

But it still needs to be part of the decoding, which adds overhead for
all logical decoding uses, even if the sequence changes end up being
discarded. That's somewhat annoying, especially considering sequences
are fairly common part of the WAL stream.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-19 Thread Christophe Pettus
Hi,

I wanted to hop in here on one particular issue:

> On Dec 12, 2023, at 02:01, Tomas Vondra  wrote:
> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> better solution for distributed (esp. active-active) systems. But there
> are important use cases that are likely to keep using regular sequences
> (online upgrades of single-node instances, existing systems, ...).

+1.

Right now, the lack of sequence replication is a rather large foot-gun on 
logical replication upgrades.  Copying the sequences over during the cutover 
period is doable, of course, but:

(a) There's no out-of-the-box tooling that does it, so everyone has to write 
some scripts just for that one function.
(b) It's one more thing that extends the cutover window.

I don't think it is a good idea to make it mandatory: for example, there's a 
strong use case for replicating a table but not a sequence associated with it.  
But it's definitely a missing feature in logical replication.



Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila
On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila  wrote:
> >
> > It can only be cleaned if we process it but xact_decode won't allow us
> > to process it and I don't think it would be a good idea to add another
> > hack for sequences here. See below code:
> >
> > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> > {
> > SnapBuild  *builder = ctx->snapshot_builder;
> > ReorderBuffer *reorder = ctx->reorder;
> > XLogReaderState *r = buf->record;
> > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
> >
> > /*
> > * If the snapshot isn't yet fully built, we cannot decode anything, so
> > * bail out.
> > */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> > return;
>
> That may be true for a transaction which is decoded, but I think all
> the transactions which are added to ReorderBuffer should be cleaned up
> once they have been processed irrespective of whether they are
> decoded/sent downstream or not. In this case I see the sequence hash
> being cleaned up for the sequence related transaction in Hayato's
> reproducer.
>

It was because the test you are using was not designed to show the
problem I mentioned. In this case, the rollback was after a full
snapshot state was reached.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Euler Taveira
On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote:
> I haven't found the code path from where the sequence cleanup gets
> called. But it's being called. Am I missing something?

ReorderBufferCleanupTXN.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Ashutosh Bapat
On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila  wrote:
>
> On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
> > >
> > > I think you forgot to attach the patch.
> >
> > Sorry. Here it is.
> >
> > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
> > >
> > > >
> > > It looks like the solution works. But this is the only place where we
> > > process a change before SNAPSHOT reaches FULL. But this is also the
> > > only record which affects a decision to queue/not a following change.
> > > So it should be ok. The sequence_hash'es as separate for each
> > > transaction and they are cleaned when processing COMMIT record.
> > > >
> > >
> > > But it is possible that even commit or abort also happens before the
> > > snapshot reaches full state in which case the hash table will have
> > > stale or invalid (for aborts) entries. That will probably be cleaned
> > > at a later point by running_xact records.
> >
> > Why would cleaning wait till running_xact records? Won't txn entry
> > itself be removed when processing commit/abort record? At the same the
> > sequence hash will be cleaned as well.
> >
> > > Now, I think in theory, it
> > > is possible that the same RelFileLocator can again be allocated before
> > > we clean up the existing entry which can probably confuse the system.
> >
> > How? The transaction allocating the first time would be cleaned before
> > it happens the second time. So shouldn't matter.
> >
>
> It can only be cleaned if we process it but xact_decode won't allow us
> to process it and I don't think it would be a good idea to add another
> hack for sequences here. See below code:
>
> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> {
> SnapBuild  *builder = ctx->snapshot_builder;
> ReorderBuffer *reorder = ctx->reorder;
> XLogReaderState *r = buf->record;
> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>
> /*
> * If the snapshot isn't yet fully built, we cannot decode anything, so
> * bail out.
> */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> return;

That may be true for a transaction which is decoded, but I think all
the transactions which are added to ReorderBuffer should be cleaned up
once they have been processed irrespective of whether they are
decoded/sent downstream or not. In this case I see the sequence hash
being cleaned up for the sequence related transaction in Hayato's
reproducer. See attached patch with a diagnostic change and the output
below (notice sequence cleanup called on transaction 767).
2023-12-14 21:06:36.756 IST [386957] LOG:  logical decoding found
initial starting point at 0/15B2F68
2023-12-14 21:06:36.756 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 767 to end.
2023-12-14 21:06:36.756 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: skipped
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.710 IST [386957] LOG:  logical decoding found
initial consistent point at 0/15B3388
2023-12-14 21:07:05.710 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 768 to end.
2023-12-14 21:07:05.710 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:39.292 IST [386298] LOG:  checkpoint starting: time
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_FULL_SNAPSHOT
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: the sequence is transactional
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  sequence cleanup called on
transaction 767
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  logical decoding found
consistent point at 0/15B3518
2023-12-14 21:07:40.919 IST [386957] DETAIL:  There are no running transactions.
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

We see similar output when pg_logical_slot_get_changes() is called.

I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I 

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila
On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
> >
> > I think you forgot to attach the patch.
>
> Sorry. Here it is.
>
> On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
> >
> > >
> > It looks like the solution works. But this is the only place where we
> > process a change before SNAPSHOT reaches FULL. But this is also the
> > only record which affects a decision to queue/not a following change.
> > So it should be ok. The sequence_hash'es as separate for each
> > transaction and they are cleaned when processing COMMIT record.
> > >
> >
> > But it is possible that even commit or abort also happens before the
> > snapshot reaches full state in which case the hash table will have
> > stale or invalid (for aborts) entries. That will probably be cleaned
> > at a later point by running_xact records.
>
> Why would cleaning wait till running_xact records? Won't txn entry
> itself be removed when processing commit/abort record? At the same the
> sequence hash will be cleaned as well.
>
> > Now, I think in theory, it
> > is possible that the same RelFileLocator can again be allocated before
> > we clean up the existing entry which can probably confuse the system.
>
> How? The transaction allocating the first time would be cleaned before
> it happens the second time. So shouldn't matter.
>

It can only be cleaned if we process it but xact_decode won't allow us
to process it and I don't think it would be a good idea to add another
hack for sequences here. See below code:

xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
SnapBuild  *builder = ctx->snapshot_builder;
ReorderBuffer *reorder = ctx->reorder;
XLogReaderState *r = buf->record;
uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;

/*
* If the snapshot isn't yet fully built, we cannot decode anything, so
* bail out.
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;


-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Ashutosh Bapat
On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
>
> I think you forgot to attach the patch.

Sorry. Here it is.

On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
>
> >
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
> >
>
> But it is possible that even commit or abort also happens before the
> snapshot reaches full state in which case the hash table will have
> stale or invalid (for aborts) entries. That will probably be cleaned
> at a later point by running_xact records.

Why would cleaning wait till running_xact records? Won't txn entry
itself be removed when processing commit/abort record? At the same the
sequence hash will be cleaned as well.

> Now, I think in theory, it
> is possible that the same RelFileLocator can again be allocated before
> we clean up the existing entry which can probably confuse the system.

How? The transaction allocating the first time would be cleaned before
it happens the second time. So shouldn't matter.

-- 
Best Wishes,
Ashutosh Bapat
diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 73e38cafd8..8e2ebbd8e0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1543,10 +1543,15 @@ smgr_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
elog(LOG, "XXX: smgr_decode. snapshot is %s", 
SnapBuildIdentify(builder));
 
/*
-* If we don't have snapshot or we are just fast-forwarding, there is no
-* point in decoding relfilenode information.
+* If we are not building snapshot yet or we are just fast-forwarding, 
there
+* is no point in decoding relfilenode information.  If the sequence
+* associated with relfilenode here is changed in the same transaction 
but
+* after snapshot was built, the relfilenode needs to be present in the
+* sequence hash table so that the change can be deemed as 
transactional.
+* Otherwise it will not be queued. Hence we process this change even 
before
+* we have built snapshot.
 */
-   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+   if (SnapBuildCurrentState(builder) < SNAPBUILD_BUILDING_SNAPSHOT ||
ctx->fast_forward)
{
elog(LOG, "XXX: skipped");


Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
>
...
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
>

>
It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record.
>

But it is possible that even commit or abort also happens before the
snapshot reaches full state in which case the hash table will have
stale or invalid (for aborts) entries. That will probably be cleaned
at a later point by running_xact records. Now, I think in theory, it
is possible that the same RelFileLocator can again be allocated before
we clean up the existing entry which can probably confuse the system.
It might or might not be a problem in practice but I think the more
assumptions we add for sequences, the more difficult it will become to
ensure its correctness.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
> I also thought about the same solution. I tried this solution as the
> attached patch on top of Hayato's diagnostic changes.

I think you forgot to attach the patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Ashutosh Bapat
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:

> > >
> >
> > It is correct that we can make a wrong decision about whether a change
> > is transactional or non-transactional when sequence DDL happens before
> > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > after that state. However, one thing to note here is that we won't try
> > to stream such a change because for non-transactional cases we don't
> > proceed unless the snapshot is in a consistent state. Now, if the
> > decision had been correct then we would probably have queued the
> > sequence change and discarded at commit.
> >
> > One thing that we deviate here is that for non-sequence transactional
> > cases (including logical messages), we immediately start queuing the
> > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > SnapBuildProcessChange() returns true which is quite possible) and
> > take final decision at commit/prepare/abort time. However, that won't
> > be the case for sequences because of the dependency of determining
> > transactional cases on one of the prior records. Now, I am not
> > completely sure at this stage if such a deviation can cause any
> > problem and or whether we are okay to have such a deviation for
> > sequences.
>
> Okay, so this particular scenario that I raised is somehow saved, I
> mean although we are considering transactional sequence operation as
> non-transactional we also know that if some of the changes for a
> transaction are skipped because the snapshot was not FULL that means
> that transaction can not be streamed because that transaction has to
> be committed before snapshot become CONSISTENT (based on the snapshot
> state change machinery).  Ideally based on the same logic that the
> snapshot is not consistent the non-transactional sequence changes are
> also skipped.  But the only thing that makes me a bit uncomfortable is
> that even though the result is not wrong we have made some wrong
> intermediate decisions i.e. considered transactional change as
> non-transactions.
>
> One solution to this issue is that, even if the snapshot state does
> not reach FULL just add the sequence relids to the hash, I mean that
> hash is only maintained for deciding whether the sequence is changed
> in that transaction or not.  So no adding such relids to hash seems
> like a root cause of the issue.  Honestly, I haven't analyzed this
> idea in detail about how easy it would be to add only these changes to
> the hash and what are the other dependencies, but this seems like a
> worthwhile direction IMHO.

I also thought about the same solution. I tried this solution as the
attached patch on top of Hayato's diagnostic changes. Following log
messages are seen in server error log. Those indicate that the
sequence change was correctly deemed as a transactional change (line
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is
transactional).
2023-12-14 12:12:50.550 IST [321229] ERROR: relation
"pg_replication_slot" does not exist at character 15
2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from
pg_replication_slot;
2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found
initial starting point at 0/1598D50
2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 759 to end.
2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found
initial consistent point at 0/1599170
2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 760 to end.
2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_FULL_SNAPSHOT
2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional
2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found
consistent point at 0/15992E8
2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions.
2023-12-14 12:14:55.813 IST 

Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Dilip Kumar
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila  wrote:
>
> > > But can this even happen? Can we start decoding in the middle of a
> > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > > messages, where we also call the output plugin in non-transactional cases.
> >
> > It's not a problem for logical messages because whether the message is
> > transaction or non-transactional is decided while WAL logs the message
> > itself.  But here our problem starts with deciding whether the change
> > is transactional vs non-transactional, because if we insert the
> > 'relfilenode' in hash then the subsequent sequence change in the same
> > transaction would be considered transactional otherwise
> > non-transactional.
> >
>
> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state. However, one thing to note here is that we won't try
> to stream such a change because for non-transactional cases we don't
> proceed unless the snapshot is in a consistent state. Now, if the
> decision had been correct then we would probably have queued the
> sequence change and discarded at commit.
>
> One thing that we deviate here is that for non-sequence transactional
> cases (including logical messages), we immediately start queuing the
> changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> SnapBuildProcessChange() returns true which is quite possible) and
> take final decision at commit/prepare/abort time. However, that won't
> be the case for sequences because of the dependency of determining
> transactional cases on one of the prior records. Now, I am not
> completely sure at this stage if such a deviation can cause any
> problem and or whether we are okay to have such a deviation for
> sequences.

Okay, so this particular scenario that I raised is somehow saved, I
mean although we are considering transactional sequence operation as
non-transactional we also know that if some of the changes for a
transaction are skipped because the snapshot was not FULL that means
that transaction can not be streamed because that transaction has to
be committed before snapshot become CONSISTENT (based on the snapshot
state change machinery).  Ideally based on the same logic that the
snapshot is not consistent the non-transactional sequence changes are
also skipped.  But the only thing that makes me a bit uncomfortable is
that even though the result is not wrong we have made some wrong
intermediate decisions i.e. considered transactional change as
non-transactions.

One solution to this issue is that, even if the snapshot state does
not reach FULL just add the sequence relids to the hash, I mean that
hash is only maintained for deciding whether the sequence is changed
in that transaction or not.  So no adding such relids to hash seems
like a root cause of the issue.  Honestly, I haven't analyzed this
idea in detail about how easy it would be to add only these changes to
the hash and what are the other dependencies, but this seems like a
worthwhile direction IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: logical decoding and replication of sequences, take 2

2023-12-13 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state.

I found a workload which decoder distinguish wrongly.

# Prerequisite

Apply an attached patch for inspecting the sequence status. It can be applied 
atop v20231203 patch set.
Also, a table and a sequence must be defined:

```
CREATE TABLE foo (var int);
CREATE SEQUENCE s;
```

# Workload

Then, you can execute concurrent transactions from three clients like below:

Client-1

BEGIN;
INSERT INTO foo VALUES (1);

Client-2

SELECT pg_create_logical_replication_slot('slot', 
'test_decoding');

Client-3

BEGIN;
ALTER SEQUENCE s MAXVALUE 5000;
COMMIT;
SAVEPOINT s1;
SELECT setval('s', 2000);
ROLLBACK;

SELECT pg_logical_slot_get_changes('slot', 
'test_decoding');

# Result and analysis

At first, below lines would be output on the log. This meant that WAL records
for ALTER SEQUENCE were decoded but skipped because the snapshot had been 
building.

```
...
LOG:  logical decoding found initial starting point at 0/154D238
DETAIL:  Waiting for transactions (approximately 1) older than 741 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: skipped
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: skipped
...
```

Note that above `seq_decode...` line was not output via `setval()`, it was done
by ALTER SEQUENCE statement. Below is a call stack for inserting WAL.

```
XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
fill_seq_fork_with_data
fill_seq_with_data
AlterSequence
```

Then, subsequent lines would say like them. This means that the snapshot becomes
FULL and `setval()` is regarded non-transactional wrongly.

```
LOG:  logical decoding found initial consistent point at 0/154D658
DETAIL:  Waiting for transactions (approximately 1) older than 742 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: the sequence is non-transactional
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 
'test_decoding');
LOG:  XXX: not consistent: skipped
```

The change would be discarded because the snapshot has not been CONSISTENT yet
by the below part. If it has been transactional, we would have queued this
change though the transaction will be skipped at commit.

```
else if (!transactional &&
 (SnapBuildCurrentState(builder) != 
SNAPBUILD_CONSISTENT ||
  SnapBuildXactNeedsSkip(builder, buf->origptr)))
return;
```

But anyway, we could find a case which we can make a wrong decision. This 
example
is lucky - does not output wrongly, but I'm not sure all the case like that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index d48d88081f..73e38cafd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1397,12 +1397,17 @@ seq_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
 
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr);
 
+   elog(LOG, "XXX: seq_decode. snapshot is %s", 
SnapBuildIdentify(builder));
+
/*
 * If we don't have snapshot or we are just fast-forwarding, there is no
 * point in decoding sequences.
 */
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
+   {
+   elog(LOG, "XXX: skipped");
return;
+   }
 
/* only interested in our database */
XLogRecGetBlockTag(r, 0, _locator, NULL, NULL);
@@ -1437,14 +1442,22 @@ seq_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)

 target_locator,

 NULL);
 
+   elog(LOG, "XXX: the 

Re: logical decoding and replication of sequences, take 2

2023-12-13 Thread Amit Kapila
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar  wrote:
>
> On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
>  wrote:
> >
> > Yes, if something like this happens, that'd be a problem:
> >
> > 1) decoding starts, with
> >
> >SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
> >
> > 2) transaction that creates a new refilenode gets decoded, but we skip
> >it because we don't have the correct snapshot
> >
> > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
> >
> > 4) we decode sequence change from nextval() for the sequence
> >
> > This would lead to us attempting to apply sequence change for a
> > relfilenode that's not visible yet (and may even get aborted).
> >
> > But can this even happen? Can we start decoding in the middle of a
> > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > messages, where we also call the output plugin in non-transactional cases.
>
> It's not a problem for logical messages because whether the message is
> transaction or non-transactional is decided while WAL logs the message
> itself.  But here our problem starts with deciding whether the change
> is transactional vs non-transactional, because if we insert the
> 'relfilenode' in hash then the subsequent sequence change in the same
> transaction would be considered transactional otherwise
> non-transactional.
>

It is correct that we can make a wrong decision about whether a change
is transactional or non-transactional when sequence DDL happens before
the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
after that state. However, one thing to note here is that we won't try
to stream such a change because for non-transactional cases we don't
proceed unless the snapshot is in a consistent state. Now, if the
decision had been correct then we would probably have queued the
sequence change and discarded at commit.

One thing that we deviate here is that for non-sequence transactional
cases (including logical messages), we immediately start queuing the
changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
SnapBuildProcessChange() returns true which is quite possible) and
take final decision at commit/prepare/abort time. However, that won't
be the case for sequences because of the dependency of determining
transactional cases on one of the prior records. Now, I am not
completely sure at this stage if such a deviation can cause any
problem and or whether we are okay to have such a deviation for
sequences.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-12 Thread Tomas Vondra
Hi,

There's been a lot discussed over the past month or so, and it's become
difficult to get a good idea what's the current state - what issues
remain to be solved, what's unrelated to this patch, and how to move if
forward. Long-running threads tend to be confusing, so I had a short
call with Amit to discuss the current state yesterday, and to make sure
we're on the same page. I believe it was very helpful, and I've promised
to post a short summary of the call - issues, what we agreed seems like
a path forward, etc.

Obviously, I might have misunderstood something, in which case Amit can
correct me. And I'd certainly welcome opinions from others.

In general, we discussed three areas - desirability of the feature,
correctness and performance. I believe a brief summary of the agreement
would be this:

- desirability of the feature: Random IDs (UUIDs etc.) are likely a much
better solution for distributed (esp. active-active) systems. But there
are important use cases that are likely to keep using regular sequences
(online upgrades of single-node instances, existing systems, ...).

- correctness: There's one possible correctness issue, when the snapshot
changes to FULL between record creating a sequence relfilenode and that
sequence advancing. This needs to be verified/reproduced, and fixed.

- performance issues: We've agreed the case with a lot of aborts (when
DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've
discussed whether the overhead with many sequence changes (nextval-40)
is acceptable, and/or how to improve it.

Next, I'll go over these points in more details, with my understanding
of what the challenges are, possible solutions etc. Most of this was
discussed/agreed on the call, but some are ideas I had only after the
call when writing this summary.


1) desirability of the feature

Firstly, do we actually want/need this feature? I believe that's very
much a question of what use cases we're targeting.

If we only focus on distributed databases (particularly those with
multiple active nodes), then we probably agree that the right solution
is to not use sequences (~generators of incrementing values) but UUIDs
or similar random identifiers (better not call them sequences, there's
not much sequential about it). The huge advantage is this does not
require replicating any state between the nodes, so logical decoding can
simply ignore them and replicate just the generated values. I don't
think there's any argument about that. If I as building such distributed
system, I'd certainly use such random IDs.

The question is what to do about the other use cases - online upgrades
relying on logical decoding, failovers to logical replicas, and so on.
Or what to do about existing systems that can't be easily changed to use
different/random identifiers. Those are not really distributed systems
and therefore don't quite need random IDs.

Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can
easily lead to massive write amplification, for example. There are
variants like UUIDv7 reducing the impact, but there's other stuff.

My takeaway from this is there's still value in having this feature.


2) correctness

The only correctness issue I'm aware of is the question what happens
when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding
the relfilenode creation and the sequence increment, pointed out by
Dilip in [1].

If this happens (and while I don't have a reproducer, I also don't have
a very clear idea why it couldn't happen), it breaks how the patch
decides between transactional and non-transactional sequence changes.

So this seems like a fatal flaw - it definitely needs to be solved. I
don't have a good idea how to do that, unfortunately. The problem is the
dependency on an earlier record, and that this needs to be evaluated
immediately (in the decode phase). Logical messages don't have the same
issue because the "transactional" flag does not depend on earlier stuff,
and other records are not interpreted until apply/commit, when we know
everything relevant was decoded.

I don't know what the solution is. Either we find a way to make sure not
to lose/skip the smgr record, or we need to rethink how we determine the
transactional flag (perhaps even try again adding it to the WAL record,
but we didn't find a way to do that earlier).


3) performance issues

We have discussed two cases - "ddl-abort" and "nextval-40".

The "ddl-abort" is when the workload does a lot of DDL and then aborts
them, leading to profiles dominated by DecodeCommit. The agreement here
is that while this is a valid issue and we should try fixing it, it's
unrelated to this patch. The issue exists even on master. So in the
context of this patch we can ignore this issue.

The "nextval-40" applies to workloads doing a lot of regular sequence
changes. We only decode/apply changes written to WAL, and that happens
only for every 32 increments or so. The test was with a very simple
transaction (just 

Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar
On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
 wrote:
>
> Yes, if something like this happens, that'd be a problem:
>
> 1) decoding starts, with
>
>SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
>
> 2) transaction that creates a new refilenode gets decoded, but we skip
>it because we don't have the correct snapshot
>
> 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
>
> 4) we decode sequence change from nextval() for the sequence
>
> This would lead to us attempting to apply sequence change for a
> relfilenode that's not visible yet (and may even get aborted).
>
> But can this even happen? Can we start decoding in the middle of a
> transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> messages, where we also call the output plugin in non-transactional cases.

It's not a problem for logical messages because whether the message is
transaction or non-transactional is decided while WAL logs the message
itself.  But here our problem starts with deciding whether the change
is transactional vs non-transactional, because if we insert the
'relfilenode' in hash then the subsequent sequence change in the same
transaction would be considered transactional otherwise
non-transactional.  And XLOG_HEAP2_NEW_CID is just for changing the
snapshot->curcid which will only affect the catalog visibility of the
upcoming operation in the same transaction, but that's not an issue
because if some of the changes of this transaction are seen when
snapbuild state < SNAPBUILD_FULL_SNAPSHOT then this transaction has to
get committed before the state change to SNAPBUILD_CONSISTENT_SNAPSHOT
i.e. the commit LSN of this transaction is going to be <
start_decoding_at.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Amit Kapila
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra
 wrote:
>
> On 12/6/23 12:05, Dilip Kumar wrote:
> > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
> >>
> >>> Why can't we use the same concept of
> >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> >>> non-transactional changes (have some base snapshot before the first
> >>> change), and whenever there is any catalog change, queue new snapshot
> >>> change also in the queue of the non-transactional sequence change so
> >>> that while sending it to downstream whenever it is necessary we will
> >>> change the historic snapshot?
> >>>
> >>
> >> Oh, do you mean maintain different historic snapshots and then switch
> >> based on the change we are processing? I guess the other thing we need
> >> to consider is the order of processing the changes if we maintain
> >> separate queues that need to be processed.
> >
> > I mean we will not specifically maintain the historic changes, but if
> > there is any catalog change where we are pushing the snapshot to all
> > the transaction's change queue, at the same time we will push this
> > snapshot in the non-transactional sequence queue as well.  I am not
> > sure what is the problem with the ordering?
> >

Currently, we set up the historic snapshot before starting a
transaction to process the change and then adapt the updates to it
while processing the changes for the transaction. Now, while
processing this new queue of non-transactional sequence messages, we
probably need a separate snapshot and updates to it. So, either we
need some sort of switching between snapshots or do it in different
transactions.

> > because we will be
> > queueing all non-transactional sequence changes in a separate queue in
> > the order they arrive and as soon as we process the next commit we
> > will process all the non-transactional changes at that time.  Do you
> > see issue with that?
> >
>
> Isn't this (in principle) the idea of queuing the non-transactional
> changes and then applying them on the next commit? Yes, I didn't get
> very far with that, but I got stuck exactly on tracking which snapshot
> to use, so if there's a way to do that, that'd fix my issue.
>
> Also, would this mean we don't need to track the relfilenodes, if we're
> able to query the catalog? Would we be able to check if the relfilenode
> was created by the current xact?
>

I thought this new mechanism was for processing a queue of
non-transactional sequence changes. The tracking of relfilenodes is to
distinguish between transactional and non-transactional messages, so I
think we probably still need that.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra
 wrote:
>
> On 12/6/23 12:05, Dilip Kumar wrote:
> > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
> >>
> >>> Why can't we use the same concept of
> >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> >>> non-transactional changes (have some base snapshot before the first
> >>> change), and whenever there is any catalog change, queue new snapshot
> >>> change also in the queue of the non-transactional sequence change so
> >>> that while sending it to downstream whenever it is necessary we will
> >>> change the historic snapshot?
> >>>
> >>
> >> Oh, do you mean maintain different historic snapshots and then switch
> >> based on the change we are processing? I guess the other thing we need
> >> to consider is the order of processing the changes if we maintain
> >> separate queues that need to be processed.
> >
> > I mean we will not specifically maintain the historic changes, but if
> > there is any catalog change where we are pushing the snapshot to all
> > the transaction's change queue, at the same time we will push this
> > snapshot in the non-transactional sequence queue as well.  I am not
> > sure what is the problem with the ordering? because we will be
> > queueing all non-transactional sequence changes in a separate queue in
> > the order they arrive and as soon as we process the next commit we
> > will process all the non-transactional changes at that time.  Do you
> > see issue with that?
> >
>
> Isn't this (in principle) the idea of queuing the non-transactional
> changes and then applying them on the next commit?

Yes, it is.

 Yes, I didn't get
> very far with that, but I got stuck exactly on tracking which snapshot
> to use, so if there's a way to do that, that'd fix my issue.

Thinking more about the snapshot issue do we need to even bother about
changing the snapshot at all while streaming the non-transactional
sequence changes or we can send all the non-transactional changes with
a single snapshot? So mainly snapshot logically gets changed due to
these 2 events case1: When any transaction gets committed which has
done catalog operation (this changes the global snapshot) and case2:
When within a transaction, there is some catalog change (this just
updates the 'curcid' in the base snapshot of the transaction).

Now, if we are thinking that we are streaming all the
non-transactional sequence changes right before the next commit then
we are not bothered about the (case1) at all because all changes we
have queues so far are before this commit.   And if we come to a
(case2), if we are performing any catalog change on the sequence then
the following changes on the same sequence will be considered
transactional and if the changes are just on some other catalog (not
relevant to our sequence operation) then also we should not be worried
about command_id change because visibility of catalog lookup for our
sequence will be unaffected by this.

In short, I am trying to say that we can safely queue the
non-transactional sequence changes and stream them based on the
snapshot we got when we decode the first change, and as long as we are
planning to stream just before the next commit (or next in-progress
stream), we don't ever need to update the snapshot.

> Also, would this mean we don't need to track the relfilenodes, if we're
> able to query the catalog? Would we be able to check if the relfilenode
> was created by the current xact?

I think by querying the catalog and checking the xmin we should be
able to figure that out, but isn't that costlier than looking up the
relfilenode in hash?  Because just for identifying whether the changes
are transactional or non-transactional you would have to query the
catalog, that means for each change before we decide whether we add to
the transaction's change queue or non-transactional change queue we
will have to query the catalog i.e. you will have to start/stop the
transaction?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Amit Kapila
On Wed, Dec 6, 2023 at 7:20 PM Tomas Vondra
 wrote:
>
> On 12/6/23 11:19, Amit Kapila wrote:
> > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
> >  wrote:
> >>
> >> On 12/3/23 18:52, Tomas Vondra wrote:
> >>> ...
> >>>
> >>
> >> Another idea is that maybe we could somehow inform ReorderBuffer whether
> >> the output plugin even is interested in sequences. That'd help with
> >> cases where we don't even want/need to replicate sequences, e.g. because
> >> the publication does not specify (publish=sequence).
> >>
> >> What happens now in that case is we call ReorderBufferQueueSequence(),
> >> it does the whole dance with starting/aborting the transaction, calls
> >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
> >> could just short-circuit this by asking the output plugin somehow.
> >>
> >> In an extreme case the plugin may not even specify the sequence
> >> callbacks, and we're still doing all of this.
> >>
> >
> > We could explore this but I guess it won't solve the problem we are
> > facing in cases where all sequences are published and plugin has
> > specified the sequence callbacks. I think it would add some overhead
> > of this check in positive cases where we decide to anyway do send the
> > changes.
>
> Well, the idea is the check would be very simple (essentially just a
> boolean flag somewhere), so not really measurable.
>
> And if the plugin requests decoding sequences, I guess it's natural it
> may have a bit of overhead. It needs to do more things, after all. It
> needs to be acceptable, ofc.
>

I agree with you that if it can be done cheaply or without a
measurable overhead then it would be a good idea and can serve other
purposes as well. For example, see discussion [1]. I had more of what
the patch in email [1] is doing where it needs to start/stop xact and
do so relcache access etc. which seems can add some overhead if done
for each change, though I haven't measured so can't be sure.

[1] - 
https://www.postgresql.org/message-id/CAGfChW5Qo2SrjJ7rU9YYtZbRaWv6v-Z8MJn%3DdQNx4uCSqDEOHA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Tomas Vondra
On 12/6/23 09:56, Amit Kapila wrote:
> On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra
>  wrote:
>>
>> On 12/5/23 13:17, Amit Kapila wrote:
>>
>>> (b) for transactional
>>> cases, we see overhead due to traversing all the top-level txns and
>>> check the hash table for each one to find whether change is
>>> transactional.
>>>
>>
>> Not really, no. As I explained in my preceding e-mail, this check makes
>> almost no difference - I did expect it to matter, but it doesn't. And I
>> was a bit disappointed the global hash table didn't move the needle.
>>
>> Most of the time is spent in
>>
>> 78.81% 0.00%  postgres  postgres  [.] DecodeCommit (inlined)
>>   |
>>   ---DecodeCommit (inlined)
>>  |
>>  |--72.65%--SnapBuildCommitTxn
>>  | |
>>  |  --72.61%--SnapBuildBuildSnapshot
>>  ||
>>  | --72.09%--pg_qsort
>>  ||
>>  ||--66.24%--pg_qsort
>>  ||  |
>>
>> And there's almost no difference between master and build with sequence
>> decoding - see the attached diff-alter-sequence.perf, comparing the two
>> branches (perf diff -c delta-abs).
>>
> 
> I think in this the commit time predominates which hides the overhead.
> We didn't investigate in detail if that can be improved but if we see
> a similar case of abort [1], it shows the overhead of
> ReorderBufferSequenceIsTransactional(). I understand that aborts won't
> be frequent and it is sort of unrealistic test but still helps to show
> that there is overhead in ReorderBufferSequenceIsTransactional(). Now,
> I am not sure if we can ignore that case because theoretically, the
> overhead can increase based on the number of top-level transactions.
> 
> [1]: 
> https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> 

But those profiles were with the "old" patch, with one hash table per
top-level transaction. I see nothing like that with the patch [1] that
replaces that with a single global hash table. With that patch, the
ReorderBufferSequenceIsTransactional() took ~0.5% in any tests I did.

What did have bigger impact is this:

46.12%   1.47%  postgres [.] pg_logical_slot_get_changes_guts
  |
  |--45.12%--pg_logical_slot_get_changes_guts
  ||
  ||--42.34%--LogicalDecodingProcessRecord
  |||
  |||--12.82%--xact_decode
  ||||
  ||||--9.46%--DecodeAbort (inlined)
  ||||   |
  ||||   |--8.44%--ReorderBufferCleanupTXN
  ||||   |   |
  ||||   |   |--3.25%--ReorderBufferSequenceCleanup (in)
  ||||   |   |   |
  ||||   |   |   |--1.59%--hash_seq_search
  ||||   |   |   |
  ||||   |   |   |--0.80%--hash_search_with_hash_value
  ||||   |   |   |
  ||||   |   |--0.59%--hash_search
  ||||   |   |  hash_bytes

I guess that could be optimized, but it's also a direct consequence of
the huge number of aborts for transactions that create relfilenode. For
any other workload this will be negligible.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Tomas Vondra
On 12/6/23 11:19, Amit Kapila wrote:
> On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
>  wrote:
>>
>> On 12/3/23 18:52, Tomas Vondra wrote:
>>> ...
>>>
>>
>> Another idea is that maybe we could somehow inform ReorderBuffer whether
>> the output plugin even is interested in sequences. That'd help with
>> cases where we don't even want/need to replicate sequences, e.g. because
>> the publication does not specify (publish=sequence).
>>
>> What happens now in that case is we call ReorderBufferQueueSequence(),
>> it does the whole dance with starting/aborting the transaction, calls
>> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
>> could just short-circuit this by asking the output plugin somehow.
>>
>> In an extreme case the plugin may not even specify the sequence
>> callbacks, and we're still doing all of this.
>>
> 
> We could explore this but I guess it won't solve the problem we are
> facing in cases where all sequences are published and plugin has
> specified the sequence callbacks. I think it would add some overhead
> of this check in positive cases where we decide to anyway do send the
> changes.

Well, the idea is the check would be very simple (essentially just a
boolean flag somewhere), so not really measurable.

And if the plugin requests decoding sequences, I guess it's natural it
may have a bit of overhead. It needs to do more things, after all. It
needs to be acceptable, ofc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Tomas Vondra
On 12/6/23 12:05, Dilip Kumar wrote:
> On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
>>
>>> Why can't we use the same concept of
>>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
>>> non-transactional changes (have some base snapshot before the first
>>> change), and whenever there is any catalog change, queue new snapshot
>>> change also in the queue of the non-transactional sequence change so
>>> that while sending it to downstream whenever it is necessary we will
>>> change the historic snapshot?
>>>
>>
>> Oh, do you mean maintain different historic snapshots and then switch
>> based on the change we are processing? I guess the other thing we need
>> to consider is the order of processing the changes if we maintain
>> separate queues that need to be processed.
> 
> I mean we will not specifically maintain the historic changes, but if
> there is any catalog change where we are pushing the snapshot to all
> the transaction's change queue, at the same time we will push this
> snapshot in the non-transactional sequence queue as well.  I am not
> sure what is the problem with the ordering? because we will be
> queueing all non-transactional sequence changes in a separate queue in
> the order they arrive and as soon as we process the next commit we
> will process all the non-transactional changes at that time.  Do you
> see issue with that?
> 

Isn't this (in principle) the idea of queuing the non-transactional
changes and then applying them on the next commit? Yes, I didn't get
very far with that, but I got stuck exactly on tracking which snapshot
to use, so if there's a way to do that, that'd fix my issue.

Also, would this mean we don't need to track the relfilenodes, if we're
able to query the catalog? Would we be able to check if the relfilenode
was created by the current xact?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Tomas Vondra
On 12/6/23 10:05, Dilip Kumar wrote:
> On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar  wrote:
>>
>> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
>>  wrote:
>>>
> 
> I was also wondering what happens if the sequence changes are
> transactional but somehow the snap builder state changes to
> SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and
> the seq_decode() which means RelFileLocator will not be added to the
> hash table and during the seq_decode() we will consider the change as
> non-transactional.  I haven't fully analyzed that what is the real
> problem in this case but have we considered this case? what happens if
> the transaction having both ALTER SEQUENCE and nextval() gets aborted
> but the nextva() has been considered as non-transactional because
> smgr_decode() changes were not processed because snap builder state
> was not yet SNAPBUILD_FULL_SNAPSHOT.
> 

Yes, if something like this happens, that'd be a problem:

1) decoding starts, with

   SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT

2) transaction that creates a new refilenode gets decoded, but we skip
   it because we don't have the correct snapshot

3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT

4) we decode sequence change from nextval() for the sequence

This would lead to us attempting to apply sequence change for a
relfilenode that's not visible yet (and may even get aborted).

But can this even happen? Can we start decoding in the middle of a
transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
messages, where we also call the output plugin in non-transactional cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar
On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila  wrote:
>
> > Why can't we use the same concept of
> > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> > non-transactional changes (have some base snapshot before the first
> > change), and whenever there is any catalog change, queue new snapshot
> > change also in the queue of the non-transactional sequence change so
> > that while sending it to downstream whenever it is necessary we will
> > change the historic snapshot?
> >
>
> Oh, do you mean maintain different historic snapshots and then switch
> based on the change we are processing? I guess the other thing we need
> to consider is the order of processing the changes if we maintain
> separate queues that need to be processed.

I mean we will not specifically maintain the historic changes, but if
there is any catalog change where we are pushing the snapshot to all
the transaction's change queue, at the same time we will push this
snapshot in the non-transactional sequence queue as well.  I am not
sure what is the problem with the ordering? because we will be
queueing all non-transactional sequence changes in a separate queue in
the order they arrive and as soon as we process the next commit we
will process all the non-transactional changes at that time.  Do you
see issue with that?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Amit Kapila
On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
 wrote:
>
> On 12/3/23 18:52, Tomas Vondra wrote:
> > ...
> >
>
> Another idea is that maybe we could somehow inform ReorderBuffer whether
> the output plugin even is interested in sequences. That'd help with
> cases where we don't even want/need to replicate sequences, e.g. because
> the publication does not specify (publish=sequence).
>
> What happens now in that case is we call ReorderBufferQueueSequence(),
> it does the whole dance with starting/aborting the transaction, calls
> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
> could just short-circuit this by asking the output plugin somehow.
>
> In an extreme case the plugin may not even specify the sequence
> callbacks, and we're still doing all of this.
>

We could explore this but I guess it won't solve the problem we are
facing in cases where all sequences are published and plugin has
specified the sequence callbacks. I think it would add some overhead
of this check in positive cases where we decide to anyway do send the
changes.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Amit Kapila
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar  wrote:
>
> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
>  wrote:
> >
>
> > Some time ago I floated the idea of maybe "queuing" the sequence changes
> > and only replay them on the next commit, somehow. But we did ran into
> > problems with which snapshot to use, that I didn't know how to solve.
> > Maybe we should try again. The idea is we'd queue the non-transactional
> > changes somewhere (can't be in the transaction, because we must keep
> > them even if it aborts), and then "inject" them into the next commit.
> > That'd mean we wouldn't do the separate start/abort for each change.
>
> Why can't we use the same concept of
> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> non-transactional changes (have some base snapshot before the first
> change), and whenever there is any catalog change, queue new snapshot
> change also in the queue of the non-transactional sequence change so
> that while sending it to downstream whenever it is necessary we will
> change the historic snapshot?
>

Oh, do you mean maintain different historic snapshots and then switch
based on the change we are processing? I guess the other thing we need
to consider is the order of processing the changes if we maintain
separate queues that need to be processed.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Dilip Kumar
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar  wrote:
>
> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
>  wrote:
> >

I was also wondering what happens if the sequence changes are
transactional but somehow the snap builder state changes to
SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and
the seq_decode() which means RelFileLocator will not be added to the
hash table and during the seq_decode() we will consider the change as
non-transactional.  I haven't fully analyzed that what is the real
problem in this case but have we considered this case? what happens if
the transaction having both ALTER SEQUENCE and nextval() gets aborted
but the nextva() has been considered as non-transactional because
smgr_decode() changes were not processed because snap builder state
was not yet SNAPBUILD_FULL_SNAPSHOT.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-06 Thread Amit Kapila
On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra
 wrote:
>
> On 12/5/23 13:17, Amit Kapila wrote:
>
> > (b) for transactional
> > cases, we see overhead due to traversing all the top-level txns and
> > check the hash table for each one to find whether change is
> > transactional.
> >
>
> Not really, no. As I explained in my preceding e-mail, this check makes
> almost no difference - I did expect it to matter, but it doesn't. And I
> was a bit disappointed the global hash table didn't move the needle.
>
> Most of the time is spent in
>
> 78.81% 0.00%  postgres  postgres  [.] DecodeCommit (inlined)
>   |
>   ---DecodeCommit (inlined)
>  |
>  |--72.65%--SnapBuildCommitTxn
>  | |
>  |  --72.61%--SnapBuildBuildSnapshot
>  ||
>  | --72.09%--pg_qsort
>  ||
>  ||--66.24%--pg_qsort
>  ||  |
>
> And there's almost no difference between master and build with sequence
> decoding - see the attached diff-alter-sequence.perf, comparing the two
> branches (perf diff -c delta-abs).
>

I think in this the commit time predominates which hides the overhead.
We didn't investigate in detail if that can be improved but if we see
a similar case of abort [1], it shows the overhead of
ReorderBufferSequenceIsTransactional(). I understand that aborts won't
be frequent and it is sort of unrealistic test but still helps to show
that there is overhead in ReorderBufferSequenceIsTransactional(). Now,
I am not sure if we can ignore that case because theoretically, the
overhead can increase based on the number of top-level transactions.

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Dilip Kumar
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
 wrote:
>

> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.

Why can't we use the same concept of
SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
non-transactional changes (have some base snapshot before the first
change), and whenever there is any catalog change, queue new snapshot
change also in the queue of the non-transactional sequence change so
that while sending it to downstream whenever it is necessary we will
change the historic snapshot?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Tomas Vondra
On 12/5/23 13:17, Amit Kapila wrote:
> ...
>> I was hopeful the global hash table would be an improvement, but that
>> doesn't seem to be the case. I haven't done much profiling yet, but I'd
>> guess most of the overhead is due to ReorderBufferQueueSequence()
>> starting and aborting a transaction in the non-transactinal case. Which
>> is unfortunate, but I don't know if there's a way to optimize that.
>>
> 
> Before discussing the alternative ideas you shared, let me try to
> clarify my understanding so that we are on the same page. I see two
> observations based on the testing and discussion we had (a) for
> non-transactional cases, the overhead observed is mainly due to
> starting/aborting a transaction for each change;

Yes, I believe that's true. See the attached profiles for nextval.sql
and nextval-40.sql from master and optimized build (with the global
hash), and also a perf-diff. I only include the top 1000 lines for each
profile, that should be enough.

master - current master without patches applied
optimized - master + sequence decoding with global hash table

For nextval, there's almost no difference in the profile. Decoding the
other changes (inserts) is the dominant part, as we only log sequences
every 32 increments.

For nextval-40, the main increase is likely due to this part

  |--11.09%--seq_decode
  | |
  | |--9.25%--ReorderBufferQueueSequence
  | | |
  | | |--3.56%--AbortCurrentTransaction
  | | ||
  | | | --3.53%--AbortSubTransaction
  | | ||
  | | ||--0.95%--AtSubAbort_Portals
  | | ||  |
  | | ||   --0.83%--hash_seq_search
  | | ||
  | | | --0.83%--ResourceOwnerReleaseInternal
  | | |
  | | |--2.06%--BeginInternalSubTransaction
  | | |  |
  | | |   --1.10%--CommitTransactionCommand
  | | | |
  | | |  --1.07%--StartSubTransaction
  | | |
  | | |--1.28%--CleanupSubTransaction
  | | |  |
  | | |   --0.64%--AtSubCleanup_Portals
  | | | |
  | | |  --0.55%--hash_seq_search
  | | |
  | |  --0.67%--RelidByRelfilenumber

So yeah, that's the transaction stuff in ReorderBufferQueueSequence.

There's also per-diff, comparing individual functions.

> (b) for transactional
> cases, we see overhead due to traversing all the top-level txns and
> check the hash table for each one to find whether change is
> transactional.
> 

Not really, no. As I explained in my preceding e-mail, this check makes
almost no difference - I did expect it to matter, but it doesn't. And I
was a bit disappointed the global hash table didn't move the needle.

Most of the time is spent in

78.81% 0.00%  postgres  postgres  [.] DecodeCommit (inlined)
  |
  ---DecodeCommit (inlined)
 |
 |--72.65%--SnapBuildCommitTxn
 | |
 |  --72.61%--SnapBuildBuildSnapshot
 ||
 | --72.09%--pg_qsort
 ||
 ||--66.24%--pg_qsort
 ||  |

And there's almost no difference between master and build with sequence
decoding - see the attached diff-alter-sequence.perf, comparing the two
branches (perf diff -c delta-abs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

alter-sequence-master.perf.gz
Description: application/gzip


alter-sequence-optimized.perf.gz
Description: application/gzip


diff-alter-sequence.perf.gz
Description: application/gzip


diff-nextval.perf.gz
Description: application/gzip


diff-nextval-40.perf.gz
Description: application/gzip


nextval-40-master.perf.gz
Description: application/gzip


nextval-40-optimized.perf.gz
Description: application/gzip


nextval-master.perf.gz
Description: application/gzip


nextval-optimized.perf.gz
Description: application/gzip


Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Amit Kapila
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
 wrote:
>
> Thanks for the script. Are you also measuring the time it takes to
> decode this using test_decoding?
>
> FWIW I did more comprehensive suite of tests over the weekend, with a
> couple more variations. I'm attaching the updated scripts, running it
> should be as simple as
>
>   ./run.sh BRANCH TRANSACTIONS RUNS
>
> so perhaps
>
>   ./run.sh master 1000 3
>
> to do 3 runs with 1000 transactions per client. And it'll run a bunch of
> combinations hard-coded in the script, and write the timings into a CSV
> file (with "master" in each row).
>
> I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
> did this with current master, the basic patch (without the 0002 part),
> and then with the optimized approach (single global hash table, see the
> 0004 part). That's what master / patched / optimized in the results is.
>
> Interestingly enough, the i5 handled this much faster, it seems to be
> better in single-core tasks. The xeon is still running, so the results
> for "optimized" only have one run (out of 3), but shouldn't change much.
>
> Attached is also a table summarizing this, and visualizing the timing
> change (vs. master) in the last couple columns. Green is "faster" than
> master (but we don't really expect that), and "red" means slower than
> master (the more red, the slower).
>
> There results are grouped by script (see the attached .tgz), with either
> 32 or 96 clients (which does affect the timing, but not between master
> and patch). Some executions have no pg_sleep() calls, some have 0.001
> wait (but that doesn't seem to make much difference).
>
> Overall, I'd group the results into about three groups:
>
> 1) good cases [nextval, nextval-40, nextval-abort]
>
> These are cases that slow down a bit, but the slowdown is mostly within
> reasonable bounds (we're making the decoding to do more stuff, so it'd
> be a bit silly to require that extra work to make no impact). And I do
> think this is reasonable, because this is pretty much an extreme / worst
> case behavior. People don't really do just nextval() calls, without
> doing anything else. Not to mention doing aborts for 100% transactions.
>
> So in practice this is going to be within noise (and in those cases the
> results even show speedup, which seems a bit surprising). It's somewhat
> dependent on CPU too - on xeon there's hardly any regression.
>
>
> 2) nextval-40-abort
>
> Here the slowdown is clear, but I'd argue it generally falls in the same
> group as (1). Yes, I'd be happier if it didn't behave like this, but if
> someone can show me a practical workload affected by this ...
>
>
> 3) irrelevant cases [all the alters taking insane amounts of time]
>
> I absolutely refuse to care about these extreme cases where decoding
> 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
> xeon). If this was a problem for some practical workload, we'd have
> already heard about it I guess. And even if there was such workload, it
> wouldn't be up to this patch to fix that. There's clearly something
> misbehaving in the snapshot builder.
>
>
> I was hopeful the global hash table would be an improvement, but that
> doesn't seem to be the case. I haven't done much profiling yet, but I'd
> guess most of the overhead is due to ReorderBufferQueueSequence()
> starting and aborting a transaction in the non-transactinal case. Which
> is unfortunate, but I don't know if there's a way to optimize that.
>

Before discussing the alternative ideas you shared, let me try to
clarify my understanding so that we are on the same page. I see two
observations based on the testing and discussion we had (a) for
non-transactional cases, the overhead observed is mainly due to
starting/aborting a transaction for each change; (b) for transactional
cases, we see overhead due to traversing all the top-level txns and
check the hash table for each one to find whether change is
transactional.

Am, I missing something?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-12-03 Thread Tomas Vondra
On 12/3/23 18:52, Tomas Vondra wrote:
> ...
> 
> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.
> 

Another idea is that maybe we could somehow inform ReorderBuffer whether
the output plugin even is interested in sequences. That'd help with
cases where we don't even want/need to replicate sequences, e.g. because
the publication does not specify (publish=sequence).

What happens now in that case is we call ReorderBufferQueueSequence(),
it does the whole dance with starting/aborting the transaction, calls
rb->sequence() which just does "meh" and doesn't do anything. Maybe we
could just short-circuit this by asking the output plugin somehow.

In an extreme case the plugin may not even specify the sequence
callbacks, and we're still doing all of this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: logical decoding and replication of sequences, take 2

2023-12-03 Thread Hayato Kuroda (Fujitsu)
Dear Tomas,

> > I did also performance tests (especially case 3). First of all, there are 
> > some
> > variants from yours.
> >
> > 1. patch 0002 was reverted because it has an issue. So this test checks 
> > whether
> >refactoring around ReorderBufferSequenceIsTransactional seems really
> needed.
> 
> FWIW I also did the benchmarks without the 0002 patch, for the same
> reason. I forgot to mention that.

Oh, good news. So your bench markings are quite meaningful.

> 
> Interesting, so what exactly does the transaction do?

It is quite simple - PSA the script file. It was executed with 64 multiplicity.
The definition of alter_sequence() is same as you said.
(I did use normal bash script for running them, but your approach may be 
smarter)

> Anyway, I don't
> think this is very surprising - I believe it behaves like this because
> of having to search in many hash tables (one in each toplevel xact). And
> I think the solution I explained before (maintaining a single toplevel
> hash, instead of many per-top-level hashes).

Agreed. And I can benchmark again for new ones, maybe when we decide new
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



one_client.sh
Description: one_client.sh


Re: logical decoding and replication of sequences, take 2

2023-12-01 Thread Tomas Vondra



On 12/1/23 12:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>> I did some micro-benchmarking today, trying to identify cases where this
>> would cause unexpected problems, either due to having to maintain all
>> the relfilenodes, or due to having to do hash lookups for every sequence
>> change. But I think it's fine, mostly ...
>>
> 
> I did also performance tests (especially case 3). First of all, there are some
> variants from yours.
> 
> 1. patch 0002 was reverted because it has an issue. So this test checks 
> whether
>refactoring around ReorderBufferSequenceIsTransactional seems really 
> needed.

FWIW I also did the benchmarks without the 0002 patch, for the same
reason. I forgot to mention that.

> 2. per comments from Amit, I also measured the abort case. In this case, the
>alter_sequence() is called but the transaction is aborted.
> 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any 
> cases,
>clients executed 1000 transactions. The performance machine has 128 core 
> so that
>result for 128 clients might be saturated.
> 4. a short sleep (0.1s) was added in alter_sequence(), especially between
>"alter sequence" and nextval(). Because while testing, I found that the
>transaction is too short to execute in parallel. I think it is reasonable
>because ReorderBufferSequenceIsTransactional() might be worse when the 
> parallelism
>is increased.
> 
> I attached one backend process via perf and executed 
> pg_slot_logical_get_changes().
> Attached txt file shows which function occupied CPU time, especially from
> pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
> Here are my observations about them.
> 
> * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
>   clients case.
> * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
>   many times. I think this is because changes exceed 
> logical_decoding_work_mem,
>   so we do not have to analyze anymore.
> * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() 
> is linearly
>   longer. This means that we need to think some solution to avoid the 
> overhead by
>   ReorderBufferSequenceIsTransactional().
> 
> ```
> 8 clients  3.73% occupied time
> 16 7.26%
> 32 15.82%
> 64 29.14%
> 128 46.27%
> ```

Interesting, so what exactly does the transaction do? Anyway, I don't
think this is very surprising - I believe it behaves like this because
of having to search in many hash tables (one in each toplevel xact). And
I think the solution I explained before (maintaining a single toplevel
hash, instead of many per-top-level hashes).

FWIW I find this case interesting, but not very practical, because no
practical workload has that many aborts.

> 
> * In case of abort, I also checked CPU time used by 
> ReorderBufferAddRelFileLocator(), but
>   it seems not so depends on the number of clients.
> 
> ```
> 8 clients 3.66% occupied time
> 16 6.94%
> 32 4.65%
> 64 5.39%
> 128 3.06%
> ```
> 
> As next step, I've planned to run the case which uses setval() function, 
> because it
> generates more WALs than normal nextval();
> How do you think?
> 

Sure, although I don't think it's much different from the test selecting
40 values from the sequence (in each transaction). That generates about
the same amount of WAL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-01 Thread Tomas Vondra


On 11/30/23 12:56, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
>  wrote:
>>
>> 3) "bad case" - small transactions that generate a lot of relfilenodes
>>
>>   select alter_sequence();
>>
>> where the function is defined like this (I did create 1000 sequences
>> before the test):
>>
>>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>>   DECLARE
>>   v INT;
>>   BEGIN
>>   v := 1 + (random() * 999)::int;
>>   execute format('alter sequence s%s restart with 1000', v);
>>   perform nextval('s');
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This performs terribly, but it's entirely unrelated to sequences.
>> Current master has exactly the same problem, if transactions do DDL.
>> Like this, for example:
>>
>>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>>   DECLARE
>>   v INT;
>>   BEGIN
>>   v := 1 + (random() * 999)::int;
>>   execute format('create table t%s (a int)', v);
>>   execute format('drop table t%s', v);
>>   insert into t values (1);
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This has the same impact on master. The perf report shows this:
>>
>>   --98.06%--pg_logical_slot_get_changes_guts
>>|
>> --97.88%--LogicalDecodingProcessRecord
>>  |
>>  --97.56%--xact_decode
>>   |
>>--97.51%--DecodeCommit
>> |
>> |--91.92%--SnapBuildCommitTxn
>> | |
>> |  --91.65%--SnapBuildBuildSnapshot
>> |   |
>> |   --91.14%--pg_qsort
>>
>> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
>> takes so long is because:
>>
>> -
>>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>>   at snapbuild.c:498
>>   498+ sizeof(TransactionId) *   builder->committed.xcnt
>>   (gdb) p builder->committed.xcnt
>>   $4 = 11532
>> -
>>
>> And with each iteration it grows by 1.
>>
> 
> Can we somehow avoid this either by keeping DDL-related xacts open or
> aborting them?
I
I'm not sure why the snapshot builder does this, i.e. why we end up
accumulating that many xids, and I didn't have time to look closer. So I
don't know if this would be a solution or not.

> Also, will it make any difference to use setval as
> do_setval() seems to be logging each time?
> 

I think that's pretty much what case (2) does, as it calls nextval()
enough time for each transaction do generate WAL. But I don't think this
is a very sensible benchmark - it's an extreme case, but practical cases
are far closer to case (1) because sequences are intermixed with other
activity. No one really does just nextval() calls.

> If possible, can you share the scripts? Kuroda-San has access to the
> performance machine, he may be able to try it as well.
> 

Sure, attached. But it's a very primitive script, nothing fancy.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

scripts.tgz
Description: application/compressed-tar


RE: logical decoding and replication of sequences, take 2

2023-12-01 Thread Hayato Kuroda (Fujitsu)
Dear Tomas,

> I did some micro-benchmarking today, trying to identify cases where this
> would cause unexpected problems, either due to having to maintain all
> the relfilenodes, or due to having to do hash lookups for every sequence
> change. But I think it's fine, mostly ...
>

I did also performance tests (especially case 3). First of all, there are some
variants from yours.

1. patch 0002 was reverted because it has an issue. So this test checks whether
   refactoring around ReorderBufferSequenceIsTransactional seems really needed.
2. per comments from Amit, I also measured the abort case. In this case, the
   alter_sequence() is called but the transaction is aborted.
3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any 
cases,
   clients executed 1000 transactions. The performance machine has 128 core so 
that
   result for 128 clients might be saturated.
4. a short sleep (0.1s) was added in alter_sequence(), especially between
   "alter sequence" and nextval(). Because while testing, I found that the
   transaction is too short to execute in parallel. I think it is reasonable
   because ReorderBufferSequenceIsTransactional() might be worse when the 
parallelism
   is increased.

I attached one backend process via perf and executed 
pg_slot_logical_get_changes().
Attached txt file shows which function occupied CPU time, especially from
pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
Here are my observations about them.

* In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
  clients case.
* For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
  many times. I think this is because changes exceed logical_decoding_work_mem,
  so we do not have to analyze anymore.
* In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is 
linearly
  longer. This means that we need to think some solution to avoid the overhead 
by
  ReorderBufferSequenceIsTransactional().

```
8 clients  3.73% occupied time
16 7.26%
32 15.82%
64 29.14%
128 46.27%
```

* In case of abort, I also checked CPU time used by 
ReorderBufferAddRelFileLocator(), but
  it seems not so depends on the number of clients.

```
8 clients 3.66% occupied time
16 6.94%
32 4.65%
64 5.39%
128 3.06%
```

As next step, I've planned to run the case which uses setval() function, 
because it
generates more WALs than normal nextval();
How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


# commit case

## 128 clients

-   97.77% 0.00%  postgres  postgres[.] 
pg_logical_slot_get_changes_guts
   - 97.77% pg_logical_slot_get_changes_guts
  - 97.77% LogicalDecodingProcessRecord
 - 97.71% xact_decode
- 54.95% ReorderBufferProcessTXN
   - 32.64% ReorderBufferRestoreChanges
  + 23.37% FileRead
4.58% __memcpy_ssse3_back
  + 4.14% MemoryContextAllocZero
   + 7.73% ReorderBufferCopySnap.isra.21
   + 5.94% ReorderBufferSerializeTXN
   + 5.78% ReorderBufferCleanupTXN
   + 2.04% RelationIdGetRelation
- 42.75% SnapBuildCommitTxn
   - 34.56% ReorderBufferQueueChange
  - 34.36% ReorderBufferSerializeTXN
   27.64% __write_nocancel
   4.42% __memcpy_ssse3_back
 + 1.52% OpenTransientFilePerm
   + 7.97% SnapBuildBuildSnapshot

 0.04% 0.00%  postgres  postgres  [.] 
ReorderBufferSequenceIsTransactional

## 64 clients

-   86.49% 0.04%  postgres  postgres[.] 
pg_logical_slot_get_changes_guts
   - 86.45% pg_logical_slot_get_changes_guts
  - 86.37% LogicalDecodingProcessRecord
 - 84.79% xact_decode
- 51.05% SnapBuildCommitTxn
   + 49.77% SnapBuildBuildSnapshot
 0.53% ReorderBufferXidHasBaseSnapshot
- 33.45% ReorderBufferProcessTXN
   + 21.37% ReorderBufferCopySnap.isra.21
   + 4.86% RelationIdGetRelation
   + 2.31% RelidByRelfilenumber
   + 0.84% AbortCurrentTransaction
 0.81% ReorderBufferCleanupTXN
 - 1.10% seq_decode
  0.65% ReorderBufferSequenceIsTransactional

-1.04% 0.09%  postgres  postgres  [.] 
ReorderBufferSequenceIsTransactional

## 32 clients

-   82.20% 0.13%  postgres  postgres [.] 
pg_logical_slot_get_changes_guts
   - 82.09% pg_logical_slot_get_changes_guts
  - 81.86% LogicalDecodingProcessRecord
 - 80.01% xact_decode
- 49.38% SnapBuildCommitTxn
   + 48.10% SnapBuildBuildSnapshot
 0.50% ReorderBufferXidHasBaseSnapshot
- 29.98% ReorderBufferProcessTXN
   + 10.52% ReorderBufferCopySnap.isra.21
   + 7.64% RelationIdGetRelation
   + 4.11% RelidByRelfilenumber
   + 1.70% 

Re: logical decoding and replication of sequences, take 2

2023-11-30 Thread Amit Kapila
On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
 wrote:
>
> 3) "bad case" - small transactions that generate a lot of relfilenodes
>
>   select alter_sequence();
>
> where the function is defined like this (I did create 1000 sequences
> before the test):
>
>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>   DECLARE
>   v INT;
>   BEGIN
>   v := 1 + (random() * 999)::int;
>   execute format('alter sequence s%s restart with 1000', v);
>   perform nextval('s');
>   END;
>   $$ LANGUAGE plpgsql;
>
> This performs terribly, but it's entirely unrelated to sequences.
> Current master has exactly the same problem, if transactions do DDL.
> Like this, for example:
>
>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>   DECLARE
>   v INT;
>   BEGIN
>   v := 1 + (random() * 999)::int;
>   execute format('create table t%s (a int)', v);
>   execute format('drop table t%s', v);
>   insert into t values (1);
>   END;
>   $$ LANGUAGE plpgsql;
>
> This has the same impact on master. The perf report shows this:
>
>   --98.06%--pg_logical_slot_get_changes_guts
>|
> --97.88%--LogicalDecodingProcessRecord
>  |
>  --97.56%--xact_decode
>   |
>--97.51%--DecodeCommit
> |
> |--91.92%--SnapBuildCommitTxn
> | |
> |  --91.65%--SnapBuildBuildSnapshot
> |   |
> |   --91.14%--pg_qsort
>
> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
> takes so long is because:
>
> -
>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>   at snapbuild.c:498
>   498+ sizeof(TransactionId) *   builder->committed.xcnt
>   (gdb) p builder->committed.xcnt
>   $4 = 11532
> -
>
> And with each iteration it grows by 1.
>

Can we somehow avoid this either by keeping DDL-related xacts open or
aborting them? Also, will it make any difference to use setval as
do_setval() seems to be logging each time?

If possible, can you share the scripts? Kuroda-San has access to the
performance machine, he may be able to try it as well.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Peter Smith
On Wed, Nov 29, 2023 at 11:45 PM Tomas Vondra
 wrote:
>
>
>
> On 11/27/23 23:06, Peter Smith wrote:
> > FWIW, here are some more minor review comments for v20231127-3-0001
> >
> > ==
> > .../replication/logical/reorderbuffer.c
> >
> > 3.
> > + *   To decide if a sequence change is transactional, we maintain a hash
> > + *   table of relfilenodes created in each (sub)transactions, along with
> > + *   the XID of the (sub)transaction that created the relfilenode. The
> > + *   entries from substransactions are copied to the top-level transaction
> > + *   to make checks cheaper. The hash table gets cleaned up when the
> > + *   transaction completes (commit/abort).
> >
> > /substransactions/subtransactions/
> >
>
> Will fix.

FYI - I think this typo still exists in the patch v20231128-0001.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra
On 11/29/23 15:41, Tomas Vondra wrote:
> ...
>>
>> One thing that worries me about that approach is that it can suck with
>> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
>> records. We have previously fixed some such workloads in logical
>> decoding where decoding a transaction containing truncation of a table
>> with a lot of partitions (1000 or more) used to take a very long time.
>> Don't we face performance issues in such scenarios?
>>
> 
> I don't think we do, really. We will have to decode the SMGR records and
> add the relfilenodes to the hash table(s), but I think that affects the
> lookup performance too much. What I think might be a problem is if we
> have many top-level transactions, especially if those transactions do
> something that creates a relfilenode. Because then we'll have to do a
> hash_search for each of them, and that might be measurable even if each
> lookup is O(1). And we do the lookup for every sequence change ...
> 

I did some micro-benchmarking today, trying to identify cases where this
would cause unexpected problems, either due to having to maintain all
the relfilenodes, or due to having to do hash lookups for every sequence
change. But I think it's fine, mostly ...

I did all the following tests with 64 clients. I may try more, but even
with this there should be fair number of concurrent transactions, which
determines the number of top-level transactions in reorderbuffer. I'll
try with more clients tomorrow, but I don't think it'll change stuff.

The test is fairly simple - run a particular number of transactions
(might be 1000 * 64, or more). And then measure how long it takes to
decode the changes using test_decoding.

Now, the various workloads I tried:

1) "good case" - small OLTP transactions, a couple nextval('s') calls

  begin;
  insert into t (1);
  select nextval('s');
  insert into t (1);
  commit;

This is pretty fine, the sequence part of reorderbuffer is really not
measurable, it's like 1% of the total CPU time. Which is expected,
because we only wal-log every 32-nd increment or so.

2) "good case" - same as (1) but more nextval calls to always do wal


  begin;
  insert into t (1);
  select nextval('s') from generate_series(1,40);
  insert into t (1);
  commit;

Here sequences are more measurable, it's like 15% of CPU time, but most
of that comes to AbortCurrentTransaction() in the non-transactional
branch of ReorderBufferQueueSequence. I don't think there's a way around
that, and it's entirely unrelated to relfilenodes. The function checking
if the change is transactional (ReorderBufferSequenceIsTransactional) is
less than 1% of the profile - and this is the version that always walks
all top-level transactions.

3) "bad case" - small transactions that generate a lot of relfilenodes

  select alter_sequence();

where the function is defined like this (I did create 1000 sequences
before the test):

  CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
  DECLARE
  v INT;
  BEGIN
  v := 1 + (random() * 999)::int;
  execute format('alter sequence s%s restart with 1000', v);
  perform nextval('s');
  END;
  $$ LANGUAGE plpgsql;

This performs terribly, but it's entirely unrelated to sequences.
Current master has exactly the same problem, if transactions do DDL.
Like this, for example:

  CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
  DECLARE
  v INT;
  BEGIN
  v := 1 + (random() * 999)::int;
  execute format('create table t%s (a int)', v);
  execute format('drop table t%s', v);
  insert into t values (1);
  END;
  $$ LANGUAGE plpgsql;

This has the same impact on master. The perf report shows this:

  --98.06%--pg_logical_slot_get_changes_guts
   |
--97.88%--LogicalDecodingProcessRecord
 |
 --97.56%--xact_decode
  |
   --97.51%--DecodeCommit
|
|--91.92%--SnapBuildCommitTxn
| |
|  --91.65%--SnapBuildBuildSnapshot
|   |
|   --91.14%--pg_qsort

The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
takes so long is because:

-
  Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
  at snapbuild.c:498
  498+ sizeof(TransactionId) *   builder->committed.xcnt
  (gdb) p builder->committed.xcnt
  $4 = 11532
-

And with each iteration it grows by 1. That looks quite weird, possibly
a bug worth fixing, but unrelated to this patch. I can't investigate
this more at the moment, not sure when/if I'll get to that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra
On 11/29/23 14:42, Amit Kapila wrote:
> On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
>  wrote:
>>
>> I have been hacking on improving the improvements outlined in my
>> preceding e-mail, but I have some bad news - I ran into an issue that I
>> don't know how to solve :-(
>>
>> Consider this transaction:
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART 1000;
>>
>>   SAVEPOINT s1;
>>   ALTER SEQUENCE s RESTART 2000;
>>   ROLLBACK TO s1;
>>
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>>   COMMIT;
>>
>> If you try this with the approach relying on rd_newRelfilelocatorSubid
>> and rd_createSubid, it fails like this on the subscriber:
>>
>>   ERROR:  could not map filenode "base/5/16394" to relation OID
>>
>> This happens because ReorderBufferQueueSequence tries to do this in the
>> non-transactional branch:
>>
>>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>>
>> and the relfilenode is the one created by the first ALTER. But this is
>> obviously wrong - the changes should have been treated as transactional,
>> because they are tied to the first ALTER. So how did we get there?
>>
>> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
>> resets the two fields to InvalidSubTransactionId. Which means the
>> rollback in the above transaction also forgets about the first ALTER.
>> Now that I look at the RelationData comments, it actually describes
>> exactly this situation:
>>
>>   *
>>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>>   * the most-recent relfilenumber change has survived into or zero if
>>   * not changed in the current transaction (or we have forgotten
>>   * changing it).  This field is accurate when non-zero, but it can be
>>   * zero when a relation has multiple new relfilenumbers within a
>>   * single transaction, with one of them occurring in a subsequently
>>   * aborted subtransaction, e.g.
>>   *BEGIN;
>>   *TRUNCATE t;
>>   *SAVEPOINT save;
>>   *TRUNCATE t;
>>   *ROLLBACK TO save;
>>   *-- rd_newRelfilelocatorSubid is now forgotten
>>   *
>>
>> The root of this problem is that we'd need some sort of "history" for
>> the field, so that when a subxact aborts, we can restore the previous
>> value. But we obviously don't have that, and I doubt we want to add that
>> to relcache - for example, it'd either need to impose some limit on the
>> history (and thus a failure when we reach the limit), or it'd need to
>> handle histories of arbitrary length.
>>
> 
> Yeah, I think that would be really tricky and we may not want to go there.
> 
>> At this point I don't see a solution for this, which means the best way
>> forward with the sequence decoding patch seems to be the original
>> approach, on the decoding side.
>>
> 
> One thing that worries me about that approach is that it can suck with
> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
> records. We have previously fixed some such workloads in logical
> decoding where decoding a transaction containing truncation of a table
> with a lot of partitions (1000 or more) used to take a very long time.
> Don't we face performance issues in such scenarios?
> 

I don't think we do, really. We will have to decode the SMGR records and
add the relfilenodes to the hash table(s), but I think that affects the
lookup performance too much. What I think might be a problem is if we
have many top-level transactions, especially if those transactions do
something that creates a relfilenode. Because then we'll have to do a
hash_search for each of them, and that might be measurable even if each
lookup is O(1). And we do the lookup for every sequence change ...

> How do we see this work w.r.t to some sort of global sequences? There
> is some recent discussion where I have raised a similar point [1].
> 
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com
> 

I think those are very different things, even though called "sequences".
AFAIK solutions like snowflakeID or UUIDs don't require replication of
any shared state (that's kinda the whole point), so I don't see why
would it need some special support in logical decoding.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Amit Kapila
On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
 wrote:
>
> I have been hacking on improving the improvements outlined in my
> preceding e-mail, but I have some bad news - I ran into an issue that I
> don't know how to solve :-(
>
> Consider this transaction:
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART 1000;
>
>   SAVEPOINT s1;
>   ALTER SEQUENCE s RESTART 2000;
>   ROLLBACK TO s1;
>
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>   COMMIT;
>
> If you try this with the approach relying on rd_newRelfilelocatorSubid
> and rd_createSubid, it fails like this on the subscriber:
>
>   ERROR:  could not map filenode "base/5/16394" to relation OID
>
> This happens because ReorderBufferQueueSequence tries to do this in the
> non-transactional branch:
>
>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>
> and the relfilenode is the one created by the first ALTER. But this is
> obviously wrong - the changes should have been treated as transactional,
> because they are tied to the first ALTER. So how did we get there?
>
> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
> resets the two fields to InvalidSubTransactionId. Which means the
> rollback in the above transaction also forgets about the first ALTER.
> Now that I look at the RelationData comments, it actually describes
> exactly this situation:
>
>   *
>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>   * the most-recent relfilenumber change has survived into or zero if
>   * not changed in the current transaction (or we have forgotten
>   * changing it).  This field is accurate when non-zero, but it can be
>   * zero when a relation has multiple new relfilenumbers within a
>   * single transaction, with one of them occurring in a subsequently
>   * aborted subtransaction, e.g.
>   *BEGIN;
>   *TRUNCATE t;
>   *SAVEPOINT save;
>   *TRUNCATE t;
>   *ROLLBACK TO save;
>   *-- rd_newRelfilelocatorSubid is now forgotten
>   *
>
> The root of this problem is that we'd need some sort of "history" for
> the field, so that when a subxact aborts, we can restore the previous
> value. But we obviously don't have that, and I doubt we want to add that
> to relcache - for example, it'd either need to impose some limit on the
> history (and thus a failure when we reach the limit), or it'd need to
> handle histories of arbitrary length.
>

Yeah, I think that would be really tricky and we may not want to go there.

> At this point I don't see a solution for this, which means the best way
> forward with the sequence decoding patch seems to be the original
> approach, on the decoding side.
>

One thing that worries me about that approach is that it can suck with
the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
records. We have previously fixed some such workloads in logical
decoding where decoding a transaction containing truncation of a table
with a lot of partitions (1000 or more) used to take a very long time.
Don't we face performance issues in such scenarios?

How do we see this work w.r.t to some sort of global sequences? There
is some recent discussion where I have raised a similar point [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-29 Thread Tomas Vondra



On 11/27/23 23:06, Peter Smith wrote:
> FWIW, here are some more minor review comments for v20231127-3-0001
> 
> ==
> doc/src/sgml/logicaldecoding.sgml
> 
> 1.
> +  The txn parameter contains meta information 
> about
> +  the transaction the sequence change is part of. Note however that for
> +  non-transactional updates, the transaction may be NULL, depending on
> +  if the transaction already has an XID assigned.
> +  The sequence_lsn has the WAL location of the
> +  sequence update. transactional says if the
> +  sequence has to be replayed as part of the transaction or directly.
> 
> /says if/specifies whether/
> 

Will fix.

> ==
> src/backend/commands/sequence.c
> 
> 2. DecodeSeqTuple
> 
> + memcpy(((char *) tuple->tuple.t_data),
> +data + sizeof(xl_seq_rec),
> +SizeofHeapTupleHeader);
> +
> + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> +data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> +datalen);
> 
> Maybe I am misreading but isn't this just copying 2 contiguous pieces
> of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
> achieve the same?
> 

You're right, will fix. I think the code looked differently before, got
simplified and I haven't noticed this can be a single memcpy().

> ==
> .../replication/logical/reorderbuffer.c
> 
> 3.
> + *   To decide if a sequence change is transactional, we maintain a hash
> + *   table of relfilenodes created in each (sub)transactions, along with
> + *   the XID of the (sub)transaction that created the relfilenode. The
> + *   entries from substransactions are copied to the top-level transaction
> + *   to make checks cheaper. The hash table gets cleaned up when the
> + *   transaction completes (commit/abort).
> 
> /substransactions/subtransactions/
> 

Will fix.

> ~~~
> 
> 4.
> + * A naive approach would be to just loop through all transactions and check
> + * each of them, but there may be (easily thousands) of subtransactions, and
> + * the check happens for each sequence change. So this could be very costly.
> 
> /may be (easily thousands) of/may be (easily thousands of)/
> 
> ~~~

Thanks. I've reworded this to

  ... may be many (easily thousands of) subtransactions ...

> 
> 5. ReorderBufferSequenceCleanup
> 
> + while ((ent = (ReorderBufferSequenceEnt *)
> hash_seq_search(_status)) != NULL)
> + {
> + (void) hash_search(txn->toptxn->sequences_hash,
> +(void *) >rlocator,
> +HASH_REMOVE, NULL);
> + }
> 
> Typically, other HASH_REMOVE code I saw would check result for NULL to
> give elog(ERROR, "hash table corrupted");
> 

Good point, I'll add the error check

> ~~~
> 
> 6. ReorderBufferQueueSequence
> 
> + if (xid != InvalidTransactionId)
> + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> 
> How about using the macro: TransactionIdIsValid
> 

Actually, I wrote in some other message, I think the check is not
necessary. Or rather it should be an assert that XID is valid. And yeah,
the macro is a good idea.

> ~~~
> 
> 7. ReorderBufferQueueSequence
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(rlocator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

I chose to keep this consistent with other places in reorderbuffer, and
all of them use the equality check.

> ~~~
> 
> 8.
> + /*
> + * Calculate the first value of the next batch (at which point we
> + * generate and decode another WAL record.
> + */
> 
> Missing ')'
> 

Will fix.

> ~~~
> 
> 9. ReorderBufferAddRelFileLocator
> 
> + /*
> + * We only care about sequence relfilenodes for now, and those always have
> + * a XID. So if there's no XID, don't bother adding them to the hash.
> + */
> + if (xid == InvalidTransactionId)
> + return;
> 
> How about using the macro: TransactionIdIsValid
> 

Will change.

> ~~~
> 
> 10. ReorderBufferProcessTXN
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(change->data.sequence.locator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

Same as the other Oid check - consistency.

> ~~~
> 
> 11. ReorderBufferChangeSize
> 
> + if (tup)
> + {
> + sz += sizeof(HeapTupleData);
> + len = tup->tuple.t_len;
> + sz += len;
> + }
> 
> Why is the 'sz' increment split into 2 parts?
> 

Because the other branches in ReorderBufferChangeSize do it that way.
You're right it might be coded on a single line.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-28 Thread Tomas Vondra
On 11/28/23 12:32, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
>  wrote:
>>
>> I spent a bit of time looking at the proposed change, and unfortunately
>> logging just the boolean flag does not work. A good example is this bit
>> from a TAP test added by the patch for built-in replication (which was
>> not included with the WIP patch):
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART WITH 1000;
>>   SAVEPOINT sp1;
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>>   ROLLBACK TO sp1;
>>   COMMIT;
>>
>> This is expected to produce:
>>
>>   1131|0|t
>>
>> but produces
>>
>>   1000|0|f
>>
>> instead. The reason is very simple - as implemented, the patch simply
>> checks if the relfilenode is from the same top-level transaction, which
>> it is, and sets the flag to "true". So we know the sequence changes need
>> to be queued and replayed as part of this transaction.
>>
>> But then during decoding, we still queue the changes into the subxact,
>> which then aborts, and the changes are discarded. That is not how it's
>> supposed to work, because the new relfilenode is still valid, someone
>> might do nextval() and commit. And the nextval() may not get WAL-logged,
>> so we'd lose this.
>>
>> What I guess we might do is log not just a boolean flag, but the XID of
>> the subtransaction that created the relfilenode. And then during
>> decoding we'd queue the changes into this subtransaction ...
>>
>> 0006 in the attached patch series does this, and it seems to fix the TAP
>> test failure. I left it at the end, to make it easier to run tests
>> without the patch applied.
>>
> 
> Offhand, I don't have any better idea than what you have suggested for
> the problem but this needs some thoughts including the questions asked
> by you. I'll spend some time on it and respond back.
> 

I've been experimenting with the idea to log the XID, and for a moment I
was worried it actually can't work, because subtransactions may not
actually be just nested in simple way, but form a tree. And what if the
sequence was altered in a different branch (sibling subxact), not in the
immediate parent. In which case the new SubTransactionGetXid() would
fail, because it just walks the current chain of subtransactions.

I've been thinking about cases like this:

   BEGIN;
   CREATE SEQUENCE s;# XID 1000
   SELECT alter_sequence();  # XID 1001
   SAVEPOINT s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   ROLLBACK TO s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   COMMIT;

The XID values are what the sequence wal record will reference, assuming
that the main transaction XID is 1000.

Initially, I thought it's wrong that the nextval() calls reference XID
of the main transaction, because the last relfilenode comes from 1001,
which is the subxact created by alter_sequence() thanks to the exception
handling block. And that's where the approach in reorderbuffer would
queue the changes.

But I think this is actually correct too. When a subtransaction commits
(e.g. when alter_sequence() completes), it essentially becomes part of
the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid
accordingly, setting it to parentSubid.

This also means that SubTransactionGetXid() can't actually fail, because
the ID has to reference an active subtransaction in the current stack.
I'm still concerned about the cost of the lookup, because the list may
be long and the subxact we're looking for may be quite high, but I guess
we might have another field, caching the XID. It'd need to be updated
only in AtEOSubXact_cleanup, and at that point we know it's the
immediate parent, so it'd be pretty cheap I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-28 Thread Amit Kapila
On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
 wrote:
>
> I spent a bit of time looking at the proposed change, and unfortunately
> logging just the boolean flag does not work. A good example is this bit
> from a TAP test added by the patch for built-in replication (which was
> not included with the WIP patch):
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART WITH 1000;
>   SAVEPOINT sp1;
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>   ROLLBACK TO sp1;
>   COMMIT;
>
> This is expected to produce:
>
>   1131|0|t
>
> but produces
>
>   1000|0|f
>
> instead. The reason is very simple - as implemented, the patch simply
> checks if the relfilenode is from the same top-level transaction, which
> it is, and sets the flag to "true". So we know the sequence changes need
> to be queued and replayed as part of this transaction.
>
> But then during decoding, we still queue the changes into the subxact,
> which then aborts, and the changes are discarded. That is not how it's
> supposed to work, because the new relfilenode is still valid, someone
> might do nextval() and commit. And the nextval() may not get WAL-logged,
> so we'd lose this.
>
> What I guess we might do is log not just a boolean flag, but the XID of
> the subtransaction that created the relfilenode. And then during
> decoding we'd queue the changes into this subtransaction ...
>
> 0006 in the attached patch series does this, and it seems to fix the TAP
> test failure. I left it at the end, to make it easier to run tests
> without the patch applied.
>

Offhand, I don't have any better idea than what you have suggested for
the problem but this needs some thoughts including the questions asked
by you. I'll spend some time on it and respond back.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Peter Smith
FWIW, here are some more minor review comments for v20231127-3-0001

==
doc/src/sgml/logicaldecoding.sgml

1.
+  The txn parameter contains meta information about
+  the transaction the sequence change is part of. Note however that for
+  non-transactional updates, the transaction may be NULL, depending on
+  if the transaction already has an XID assigned.
+  The sequence_lsn has the WAL location of the
+  sequence update. transactional says if the
+  sequence has to be replayed as part of the transaction or directly.

/says if/specifies whether/

==
src/backend/commands/sequence.c

2. DecodeSeqTuple

+ memcpy(((char *) tuple->tuple.t_data),
+data + sizeof(xl_seq_rec),
+SizeofHeapTupleHeader);
+
+ memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
+data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
+datalen);

Maybe I am misreading but isn't this just copying 2 contiguous pieces
of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
achieve the same?

==
.../replication/logical/reorderbuffer.c

3.
+ *   To decide if a sequence change is transactional, we maintain a hash
+ *   table of relfilenodes created in each (sub)transactions, along with
+ *   the XID of the (sub)transaction that created the relfilenode. The
+ *   entries from substransactions are copied to the top-level transaction
+ *   to make checks cheaper. The hash table gets cleaned up when the
+ *   transaction completes (commit/abort).

/substransactions/subtransactions/

~~~

4.
+ * A naive approach would be to just loop through all transactions and check
+ * each of them, but there may be (easily thousands) of subtransactions, and
+ * the check happens for each sequence change. So this could be very costly.

/may be (easily thousands) of/may be (easily thousands of)/

~~~

5. ReorderBufferSequenceCleanup

+ while ((ent = (ReorderBufferSequenceEnt *)
hash_seq_search(_status)) != NULL)
+ {
+ (void) hash_search(txn->toptxn->sequences_hash,
+(void *) >rlocator,
+HASH_REMOVE, NULL);
+ }

Typically, other HASH_REMOVE code I saw would check result for NULL to
give elog(ERROR, "hash table corrupted");

~~~

6. ReorderBufferQueueSequence

+ if (xid != InvalidTransactionId)
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

How about using the macro: TransactionIdIsValid

~~~

7. ReorderBufferQueueSequence

+ if (reloid == InvalidOid)
+ elog(ERROR, "could not map filenode \"%s\" to relation OID",
+ relpathperm(rlocator,
+ MAIN_FORKNUM));

How about using the macro: OidIsValid

~~~

8.
+ /*
+ * Calculate the first value of the next batch (at which point we
+ * generate and decode another WAL record.
+ */

Missing ')'

~~~

9. ReorderBufferAddRelFileLocator

+ /*
+ * We only care about sequence relfilenodes for now, and those always have
+ * a XID. So if there's no XID, don't bother adding them to the hash.
+ */
+ if (xid == InvalidTransactionId)
+ return;

How about using the macro: TransactionIdIsValid

~~~

10. ReorderBufferProcessTXN

+ if (reloid == InvalidOid)
+ elog(ERROR, "could not map filenode \"%s\" to relation OID",
+ relpathperm(change->data.sequence.locator,
+ MAIN_FORKNUM));

How about using the macro: OidIsValid

~~~

11. ReorderBufferChangeSize

+ if (tup)
+ {
+ sz += sizeof(HeapTupleData);
+ len = tup->tuple.t_len;
+ sz += len;
+ }

Why is the 'sz' increment split into 2 parts?

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Tomas Vondra
On 11/27/23 13:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Amit, Tomas,
> 

 I am wondering that instead of building the infrastructure to know
 whether a particular change is transactional on the decoding side,
 can't we have some flag in the WAL record to note whether the change
 is transactional or not? I have discussed this point with my colleague
 Kuroda-San and we thought that it may be worth exploring whether we
 can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
 determine if the sequence is created/changed in the current
 subtransaction and then record that in WAL record. By this, we need to
 have additional information in the WAL record like XLOG_SEQ_LOG but we
 can probably do it only with wal_level as logical.

>>>
>>> I may not understand the proposal exactly, but it's not enough to know
>>> if it was created in the same subxact. It might have been created in
>>> some earlier subxact in the same top-level xact.
>>>
>>
>> We should be able to detect even some earlier subxact or top-level
>> xact based on rd_createSubid/rd_newRelfilelocatorSubid.
> 
> Here is a small PoC patchset to help your understanding. Please see attached
> files.
> 
> 0001, 0002 were not changed, and 0004 was reassigned to 0003.
> (For now, I focused only on test_decoding, because it is only for evaluation 
> purpose.)
> 
> 0004 is what we really wanted to say. is_transactional is added in WAL 
> record, and it stores
> whether the operations is transactional. In order to distinguish the status, 
> rd_createSubid and
> rd_newRelfilelocatorSubid are used. According to the comment, they would be a 
> valid value
> only when the relation was changed within the transaction
> Also, sequences_hash was not needed anymore, so it and related functions were 
> removed.
> 
> How do you think?
> 

I think it's an a very nice idea, assuming it maintains the current
behavior. It makes a lot of code unnecessary, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Tomas Vondra



On 11/27/23 12:11, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
>  wrote:
>>
>> On 11/27/23 11:13, Amit Kapila wrote:
>>> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila  
>>> wrote:

 On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
  wrote:
>
> While going over 0001, I realized there might be an optimization for
> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> searches through all top-level transactions, and if there's many of them
> that might be expensive, even if very few of them have any relfilenodes
> in the hash table. It's still linear search, and it needs to happen for
> each sequence change.
>
> But can the relfilenode even be in some other top-level transaction? How
> could it be - our transaction would not see it, and wouldn't be able to
> generate the sequence change. So we should be able to simply check *our*
> transaction (or if it's a subxact, the top-level transaction). Either
> it's there (and it's transactional change), or not (and then it's
> non-transactional change).
>

 I also think the relfilenode should be part of either the current
 top-level xact or one of its subxact, so looking at all the top-level
 transactions for each change doesn't seem advisable.

> The 0004 does this.
>
> This of course hinges on when exactly the transactions get created, and
> assignments processed. For example if this would fire before the txn
> gets assigned to the top-level one, this would break. I don't think this
> can happen thanks to the immediate logging of assignments, but I'm too
> tired to think about it now.
>

 This needs some thought because I think we can't guarantee the
 association till we reach the point where we can actually decode the
 xact. See comments in AssertTXNLsnOrder() [1].

>>
>> I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
>> which says:
>>
>>   /*
>>* Skip the verification if we don't reach the LSN at which we start
>>* decoding the contents of transactions yet because until we reach
>>* the LSN, we could have transactions that don't have the association
>>* between the top-level transaction and subtransaction yet and
>>* consequently have the same LSN.  We don't guarantee this
>>* association until we try to decode the actual contents of
>>* transaction. The ordering of the records prior to the
>>* start_decoding_at LSN should have been checked before the restart.
>>*/
>>
>> But doesn't this say that after we actually start decoding / stop
>> skipping, we should have seen the assignment? We're already decoding
>> transaction contents (because sequence change *is* part of xact, even if
>> we decide to replay it in the non-transactional way).
>>
> 
> It means to say that the assignment is decided after start_decoding_at
> point. We haven't decided that we are past start_decoding_at by the
> time the patch is computing the transactional flag.
> 

Ah, I see. We're deciding if the change is transactional before calling
SnapBuildXactNeedsSkip. That's a bit unfortunate.

>>>
>>> I am wondering that instead of building the infrastructure to know
>>> whether a particular change is transactional on the decoding side,
>>> can't we have some flag in the WAL record to note whether the change
>>> is transactional or not? I have discussed this point with my colleague
>>> Kuroda-San and we thought that it may be worth exploring whether we
>>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
>>> determine if the sequence is created/changed in the current
>>> subtransaction and then record that in WAL record. By this, we need to
>>> have additional information in the WAL record like XLOG_SEQ_LOG but we
>>> can probably do it only with wal_level as logical.
>>>
>>
>> I may not understand the proposal exactly, but it's not enough to know
>> if it was created in the same subxact. It might have been created in
>> some earlier subxact in the same top-level xact.
>>
> 
> We should be able to detect even some earlier subxact or top-level
> xact based on rd_createSubid/rd_newRelfilelocatorSubid.
> 

Interesting. I admit I haven't considered using these fields before, so
I need to familiarize with it a bit, and try if it'd work.

>> FWIW I think one of the earlier patch versions did something like this,
>> by adding a "created" flag in the xlog record. And we concluded doing
>> this on the decoding side is a better solution.
>>
> 
> oh, I thought it would be much simpler than what we are doing on the
> decoding-side. Can you please point me to the email discussion where
> this is concluded or share the reason?
> 

I think the discussion started around [1], and then in a bunch of
following messages (search for "relfilenode").

regards


[1]

Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Amit Kapila
On Mon, Nov 27, 2023 at 4:41 PM Amit Kapila  wrote:
>
> On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
>  wrote:
> >
>
> > FWIW I think one of the earlier patch versions did something like this,
> > by adding a "created" flag in the xlog record. And we concluded doing
> > this on the decoding side is a better solution.
> >
>
> oh, I thought it would be much simpler than what we are doing on the
> decoding-side. Can you please point me to the email discussion where
> this is concluded or share the reason?
>

I'll check the thread about this point by myself as well but if by
chance you remember it then kindly share it.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Amit Kapila
On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
 wrote:
>
> On 11/27/23 11:13, Amit Kapila wrote:
> > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila  
> > wrote:
> >>
> >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
> >>  wrote:
> >>>
> >>> While going over 0001, I realized there might be an optimization for
> >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> >>> searches through all top-level transactions, and if there's many of them
> >>> that might be expensive, even if very few of them have any relfilenodes
> >>> in the hash table. It's still linear search, and it needs to happen for
> >>> each sequence change.
> >>>
> >>> But can the relfilenode even be in some other top-level transaction? How
> >>> could it be - our transaction would not see it, and wouldn't be able to
> >>> generate the sequence change. So we should be able to simply check *our*
> >>> transaction (or if it's a subxact, the top-level transaction). Either
> >>> it's there (and it's transactional change), or not (and then it's
> >>> non-transactional change).
> >>>
> >>
> >> I also think the relfilenode should be part of either the current
> >> top-level xact or one of its subxact, so looking at all the top-level
> >> transactions for each change doesn't seem advisable.
> >>
> >>> The 0004 does this.
> >>>
> >>> This of course hinges on when exactly the transactions get created, and
> >>> assignments processed. For example if this would fire before the txn
> >>> gets assigned to the top-level one, this would break. I don't think this
> >>> can happen thanks to the immediate logging of assignments, but I'm too
> >>> tired to think about it now.
> >>>
> >>
> >> This needs some thought because I think we can't guarantee the
> >> association till we reach the point where we can actually decode the
> >> xact. See comments in AssertTXNLsnOrder() [1].
> >>
>
> I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
> which says:
>
>   /*
>* Skip the verification if we don't reach the LSN at which we start
>* decoding the contents of transactions yet because until we reach
>* the LSN, we could have transactions that don't have the association
>* between the top-level transaction and subtransaction yet and
>* consequently have the same LSN.  We don't guarantee this
>* association until we try to decode the actual contents of
>* transaction. The ordering of the records prior to the
>* start_decoding_at LSN should have been checked before the restart.
>*/
>
> But doesn't this say that after we actually start decoding / stop
> skipping, we should have seen the assignment? We're already decoding
> transaction contents (because sequence change *is* part of xact, even if
> we decide to replay it in the non-transactional way).
>

It means to say that the assignment is decided after start_decoding_at
point. We haven't decided that we are past start_decoding_at by the
time the patch is computing the transactional flag.

> >
> > I am wondering that instead of building the infrastructure to know
> > whether a particular change is transactional on the decoding side,
> > can't we have some flag in the WAL record to note whether the change
> > is transactional or not? I have discussed this point with my colleague
> > Kuroda-San and we thought that it may be worth exploring whether we
> > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> > determine if the sequence is created/changed in the current
> > subtransaction and then record that in WAL record. By this, we need to
> > have additional information in the WAL record like XLOG_SEQ_LOG but we
> > can probably do it only with wal_level as logical.
> >
>
> I may not understand the proposal exactly, but it's not enough to know
> if it was created in the same subxact. It might have been created in
> some earlier subxact in the same top-level xact.
>

We should be able to detect even some earlier subxact or top-level
xact based on rd_createSubid/rd_newRelfilelocatorSubid.

> FWIW I think one of the earlier patch versions did something like this,
> by adding a "created" flag in the xlog record. And we concluded doing
> this on the decoding side is a better solution.
>

oh, I thought it would be much simpler than what we are doing on the
decoding-side. Can you please point me to the email discussion where
this is concluded or share the reason?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Tomas Vondra



On 11/27/23 11:13, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila  wrote:
>>
>> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
>>  wrote:
>>>
>>> While going over 0001, I realized there might be an optimization for
>>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
>>> searches through all top-level transactions, and if there's many of them
>>> that might be expensive, even if very few of them have any relfilenodes
>>> in the hash table. It's still linear search, and it needs to happen for
>>> each sequence change.
>>>
>>> But can the relfilenode even be in some other top-level transaction? How
>>> could it be - our transaction would not see it, and wouldn't be able to
>>> generate the sequence change. So we should be able to simply check *our*
>>> transaction (or if it's a subxact, the top-level transaction). Either
>>> it's there (and it's transactional change), or not (and then it's
>>> non-transactional change).
>>>
>>
>> I also think the relfilenode should be part of either the current
>> top-level xact or one of its subxact, so looking at all the top-level
>> transactions for each change doesn't seem advisable.
>>
>>> The 0004 does this.
>>>
>>> This of course hinges on when exactly the transactions get created, and
>>> assignments processed. For example if this would fire before the txn
>>> gets assigned to the top-level one, this would break. I don't think this
>>> can happen thanks to the immediate logging of assignments, but I'm too
>>> tired to think about it now.
>>>
>>
>> This needs some thought because I think we can't guarantee the
>> association till we reach the point where we can actually decode the
>> xact. See comments in AssertTXNLsnOrder() [1].
>>

I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
which says:

  /*
   * Skip the verification if we don't reach the LSN at which we start
   * decoding the contents of transactions yet because until we reach
   * the LSN, we could have transactions that don't have the association
   * between the top-level transaction and subtransaction yet and
   * consequently have the same LSN.  We don't guarantee this
   * association until we try to decode the actual contents of
   * transaction. The ordering of the records prior to the
   * start_decoding_at LSN should have been checked before the restart.
   */

But doesn't this say that after we actually start decoding / stop
skipping, we should have seen the assignment? We're already decoding
transaction contents (because sequence change *is* part of xact, even if
we decide to replay it in the non-transactional way).

> 
> I am wondering that instead of building the infrastructure to know
> whether a particular change is transactional on the decoding side,
> can't we have some flag in the WAL record to note whether the change
> is transactional or not? I have discussed this point with my colleague
> Kuroda-San and we thought that it may be worth exploring whether we
> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> determine if the sequence is created/changed in the current
> subtransaction and then record that in WAL record. By this, we need to
> have additional information in the WAL record like XLOG_SEQ_LOG but we
> can probably do it only with wal_level as logical.
> 

I may not understand the proposal exactly, but it's not enough to know
if it was created in the same subxact. It might have been created in
some earlier subxact in the same top-level xact.

FWIW I think one of the earlier patch versions did something like this,
by adding a "created" flag in the xlog record. And we concluded doing
this on the decoding side is a better solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-11-27 Thread Amit Kapila
On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila  wrote:
>
> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
>  wrote:
> >
> > While going over 0001, I realized there might be an optimization for
> > ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> > searches through all top-level transactions, and if there's many of them
> > that might be expensive, even if very few of them have any relfilenodes
> > in the hash table. It's still linear search, and it needs to happen for
> > each sequence change.
> >
> > But can the relfilenode even be in some other top-level transaction? How
> > could it be - our transaction would not see it, and wouldn't be able to
> > generate the sequence change. So we should be able to simply check *our*
> > transaction (or if it's a subxact, the top-level transaction). Either
> > it's there (and it's transactional change), or not (and then it's
> > non-transactional change).
> >
>
> I also think the relfilenode should be part of either the current
> top-level xact or one of its subxact, so looking at all the top-level
> transactions for each change doesn't seem advisable.
>
> > The 0004 does this.
> >
> > This of course hinges on when exactly the transactions get created, and
> > assignments processed. For example if this would fire before the txn
> > gets assigned to the top-level one, this would break. I don't think this
> > can happen thanks to the immediate logging of assignments, but I'm too
> > tired to think about it now.
> >
>
> This needs some thought because I think we can't guarantee the
> association till we reach the point where we can actually decode the
> xact. See comments in AssertTXNLsnOrder() [1].
>

I am wondering that instead of building the infrastructure to know
whether a particular change is transactional on the decoding side,
can't we have some flag in the WAL record to note whether the change
is transactional or not? I have discussed this point with my colleague
Kuroda-San and we thought that it may be worth exploring whether we
can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
determine if the sequence is created/changed in the current
subtransaction and then record that in WAL record. By this, we need to
have additional information in the WAL record like XLOG_SEQ_LOG but we
can probably do it only with wal_level as logical.

One minor point:
It'd also
+ * trigger assert in DecodeSequence.

I don't see DecodeSequence() in the patch. Which exact assert/function
are you referring to here?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-11-26 Thread Amit Kapila
On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
 wrote:
>
> I've been cleaning up the first two patches to get them committed soon
> (adding the decoding infrastructure + test_decoding), cleaning up stale
> comments, updating commit messages etc. And I think it's ready to go,
> but it's too late over, so I plan going over once more tomorrow and then
> likely push. But if someone wants to take a look, I'd welcome that.
>
> The one issue I found during this cleanup is that the patch was missing
> the changes introduced by 29d0a77fa660 for decoding of other stuff.
>
>   commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d
>   Author: Amit Kapila 
>   Date:   Thu Oct 26 06:54:16 2023 +0530
>
>   Migrate logical slots to the new node during an upgrade.
>   ...
>
> I fixed that, but perhaps someone might want to double check ...
>
>
> 0003 is here just for completeness - that's the part adding sequences to
> built-in replication. I haven't done much with it, it needs some cleanup
> too to get it committable. I don't intend to push that right after
> 0001+0002, though.
>
>
> While going over 0001, I realized there might be an optimization for
> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> searches through all top-level transactions, and if there's many of them
> that might be expensive, even if very few of them have any relfilenodes
> in the hash table. It's still linear search, and it needs to happen for
> each sequence change.
>
> But can the relfilenode even be in some other top-level transaction? How
> could it be - our transaction would not see it, and wouldn't be able to
> generate the sequence change. So we should be able to simply check *our*
> transaction (or if it's a subxact, the top-level transaction). Either
> it's there (and it's transactional change), or not (and then it's
> non-transactional change).
>

I also think the relfilenode should be part of either the current
top-level xact or one of its subxact, so looking at all the top-level
transactions for each change doesn't seem advisable.

> The 0004 does this.
>
> This of course hinges on when exactly the transactions get created, and
> assignments processed. For example if this would fire before the txn
> gets assigned to the top-level one, this would break. I don't think this
> can happen thanks to the immediate logging of assignments, but I'm too
> tired to think about it now.
>

This needs some thought because I think we can't guarantee the
association till we reach the point where we can actually decode the
xact. See comments in AssertTXNLsnOrder() [1].

I noticed few minor comments while reading the patch:
1.
+ * turned on here because the non-transactional logical message is
+ * decoded without waiting for these records.

Instead of '.. logical message', shouldn't we say sequence change message?

2.
+ /*
+ * If we found an entry with matchine relfilenode,

typo (matchine)

3.
+  Note that this may not the value obtained by the process updating the
+  process, but the future sequence value written to WAL (typically about
+  32 values ahead).

/may not the value/may not be the value

[1] -
/*
* Skip the verification if we don't reach the LSN at which we start
* decoding the contents of transactions yet because until we reach the
* LSN, we could have transactions that don't have the association between
* the top-level transaction and subtransaction yet and consequently have
* the same LSN.  We don't guarantee this association until we try to
* decode the actual contents of transaction.

-- 
With Regards,
Amit Kapila.




RE: logical decoding and replication of sequences, take 2

2023-10-24 Thread Zhijie Hou (Fujitsu)
On Thursday, October 12, 2023 11:06 PM Tomas Vondra 
 wrote:
>

Hi,

I have been reviewing the patch set, and here are some initial comments.

1.

I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional
sequence change in ReorderBufferQueueChange().

2.

ReorderBufferSequenceIsTransactional

It seems we call the above function once in sequence_decode() and call it again
in ReorderBufferQueueSequence(), would it better to avoid the second call as
the hashtable search looks not cheap.

3.

The patch cleans up the sequence hash table when COMMIT or ABORT a transaction
(via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem
destory the hash table when PREPARE the transaction. It's not a big porblem but
would it be better to release the memory earlier by destory the table for
prepare ?

4.

+pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
...
+   /* output BEGIN if we haven't yet, but only for the transactional case 
*/
+   if (transactional)
+   {
+   if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+   {
+   pg_output_begin(ctx, data, txn, false);
+   }
+   txndata->xact_wrote_changes = true;
+   }

I think we should call pg_output_stream_start() instead of pg_output_begin()
for streaming sequence changes.

5.
+   /*
+* Schema should be sent using the original relation because it
+* also sends the ancestor's relation.
+*/
+   maybe_send_schema(ctx, txn, relation, relentry);

The comment seems a bit misleading here, I think it was used for the partition
logic in pgoutput_change().

Best Regards,
Hou zj


Re: logical decoding and replication of sequences, take 2

2023-10-14 Thread Amit Kapila
On Thu, Oct 12, 2023 at 9:03 PM Tomas Vondra
 wrote:
>
> On 7/25/23 12:20, Amit Kapila wrote:
> > ...
> >
> > I have used the debugger to reproduce this as it needs quite some
> > coordination. I just wanted to see if the sequence can go backward and
> > didn't catch up completely before the sequence state is marked
> > 'ready'. On the publisher side, I created a publication with a table
> > and a sequence. Then did the following steps:
> > SELECT nextval('s') FROM generate_series(1,50);
> > insert into t1 values(1);
> > SELECT nextval('s') FROM generate_series(51,150);
> >
> > Then on the subscriber side with some debugging aid, I could find the
> > values in the sequence shown in the previous email. Sorry, I haven't
> > recorded each and every step but, if you think it helps, I can again
> > try to reproduce it and share the steps.
> >
>
> Amit, can you try to reproduce this backwards movement with the latest
> version of the patch?
>

I lost touch with this patch but IIRC the quoted problem per se
shouldn't occur after the idea to use page LSN instead of slot's LSN
for synchronization between sync and apply worker.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra
On 7/25/23 12:20, Amit Kapila wrote:
> ...
>
> I have used the debugger to reproduce this as it needs quite some
> coordination. I just wanted to see if the sequence can go backward and
> didn't catch up completely before the sequence state is marked
> 'ready'. On the publisher side, I created a publication with a table
> and a sequence. Then did the following steps:
> SELECT nextval('s') FROM generate_series(1,50);
> insert into t1 values(1);
> SELECT nextval('s') FROM generate_series(51,150);
> 
> Then on the subscriber side with some debugging aid, I could find the
> values in the sequence shown in the previous email. Sorry, I haven't
> recorded each and every step but, if you think it helps, I can again
> try to reproduce it and share the steps.
> 

Amit, can you try to reproduce this backwards movement with the latest
version of the patch? I have tried triggering that (mis)behavior, but I
haven't been successful so far. I'm hesitant to declare it resolved, as
it's dependent on timing etc. and you mentioned it required quite some
coordination.


Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra



On 9/13/23 15:18, Ashutosh Bapat wrote:
> On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila  wrote:
>>
>> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila  wrote:
>>>
>>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
>>>  wrote:

 On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
  wrote:
>
>>
>> But whether or not that's the case, downstream should not request (and
>> hence receive) any changes that have been already applied (and
>> committed) downstream as a principle. I think a way to achieve this is
>> to update the replorigin_session_origin_lsn so that a sequence change
>> applied once is not requested (and hence sent) again.
>>
>
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we
> don't need that.
>
> The attached patch merges the earlier improvements, except for the part
> that experimented with adding a "fake" transaction (which turned out to
> have a number of difficult issues).

 0004 looks good to me.
>>>
>>>
>>> + {
>>>   CommitTransactionCommand();
>>> +
>>> + /*
>>> + * Update origin state so we don't try applying this sequence
>>> + * change in case of crash.
>>> + *
>>> + * XXX We don't have replorigin_session_origin_timestamp, but we
>>> + * can just leave that set to 0.
>>> + */
>>> + replorigin_session_origin_lsn = seq.lsn;
>>>
>>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so
>>> that after restart, it doesn't use some prior origin LSN to start with
>>> which can in turn lead the sequence to go backward. If so, it should
>>> be updated before calling CommitTransactionCommand() as we are doing
>>> in apply_handle_commit_internal(). If that is not the intention then
>>> it is not clear to me how updating replorigin_session_origin_lsn after
>>> commit is helpful.
>>>
>>
>> typedef struct ReplicationState
>> {
>> ...
>> /*
>>  * Location of the latest commit from the remote side.
>>  */
>> XLogRecPtrremote_lsn;
>>
>> This is the variable that will be updated with the value of
>> replorigin_session_origin_lsn. This means we will now track some
>> arbitrary LSN location of the remote side in this variable. The above
>> comment makes me wonder if there is anything we are missing or if it
>> is just a matter of updating this comment because before the patch we
>> always adhere to what is written in the comment.
> 
> I don't think we are missing anything. This value is used to track the
> remote LSN upto which all the commits from upstream have been applied
> locally. Since a non-transactional sequence change is like a single
> WAL record transaction, it's LSN acts as the LSN of the mini-commit.
> So it should be fine to update remote_lsn with sequence WAL record's
> end LSN. That's what the patches do. I don't see any hazard. But you
> are right, we need to update comments. Here and also at other places
> like
> replorigin_session_advance() which uses remote_commit as name of the
> argument which gets assigned to ReplicationState::remote_lsn.
> 

I agree - updating the replorigin_session_origin_lsn shouldn't break
anything. As you write, it's essentially a "mini-commit" and the commit
order remains the same.

I'm not sure about resetting replorigin_session_origin_timestamp to 0
though. It's not something we rely on very much (it may not correlated
with the commit order etc.). But why should we set it to 0? We don't do
that for regular commits, right? And IMO it makes sense to just use the
timestamp of the last commit before the sequence change.

FWIW I've left this in a separate commit, but I'll merge that into 0002
in the next patch version.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra



On 9/20/23 11:53, Dilip Kumar wrote:
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
>  wrote:
>>
> 
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
> 
> + * To decide if a sequence change should be handled as transactional or 
> applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
> 
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
> 

Yeah, that's a stale comment - the actual code only searched through the
top-level ones (and thus relying on the immediate assignment). As I
wrote in the earlier response, I suspect this code originates from
before I added the GetCurrentTransactionId() calls.

That being said, I do wonder why with the immediate assignments we still
need the bit in ReorderBufferAssignChild that says:

/*
 * We already saw this transaction, but initially added it to the
 * list of top-level txns.  Now that we know it's not top-level,
 * remove it from there.
 */
dlist_delete(>node);

I don't think that affects this patch, but it's a bit confusing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: logical decoding and replication of sequences, take 2

2023-09-25 Thread Zhijie Hou (Fujitsu)
On Friday, September 15, 2023 11:11 AM Zhijie Hou (Fujitsu) 
 wrote:
> 
> On Wednesday, August 16, 2023 10:27 PM Tomas Vondra
>  wrote:
> 
> Hi,
> 
> >
> >
> > I guess we could update the origin, per attached 0004. We don't have
> > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > don't need that.
> >
> > The attached patch merges the earlier improvements, except for the
> > part that experimented with adding a "fake" transaction (which turned
> > out to have a number of difficult issues).
> 
> I tried to test the patch and found a crash when calling
> pg_logical_slot_get_changes() to consume sequence changes.

Oh, after confirming again, I realize it's my fault that my build environment
was not clean. This case passed after rebuilding. Sorry for the noise.

Best Regards,
Hou zj


Re: logical decoding and replication of sequences, take 2

2023-09-22 Thread Dilip Kumar
On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar  wrote:
>
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
>  wrote:
> >
>
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
>
> + * To decide if a sequence change should be handled as transactional or 
> applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
>
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
>
> Refer this code
>
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *record)
> {
> ...
> /*
> * If the top-level xid is valid, we need to assign the subxact to the
> * top-level xact. We need to do this for all records, hence we do it
> * before the switch.
> */
> if (TransactionIdIsValid(txid))
> {
> ReorderBufferAssignChild(ctx->reorder,
> txid,
> XLogRecGetXid(record),
> buf.origptr);
> }
> }

Some more comments

1.
ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid
are duplicated except the first one is just confirming whether
relfilelocator was created in the transaction or not and the other is
returning the XID as well so I think these two could be easily merged
so that we can avoid duplicate codes.

2.
/*
+ * ReorderBufferTransferSequencesToParent
+ * Copy the relfilenode entries to the parent after assignment.
+ */
+static void
+ReorderBufferTransferSequencesToParent(ReorderBuffer *rb,
+ReorderBufferTXN *txn,
+ReorderBufferTXN *subtxn)

If we agree with my comment in the previous email (i.e. the first WAL
by a subxid will always include topxid) then we do not need this
function at all and always add relfilelocator directly to the top
transaction and we never need to transfer.

That is all I have for now while first pass of 0001, later I will do a
more detailed review and will look into other patches also.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-09-20 Thread Dilip Kumar
On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
 wrote:
>

I was reading through 0001, I noticed this comment in
ReorderBufferSequenceIsTransactional() function

+ * To decide if a sequence change should be handled as transactional or applied
+ * immediately, we track (sequence) relfilenodes created by each transaction.
+ * We don't know if the current sub-transaction was already assigned to the
+ * top-level transaction, so we need to check all transactions.

It says "We don't know if the current sub-transaction was already
assigned to the top-level transaction, so we need to check all
transactions". But IIRC as part of the steaming of in-progress
transactions we have ensured that whenever we are logging the first
change by any subtransaction we include the top transaction ID in it.

Refer this code

LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *record)
{
...
/*
* If the top-level xid is valid, we need to assign the subxact to the
* top-level xact. We need to do this for all records, hence we do it
* before the switch.
*/
if (TransactionIdIsValid(txid))
{
ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);
}
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: logical decoding and replication of sequences, take 2

2023-09-14 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 16, 2023 10:27 PM Tomas Vondra 
 wrote:

Hi,

> 
> 
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we don't
> need that.
> 
> The attached patch merges the earlier improvements, except for the part that
> experimented with adding a "fake" transaction (which turned out to have a
> number of difficult issues).

I tried to test the patch and found a crash when calling
pg_logical_slot_get_changes() to consume sequence changes.

Steps:

create table t1_seq(a int);
create sequence seq1;
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1');


Attach the backtrace in bt.txt.

Best Regards,
Hou zj
(gdb) bt
#0  0x7fc7fc0ddaff in raise () from /lib64/libc.so.6
#1  0x7fc7fc0b0ea5 in abort () from /lib64/libc.so.6
#2  0x7fc7fc120097 in __libc_message () from /lib64/libc.so.6
#3  0x7fc7fc1274ec in malloc_printerr () from /lib64/libc.so.6
#4  0x7fc7fc12929c in _int_free () from /lib64/libc.so.6
#5  0x00ba3a9a in AllocSetDelete (context=0x234e820) at aset.c:652
#6  0x00bae6ab in MemoryContextDelete (context=0x234e820) at mcxt.c:437
#7  0x009171e2 in filter_by_origin_cb_wrapper (ctx=0x2356970, 
origin_id=0)
at logical.c:1234
#8  0x0091153b in FilterByOrigin (ctx=0x2356970, origin_id=0) at 
decode.c:583
#9  0x00911d5b in DecodeInsert (ctx=0x2356970, buf=0x7ffd28f1ad70) at 
decode.c:907
#10 0x00911332 in heap_decode (ctx=0x2356970, buf=0x7ffd28f1ad70) at 
decode.c:488
#11 0x00910983 in LogicalDecodingProcessRecord (ctx=0x2356970, 
record=0x2356d08)
at decode.c:122
#12 0x009192af in pg_logical_slot_get_changes_guts (fcinfo=0x2389380, 
confirm=true,
binary=false) at logicalfuncs.c:258
#13 0x009193e4 in pg_logical_slot_get_changes (fcinfo=0x2389380)
at logicalfuncs.c:325
#14 0x00743ab4 in ExecMakeTableFunctionResult (setexpr=0x2373828,
econtext=0x23736f8, argContext=0x2389280, expectedDesc=0x23341c8, 
randomAccess=false)
at execSRF.c:235
#15 0x0075ed3d in FunctionNext (node=0x23734e8) at nodeFunctionscan.c:95
#16 0x00745312 in ExecScanFetch (node=0x23734e8, accessMtd=0x75ec8b 
,
recheckMtd=0x75f095 ) at execScan.c:132
#17 0x007453b3 in ExecScan (node=0x23734e8, accessMtd=0x75ec8b 
,
recheckMtd=0x75f095 ) at execScan.c:198
#18 0x0075f0df in ExecFunctionScan (pstate=0x23734e8) at 
nodeFunctionscan.c:270
#19 0x00741324 in ExecProcNodeFirst (node=0x23734e8) at 
execProcnode.c:464
#20 0x0073503c in ExecProcNode (node=0x23734e8)
at ../../../src/include/executor/executor.h:273
#21 0x00737ca7 in ExecutePlan (estate=0x23732c0, planstate=0x23734e8,
use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, 
numberTuples=0,
direction=ForwardScanDirection, dest=0x2330518, execute_once=true) at 
execMain.c:1670
#22 0x00735682 in standard_ExecutorRun (queryDesc=0x2365020,
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:365
#23 0x007354bf in ExecutorRun (queryDesc=0x2365020, 
direction=ForwardScanDirection,
count=0, execute_once=true) at execMain.c:309
#24 0x009bfc7e in PortalRunSelect (portal=0x22df3c0, forward=true, 
count=0,
dest=0x2330518) at pquery.c:924
#25 0x009bf93b in PortalRun (portal=0x22df3c0, 
count=9223372036854775807,
isTopLevel=true, run_once=true, dest=0x2330518, altdest=0x2330518, 
qc=0x7ffd28f1b550)
at pquery.c:768
#26 0x009b94c4 in exec_simple_query (
query_string=0x2264410 "SELECT data  FROM 
pg_logical_slot_get_changes('test_slot', NULL, NULL,\n'include-xids', 'false', 
'skip-empty-xacts', '1');") at postgres.c:1274
#27 0x009bdb31 in PostgresMain (dbname=0x229d240 "postgres",
username=0x229d228 "houzj") at postgres.c:4637
#28 0x008f27ef in BackendRun (port=0x2293650) at postmaster.c:4433
#29 0x008f2185 in BackendStartup (port=0x2293650) at postmaster.c:4161
#30 0x008eebff in ServerLoop () at postmaster.c:1778
#31 0x008ee5cf in PostmasterMain (argc=3, argv=0x225d690) at 
postmaster.c:1462

Re: logical decoding and replication of sequences, take 2

2023-09-13 Thread Ashutosh Bapat
On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila  wrote:
>
> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila  wrote:
> >
> > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> > >  wrote:
> > > >
> > > > >
> > > > > But whether or not that's the case, downstream should not request (and
> > > > > hence receive) any changes that have been already applied (and
> > > > > committed) downstream as a principle. I think a way to achieve this is
> > > > > to update the replorigin_session_origin_lsn so that a sequence change
> > > > > applied once is not requested (and hence sent) again.
> > > > >
> > > >
> > > > I guess we could update the origin, per attached 0004. We don't have
> > > > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > > > don't need that.
> > > >
> > > > The attached patch merges the earlier improvements, except for the part
> > > > that experimented with adding a "fake" transaction (which turned out to
> > > > have a number of difficult issues).
> > >
> > > 0004 looks good to me.
> >
> >
> > + {
> >   CommitTransactionCommand();
> > +
> > + /*
> > + * Update origin state so we don't try applying this sequence
> > + * change in case of crash.
> > + *
> > + * XXX We don't have replorigin_session_origin_timestamp, but we
> > + * can just leave that set to 0.
> > + */
> > + replorigin_session_origin_lsn = seq.lsn;
> >
> > IIUC, your proposal is to update the replorigin_session_origin_lsn, so
> > that after restart, it doesn't use some prior origin LSN to start with
> > which can in turn lead the sequence to go backward. If so, it should
> > be updated before calling CommitTransactionCommand() as we are doing
> > in apply_handle_commit_internal(). If that is not the intention then
> > it is not clear to me how updating replorigin_session_origin_lsn after
> > commit is helpful.
> >
>
> typedef struct ReplicationState
> {
> ...
> /*
>  * Location of the latest commit from the remote side.
>  */
> XLogRecPtrremote_lsn;
>
> This is the variable that will be updated with the value of
> replorigin_session_origin_lsn. This means we will now track some
> arbitrary LSN location of the remote side in this variable. The above
> comment makes me wonder if there is anything we are missing or if it
> is just a matter of updating this comment because before the patch we
> always adhere to what is written in the comment.

I don't think we are missing anything. This value is used to track the
remote LSN upto which all the commits from upstream have been applied
locally. Since a non-transactional sequence change is like a single
WAL record transaction, it's LSN acts as the LSN of the mini-commit.
So it should be fine to update remote_lsn with sequence WAL record's
end LSN. That's what the patches do. I don't see any hazard. But you
are right, we need to update comments. Here and also at other places
like
replorigin_session_advance() which uses remote_commit as name of the
argument which gets assigned to ReplicationState::remote_lsn.

-- 
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-09-13 Thread Ashutosh Bapat
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
 wrote:
> >
> > The attached patch merges the earlier improvements, except for the part
> > that experimented with adding a "fake" transaction (which turned out to
> > have a number of difficult issues).
>
> 0004 looks good to me. But I need to review the impact of not setting
> replorigin_session_origin_timestamp.

I think it will be good to set replorigin_session_origin_timestamp = 0
explicitly so as not to pick up a garbage value. The timestamp is
written to the commit record. Beyond that I don't see any use of it.
It is further passed downstream if there is cascaded logical
replication setup. But I don't see it being used. So it should be fine
to leave it 0. I don't think we can use logically replicated sequences
in a mult-master environment where the timestamp may be used to
resolve conflict. Such a setup will require a distributed sequence
management which can not be achieved by logical replication alone.

In short, I didn't find any hazard in leaving the
replorigin_session_origin_timestamp as 0.

-- 
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-08-18 Thread Amit Kapila
On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila  wrote:
>
> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
>  wrote:
> >
> > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> >  wrote:
> > >
> > > >
> > > > But whether or not that's the case, downstream should not request (and
> > > > hence receive) any changes that have been already applied (and
> > > > committed) downstream as a principle. I think a way to achieve this is
> > > > to update the replorigin_session_origin_lsn so that a sequence change
> > > > applied once is not requested (and hence sent) again.
> > > >
> > >
> > > I guess we could update the origin, per attached 0004. We don't have
> > > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > > don't need that.
> > >
> > > The attached patch merges the earlier improvements, except for the part
> > > that experimented with adding a "fake" transaction (which turned out to
> > > have a number of difficult issues).
> >
> > 0004 looks good to me.
>
>
> + {
>   CommitTransactionCommand();
> +
> + /*
> + * Update origin state so we don't try applying this sequence
> + * change in case of crash.
> + *
> + * XXX We don't have replorigin_session_origin_timestamp, but we
> + * can just leave that set to 0.
> + */
> + replorigin_session_origin_lsn = seq.lsn;
>
> IIUC, your proposal is to update the replorigin_session_origin_lsn, so
> that after restart, it doesn't use some prior origin LSN to start with
> which can in turn lead the sequence to go backward. If so, it should
> be updated before calling CommitTransactionCommand() as we are doing
> in apply_handle_commit_internal(). If that is not the intention then
> it is not clear to me how updating replorigin_session_origin_lsn after
> commit is helpful.
>

typedef struct ReplicationState
{
...
/*
 * Location of the latest commit from the remote side.
 */
XLogRecPtrremote_lsn;

This is the variable that will be updated with the value of
replorigin_session_origin_lsn. This means we will now track some
arbitrary LSN location of the remote side in this variable. The above
comment makes me wonder if there is anything we are missing or if it
is just a matter of updating this comment because before the patch we
always adhere to what is written in the comment.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-08-17 Thread Amit Kapila
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
 wrote:
>
> On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
>  wrote:
> >
> > >
> > > But whether or not that's the case, downstream should not request (and
> > > hence receive) any changes that have been already applied (and
> > > committed) downstream as a principle. I think a way to achieve this is
> > > to update the replorigin_session_origin_lsn so that a sequence change
> > > applied once is not requested (and hence sent) again.
> > >
> >
> > I guess we could update the origin, per attached 0004. We don't have
> > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > don't need that.
> >
> > The attached patch merges the earlier improvements, except for the part
> > that experimented with adding a "fake" transaction (which turned out to
> > have a number of difficult issues).
>
> 0004 looks good to me.


+ {
  CommitTransactionCommand();
+
+ /*
+ * Update origin state so we don't try applying this sequence
+ * change in case of crash.
+ *
+ * XXX We don't have replorigin_session_origin_timestamp, but we
+ * can just leave that set to 0.
+ */
+ replorigin_session_origin_lsn = seq.lsn;

IIUC, your proposal is to update the replorigin_session_origin_lsn, so
that after restart, it doesn't use some prior origin LSN to start with
which can in turn lead the sequence to go backward. If so, it should
be updated before calling CommitTransactionCommand() as we are doing
in apply_handle_commit_internal(). If that is not the intention then
it is not clear to me how updating replorigin_session_origin_lsn after
commit is helpful.

>
 But I need to review the impact of not setting
> replorigin_session_origin_timestamp.
>

This may not have a direct impact on built-in replication as I think
we don't rely on it yet but we need to think of out-of-core solutions.
I am not sure if I understood your proposal as per my previous comment
but once you clarify the same, I'll also try to think on the same.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-08-17 Thread Ashutosh Bapat
On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
 wrote:
>
> >
> > But whether or not that's the case, downstream should not request (and
> > hence receive) any changes that have been already applied (and
> > committed) downstream as a principle. I think a way to achieve this is
> > to update the replorigin_session_origin_lsn so that a sequence change
> > applied once is not requested (and hence sent) again.
> >
>
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we
> don't need that.
>
> The attached patch merges the earlier improvements, except for the part
> that experimented with adding a "fake" transaction (which turned out to
> have a number of difficult issues).

0004 looks good to me. But I need to review the impact of not setting
replorigin_session_origin_timestamp.

What fake transaction experiment are you talking about?

--
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-08-11 Thread Ashutosh Bapat
On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra
 wrote:
>
> Anyway, I think this is "just" a matter of efficiency, not correctness.
> IMHO there are bigger questions regarding the "going back" behavior
> after apply restart.


sequence_decode() has the following code
/* Skip the change if already processed (per the snapshot). */
if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;
else if (!transactional &&
(SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
SnapBuildXactNeedsSkip(builder, buf->origptr)))
return;

This means that if the subscription restarts, the upstream will *not*
send any non-transactional sequence changes with LSN prior to the LSN
specified by START_REPLICATION command. That should avoid replicating
all the non-transactional sequence changes since
ReplicationSlot::restart_lsn if the subscription restarts.

But in apply_handle_sequence(), we do not update the
replorigin_session_origin_lsn with LSN of the non-transactional
sequence change when it's applied. This means that if a subscription
restarts while it is half way through applying a transaction, those
changes will be replicated again. This will move the sequence
backward. If the subscription keeps restarting again and again while
applying that transaction, we will see the sequence "rubber banding"
[1] on subscription. So untill the transaction is completely applied,
the other users of the sequence may see duplicate values during this
time. I think this is undesirable.

But I am not able to find a case where this can lead to conflicting
values after failover. If there's only one transaction which is
repeatedly being applied, the rows which use sequence values were
never committed so there's no conflicting value present on the
subscription. The same reasoning can be extended to multiple in-flight
transactions. If another transaction (T2) uses the sequence values
changed by in-flight transaction T1 and if T2 commits before T1, the
sequence changes used by T2 must have LSNs before commit of T2 and
thus they will never be replicated. (See example below).

T1
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1
T2
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2
COMMIT;
T1
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13
COMMIT;

So I am not able to imagine a case when a sequence going backward can
cause conflicting values.

But whether or not that's the case, downstream should not request (and
hence receive) any changes that have been already applied (and
committed) downstream as a principle. I think a way to achieve this is
to update the replorigin_session_origin_lsn so that a sequence change
applied once is not requested (and hence sent) again.

[1] https://en.wikipedia.org/wiki/Rubber_banding

--
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-08-01 Thread Tomas Vondra



On 8/1/23 04:59, Amit Kapila wrote:
> On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra
>  wrote:
>>
>> On 7/31/23 11:25, Amit Kapila wrote:
>>> ...
>>>
>>> Yeah, I also think this needs a review. This is a sort of new concept
>>> where we don't use the LSN of the slot (for cases where copy returned
>>> a larger value of LSN) or a full_snapshot created corresponding to the
>>> sync slot by Walsender. For the case of the table, we build a full
>>> snapshot because we use that for copying the table but why do we need
>>> to build that for copying the sequence especially when we directly
>>> copy it from the sequence relation without caring for any snapshot?
>>>
>>
>> We need the slot to decode/apply changes during catchup. The main
>> subscription may get ahead, and we need to ensure the WAL is not
>> discarded or something like that. This applies even if the initial sync
>> step does not use the slot/snapshot directly.
>>
> 
> AFAIK, none of these needs a full_snapshot (see usage of
> SnapBuild->building_full_snapshot). The full_snapshot tracks both
> catalog and non-catalog xacts in the snapshot where we require to
> track non-catalog ones because we want to copy the table using that
> snapshot. It is relatively expensive to build a full snapshot and we
> don't do that unless it is required. For the current usage of this
> patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient.
> 

Yeah, you may be right we don't need a full snapshot, because we don't
need to export it. We however still need a snapshot, and it wasn't clear
to me whether you suggest we don't need the slot / snapshot at all.

Anyway, I think this is "just" a matter of efficiency, not correctness.
IMHO there are bigger questions regarding the "going back" behavior
after apply restart.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Amit Kapila
On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra
 wrote:
>
> On 7/31/23 11:25, Amit Kapila wrote:
> > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
> >  wrote:
> >>
> >> On 7/28/23 14:44, Ashutosh Bapat wrote:
> >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >>>  wrote:
> 
>  Anyway, I was thinking about this a bit more, and it seems it's not as
>  difficult to use the page LSN to ensure sequences don't go backwards.
>  The 0005 change does that, by:
> 
>  1) adding pg_sequence_state, that returns both the sequence state and
> the page LSN
> 
>  2) copy_sequence returns the page LSN
> 
>  3) tablesync then sets this LSN as origin_startpos (which for tables is
> just the LSN of the replication slot)
> 
>  AFAICS this makes it work - we start decoding at the page LSN, so that
>  we  skip the increments that could lead to the sequence going backwards.
> 
> >>>
> >>> I like this design very much. It makes things simpler than complex.
> >>> Thanks for doing this.
> >>>
> >>
> >> I agree it seems simpler. It'd be good to try testing / reviewing it a
> >> bit more, so that it doesn't misbehave in some way.
> >>
> >
> > Yeah, I also think this needs a review. This is a sort of new concept
> > where we don't use the LSN of the slot (for cases where copy returned
> > a larger value of LSN) or a full_snapshot created corresponding to the
> > sync slot by Walsender. For the case of the table, we build a full
> > snapshot because we use that for copying the table but why do we need
> > to build that for copying the sequence especially when we directly
> > copy it from the sequence relation without caring for any snapshot?
> >
>
> We need the slot to decode/apply changes during catchup. The main
> subscription may get ahead, and we need to ensure the WAL is not
> discarded or something like that. This applies even if the initial sync
> step does not use the slot/snapshot directly.
>

AFAIK, none of these needs a full_snapshot (see usage of
SnapBuild->building_full_snapshot). The full_snapshot tracks both
catalog and non-catalog xacts in the snapshot where we require to
track non-catalog ones because we want to copy the table using that
snapshot. It is relatively expensive to build a full snapshot and we
don't do that unless it is required. For the current usage of this
patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Tomas Vondra



On 7/31/23 11:25, Amit Kapila wrote:
> On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
>  wrote:
>>
>> On 7/28/23 14:44, Ashutosh Bapat wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>>  wrote:

 Anyway, I was thinking about this a bit more, and it seems it's not as
 difficult to use the page LSN to ensure sequences don't go backwards.
 The 0005 change does that, by:

 1) adding pg_sequence_state, that returns both the sequence state and
the page LSN

 2) copy_sequence returns the page LSN

 3) tablesync then sets this LSN as origin_startpos (which for tables is
just the LSN of the replication slot)

 AFAICS this makes it work - we start decoding at the page LSN, so that
 we  skip the increments that could lead to the sequence going backwards.

>>>
>>> I like this design very much. It makes things simpler than complex.
>>> Thanks for doing this.
>>>
>>
>> I agree it seems simpler. It'd be good to try testing / reviewing it a
>> bit more, so that it doesn't misbehave in some way.
>>
> 
> Yeah, I also think this needs a review. This is a sort of new concept
> where we don't use the LSN of the slot (for cases where copy returned
> a larger value of LSN) or a full_snapshot created corresponding to the
> sync slot by Walsender. For the case of the table, we build a full
> snapshot because we use that for copying the table but why do we need
> to build that for copying the sequence especially when we directly
> copy it from the sequence relation without caring for any snapshot?
> 

We need the slot to decode/apply changes during catchup. The main
subscription may get ahead, and we need to ensure the WAL is not
discarded or something like that. This applies even if the initial sync
step does not use the slot/snapshot directly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Amit Kapila
On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
 wrote:
>
> On 7/28/23 14:44, Ashutosh Bapat wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >  wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >> The 0005 change does that, by:
> >>
> >> 1) adding pg_sequence_state, that returns both the sequence state and
> >>the page LSN
> >>
> >> 2) copy_sequence returns the page LSN
> >>
> >> 3) tablesync then sets this LSN as origin_startpos (which for tables is
> >>just the LSN of the replication slot)
> >>
> >> AFAICS this makes it work - we start decoding at the page LSN, so that
> >> we  skip the increments that could lead to the sequence going backwards.
> >>
> >
> > I like this design very much. It makes things simpler than complex.
> > Thanks for doing this.
> >
>
> I agree it seems simpler. It'd be good to try testing / reviewing it a
> bit more, so that it doesn't misbehave in some way.
>

Yeah, I also think this needs a review. This is a sort of new concept
where we don't use the LSN of the slot (for cases where copy returned
a larger value of LSN) or a full_snapshot created corresponding to the
sync slot by Walsender. For the case of the table, we build a full
snapshot because we use that for copying the table but why do we need
to build that for copying the sequence especially when we directly
copy it from the sequence relation without caring for any snapshot?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-07-29 Thread Tomas Vondra



On 7/29/23 06:54, Amit Kapila wrote:
> On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
>  wrote:
>>
>> On 7/28/23 11:42, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>>  wrote:

 On 7/26/23 09:27, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  
> wrote:

 Anyway, I was thinking about this a bit more, and it seems it's not as
 difficult to use the page LSN to ensure sequences don't go backwards.

>>>
>>> While studying the changes for this proposal and related areas, I have
>>> a few comments:
>>> 1. I think you need to advance the origin if it is changed due to
>>> copy_sequence(), otherwise, if the sync worker restarts after
>>> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
>>> value.
>>>
>>
>> True, we want to restart at the new origin_startpos.
>>
>>> 2. Between the time of SYNCDONE and READY state, the patch can skip
>>> applying non-transactional sequence changes even if it should apply
>>> it. The reason is that during that state change
>>> should_apply_changes_for_rel() decides whether to apply change based
>>> on the value of remote_final_lsn which won't be set for
>>> non-transactional change. I think we need to send the start LSN of a
>>> non-transactional record and then use that as remote_final_lsn for
>>> such a change.
>>
>> Good catch. remote_final_lsn is set in apply_handle_begin, but that
>> won't happen for sequences. We're already sending the LSN, but
>> logicalrep_read_sequence ignores it - it should be enough to add it to
>> LogicalRepSequence and then set it in apply_handle_sequence().
>>
> 
> As per my understanding, the LSN sent is EndRecPtr of record which is
> the beginning of the next record (means current_record_end + 1). For
> comparing the current record, we use the start_position of the record
> as we do when we use the remote_final_lsn via apply_handle_begin().
> 
>>>
>>> 3. For non-transactional sequence change apply, we don't set
>>> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
>>> we are doing in apply_handle_commit_internal() before calling
>>> CommitTransactionCommand(). So, that can lead to the origin moving
>>> backwards after restart which will lead to requesting and applying the
>>> same changes again and for that period of time sequence can go
>>> backwards. This needs some more thought as to what is the correct
>>> behaviour/solution for this.
>>>
>>
>> I think saying "origin moves backwards" is a bit misleading. AFAICS the
>> origin position is not actually moving backwards, it's more that we
>> don't (and can't) move it forwards for each non-transactional change. So
>> yeah, we may re-apply those, and IMHO that's expected - the sequence is
>> allowed to be "ahead" on the subscriber.
>>
> 
> But, if this happens then for a period of time the sequence will go
> backwards relative to what one would have observed before restart.
> 

That is true, but is it really a problem? This whole sequence decoding
thing was meant to allow logical failover - make sure that after switch
to the subscriber, the sequences don't generate duplicate values. From
this POV, the sequence going backwards (back to the confirmed origin
position) is not an issue - it's still far enough (ahead of publisher).

Is that great / ideal? No, I agree with that. But it was considered
acceptable and good enough for the failover use case ...

The only idea how to improve that is we could keep the non-transactional
changes (instead of applying them immediately), and then apply them on
the nearest "commit". That'd mean it's subject to the position tracking,
and the sequence would not go backwards, I think.

So every time we decode a commit, we'd check if we decoded any sequence
changes since the last commit, and merge them (a bit like a subxact).

This would however also mean sequence changes from rolled-back xacts may
not be replictated. I think that'd be fine, but IIRC Andres suggested
it's a valid use case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-29 Thread Tomas Vondra



On 7/28/23 14:44, Ashutosh Bapat wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>  wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>> The 0005 change does that, by:
>>
>> 1) adding pg_sequence_state, that returns both the sequence state and
>>the page LSN
>>
>> 2) copy_sequence returns the page LSN
>>
>> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>>just the LSN of the replication slot)
>>
>> AFAICS this makes it work - we start decoding at the page LSN, so that
>> we  skip the increments that could lead to the sequence going backwards.
>>
> 
> I like this design very much. It makes things simpler than complex.
> Thanks for doing this.
> 

I agree it seems simpler. It'd be good to try testing / reviewing it a
bit more, so that it doesn't misbehave in some way.

> I am wondering whether we could reuse pg_sequence_last_value() instead
> of adding a new function. But the name of the function doesn't leave
> much space for expanding its functionality. So we are good with a new
> one. Probably some code deduplication.
> 

I don't think we should do that, the pg_sequence_last_value() function
is meant to do something different. I don't think it'd be any simpler to
also make it do what pg_sequence_state() does would make it any simpler.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Amit Kapila
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
 wrote:
>
> On 7/28/23 11:42, Amit Kapila wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >  wrote:
> >>
> >> On 7/26/23 09:27, Amit Kapila wrote:
> >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  
> >>> wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >>
> >
> > While studying the changes for this proposal and related areas, I have
> > a few comments:
> > 1. I think you need to advance the origin if it is changed due to
> > copy_sequence(), otherwise, if the sync worker restarts after
> > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> > value.
> >
>
> True, we want to restart at the new origin_startpos.
>
> > 2. Between the time of SYNCDONE and READY state, the patch can skip
> > applying non-transactional sequence changes even if it should apply
> > it. The reason is that during that state change
> > should_apply_changes_for_rel() decides whether to apply change based
> > on the value of remote_final_lsn which won't be set for
> > non-transactional change. I think we need to send the start LSN of a
> > non-transactional record and then use that as remote_final_lsn for
> > such a change.
>
> Good catch. remote_final_lsn is set in apply_handle_begin, but that
> won't happen for sequences. We're already sending the LSN, but
> logicalrep_read_sequence ignores it - it should be enough to add it to
> LogicalRepSequence and then set it in apply_handle_sequence().
>

As per my understanding, the LSN sent is EndRecPtr of record which is
the beginning of the next record (means current_record_end + 1). For
comparing the current record, we use the start_position of the record
as we do when we use the remote_final_lsn via apply_handle_begin().

> >
> > 3. For non-transactional sequence change apply, we don't set
> > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> > we are doing in apply_handle_commit_internal() before calling
> > CommitTransactionCommand(). So, that can lead to the origin moving
> > backwards after restart which will lead to requesting and applying the
> > same changes again and for that period of time sequence can go
> > backwards. This needs some more thought as to what is the correct
> > behaviour/solution for this.
> >
>
> I think saying "origin moves backwards" is a bit misleading. AFAICS the
> origin position is not actually moving backwards, it's more that we
> don't (and can't) move it forwards for each non-transactional change. So
> yeah, we may re-apply those, and IMHO that's expected - the sequence is
> allowed to be "ahead" on the subscriber.
>

But, if this happens then for a period of time the sequence will go
backwards relative to what one would have observed before restart.

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Tomas Vondra



On 7/28/23 14:35, Ashutosh Bapat wrote:
> On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
>  wrote:
>>
>> On 7/24/23 14:57, Ashutosh Bapat wrote:
>>> ...
>>>


 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
 I was thinking maybe we should have it in the transaction (because we
 need to do cleanup at the end). It seem a bit inconvenient, because then
 we'd need to either search htabs in all subxacts, or transfer the
 entries to the top-level xact (otoh, we already do that with snapshots),
 and cleanup on abort.

 What do you think?
>>>
>>> Hash table per transaction seems saner design. Adding it to the top
>>> level transaction should be fine. The entry will contain an XID
>>> anyway. If we add it to every subtransaction we will need to search
>>> hash table in each of the subtransactions when deciding whether a
>>> sequence change is transactional or not. Top transaction is a
>>> reasonable trade off.
>>>
>>
>> It's not clear to me what design you're proposing, exactly.
>>
>> If we track it in top-level transactions, then we'd need copy the data
>> whenever a transaction is assigned as a child, and perhaps also remove
>> it when there's a subxact abort.
> 
> I thought, esp. with your changes to assign xid, we will always know
> the top level transaction when a sequence is assigned a relfilenode.
> So the refilenodes will always get added to the correct hash directly.
> I didn't imagine a case where we will need to copy the hash table from
> sub-transaction to top transaction. If that's true, yes it's
> inconvenient.
> 

Well, it's a matter of efficiency.

To check if a sequence change is transactional, we need to check if it's
for a relfilenode created in the current transaction (it can't be for
relfilenode created in a concurrent top-level transaction, due to MVCC).

If you don't copy the entries into the top-level xact, you have to walk
all subxacts and search all of those, for each sequence change. And
there may be quite a few of both subxacts and sequence changes ...

I wonder if we need to search the other top-level xacts, but we probably
need to do that. Because it might be a subxact without an assignment, or
something like that.

> As to the abort, don't we already remove entries on subtxn abort?
> Having per transaction hash table doesn't seem to change anything
> much.
> 

What entries are we removing? My point is that if we copy the entries to
the top-level xact, we probably need to remove them on abort. Or we
could leave them in the top-level xact hash.

>>
>> And we'd need to still search the hashes in all toplevel transactions on
>> every sequence increment - in principle we can't have increment for a
>> sequence created in another in-progress transaction, but maybe it's just
>> not assigned yet.
> 
> We hold a strong lock on sequence when changing its relfilenode. The
> sequence whose relfilenode is being changed can not be accessed by any
> concurrent transaction. So I am not able to understand what you are
> trying to say.
> 

How do you know the subxact has already been recognized as such? It may
be treated as top-level transaction for a while, until the assignment.

> I think per (top level) transaction hash table is cleaner design. It
> puts the hash table where it should be. But if that makes code
> difficult, current design works too.
> 


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Ashutosh Bapat
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
 wrote:
>
> Anyway, I was thinking about this a bit more, and it seems it's not as
> difficult to use the page LSN to ensure sequences don't go backwards.
> The 0005 change does that, by:
>
> 1) adding pg_sequence_state, that returns both the sequence state and
>the page LSN
>
> 2) copy_sequence returns the page LSN
>
> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>just the LSN of the replication slot)
>
> AFAICS this makes it work - we start decoding at the page LSN, so that
> we  skip the increments that could lead to the sequence going backwards.
>

I like this design very much. It makes things simpler than complex.
Thanks for doing this.

I am wondering whether we could reuse pg_sequence_last_value() instead
of adding a new function. But the name of the function doesn't leave
much space for expanding its functionality. So we are good with a new
one. Probably some code deduplication.

-- 
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Tomas Vondra
On 7/28/23 11:42, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>  wrote:
>>
>> On 7/26/23 09:27, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>>
> 
> While studying the changes for this proposal and related areas, I have
> a few comments:
> 1. I think you need to advance the origin if it is changed due to
> copy_sequence(), otherwise, if the sync worker restarts after
> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> value.
> 

True, we want to restart at the new origin_startpos.

> 2. Between the time of SYNCDONE and READY state, the patch can skip
> applying non-transactional sequence changes even if it should apply
> it. The reason is that during that state change
> should_apply_changes_for_rel() decides whether to apply change based
> on the value of remote_final_lsn which won't be set for
> non-transactional change. I think we need to send the start LSN of a
> non-transactional record and then use that as remote_final_lsn for
> such a change.

Good catch. remote_final_lsn is set in apply_handle_begin, but that
won't happen for sequences. We're already sending the LSN, but
logicalrep_read_sequence ignores it - it should be enough to add it to
LogicalRepSequence and then set it in apply_handle_sequence().

> 
> 3. For non-transactional sequence change apply, we don't set
> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> we are doing in apply_handle_commit_internal() before calling
> CommitTransactionCommand(). So, that can lead to the origin moving
> backwards after restart which will lead to requesting and applying the
> same changes again and for that period of time sequence can go
> backwards. This needs some more thought as to what is the correct
> behaviour/solution for this.
> 

I think saying "origin moves backwards" is a bit misleading. AFAICS the
origin position is not actually moving backwards, it's more that we
don't (and can't) move it forwards for each non-transactional change. So
yeah, we may re-apply those, and IMHO that's expected - the sequence is
allowed to be "ahead" on the subscriber.

I don't see a way to improve this, except maybe having a separate LSN
for non-transactional changes (for each origin).

> 4. BTW, while checking this behaviour, I noticed that the initial sync
> worker for sequence mentions the table in the LOG message: "LOG:
> logical replication table synchronization worker for subscription
> "mysub", table "s" has finished". Won't it be better here to refer to
> it as a sequence?
> 

Thanks, I'll fix that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Ashutosh Bapat
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
 wrote:
>
> On 7/24/23 14:57, Ashutosh Bapat wrote:
> > ...
> >
> >>
> >>
> >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
> >> I was thinking maybe we should have it in the transaction (because we
> >> need to do cleanup at the end). It seem a bit inconvenient, because then
> >> we'd need to either search htabs in all subxacts, or transfer the
> >> entries to the top-level xact (otoh, we already do that with snapshots),
> >> and cleanup on abort.
> >>
> >> What do you think?
> >
> > Hash table per transaction seems saner design. Adding it to the top
> > level transaction should be fine. The entry will contain an XID
> > anyway. If we add it to every subtransaction we will need to search
> > hash table in each of the subtransactions when deciding whether a
> > sequence change is transactional or not. Top transaction is a
> > reasonable trade off.
> >
>
> It's not clear to me what design you're proposing, exactly.
>
> If we track it in top-level transactions, then we'd need copy the data
> whenever a transaction is assigned as a child, and perhaps also remove
> it when there's a subxact abort.

I thought, esp. with your changes to assign xid, we will always know
the top level transaction when a sequence is assigned a relfilenode.
So the refilenodes will always get added to the correct hash directly.
I didn't imagine a case where we will need to copy the hash table from
sub-transaction to top transaction. If that's true, yes it's
inconvenient.

As to the abort, don't we already remove entries on subtxn abort?
Having per transaction hash table doesn't seem to change anything
much.

>
> And we'd need to still search the hashes in all toplevel transactions on
> every sequence increment - in principle we can't have increment for a
> sequence created in another in-progress transaction, but maybe it's just
> not assigned yet.

We hold a strong lock on sequence when changing its relfilenode. The
sequence whose relfilenode is being changed can not be accessed by any
concurrent transaction. So I am not able to understand what you are
trying to say.

I think per (top level) transaction hash table is cleaner design. It
puts the hash table where it should be. But if that makes code
difficult, current design works too.

-- 
Best Wishes,
Ashutosh Bapat




  1   2   3   4   >