Re: logical decoding and replication of sequences, take 2
Hi, Let me share a bit of an update regarding this patch and PG17. I have discussed this patch and how to move it forward with a couple hackers (both within EDB and outside), and my takeaway is that the patch is not quite baked yet, not enough to make it into PG17 :-( There are two main reasons / concerns leading to this conclusion: * correctness of the decoding part There are (were) doubts about decoding during startup, before the snapshot gets consistent, when we can get "temporarily incorrect" decisions whether a change is transactional. While the behavior is ultimately correct (we treat all changes as non-transactional and discard it), it seems "dirty" and it’s unclear to me if it might cause more serious issues down the line (not necessarily bugs, but perhaps making it harder to implement future changes). * handling of sequences in built-in replication Per the patch, sequences need to be added to the publication explicitly. But there were suggestions we might (should) add certain sequences automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc. I’m not sure we really want to do that, and so far I assumed we would start with the manual approach and move to automatic addition in the future. But the agreement seems to be it would be a pretty significant "breaking change", and something we probably don’t want to do. If someone feels has an opinion on either of the two issues (in either way), I'd like to hear it. Obviously, I'm not particularly happy about this outcome. And I'm also somewhat cautious because this patch was already committed+reverted in PG16 cycle, and doing the same thing in PG17 is not on my wish list. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 2:52 PM Robert Haas wrote: > > On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar wrote: > > So the problem is that we might consider the transaction change as > > non-transaction and mark this flag as true. > > But it's not "might" right? It's absolutely 100% certain that we will > consider that transaction's changes as non-transactional ... because > when we're in fast-forward mode, the table of new relfilenodes is not > built, and so whenever we check whether any transaction made a new > relfilenode for this sequence, the answer will be no. > > > But what would have > > happened if we would have identified it correctly as transactional? > > In such cases, we wouldn't have set this flag here but then we would > > have set this while processing the DecodeAbort/DecodeCommit, so the > > net effect would be the same no? You may question what if the > > Abort/Commit WAL never appears in the WAL, but this flag is > > specifically for the upgrade case, and in that case we have to do a > > clean shutdown so may not be an issue. But in the future, if we try > > to use 'ctx->processing_required' for something else where the clean > > shutdown is not guaranteed then this flag can be set incorrectly. > > > > I am not arguing that this is a perfect design but I am just making a > > point about why it would work. > > Even if this argument is correct (and I don't know if it is), the code > and comments need some updating. We should not be testing a flag that > is guaranteed false with comments that make it sound like the value of > the flag is trustworthy when it isn't. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar wrote: > So the problem is that we might consider the transaction change as > non-transaction and mark this flag as true. But it's not "might" right? It's absolutely 100% certain that we will consider that transaction's changes as non-transactional ... because when we're in fast-forward mode, the table of new relfilenodes is not built, and so whenever we check whether any transaction made a new relfilenode for this sequence, the answer will be no. > But what would have > happened if we would have identified it correctly as transactional? > In such cases, we wouldn't have set this flag here but then we would > have set this while processing the DecodeAbort/DecodeCommit, so the > net effect would be the same no? You may question what if the > Abort/Commit WAL never appears in the WAL, but this flag is > specifically for the upgrade case, and in that case we have to do a > clean shutdown so may not be an issue. But in the future, if we try > to use 'ctx->processing_required' for something else where the clean > shutdown is not guaranteed then this flag can be set incorrectly. > > I am not arguing that this is a perfect design but I am just making a > point about why it would work. Even if this argument is correct (and I don't know if it is), the code and comments need some updating. We should not be testing a flag that is guaranteed false with comments that make it sound like the value of the flag is trustworthy when it isn't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 1:24 PM Robert Haas wrote: > > On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar wrote: > > > But I am wondering why this flag is always set to true in > > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > > aborted transactions are not supposed to be replayed? So if my > > > observation is correct that for the aborted transaction, this > > > shouldn't be set to true then we have a problem with sequence where we > > > are identifying the transactional changes as non-transaction changes > > > because now for transactional changes this should depend upon commit > > > status. > > > > I have checked this case with Amit Kapila. So it seems in the cases > > where we have sent the prepared transaction or streamed in-progress > > transaction we would need to send the abort also, and for that reason, > > we are setting 'ctx->processing_required' as true so that if these > > WALs are not streamed we do not allow upgrade of such slots. > > I don't find this explanation clear enough for me to understand. Explanation about why we set 'ctx->processing_required' to true from DecodeCommit as well as DecodeAbort: -- For upgrading logical replication slots, it's essential to ensure these slots are completely synchronized with the subscriber. To identify that we process all the pending WAL in 'fast_forward' mode to find whether there is any decodable WAL or not. So in short any WAL type that we stream to standby in normal mode (no fast_forward mode) is considered decodable and so is the abort WAL. That's the reason why at the end of the transaction commit/abort we need to set this 'ctx->processing_required' to true i.e. there are some decodable WAL exists so we can not upgrade this slot. Why the below check is safe? > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } So the problem is that we might consider the transaction change as non-transaction and mark this flag as true. But what would have happened if we would have identified it correctly as transactional? In such cases, we wouldn't have set this flag here but then we would have set this while processing the DecodeAbort/DecodeCommit, so the net effect would be the same no? You may question what if the Abort/Commit WAL never appears in the WAL, but this flag is specifically for the upgrade case, and in that case we have to do a clean shutdown so may not be an issue. But in the future, if we try to use 'ctx->processing_required' for something else where the clean shutdown is not guaranteed then this flag can be set incorrectly. I am not arguing that this is a perfect design but I am just making a point about why it would work. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar wrote: > > But I am wondering why this flag is always set to true in > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > aborted transactions are not supposed to be replayed? So if my > > observation is correct that for the aborted transaction, this > > shouldn't be set to true then we have a problem with sequence where we > > are identifying the transactional changes as non-transaction changes > > because now for transactional changes this should depend upon commit > > status. > > I have checked this case with Amit Kapila. So it seems in the cases > where we have sent the prepared transaction or streamed in-progress > transaction we would need to send the abort also, and for that reason, > we are setting 'ctx->processing_required' as true so that if these > WALs are not streamed we do not allow upgrade of such slots. I don't find this explanation clear enough for me to understand. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar wrote: > > On Tue, Feb 20, 2024 at 3:38 PM Robert Haas wrote: > > > Let's say fast_forward is true. Then smgr_decode() is going to skip > > recording anything about the relfilenode, so we'll identify all > > sequence changes as non-transactional. But look at how this case is > > handled in seq_decode(): > > > > + if (ctx->fast_forward) > > + { > > + /* > > + * We need to set processing_required flag to notify the sequence > > + * change existence to the caller. Usually, the flag is set when > > + * either the COMMIT or ABORT records are decoded, but this must be > > + * turned on here because the non-transactional logical message is > > + * decoded without waiting for these records. > > + */ > > + if (!transactional) > > + ctx->processing_required = true; > > + > > + return; > > + } > > It appears that the 'processing_required' flag was introduced as part > of supporting upgrades for logical replication slots. Its purpose is > to determine whether a slot is fully caught up, meaning that there are > no pending decodable changes left before it can be upgraded. > > So now if some change was transactional but we have identified it as > non-transaction then we will mark this flag 'ctx->processing_required > = true;' so we temporarily set this flag incorrectly, but even if the > flag would have been correctly identified initially, it would have > been set again to true in the DecodeTXNNeedSkip() function regardless > of whether the transaction is committed or aborted. As a result, the > flag would eventually be set to 'true', and the behavior would align > with the intended logic. > > But I am wondering why this flag is always set to true in > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > aborted transactions are not supposed to be replayed? So if my > observation is correct that for the aborted transaction, this > shouldn't be set to true then we have a problem with sequence where we > are identifying the transactional changes as non-transaction changes > because now for transactional changes this should depend upon commit > status. I have checked this case with Amit Kapila. So it seems in the cases where we have sent the prepared transaction or streamed in-progress transaction we would need to send the abort also, and for that reason, we are setting 'ctx->processing_required' as true so that if these WALs are not streamed we do not allow upgrade of such slots. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra wrote: > > On 2/13/24 17:37, Robert Haas wrote: > > > In other words, the fact that some sequence changes are > > non-transactional creates ordering hazards that don't exist if there > > are no non-transactional changes. So in that way, sequences are > > different from table modifications, where applying the transactions in > > order of commit is all we need to do. Here we need to apply the > > transactions in order of commit and also apply the non-transactional > > changes at the right point in the sequence. Consider the following > > alternative apply sequence: > > > > 1. T1. > > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > > the subsequent nextval) > > 3. T3's nextval > > 4. T2's first nextval > > > > That's still in commit order. It's also wrong. > > > > Yes, this would be wrong. Thankfully the apply is not allowed to reorder > the changes like this, because that's not what "non-transactional" means > in this context. > > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). > In this regard, I have another scenario in mind where the apply order could be different for the changes in the same transactions. For example, Transaction T1 Begin; Insert .. Insert .. nextval .. --consider this generates WAL .. Insert .. nextval .. --consider this generates WAL In this case, if the nextval operations will be applied in a different order (aka before Inserts) then there could be some inconsistency. Say, if, it doesn't follow the above order during apply then a trigger fired on both pub and sub for each row insert that refers to the current sequence value to make some decision could have different behavior on publisher and subscriber. If this is not how the patch will behave then fine but otherwise, isn't this something that we should be worried about? -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra wrote: > > On 2/20/24 06:54, Amit Kapila wrote: > > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > > wrote: > >> > >> On 12/19/23 13:54, Christophe Pettus wrote: > >>> Hi, > >>> > >>> I wanted to hop in here on one particular issue: > >>> > On Dec 12, 2023, at 02:01, Tomas Vondra > wrote: > - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > better solution for distributed (esp. active-active) systems. But there > are important use cases that are likely to keep using regular sequences > (online upgrades of single-node instances, existing systems, ...). > >>> > >>> +1. > >>> > >>> Right now, the lack of sequence replication is a rather large > >>> foot-gun on logical replication upgrades. Copying the sequences > >>> over during the cutover period is doable, of course, but: > >>> > >>> (a) There's no out-of-the-box tooling that does it, so everyone has > >>> to write some scripts just for that one function. > >>> > >>> (b) It's one more thing that extends the cutover window. > >>> > >> > >> I agree it's an annoying gap for this use case. But if this is the only > >> use cases, maybe a better solution would be to provide such tooling > >> instead of adding it to the logical decoding? > >> > >> It might seem a bit strange if most data is copied by replication > >> directly, while sequences need special handling, ofc. > >> > > > > One difference between the logical replication of tables and sequences > > is that we can guarantee with synchronous_commit (and > > synchronous_standby_names) that after failover transactions data is > > replicated or not whereas for sequences we can't guarantee that > > because of their non-transactional nature. Say, there are two > > transactions T1 and T2, it is possible that T1's entire table data and > > sequence data are committed and replicated but T2's sequence data is > > replicated. So, after failover to logical subscriber in such a case if > > one routes T2 again to the new node as it was not successful > > previously then it would needlessly perform the sequence changes > > again. I don't how much that matters but that would probably be the > > difference between the replication of tables and sequences. > > > > I don't quite follow what the problem with synchronous_commit is :-( > > For sequences, we log the changes ahead, i.e. even if nextval() did not > write anything into WAL, it's still safe because these changes are > covered by the WAL generated some time ago (up to ~32 values back). And > that's certainly subject to synchronous_commit, right? > > There certainly are issues with sequences and syncrep: > > https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com > > but that's unrelated to logical replication. > > FWIW I don't think we'd re-apply sequence changes needlessly, because > the worker does update the origin after applying non-transactional > changes. So after the replication gets restarted, we'd skip what we > already applied, no? > It will work for restarts but I was trying to discuss what happens in the scenario after the publisher node goes down and we failover to the subscriber node and make it a primary node (or a failover case). After that, all unfinished transactions will be re-routed to the new primary. Consider a theoretical case where we send sequence changes of the yet uncommitted transactions directly from wal buffers (something like 91f2cae7a4 does for physical replication) and then immediately the primary or publisher node crashes. After failover to the subscriber node, the application will re-route unfinished transactions to the new primary. In such a situation, I think there is a chance that we will update the sequence value when it would have already received/applied that update via replication. This is what I was saying that there is probably a difference between tables and sequences, for tables such a replicated change would be rolled back. Having said that, this is probably no different from what would happen in the case of physical replication. > But maybe there is an issue and I'm just not getting it. Could you maybe > share an example of T1/T2, with a replication restart and what you think > would happen? > > > I agree with your point above that for upgrades some tool like > > pg_copysequence where we can provide a way to copy sequence data to > > subscribers from the publisher would suffice the need. > > > > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet > another tool users would need to use. > But can logical replica be used for failover? We don't have any way to replicate/sync the slots on subscribers and neither we have a mechanism to replicate existing publications. I think if we want to achieve failover to a logical subscriber we need to replicate/sync the required logical and physical slots to the subscribers. I haven't thought through it completely so there
Re: logical decoding and replication of sequences, take 2
On 2/20/24 06:54, Amit Kapila wrote: > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra > wrote: >> >> On 12/19/23 13:54, Christophe Pettus wrote: >>> Hi, >>> >>> I wanted to hop in here on one particular issue: >>> On Dec 12, 2023, at 02:01, Tomas Vondra wrote: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). >>> >>> +1. >>> >>> Right now, the lack of sequence replication is a rather large >>> foot-gun on logical replication upgrades. Copying the sequences >>> over during the cutover period is doable, of course, but: >>> >>> (a) There's no out-of-the-box tooling that does it, so everyone has >>> to write some scripts just for that one function. >>> >>> (b) It's one more thing that extends the cutover window. >>> >> >> I agree it's an annoying gap for this use case. But if this is the only >> use cases, maybe a better solution would be to provide such tooling >> instead of adding it to the logical decoding? >> >> It might seem a bit strange if most data is copied by replication >> directly, while sequences need special handling, ofc. >> > > One difference between the logical replication of tables and sequences > is that we can guarantee with synchronous_commit (and > synchronous_standby_names) that after failover transactions data is > replicated or not whereas for sequences we can't guarantee that > because of their non-transactional nature. Say, there are two > transactions T1 and T2, it is possible that T1's entire table data and > sequence data are committed and replicated but T2's sequence data is > replicated. So, after failover to logical subscriber in such a case if > one routes T2 again to the new node as it was not successful > previously then it would needlessly perform the sequence changes > again. I don't how much that matters but that would probably be the > difference between the replication of tables and sequences. > I don't quite follow what the problem with synchronous_commit is :-( For sequences, we log the changes ahead, i.e. even if nextval() did not write anything into WAL, it's still safe because these changes are covered by the WAL generated some time ago (up to ~32 values back). And that's certainly subject to synchronous_commit, right? There certainly are issues with sequences and syncrep: https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9...@enterprisedb.com but that's unrelated to logical replication. FWIW I don't think we'd re-apply sequence changes needlessly, because the worker does update the origin after applying non-transactional changes. So after the replication gets restarted, we'd skip what we already applied, no? But maybe there is an issue and I'm just not getting it. Could you maybe share an example of T1/T2, with a replication restart and what you think would happen? > I agree with your point above that for upgrades some tool like > pg_copysequence where we can provide a way to copy sequence data to > subscribers from the publisher would suffice the need. > Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet another tool users would need to use. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 3:38 PM Robert Haas wrote: > Let's say fast_forward is true. Then smgr_decode() is going to skip > recording anything about the relfilenode, so we'll identify all > sequence changes as non-transactional. But look at how this case is > handled in seq_decode(): > > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } It appears that the 'processing_required' flag was introduced as part of supporting upgrades for logical replication slots. Its purpose is to determine whether a slot is fully caught up, meaning that there are no pending decodable changes left before it can be upgraded. So now if some change was transactional but we have identified it as non-transaction then we will mark this flag 'ctx->processing_required = true;' so we temporarily set this flag incorrectly, but even if the flag would have been correctly identified initially, it would have been set again to true in the DecodeTXNNeedSkip() function regardless of whether the transaction is committed or aborted. As a result, the flag would eventually be set to 'true', and the behavior would align with the intended logic. But I am wondering why this flag is always set to true in DecodeTXNNeedSkip() irrespective of the commit or abort. Because the aborted transactions are not supposed to be replayed? So if my observation is correct that for the aborted transaction, this shouldn't be set to true then we have a problem with sequence where we are identifying the transactional changes as non-transaction changes because now for transactional changes this should depend upon commit status. On another thought, can there be a situation where we have identified this flag wrongly as non-transaction and set this flag, and the commit/abort record never appeared in the WAL so never decoded? That can also lead to an incorrect decision during the upgrade. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 1:32 PM Dilip Kumar wrote: > You might be interested in more detail [1] where I first reported this > problem and also [2] where we concluded why this is not creating a > real problem. > > [1] > https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com Thanks. Dilip and I just spent a lot of time talking this through on a call. One of the key bits of logic is here: + /* Skip the change if already processed (per the snapshot). */ + if (transactional && + !SnapBuildProcessChange(builder, xid, buf->origptr)) + return; + else if (!transactional && + (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || + SnapBuildXactNeedsSkip(builder, buf->origptr))) + return; As a stylistic note, I think this would be mode clear if it were written if (transactional) { if (!SnapBuildProcessChange()) return; } else { if (something else) return; }. Now, on to correctness. It's possible for us to identify a transactional change as non-transactional if smgr_decode() was called for the relfilenode before SNAPBUILD_FULL_SNAPSHOT was reached. In that case, if !SnapBuildProcessChange() would have been true, then we need SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr) to also be true. Otherwise, we'll process this change when we wouldn't have otherwise. But Dilip made an argument to me about this which seems correct to me. snapbuild.h says that SNAPBUILD_CONSISTENT is reached only when we find a point where any transaction that was running at the time we reached SNAPBUILD_FULL_SNAPSHOT have finished. So if this transaction is one for which we incorrectly identified the sequence change as non-transactional, then we cannot be in the SNAPBUILD_CONSISTENT state yet, so SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT will be true and hence whole "or" condition we'll be true and we'll return. So far, so good. I think, anyway. I haven't comprehensively verified that the comment in snapbuild.h accurately reflects what the code actually does. But if it does, then presumably we shouldn't see a record for which we might have mistakenly identified a change as non-transactional after reaching SNAPBUILD_CONSISTENT, which seems to be good enough to guarantee that the mistake won't matter. However, the logic in smgr_decode() doesn't only care about the snapshot state. It also cares about the fast-forward flag: + if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || + ctx->fast_forward) + return; Let's say fast_forward is true. Then smgr_decode() is going to skip recording anything about the relfilenode, so we'll identify all sequence changes as non-transactional. But look at how this case is handled in seq_decode(): + if (ctx->fast_forward) + { + /* + * We need to set processing_required flag to notify the sequence + * change existence to the caller. Usually, the flag is set when + * either the COMMIT or ABORT records are decoded, but this must be + * turned on here because the non-transactional logical message is + * decoded without waiting for these records. + */ + if (!transactional) + ctx->processing_required = true; + + return; + } This seems suspicious. Why are we testing the transactional flag here if it's guaranteed to be false? My guess is that the person who wrote this code thought that the flag would be accurate even in this case, but that doesn't seem to be true. So this case probably needs some more thought. It's definitely not great that this logic is so complicated; it's really hard to verify that all the tests match up well enough to keep us out of trouble. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 10:30 AM Robert Haas wrote: > > Is the rule that changes are transactional if and only if the current > transaction has assigned a new relfilenode to the sequence? Yes, thats the rule. > Why does the logic get confused if the state of the snapshot changes? The rule doesn't get changed, but the way this identification is implemented at the decoding gets confused and assumes transactional as non-transactional. The identification of whether the sequence is transactional or not is implemented based on what WAL we have decoded from the particular transaction and whether we decode a particular WAL or not depends upon the snapshot state (it's about what we decode not necessarily what we sent). So if the snapshot state changed the mid-transaction that means we haven't decoded the WAL which created a new relfilenode but we will decode the WAL which is operating on the sequence. So here we will assume the change is non-transaction whereas it was transactional because we did not decode some of the changes of transaction which we rely on for identifying whether it is transactional or not. > My naive reaction is that it kinda sounds like you're relying on two > different mistakes cancelling each other out, and that might be a bad > idea, because maybe there's some situation where they don't. But I > don't understand the issue well enough to have an educated opinion at > this point. I would say the first one is a mistake in identifying the transactional as non-transactional during the decoding and that mistake happens only when we decode the transaction partially. But we never stream the partially decoded transactions downstream which means even though we have made a mistake in decoding it, we are not streaming it so our mistake is not getting converted into a real problem. But again I agree there is a temporary wrong decision and if we try to do something else based on this decision then it could be an issue. You might be interested in more detail [1] where I first reported this problem and also [2] where we concluded why this is not creating a real problem. [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra wrote: > > On 12/19/23 13:54, Christophe Pettus wrote: > > Hi, > > > > I wanted to hop in here on one particular issue: > > > >> On Dec 12, 2023, at 02:01, Tomas Vondra > >> wrote: > >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > >> better solution for distributed (esp. active-active) systems. But there > >> are important use cases that are likely to keep using regular sequences > >> (online upgrades of single-node instances, existing systems, ...). > > > > +1. > > > > Right now, the lack of sequence replication is a rather large > > foot-gun on logical replication upgrades. Copying the sequences > > over during the cutover period is doable, of course, but: > > > > (a) There's no out-of-the-box tooling that does it, so everyone has > > to write some scripts just for that one function. > > > > (b) It's one more thing that extends the cutover window. > > > > I agree it's an annoying gap for this use case. But if this is the only > use cases, maybe a better solution would be to provide such tooling > instead of adding it to the logical decoding? > > It might seem a bit strange if most data is copied by replication > directly, while sequences need special handling, ofc. > One difference between the logical replication of tables and sequences is that we can guarantee with synchronous_commit (and synchronous_standby_names) that after failover transactions data is replicated or not whereas for sequences we can't guarantee that because of their non-transactional nature. Say, there are two transactions T1 and T2, it is possible that T1's entire table data and sequence data are committed and replicated but T2's sequence data is replicated. So, after failover to logical subscriber in such a case if one routes T2 again to the new node as it was not successful previously then it would needlessly perform the sequence changes again. I don't how much that matters but that would probably be the difference between the replication of tables and sequences. I agree with your point above that for upgrades some tool like pg_copysequence where we can provide a way to copy sequence data to subscribers from the publisher would suffice the need. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra wrote: > For me, the part that I feel most uneasy about is the decoding while the > snapshot is still being built (and can flip to consistent snapshot > between the relfilenode creation and sequence change, confusing the > logic that decides which changes are transactional). > > It seems "a bit weird" that we either keep the "simple" logic that may > end up with incorrect "non-transactional" result, but happens to then > work fine because we immediately discard the change. > > But it still feels better than the alternative, which requires us to > start decoding stuff (relfilenode creation) before building a proper > snapshot is consistent, which we didn't do before - or at least not in > this particular way. While I don't have a practical example where it > would cause trouble now, I have a nagging feeling it might easily cause > trouble in the future by making some new features harder to implement. I don't understand the issues here well enough to comment. Is there a good write-up someplace I can read to understand the design here? Is the rule that changes are transactional if and only if the current transaction has assigned a new relfilenode to the sequence? Why does the logic get confused if the state of the snapshot changes? My naive reaction is that it kinda sounds like you're relying on two different mistakes cancelling each other out, and that might be a bad idea, because maybe there's some situation where they don't. But I don't understand the issue well enough to have an educated opinion at this point. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 2/15/24 05:16, Robert Haas wrote: > On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra > wrote: >> The way I think about non-transactional sequence changes is as if they >> were tiny transactions that happen "fully" (including commit) at the LSN >> where the LSN change is logged. > > 100% this. > >> It does not mean we can arbitrarily reorder the changes, it only means >> the changes are applied as if they were independent transactions (but in >> the same order as they were executed originally). Both with respect to >> the other non-transactional changes, and to "commits" of other stuff. > > Right, this is very important and I agree completely. > > I'm feeling more confident about this now that I heard you say that > stuff -- this is really the key issue I've been worried about since I > first looked at this, and I wasn't sure that you were in agreement, > but it sounds like you are. I think we should (a) fix the locking bug > I found (but that can be independent of this patch) and (b) make sure > that this patch documents the points from the quoted material above so > that everyone who reads the code (and maybe tries to enhance it) is > clear on what the assumptions are. > > (I haven't checked whether it documents that stuff or not. I'm just > saying it should, because I think it's a subtlety that someone might > miss.) > Thanks for thinking about these issues with reordering events. Good we seem to be in agreement and that you feel more confident about this. I'll check if there's a good place to document this. For me, the part that I feel most uneasy about is the decoding while the snapshot is still being built (and can flip to consistent snapshot between the relfilenode creation and sequence change, confusing the logic that decides which changes are transactional). It seems "a bit weird" that we either keep the "simple" logic that may end up with incorrect "non-transactional" result, but happens to then work fine because we immediately discard the change. But it still feels better than the alternative, which requires us to start decoding stuff (relfilenode creation) before building a proper snapshot is consistent, which we didn't do before - or at least not in this particular way. While I don't have a practical example where it would cause trouble now, I have a nagging feeling it might easily cause trouble in the future by making some new features harder to implement. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra wrote: > The way I think about non-transactional sequence changes is as if they > were tiny transactions that happen "fully" (including commit) at the LSN > where the LSN change is logged. 100% this. > It does not mean we can arbitrarily reorder the changes, it only means > the changes are applied as if they were independent transactions (but in > the same order as they were executed originally). Both with respect to > the other non-transactional changes, and to "commits" of other stuff. Right, this is very important and I agree completely. I'm feeling more confident about this now that I heard you say that stuff -- this is really the key issue I've been worried about since I first looked at this, and I wasn't sure that you were in agreement, but it sounds like you are. I think we should (a) fix the locking bug I found (but that can be independent of this patch) and (b) make sure that this patch documents the points from the quoted material above so that everyone who reads the code (and maybe tries to enhance it) is clear on what the assumptions are. (I haven't checked whether it documents that stuff or not. I'm just saying it should, because I think it's a subtlety that someone might miss.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 2/13/24 17:37, Robert Haas wrote: > On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra > wrote: >> Right, locks + apply in commit order gives us this guarantee (I can't >> think of a case where it wouldn't be the case). > > I couldn't find any cases of inadequate locking other than the one I > mentioned. > >> Doesn't the whole logical replication critically depend on the commit >> order? If you decide to arbitrarily reorder/delay the transactions, all >> kinds of really bad things can happen. That's a generic problem, it >> applies to all kinds of objects, not just sequences - a parallel apply >> would need to detect this sort of dependencies (e.g. INSERT + DELETE of >> the same key), and do something about it. > > Yes, but here I'm not just talking about the commit order. I'm talking > about the order of applying non-transactional operations relative to > commits. > > Consider: > > T1: CREATE SEQUENCE s; > T2: BEGIN; > T2: SELECT nextval('s'); > T3: SELECT nextval('s'); > T2: ALTER SEQUENCE s INCREMENT 2; > T2: SELECT nextval('s'); > T2: COMMIT; > It's not clear to me if you're talking about nextval() that happens to generate WAL, or nextval() covered by WAL generated by a previous call. I'm going to assume it's the former, i.e. nextval() that generated WAL describing the *next* sequence chunk, because without WAL there's nothing to apply and therefore no issue with T3 ordering. The way I think about non-transactional sequence changes is as if they were tiny transactions that happen "fully" (including commit) at the LSN where the LSN change is logged. > The commit order is T1 < T3 < T2, but T3 makes no transactional > changes, so the commit order is really just T1 < T2. But it's > completely wrong to say that all we need to do is apply T1 before we > apply T2. The correct order of application is: > > 1. T1. > 2. T2's first nextval > 3. T3's nextval > 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > Is that quite true? If T3 generated WAL (for the nextval call), it will be applied at that particular LSN. AFAIK that guarantees it happens after the first T2 change (which is also non-transactional) and before the transactional T2 change (because that creates a new relfilenode). > In other words, the fact that some sequence changes are > non-transactional creates ordering hazards that don't exist if there > are no non-transactional changes. So in that way, sequences are > different from table modifications, where applying the transactions in > order of commit is all we need to do. Here we need to apply the > transactions in order of commit and also apply the non-transactional > changes at the right point in the sequence. Consider the following > alternative apply sequence: > > 1. T1. > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and > the subsequent nextval) > 3. T3's nextval > 4. T2's first nextval > > That's still in commit order. It's also wrong. > Yes, this would be wrong. Thankfully the apply is not allowed to reorder the changes like this, because that's not what "non-transactional" means in this context. It does not mean we can arbitrarily reorder the changes, it only means the changes are applied as if they were independent transactions (but in the same order as they were executed originally). Both with respect to the other non-transactional changes, and to "commits" of other stuff. (for serial apply, at least) > Imagine that you commit this patch and someone later wants to do > parallel logical apply. So every time they finish decoding a > transaction, they stick it in a queue to be applied by the next > available worker. But, non-transactional changes are very simple, so > we just directly apply those in the main process. Well, kaboom! But > now this can happen with the above example. > > 1. Decode T1. Add to queue for apply. > 2. Before the (idle) apply worker has a chance to pull T1 out of the > queue, decode the first nextval and try to apply it. > > Oops. We're trying to apply a modification to a sequence that hasn't > been created yet. I'm not saying that this kind of hypothetical is a > reason not to commit the patch. But it seems like we're not on the > same page about what the ordering requirements are here. I'm just > making the argument that those non-transactional operations actually > act like mini-transactions. They need to happen at the right time > relative to the real transactions. A non-transactional operation needs > to be applied after any transactions that commit before it is logged, > and before any transactions that commit after it's logged. > How is this issue specific to sequences? AFAIK this is a general problem with transactions that depend on each other. Consider for example this: T1: INSERT INTO t (id) VALUES (1); T2: DELETE FROM t WHERE id = 1; If you parallelize this in a naive way, maybe T2 gets applied before T1. In which case the DELETE won't find the row yet. There's
Re: logical decoding and replication of sequences, take 2
On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra wrote: > Right, locks + apply in commit order gives us this guarantee (I can't > think of a case where it wouldn't be the case). I couldn't find any cases of inadequate locking other than the one I mentioned. > Doesn't the whole logical replication critically depend on the commit > order? If you decide to arbitrarily reorder/delay the transactions, all > kinds of really bad things can happen. That's a generic problem, it > applies to all kinds of objects, not just sequences - a parallel apply > would need to detect this sort of dependencies (e.g. INSERT + DELETE of > the same key), and do something about it. Yes, but here I'm not just talking about the commit order. I'm talking about the order of applying non-transactional operations relative to commits. Consider: T1: CREATE SEQUENCE s; T2: BEGIN; T2: SELECT nextval('s'); T3: SELECT nextval('s'); T2: ALTER SEQUENCE s INCREMENT 2; T2: SELECT nextval('s'); T2: COMMIT; The commit order is T1 < T3 < T2, but T3 makes no transactional changes, so the commit order is really just T1 < T2. But it's completely wrong to say that all we need to do is apply T1 before we apply T2. The correct order of application is: 1. T1. 2. T2's first nextval 3. T3's nextval 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and the subsequent nextval) In other words, the fact that some sequence changes are non-transactional creates ordering hazards that don't exist if there are no non-transactional changes. So in that way, sequences are different from table modifications, where applying the transactions in order of commit is all we need to do. Here we need to apply the transactions in order of commit and also apply the non-transactional changes at the right point in the sequence. Consider the following alternative apply sequence: 1. T1. 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and the subsequent nextval) 3. T3's nextval 4. T2's first nextval That's still in commit order. It's also wrong. Imagine that you commit this patch and someone later wants to do parallel logical apply. So every time they finish decoding a transaction, they stick it in a queue to be applied by the next available worker. But, non-transactional changes are very simple, so we just directly apply those in the main process. Well, kaboom! But now this can happen with the above example. 1. Decode T1. Add to queue for apply. 2. Before the (idle) apply worker has a chance to pull T1 out of the queue, decode the first nextval and try to apply it. Oops. We're trying to apply a modification to a sequence that hasn't been created yet. I'm not saying that this kind of hypothetical is a reason not to commit the patch. But it seems like we're not on the same page about what the ordering requirements are here. I'm just making the argument that those non-transactional operations actually act like mini-transactions. They need to happen at the right time relative to the real transactions. A non-transactional operation needs to be applied after any transactions that commit before it is logged, and before any transactions that commit after it's logged. > Yes, I think this is a bug in handling of owned sequences - from the > moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions > generate duplicate values (until the S1 is committed, at which point the > values generated in S2 get "forgotten"). > > It seems we end up updating both relfilenodes, which is clearly wrong. > > Seems like a bug independent of the decoding, IMO. Yeah. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 1/26/24 15:39, Robert Haas wrote: > On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra > wrote: >> I did try to explain how this works (and why) in a couple places: >> >> 1) the commit message >> 2) reorderbuffer header comment >> 3) ReorderBufferSequenceIsTransactional comment (and nearby) >> >> It's possible this does not meet your expectations, ofc. Maybe there >> should be a separate README for this - I haven't found anything like >> that for logical decoding in general, which is why I did (1)-(3). > > I read over these and I do think they answer a bunch of questions, but > I don't think they answer all of the questions. > > Suppose T1 creates a sequence and commits. Then T2 calls nextval(). > Then T3 drops the sequence. According to the commit message, T2's > change will be "replayed immediately after decoding". But it's > essential to replay T2's change after we replay T1 and before we > replay T3, and the comments don't explain why that's guaranteed. > > The answer might be "locks". If we always replay a transaction > immediately when we see it's commit record then in the example above > we're fine, because the commit record for the transaction that creates > the sequence must precede the nextval() call, since the sequence won't > be visible until the transaction commits, and also because T1 holds a > lock on it at that point sufficient to hedge out nextval. And the > nextval record must precede the point where T3 takes an exclusive lock > on the sequence. > Right, locks + apply in commit order gives us this guarantee (I can't think of a case where it wouldn't be the case). > Note, however, that this change of reasoning critically depends on us > never delaying application of a transaction. If we might reach T1's > commit record and say "hey, let's hold on to this for a bit and replay > it after we've decoded some more," everything immediately breaks, > unless we also delay application of T2's non-transactional update in > such a way that it's still guaranteed to happen after T1. I wonder if > this kind of situation would be a problem for a future parallel-apply > feature. It wouldn't work, for example, to hand T1 and T3 off (in that > order) to a separate apply process but handle T2's "non-transactional" > message directly, because it might handle that message before the > application of T1 got completed. > Doesn't the whole logical replication critically depend on the commit order? If you decide to arbitrarily reorder/delay the transactions, all kinds of really bad things can happen. That's a generic problem, it applies to all kinds of objects, not just sequences - a parallel apply would need to detect this sort of dependencies (e.g. INSERT + DELETE of the same key), and do something about it. Similar for sequences, where the important event is allocation of a new relfilenode. If anything, it's easier for sequences, because the relfilenode tracking gives us an explicit (and easy) way to detect these dependencies between transactions. > This also seems to depend on every transactional operation that might > affect a future non-transactional operation holding a lock that would > conflict with that non-transactional operation. For example, if ALTER > SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence, > then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART > WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what > the semantics of that would be -- would T1's second nextval() see the > sequence restart, or what? But if the effect of T1's second nextval > does depend in some way on the ALTER SEQUENCE operation which precedes > it in the WAL stream, then we might have some trouble here, because > both nextvals precede the commit of T2. Fortunately, this sequence of > events is foreclosed by locking. > I don't quite follow :-( AFAIK this theory hinges on not having the right lock, but I believe ALTER SEQUENCE does obtain the lock (at least in cases that assign a new relfilenode). Which means such reordering should not be possible, because nextval() in other transactions will then wait until commit. And all nextval() calls in the same transaction will be treated as transactional. So I think this works OK. If something does not lock the sequence in a way that would prevent other xacts to do nextval() on it, it's not a change that would change the relfilenode - and so it does not switch the sequence into a transactional mode. > But I did find one somewhat-similar case in which that's not so. > > S1: create table withseq (a bigint generated always as identity); > S1: begin; > S2: select nextval('withseq_a_seq'); > S1: alter table withseq set unlogged; > S2: select nextval('withseq_a_seq'); > > I think this is a bug in the code that supports owned sequences rather > than a problem that this patch should have to do something about. When > a sequence is flipped between logged and unlogged directly, we take a > stronger lock than we do here when
Re: logical decoding and replication of sequences, take 2
On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra wrote: > I did try to explain how this works (and why) in a couple places: > > 1) the commit message > 2) reorderbuffer header comment > 3) ReorderBufferSequenceIsTransactional comment (and nearby) > > It's possible this does not meet your expectations, ofc. Maybe there > should be a separate README for this - I haven't found anything like > that for logical decoding in general, which is why I did (1)-(3). I read over these and I do think they answer a bunch of questions, but I don't think they answer all of the questions. Suppose T1 creates a sequence and commits. Then T2 calls nextval(). Then T3 drops the sequence. According to the commit message, T2's change will be "replayed immediately after decoding". But it's essential to replay T2's change after we replay T1 and before we replay T3, and the comments don't explain why that's guaranteed. The answer might be "locks". If we always replay a transaction immediately when we see it's commit record then in the example above we're fine, because the commit record for the transaction that creates the sequence must precede the nextval() call, since the sequence won't be visible until the transaction commits, and also because T1 holds a lock on it at that point sufficient to hedge out nextval. And the nextval record must precede the point where T3 takes an exclusive lock on the sequence. Note, however, that this change of reasoning critically depends on us never delaying application of a transaction. If we might reach T1's commit record and say "hey, let's hold on to this for a bit and replay it after we've decoded some more," everything immediately breaks, unless we also delay application of T2's non-transactional update in such a way that it's still guaranteed to happen after T1. I wonder if this kind of situation would be a problem for a future parallel-apply feature. It wouldn't work, for example, to hand T1 and T3 off (in that order) to a separate apply process but handle T2's "non-transactional" message directly, because it might handle that message before the application of T1 got completed. This also seems to depend on every transactional operation that might affect a future non-transactional operation holding a lock that would conflict with that non-transactional operation. For example, if ALTER SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence, then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what the semantics of that would be -- would T1's second nextval() see the sequence restart, or what? But if the effect of T1's second nextval does depend in some way on the ALTER SEQUENCE operation which precedes it in the WAL stream, then we might have some trouble here, because both nextvals precede the commit of T2. Fortunately, this sequence of events is foreclosed by locking. But I did find one somewhat-similar case in which that's not so. S1: create table withseq (a bigint generated always as identity); S1: begin; S2: select nextval('withseq_a_seq'); S1: alter table withseq set unlogged; S2: select nextval('withseq_a_seq'); I think this is a bug in the code that supports owned sequences rather than a problem that this patch should have to do something about. When a sequence is flipped between logged and unlogged directly, we take a stronger lock than we do here when it's done in this indirect way. Also, I'm not quite sure if it would pose a problem for sequence decoding anyway: it changes the relfilenode, but not the value. But this is the *kind* of problem that could make the approach unsafe: supposedly transactional changes being interleaved with supposedly non-transctional changes, in such a way that the non-transactional changes might get applied at the wrong time relative to the transactional changes. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 1/23/24 21:47, Robert Haas wrote: > On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra > wrote: >> 1) desirability: We want a built-in way to handle sequences in logical >> replication. I think everyone agrees this is not a way to do distributed >> sequences in an active-active setups, but that there are other use cases >> that need this feature - typically upgrades / logical failover. > > Yeah. I find it extremely hard to take seriously the idea that this > isn't a valuable feature. How else are you supposed to do a logical > failover without having your entire application break? > >> 2) performance: There was concern about the performance impact, and that >> it affects everyone, including those who don't replicate sequences (as >> the overhead is mostly incurred before calls to output plugin etc.). >> >> The agreement was that the best way is to have a CREATE SUBSCRIPTION >> option that would instruct the upstream to decode sequences. By default >> this option is 'off' (because that's the no-overhead case), but it can >> be enabled for each subscription. > > Seems reasonable, at least unless and until we come up with something better. > >> 3) correctness: The last point is about making "transactional" flag >> correct when the snapshot state changes mid-transaction, originally >> pointed out by Dilip [4]. Per [5] this however happens to work >> correctly, because while we identify the change as 'non-transactional' >> (which is incorrect), we immediately throw it again (so we don't try to >> apply it, which would error-out). > > I've said this before, but I still find this really scary. It's > unclear to me that we can simply classify updates as transactional or > non-transactional and expect things to work. If it's possible, I hope > we have a really good explanation somewhere of how and why it's > possible. If we do, can somebody point me to it so I can read it? > I did try to explain how this works (and why) in a couple places: 1) the commit message 2) reorderbuffer header comment 3) ReorderBufferSequenceIsTransactional comment (and nearby) It's possible this does not meet your expectations, ofc. Maybe there should be a separate README for this - I haven't found anything like that for logical decoding in general, which is why I did (1)-(3). > To be possibly slightly more clear about my concern, I think the scary > case is where we have transactional and non-transactional things > happening to the same sequence in close temporal proximity, either > within the same session or across two or more sessions. If a > non-transactional change can get reordered ahead of some transactional > change upon which it logically depends, or behind some transactional > change that logically depends on it, then we have trouble. I also > wonder if there are any cases where the same operation is partly > transactional and partly non-transactional. > I certainly understand this concern, and to some extent I even share it. Having to differentiate between transactional and non-transactional changes certainly confused me more than once. It's especially confusing, because the decoding implicitly changes the perceived ordering/atomicity of the events. That being said, I don't think it get reordered the way you're concerned about. The "transactionality" is determined by relfilenode change, so how could the reordering happen? We'd have to misidentify change in either direction - and for nontransactional->transactional change that's clearly not possible. There has to be a new relfilenode in that xact. In the other direction (transactional->nontransactional), it can happen if we fail to decode the relfilenode record. Which is what we discussed earlier, but came to the conclusion that it actually works OK. Of course, there might be bugs. I spent quite a bit of effort reviewing and testing this, but there still might be something wrong. But I think that applies to any feature. What would be worse is some sort of thinko in the approach in general. I don't have a good answer to that, unfortunately - I think it works, but how would I know for sure? We explored multiple alternative approaches and all of them crashed and burned ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra wrote: > 1) desirability: We want a built-in way to handle sequences in logical > replication. I think everyone agrees this is not a way to do distributed > sequences in an active-active setups, but that there are other use cases > that need this feature - typically upgrades / logical failover. Yeah. I find it extremely hard to take seriously the idea that this isn't a valuable feature. How else are you supposed to do a logical failover without having your entire application break? > 2) performance: There was concern about the performance impact, and that > it affects everyone, including those who don't replicate sequences (as > the overhead is mostly incurred before calls to output plugin etc.). > > The agreement was that the best way is to have a CREATE SUBSCRIPTION > option that would instruct the upstream to decode sequences. By default > this option is 'off' (because that's the no-overhead case), but it can > be enabled for each subscription. Seems reasonable, at least unless and until we come up with something better. > 3) correctness: The last point is about making "transactional" flag > correct when the snapshot state changes mid-transaction, originally > pointed out by Dilip [4]. Per [5] this however happens to work > correctly, because while we identify the change as 'non-transactional' > (which is incorrect), we immediately throw it again (so we don't try to > apply it, which would error-out). I've said this before, but I still find this really scary. It's unclear to me that we can simply classify updates as transactional or non-transactional and expect things to work. If it's possible, I hope we have a really good explanation somewhere of how and why it's possible. If we do, can somebody point me to it so I can read it? To be possibly slightly more clear about my concern, I think the scary case is where we have transactional and non-transactional things happening to the same sequence in close temporal proximity, either within the same session or across two or more sessions. If a non-transactional change can get reordered ahead of some transactional change upon which it logically depends, or behind some transactional change that logically depends on it, then we have trouble. I also wonder if there are any cases where the same operation is partly transactional and partly non-transactional. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 12/15/23 03:33, Amit Kapila wrote: > On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat > wrote: >> >> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila wrote: >>> >>> It can only be cleaned if we process it but xact_decode won't allow us >>> to process it and I don't think it would be a good idea to add another >>> hack for sequences here. See below code: >>> >>> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) >>> { >>> SnapBuild *builder = ctx->snapshot_builder; >>> ReorderBuffer *reorder = ctx->reorder; >>> XLogReaderState *r = buf->record; >>> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; >>> >>> /* >>> * If the snapshot isn't yet fully built, we cannot decode anything, so >>> * bail out. >>> */ >>> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) >>> return; >> >> That may be true for a transaction which is decoded, but I think all >> the transactions which are added to ReorderBuffer should be cleaned up >> once they have been processed irrespective of whether they are >> decoded/sent downstream or not. In this case I see the sequence hash >> being cleaned up for the sequence related transaction in Hayato's >> reproducer. >> > > It was because the test you are using was not designed to show the > problem I mentioned. In this case, the rollback was after a full > snapshot state was reached. > Right, I haven't tried to reproduce this, but it very much looks like we the entry would not be removed if the xact aborts/commits before the snapshot reaches FULL state. I suppose one way to deal with this would be to first check if an entry for the same relfilenode exists. If it does, the original transaction must have terminated, but we haven't cleaned it up yet - in which case we can just "move" the relfilenode to the new one. However, can't that happen even with full snapshots? I mean, let's say a transaction creates a relfilenode and terminates without writing an abort record (surely that's possible, right?). And then another xact comes and generates the same relfilenode (presumably that's unlikely, but perhaps possible?). Aren't we in pretty much the same situation, until the next RUNNING_XACTS cleans up the hash table? I think tracking all relfilenodes would fix the original issue (with treating some changes as transactional), and the tweak that "moves" the relfilenode to the new xact would fix this other issue too. That being said, I feel a bit uneasy about it, for similar reasons as Amit. If we start processing records before full snapshot, that seems like moving the assumptions a bit. For example it means we'd create ReorderBufferTXN entries for cases that'd have skipped before. OTOH this is (or should be) only a very temporary period while starting the replication, I believe. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 12/19/23 13:54, Christophe Pettus wrote: > Hi, > > I wanted to hop in here on one particular issue: > >> On Dec 12, 2023, at 02:01, Tomas Vondra >> wrote: >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much >> better solution for distributed (esp. active-active) systems. But there >> are important use cases that are likely to keep using regular sequences >> (online upgrades of single-node instances, existing systems, ...). > > +1. > > Right now, the lack of sequence replication is a rather large > foot-gun on logical replication upgrades. Copying the sequences > over during the cutover period is doable, of course, but: > > (a) There's no out-of-the-box tooling that does it, so everyone has > to write some scripts just for that one function. > > (b) It's one more thing that extends the cutover window. > I agree it's an annoying gap for this use case. But if this is the only use cases, maybe a better solution would be to provide such tooling instead of adding it to the logical decoding? It might seem a bit strange if most data is copied by replication directly, while sequences need special handling, ofc. > I don't think it is a good idea to make it mandatory: for example, > there's a strong use case for replicating a table but not a sequence > associated with it. But it's definitely a missing feature in > logical replication. I don't think the plan was to make replication of sequences mandatory, certainly not with the built-in replication. If you don't add sequences to the publication, the sequence changes will be skipped. But it still needs to be part of the decoding, which adds overhead for all logical decoding uses, even if the sequence changes end up being discarded. That's somewhat annoying, especially considering sequences are fairly common part of the WAL stream. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
Hi, I wanted to hop in here on one particular issue: > On Dec 12, 2023, at 02:01, Tomas Vondra wrote: > - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > better solution for distributed (esp. active-active) systems. But there > are important use cases that are likely to keep using regular sequences > (online upgrades of single-node instances, existing systems, ...). +1. Right now, the lack of sequence replication is a rather large foot-gun on logical replication upgrades. Copying the sequences over during the cutover period is doable, of course, but: (a) There's no out-of-the-box tooling that does it, so everyone has to write some scripts just for that one function. (b) It's one more thing that extends the cutover window. I don't think it is a good idea to make it mandatory: for example, there's a strong use case for replicating a table but not a sequence associated with it. But it's definitely a missing feature in logical replication.
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila wrote: > > > > It can only be cleaned if we process it but xact_decode won't allow us > > to process it and I don't think it would be a good idea to add another > > hack for sequences here. See below code: > > > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > > { > > SnapBuild *builder = ctx->snapshot_builder; > > ReorderBuffer *reorder = ctx->reorder; > > XLogReaderState *r = buf->record; > > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > > > /* > > * If the snapshot isn't yet fully built, we cannot decode anything, so > > * bail out. > > */ > > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > > return; > > That may be true for a transaction which is decoded, but I think all > the transactions which are added to ReorderBuffer should be cleaned up > once they have been processed irrespective of whether they are > decoded/sent downstream or not. In this case I see the sequence hash > being cleaned up for the sequence related transaction in Hayato's > reproducer. > It was because the test you are using was not designed to show the problem I mentioned. In this case, the rollback was after a full snapshot state was reached. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote: > I haven't found the code path from where the sequence cleanup gets > called. But it's being called. Am I missing something? ReorderBufferCleanupTXN. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila wrote: > > On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat > wrote: > > > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > > > > > I think you forgot to attach the patch. > > > > Sorry. Here it is. > > > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > > > > > > > It looks like the solution works. But this is the only place where we > > > process a change before SNAPSHOT reaches FULL. But this is also the > > > only record which affects a decision to queue/not a following change. > > > So it should be ok. The sequence_hash'es as separate for each > > > transaction and they are cleaned when processing COMMIT record. > > > > > > > > > > But it is possible that even commit or abort also happens before the > > > snapshot reaches full state in which case the hash table will have > > > stale or invalid (for aborts) entries. That will probably be cleaned > > > at a later point by running_xact records. > > > > Why would cleaning wait till running_xact records? Won't txn entry > > itself be removed when processing commit/abort record? At the same the > > sequence hash will be cleaned as well. > > > > > Now, I think in theory, it > > > is possible that the same RelFileLocator can again be allocated before > > > we clean up the existing entry which can probably confuse the system. > > > > How? The transaction allocating the first time would be cleaned before > > it happens the second time. So shouldn't matter. > > > > It can only be cleaned if we process it but xact_decode won't allow us > to process it and I don't think it would be a good idea to add another > hack for sequences here. See below code: > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > { > SnapBuild *builder = ctx->snapshot_builder; > ReorderBuffer *reorder = ctx->reorder; > XLogReaderState *r = buf->record; > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > /* > * If the snapshot isn't yet fully built, we cannot decode anything, so > * bail out. > */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > return; That may be true for a transaction which is decoded, but I think all the transactions which are added to ReorderBuffer should be cleaned up once they have been processed irrespective of whether they are decoded/sent downstream or not. In this case I see the sequence hash being cleaned up for the sequence related transaction in Hayato's reproducer. See attached patch with a diagnostic change and the output below (notice sequence cleanup called on transaction 767). 2023-12-14 21:06:36.756 IST [386957] LOG: logical decoding found initial starting point at 0/15B2F68 2023-12-14 21:06:36.756 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 767 to end. 2023-12-14 21:06:36.756 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: skipped 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.710 IST [386957] LOG: logical decoding found initial consistent point at 0/15B3388 2023-12-14 21:07:05.710 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 768 to end. 2023-12-14 21:07:05.710 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:39.292 IST [386298] LOG: checkpoint starting: time 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: the sequence is transactional 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: sequence cleanup called on transaction 767 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: logical decoding found consistent point at 0/15B3518 2023-12-14 21:07:40.919 IST [386957] DETAIL: There are no running transactions. 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); We see similar output when pg_logical_slot_get_changes() is called. I haven't found the code path from where the sequence cleanup gets called. But it's being called. Am I
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > > > I think you forgot to attach the patch. > > Sorry. Here it is. > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > > > > It looks like the solution works. But this is the only place where we > > process a change before SNAPSHOT reaches FULL. But this is also the > > only record which affects a decision to queue/not a following change. > > So it should be ok. The sequence_hash'es as separate for each > > transaction and they are cleaned when processing COMMIT record. > > > > > > > But it is possible that even commit or abort also happens before the > > snapshot reaches full state in which case the hash table will have > > stale or invalid (for aborts) entries. That will probably be cleaned > > at a later point by running_xact records. > > Why would cleaning wait till running_xact records? Won't txn entry > itself be removed when processing commit/abort record? At the same the > sequence hash will be cleaned as well. > > > Now, I think in theory, it > > is possible that the same RelFileLocator can again be allocated before > > we clean up the existing entry which can probably confuse the system. > > How? The transaction allocating the first time would be cleaned before > it happens the second time. So shouldn't matter. > It can only be cleaned if we process it but xact_decode won't allow us to process it and I don't think it would be a good idea to add another hack for sequences here. See below code: xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) { SnapBuild *builder = ctx->snapshot_builder; ReorderBuffer *reorder = ctx->reorder; XLogReaderState *r = buf->record; uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; /* * If the snapshot isn't yet fully built, we cannot decode anything, so * bail out. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return; -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > I think you forgot to attach the patch. Sorry. Here it is. On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > > > But it is possible that even commit or abort also happens before the > snapshot reaches full state in which case the hash table will have > stale or invalid (for aborts) entries. That will probably be cleaned > at a later point by running_xact records. Why would cleaning wait till running_xact records? Won't txn entry itself be removed when processing commit/abort record? At the same the sequence hash will be cleaned as well. > Now, I think in theory, it > is possible that the same RelFileLocator can again be allocated before > we clean up the existing entry which can probably confuse the system. How? The transaction allocating the first time would be cleaned before it happens the second time. So shouldn't matter. -- Best Wishes, Ashutosh Bapat diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 73e38cafd8..8e2ebbd8e0 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -1543,10 +1543,15 @@ smgr_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) elog(LOG, "XXX: smgr_decode. snapshot is %s", SnapBuildIdentify(builder)); /* -* If we don't have snapshot or we are just fast-forwarding, there is no -* point in decoding relfilenode information. +* If we are not building snapshot yet or we are just fast-forwarding, there +* is no point in decoding relfilenode information. If the sequence +* associated with relfilenode here is changed in the same transaction but +* after snapshot was built, the relfilenode needs to be present in the +* sequence hash table so that the change can be deemed as transactional. +* Otherwise it will not be queued. Hence we process this change even before +* we have built snapshot. */ - if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || + if (SnapBuildCurrentState(builder) < SNAPBUILD_BUILDING_SNAPSHOT || ctx->fast_forward) { elog(LOG, "XXX: skipped");
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > ... > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > It looks like the solution works. But this is the only place where we process a change before SNAPSHOT reaches FULL. But this is also the only record which affects a decision to queue/not a following change. So it should be ok. The sequence_hash'es as separate for each transaction and they are cleaned when processing COMMIT record. > But it is possible that even commit or abort also happens before the snapshot reaches full state in which case the hash table will have stale or invalid (for aborts) entries. That will probably be cleaned at a later point by running_xact records. Now, I think in theory, it is possible that the same RelFileLocator can again be allocated before we clean up the existing entry which can probably confuse the system. It might or might not be a problem in practice but I think the more assumptions we add for sequences, the more difficult it will become to ensure its correctness. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > I also thought about the same solution. I tried this solution as the > attached patch on top of Hayato's diagnostic changes. I think you forgot to attach the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > It is correct that we can make a wrong decision about whether a change > > is transactional or non-transactional when sequence DDL happens before > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > after that state. However, one thing to note here is that we won't try > > to stream such a change because for non-transactional cases we don't > > proceed unless the snapshot is in a consistent state. Now, if the > > decision had been correct then we would probably have queued the > > sequence change and discarded at commit. > > > > One thing that we deviate here is that for non-sequence transactional > > cases (including logical messages), we immediately start queuing the > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > SnapBuildProcessChange() returns true which is quite possible) and > > take final decision at commit/prepare/abort time. However, that won't > > be the case for sequences because of the dependency of determining > > transactional cases on one of the prior records. Now, I am not > > completely sure at this stage if such a deviation can cause any > > problem and or whether we are okay to have such a deviation for > > sequences. > > Okay, so this particular scenario that I raised is somehow saved, I > mean although we are considering transactional sequence operation as > non-transactional we also know that if some of the changes for a > transaction are skipped because the snapshot was not FULL that means > that transaction can not be streamed because that transaction has to > be committed before snapshot become CONSISTENT (based on the snapshot > state change machinery). Ideally based on the same logic that the > snapshot is not consistent the non-transactional sequence changes are > also skipped. But the only thing that makes me a bit uncomfortable is > that even though the result is not wrong we have made some wrong > intermediate decisions i.e. considered transactional change as > non-transactions. > > One solution to this issue is that, even if the snapshot state does > not reach FULL just add the sequence relids to the hash, I mean that > hash is only maintained for deciding whether the sequence is changed > in that transaction or not. So no adding such relids to hash seems > like a root cause of the issue. Honestly, I haven't analyzed this > idea in detail about how easy it would be to add only these changes to > the hash and what are the other dependencies, but this seems like a > worthwhile direction IMHO. I also thought about the same solution. I tried this solution as the attached patch on top of Hayato's diagnostic changes. Following log messages are seen in server error log. Those indicate that the sequence change was correctly deemed as a transactional change (line 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional). 2023-12-14 12:12:50.550 IST [321229] ERROR: relation "pg_replication_slot" does not exist at character 15 2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from pg_replication_slot; 2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found initial starting point at 0/1598D50 2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 759 to end. 2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped 2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found initial consistent point at 0/1599170 2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions (approximately 1) older than 760 to end. 2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional 2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found consistent point at 0/15992E8 2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions. 2023-12-14 12:14:55.813 IST
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila wrote: > > > > But can this even happen? Can we start decoding in the middle of a > > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > > messages, where we also call the output plugin in non-transactional cases. > > > > It's not a problem for logical messages because whether the message is > > transaction or non-transactional is decided while WAL logs the message > > itself. But here our problem starts with deciding whether the change > > is transactional vs non-transactional, because if we insert the > > 'relfilenode' in hash then the subsequent sequence change in the same > > transaction would be considered transactional otherwise > > non-transactional. > > > > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. However, one thing to note here is that we won't try > to stream such a change because for non-transactional cases we don't > proceed unless the snapshot is in a consistent state. Now, if the > decision had been correct then we would probably have queued the > sequence change and discarded at commit. > > One thing that we deviate here is that for non-sequence transactional > cases (including logical messages), we immediately start queuing the > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > SnapBuildProcessChange() returns true which is quite possible) and > take final decision at commit/prepare/abort time. However, that won't > be the case for sequences because of the dependency of determining > transactional cases on one of the prior records. Now, I am not > completely sure at this stage if such a deviation can cause any > problem and or whether we are okay to have such a deviation for > sequences. Okay, so this particular scenario that I raised is somehow saved, I mean although we are considering transactional sequence operation as non-transactional we also know that if some of the changes for a transaction are skipped because the snapshot was not FULL that means that transaction can not be streamed because that transaction has to be committed before snapshot become CONSISTENT (based on the snapshot state change machinery). Ideally based on the same logic that the snapshot is not consistent the non-transactional sequence changes are also skipped. But the only thing that makes me a bit uncomfortable is that even though the result is not wrong we have made some wrong intermediate decisions i.e. considered transactional change as non-transactions. One solution to this issue is that, even if the snapshot state does not reach FULL just add the sequence relids to the hash, I mean that hash is only maintained for deciding whether the sequence is changed in that transaction or not. So no adding such relids to hash seems like a root cause of the issue. Honestly, I haven't analyzed this idea in detail about how easy it would be to add only these changes to the hash and what are the other dependencies, but this seems like a worthwhile direction IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: logical decoding and replication of sequences, take 2
Dear hackers, > It is correct that we can make a wrong decision about whether a change > is transactional or non-transactional when sequence DDL happens before > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > after that state. I found a workload which decoder distinguish wrongly. # Prerequisite Apply an attached patch for inspecting the sequence status. It can be applied atop v20231203 patch set. Also, a table and a sequence must be defined: ``` CREATE TABLE foo (var int); CREATE SEQUENCE s; ``` # Workload Then, you can execute concurrent transactions from three clients like below: Client-1 BEGIN; INSERT INTO foo VALUES (1); Client-2 SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); Client-3 BEGIN; ALTER SEQUENCE s MAXVALUE 5000; COMMIT; SAVEPOINT s1; SELECT setval('s', 2000); ROLLBACK; SELECT pg_logical_slot_get_changes('slot', 'test_decoding'); # Result and analysis At first, below lines would be output on the log. This meant that WAL records for ALTER SEQUENCE were decoded but skipped because the snapshot had been building. ``` ... LOG: logical decoding found initial starting point at 0/154D238 DETAIL: Waiting for transactions (approximately 1) older than 741 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: skipped ... ``` Note that above `seq_decode...` line was not output via `setval()`, it was done by ALTER SEQUENCE statement. Below is a call stack for inserting WAL. ``` XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); fill_seq_fork_with_data fill_seq_with_data AlterSequence ``` Then, subsequent lines would say like them. This means that the snapshot becomes FULL and `setval()` is regarded non-transactional wrongly. ``` LOG: logical decoding found initial consistent point at 0/154D658 DETAIL: Waiting for transactions (approximately 1) older than 742 to end. STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: the sequence is non-transactional STATEMENT: SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding'); LOG: XXX: not consistent: skipped ``` The change would be discarded because the snapshot has not been CONSISTENT yet by the below part. If it has been transactional, we would have queued this change though the transaction will be skipped at commit. ``` else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; ``` But anyway, we could find a case which we can make a wrong decision. This example is lucky - does not output wrongly, but I'm not sure all the case like that. Best Regards, Hayato Kuroda FUJITSU LIMITED diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index d48d88081f..73e38cafd8 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -1397,12 +1397,17 @@ seq_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr); + elog(LOG, "XXX: seq_decode. snapshot is %s", SnapBuildIdentify(builder)); + /* * If we don't have snapshot or we are just fast-forwarding, there is no * point in decoding sequences. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) + { + elog(LOG, "XXX: skipped"); return; + } /* only interested in our database */ XLogRecGetBlockTag(r, 0, _locator, NULL, NULL); @@ -1437,14 +1442,22 @@ seq_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) target_locator, NULL); + elog(LOG, "XXX: the
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra > wrote: > > > > Yes, if something like this happens, that'd be a problem: > > > > 1) decoding starts, with > > > >SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT > > > > 2) transaction that creates a new refilenode gets decoded, but we skip > >it because we don't have the correct snapshot > > > > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT > > > > 4) we decode sequence change from nextval() for the sequence > > > > This would lead to us attempting to apply sequence change for a > > relfilenode that's not visible yet (and may even get aborted). > > > > But can this even happen? Can we start decoding in the middle of a > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > > messages, where we also call the output plugin in non-transactional cases. > > It's not a problem for logical messages because whether the message is > transaction or non-transactional is decided while WAL logs the message > itself. But here our problem starts with deciding whether the change > is transactional vs non-transactional, because if we insert the > 'relfilenode' in hash then the subsequent sequence change in the same > transaction would be considered transactional otherwise > non-transactional. > It is correct that we can make a wrong decision about whether a change is transactional or non-transactional when sequence DDL happens before the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens after that state. However, one thing to note here is that we won't try to stream such a change because for non-transactional cases we don't proceed unless the snapshot is in a consistent state. Now, if the decision had been correct then we would probably have queued the sequence change and discarded at commit. One thing that we deviate here is that for non-sequence transactional cases (including logical messages), we immediately start queuing the changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided SnapBuildProcessChange() returns true which is quite possible) and take final decision at commit/prepare/abort time. However, that won't be the case for sequences because of the dependency of determining transactional cases on one of the prior records. Now, I am not completely sure at this stage if such a deviation can cause any problem and or whether we are okay to have such a deviation for sequences. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
Hi, There's been a lot discussed over the past month or so, and it's become difficult to get a good idea what's the current state - what issues remain to be solved, what's unrelated to this patch, and how to move if forward. Long-running threads tend to be confusing, so I had a short call with Amit to discuss the current state yesterday, and to make sure we're on the same page. I believe it was very helpful, and I've promised to post a short summary of the call - issues, what we agreed seems like a path forward, etc. Obviously, I might have misunderstood something, in which case Amit can correct me. And I'd certainly welcome opinions from others. In general, we discussed three areas - desirability of the feature, correctness and performance. I believe a brief summary of the agreement would be this: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). - correctness: There's one possible correctness issue, when the snapshot changes to FULL between record creating a sequence relfilenode and that sequence advancing. This needs to be verified/reproduced, and fixed. - performance issues: We've agreed the case with a lot of aborts (when DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've discussed whether the overhead with many sequence changes (nextval-40) is acceptable, and/or how to improve it. Next, I'll go over these points in more details, with my understanding of what the challenges are, possible solutions etc. Most of this was discussed/agreed on the call, but some are ideas I had only after the call when writing this summary. 1) desirability of the feature Firstly, do we actually want/need this feature? I believe that's very much a question of what use cases we're targeting. If we only focus on distributed databases (particularly those with multiple active nodes), then we probably agree that the right solution is to not use sequences (~generators of incrementing values) but UUIDs or similar random identifiers (better not call them sequences, there's not much sequential about it). The huge advantage is this does not require replicating any state between the nodes, so logical decoding can simply ignore them and replicate just the generated values. I don't think there's any argument about that. If I as building such distributed system, I'd certainly use such random IDs. The question is what to do about the other use cases - online upgrades relying on logical decoding, failovers to logical replicas, and so on. Or what to do about existing systems that can't be easily changed to use different/random identifiers. Those are not really distributed systems and therefore don't quite need random IDs. Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can easily lead to massive write amplification, for example. There are variants like UUIDv7 reducing the impact, but there's other stuff. My takeaway from this is there's still value in having this feature. 2) correctness The only correctness issue I'm aware of is the question what happens when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding the relfilenode creation and the sequence increment, pointed out by Dilip in [1]. If this happens (and while I don't have a reproducer, I also don't have a very clear idea why it couldn't happen), it breaks how the patch decides between transactional and non-transactional sequence changes. So this seems like a fatal flaw - it definitely needs to be solved. I don't have a good idea how to do that, unfortunately. The problem is the dependency on an earlier record, and that this needs to be evaluated immediately (in the decode phase). Logical messages don't have the same issue because the "transactional" flag does not depend on earlier stuff, and other records are not interpreted until apply/commit, when we know everything relevant was decoded. I don't know what the solution is. Either we find a way to make sure not to lose/skip the smgr record, or we need to rethink how we determine the transactional flag (perhaps even try again adding it to the WAL record, but we didn't find a way to do that earlier). 3) performance issues We have discussed two cases - "ddl-abort" and "nextval-40". The "ddl-abort" is when the workload does a lot of DDL and then aborts them, leading to profiles dominated by DecodeCommit. The agreement here is that while this is a valid issue and we should try fixing it, it's unrelated to this patch. The issue exists even on master. So in the context of this patch we can ignore this issue. The "nextval-40" applies to workloads doing a lot of regular sequence changes. We only decode/apply changes written to WAL, and that happens only for every 32 increments or so. The test was with a very simple transaction (just
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra wrote: > > Yes, if something like this happens, that'd be a problem: > > 1) decoding starts, with > >SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT > > 2) transaction that creates a new refilenode gets decoded, but we skip >it because we don't have the correct snapshot > > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT > > 4) we decode sequence change from nextval() for the sequence > > This would lead to us attempting to apply sequence change for a > relfilenode that's not visible yet (and may even get aborted). > > But can this even happen? Can we start decoding in the middle of a > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical > messages, where we also call the output plugin in non-transactional cases. It's not a problem for logical messages because whether the message is transaction or non-transactional is decided while WAL logs the message itself. But here our problem starts with deciding whether the change is transactional vs non-transactional, because if we insert the 'relfilenode' in hash then the subsequent sequence change in the same transaction would be considered transactional otherwise non-transactional. And XLOG_HEAP2_NEW_CID is just for changing the snapshot->curcid which will only affect the catalog visibility of the upcoming operation in the same transaction, but that's not an issue because if some of the changes of this transaction are seen when snapbuild state < SNAPBUILD_FULL_SNAPSHOT then this transaction has to get committed before the state change to SNAPBUILD_CONSISTENT_SNAPSHOT i.e. the commit LSN of this transaction is going to be < start_decoding_at. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra wrote: > > On 12/6/23 12:05, Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila wrote: > >> > >>> Why can't we use the same concept of > >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > >>> non-transactional changes (have some base snapshot before the first > >>> change), and whenever there is any catalog change, queue new snapshot > >>> change also in the queue of the non-transactional sequence change so > >>> that while sending it to downstream whenever it is necessary we will > >>> change the historic snapshot? > >>> > >> > >> Oh, do you mean maintain different historic snapshots and then switch > >> based on the change we are processing? I guess the other thing we need > >> to consider is the order of processing the changes if we maintain > >> separate queues that need to be processed. > > > > I mean we will not specifically maintain the historic changes, but if > > there is any catalog change where we are pushing the snapshot to all > > the transaction's change queue, at the same time we will push this > > snapshot in the non-transactional sequence queue as well. I am not > > sure what is the problem with the ordering? > > Currently, we set up the historic snapshot before starting a transaction to process the change and then adapt the updates to it while processing the changes for the transaction. Now, while processing this new queue of non-transactional sequence messages, we probably need a separate snapshot and updates to it. So, either we need some sort of switching between snapshots or do it in different transactions. > > because we will be > > queueing all non-transactional sequence changes in a separate queue in > > the order they arrive and as soon as we process the next commit we > > will process all the non-transactional changes at that time. Do you > > see issue with that? > > > > Isn't this (in principle) the idea of queuing the non-transactional > changes and then applying them on the next commit? Yes, I didn't get > very far with that, but I got stuck exactly on tracking which snapshot > to use, so if there's a way to do that, that'd fix my issue. > > Also, would this mean we don't need to track the relfilenodes, if we're > able to query the catalog? Would we be able to check if the relfilenode > was created by the current xact? > I thought this new mechanism was for processing a queue of non-transactional sequence changes. The tracking of relfilenodes is to distinguish between transactional and non-transactional messages, so I think we probably still need that. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra wrote: > > On 12/6/23 12:05, Dilip Kumar wrote: > > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila wrote: > >> > >>> Why can't we use the same concept of > >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > >>> non-transactional changes (have some base snapshot before the first > >>> change), and whenever there is any catalog change, queue new snapshot > >>> change also in the queue of the non-transactional sequence change so > >>> that while sending it to downstream whenever it is necessary we will > >>> change the historic snapshot? > >>> > >> > >> Oh, do you mean maintain different historic snapshots and then switch > >> based on the change we are processing? I guess the other thing we need > >> to consider is the order of processing the changes if we maintain > >> separate queues that need to be processed. > > > > I mean we will not specifically maintain the historic changes, but if > > there is any catalog change where we are pushing the snapshot to all > > the transaction's change queue, at the same time we will push this > > snapshot in the non-transactional sequence queue as well. I am not > > sure what is the problem with the ordering? because we will be > > queueing all non-transactional sequence changes in a separate queue in > > the order they arrive and as soon as we process the next commit we > > will process all the non-transactional changes at that time. Do you > > see issue with that? > > > > Isn't this (in principle) the idea of queuing the non-transactional > changes and then applying them on the next commit? Yes, it is. Yes, I didn't get > very far with that, but I got stuck exactly on tracking which snapshot > to use, so if there's a way to do that, that'd fix my issue. Thinking more about the snapshot issue do we need to even bother about changing the snapshot at all while streaming the non-transactional sequence changes or we can send all the non-transactional changes with a single snapshot? So mainly snapshot logically gets changed due to these 2 events case1: When any transaction gets committed which has done catalog operation (this changes the global snapshot) and case2: When within a transaction, there is some catalog change (this just updates the 'curcid' in the base snapshot of the transaction). Now, if we are thinking that we are streaming all the non-transactional sequence changes right before the next commit then we are not bothered about the (case1) at all because all changes we have queues so far are before this commit. And if we come to a (case2), if we are performing any catalog change on the sequence then the following changes on the same sequence will be considered transactional and if the changes are just on some other catalog (not relevant to our sequence operation) then also we should not be worried about command_id change because visibility of catalog lookup for our sequence will be unaffected by this. In short, I am trying to say that we can safely queue the non-transactional sequence changes and stream them based on the snapshot we got when we decode the first change, and as long as we are planning to stream just before the next commit (or next in-progress stream), we don't ever need to update the snapshot. > Also, would this mean we don't need to track the relfilenodes, if we're > able to query the catalog? Would we be able to check if the relfilenode > was created by the current xact? I think by querying the catalog and checking the xmin we should be able to figure that out, but isn't that costlier than looking up the relfilenode in hash? Because just for identifying whether the changes are transactional or non-transactional you would have to query the catalog, that means for each change before we decide whether we add to the transaction's change queue or non-transactional change queue we will have to query the catalog i.e. you will have to start/stop the transaction? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 7:20 PM Tomas Vondra wrote: > > On 12/6/23 11:19, Amit Kapila wrote: > > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra > > wrote: > >> > >> On 12/3/23 18:52, Tomas Vondra wrote: > >>> ... > >>> > >> > >> Another idea is that maybe we could somehow inform ReorderBuffer whether > >> the output plugin even is interested in sequences. That'd help with > >> cases where we don't even want/need to replicate sequences, e.g. because > >> the publication does not specify (publish=sequence). > >> > >> What happens now in that case is we call ReorderBufferQueueSequence(), > >> it does the whole dance with starting/aborting the transaction, calls > >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we > >> could just short-circuit this by asking the output plugin somehow. > >> > >> In an extreme case the plugin may not even specify the sequence > >> callbacks, and we're still doing all of this. > >> > > > > We could explore this but I guess it won't solve the problem we are > > facing in cases where all sequences are published and plugin has > > specified the sequence callbacks. I think it would add some overhead > > of this check in positive cases where we decide to anyway do send the > > changes. > > Well, the idea is the check would be very simple (essentially just a > boolean flag somewhere), so not really measurable. > > And if the plugin requests decoding sequences, I guess it's natural it > may have a bit of overhead. It needs to do more things, after all. It > needs to be acceptable, ofc. > I agree with you that if it can be done cheaply or without a measurable overhead then it would be a good idea and can serve other purposes as well. For example, see discussion [1]. I had more of what the patch in email [1] is doing where it needs to start/stop xact and do so relcache access etc. which seems can add some overhead if done for each change, though I haven't measured so can't be sure. [1] - https://www.postgresql.org/message-id/CAGfChW5Qo2SrjJ7rU9YYtZbRaWv6v-Z8MJn%3DdQNx4uCSqDEOHA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 12/6/23 09:56, Amit Kapila wrote: > On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra > wrote: >> >> On 12/5/23 13:17, Amit Kapila wrote: >> >>> (b) for transactional >>> cases, we see overhead due to traversing all the top-level txns and >>> check the hash table for each one to find whether change is >>> transactional. >>> >> >> Not really, no. As I explained in my preceding e-mail, this check makes >> almost no difference - I did expect it to matter, but it doesn't. And I >> was a bit disappointed the global hash table didn't move the needle. >> >> Most of the time is spent in >> >> 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) >> | >> ---DecodeCommit (inlined) >> | >> |--72.65%--SnapBuildCommitTxn >> | | >> | --72.61%--SnapBuildBuildSnapshot >> || >> | --72.09%--pg_qsort >> || >> ||--66.24%--pg_qsort >> || | >> >> And there's almost no difference between master and build with sequence >> decoding - see the attached diff-alter-sequence.perf, comparing the two >> branches (perf diff -c delta-abs). >> > > I think in this the commit time predominates which hides the overhead. > We didn't investigate in detail if that can be improved but if we see > a similar case of abort [1], it shows the overhead of > ReorderBufferSequenceIsTransactional(). I understand that aborts won't > be frequent and it is sort of unrealistic test but still helps to show > that there is overhead in ReorderBufferSequenceIsTransactional(). Now, > I am not sure if we can ignore that case because theoretically, the > overhead can increase based on the number of top-level transactions. > > [1]: > https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com > But those profiles were with the "old" patch, with one hash table per top-level transaction. I see nothing like that with the patch [1] that replaces that with a single global hash table. With that patch, the ReorderBufferSequenceIsTransactional() took ~0.5% in any tests I did. What did have bigger impact is this: 46.12% 1.47% postgres [.] pg_logical_slot_get_changes_guts | |--45.12%--pg_logical_slot_get_changes_guts || ||--42.34%--LogicalDecodingProcessRecord ||| |||--12.82%--xact_decode |||| ||||--9.46%--DecodeAbort (inlined) |||| | |||| |--8.44%--ReorderBufferCleanupTXN |||| | | |||| | |--3.25%--ReorderBufferSequenceCleanup (in) |||| | | | |||| | | |--1.59%--hash_seq_search |||| | | | |||| | | |--0.80%--hash_search_with_hash_value |||| | | | |||| | |--0.59%--hash_search |||| | | hash_bytes I guess that could be optimized, but it's also a direct consequence of the huge number of aborts for transactions that create relfilenode. For any other workload this will be negligible. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 12/6/23 11:19, Amit Kapila wrote: > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra > wrote: >> >> On 12/3/23 18:52, Tomas Vondra wrote: >>> ... >>> >> >> Another idea is that maybe we could somehow inform ReorderBuffer whether >> the output plugin even is interested in sequences. That'd help with >> cases where we don't even want/need to replicate sequences, e.g. because >> the publication does not specify (publish=sequence). >> >> What happens now in that case is we call ReorderBufferQueueSequence(), >> it does the whole dance with starting/aborting the transaction, calls >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we >> could just short-circuit this by asking the output plugin somehow. >> >> In an extreme case the plugin may not even specify the sequence >> callbacks, and we're still doing all of this. >> > > We could explore this but I guess it won't solve the problem we are > facing in cases where all sequences are published and plugin has > specified the sequence callbacks. I think it would add some overhead > of this check in positive cases where we decide to anyway do send the > changes. Well, the idea is the check would be very simple (essentially just a boolean flag somewhere), so not really measurable. And if the plugin requests decoding sequences, I guess it's natural it may have a bit of overhead. It needs to do more things, after all. It needs to be acceptable, ofc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 12/6/23 12:05, Dilip Kumar wrote: > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila wrote: >> >>> Why can't we use the same concept of >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the >>> non-transactional changes (have some base snapshot before the first >>> change), and whenever there is any catalog change, queue new snapshot >>> change also in the queue of the non-transactional sequence change so >>> that while sending it to downstream whenever it is necessary we will >>> change the historic snapshot? >>> >> >> Oh, do you mean maintain different historic snapshots and then switch >> based on the change we are processing? I guess the other thing we need >> to consider is the order of processing the changes if we maintain >> separate queues that need to be processed. > > I mean we will not specifically maintain the historic changes, but if > there is any catalog change where we are pushing the snapshot to all > the transaction's change queue, at the same time we will push this > snapshot in the non-transactional sequence queue as well. I am not > sure what is the problem with the ordering? because we will be > queueing all non-transactional sequence changes in a separate queue in > the order they arrive and as soon as we process the next commit we > will process all the non-transactional changes at that time. Do you > see issue with that? > Isn't this (in principle) the idea of queuing the non-transactional changes and then applying them on the next commit? Yes, I didn't get very far with that, but I got stuck exactly on tracking which snapshot to use, so if there's a way to do that, that'd fix my issue. Also, would this mean we don't need to track the relfilenodes, if we're able to query the catalog? Would we be able to check if the relfilenode was created by the current xact? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 12/6/23 10:05, Dilip Kumar wrote: > On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar wrote: >> >> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra >> wrote: >>> > > I was also wondering what happens if the sequence changes are > transactional but somehow the snap builder state changes to > SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and > the seq_decode() which means RelFileLocator will not be added to the > hash table and during the seq_decode() we will consider the change as > non-transactional. I haven't fully analyzed that what is the real > problem in this case but have we considered this case? what happens if > the transaction having both ALTER SEQUENCE and nextval() gets aborted > but the nextva() has been considered as non-transactional because > smgr_decode() changes were not processed because snap builder state > was not yet SNAPBUILD_FULL_SNAPSHOT. > Yes, if something like this happens, that'd be a problem: 1) decoding starts, with SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT 2) transaction that creates a new refilenode gets decoded, but we skip it because we don't have the correct snapshot 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT 4) we decode sequence change from nextval() for the sequence This would lead to us attempting to apply sequence change for a relfilenode that's not visible yet (and may even get aborted). But can this even happen? Can we start decoding in the middle of a transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID, which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical messages, where we also call the output plugin in non-transactional cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila wrote: > > > Why can't we use the same concept of > > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > > non-transactional changes (have some base snapshot before the first > > change), and whenever there is any catalog change, queue new snapshot > > change also in the queue of the non-transactional sequence change so > > that while sending it to downstream whenever it is necessary we will > > change the historic snapshot? > > > > Oh, do you mean maintain different historic snapshots and then switch > based on the change we are processing? I guess the other thing we need > to consider is the order of processing the changes if we maintain > separate queues that need to be processed. I mean we will not specifically maintain the historic changes, but if there is any catalog change where we are pushing the snapshot to all the transaction's change queue, at the same time we will push this snapshot in the non-transactional sequence queue as well. I am not sure what is the problem with the ordering? because we will be queueing all non-transactional sequence changes in a separate queue in the order they arrive and as soon as we process the next commit we will process all the non-transactional changes at that time. Do you see issue with that? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra wrote: > > On 12/3/23 18:52, Tomas Vondra wrote: > > ... > > > > Another idea is that maybe we could somehow inform ReorderBuffer whether > the output plugin even is interested in sequences. That'd help with > cases where we don't even want/need to replicate sequences, e.g. because > the publication does not specify (publish=sequence). > > What happens now in that case is we call ReorderBufferQueueSequence(), > it does the whole dance with starting/aborting the transaction, calls > rb->sequence() which just does "meh" and doesn't do anything. Maybe we > could just short-circuit this by asking the output plugin somehow. > > In an extreme case the plugin may not even specify the sequence > callbacks, and we're still doing all of this. > We could explore this but I guess it won't solve the problem we are facing in cases where all sequences are published and plugin has specified the sequence callbacks. I think it would add some overhead of this check in positive cases where we decide to anyway do send the changes. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar wrote: > > On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra > wrote: > > > > > Some time ago I floated the idea of maybe "queuing" the sequence changes > > and only replay them on the next commit, somehow. But we did ran into > > problems with which snapshot to use, that I didn't know how to solve. > > Maybe we should try again. The idea is we'd queue the non-transactional > > changes somewhere (can't be in the transaction, because we must keep > > them even if it aborts), and then "inject" them into the next commit. > > That'd mean we wouldn't do the separate start/abort for each change. > > Why can't we use the same concept of > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the > non-transactional changes (have some base snapshot before the first > change), and whenever there is any catalog change, queue new snapshot > change also in the queue of the non-transactional sequence change so > that while sending it to downstream whenever it is necessary we will > change the historic snapshot? > Oh, do you mean maintain different historic snapshots and then switch based on the change we are processing? I guess the other thing we need to consider is the order of processing the changes if we maintain separate queues that need to be processed. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar wrote: > > On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra > wrote: > > I was also wondering what happens if the sequence changes are transactional but somehow the snap builder state changes to SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and the seq_decode() which means RelFileLocator will not be added to the hash table and during the seq_decode() we will consider the change as non-transactional. I haven't fully analyzed that what is the real problem in this case but have we considered this case? what happens if the transaction having both ALTER SEQUENCE and nextval() gets aborted but the nextva() has been considered as non-transactional because smgr_decode() changes were not processed because snap builder state was not yet SNAPBUILD_FULL_SNAPSHOT. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra wrote: > > On 12/5/23 13:17, Amit Kapila wrote: > > > (b) for transactional > > cases, we see overhead due to traversing all the top-level txns and > > check the hash table for each one to find whether change is > > transactional. > > > > Not really, no. As I explained in my preceding e-mail, this check makes > almost no difference - I did expect it to matter, but it doesn't. And I > was a bit disappointed the global hash table didn't move the needle. > > Most of the time is spent in > > 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) > | > ---DecodeCommit (inlined) > | > |--72.65%--SnapBuildCommitTxn > | | > | --72.61%--SnapBuildBuildSnapshot > || > | --72.09%--pg_qsort > || > ||--66.24%--pg_qsort > || | > > And there's almost no difference between master and build with sequence > decoding - see the attached diff-alter-sequence.perf, comparing the two > branches (perf diff -c delta-abs). > I think in this the commit time predominates which hides the overhead. We didn't investigate in detail if that can be improved but if we see a similar case of abort [1], it shows the overhead of ReorderBufferSequenceIsTransactional(). I understand that aborts won't be frequent and it is sort of unrealistic test but still helps to show that there is overhead in ReorderBufferSequenceIsTransactional(). Now, I am not sure if we can ignore that case because theoretically, the overhead can increase based on the number of top-level transactions. [1]: https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra wrote: > > Some time ago I floated the idea of maybe "queuing" the sequence changes > and only replay them on the next commit, somehow. But we did ran into > problems with which snapshot to use, that I didn't know how to solve. > Maybe we should try again. The idea is we'd queue the non-transactional > changes somewhere (can't be in the transaction, because we must keep > them even if it aborts), and then "inject" them into the next commit. > That'd mean we wouldn't do the separate start/abort for each change. Why can't we use the same concept of SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the non-transactional changes (have some base snapshot before the first change), and whenever there is any catalog change, queue new snapshot change also in the queue of the non-transactional sequence change so that while sending it to downstream whenever it is necessary we will change the historic snapshot? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 12/5/23 13:17, Amit Kapila wrote: > ... >> I was hopeful the global hash table would be an improvement, but that >> doesn't seem to be the case. I haven't done much profiling yet, but I'd >> guess most of the overhead is due to ReorderBufferQueueSequence() >> starting and aborting a transaction in the non-transactinal case. Which >> is unfortunate, but I don't know if there's a way to optimize that. >> > > Before discussing the alternative ideas you shared, let me try to > clarify my understanding so that we are on the same page. I see two > observations based on the testing and discussion we had (a) for > non-transactional cases, the overhead observed is mainly due to > starting/aborting a transaction for each change; Yes, I believe that's true. See the attached profiles for nextval.sql and nextval-40.sql from master and optimized build (with the global hash), and also a perf-diff. I only include the top 1000 lines for each profile, that should be enough. master - current master without patches applied optimized - master + sequence decoding with global hash table For nextval, there's almost no difference in the profile. Decoding the other changes (inserts) is the dominant part, as we only log sequences every 32 increments. For nextval-40, the main increase is likely due to this part |--11.09%--seq_decode | | | |--9.25%--ReorderBufferQueueSequence | | | | | |--3.56%--AbortCurrentTransaction | | || | | | --3.53%--AbortSubTransaction | | || | | ||--0.95%--AtSubAbort_Portals | | || | | | || --0.83%--hash_seq_search | | || | | | --0.83%--ResourceOwnerReleaseInternal | | | | | |--2.06%--BeginInternalSubTransaction | | | | | | | --1.10%--CommitTransactionCommand | | | | | | | --1.07%--StartSubTransaction | | | | | |--1.28%--CleanupSubTransaction | | | | | | | --0.64%--AtSubCleanup_Portals | | | | | | | --0.55%--hash_seq_search | | | | | --0.67%--RelidByRelfilenumber So yeah, that's the transaction stuff in ReorderBufferQueueSequence. There's also per-diff, comparing individual functions. > (b) for transactional > cases, we see overhead due to traversing all the top-level txns and > check the hash table for each one to find whether change is > transactional. > Not really, no. As I explained in my preceding e-mail, this check makes almost no difference - I did expect it to matter, but it doesn't. And I was a bit disappointed the global hash table didn't move the needle. Most of the time is spent in 78.81% 0.00% postgres postgres [.] DecodeCommit (inlined) | ---DecodeCommit (inlined) | |--72.65%--SnapBuildCommitTxn | | | --72.61%--SnapBuildBuildSnapshot || | --72.09%--pg_qsort || ||--66.24%--pg_qsort || | And there's almost no difference between master and build with sequence decoding - see the attached diff-alter-sequence.perf, comparing the two branches (perf diff -c delta-abs). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company alter-sequence-master.perf.gz Description: application/gzip alter-sequence-optimized.perf.gz Description: application/gzip diff-alter-sequence.perf.gz Description: application/gzip diff-nextval.perf.gz Description: application/gzip diff-nextval-40.perf.gz Description: application/gzip nextval-40-master.perf.gz Description: application/gzip nextval-40-optimized.perf.gz Description: application/gzip nextval-master.perf.gz Description: application/gzip nextval-optimized.perf.gz Description: application/gzip
Re: logical decoding and replication of sequences, take 2
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra wrote: > > Thanks for the script. Are you also measuring the time it takes to > decode this using test_decoding? > > FWIW I did more comprehensive suite of tests over the weekend, with a > couple more variations. I'm attaching the updated scripts, running it > should be as simple as > > ./run.sh BRANCH TRANSACTIONS RUNS > > so perhaps > > ./run.sh master 1000 3 > > to do 3 runs with 1000 transactions per client. And it'll run a bunch of > combinations hard-coded in the script, and write the timings into a CSV > file (with "master" in each row). > > I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I > did this with current master, the basic patch (without the 0002 part), > and then with the optimized approach (single global hash table, see the > 0004 part). That's what master / patched / optimized in the results is. > > Interestingly enough, the i5 handled this much faster, it seems to be > better in single-core tasks. The xeon is still running, so the results > for "optimized" only have one run (out of 3), but shouldn't change much. > > Attached is also a table summarizing this, and visualizing the timing > change (vs. master) in the last couple columns. Green is "faster" than > master (but we don't really expect that), and "red" means slower than > master (the more red, the slower). > > There results are grouped by script (see the attached .tgz), with either > 32 or 96 clients (which does affect the timing, but not between master > and patch). Some executions have no pg_sleep() calls, some have 0.001 > wait (but that doesn't seem to make much difference). > > Overall, I'd group the results into about three groups: > > 1) good cases [nextval, nextval-40, nextval-abort] > > These are cases that slow down a bit, but the slowdown is mostly within > reasonable bounds (we're making the decoding to do more stuff, so it'd > be a bit silly to require that extra work to make no impact). And I do > think this is reasonable, because this is pretty much an extreme / worst > case behavior. People don't really do just nextval() calls, without > doing anything else. Not to mention doing aborts for 100% transactions. > > So in practice this is going to be within noise (and in those cases the > results even show speedup, which seems a bit surprising). It's somewhat > dependent on CPU too - on xeon there's hardly any regression. > > > 2) nextval-40-abort > > Here the slowdown is clear, but I'd argue it generally falls in the same > group as (1). Yes, I'd be happier if it didn't behave like this, but if > someone can show me a practical workload affected by this ... > > > 3) irrelevant cases [all the alters taking insane amounts of time] > > I absolutely refuse to care about these extreme cases where decoding > 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on > xeon). If this was a problem for some practical workload, we'd have > already heard about it I guess. And even if there was such workload, it > wouldn't be up to this patch to fix that. There's clearly something > misbehaving in the snapshot builder. > > > I was hopeful the global hash table would be an improvement, but that > doesn't seem to be the case. I haven't done much profiling yet, but I'd > guess most of the overhead is due to ReorderBufferQueueSequence() > starting and aborting a transaction in the non-transactinal case. Which > is unfortunate, but I don't know if there's a way to optimize that. > Before discussing the alternative ideas you shared, let me try to clarify my understanding so that we are on the same page. I see two observations based on the testing and discussion we had (a) for non-transactional cases, the overhead observed is mainly due to starting/aborting a transaction for each change; (b) for transactional cases, we see overhead due to traversing all the top-level txns and check the hash table for each one to find whether change is transactional. Am, I missing something? -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 12/3/23 18:52, Tomas Vondra wrote: > ... > > Some time ago I floated the idea of maybe "queuing" the sequence changes > and only replay them on the next commit, somehow. But we did ran into > problems with which snapshot to use, that I didn't know how to solve. > Maybe we should try again. The idea is we'd queue the non-transactional > changes somewhere (can't be in the transaction, because we must keep > them even if it aborts), and then "inject" them into the next commit. > That'd mean we wouldn't do the separate start/abort for each change. > Another idea is that maybe we could somehow inform ReorderBuffer whether the output plugin even is interested in sequences. That'd help with cases where we don't even want/need to replicate sequences, e.g. because the publication does not specify (publish=sequence). What happens now in that case is we call ReorderBufferQueueSequence(), it does the whole dance with starting/aborting the transaction, calls rb->sequence() which just does "meh" and doesn't do anything. Maybe we could just short-circuit this by asking the output plugin somehow. In an extreme case the plugin may not even specify the sequence callbacks, and we're still doing all of this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: logical decoding and replication of sequences, take 2
Dear Tomas, > > I did also performance tests (especially case 3). First of all, there are > > some > > variants from yours. > > > > 1. patch 0002 was reverted because it has an issue. So this test checks > > whether > >refactoring around ReorderBufferSequenceIsTransactional seems really > needed. > > FWIW I also did the benchmarks without the 0002 patch, for the same > reason. I forgot to mention that. Oh, good news. So your bench markings are quite meaningful. > > Interesting, so what exactly does the transaction do? It is quite simple - PSA the script file. It was executed with 64 multiplicity. The definition of alter_sequence() is same as you said. (I did use normal bash script for running them, but your approach may be smarter) > Anyway, I don't > think this is very surprising - I believe it behaves like this because > of having to search in many hash tables (one in each toplevel xact). And > I think the solution I explained before (maintaining a single toplevel > hash, instead of many per-top-level hashes). Agreed. And I can benchmark again for new ones, maybe when we decide new approach. Best Regards, Hayato Kuroda FUJITSU LIMITED one_client.sh Description: one_client.sh
Re: logical decoding and replication of sequences, take 2
On 12/1/23 12:08, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, > >> I did some micro-benchmarking today, trying to identify cases where this >> would cause unexpected problems, either due to having to maintain all >> the relfilenodes, or due to having to do hash lookups for every sequence >> change. But I think it's fine, mostly ... >> > > I did also performance tests (especially case 3). First of all, there are some > variants from yours. > > 1. patch 0002 was reverted because it has an issue. So this test checks > whether >refactoring around ReorderBufferSequenceIsTransactional seems really > needed. FWIW I also did the benchmarks without the 0002 patch, for the same reason. I forgot to mention that. > 2. per comments from Amit, I also measured the abort case. In this case, the >alter_sequence() is called but the transaction is aborted. > 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any > cases, >clients executed 1000 transactions. The performance machine has 128 core > so that >result for 128 clients might be saturated. > 4. a short sleep (0.1s) was added in alter_sequence(), especially between >"alter sequence" and nextval(). Because while testing, I found that the >transaction is too short to execute in parallel. I think it is reasonable >because ReorderBufferSequenceIsTransactional() might be worse when the > parallelism >is increased. > > I attached one backend process via perf and executed > pg_slot_logical_get_changes(). > Attached txt file shows which function occupied CPU time, especially from > pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional(). > Here are my observations about them. > > * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64 > clients case. > * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste > many times. I think this is because changes exceed > logical_decoding_work_mem, > so we do not have to analyze anymore. > * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() > is linearly > longer. This means that we need to think some solution to avoid the > overhead by > ReorderBufferSequenceIsTransactional(). > > ``` > 8 clients 3.73% occupied time > 16 7.26% > 32 15.82% > 64 29.14% > 128 46.27% > ``` Interesting, so what exactly does the transaction do? Anyway, I don't think this is very surprising - I believe it behaves like this because of having to search in many hash tables (one in each toplevel xact). And I think the solution I explained before (maintaining a single toplevel hash, instead of many per-top-level hashes). FWIW I find this case interesting, but not very practical, because no practical workload has that many aborts. > > * In case of abort, I also checked CPU time used by > ReorderBufferAddRelFileLocator(), but > it seems not so depends on the number of clients. > > ``` > 8 clients 3.66% occupied time > 16 6.94% > 32 4.65% > 64 5.39% > 128 3.06% > ``` > > As next step, I've planned to run the case which uses setval() function, > because it > generates more WALs than normal nextval(); > How do you think? > Sure, although I don't think it's much different from the test selecting 40 values from the sequence (in each transaction). That generates about the same amount of WAL. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 11/30/23 12:56, Amit Kapila wrote: > On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra > wrote: >> >> 3) "bad case" - small transactions that generate a lot of relfilenodes >> >> select alter_sequence(); >> >> where the function is defined like this (I did create 1000 sequences >> before the test): >> >> CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ >> DECLARE >> v INT; >> BEGIN >> v := 1 + (random() * 999)::int; >> execute format('alter sequence s%s restart with 1000', v); >> perform nextval('s'); >> END; >> $$ LANGUAGE plpgsql; >> >> This performs terribly, but it's entirely unrelated to sequences. >> Current master has exactly the same problem, if transactions do DDL. >> Like this, for example: >> >> CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ >> DECLARE >> v INT; >> BEGIN >> v := 1 + (random() * 999)::int; >> execute format('create table t%s (a int)', v); >> execute format('drop table t%s', v); >> insert into t values (1); >> END; >> $$ LANGUAGE plpgsql; >> >> This has the same impact on master. The perf report shows this: >> >> --98.06%--pg_logical_slot_get_changes_guts >>| >> --97.88%--LogicalDecodingProcessRecord >> | >> --97.56%--xact_decode >> | >>--97.51%--DecodeCommit >> | >> |--91.92%--SnapBuildCommitTxn >> | | >> | --91.65%--SnapBuildBuildSnapshot >> | | >> | --91.14%--pg_qsort >> >> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot >> takes so long is because: >> >> - >> Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) >> at snapbuild.c:498 >> 498+ sizeof(TransactionId) * builder->committed.xcnt >> (gdb) p builder->committed.xcnt >> $4 = 11532 >> - >> >> And with each iteration it grows by 1. >> > > Can we somehow avoid this either by keeping DDL-related xacts open or > aborting them? I I'm not sure why the snapshot builder does this, i.e. why we end up accumulating that many xids, and I didn't have time to look closer. So I don't know if this would be a solution or not. > Also, will it make any difference to use setval as > do_setval() seems to be logging each time? > I think that's pretty much what case (2) does, as it calls nextval() enough time for each transaction do generate WAL. But I don't think this is a very sensible benchmark - it's an extreme case, but practical cases are far closer to case (1) because sequences are intermixed with other activity. No one really does just nextval() calls. > If possible, can you share the scripts? Kuroda-San has access to the > performance machine, he may be able to try it as well. > Sure, attached. But it's a very primitive script, nothing fancy. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company scripts.tgz Description: application/compressed-tar
RE: logical decoding and replication of sequences, take 2
Dear Tomas, > I did some micro-benchmarking today, trying to identify cases where this > would cause unexpected problems, either due to having to maintain all > the relfilenodes, or due to having to do hash lookups for every sequence > change. But I think it's fine, mostly ... > I did also performance tests (especially case 3). First of all, there are some variants from yours. 1. patch 0002 was reverted because it has an issue. So this test checks whether refactoring around ReorderBufferSequenceIsTransactional seems really needed. 2. per comments from Amit, I also measured the abort case. In this case, the alter_sequence() is called but the transaction is aborted. 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases, clients executed 1000 transactions. The performance machine has 128 core so that result for 128 clients might be saturated. 4. a short sleep (0.1s) was added in alter_sequence(), especially between "alter sequence" and nextval(). Because while testing, I found that the transaction is too short to execute in parallel. I think it is reasonable because ReorderBufferSequenceIsTransactional() might be worse when the parallelism is increased. I attached one backend process via perf and executed pg_slot_logical_get_changes(). Attached txt file shows which function occupied CPU time, especially from pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional(). Here are my observations about them. * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64 clients case. * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste many times. I think this is because changes exceed logical_decoding_work_mem, so we do not have to analyze anymore. * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly longer. This means that we need to think some solution to avoid the overhead by ReorderBufferSequenceIsTransactional(). ``` 8 clients 3.73% occupied time 16 7.26% 32 15.82% 64 29.14% 128 46.27% ``` * In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but it seems not so depends on the number of clients. ``` 8 clients 3.66% occupied time 16 6.94% 32 4.65% 64 5.39% 128 3.06% ``` As next step, I've planned to run the case which uses setval() function, because it generates more WALs than normal nextval(); How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED # commit case ## 128 clients - 97.77% 0.00% postgres postgres[.] pg_logical_slot_get_changes_guts - 97.77% pg_logical_slot_get_changes_guts - 97.77% LogicalDecodingProcessRecord - 97.71% xact_decode - 54.95% ReorderBufferProcessTXN - 32.64% ReorderBufferRestoreChanges + 23.37% FileRead 4.58% __memcpy_ssse3_back + 4.14% MemoryContextAllocZero + 7.73% ReorderBufferCopySnap.isra.21 + 5.94% ReorderBufferSerializeTXN + 5.78% ReorderBufferCleanupTXN + 2.04% RelationIdGetRelation - 42.75% SnapBuildCommitTxn - 34.56% ReorderBufferQueueChange - 34.36% ReorderBufferSerializeTXN 27.64% __write_nocancel 4.42% __memcpy_ssse3_back + 1.52% OpenTransientFilePerm + 7.97% SnapBuildBuildSnapshot 0.04% 0.00% postgres postgres [.] ReorderBufferSequenceIsTransactional ## 64 clients - 86.49% 0.04% postgres postgres[.] pg_logical_slot_get_changes_guts - 86.45% pg_logical_slot_get_changes_guts - 86.37% LogicalDecodingProcessRecord - 84.79% xact_decode - 51.05% SnapBuildCommitTxn + 49.77% SnapBuildBuildSnapshot 0.53% ReorderBufferXidHasBaseSnapshot - 33.45% ReorderBufferProcessTXN + 21.37% ReorderBufferCopySnap.isra.21 + 4.86% RelationIdGetRelation + 2.31% RelidByRelfilenumber + 0.84% AbortCurrentTransaction 0.81% ReorderBufferCleanupTXN - 1.10% seq_decode 0.65% ReorderBufferSequenceIsTransactional -1.04% 0.09% postgres postgres [.] ReorderBufferSequenceIsTransactional ## 32 clients - 82.20% 0.13% postgres postgres [.] pg_logical_slot_get_changes_guts - 82.09% pg_logical_slot_get_changes_guts - 81.86% LogicalDecodingProcessRecord - 80.01% xact_decode - 49.38% SnapBuildCommitTxn + 48.10% SnapBuildBuildSnapshot 0.50% ReorderBufferXidHasBaseSnapshot - 29.98% ReorderBufferProcessTXN + 10.52% ReorderBufferCopySnap.isra.21 + 7.64% RelationIdGetRelation + 4.11% RelidByRelfilenumber + 1.70%
Re: logical decoding and replication of sequences, take 2
On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra wrote: > > 3) "bad case" - small transactions that generate a lot of relfilenodes > > select alter_sequence(); > > where the function is defined like this (I did create 1000 sequences > before the test): > > CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ > DECLARE > v INT; > BEGIN > v := 1 + (random() * 999)::int; > execute format('alter sequence s%s restart with 1000', v); > perform nextval('s'); > END; > $$ LANGUAGE plpgsql; > > This performs terribly, but it's entirely unrelated to sequences. > Current master has exactly the same problem, if transactions do DDL. > Like this, for example: > > CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ > DECLARE > v INT; > BEGIN > v := 1 + (random() * 999)::int; > execute format('create table t%s (a int)', v); > execute format('drop table t%s', v); > insert into t values (1); > END; > $$ LANGUAGE plpgsql; > > This has the same impact on master. The perf report shows this: > > --98.06%--pg_logical_slot_get_changes_guts >| > --97.88%--LogicalDecodingProcessRecord > | > --97.56%--xact_decode > | >--97.51%--DecodeCommit > | > |--91.92%--SnapBuildCommitTxn > | | > | --91.65%--SnapBuildBuildSnapshot > | | > | --91.14%--pg_qsort > > The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot > takes so long is because: > > - > Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) > at snapbuild.c:498 > 498+ sizeof(TransactionId) * builder->committed.xcnt > (gdb) p builder->committed.xcnt > $4 = 11532 > - > > And with each iteration it grows by 1. > Can we somehow avoid this either by keeping DDL-related xacts open or aborting them? Also, will it make any difference to use setval as do_setval() seems to be logging each time? If possible, can you share the scripts? Kuroda-San has access to the performance machine, he may be able to try it as well. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Wed, Nov 29, 2023 at 11:45 PM Tomas Vondra wrote: > > > > On 11/27/23 23:06, Peter Smith wrote: > > FWIW, here are some more minor review comments for v20231127-3-0001 > > > > == > > .../replication/logical/reorderbuffer.c > > > > 3. > > + * To decide if a sequence change is transactional, we maintain a hash > > + * table of relfilenodes created in each (sub)transactions, along with > > + * the XID of the (sub)transaction that created the relfilenode. The > > + * entries from substransactions are copied to the top-level transaction > > + * to make checks cheaper. The hash table gets cleaned up when the > > + * transaction completes (commit/abort). > > > > /substransactions/subtransactions/ > > > > Will fix. FYI - I think this typo still exists in the patch v20231128-0001. == Kind Regards, Peter Smith. Fujitsu Australia
Re: logical decoding and replication of sequences, take 2
On 11/29/23 15:41, Tomas Vondra wrote: > ... >> >> One thing that worries me about that approach is that it can suck with >> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE >> records. We have previously fixed some such workloads in logical >> decoding where decoding a transaction containing truncation of a table >> with a lot of partitions (1000 or more) used to take a very long time. >> Don't we face performance issues in such scenarios? >> > > I don't think we do, really. We will have to decode the SMGR records and > add the relfilenodes to the hash table(s), but I think that affects the > lookup performance too much. What I think might be a problem is if we > have many top-level transactions, especially if those transactions do > something that creates a relfilenode. Because then we'll have to do a > hash_search for each of them, and that might be measurable even if each > lookup is O(1). And we do the lookup for every sequence change ... > I did some micro-benchmarking today, trying to identify cases where this would cause unexpected problems, either due to having to maintain all the relfilenodes, or due to having to do hash lookups for every sequence change. But I think it's fine, mostly ... I did all the following tests with 64 clients. I may try more, but even with this there should be fair number of concurrent transactions, which determines the number of top-level transactions in reorderbuffer. I'll try with more clients tomorrow, but I don't think it'll change stuff. The test is fairly simple - run a particular number of transactions (might be 1000 * 64, or more). And then measure how long it takes to decode the changes using test_decoding. Now, the various workloads I tried: 1) "good case" - small OLTP transactions, a couple nextval('s') calls begin; insert into t (1); select nextval('s'); insert into t (1); commit; This is pretty fine, the sequence part of reorderbuffer is really not measurable, it's like 1% of the total CPU time. Which is expected, because we only wal-log every 32-nd increment or so. 2) "good case" - same as (1) but more nextval calls to always do wal begin; insert into t (1); select nextval('s') from generate_series(1,40); insert into t (1); commit; Here sequences are more measurable, it's like 15% of CPU time, but most of that comes to AbortCurrentTransaction() in the non-transactional branch of ReorderBufferQueueSequence. I don't think there's a way around that, and it's entirely unrelated to relfilenodes. The function checking if the change is transactional (ReorderBufferSequenceIsTransactional) is less than 1% of the profile - and this is the version that always walks all top-level transactions. 3) "bad case" - small transactions that generate a lot of relfilenodes select alter_sequence(); where the function is defined like this (I did create 1000 sequences before the test): CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$ DECLARE v INT; BEGIN v := 1 + (random() * 999)::int; execute format('alter sequence s%s restart with 1000', v); perform nextval('s'); END; $$ LANGUAGE plpgsql; This performs terribly, but it's entirely unrelated to sequences. Current master has exactly the same problem, if transactions do DDL. Like this, for example: CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$ DECLARE v INT; BEGIN v := 1 + (random() * 999)::int; execute format('create table t%s (a int)', v); execute format('drop table t%s', v); insert into t values (1); END; $$ LANGUAGE plpgsql; This has the same impact on master. The perf report shows this: --98.06%--pg_logical_slot_get_changes_guts | --97.88%--LogicalDecodingProcessRecord | --97.56%--xact_decode | --97.51%--DecodeCommit | |--91.92%--SnapBuildCommitTxn | | | --91.65%--SnapBuildBuildSnapshot | | | --91.14%--pg_qsort The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot takes so long is because: - Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8) at snapbuild.c:498 498+ sizeof(TransactionId) * builder->committed.xcnt (gdb) p builder->committed.xcnt $4 = 11532 - And with each iteration it grows by 1. That looks quite weird, possibly a bug worth fixing, but unrelated to this patch. I can't investigate this more at the moment, not sure when/if I'll get to that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 11/29/23 14:42, Amit Kapila wrote: > On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra > wrote: >> >> I have been hacking on improving the improvements outlined in my >> preceding e-mail, but I have some bad news - I ran into an issue that I >> don't know how to solve :-( >> >> Consider this transaction: >> >> BEGIN; >> ALTER SEQUENCE s RESTART 1000; >> >> SAVEPOINT s1; >> ALTER SEQUENCE s RESTART 2000; >> ROLLBACK TO s1; >> >> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40); >> COMMIT; >> >> If you try this with the approach relying on rd_newRelfilelocatorSubid >> and rd_createSubid, it fails like this on the subscriber: >> >> ERROR: could not map filenode "base/5/16394" to relation OID >> >> This happens because ReorderBufferQueueSequence tries to do this in the >> non-transactional branch: >> >> reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber); >> >> and the relfilenode is the one created by the first ALTER. But this is >> obviously wrong - the changes should have been treated as transactional, >> because they are tied to the first ALTER. So how did we get there? >> >> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup >> resets the two fields to InvalidSubTransactionId. Which means the >> rollback in the above transaction also forgets about the first ALTER. >> Now that I look at the RelationData comments, it actually describes >> exactly this situation: >> >> * >> * rd_newRelfilelocatorSubid is the ID of the highest subtransaction >> * the most-recent relfilenumber change has survived into or zero if >> * not changed in the current transaction (or we have forgotten >> * changing it). This field is accurate when non-zero, but it can be >> * zero when a relation has multiple new relfilenumbers within a >> * single transaction, with one of them occurring in a subsequently >> * aborted subtransaction, e.g. >> *BEGIN; >> *TRUNCATE t; >> *SAVEPOINT save; >> *TRUNCATE t; >> *ROLLBACK TO save; >> *-- rd_newRelfilelocatorSubid is now forgotten >> * >> >> The root of this problem is that we'd need some sort of "history" for >> the field, so that when a subxact aborts, we can restore the previous >> value. But we obviously don't have that, and I doubt we want to add that >> to relcache - for example, it'd either need to impose some limit on the >> history (and thus a failure when we reach the limit), or it'd need to >> handle histories of arbitrary length. >> > > Yeah, I think that would be really tricky and we may not want to go there. > >> At this point I don't see a solution for this, which means the best way >> forward with the sequence decoding patch seems to be the original >> approach, on the decoding side. >> > > One thing that worries me about that approach is that it can suck with > the workload that has a lot of DDLs that create XLOG_SMGR_CREATE > records. We have previously fixed some such workloads in logical > decoding where decoding a transaction containing truncation of a table > with a lot of partitions (1000 or more) used to take a very long time. > Don't we face performance issues in such scenarios? > I don't think we do, really. We will have to decode the SMGR records and add the relfilenodes to the hash table(s), but I think that affects the lookup performance too much. What I think might be a problem is if we have many top-level transactions, especially if those transactions do something that creates a relfilenode. Because then we'll have to do a hash_search for each of them, and that might be measurable even if each lookup is O(1). And we do the lookup for every sequence change ... > How do we see this work w.r.t to some sort of global sequences? There > is some recent discussion where I have raised a similar point [1]. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com > I think those are very different things, even though called "sequences". AFAIK solutions like snowflakeID or UUIDs don't require replication of any shared state (that's kinda the whole point), so I don't see why would it need some special support in logical decoding. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra wrote: > > I have been hacking on improving the improvements outlined in my > preceding e-mail, but I have some bad news - I ran into an issue that I > don't know how to solve :-( > > Consider this transaction: > > BEGIN; > ALTER SEQUENCE s RESTART 1000; > > SAVEPOINT s1; > ALTER SEQUENCE s RESTART 2000; > ROLLBACK TO s1; > > INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40); > COMMIT; > > If you try this with the approach relying on rd_newRelfilelocatorSubid > and rd_createSubid, it fails like this on the subscriber: > > ERROR: could not map filenode "base/5/16394" to relation OID > > This happens because ReorderBufferQueueSequence tries to do this in the > non-transactional branch: > > reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber); > > and the relfilenode is the one created by the first ALTER. But this is > obviously wrong - the changes should have been treated as transactional, > because they are tied to the first ALTER. So how did we get there? > > Well, the whole problem is that in case of abort, AtEOSubXact_cleanup > resets the two fields to InvalidSubTransactionId. Which means the > rollback in the above transaction also forgets about the first ALTER. > Now that I look at the RelationData comments, it actually describes > exactly this situation: > > * > * rd_newRelfilelocatorSubid is the ID of the highest subtransaction > * the most-recent relfilenumber change has survived into or zero if > * not changed in the current transaction (or we have forgotten > * changing it). This field is accurate when non-zero, but it can be > * zero when a relation has multiple new relfilenumbers within a > * single transaction, with one of them occurring in a subsequently > * aborted subtransaction, e.g. > *BEGIN; > *TRUNCATE t; > *SAVEPOINT save; > *TRUNCATE t; > *ROLLBACK TO save; > *-- rd_newRelfilelocatorSubid is now forgotten > * > > The root of this problem is that we'd need some sort of "history" for > the field, so that when a subxact aborts, we can restore the previous > value. But we obviously don't have that, and I doubt we want to add that > to relcache - for example, it'd either need to impose some limit on the > history (and thus a failure when we reach the limit), or it'd need to > handle histories of arbitrary length. > Yeah, I think that would be really tricky and we may not want to go there. > At this point I don't see a solution for this, which means the best way > forward with the sequence decoding patch seems to be the original > approach, on the decoding side. > One thing that worries me about that approach is that it can suck with the workload that has a lot of DDLs that create XLOG_SMGR_CREATE records. We have previously fixed some such workloads in logical decoding where decoding a transaction containing truncation of a table with a lot of partitions (1000 or more) used to take a very long time. Don't we face performance issues in such scenarios? How do we see this work w.r.t to some sort of global sequences? There is some recent discussion where I have raised a similar point [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 11/27/23 23:06, Peter Smith wrote: > FWIW, here are some more minor review comments for v20231127-3-0001 > > == > doc/src/sgml/logicaldecoding.sgml > > 1. > + The txn parameter contains meta information > about > + the transaction the sequence change is part of. Note however that for > + non-transactional updates, the transaction may be NULL, depending on > + if the transaction already has an XID assigned. > + The sequence_lsn has the WAL location of the > + sequence update. transactional says if the > + sequence has to be replayed as part of the transaction or directly. > > /says if/specifies whether/ > Will fix. > == > src/backend/commands/sequence.c > > 2. DecodeSeqTuple > > + memcpy(((char *) tuple->tuple.t_data), > +data + sizeof(xl_seq_rec), > +SizeofHeapTupleHeader); > + > + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, > +data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, > +datalen); > > Maybe I am misreading but isn't this just copying 2 contiguous pieces > of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen) > achieve the same? > You're right, will fix. I think the code looked differently before, got simplified and I haven't noticed this can be a single memcpy(). > == > .../replication/logical/reorderbuffer.c > > 3. > + * To decide if a sequence change is transactional, we maintain a hash > + * table of relfilenodes created in each (sub)transactions, along with > + * the XID of the (sub)transaction that created the relfilenode. The > + * entries from substransactions are copied to the top-level transaction > + * to make checks cheaper. The hash table gets cleaned up when the > + * transaction completes (commit/abort). > > /substransactions/subtransactions/ > Will fix. > ~~~ > > 4. > + * A naive approach would be to just loop through all transactions and check > + * each of them, but there may be (easily thousands) of subtransactions, and > + * the check happens for each sequence change. So this could be very costly. > > /may be (easily thousands) of/may be (easily thousands of)/ > > ~~~ Thanks. I've reworded this to ... may be many (easily thousands of) subtransactions ... > > 5. ReorderBufferSequenceCleanup > > + while ((ent = (ReorderBufferSequenceEnt *) > hash_seq_search(_status)) != NULL) > + { > + (void) hash_search(txn->toptxn->sequences_hash, > +(void *) >rlocator, > +HASH_REMOVE, NULL); > + } > > Typically, other HASH_REMOVE code I saw would check result for NULL to > give elog(ERROR, "hash table corrupted"); > Good point, I'll add the error check > ~~~ > > 6. ReorderBufferQueueSequence > > + if (xid != InvalidTransactionId) > + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > How about using the macro: TransactionIdIsValid > Actually, I wrote in some other message, I think the check is not necessary. Or rather it should be an assert that XID is valid. And yeah, the macro is a good idea. > ~~~ > > 7. ReorderBufferQueueSequence > > + if (reloid == InvalidOid) > + elog(ERROR, "could not map filenode \"%s\" to relation OID", > + relpathperm(rlocator, > + MAIN_FORKNUM)); > > How about using the macro: OidIsValid > I chose to keep this consistent with other places in reorderbuffer, and all of them use the equality check. > ~~~ > > 8. > + /* > + * Calculate the first value of the next batch (at which point we > + * generate and decode another WAL record. > + */ > > Missing ')' > Will fix. > ~~~ > > 9. ReorderBufferAddRelFileLocator > > + /* > + * We only care about sequence relfilenodes for now, and those always have > + * a XID. So if there's no XID, don't bother adding them to the hash. > + */ > + if (xid == InvalidTransactionId) > + return; > > How about using the macro: TransactionIdIsValid > Will change. > ~~~ > > 10. ReorderBufferProcessTXN > > + if (reloid == InvalidOid) > + elog(ERROR, "could not map filenode \"%s\" to relation OID", > + relpathperm(change->data.sequence.locator, > + MAIN_FORKNUM)); > > How about using the macro: OidIsValid > Same as the other Oid check - consistency. > ~~~ > > 11. ReorderBufferChangeSize > > + if (tup) > + { > + sz += sizeof(HeapTupleData); > + len = tup->tuple.t_len; > + sz += len; > + } > > Why is the 'sz' increment split into 2 parts? > Because the other branches in ReorderBufferChangeSize do it that way. You're right it might be coded on a single line. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 11/28/23 12:32, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra > wrote: >> >> I spent a bit of time looking at the proposed change, and unfortunately >> logging just the boolean flag does not work. A good example is this bit >> from a TAP test added by the patch for built-in replication (which was >> not included with the WIP patch): >> >> BEGIN; >> ALTER SEQUENCE s RESTART WITH 1000; >> SAVEPOINT sp1; >> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); >> ROLLBACK TO sp1; >> COMMIT; >> >> This is expected to produce: >> >> 1131|0|t >> >> but produces >> >> 1000|0|f >> >> instead. The reason is very simple - as implemented, the patch simply >> checks if the relfilenode is from the same top-level transaction, which >> it is, and sets the flag to "true". So we know the sequence changes need >> to be queued and replayed as part of this transaction. >> >> But then during decoding, we still queue the changes into the subxact, >> which then aborts, and the changes are discarded. That is not how it's >> supposed to work, because the new relfilenode is still valid, someone >> might do nextval() and commit. And the nextval() may not get WAL-logged, >> so we'd lose this. >> >> What I guess we might do is log not just a boolean flag, but the XID of >> the subtransaction that created the relfilenode. And then during >> decoding we'd queue the changes into this subtransaction ... >> >> 0006 in the attached patch series does this, and it seems to fix the TAP >> test failure. I left it at the end, to make it easier to run tests >> without the patch applied. >> > > Offhand, I don't have any better idea than what you have suggested for > the problem but this needs some thoughts including the questions asked > by you. I'll spend some time on it and respond back. > I've been experimenting with the idea to log the XID, and for a moment I was worried it actually can't work, because subtransactions may not actually be just nested in simple way, but form a tree. And what if the sequence was altered in a different branch (sibling subxact), not in the immediate parent. In which case the new SubTransactionGetXid() would fail, because it just walks the current chain of subtransactions. I've been thinking about cases like this: BEGIN; CREATE SEQUENCE s;# XID 1000 SELECT alter_sequence(); # XID 1001 SAVEPOINT s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 ROLLBACK TO s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 COMMIT; The XID values are what the sequence wal record will reference, assuming that the main transaction XID is 1000. Initially, I thought it's wrong that the nextval() calls reference XID of the main transaction, because the last relfilenode comes from 1001, which is the subxact created by alter_sequence() thanks to the exception handling block. And that's where the approach in reorderbuffer would queue the changes. But I think this is actually correct too. When a subtransaction commits (e.g. when alter_sequence() completes), it essentially becomes part of the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid accordingly, setting it to parentSubid. This also means that SubTransactionGetXid() can't actually fail, because the ID has to reference an active subtransaction in the current stack. I'm still concerned about the cost of the lookup, because the list may be long and the subxact we're looking for may be quite high, but I guess we might have another field, caching the XID. It'd need to be updated only in AtEOSubXact_cleanup, and at that point we know it's the immediate parent, so it'd be pretty cheap I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra wrote: > > I spent a bit of time looking at the proposed change, and unfortunately > logging just the boolean flag does not work. A good example is this bit > from a TAP test added by the patch for built-in replication (which was > not included with the WIP patch): > > BEGIN; > ALTER SEQUENCE s RESTART WITH 1000; > SAVEPOINT sp1; > INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); > ROLLBACK TO sp1; > COMMIT; > > This is expected to produce: > > 1131|0|t > > but produces > > 1000|0|f > > instead. The reason is very simple - as implemented, the patch simply > checks if the relfilenode is from the same top-level transaction, which > it is, and sets the flag to "true". So we know the sequence changes need > to be queued and replayed as part of this transaction. > > But then during decoding, we still queue the changes into the subxact, > which then aborts, and the changes are discarded. That is not how it's > supposed to work, because the new relfilenode is still valid, someone > might do nextval() and commit. And the nextval() may not get WAL-logged, > so we'd lose this. > > What I guess we might do is log not just a boolean flag, but the XID of > the subtransaction that created the relfilenode. And then during > decoding we'd queue the changes into this subtransaction ... > > 0006 in the attached patch series does this, and it seems to fix the TAP > test failure. I left it at the end, to make it easier to run tests > without the patch applied. > Offhand, I don't have any better idea than what you have suggested for the problem but this needs some thoughts including the questions asked by you. I'll spend some time on it and respond back. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
FWIW, here are some more minor review comments for v20231127-3-0001 == doc/src/sgml/logicaldecoding.sgml 1. + The txn parameter contains meta information about + the transaction the sequence change is part of. Note however that for + non-transactional updates, the transaction may be NULL, depending on + if the transaction already has an XID assigned. + The sequence_lsn has the WAL location of the + sequence update. transactional says if the + sequence has to be replayed as part of the transaction or directly. /says if/specifies whether/ == src/backend/commands/sequence.c 2. DecodeSeqTuple + memcpy(((char *) tuple->tuple.t_data), +data + sizeof(xl_seq_rec), +SizeofHeapTupleHeader); + + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader, +data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader, +datalen); Maybe I am misreading but isn't this just copying 2 contiguous pieces of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen) achieve the same? == .../replication/logical/reorderbuffer.c 3. + * To decide if a sequence change is transactional, we maintain a hash + * table of relfilenodes created in each (sub)transactions, along with + * the XID of the (sub)transaction that created the relfilenode. The + * entries from substransactions are copied to the top-level transaction + * to make checks cheaper. The hash table gets cleaned up when the + * transaction completes (commit/abort). /substransactions/subtransactions/ ~~~ 4. + * A naive approach would be to just loop through all transactions and check + * each of them, but there may be (easily thousands) of subtransactions, and + * the check happens for each sequence change. So this could be very costly. /may be (easily thousands) of/may be (easily thousands of)/ ~~~ 5. ReorderBufferSequenceCleanup + while ((ent = (ReorderBufferSequenceEnt *) hash_seq_search(_status)) != NULL) + { + (void) hash_search(txn->toptxn->sequences_hash, +(void *) >rlocator, +HASH_REMOVE, NULL); + } Typically, other HASH_REMOVE code I saw would check result for NULL to give elog(ERROR, "hash table corrupted"); ~~~ 6. ReorderBufferQueueSequence + if (xid != InvalidTransactionId) + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); How about using the macro: TransactionIdIsValid ~~~ 7. ReorderBufferQueueSequence + if (reloid == InvalidOid) + elog(ERROR, "could not map filenode \"%s\" to relation OID", + relpathperm(rlocator, + MAIN_FORKNUM)); How about using the macro: OidIsValid ~~~ 8. + /* + * Calculate the first value of the next batch (at which point we + * generate and decode another WAL record. + */ Missing ')' ~~~ 9. ReorderBufferAddRelFileLocator + /* + * We only care about sequence relfilenodes for now, and those always have + * a XID. So if there's no XID, don't bother adding them to the hash. + */ + if (xid == InvalidTransactionId) + return; How about using the macro: TransactionIdIsValid ~~~ 10. ReorderBufferProcessTXN + if (reloid == InvalidOid) + elog(ERROR, "could not map filenode \"%s\" to relation OID", + relpathperm(change->data.sequence.locator, + MAIN_FORKNUM)); How about using the macro: OidIsValid ~~~ 11. ReorderBufferChangeSize + if (tup) + { + sz += sizeof(HeapTupleData); + len = tup->tuple.t_len; + sz += len; + } Why is the 'sz' increment split into 2 parts? == Kind Regards, Peter Smith. Fujitsu Australia
Re: logical decoding and replication of sequences, take 2
On 11/27/23 13:08, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, Tomas, > I am wondering that instead of building the infrastructure to know whether a particular change is transactional on the decoding side, can't we have some flag in the WAL record to note whether the change is transactional or not? I have discussed this point with my colleague Kuroda-San and we thought that it may be worth exploring whether we can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to determine if the sequence is created/changed in the current subtransaction and then record that in WAL record. By this, we need to have additional information in the WAL record like XLOG_SEQ_LOG but we can probably do it only with wal_level as logical. >>> >>> I may not understand the proposal exactly, but it's not enough to know >>> if it was created in the same subxact. It might have been created in >>> some earlier subxact in the same top-level xact. >>> >> >> We should be able to detect even some earlier subxact or top-level >> xact based on rd_createSubid/rd_newRelfilelocatorSubid. > > Here is a small PoC patchset to help your understanding. Please see attached > files. > > 0001, 0002 were not changed, and 0004 was reassigned to 0003. > (For now, I focused only on test_decoding, because it is only for evaluation > purpose.) > > 0004 is what we really wanted to say. is_transactional is added in WAL > record, and it stores > whether the operations is transactional. In order to distinguish the status, > rd_createSubid and > rd_newRelfilelocatorSubid are used. According to the comment, they would be a > valid value > only when the relation was changed within the transaction > Also, sequences_hash was not needed anymore, so it and related functions were > removed. > > How do you think? > I think it's an a very nice idea, assuming it maintains the current behavior. It makes a lot of code unnecessary, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 11/27/23 12:11, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra > wrote: >> >> On 11/27/23 11:13, Amit Kapila wrote: >>> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila >>> wrote: On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra wrote: > > While going over 0001, I realized there might be an optimization for > ReorderBufferSequenceIsTransactional. As coded in 0001, it always > searches through all top-level transactions, and if there's many of them > that might be expensive, even if very few of them have any relfilenodes > in the hash table. It's still linear search, and it needs to happen for > each sequence change. > > But can the relfilenode even be in some other top-level transaction? How > could it be - our transaction would not see it, and wouldn't be able to > generate the sequence change. So we should be able to simply check *our* > transaction (or if it's a subxact, the top-level transaction). Either > it's there (and it's transactional change), or not (and then it's > non-transactional change). > I also think the relfilenode should be part of either the current top-level xact or one of its subxact, so looking at all the top-level transactions for each change doesn't seem advisable. > The 0004 does this. > > This of course hinges on when exactly the transactions get created, and > assignments processed. For example if this would fire before the txn > gets assigned to the top-level one, this would break. I don't think this > can happen thanks to the immediate logging of assignments, but I'm too > tired to think about it now. > This needs some thought because I think we can't guarantee the association till we reach the point where we can actually decode the xact. See comments in AssertTXNLsnOrder() [1]. >> >> I suppose you mean the comment before the SnapBuildXactNeedsSkip call, >> which says: >> >> /* >>* Skip the verification if we don't reach the LSN at which we start >>* decoding the contents of transactions yet because until we reach >>* the LSN, we could have transactions that don't have the association >>* between the top-level transaction and subtransaction yet and >>* consequently have the same LSN. We don't guarantee this >>* association until we try to decode the actual contents of >>* transaction. The ordering of the records prior to the >>* start_decoding_at LSN should have been checked before the restart. >>*/ >> >> But doesn't this say that after we actually start decoding / stop >> skipping, we should have seen the assignment? We're already decoding >> transaction contents (because sequence change *is* part of xact, even if >> we decide to replay it in the non-transactional way). >> > > It means to say that the assignment is decided after start_decoding_at > point. We haven't decided that we are past start_decoding_at by the > time the patch is computing the transactional flag. > Ah, I see. We're deciding if the change is transactional before calling SnapBuildXactNeedsSkip. That's a bit unfortunate. >>> >>> I am wondering that instead of building the infrastructure to know >>> whether a particular change is transactional on the decoding side, >>> can't we have some flag in the WAL record to note whether the change >>> is transactional or not? I have discussed this point with my colleague >>> Kuroda-San and we thought that it may be worth exploring whether we >>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to >>> determine if the sequence is created/changed in the current >>> subtransaction and then record that in WAL record. By this, we need to >>> have additional information in the WAL record like XLOG_SEQ_LOG but we >>> can probably do it only with wal_level as logical. >>> >> >> I may not understand the proposal exactly, but it's not enough to know >> if it was created in the same subxact. It might have been created in >> some earlier subxact in the same top-level xact. >> > > We should be able to detect even some earlier subxact or top-level > xact based on rd_createSubid/rd_newRelfilelocatorSubid. > Interesting. I admit I haven't considered using these fields before, so I need to familiarize with it a bit, and try if it'd work. >> FWIW I think one of the earlier patch versions did something like this, >> by adding a "created" flag in the xlog record. And we concluded doing >> this on the decoding side is a better solution. >> > > oh, I thought it would be much simpler than what we are doing on the > decoding-side. Can you please point me to the email discussion where > this is concluded or share the reason? > I think the discussion started around [1], and then in a bunch of following messages (search for "relfilenode"). regards [1]
Re: logical decoding and replication of sequences, take 2
On Mon, Nov 27, 2023 at 4:41 PM Amit Kapila wrote: > > On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra > wrote: > > > > > FWIW I think one of the earlier patch versions did something like this, > > by adding a "created" flag in the xlog record. And we concluded doing > > this on the decoding side is a better solution. > > > > oh, I thought it would be much simpler than what we are doing on the > decoding-side. Can you please point me to the email discussion where > this is concluded or share the reason? > I'll check the thread about this point by myself as well but if by chance you remember it then kindly share it. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra wrote: > > On 11/27/23 11:13, Amit Kapila wrote: > > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila > > wrote: > >> > >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra > >> wrote: > >>> > >>> While going over 0001, I realized there might be an optimization for > >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always > >>> searches through all top-level transactions, and if there's many of them > >>> that might be expensive, even if very few of them have any relfilenodes > >>> in the hash table. It's still linear search, and it needs to happen for > >>> each sequence change. > >>> > >>> But can the relfilenode even be in some other top-level transaction? How > >>> could it be - our transaction would not see it, and wouldn't be able to > >>> generate the sequence change. So we should be able to simply check *our* > >>> transaction (or if it's a subxact, the top-level transaction). Either > >>> it's there (and it's transactional change), or not (and then it's > >>> non-transactional change). > >>> > >> > >> I also think the relfilenode should be part of either the current > >> top-level xact or one of its subxact, so looking at all the top-level > >> transactions for each change doesn't seem advisable. > >> > >>> The 0004 does this. > >>> > >>> This of course hinges on when exactly the transactions get created, and > >>> assignments processed. For example if this would fire before the txn > >>> gets assigned to the top-level one, this would break. I don't think this > >>> can happen thanks to the immediate logging of assignments, but I'm too > >>> tired to think about it now. > >>> > >> > >> This needs some thought because I think we can't guarantee the > >> association till we reach the point where we can actually decode the > >> xact. See comments in AssertTXNLsnOrder() [1]. > >> > > I suppose you mean the comment before the SnapBuildXactNeedsSkip call, > which says: > > /* >* Skip the verification if we don't reach the LSN at which we start >* decoding the contents of transactions yet because until we reach >* the LSN, we could have transactions that don't have the association >* between the top-level transaction and subtransaction yet and >* consequently have the same LSN. We don't guarantee this >* association until we try to decode the actual contents of >* transaction. The ordering of the records prior to the >* start_decoding_at LSN should have been checked before the restart. >*/ > > But doesn't this say that after we actually start decoding / stop > skipping, we should have seen the assignment? We're already decoding > transaction contents (because sequence change *is* part of xact, even if > we decide to replay it in the non-transactional way). > It means to say that the assignment is decided after start_decoding_at point. We haven't decided that we are past start_decoding_at by the time the patch is computing the transactional flag. > > > > I am wondering that instead of building the infrastructure to know > > whether a particular change is transactional on the decoding side, > > can't we have some flag in the WAL record to note whether the change > > is transactional or not? I have discussed this point with my colleague > > Kuroda-San and we thought that it may be worth exploring whether we > > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to > > determine if the sequence is created/changed in the current > > subtransaction and then record that in WAL record. By this, we need to > > have additional information in the WAL record like XLOG_SEQ_LOG but we > > can probably do it only with wal_level as logical. > > > > I may not understand the proposal exactly, but it's not enough to know > if it was created in the same subxact. It might have been created in > some earlier subxact in the same top-level xact. > We should be able to detect even some earlier subxact or top-level xact based on rd_createSubid/rd_newRelfilelocatorSubid. > FWIW I think one of the earlier patch versions did something like this, > by adding a "created" flag in the xlog record. And we concluded doing > this on the decoding side is a better solution. > oh, I thought it would be much simpler than what we are doing on the decoding-side. Can you please point me to the email discussion where this is concluded or share the reason? -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 11/27/23 11:13, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila wrote: >> >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra >> wrote: >>> >>> While going over 0001, I realized there might be an optimization for >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always >>> searches through all top-level transactions, and if there's many of them >>> that might be expensive, even if very few of them have any relfilenodes >>> in the hash table. It's still linear search, and it needs to happen for >>> each sequence change. >>> >>> But can the relfilenode even be in some other top-level transaction? How >>> could it be - our transaction would not see it, and wouldn't be able to >>> generate the sequence change. So we should be able to simply check *our* >>> transaction (or if it's a subxact, the top-level transaction). Either >>> it's there (and it's transactional change), or not (and then it's >>> non-transactional change). >>> >> >> I also think the relfilenode should be part of either the current >> top-level xact or one of its subxact, so looking at all the top-level >> transactions for each change doesn't seem advisable. >> >>> The 0004 does this. >>> >>> This of course hinges on when exactly the transactions get created, and >>> assignments processed. For example if this would fire before the txn >>> gets assigned to the top-level one, this would break. I don't think this >>> can happen thanks to the immediate logging of assignments, but I'm too >>> tired to think about it now. >>> >> >> This needs some thought because I think we can't guarantee the >> association till we reach the point where we can actually decode the >> xact. See comments in AssertTXNLsnOrder() [1]. >> I suppose you mean the comment before the SnapBuildXactNeedsSkip call, which says: /* * Skip the verification if we don't reach the LSN at which we start * decoding the contents of transactions yet because until we reach * the LSN, we could have transactions that don't have the association * between the top-level transaction and subtransaction yet and * consequently have the same LSN. We don't guarantee this * association until we try to decode the actual contents of * transaction. The ordering of the records prior to the * start_decoding_at LSN should have been checked before the restart. */ But doesn't this say that after we actually start decoding / stop skipping, we should have seen the assignment? We're already decoding transaction contents (because sequence change *is* part of xact, even if we decide to replay it in the non-transactional way). > > I am wondering that instead of building the infrastructure to know > whether a particular change is transactional on the decoding side, > can't we have some flag in the WAL record to note whether the change > is transactional or not? I have discussed this point with my colleague > Kuroda-San and we thought that it may be worth exploring whether we > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to > determine if the sequence is created/changed in the current > subtransaction and then record that in WAL record. By this, we need to > have additional information in the WAL record like XLOG_SEQ_LOG but we > can probably do it only with wal_level as logical. > I may not understand the proposal exactly, but it's not enough to know if it was created in the same subxact. It might have been created in some earlier subxact in the same top-level xact. FWIW I think one of the earlier patch versions did something like this, by adding a "created" flag in the xlog record. And we concluded doing this on the decoding side is a better solution. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila wrote: > > On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra > wrote: > > > > While going over 0001, I realized there might be an optimization for > > ReorderBufferSequenceIsTransactional. As coded in 0001, it always > > searches through all top-level transactions, and if there's many of them > > that might be expensive, even if very few of them have any relfilenodes > > in the hash table. It's still linear search, and it needs to happen for > > each sequence change. > > > > But can the relfilenode even be in some other top-level transaction? How > > could it be - our transaction would not see it, and wouldn't be able to > > generate the sequence change. So we should be able to simply check *our* > > transaction (or if it's a subxact, the top-level transaction). Either > > it's there (and it's transactional change), or not (and then it's > > non-transactional change). > > > > I also think the relfilenode should be part of either the current > top-level xact or one of its subxact, so looking at all the top-level > transactions for each change doesn't seem advisable. > > > The 0004 does this. > > > > This of course hinges on when exactly the transactions get created, and > > assignments processed. For example if this would fire before the txn > > gets assigned to the top-level one, this would break. I don't think this > > can happen thanks to the immediate logging of assignments, but I'm too > > tired to think about it now. > > > > This needs some thought because I think we can't guarantee the > association till we reach the point where we can actually decode the > xact. See comments in AssertTXNLsnOrder() [1]. > I am wondering that instead of building the infrastructure to know whether a particular change is transactional on the decoding side, can't we have some flag in the WAL record to note whether the change is transactional or not? I have discussed this point with my colleague Kuroda-San and we thought that it may be worth exploring whether we can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to determine if the sequence is created/changed in the current subtransaction and then record that in WAL record. By this, we need to have additional information in the WAL record like XLOG_SEQ_LOG but we can probably do it only with wal_level as logical. One minor point: It'd also + * trigger assert in DecodeSequence. I don't see DecodeSequence() in the patch. Which exact assert/function are you referring to here? -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra wrote: > > I've been cleaning up the first two patches to get them committed soon > (adding the decoding infrastructure + test_decoding), cleaning up stale > comments, updating commit messages etc. And I think it's ready to go, > but it's too late over, so I plan going over once more tomorrow and then > likely push. But if someone wants to take a look, I'd welcome that. > > The one issue I found during this cleanup is that the patch was missing > the changes introduced by 29d0a77fa660 for decoding of other stuff. > > commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d > Author: Amit Kapila > Date: Thu Oct 26 06:54:16 2023 +0530 > > Migrate logical slots to the new node during an upgrade. > ... > > I fixed that, but perhaps someone might want to double check ... > > > 0003 is here just for completeness - that's the part adding sequences to > built-in replication. I haven't done much with it, it needs some cleanup > too to get it committable. I don't intend to push that right after > 0001+0002, though. > > > While going over 0001, I realized there might be an optimization for > ReorderBufferSequenceIsTransactional. As coded in 0001, it always > searches through all top-level transactions, and if there's many of them > that might be expensive, even if very few of them have any relfilenodes > in the hash table. It's still linear search, and it needs to happen for > each sequence change. > > But can the relfilenode even be in some other top-level transaction? How > could it be - our transaction would not see it, and wouldn't be able to > generate the sequence change. So we should be able to simply check *our* > transaction (or if it's a subxact, the top-level transaction). Either > it's there (and it's transactional change), or not (and then it's > non-transactional change). > I also think the relfilenode should be part of either the current top-level xact or one of its subxact, so looking at all the top-level transactions for each change doesn't seem advisable. > The 0004 does this. > > This of course hinges on when exactly the transactions get created, and > assignments processed. For example if this would fire before the txn > gets assigned to the top-level one, this would break. I don't think this > can happen thanks to the immediate logging of assignments, but I'm too > tired to think about it now. > This needs some thought because I think we can't guarantee the association till we reach the point where we can actually decode the xact. See comments in AssertTXNLsnOrder() [1]. I noticed few minor comments while reading the patch: 1. + * turned on here because the non-transactional logical message is + * decoded without waiting for these records. Instead of '.. logical message', shouldn't we say sequence change message? 2. + /* + * If we found an entry with matchine relfilenode, typo (matchine) 3. + Note that this may not the value obtained by the process updating the + process, but the future sequence value written to WAL (typically about + 32 values ahead). /may not the value/may not be the value [1] - /* * Skip the verification if we don't reach the LSN at which we start * decoding the contents of transactions yet because until we reach the * LSN, we could have transactions that don't have the association between * the top-level transaction and subtransaction yet and consequently have * the same LSN. We don't guarantee this association until we try to * decode the actual contents of transaction. -- With Regards, Amit Kapila.
RE: logical decoding and replication of sequences, take 2
On Thursday, October 12, 2023 11:06 PM Tomas Vondra wrote: > Hi, I have been reviewing the patch set, and here are some initial comments. 1. I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional sequence change in ReorderBufferQueueChange(). 2. ReorderBufferSequenceIsTransactional It seems we call the above function once in sequence_decode() and call it again in ReorderBufferQueueSequence(), would it better to avoid the second call as the hashtable search looks not cheap. 3. The patch cleans up the sequence hash table when COMMIT or ABORT a transaction (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem destory the hash table when PREPARE the transaction. It's not a big porblem but would it be better to release the memory earlier by destory the table for prepare ? 4. +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, ... + /* output BEGIN if we haven't yet, but only for the transactional case */ + if (transactional) + { + if (data->skip_empty_xacts && !txndata->xact_wrote_changes) + { + pg_output_begin(ctx, data, txn, false); + } + txndata->xact_wrote_changes = true; + } I think we should call pg_output_stream_start() instead of pg_output_begin() for streaming sequence changes. 5. + /* +* Schema should be sent using the original relation because it +* also sends the ancestor's relation. +*/ + maybe_send_schema(ctx, txn, relation, relentry); The comment seems a bit misleading here, I think it was used for the partition logic in pgoutput_change(). Best Regards, Hou zj
Re: logical decoding and replication of sequences, take 2
On Thu, Oct 12, 2023 at 9:03 PM Tomas Vondra wrote: > > On 7/25/23 12:20, Amit Kapila wrote: > > ... > > > > I have used the debugger to reproduce this as it needs quite some > > coordination. I just wanted to see if the sequence can go backward and > > didn't catch up completely before the sequence state is marked > > 'ready'. On the publisher side, I created a publication with a table > > and a sequence. Then did the following steps: > > SELECT nextval('s') FROM generate_series(1,50); > > insert into t1 values(1); > > SELECT nextval('s') FROM generate_series(51,150); > > > > Then on the subscriber side with some debugging aid, I could find the > > values in the sequence shown in the previous email. Sorry, I haven't > > recorded each and every step but, if you think it helps, I can again > > try to reproduce it and share the steps. > > > > Amit, can you try to reproduce this backwards movement with the latest > version of the patch? > I lost touch with this patch but IIRC the quoted problem per se shouldn't occur after the idea to use page LSN instead of slot's LSN for synchronization between sync and apply worker. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 7/25/23 12:20, Amit Kapila wrote: > ... > > I have used the debugger to reproduce this as it needs quite some > coordination. I just wanted to see if the sequence can go backward and > didn't catch up completely before the sequence state is marked > 'ready'. On the publisher side, I created a publication with a table > and a sequence. Then did the following steps: > SELECT nextval('s') FROM generate_series(1,50); > insert into t1 values(1); > SELECT nextval('s') FROM generate_series(51,150); > > Then on the subscriber side with some debugging aid, I could find the > values in the sequence shown in the previous email. Sorry, I haven't > recorded each and every step but, if you think it helps, I can again > try to reproduce it and share the steps. > Amit, can you try to reproduce this backwards movement with the latest version of the patch? I have tried triggering that (mis)behavior, but I haven't been successful so far. I'm hesitant to declare it resolved, as it's dependent on timing etc. and you mentioned it required quite some coordination. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 9/13/23 15:18, Ashutosh Bapat wrote: > On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila wrote: >> >> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila wrote: >>> >>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat >>> wrote: On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra wrote: > >> >> But whether or not that's the case, downstream should not request (and >> hence receive) any changes that have been already applied (and >> committed) downstream as a principle. I think a way to achieve this is >> to update the replorigin_session_origin_lsn so that a sequence change >> applied once is not requested (and hence sent) again. >> > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we > don't need that. > > The attached patch merges the earlier improvements, except for the part > that experimented with adding a "fake" transaction (which turned out to > have a number of difficult issues). 0004 looks good to me. >>> >>> >>> + { >>> CommitTransactionCommand(); >>> + >>> + /* >>> + * Update origin state so we don't try applying this sequence >>> + * change in case of crash. >>> + * >>> + * XXX We don't have replorigin_session_origin_timestamp, but we >>> + * can just leave that set to 0. >>> + */ >>> + replorigin_session_origin_lsn = seq.lsn; >>> >>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so >>> that after restart, it doesn't use some prior origin LSN to start with >>> which can in turn lead the sequence to go backward. If so, it should >>> be updated before calling CommitTransactionCommand() as we are doing >>> in apply_handle_commit_internal(). If that is not the intention then >>> it is not clear to me how updating replorigin_session_origin_lsn after >>> commit is helpful. >>> >> >> typedef struct ReplicationState >> { >> ... >> /* >> * Location of the latest commit from the remote side. >> */ >> XLogRecPtrremote_lsn; >> >> This is the variable that will be updated with the value of >> replorigin_session_origin_lsn. This means we will now track some >> arbitrary LSN location of the remote side in this variable. The above >> comment makes me wonder if there is anything we are missing or if it >> is just a matter of updating this comment because before the patch we >> always adhere to what is written in the comment. > > I don't think we are missing anything. This value is used to track the > remote LSN upto which all the commits from upstream have been applied > locally. Since a non-transactional sequence change is like a single > WAL record transaction, it's LSN acts as the LSN of the mini-commit. > So it should be fine to update remote_lsn with sequence WAL record's > end LSN. That's what the patches do. I don't see any hazard. But you > are right, we need to update comments. Here and also at other places > like > replorigin_session_advance() which uses remote_commit as name of the > argument which gets assigned to ReplicationState::remote_lsn. > I agree - updating the replorigin_session_origin_lsn shouldn't break anything. As you write, it's essentially a "mini-commit" and the commit order remains the same. I'm not sure about resetting replorigin_session_origin_timestamp to 0 though. It's not something we rely on very much (it may not correlated with the commit order etc.). But why should we set it to 0? We don't do that for regular commits, right? And IMO it makes sense to just use the timestamp of the last commit before the sequence change. FWIW I've left this in a separate commit, but I'll merge that into 0002 in the next patch version. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 9/20/23 11:53, Dilip Kumar wrote: > On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra > wrote: >> > > I was reading through 0001, I noticed this comment in > ReorderBufferSequenceIsTransactional() function > > + * To decide if a sequence change should be handled as transactional or > applied > + * immediately, we track (sequence) relfilenodes created by each transaction. > + * We don't know if the current sub-transaction was already assigned to the > + * top-level transaction, so we need to check all transactions. > > It says "We don't know if the current sub-transaction was already > assigned to the top-level transaction, so we need to check all > transactions". But IIRC as part of the steaming of in-progress > transactions we have ensured that whenever we are logging the first > change by any subtransaction we include the top transaction ID in it. > Yeah, that's a stale comment - the actual code only searched through the top-level ones (and thus relying on the immediate assignment). As I wrote in the earlier response, I suspect this code originates from before I added the GetCurrentTransactionId() calls. That being said, I do wonder why with the immediate assignments we still need the bit in ReorderBufferAssignChild that says: /* * We already saw this transaction, but initially added it to the * list of top-level txns. Now that we know it's not top-level, * remove it from there. */ dlist_delete(>node); I don't think that affects this patch, but it's a bit confusing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: logical decoding and replication of sequences, take 2
On Friday, September 15, 2023 11:11 AM Zhijie Hou (Fujitsu) wrote: > > On Wednesday, August 16, 2023 10:27 PM Tomas Vondra > wrote: > > Hi, > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > don't need that. > > > > The attached patch merges the earlier improvements, except for the > > part that experimented with adding a "fake" transaction (which turned > > out to have a number of difficult issues). > > I tried to test the patch and found a crash when calling > pg_logical_slot_get_changes() to consume sequence changes. Oh, after confirming again, I realize it's my fault that my build environment was not clean. This case passed after rebuilding. Sorry for the noise. Best Regards, Hou zj
Re: logical decoding and replication of sequences, take 2
On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar wrote: > > On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra > wrote: > > > > I was reading through 0001, I noticed this comment in > ReorderBufferSequenceIsTransactional() function > > + * To decide if a sequence change should be handled as transactional or > applied > + * immediately, we track (sequence) relfilenodes created by each transaction. > + * We don't know if the current sub-transaction was already assigned to the > + * top-level transaction, so we need to check all transactions. > > It says "We don't know if the current sub-transaction was already > assigned to the top-level transaction, so we need to check all > transactions". But IIRC as part of the steaming of in-progress > transactions we have ensured that whenever we are logging the first > change by any subtransaction we include the top transaction ID in it. > > Refer this code > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, > XLogReaderState *record) > { > ... > /* > * If the top-level xid is valid, we need to assign the subxact to the > * top-level xact. We need to do this for all records, hence we do it > * before the switch. > */ > if (TransactionIdIsValid(txid)) > { > ReorderBufferAssignChild(ctx->reorder, > txid, > XLogRecGetXid(record), > buf.origptr); > } > } Some more comments 1. ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid are duplicated except the first one is just confirming whether relfilelocator was created in the transaction or not and the other is returning the XID as well so I think these two could be easily merged so that we can avoid duplicate codes. 2. /* + * ReorderBufferTransferSequencesToParent + * Copy the relfilenode entries to the parent after assignment. + */ +static void +ReorderBufferTransferSequencesToParent(ReorderBuffer *rb, +ReorderBufferTXN *txn, +ReorderBufferTXN *subtxn) If we agree with my comment in the previous email (i.e. the first WAL by a subxid will always include topxid) then we do not need this function at all and always add relfilelocator directly to the top transaction and we never need to transfer. That is all I have for now while first pass of 0001, later I will do a more detailed review and will look into other patches also. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra wrote: > I was reading through 0001, I noticed this comment in ReorderBufferSequenceIsTransactional() function + * To decide if a sequence change should be handled as transactional or applied + * immediately, we track (sequence) relfilenodes created by each transaction. + * We don't know if the current sub-transaction was already assigned to the + * top-level transaction, so we need to check all transactions. It says "We don't know if the current sub-transaction was already assigned to the top-level transaction, so we need to check all transactions". But IIRC as part of the steaming of in-progress transactions we have ensured that whenever we are logging the first change by any subtransaction we include the top transaction ID in it. Refer this code LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *record) { ... /* * If the top-level xid is valid, we need to assign the subxact to the * top-level xact. We need to do this for all records, hence we do it * before the switch. */ if (TransactionIdIsValid(txid)) { ReorderBufferAssignChild(ctx->reorder, txid, XLogRecGetXid(record), buf.origptr); } } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: logical decoding and replication of sequences, take 2
On Wednesday, August 16, 2023 10:27 PM Tomas Vondra wrote: Hi, > > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we don't > need that. > > The attached patch merges the earlier improvements, except for the part that > experimented with adding a "fake" transaction (which turned out to have a > number of difficult issues). I tried to test the patch and found a crash when calling pg_logical_slot_get_changes() to consume sequence changes. Steps: create table t1_seq(a int); create sequence seq1; SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true); INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1'); Attach the backtrace in bt.txt. Best Regards, Hou zj (gdb) bt #0 0x7fc7fc0ddaff in raise () from /lib64/libc.so.6 #1 0x7fc7fc0b0ea5 in abort () from /lib64/libc.so.6 #2 0x7fc7fc120097 in __libc_message () from /lib64/libc.so.6 #3 0x7fc7fc1274ec in malloc_printerr () from /lib64/libc.so.6 #4 0x7fc7fc12929c in _int_free () from /lib64/libc.so.6 #5 0x00ba3a9a in AllocSetDelete (context=0x234e820) at aset.c:652 #6 0x00bae6ab in MemoryContextDelete (context=0x234e820) at mcxt.c:437 #7 0x009171e2 in filter_by_origin_cb_wrapper (ctx=0x2356970, origin_id=0) at logical.c:1234 #8 0x0091153b in FilterByOrigin (ctx=0x2356970, origin_id=0) at decode.c:583 #9 0x00911d5b in DecodeInsert (ctx=0x2356970, buf=0x7ffd28f1ad70) at decode.c:907 #10 0x00911332 in heap_decode (ctx=0x2356970, buf=0x7ffd28f1ad70) at decode.c:488 #11 0x00910983 in LogicalDecodingProcessRecord (ctx=0x2356970, record=0x2356d08) at decode.c:122 #12 0x009192af in pg_logical_slot_get_changes_guts (fcinfo=0x2389380, confirm=true, binary=false) at logicalfuncs.c:258 #13 0x009193e4 in pg_logical_slot_get_changes (fcinfo=0x2389380) at logicalfuncs.c:325 #14 0x00743ab4 in ExecMakeTableFunctionResult (setexpr=0x2373828, econtext=0x23736f8, argContext=0x2389280, expectedDesc=0x23341c8, randomAccess=false) at execSRF.c:235 #15 0x0075ed3d in FunctionNext (node=0x23734e8) at nodeFunctionscan.c:95 #16 0x00745312 in ExecScanFetch (node=0x23734e8, accessMtd=0x75ec8b , recheckMtd=0x75f095 ) at execScan.c:132 #17 0x007453b3 in ExecScan (node=0x23734e8, accessMtd=0x75ec8b , recheckMtd=0x75f095 ) at execScan.c:198 #18 0x0075f0df in ExecFunctionScan (pstate=0x23734e8) at nodeFunctionscan.c:270 #19 0x00741324 in ExecProcNodeFirst (node=0x23734e8) at execProcnode.c:464 #20 0x0073503c in ExecProcNode (node=0x23734e8) at ../../../src/include/executor/executor.h:273 #21 0x00737ca7 in ExecutePlan (estate=0x23732c0, planstate=0x23734e8, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x2330518, execute_once=true) at execMain.c:1670 #22 0x00735682 in standard_ExecutorRun (queryDesc=0x2365020, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365 #23 0x007354bf in ExecutorRun (queryDesc=0x2365020, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309 #24 0x009bfc7e in PortalRunSelect (portal=0x22df3c0, forward=true, count=0, dest=0x2330518) at pquery.c:924 #25 0x009bf93b in PortalRun (portal=0x22df3c0, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2330518, altdest=0x2330518, qc=0x7ffd28f1b550) at pquery.c:768 #26 0x009b94c4 in exec_simple_query ( query_string=0x2264410 "SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,\n'include-xids', 'false', 'skip-empty-xacts', '1');") at postgres.c:1274 #27 0x009bdb31 in PostgresMain (dbname=0x229d240 "postgres", username=0x229d228 "houzj") at postgres.c:4637 #28 0x008f27ef in BackendRun (port=0x2293650) at postmaster.c:4433 #29 0x008f2185 in BackendStartup (port=0x2293650) at postmaster.c:4161 #30 0x008eebff in ServerLoop () at postmaster.c:1778 #31 0x008ee5cf in PostmasterMain (argc=3, argv=0x225d690) at postmaster.c:1462
Re: logical decoding and replication of sequences, take 2
On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila wrote: > > On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila wrote: > > > > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat > > wrote: > > > > > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > > > wrote: > > > > > > > > > > > > > > But whether or not that's the case, downstream should not request (and > > > > > hence receive) any changes that have been already applied (and > > > > > committed) downstream as a principle. I think a way to achieve this is > > > > > to update the replorigin_session_origin_lsn so that a sequence change > > > > > applied once is not requested (and hence sent) again. > > > > > > > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > > > don't need that. > > > > > > > > The attached patch merges the earlier improvements, except for the part > > > > that experimented with adding a "fake" transaction (which turned out to > > > > have a number of difficult issues). > > > > > > 0004 looks good to me. > > > > > > + { > > CommitTransactionCommand(); > > + > > + /* > > + * Update origin state so we don't try applying this sequence > > + * change in case of crash. > > + * > > + * XXX We don't have replorigin_session_origin_timestamp, but we > > + * can just leave that set to 0. > > + */ > > + replorigin_session_origin_lsn = seq.lsn; > > > > IIUC, your proposal is to update the replorigin_session_origin_lsn, so > > that after restart, it doesn't use some prior origin LSN to start with > > which can in turn lead the sequence to go backward. If so, it should > > be updated before calling CommitTransactionCommand() as we are doing > > in apply_handle_commit_internal(). If that is not the intention then > > it is not clear to me how updating replorigin_session_origin_lsn after > > commit is helpful. > > > > typedef struct ReplicationState > { > ... > /* > * Location of the latest commit from the remote side. > */ > XLogRecPtrremote_lsn; > > This is the variable that will be updated with the value of > replorigin_session_origin_lsn. This means we will now track some > arbitrary LSN location of the remote side in this variable. The above > comment makes me wonder if there is anything we are missing or if it > is just a matter of updating this comment because before the patch we > always adhere to what is written in the comment. I don't think we are missing anything. This value is used to track the remote LSN upto which all the commits from upstream have been applied locally. Since a non-transactional sequence change is like a single WAL record transaction, it's LSN acts as the LSN of the mini-commit. So it should be fine to update remote_lsn with sequence WAL record's end LSN. That's what the patches do. I don't see any hazard. But you are right, we need to update comments. Here and also at other places like replorigin_session_advance() which uses remote_commit as name of the argument which gets assigned to ReplicationState::remote_lsn. -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat wrote: > > > > The attached patch merges the earlier improvements, except for the part > > that experimented with adding a "fake" transaction (which turned out to > > have a number of difficult issues). > > 0004 looks good to me. But I need to review the impact of not setting > replorigin_session_origin_timestamp. I think it will be good to set replorigin_session_origin_timestamp = 0 explicitly so as not to pick up a garbage value. The timestamp is written to the commit record. Beyond that I don't see any use of it. It is further passed downstream if there is cascaded logical replication setup. But I don't see it being used. So it should be fine to leave it 0. I don't think we can use logically replicated sequences in a mult-master environment where the timestamp may be used to resolve conflict. Such a setup will require a distributed sequence management which can not be achieved by logical replication alone. In short, I didn't find any hazard in leaving the replorigin_session_origin_timestamp as 0. -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila wrote: > > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat > wrote: > > > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > > wrote: > > > > > > > > > > > But whether or not that's the case, downstream should not request (and > > > > hence receive) any changes that have been already applied (and > > > > committed) downstream as a principle. I think a way to achieve this is > > > > to update the replorigin_session_origin_lsn so that a sequence change > > > > applied once is not requested (and hence sent) again. > > > > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > > don't need that. > > > > > > The attached patch merges the earlier improvements, except for the part > > > that experimented with adding a "fake" transaction (which turned out to > > > have a number of difficult issues). > > > > 0004 looks good to me. > > > + { > CommitTransactionCommand(); > + > + /* > + * Update origin state so we don't try applying this sequence > + * change in case of crash. > + * > + * XXX We don't have replorigin_session_origin_timestamp, but we > + * can just leave that set to 0. > + */ > + replorigin_session_origin_lsn = seq.lsn; > > IIUC, your proposal is to update the replorigin_session_origin_lsn, so > that after restart, it doesn't use some prior origin LSN to start with > which can in turn lead the sequence to go backward. If so, it should > be updated before calling CommitTransactionCommand() as we are doing > in apply_handle_commit_internal(). If that is not the intention then > it is not clear to me how updating replorigin_session_origin_lsn after > commit is helpful. > typedef struct ReplicationState { ... /* * Location of the latest commit from the remote side. */ XLogRecPtrremote_lsn; This is the variable that will be updated with the value of replorigin_session_origin_lsn. This means we will now track some arbitrary LSN location of the remote side in this variable. The above comment makes me wonder if there is anything we are missing or if it is just a matter of updating this comment because before the patch we always adhere to what is written in the comment. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat wrote: > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra > wrote: > > > > > > > > But whether or not that's the case, downstream should not request (and > > > hence receive) any changes that have been already applied (and > > > committed) downstream as a principle. I think a way to achieve this is > > > to update the replorigin_session_origin_lsn so that a sequence change > > > applied once is not requested (and hence sent) again. > > > > > > > I guess we could update the origin, per attached 0004. We don't have > > timestamp to set replorigin_session_origin_timestamp, but it seems we > > don't need that. > > > > The attached patch merges the earlier improvements, except for the part > > that experimented with adding a "fake" transaction (which turned out to > > have a number of difficult issues). > > 0004 looks good to me. + { CommitTransactionCommand(); + + /* + * Update origin state so we don't try applying this sequence + * change in case of crash. + * + * XXX We don't have replorigin_session_origin_timestamp, but we + * can just leave that set to 0. + */ + replorigin_session_origin_lsn = seq.lsn; IIUC, your proposal is to update the replorigin_session_origin_lsn, so that after restart, it doesn't use some prior origin LSN to start with which can in turn lead the sequence to go backward. If so, it should be updated before calling CommitTransactionCommand() as we are doing in apply_handle_commit_internal(). If that is not the intention then it is not clear to me how updating replorigin_session_origin_lsn after commit is helpful. > But I need to review the impact of not setting > replorigin_session_origin_timestamp. > This may not have a direct impact on built-in replication as I think we don't rely on it yet but we need to think of out-of-core solutions. I am not sure if I understood your proposal as per my previous comment but once you clarify the same, I'll also try to think on the same. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra wrote: > > > > > But whether or not that's the case, downstream should not request (and > > hence receive) any changes that have been already applied (and > > committed) downstream as a principle. I think a way to achieve this is > > to update the replorigin_session_origin_lsn so that a sequence change > > applied once is not requested (and hence sent) again. > > > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we > don't need that. > > The attached patch merges the earlier improvements, except for the part > that experimented with adding a "fake" transaction (which turned out to > have a number of difficult issues). 0004 looks good to me. But I need to review the impact of not setting replorigin_session_origin_timestamp. What fake transaction experiment are you talking about? -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra wrote: > > Anyway, I think this is "just" a matter of efficiency, not correctness. > IMHO there are bigger questions regarding the "going back" behavior > after apply restart. sequence_decode() has the following code /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; This means that if the subscription restarts, the upstream will *not* send any non-transactional sequence changes with LSN prior to the LSN specified by START_REPLICATION command. That should avoid replicating all the non-transactional sequence changes since ReplicationSlot::restart_lsn if the subscription restarts. But in apply_handle_sequence(), we do not update the replorigin_session_origin_lsn with LSN of the non-transactional sequence change when it's applied. This means that if a subscription restarts while it is half way through applying a transaction, those changes will be replicated again. This will move the sequence backward. If the subscription keeps restarting again and again while applying that transaction, we will see the sequence "rubber banding" [1] on subscription. So untill the transaction is completely applied, the other users of the sequence may see duplicate values during this time. I think this is undesirable. But I am not able to find a case where this can lead to conflicting values after failover. If there's only one transaction which is repeatedly being applied, the rows which use sequence values were never committed so there's no conflicting value present on the subscription. The same reasoning can be extended to multiple in-flight transactions. If another transaction (T2) uses the sequence values changed by in-flight transaction T1 and if T2 commits before T1, the sequence changes used by T2 must have LSNs before commit of T2 and thus they will never be replicated. (See example below). T1 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1 T2 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2 COMMIT; T1 insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13 COMMIT; So I am not able to imagine a case when a sequence going backward can cause conflicting values. But whether or not that's the case, downstream should not request (and hence receive) any changes that have been already applied (and committed) downstream as a principle. I think a way to achieve this is to update the replorigin_session_origin_lsn so that a sequence change applied once is not requested (and hence sent) again. [1] https://en.wikipedia.org/wiki/Rubber_banding -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On 8/1/23 04:59, Amit Kapila wrote: > On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra > wrote: >> >> On 7/31/23 11:25, Amit Kapila wrote: >>> ... >>> >>> Yeah, I also think this needs a review. This is a sort of new concept >>> where we don't use the LSN of the slot (for cases where copy returned >>> a larger value of LSN) or a full_snapshot created corresponding to the >>> sync slot by Walsender. For the case of the table, we build a full >>> snapshot because we use that for copying the table but why do we need >>> to build that for copying the sequence especially when we directly >>> copy it from the sequence relation without caring for any snapshot? >>> >> >> We need the slot to decode/apply changes during catchup. The main >> subscription may get ahead, and we need to ensure the WAL is not >> discarded or something like that. This applies even if the initial sync >> step does not use the slot/snapshot directly. >> > > AFAIK, none of these needs a full_snapshot (see usage of > SnapBuild->building_full_snapshot). The full_snapshot tracks both > catalog and non-catalog xacts in the snapshot where we require to > track non-catalog ones because we want to copy the table using that > snapshot. It is relatively expensive to build a full snapshot and we > don't do that unless it is required. For the current usage of this > patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient. > Yeah, you may be right we don't need a full snapshot, because we don't need to export it. We however still need a snapshot, and it wasn't clear to me whether you suggest we don't need the slot / snapshot at all. Anyway, I think this is "just" a matter of efficiency, not correctness. IMHO there are bigger questions regarding the "going back" behavior after apply restart. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra wrote: > > On 7/31/23 11:25, Amit Kapila wrote: > > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > > wrote: > >> > >> On 7/28/23 14:44, Ashutosh Bapat wrote: > >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > >>> wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > The 0005 change does that, by: > > 1) adding pg_sequence_state, that returns both the sequence state and > the page LSN > > 2) copy_sequence returns the page LSN > > 3) tablesync then sets this LSN as origin_startpos (which for tables is > just the LSN of the replication slot) > > AFAICS this makes it work - we start decoding at the page LSN, so that > we skip the increments that could lead to the sequence going backwards. > > >>> > >>> I like this design very much. It makes things simpler than complex. > >>> Thanks for doing this. > >>> > >> > >> I agree it seems simpler. It'd be good to try testing / reviewing it a > >> bit more, so that it doesn't misbehave in some way. > >> > > > > Yeah, I also think this needs a review. This is a sort of new concept > > where we don't use the LSN of the slot (for cases where copy returned > > a larger value of LSN) or a full_snapshot created corresponding to the > > sync slot by Walsender. For the case of the table, we build a full > > snapshot because we use that for copying the table but why do we need > > to build that for copying the sequence especially when we directly > > copy it from the sequence relation without caring for any snapshot? > > > > We need the slot to decode/apply changes during catchup. The main > subscription may get ahead, and we need to ensure the WAL is not > discarded or something like that. This applies even if the initial sync > step does not use the slot/snapshot directly. > AFAIK, none of these needs a full_snapshot (see usage of SnapBuild->building_full_snapshot). The full_snapshot tracks both catalog and non-catalog xacts in the snapshot where we require to track non-catalog ones because we want to copy the table using that snapshot. It is relatively expensive to build a full snapshot and we don't do that unless it is required. For the current usage of this patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 7/31/23 11:25, Amit Kapila wrote: > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra > wrote: >> >> On 7/28/23 14:44, Ashutosh Bapat wrote: >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra >>> wrote: Anyway, I was thinking about this a bit more, and it seems it's not as difficult to use the page LSN to ensure sequences don't go backwards. The 0005 change does that, by: 1) adding pg_sequence_state, that returns both the sequence state and the page LSN 2) copy_sequence returns the page LSN 3) tablesync then sets this LSN as origin_startpos (which for tables is just the LSN of the replication slot) AFAICS this makes it work - we start decoding at the page LSN, so that we skip the increments that could lead to the sequence going backwards. >>> >>> I like this design very much. It makes things simpler than complex. >>> Thanks for doing this. >>> >> >> I agree it seems simpler. It'd be good to try testing / reviewing it a >> bit more, so that it doesn't misbehave in some way. >> > > Yeah, I also think this needs a review. This is a sort of new concept > where we don't use the LSN of the slot (for cases where copy returned > a larger value of LSN) or a full_snapshot created corresponding to the > sync slot by Walsender. For the case of the table, we build a full > snapshot because we use that for copying the table but why do we need > to build that for copying the sequence especially when we directly > copy it from the sequence relation without caring for any snapshot? > We need the slot to decode/apply changes during catchup. The main subscription may get ahead, and we need to ensure the WAL is not discarded or something like that. This applies even if the initial sync step does not use the slot/snapshot directly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra wrote: > > On 7/28/23 14:44, Ashutosh Bapat wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> The 0005 change does that, by: > >> > >> 1) adding pg_sequence_state, that returns both the sequence state and > >>the page LSN > >> > >> 2) copy_sequence returns the page LSN > >> > >> 3) tablesync then sets this LSN as origin_startpos (which for tables is > >>just the LSN of the replication slot) > >> > >> AFAICS this makes it work - we start decoding at the page LSN, so that > >> we skip the increments that could lead to the sequence going backwards. > >> > > > > I like this design very much. It makes things simpler than complex. > > Thanks for doing this. > > > > I agree it seems simpler. It'd be good to try testing / reviewing it a > bit more, so that it doesn't misbehave in some way. > Yeah, I also think this needs a review. This is a sort of new concept where we don't use the LSN of the slot (for cases where copy returned a larger value of LSN) or a full_snapshot created corresponding to the sync slot by Walsender. For the case of the table, we build a full snapshot because we use that for copying the table but why do we need to build that for copying the sequence especially when we directly copy it from the sequence relation without caring for any snapshot? -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 7/29/23 06:54, Amit Kapila wrote: > On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra > wrote: >> >> On 7/28/23 11:42, Amit Kapila wrote: >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra >>> wrote: On 7/26/23 09:27, Amit Kapila wrote: > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila > wrote: Anyway, I was thinking about this a bit more, and it seems it's not as difficult to use the page LSN to ensure sequences don't go backwards. >>> >>> While studying the changes for this proposal and related areas, I have >>> a few comments: >>> 1. I think you need to advance the origin if it is changed due to >>> copy_sequence(), otherwise, if the sync worker restarts after >>> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN >>> value. >>> >> >> True, we want to restart at the new origin_startpos. >> >>> 2. Between the time of SYNCDONE and READY state, the patch can skip >>> applying non-transactional sequence changes even if it should apply >>> it. The reason is that during that state change >>> should_apply_changes_for_rel() decides whether to apply change based >>> on the value of remote_final_lsn which won't be set for >>> non-transactional change. I think we need to send the start LSN of a >>> non-transactional record and then use that as remote_final_lsn for >>> such a change. >> >> Good catch. remote_final_lsn is set in apply_handle_begin, but that >> won't happen for sequences. We're already sending the LSN, but >> logicalrep_read_sequence ignores it - it should be enough to add it to >> LogicalRepSequence and then set it in apply_handle_sequence(). >> > > As per my understanding, the LSN sent is EndRecPtr of record which is > the beginning of the next record (means current_record_end + 1). For > comparing the current record, we use the start_position of the record > as we do when we use the remote_final_lsn via apply_handle_begin(). > >>> >>> 3. For non-transactional sequence change apply, we don't set >>> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as >>> we are doing in apply_handle_commit_internal() before calling >>> CommitTransactionCommand(). So, that can lead to the origin moving >>> backwards after restart which will lead to requesting and applying the >>> same changes again and for that period of time sequence can go >>> backwards. This needs some more thought as to what is the correct >>> behaviour/solution for this. >>> >> >> I think saying "origin moves backwards" is a bit misleading. AFAICS the >> origin position is not actually moving backwards, it's more that we >> don't (and can't) move it forwards for each non-transactional change. So >> yeah, we may re-apply those, and IMHO that's expected - the sequence is >> allowed to be "ahead" on the subscriber. >> > > But, if this happens then for a period of time the sequence will go > backwards relative to what one would have observed before restart. > That is true, but is it really a problem? This whole sequence decoding thing was meant to allow logical failover - make sure that after switch to the subscriber, the sequences don't generate duplicate values. From this POV, the sequence going backwards (back to the confirmed origin position) is not an issue - it's still far enough (ahead of publisher). Is that great / ideal? No, I agree with that. But it was considered acceptable and good enough for the failover use case ... The only idea how to improve that is we could keep the non-transactional changes (instead of applying them immediately), and then apply them on the nearest "commit". That'd mean it's subject to the position tracking, and the sequence would not go backwards, I think. So every time we decode a commit, we'd check if we decoded any sequence changes since the last commit, and merge them (a bit like a subxact). This would however also mean sequence changes from rolled-back xacts may not be replictated. I think that'd be fine, but IIRC Andres suggested it's a valid use case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 7/28/23 14:44, Ashutosh Bapat wrote: > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > wrote: >> >> Anyway, I was thinking about this a bit more, and it seems it's not as >> difficult to use the page LSN to ensure sequences don't go backwards. >> The 0005 change does that, by: >> >> 1) adding pg_sequence_state, that returns both the sequence state and >>the page LSN >> >> 2) copy_sequence returns the page LSN >> >> 3) tablesync then sets this LSN as origin_startpos (which for tables is >>just the LSN of the replication slot) >> >> AFAICS this makes it work - we start decoding at the page LSN, so that >> we skip the increments that could lead to the sequence going backwards. >> > > I like this design very much. It makes things simpler than complex. > Thanks for doing this. > I agree it seems simpler. It'd be good to try testing / reviewing it a bit more, so that it doesn't misbehave in some way. > I am wondering whether we could reuse pg_sequence_last_value() instead > of adding a new function. But the name of the function doesn't leave > much space for expanding its functionality. So we are good with a new > one. Probably some code deduplication. > I don't think we should do that, the pg_sequence_last_value() function is meant to do something different. I don't think it'd be any simpler to also make it do what pg_sequence_state() does would make it any simpler. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra wrote: > > On 7/28/23 11:42, Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > wrote: > >> > >> On 7/26/23 09:27, Amit Kapila wrote: > >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila > >>> wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> > > > > While studying the changes for this proposal and related areas, I have > > a few comments: > > 1. I think you need to advance the origin if it is changed due to > > copy_sequence(), otherwise, if the sync worker restarts after > > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > > value. > > > > True, we want to restart at the new origin_startpos. > > > 2. Between the time of SYNCDONE and READY state, the patch can skip > > applying non-transactional sequence changes even if it should apply > > it. The reason is that during that state change > > should_apply_changes_for_rel() decides whether to apply change based > > on the value of remote_final_lsn which won't be set for > > non-transactional change. I think we need to send the start LSN of a > > non-transactional record and then use that as remote_final_lsn for > > such a change. > > Good catch. remote_final_lsn is set in apply_handle_begin, but that > won't happen for sequences. We're already sending the LSN, but > logicalrep_read_sequence ignores it - it should be enough to add it to > LogicalRepSequence and then set it in apply_handle_sequence(). > As per my understanding, the LSN sent is EndRecPtr of record which is the beginning of the next record (means current_record_end + 1). For comparing the current record, we use the start_position of the record as we do when we use the remote_final_lsn via apply_handle_begin(). > > > > 3. For non-transactional sequence change apply, we don't set > > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > > we are doing in apply_handle_commit_internal() before calling > > CommitTransactionCommand(). So, that can lead to the origin moving > > backwards after restart which will lead to requesting and applying the > > same changes again and for that period of time sequence can go > > backwards. This needs some more thought as to what is the correct > > behaviour/solution for this. > > > > I think saying "origin moves backwards" is a bit misleading. AFAICS the > origin position is not actually moving backwards, it's more that we > don't (and can't) move it forwards for each non-transactional change. So > yeah, we may re-apply those, and IMHO that's expected - the sequence is > allowed to be "ahead" on the subscriber. > But, if this happens then for a period of time the sequence will go backwards relative to what one would have observed before restart. -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On 7/28/23 14:35, Ashutosh Bapat wrote: > On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra > wrote: >> >> On 7/24/23 14:57, Ashutosh Bapat wrote: >>> ... >>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. I was thinking maybe we should have it in the transaction (because we need to do cleanup at the end). It seem a bit inconvenient, because then we'd need to either search htabs in all subxacts, or transfer the entries to the top-level xact (otoh, we already do that with snapshots), and cleanup on abort. What do you think? >>> >>> Hash table per transaction seems saner design. Adding it to the top >>> level transaction should be fine. The entry will contain an XID >>> anyway. If we add it to every subtransaction we will need to search >>> hash table in each of the subtransactions when deciding whether a >>> sequence change is transactional or not. Top transaction is a >>> reasonable trade off. >>> >> >> It's not clear to me what design you're proposing, exactly. >> >> If we track it in top-level transactions, then we'd need copy the data >> whenever a transaction is assigned as a child, and perhaps also remove >> it when there's a subxact abort. > > I thought, esp. with your changes to assign xid, we will always know > the top level transaction when a sequence is assigned a relfilenode. > So the refilenodes will always get added to the correct hash directly. > I didn't imagine a case where we will need to copy the hash table from > sub-transaction to top transaction. If that's true, yes it's > inconvenient. > Well, it's a matter of efficiency. To check if a sequence change is transactional, we need to check if it's for a relfilenode created in the current transaction (it can't be for relfilenode created in a concurrent top-level transaction, due to MVCC). If you don't copy the entries into the top-level xact, you have to walk all subxacts and search all of those, for each sequence change. And there may be quite a few of both subxacts and sequence changes ... I wonder if we need to search the other top-level xacts, but we probably need to do that. Because it might be a subxact without an assignment, or something like that. > As to the abort, don't we already remove entries on subtxn abort? > Having per transaction hash table doesn't seem to change anything > much. > What entries are we removing? My point is that if we copy the entries to the top-level xact, we probably need to remove them on abort. Or we could leave them in the top-level xact hash. >> >> And we'd need to still search the hashes in all toplevel transactions on >> every sequence increment - in principle we can't have increment for a >> sequence created in another in-progress transaction, but maybe it's just >> not assigned yet. > > We hold a strong lock on sequence when changing its relfilenode. The > sequence whose relfilenode is being changed can not be accessed by any > concurrent transaction. So I am not able to understand what you are > trying to say. > How do you know the subxact has already been recognized as such? It may be treated as top-level transaction for a while, until the assignment. > I think per (top level) transaction hash table is cleaner design. It > puts the hash table where it should be. But if that makes code > difficult, current design works too. > regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > The 0005 change does that, by: > > 1) adding pg_sequence_state, that returns both the sequence state and >the page LSN > > 2) copy_sequence returns the page LSN > > 3) tablesync then sets this LSN as origin_startpos (which for tables is >just the LSN of the replication slot) > > AFAICS this makes it work - we start decoding at the page LSN, so that > we skip the increments that could lead to the sequence going backwards. > I like this design very much. It makes things simpler than complex. Thanks for doing this. I am wondering whether we could reuse pg_sequence_last_value() instead of adding a new function. But the name of the function doesn't leave much space for expanding its functionality. So we are good with a new one. Probably some code deduplication. -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On 7/28/23 11:42, Amit Kapila wrote: > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > wrote: >> >> On 7/26/23 09:27, Amit Kapila wrote: >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila wrote: >> >> Anyway, I was thinking about this a bit more, and it seems it's not as >> difficult to use the page LSN to ensure sequences don't go backwards. >> > > While studying the changes for this proposal and related areas, I have > a few comments: > 1. I think you need to advance the origin if it is changed due to > copy_sequence(), otherwise, if the sync worker restarts after > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > value. > True, we want to restart at the new origin_startpos. > 2. Between the time of SYNCDONE and READY state, the patch can skip > applying non-transactional sequence changes even if it should apply > it. The reason is that during that state change > should_apply_changes_for_rel() decides whether to apply change based > on the value of remote_final_lsn which won't be set for > non-transactional change. I think we need to send the start LSN of a > non-transactional record and then use that as remote_final_lsn for > such a change. Good catch. remote_final_lsn is set in apply_handle_begin, but that won't happen for sequences. We're already sending the LSN, but logicalrep_read_sequence ignores it - it should be enough to add it to LogicalRepSequence and then set it in apply_handle_sequence(). > > 3. For non-transactional sequence change apply, we don't set > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > we are doing in apply_handle_commit_internal() before calling > CommitTransactionCommand(). So, that can lead to the origin moving > backwards after restart which will lead to requesting and applying the > same changes again and for that period of time sequence can go > backwards. This needs some more thought as to what is the correct > behaviour/solution for this. > I think saying "origin moves backwards" is a bit misleading. AFAICS the origin position is not actually moving backwards, it's more that we don't (and can't) move it forwards for each non-transactional change. So yeah, we may re-apply those, and IMHO that's expected - the sequence is allowed to be "ahead" on the subscriber. I don't see a way to improve this, except maybe having a separate LSN for non-transactional changes (for each origin). > 4. BTW, while checking this behaviour, I noticed that the initial sync > worker for sequence mentions the table in the LOG message: "LOG: > logical replication table synchronization worker for subscription > "mysub", table "s" has finished". Won't it be better here to refer to > it as a sequence? > Thanks, I'll fix that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra wrote: > > On 7/24/23 14:57, Ashutosh Bapat wrote: > > ... > > > >> > >> > >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. > >> I was thinking maybe we should have it in the transaction (because we > >> need to do cleanup at the end). It seem a bit inconvenient, because then > >> we'd need to either search htabs in all subxacts, or transfer the > >> entries to the top-level xact (otoh, we already do that with snapshots), > >> and cleanup on abort. > >> > >> What do you think? > > > > Hash table per transaction seems saner design. Adding it to the top > > level transaction should be fine. The entry will contain an XID > > anyway. If we add it to every subtransaction we will need to search > > hash table in each of the subtransactions when deciding whether a > > sequence change is transactional or not. Top transaction is a > > reasonable trade off. > > > > It's not clear to me what design you're proposing, exactly. > > If we track it in top-level transactions, then we'd need copy the data > whenever a transaction is assigned as a child, and perhaps also remove > it when there's a subxact abort. I thought, esp. with your changes to assign xid, we will always know the top level transaction when a sequence is assigned a relfilenode. So the refilenodes will always get added to the correct hash directly. I didn't imagine a case where we will need to copy the hash table from sub-transaction to top transaction. If that's true, yes it's inconvenient. As to the abort, don't we already remove entries on subtxn abort? Having per transaction hash table doesn't seem to change anything much. > > And we'd need to still search the hashes in all toplevel transactions on > every sequence increment - in principle we can't have increment for a > sequence created in another in-progress transaction, but maybe it's just > not assigned yet. We hold a strong lock on sequence when changing its relfilenode. The sequence whose relfilenode is being changed can not be accessed by any concurrent transaction. So I am not able to understand what you are trying to say. I think per (top level) transaction hash table is cleaner design. It puts the hash table where it should be. But if that makes code difficult, current design works too. -- Best Wishes, Ashutosh Bapat