Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Jun 17, 2025 at 5:17 AM Masahiko Sawada wrote: > > On Wed, Jun 11, 2025 at 9:50 PM shveta malik wrote: > > > > On Wed, Jun 11, 2025 at 11:31 PM Masahiko Sawada > > wrote: > > > > > > > > BTW have we addressed the point Amit mentioned before[1]? > > > > > > > The one more combination to consider is when someone takes a dump of > > > > an older version and loads it into a newer version. For example, where > > > > users dump from 17.5 and then restore in a newer version, say 17.6 > > > > (which has our fix), the restore will fail due to newer restrictions > > > > added by this patch. Do we need to do anything about it? > > > > > > I think it could be a significant side-effect and we need to do > > > something about that. > > > > > > > After giving it more thought, we have an opinion that this > > side-effect/issue is unlikely to occur if users follow our > > documentation properly. > > > > The recommended approach for upgrading between minor versions is to > > shut down the server and replace the binaries. See 'To update between > > compatible versions' in [1]. > > > > Also it is recommended in docs that we use pg_dump from the newer > > version of PostgreSQL. See 'It is recommended that you use the > > pg_dump' in [2]. This particular recommendation is in the Upgrade doc. > > If needed, we can make a similar recommendation in any of our failover > > specific docs as well, mentioning this particular case. > > > > In brief, our overall understanding is that a) pg_dump is mainly used > > for major versions upgrade b) pg_dump of higher version is used. > > Please let us know if your understanding is different here. > > I agree that the main use case of pg_dump is major version upgrading > but it's not limited to that use case. There might be some users who > have taken backups using pg_dump for truly backup purposes. I'm not > sure we've had such compatibility breakage in a minor version release. > > > Beyond these steps, we could not find any better solution for the > > pointed case. But we are open to exploring and implementing any > > alternative solutions you may have. Feedback is most welcome here. > > I'll share alternative ideas if I come up. > There was another idea proposed in [1] in the beginning, quoting it here: - Another idea considered is to prevent the slot that enables two-phase decoding from being synced to standby. IOW, this means displaying the failover field as false in the view, if there is any possibility that transactions prepared before the two_phase_at position exist (e.g., if restart_lsn is less than two_phase_at). - I believe this approach has lesser limitations and maintains compatibility with the existing pg_dump workflow. The proposal is to show failover as false in pg_replication_slots when restart_lsn is earlier than two_phase_at. This scenario can only occur during the activation of two_phase. In PG17, since enabling two_phase via ALTER SUBSCRIPTION is not permitted, this situation effectively happens only at the creation of the subscription. At that point, if slot synchronization is attempted, it will not synchronize the slot. Synchronization (and creation of the new slot) will begin only once restart_lsn surpasses two_phase_at. The only limitation I can identify here is if a failover occurs during that brief window, the standby might not yet have the slot created. However, I believe the likelihood of this happening is quite low. Overall this looks a safer approach. Please let me know your thoughts on this. [1]: https://www.postgresql.org/message-id/OS0PR01MB57161D9BB5409F229564957994AD2%40OS0PR01MB5716.jpnprd01.prod.outlook.com thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Jun 11, 2025 at 9:50 PM shveta malik wrote: > > On Wed, Jun 11, 2025 at 11:31 PM Masahiko Sawada > wrote: > > > > > BTW have we addressed the point Amit mentioned before[1]? > > > > > The one more combination to consider is when someone takes a dump of > > > an older version and loads it into a newer version. For example, where > > > users dump from 17.5 and then restore in a newer version, say 17.6 > > > (which has our fix), the restore will fail due to newer restrictions > > > added by this patch. Do we need to do anything about it? > > > > I think it could be a significant side-effect and we need to do > > something about that. > > > > After giving it more thought, we have an opinion that this > side-effect/issue is unlikely to occur if users follow our > documentation properly. > > The recommended approach for upgrading between minor versions is to > shut down the server and replace the binaries. See 'To update between > compatible versions' in [1]. > > Also it is recommended in docs that we use pg_dump from the newer > version of PostgreSQL. See 'It is recommended that you use the > pg_dump' in [2]. This particular recommendation is in the Upgrade doc. > If needed, we can make a similar recommendation in any of our failover > specific docs as well, mentioning this particular case. > > In brief, our overall understanding is that a) pg_dump is mainly used > for major versions upgrade b) pg_dump of higher version is used. > Please let us know if your understanding is different here. I agree that the main use case of pg_dump is major version upgrading but it's not limited to that use case. There might be some users who have taken backups using pg_dump for truly backup purposes. I'm not sure we've had such compatibility breakage in a minor version release. > Beyond these steps, we could not find any better solution for the > pointed case. But we are open to exploring and implementing any > alternative solutions you may have. Feedback is most welcome here. I'll share alternative ideas if I come up. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 12, 2025 at 4:00 AM Peter Smith wrote: > > On Wed, Jun 11, 2025 at 8:16 PM Ajin Cherian wrote: > > > > On Fri, Jun 6, 2025 at 5:07 PM Nisha Moond wrote: > > > > > > > > > Attached v18 patch. > > > - patch-001: modified error messages as suggested above. > > > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. > > > > > > [1] > > > https://www.postgresql.org/message-id/CAA4eK1%2BB067G8mUJzKUEjc5KSkYq6z0utTaHey-qeRt%2BnZTNJg%40mail.gmail.com > > > > > > -- > > > Thanks, > > > Nisha > > > > Hi Nisha, > > > > Some comments. > > > > Patch 0001 > > 1. Commit message > > > > from "two_phase enables slots." to "two_phase enabled slots" > > > > 2. > > + > > + > > + When a subscription's > > + > linkend="sql-createsubscription-params-with-two-phase">two_phase > > + is in the pending state, setting failover to > > true is not > > + permitted. Once > > > > The term "subscription's two_phase" sounds a bit vague here. > > Referring https://www.postgresql.org/docs/17/sql-createsubscription.html > > and https://www.postgresql.org/docs/current/catalog-pg-subscription.html > > two_phase is referred to both as the "internal state" as well as the > > "mode" of the subscription. > > > > Probably rephrase it to - "When a subscription's internal two_phase > > mode is in the pending state, setting failover to true is not > > permitted. See column subtwophasestate of pg_subscription to know the > > actual internal state of the two-phase mode" (with appropriate tags > > and reference links) > > > > There seems a potential muddling of terms two_phase VS two-phase. > > AFAIK "two_phase" (with the underscore) is the name of the > subscription option; it might also be the name of an internal > structure member or catalog column name. Furthermore, when you are > referring to the option name I think it ought to be double-quoted for > clarity. > > But, unless you are referring specifically to the option/field/column > then IMO it should say two-phase (with the dash). > e.g. The "two_phase" option enables two-phase mode. > I agree with your suggestion to use two-phase when referring to the internal state of the subscription. I've updated the commit message and the respective doc (with reference link) accordingly in v19 patches. -- Thanks, Nisha
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Jun 10, 2025 at 2:29 PM shveta malik wrote: > > On Fri, Jun 6, 2025 at 12:37 PM Nisha Moond wrote: > > > > Attached v18 patch. > > - patch-001: modified error messages as suggested above. > > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. > > > > Thanks for the patches. Please find few comments: > Thanks for the review. > 1) > + * However, we allow this combination in binary upgrade mode, where > + * pg_upgrade guarantees that all slot changes are consumed and no > + * prepared transactions exist. > + */ > > Can we please mention on which system pg_upgrade ensures that no > prepared txn exists. Is it at the source system or target system or > both? The similar comment is there at other place too, please update > that as well. > pg_upgrade checks for prepared transactions on both source and target systems. I've updated the respective comments. ~~~ Attached v19 patches addressing all the comments in [1] and [2]. [1] https://www.postgresql.org/message-id/CAJpy0uCGE0fi4oMb9A_L27C-fJq%2BVpSO9ZbcVxW5s5F3%2BsX0Lw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFPTHDYXjq_%3Dr2ss6YMQBr%3DwbSfkH%3DuGJFrJzVpaW0eqPhy0RQ%40mail.gmail.com -- Thanks, Nisha From b6c1032a504f699256327160e5b9dfd99ae4f962 Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Tue, 3 Jun 2025 10:42:53 +0530 Subject: [PATCH v19 1/2] PG17 Approach 3 Fix slot synchronization for two-phase enabled slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering subscription's failover to true when two-phase state is 'pending'. User can try altering failover again when two-phase state is moved to 'enabled'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 13 ++ doc/src/sgml/protocol.sgml| 17 +++ doc/src/sgml/ref/alter_subscription.sgml | 10 ++ doc/src/sgml/ref/create_subscription.sgml | 17 +++ src/backend/commands/subscriptioncmds.c | 28 src/backend/replication/logical/logical.c | 11 ++ src/backend/replication/logical/slotsync.c| 10 ++ src/backend/replication/slot.c| 30 + src/bin/pg_upgrade/t/003_logical_slots.pl | 32 - .../t/040_standby_failover_slots_sync.pl | 124 ++ src/test/regress/expected/subscription.out| 4 + src/test/regress/sql/subscription.sql | 4 + 14 files changed, 302 insertions(+), 1 deletion(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..87b28ad8d55 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: cannot enable both "failover" and "two_phase" options during replication slot creation SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -182,6 +182,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot', 'tes SELECT 'init' FROM pg_create_logical_replication_slot('failover_false_slot', 'test_decoding', false, false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', 'test_decoding', false, false); SELECT 'init' FROM pg_create_logical_replication_slot('fail
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Jun 11, 2025 at 11:31 PM Masahiko Sawada wrote: > > BTW have we addressed the point Amit mentioned before[1]? > > > The one more combination to consider is when someone takes a dump of > > an older version and loads it into a newer version. For example, where > > users dump from 17.5 and then restore in a newer version, say 17.6 > > (which has our fix), the restore will fail due to newer restrictions > > added by this patch. Do we need to do anything about it? > > I think it could be a significant side-effect and we need to do > something about that. > After giving it more thought, we have an opinion that this side-effect/issue is unlikely to occur if users follow our documentation properly. The recommended approach for upgrading between minor versions is to shut down the server and replace the binaries. See 'To update between compatible versions' in [1]. Also it is recommended in docs that we use pg_dump from the newer version of PostgreSQL. See 'It is recommended that you use the pg_dump' in [2]. This particular recommendation is in the Upgrade doc. If needed, we can make a similar recommendation in any of our failover specific docs as well, mentioning this particular case. In brief, our overall understanding is that a) pg_dump is mainly used for major versions upgrade b) pg_dump of higher version is used. Please let us know if your understanding is different here. Beyond these steps, we could not find any better solution for the pointed case. But we are open to exploring and implementing any alternative solutions you may have. Feedback is most welcome here. [1]: https://www.postgresql.org/docs/current/upgrading.html [2]: https://www.postgresql.org/docs/current/upgrading.html#UPGRADING-VIA-PGDUMPALL thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Jun 11, 2025 at 8:16 PM Ajin Cherian wrote: > > On Fri, Jun 6, 2025 at 5:07 PM Nisha Moond wrote: > > > > > > Attached v18 patch. > > - patch-001: modified error messages as suggested above. > > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. > > > > [1] > > https://www.postgresql.org/message-id/CAA4eK1%2BB067G8mUJzKUEjc5KSkYq6z0utTaHey-qeRt%2BnZTNJg%40mail.gmail.com > > > > -- > > Thanks, > > Nisha > > Hi Nisha, > > Some comments. > > Patch 0001 > 1. Commit message > > from "two_phase enables slots." to "two_phase enabled slots" > > 2. > + > + > + When a subscription's > + linkend="sql-createsubscription-params-with-two-phase">two_phase > + is in the pending state, setting failover to > true is not > + permitted. Once > > The term "subscription's two_phase" sounds a bit vague here. > Referring https://www.postgresql.org/docs/17/sql-createsubscription.html > and https://www.postgresql.org/docs/current/catalog-pg-subscription.html > two_phase is referred to both as the "internal state" as well as the > "mode" of the subscription. > > Probably rephrase it to - "When a subscription's internal two_phase > mode is in the pending state, setting failover to true is not > permitted. See column subtwophasestate of pg_subscription to know the > actual internal state of the two-phase mode" (with appropriate tags > and reference links) > There seems a potential muddling of terms two_phase VS two-phase. AFAIK "two_phase" (with the underscore) is the name of the subscription option; it might also be the name of an internal structure member or catalog column name. Furthermore, when you are referring to the option name I think it ought to be double-quoted for clarity. But, unless you are referring specifically to the option/field/column then IMO it should say two-phase (with the dash). e.g. The "two_phase" option enables two-phase mode. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Jun 6, 2025 at 12:07 AM Nisha Moond wrote: > > On Thu, Jun 5, 2025 at 3:26 PM Dilip Kumar wrote: > > > > On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > > > > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond > > > wrote: > > > > > > > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > > > suggested by Amit above. > > > > > > I have started reviewing this, although I haven't done a complete > > > review yet, but I have a question on the fix we are trying to do, IIUC > > > we are disallowing to use 'two phase' and 'failover' options together > > > at the create slot time and now users has to create slot with one of > > > the option and later enable other option right (if user want to use > > > both options)? But don't you think it will affect usability? because > > > if a user wants to use both the options together then after creating > > > the slot they need to track when is the right time to enable the other > > > option? Not sure if anyone else has this concern or it's just me? > > > > > Some additional comments while quickly glancing at the patch > > > > Thank you for the review. > > > 1. > > + if (two_phase && !IsSyncingReplicationSlots() && !IsBinaryUpgrade) > > + ereport(ERROR, > > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("cannot enable both \"%s\" and \"%s\" options during > > replication slot creation", > > +"failover", "two_phase")); > > > > I think we should also give hints to retry later when a certain > > constraint is met? > > > > Since it’s not possible to create a slot with both options enabled, > this command will always fail. > > As there are no specific constraints to fulfill that would allow > creating a slot with both options enabled, the possible hints are: > - Suggesting to use ALTER_REPLICATION_SLOT command to alter failover > later. But it may not be appropriate here from the user's perspective. > - Recommending to use ALTER SUBSCRIPTION to enable failover later. > But not all slots are created for subscriptions, so this may not apply > in every case. > > Please let me know if you have any suggestions for a suitable hint we > could provide here. > > > Also this is hardcoded options "failover" and "two_phase" so why do we > > need to use %s for contruncting this error message? > > > > I’ve followed the error message guidelines to keep the messages > translator-friendly. The use of %s helps isolate keywords like GUCs or > sub-options so they remain untouched during translation. > > That said, it’s not a strict rule, usage often depends on developer or > committer judgment. After revisiting the docs and re-evaluating this > patch related files, I’ve adjusted the messages with the following in > mind: > - do not use "%s" when terms "failover" and "two-phase" can be used > as concepts and not subscription options. > - use "%s" when referring explicitly to options (or nearby code is so). > - wrap SQL commands in double quotes for clarity. > - since slot.c doesn’t use "%s" in similar contexts, I’ve updated the > message style there for consistency. > > Please let me know if you feel any of the messages could still be > improved further. > > > 2. > > +"failover", "two_phase"), > > + errhint("Use ALTER SUBSCRIPTION ... SET (failover = true) to enable > > \"%s\" after two_phase state is ready", > > + "failover")); > > > > Here we are using a mix of hardcoded string and formatted string, like > > for (failover = true) we hardcoded the "failover" whereas to enable > > \"%s\", we have > > used %s. Better to just directly use failover as we are not depending > > on any variable. Please look at other places as well, I see a few > > more places whereas > > we have used like this. > > > > Modified this message following the reasoning outlined above. > > > 3. + if (slot->data.failover) > > + ereport(ERROR, > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("cannot enable two-phase decoding for failover enabled slot > > \"%s\"", > > + NameStr(slot->data.name; > > > > So for a failover slot we can never enable two_phase, whereas for > > two_phase enabled slot we can enable failover? This seems confusing, > > no? > > > > This restriction is not introduced by this patch. > As Amit pointed out [1], this patch is specific to PG17, where > altering the two_phase option for a subscription is not permitted. > This capability has been added in PG18 onwards. > ~~~ > > Attached v18 patch. > - patch-001: modified error messages as suggested above. > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. Thank you for updating the patch. BTW have we addressed the point Amit mentioned before[1]? > The one more combination to consider is when someone takes a dump of > an older version and loads it into a newer version. For example, where > users dump from 17.5 and then restore in a newer version, say 17.6 > (which has our fix), the restore will fail due to newer restrictions > added by this patch. Do w
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Jun 6, 2025 at 5:07 PM Nisha Moond wrote: > > > Attached v18 patch. > - patch-001: modified error messages as suggested above. > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. > > [1] > https://www.postgresql.org/message-id/CAA4eK1%2BB067G8mUJzKUEjc5KSkYq6z0utTaHey-qeRt%2BnZTNJg%40mail.gmail.com > > -- > Thanks, > Nisha Hi Nisha, Some comments. Patch 0001 1. Commit message from "two_phase enables slots." to "two_phase enabled slots" 2. + + + When a subscription's + two_phase + is in the pending state, setting failover to true is not + permitted. Once The term "subscription's two_phase" sounds a bit vague here. Referring https://www.postgresql.org/docs/17/sql-createsubscription.html and https://www.postgresql.org/docs/current/catalog-pg-subscription.html two_phase is referred to both as the "internal state" as well as the "mode" of the subscription. Probably rephrase it to - "When a subscription's internal two_phase mode is in the pending state, setting failover to true is not permitted. See column subtwophasestate of pg_subscription to know the actual internal state of the two-phase mode" (with appropriate tags and reference links) Patch 0002 1. Commit message Typo: change "subscrption" to "subscription" regards, Ajin Cherian Fujitsu Australia
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Jun 6, 2025 at 12:37 PM Nisha Moond wrote: > > Attached v18 patch. > - patch-001: modified error messages as suggested above. > - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. > Thanks for the patches. Please find few comments: 1) + * However, we allow this combination in binary upgrade mode, where + * pg_upgrade guarantees that all slot changes are consumed and no + * prepared transactions exist. + */ Can we please mention on which system pg_upgrade ensures that no prepared txn exists. Is it at the source system or target system or both? The similar comment is there at other place too, please update that as well. 2) + * The failover option cannot be set togather with two_phase during + * CREATE SUBSCRIPTION. It can be enabled later using ALTER + * SUBSCRIPTION after the subscription is restored. + */ togather->together 3) We have these 2 messages in slot.c and subscriptioncmds.c, we can have both in similar way to maintain consistency across the patch. Since we are using 'options' in the msg, thus one with %s looks better + errmsg("cannot enable both failover and two_phase options during replication slot creation")); + errmsg("cannot enable both \"%s\" and \"%s\" options at \"CREATE SUBSCRIPTION\"", +"failover", "two_phase"), 4) Can we please update patch-message for patch002 similar to doc to make it more clear. thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 5, 2025 at 3:59 PM Amit Kapila wrote: > > On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond > > wrote: > > > > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > > suggested by Amit above. > > > > I have started reviewing this, although I haven't done a complete > > review yet, but I have a question on the fix we are trying to do, IIUC > > we are disallowing to use 'two phase' and 'failover' options together > > at the create slot time and now users has to create slot with one of > > the option and later enable other option right (if user want to use > > both options)? But don't you think it will affect usability? because > > if a user wants to use both the options together then after creating > > the slot they need to track when is the right time to enable the other > > option? > > > > Note that this is the restriction for 17 only, as we can't think of a > better way to fix it. For HEAD, we have exposed two_phase_at via the > existing view to fix this bug, see 4868c96bc8. We can't do the same in > backbranches as discussed in email [1]. If you have any better ideas > to fix this in backbranch, then kindly let us know. Understood. On HEAD , we've exposed two_phase_at to fix this problem. It's unfortunate we can't back-port this solution. This means we have to impose new restrictions on what was previously allowed in older versions of PostgreSQL to avoid the issue. While it's not ideal, we don't have a better approach for those older branches right now. Thanks for clarifying. -- Regards, Dilip Kumar Google
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 5, 2025 at 3:26 PM Dilip Kumar wrote: > > On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond > > wrote: > > > > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > > suggested by Amit above. > > > > I have started reviewing this, although I haven't done a complete > > review yet, but I have a question on the fix we are trying to do, IIUC > > we are disallowing to use 'two phase' and 'failover' options together > > at the create slot time and now users has to create slot with one of > > the option and later enable other option right (if user want to use > > both options)? But don't you think it will affect usability? because > > if a user wants to use both the options together then after creating > > the slot they need to track when is the right time to enable the other > > option? Not sure if anyone else has this concern or it's just me? > > > Some additional comments while quickly glancing at the patch > Thank you for the review. > 1. > + if (two_phase && !IsSyncingReplicationSlots() && !IsBinaryUpgrade) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot enable both \"%s\" and \"%s\" options during > replication slot creation", > +"failover", "two_phase")); > > I think we should also give hints to retry later when a certain > constraint is met? > Since it’s not possible to create a slot with both options enabled, this command will always fail. As there are no specific constraints to fulfill that would allow creating a slot with both options enabled, the possible hints are: - Suggesting to use ALTER_REPLICATION_SLOT command to alter failover later. But it may not be appropriate here from the user's perspective. - Recommending to use ALTER SUBSCRIPTION to enable failover later. But not all slots are created for subscriptions, so this may not apply in every case. Please let me know if you have any suggestions for a suitable hint we could provide here. > Also this is hardcoded options "failover" and "two_phase" so why do we > need to use %s for contruncting this error message? > I’ve followed the error message guidelines to keep the messages translator-friendly. The use of %s helps isolate keywords like GUCs or sub-options so they remain untouched during translation. That said, it’s not a strict rule, usage often depends on developer or committer judgment. After revisiting the docs and re-evaluating this patch related files, I’ve adjusted the messages with the following in mind: - do not use "%s" when terms "failover" and "two-phase" can be used as concepts and not subscription options. - use "%s" when referring explicitly to options (or nearby code is so). - wrap SQL commands in double quotes for clarity. - since slot.c doesn’t use "%s" in similar contexts, I’ve updated the message style there for consistency. Please let me know if you feel any of the messages could still be improved further. > 2. > +"failover", "two_phase"), > + errhint("Use ALTER SUBSCRIPTION ... SET (failover = true) to enable > \"%s\" after two_phase state is ready", > + "failover")); > > Here we are using a mix of hardcoded string and formatted string, like > for (failover = true) we hardcoded the "failover" whereas to enable > \"%s\", we have > used %s. Better to just directly use failover as we are not depending > on any variable. Please look at other places as well, I see a few > more places whereas > we have used like this. > Modified this message following the reasoning outlined above. > 3. + if (slot->data.failover) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", > + NameStr(slot->data.name; > > So for a failover slot we can never enable two_phase, whereas for > two_phase enabled slot we can enable failover? This seems confusing, > no? > This restriction is not introduced by this patch. As Amit pointed out [1], this patch is specific to PG17, where altering the two_phase option for a subscription is not permitted. This capability has been added in PG18 onwards. ~~~ Attached v18 patch. - patch-001: modified error messages as suggested above. - patch-002: improved pg_dump docs as per Shveta's off-list suggestions. [1] https://www.postgresql.org/message-id/CAA4eK1%2BB067G8mUJzKUEjc5KSkYq6z0utTaHey-qeRt%2BnZTNJg%40mail.gmail.com -- Thanks, Nisha v18-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch Description: Binary data v18-0002-Fix-for-pg_dump.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Jun 4, 2025 at 11:16 PM Masahiko Sawada wrote: > > On Sun, Jun 1, 2025 at 10:25 PM Amit Kapila wrote: > > > > The one more combination to consider is when someone takes a dump of > > an older version and loads it into a newer version. For example, where > > users dump from 17.5 and then restore in a newer version, say 17.6 > > (which has our fix), the restore will fail due to newer restrictions > > added by this patch. Do we need to do anything about it? > > A valid concern. Implementing this change could potentially render > dumps created prior to version 17.5 incompatible with version 17.6 or > later, which seems a significant backwards incompatibility to me. Do > we have any precedence of such incompatibility? > As per the upgrade documentation[1]: - For major version upgrades, it is recommended to use the pg_dump (and pg_dumpall) from the newer version of Postgres. - For minor version upgrades, it is clearly mentioned to use binary replacement. i.e. - simply shut down the server, replace the binaries, and restart - as minor versions are binary-compatible and do not change the storage format. As per the pg_dump documentation[2]: "Because pg_dump is used to transfer data to newer versions of PostgreSQL, the output of pg_dump can be expected to load into PostgreSQL server versions newer than pg_dump's version." By looking at both the docs[1][2], we can draw the conclusion that the above statement from the pg_dump docs[2] is true for the major version and thus our fix should be safe in that context. [1] https://www.postgresql.org/docs/current/upgrading.html [2] https://www.postgresql.org/docs/14/app-pgdump.html -- Thanks, Nisha
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 5, 2025 at 3:59 PM Amit Kapila wrote: > > On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond > > wrote: > > > > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > > suggested by Amit above. > > > > I have started reviewing this, although I haven't done a complete > > review yet, but I have a question on the fix we are trying to do, IIUC > > we are disallowing to use 'two phase' and 'failover' options together > > at the create slot time and now users has to create slot with one of > > the option and later enable other option right (if user want to use > > both options)? But don't you think it will affect usability? because > > if a user wants to use both the options together then after creating > > the slot they need to track when is the right time to enable the other > > option? > > > > Note that this is the restriction for 17 only, as we can't think of a > better way to fix it. For HEAD, we have exposed two_phase_at via the > existing view to fix this bug, see 4868c96bc8. We can't do the same in > backbranches as discussed in email [1]. If you have any better ideas > to fix this in backbranch, then kindly let us know. > Sure, I will look at the commit and get back, Thanks. -- Regards, Dilip Kumar Google
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond wrote: > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > suggested by Amit above. > > I have started reviewing this, although I haven't done a complete > review yet, but I have a question on the fix we are trying to do, IIUC > we are disallowing to use 'two phase' and 'failover' options together > at the create slot time and now users has to create slot with one of > the option and later enable other option right (if user want to use > both options)? But don't you think it will affect usability? because > if a user wants to use both the options together then after creating > the slot they need to track when is the right time to enable the other > option? > Note that this is the restriction for 17 only, as we can't think of a better way to fix it. For HEAD, we have exposed two_phase_at via the existing view to fix this bug, see 4868c96bc8. We can't do the same in backbranches as discussed in email [1]. If you have any better ideas to fix this in backbranch, then kindly let us know. [1] - https://www.postgresql.org/message-id/CAA4eK1Jx7Ed_58%2BywXozmHvRRhAVO2Yfcyoi0e5PRexZn5A6Gw%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Jun 5, 2025 at 2:53 PM Dilip Kumar wrote: > > On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond wrote: > > > > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > > suggested by Amit above. > > I have started reviewing this, although I haven't done a complete > review yet, but I have a question on the fix we are trying to do, IIUC > we are disallowing to use 'two phase' and 'failover' options together > at the create slot time and now users has to create slot with one of > the option and later enable other option right (if user want to use > both options)? But don't you think it will affect usability? because > if a user wants to use both the options together then after creating > the slot they need to track when is the right time to enable the other > option? Not sure if anyone else has this concern or it's just me? > Some additional comments while quickly glancing at the patch 1. + if (two_phase && !IsSyncingReplicationSlots() && !IsBinaryUpgrade) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable both \"%s\" and \"%s\" options during replication slot creation", +"failover", "two_phase")); I think we should also give hints to retry later when a certain constraint is met? Also this is hardcoded options "failover" and "two_phase" so why do we need to use %s for contruncting this error message? 2. +"failover", "two_phase"), + errhint("Use ALTER SUBSCRIPTION ... SET (failover = true) to enable \"%s\" after two_phase state is ready", + "failover")); Here we are using a mix of hardcoded string and formatted string, like for (failover = true) we hardcoded the "failover" whereas to enable \"%s\", we have used %s. Better to just directly use failover as we are not depending on any variable. Please look at other places as well, I see a few more places whereas we have used like this. 3. + if (slot->data.failover) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", + NameStr(slot->data.name; So for a failover slot we can never enable two_phase, whereas for two_phase enabled slot we can enable failover? This seems confusing, no? -- Regards, Dilip Kumar Google
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Jun 3, 2025 at 11:05 AM Nisha Moond wrote: > > > Attached v17 patches. Added a top-up patch 0002 implementing the idea > suggested by Amit above. I have started reviewing this, although I haven't done a complete review yet, but I have a question on the fix we are trying to do, IIUC we are disallowing to use 'two phase' and 'failover' options together at the create slot time and now users has to create slot with one of the option and later enable other option right (if user want to use both options)? But don't you think it will affect usability? because if a user wants to use both the options together then after creating the slot they need to track when is the right time to enable the other option? Not sure if anyone else has this concern or it's just me? -- Regards, Dilip Kumar Google
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Jun 4, 2025 at 10:46 AM Masahiko Sawada wrote: > > On Sun, Jun 1, 2025 at 10:25 PM Amit Kapila wrote: > > > > > > Yet another idea is to dump the subscription with two_phase = on and > > failover = false. We should do this when both options are 'true' > > during the dump. As we are documenting that we always dump > > createsubscription with connect as false and let users take care (see > > [1] (When dumping logical replication subscriptions ...)), a similar > > reasoning could be given for the failover flag. > > It probably would work for subscriptions but what about logical slots? > If we restore two_phase=on and failover=false, the user would have to > enable the failover via ALTER_REPLICATION_SLOT. I think we should > avoid asking users to execute the replication commands directly. > Sorry, I mixed manually creating a logical slot and restoring a dump, and I think we've already discussed that. Please ignore this comment. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Sun, Jun 1, 2025 at 10:25 PM Amit Kapila wrote: > > > Yet another idea is to dump the subscription with two_phase = on and > failover = false. We should do this when both options are 'true' > during the dump. As we are documenting that we always dump > createsubscription with connect as false and let users take care (see > [1] (When dumping logical replication subscriptions ...)), a similar > reasoning could be given for the failover flag. It probably would work for subscriptions but what about logical slots? If we restore two_phase=on and failover=false, the user would have to enable the failover via ALTER_REPLICATION_SLOT. I think we should avoid asking users to execute the replication commands directly. > The one more combination to consider is when someone takes a dump of > an older version and loads it into a newer version. For example, where > users dump from 17.5 and then restore in a newer version, say 17.6 > (which has our fix), the restore will fail due to newer restrictions > added by this patch. Do we need to do anything about it? A valid concern. Implementing this change could potentially render dumps created prior to version 17.5 incompatible with version 17.6 or later, which seems a significant backwards incompatibility to me. Do we have any precedence of such incompatibility? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Jun 2, 2025 at 10:55 AM Amit Kapila wrote: > > On Fri, May 30, 2025 at 3:00 PM Nisha Moond wrote: > > > > Agree that we need to cover the simple pg_dump and pg_restore with the > > patch. > > > > When pg_dump and pg_restore are used outside of pg_upgrade, there's no > > guarantee that the target system does not have any prepared > > transactions. In such cases, restoring a subscription with both > > two_phase and failover enabled could lead to the bug, so we should > > avoid allowing both options via pg_restore. > > > > Here are a few possible solutions: > > > > 1) Split the CREATE SUBSCRIPTION command of such subscriptions in > > pg_restore : > > first create the sub with two_phase: > > - CREATE SUBSCRIPTION ... with (two_phase = on) > > then enable failover > > - ALTER SUBSCRIPTION ... with (failover=true) > > > > This won't work because we always dump a subscription with the > 'connect' option as 'false'. As changing the subscription's failover > option, needs the connection above split won't work. > > > However, this approach adds complexity in pg_restore and may still > > fail if the slot's restart_lsn is earlier than two_phase_at. > > > > 2) Prevent dump of subscription in non-upgrade mode: > > Raise an error/warning during pg_dump when a subscription with both > > two_phase and failover is encountered (unless in binary upgrade mode). > > That is, either fail the pg_dump, or skip dumping subscriptions in > > such a case with a warning. > > We can include a helpful hint: > > "Disable failover for subscription before dumping and > > re-enable it after restore." > > > > 3) Dump such subscriptions with (two_phase=on, failover=on, > > create_slot=false) together. As pg_dump always sets "connect=false" > > for subscription, it is up to users to reactivate the subscription > > suitably. In this case, we can add some document to warn that it's the > > user's responsibility to ensure the remote slot is in a safe state. > > > > What do you think? > > > > Please let me know if you have any other suggestions for handling this > > in pg_dump/pg_restore. > > > > Yet another idea is to dump the subscription with two_phase = on and > failover = false. We should do this when both options are 'true' > during the dump. As we are documenting that we always dump > createsubscription with connect as false and let users take care (see > [1] (When dumping logical replication subscriptions ...)), a similar > reasoning could be given for the failover flag. > +1 > The one more combination to consider is when someone takes a dump of > an older version and loads it into a newer version. For example, where > users dump from 17.5 and then restore in a newer version, say 17.6 > (which has our fix), the restore will fail due to newer restrictions > added by this patch. Do we need to do anything about it? > > [1] : https://www.postgresql.org/docs/devel/app-pgdump.html > ~~~ Attached v17 patches. Added a top-up patch 0002 implementing the idea suggested by Amit above. Thanks, Nisha v17-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch Description: Binary data v17-0002-Fix-for-pg_dump.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, May 30, 2025 at 3:00 PM Nisha Moond wrote: > > Agree that we need to cover the simple pg_dump and pg_restore with the patch. > > When pg_dump and pg_restore are used outside of pg_upgrade, there's no > guarantee that the target system does not have any prepared > transactions. In such cases, restoring a subscription with both > two_phase and failover enabled could lead to the bug, so we should > avoid allowing both options via pg_restore. > > Here are a few possible solutions: > > 1) Split the CREATE SUBSCRIPTION command of such subscriptions in pg_restore : > first create the sub with two_phase: > - CREATE SUBSCRIPTION ... with (two_phase = on) > then enable failover > - ALTER SUBSCRIPTION ... with (failover=true) > This won't work because we always dump a subscription with the 'connect' option as 'false'. As changing the subscription's failover option, needs the connection above split won't work. > However, this approach adds complexity in pg_restore and may still > fail if the slot's restart_lsn is earlier than two_phase_at. > > 2) Prevent dump of subscription in non-upgrade mode: > Raise an error/warning during pg_dump when a subscription with both > two_phase and failover is encountered (unless in binary upgrade mode). > That is, either fail the pg_dump, or skip dumping subscriptions in > such a case with a warning. > We can include a helpful hint: > "Disable failover for subscription before dumping and > re-enable it after restore." > > 3) Dump such subscriptions with (two_phase=on, failover=on, > create_slot=false) together. As pg_dump always sets "connect=false" > for subscription, it is up to users to reactivate the subscription > suitably. In this case, we can add some document to warn that it's the > user's responsibility to ensure the remote slot is in a safe state. > > What do you think? > > Please let me know if you have any other suggestions for handling this > in pg_dump/pg_restore. > Yet another idea is to dump the subscription with two_phase = on and failover = false. We should do this when both options are 'true' during the dump. As we are documenting that we always dump createsubscription with connect as false and let users take care (see [1] (When dumping logical replication subscriptions ...)), a similar reasoning could be given for the failover flag. The one more combination to consider is when someone takes a dump of an older version and loads it into a newer version. For example, where users dump from 17.5 and then restore in a newer version, say 17.6 (which has our fix), the restore will fail due to newer restrictions added by this patch. Do we need to do anything about it? [1] : https://www.postgresql.org/docs/devel/app-pgdump.html -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, May 30, 2025 at 12:55 AM Masahiko Sawada wrote: > > On Thu, May 29, 2025 at 2:00 AM Nisha Moond wrote: > > > > On Wed, May 28, 2025 at 6:10 AM Masahiko Sawada > > wrote: > > > > > > On Sun, May 25, 2025 at 11:34 PM Nisha Moond > > > wrote: > > > > > > > > to > > > > On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada > > > > wrote: > > > > > > > > > > > > > > > Here are review comments for v14 patch: > > > > > > > > > > > > > Thank you for the review. > > > > > > > > > I think we need to include a basic test case where we simply create a > > > > > subscription with two_phase=true and then enable the failover via > > > > > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > > > > > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > > > > > can avoid creating regress_sub2? > > > > > > > > > > > > > A test has been added in 040_standby_failover_slots_sync.pl to enable > > > > failover via ALTER SUBSCRIPTION. > > > > > > Yes but the slot is originally created via SQL API, which seems > > > uncommon usage in practice. I thought it would be good to have the > > > basic steps in the tests to enable both fields. > > > > > > > Apologies for the earlier misunderstanding. I've updated > > 040_standby_failover_slots_sync.pl to modify the slot creation via the > > CREATE SUBSCRIPTION command as suggested. > > > > > > Regarding regress_sub2 in 003_logical_slots.pl : > > > > If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it > > > > leads to pg_upgrade failure, as it attempts to create a slot on the > > > > new_node(upgraded version) with both two_phase and failover enabled, > > > > which is an unsupported combination. > > > > > > I think that the pg_upgrade test should cover the case where it > > > restores logical slots with both fields enabled in the new cluster. > > > When I actually tried this case, I realized that pg_upgrade doesn't > > > handle this case; it tried to create the logical slot via SQL API but > > > it failed as we don't allow it to create a logical slot with enabling > > > both two_phase and failover. How can we handle this case? > > > > > > > On further analysis, it seems feasible and safe to allow creation of > > slot and subscription with both two_phase and failover enabled during > > upgrade (pg_upgrade). > > As before starting the upgrade, pg_upgrade ensures that - > > a) All slot changes have been fully consumed. > > b) No prepared transactions exist. > > Additionally, it is also documented[1] - "Obviously, no one should be > > accessing the clusters during the upgrade." > > > > Though, if allowed, the slot may be created with a restart_lsn < > > two_phase_at. IIUC, the guarantees above make it impossible for a > > prepared transaction to exist before two_phase_at, thus preventing the > > bug case. > > True, but I think we need to cover the simple pg_dump and pg_restore > case too, without pg_upgrade. Otherwise the dump file won't work if > the database has at least one subscription with two_phase and failover > enabled. > Agree that we need to cover the simple pg_dump and pg_restore with the patch. When pg_dump and pg_restore are used outside of pg_upgrade, there's no guarantee that the target system does not have any prepared transactions. In such cases, restoring a subscription with both two_phase and failover enabled could lead to the bug, so we should avoid allowing both options via pg_restore. Here are a few possible solutions: 1) Split the CREATE SUBSCRIPTION command of such subscriptions in pg_restore : first create the sub with two_phase: - CREATE SUBSCRIPTION ... with (two_phase = on) then enable failover - ALTER SUBSCRIPTION ... with (failover=true) However, this approach adds complexity in pg_restore and may still fail if the slot's restart_lsn is earlier than two_phase_at. 2) Prevent dump of subscription in non-upgrade mode: Raise an error/warning during pg_dump when a subscription with both two_phase and failover is encountered (unless in binary upgrade mode). That is, either fail the pg_dump, or skip dumping subscriptions in such a case with a warning. We can include a helpful hint: "Disable failover for subscription before dumping and re-enable it after restore." 3) Dump such subscriptions with (two_phase=on, failover=on, create_slot=false) together. As pg_dump always sets "connect=false" for subscription, it is up to users to reactivate the subscription suitably. In this case, we can add some document to warn that it's the user's responsibility to ensure the remote slot is in a safe state. What do you think? Please let me know if you have any other suggestions for handling this in pg_dump/pg_restore. -- Thanks & Regards, Nisha Moond
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, May 29, 2025 at 2:00 AM Nisha Moond wrote: > > On Wed, May 28, 2025 at 6:10 AM Masahiko Sawada wrote: > > > > On Sun, May 25, 2025 at 11:34 PM Nisha Moond > > wrote: > > > > > > to > > > On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada > > > wrote: > > > > > > > > > > > > Here are review comments for v14 patch: > > > > > > > > > > Thank you for the review. > > > > > > > I think we need to include a basic test case where we simply create a > > > > subscription with two_phase=true and then enable the failover via > > > > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > > > > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > > > > can avoid creating regress_sub2? > > > > > > > > > > A test has been added in 040_standby_failover_slots_sync.pl to enable > > > failover via ALTER SUBSCRIPTION. > > > > Yes but the slot is originally created via SQL API, which seems > > uncommon usage in practice. I thought it would be good to have the > > basic steps in the tests to enable both fields. > > > > Apologies for the earlier misunderstanding. I've updated > 040_standby_failover_slots_sync.pl to modify the slot creation via the > CREATE SUBSCRIPTION command as suggested. > > > > Regarding regress_sub2 in 003_logical_slots.pl : > > > If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it > > > leads to pg_upgrade failure, as it attempts to create a slot on the > > > new_node(upgraded version) with both two_phase and failover enabled, > > > which is an unsupported combination. > > > > I think that the pg_upgrade test should cover the case where it > > restores logical slots with both fields enabled in the new cluster. > > When I actually tried this case, I realized that pg_upgrade doesn't > > handle this case; it tried to create the logical slot via SQL API but > > it failed as we don't allow it to create a logical slot with enabling > > both two_phase and failover. How can we handle this case? > > > > On further analysis, it seems feasible and safe to allow creation of > slot and subscription with both two_phase and failover enabled during > upgrade (pg_upgrade). > As before starting the upgrade, pg_upgrade ensures that - > a) All slot changes have been fully consumed. > b) No prepared transactions exist. > Additionally, it is also documented[1] - "Obviously, no one should be > accessing the clusters during the upgrade." > > Though, if allowed, the slot may be created with a restart_lsn < > two_phase_at. IIUC, the guarantees above make it impossible for a > prepared transaction to exist before two_phase_at, thus preventing the > bug case. True, but I think we need to cover the simple pg_dump and pg_restore case too, without pg_upgrade. Otherwise the dump file won't work if the database has at least one subscription with two_phase and failover enabled. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, May 28, 2025 at 6:10 AM Masahiko Sawada wrote: > > On Sun, May 25, 2025 at 11:34 PM Nisha Moond wrote: > > > > to > > On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada > > wrote: > > > > > > > > > Here are review comments for v14 patch: > > > > > > > Thank you for the review. > > > > > I think we need to include a basic test case where we simply create a > > > subscription with two_phase=true and then enable the failover via > > > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > > > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > > > can avoid creating regress_sub2? > > > > > > > A test has been added in 040_standby_failover_slots_sync.pl to enable > > failover via ALTER SUBSCRIPTION. > > Yes but the slot is originally created via SQL API, which seems > uncommon usage in practice. I thought it would be good to have the > basic steps in the tests to enable both fields. > Apologies for the earlier misunderstanding. I've updated 040_standby_failover_slots_sync.pl to modify the slot creation via the CREATE SUBSCRIPTION command as suggested. > > Regarding regress_sub2 in 003_logical_slots.pl : > > If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it > > leads to pg_upgrade failure, as it attempts to create a slot on the > > new_node(upgraded version) with both two_phase and failover enabled, > > which is an unsupported combination. > > I think that the pg_upgrade test should cover the case where it > restores logical slots with both fields enabled in the new cluster. > When I actually tried this case, I realized that pg_upgrade doesn't > handle this case; it tried to create the logical slot via SQL API but > it failed as we don't allow it to create a logical slot with enabling > both two_phase and failover. How can we handle this case? > On further analysis, it seems feasible and safe to allow creation of slot and subscription with both two_phase and failover enabled during upgrade (pg_upgrade). As before starting the upgrade, pg_upgrade ensures that - a) All slot changes have been fully consumed. b) No prepared transactions exist. Additionally, it is also documented[1] - "Obviously, no one should be accessing the clusters during the upgrade." Though, if allowed, the slot may be created with a restart_lsn < two_phase_at. IIUC, the guarantees above make it impossible for a prepared transaction to exist before two_phase_at, thus preventing the bug case. Attached is the v16 patch, which includes the following changes: - Enabled creation of slots and subscriptions with both two_phase and failover options when in binary upgrade mode. - Updated 003_logical_slots.pl to configure regress_sub's failover option using ALTER SUBSCRIPTION, removing the need for the additional regress_sub2 subscription. [1] https://www.postgresql.org/docs/current/pgupgrade.html Thanks & Regards, Nisha Moond From 6284430ae19c53180472239ef095798c28ca9475 Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Fri, 9 May 2025 11:09:23 +0530 Subject: [PATCH v16] PG17 Approach 3 Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering slot's failover to true when two_phase state is 'pending'. User can try altering failover again when two_phase state is moved beyond 'pending'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 13 ++ doc/src/sgml/protocol.sgml| 17 +++ doc/src/sgml/ref/alter_subscription.sgml | 12 ++ doc/src/sgml/ref/create_subscription.sgml | 17 +++ src/backend/commands/subscriptioncmds.c | 29 src/backend/replication/logical/logical.c | 11 ++ src/backend/replication/logical/slotsync.c| 10 ++ src/backend/replication/slot.c| 29 src/bin/pg_upgrade/t/003_logic
Re: Fix slot synchronization with two_phase decoding enabled
On Sun, May 25, 2025 at 11:34 PM Nisha Moond wrote: > > to > On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada > wrote: > > > > > > Here are review comments for v14 patch: > > > > Thank you for the review. > > > I think we need to include a basic test case where we simply create a > > subscription with two_phase=true and then enable the failover via > > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > > can avoid creating regress_sub2? > > > > A test has been added in 040_standby_failover_slots_sync.pl to enable > failover via ALTER SUBSCRIPTION. Yes but the slot is originally created via SQL API, which seems uncommon usage in practice. I thought it would be good to have the basic steps in the tests to enable both fields. > Regarding regress_sub2 in 003_logical_slots.pl : > If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it > leads to pg_upgrade failure, as it attempts to create a slot on the > new_node(upgraded version) with both two_phase and failover enabled, > which is an unsupported combination. I think that the pg_upgrade test should cover the case where it restores logical slots with both fields enabled in the new cluster. When I actually tried this case, I realized that pg_upgrade doesn't handle this case; it tried to create the logical slot via SQL API but it failed as we don't allow it to create a logical slot with enabling both two_phase and failover. How can we handle this case? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
to On Fri, May 23, 2025 at 10:06 PM Masahiko Sawada wrote: > > > Here are review comments for v14 patch: > Thank you for the review. > I think we need to include a basic test case where we simply create a > subscription with two_phase=true and then enable the failover via > ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed > two_phase_at. Probably if we use this case in 003_logical_slots.pl, we > can avoid creating regress_sub2? > A test has been added in 040_standby_failover_slots_sync.pl to enable failover via ALTER SUBSCRIPTION. Regarding regress_sub2 in 003_logical_slots.pl : If we use ALTER SUBSCRIPTION to set failover=true on regress_sub, it leads to pg_upgrade failure, as it attempts to create a slot on the new_node(upgraded version) with both two_phase and failover enabled, which is an unsupported combination. > --- > +# Advance the slot's restart_lsn to allow enabling the failover option > +# on a two_phase-enabled subscription using ALTER SUBSCRIPTION. > +$publisher->safe_psql( > +'postgres', qq( > +BEGIN; > +SELECT txid_current(); > +SELECT pg_log_standby_snapshot(); > +COMMIT; > +BEGIN; > +SELECT txid_current(); > +SELECT pg_log_standby_snapshot(); > +COMMIT; > +)); > + > +# Alter subscription to enable failover > +$subscriber1->psql( > +'postgres', > +"ALTER SUBSCRIPTION regress_mysub2 DISABLE; > + ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);"); > > I think we need to wait for the subscription to consume these changes > before disabling the subscription. If the publisher doesn't consume > these WAL records for some reason, the subsequent "ALTER SUBSCRIPTION > ... SET (failover = true)" will fail. > Done. > --- > @@ -25,6 +25,8 @@ $publisher->init( > $publisher->append_conf('postgresql.conf', 'autovacuum = off'); > $publisher->start; > > +$publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); > + > $publisher->safe_psql('postgres', > "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"); > > @@ -42,6 +44,8 @@ my $slot_creation_time_on_primary = $publisher->safe_psql( > SELECT current_timestamp; > ]); > > +$subscriber1->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); > + > # Create a subscription that enables failover. > $subscriber1->safe_psql('postgres', > "CREATE SUBSCRIPTION regress_mysub1 CONNECTION > '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = > lsub1_slot, copy_data = false, failover = true, enabled = false);" > @@ -98,6 +102,114 @@ my ($result, $stdout, $stderr) = > $subscriber1->psql('postgres', > ok( $stderr =~ /ERROR: cannot set failover for enabled subscription/, > "altering failover is not allowed for enabled subscription"); > > I think it's better to create the tables before the new test starts > with a comment explaining why we need to create the table here. > Done. ~~~ Please find the updated patch v15, addressing above comments. -- Thanks, Nisha v15-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, May 21, 2025 at 4:02 PM Masahiko Sawada wrote: > > On Fri, May 9, 2025 at 5:09 AM Nisha Moond wrote: > > > > On Thu, May 8, 2025 at 11:35 AM shveta malik wrote: > > > > > > On Wed, May 7, 2025 at 4:36 PM Nisha Moond > > > wrote: > > > > > > > > > > > > Attached is the v13 patch with the above comments addressed. > > > > > > > > -- > > > > > > Thanks for the patch. > > > > > > I think we can have the restriction mentioned under the 'two_phase' > > > section as well along with the 'failover' section in the CREATE > > > SUBSCRIPTION doc, similar to what we have in CREATE_REPLICATION_SLOT. > > > > > > Apart from this, I have no more comments. > > > > > > > Thanks for the review. > > Please find the v14 patch, updated with the above suggestion and > > rephrased changes in alter_subscription.sgml with the missing links. > > Thank you for updating the patch. I'm reviewing the v14 patch and will > share some comments tomorrow. Here are review comments for v14 patch: I think we need to include a basic test case where we simply create a subscription with two_phase=true and then enable the failover via ALTER SUBSCRIPTION after making sure that slot's restart_lsn passed two_phase_at. Probably if we use this case in 003_logical_slots.pl, we can avoid creating regress_sub2? --- +# Advance the slot's restart_lsn to allow enabling the failover option +# on a two_phase-enabled subscription using ALTER SUBSCRIPTION. +$publisher->safe_psql( +'postgres', qq( +BEGIN; +SELECT txid_current(); +SELECT pg_log_standby_snapshot(); +COMMIT; +BEGIN; +SELECT txid_current(); +SELECT pg_log_standby_snapshot(); +COMMIT; +)); + +# Alter subscription to enable failover +$subscriber1->psql( +'postgres', +"ALTER SUBSCRIPTION regress_mysub2 DISABLE; + ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);"); I think we need to wait for the subscription to consume these changes before disabling the subscription. If the publisher doesn't consume these WAL records for some reason, the subsequent "ALTER SUBSCRIPTION ... SET (failover = true)" will fail. --- @@ -25,6 +25,8 @@ $publisher->init( $publisher->append_conf('postgresql.conf', 'autovacuum = off'); $publisher->start; +$publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); + $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"); @@ -42,6 +44,8 @@ my $slot_creation_time_on_primary = $publisher->safe_psql( SELECT current_timestamp; ]); +$subscriber1->safe_psql('postgres', "CREATE TABLE tab_full (a int)"); + # Create a subscription that enables failover. $subscriber1->safe_psql('postgres', "CREATE SUBSCRIPTION regress_mysub1 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, copy_data = false, failover = true, enabled = false);" @@ -98,6 +102,114 @@ my ($result, $stdout, $stderr) = $subscriber1->psql('postgres', ok( $stderr =~ /ERROR: cannot set failover for enabled subscription/, "altering failover is not allowed for enabled subscription"); I think it's better to create the tables before the new test starts with a comment explaining why we need to create the table here. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, May 9, 2025 at 5:09 AM Nisha Moond wrote: > > On Thu, May 8, 2025 at 11:35 AM shveta malik wrote: > > > > On Wed, May 7, 2025 at 4:36 PM Nisha Moond wrote: > > > > > > > > > Attached is the v13 patch with the above comments addressed. > > > > > > -- > > > > Thanks for the patch. > > > > I think we can have the restriction mentioned under the 'two_phase' > > section as well along with the 'failover' section in the CREATE > > SUBSCRIPTION doc, similar to what we have in CREATE_REPLICATION_SLOT. > > > > Apart from this, I have no more comments. > > > > Thanks for the review. > Please find the v14 patch, updated with the above suggestion and > rephrased changes in alter_subscription.sgml with the missing links. Thank you for updating the patch. I'm reviewing the v14 patch and will share some comments tomorrow. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) wrote: > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > > wrote: > > > > > > While I cannot be entirely certain of my analysis, I believe the root > > > cause might be related to the backward movement of the confirmed_flush > > > LSN. The following scenario seems possible: > > > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > > should be the same as confirmed_flush). > > > 2. The slot's confirmed_flush regresses for some reason. > > > 3. The slotsync worker retrieves the remote slot information and > > > enables two_phase for the local slot. > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in the > > current > > case. In the failed test, after the primary has prepared a transaction, the > > transaction won't be replicated to the subscriber as two_phase was not > > enabled for the slot. However, subsequent keepalive messages can send the > > latest WAL location to the subscriber and get the confirmation of the same > > from > > the subscriber without its origin being moved. Now, after we restart the > > apply > > worker (due to disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as explained in > > one of the previous emails in another thread [1]. During this temporary > > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at > > and confirm_flush_lsn values, leading to the assertion failure. We see this > > issue intermittently because it depends on the timing of slotsync worker's > > request to fetch the slot's value. > > Based on this theory, I can reproduce the BF failure in the 040 tap-test on > HEAD after applying the 0001 patch. This is achieved by using the injection > point to stop the walsender from sending a keepalive before receiving the old > origin position from the apply worker, ensuring the confirmed_flush > consistently moves backward before slotsync. > > Additionally, I've reproduced the duplicate data issue on HEAD without > slotsync > using the attached script (after applying the injection point patch). This > issue arises if we immediately disable the subscription after the > confirm_flush_lsn moves backward, preventing the walsender from advancing the > confirm_flush_lsn. > > In this case, if a prepared transaction exists before two_phase_at, then after > re-enabling the subscription, it will replicate that prepared transaction when > decoding the PREPARE record and replicate that again when decoding the COMMIT > PREPARED record. In such cases, the apply worker keeps reporting the error: > > ERROR: transaction identifier "pg_gid_16387_755" is already in use. > > Apart from above, we're investigating whether the same issue can occur in > back-branches and will share the results once ready. > The issue was confirmed to occur on back branches as well, due to confirmed_flush_lsn moving backward. It has now been fixed on HEAD and all supported back-branches down to PG13. For details, refer to the separate thread [1]; the fix was committed (commit: ad5eaf3)[2]. The BF failure has not occurred since the fix, but we’ll continue to keep an eye. [1] https://www.postgresql.org/message-id/CAJpy0uDZ29P=BYB1JDWMCh-6wXaNqMwG1u1mB4=10ly0x7h...@mail.gmail.com [2] https://github.com/postgres/postgres/commit/ad5eaf390c58294e2e4c1509aa87bf13261a5d15 -- Thanks, Nisha
Re: Fix slot synchronization with two_phase decoding enabled
On Sat, May 10, 2025 at 4:59 PM Amit Kapila wrote: > > On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in the > > > current > > > case. In the failed test, after the primary has prepared a transaction, > > > the > > > transaction won't be replicated to the subscriber as two_phase was not > > > enabled for the slot. However, subsequent keepalive messages can send the > > > latest WAL location to the subscriber and get the confirmation of the > > > same from > > > the subscriber without its origin being moved. Now, after we restart the > > > apply > > > worker (due to disable/enable for a subscription), it will use the > > > previous > > > origin_lsn to temporarily move back the confirmed flush LSN as explained > > > in > > > one of the previous emails in another thread [1]. During this temporary > > > movement of confirm flush LSN, the slotsync worker fetches the > > > two_phase_at > > > and confirm_flush_lsn values, leading to the assertion failure. We see > > > this > > > issue intermittently because it depends on the timing of slotsync worker's > > > request to fetch the slot's value. > > > > Based on this theory, I can reproduce the BF failure in the 040 tap-test on > > HEAD after applying the 0001 patch. This is achieved by using the injection > > point to stop the walsender from sending a keepalive before receiving the > > old > > origin position from the apply worker, ensuring the confirmed_flush > > consistently moves backward before slotsync. > > > > Additionally, I've reproduced the duplicate data issue on HEAD without > > slotsync > > using the attached script (after applying the injection point patch). This > > issue arises if we immediately disable the subscription after the > > confirm_flush_lsn moves backward, preventing the walsender from advancing > > the > > confirm_flush_lsn. > > > > Script contents: > psql -d postgres -p $port_primary -c "create extension > injection_points;SELECT injection_points_attach('process-replies', > 'wait');" > > psql -d postgres -p $port_subscriber -c "alter subscription sub set > (two_phase =on); alter subscription sub enable ;" > > sleep 1 > > psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" > > I think what you said in the above paragraph is happening here. How > can walsender move back the confirm_flush_lsn backwards when it is > waiting due to the injection point? I think I am missing something > here. It would be good if you could add a few comments to your > scripts. > Please see my comments in the attached (updated) script. The testcase to reproduce the issue on HEAD is the same, only the comments have been added to elaborate the flow which moves confirmed_flush backward. thanks Shveta #!/bin/bash # The idea of script is to enable two_phase after Preparing a transaction # and then attempt to move confirmed_lsn backward (prior to two_phase_at) # resulting in replay of PREPARE twice. # Script achieves this by stoppign walsender using injection point when needed. port_primary=5433 port_secondary=5434 port_subscriber=5435 echo '==' echo '=Clean up=' echo '==' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_subscriber rm -rf data_* *log echo '===' echo '=Set up primary server=' echo '===' initdb -D data_primary cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary #standby_slot_names = 'physical' #log_replication_commands = 'on' #max_slot_wal_keep_size = 64kB max_wal_senders=550 max_worker_processes=1000 max_replication_slots=550 log_replication_commands = 'on' checkpoint_timeout = 1d shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = on checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off wal_sender_timeout = 6000s wal_receiver_timeout = 6000s log_min_messages = 'debug1' max_prepared_transactions = 10 EOF pg_ctl -D data_primary start -w -l primary.log psql -d postgres -p $port_primary -c "SELECT * FROM pg_create_physical_replication_slot('physical');" echo '=' echo '=Set up secondary server=' echo '=' psql -d postgres -p $port_primary -c "CHECKPOINT;" pg_basebackup -D data_secondary -p $port_primary cat << EOF >> data_secondary/postgresql.conf port = $port_secondary primary_conninfo = 'port=$port_primary application_name=secondary dbname=postgres' #primary_conninfo = 'port=$port_primary application_name=secondary dbname=postgreis' primary_slot_name = 'physical' hot_standby = on hot_standby_feedback = on #sync_replication_slots = on #standby_slot_names = '' EOF cat << EOF >> data_secondary/standby.signal EOF pg_ctl -D data_secondary start -w -l sec.log psql
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) wrote: > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > > > Yes, this is possible. Here is my theory as to how it can happen in the > > current > > case. In the failed test, after the primary has prepared a transaction, the > > transaction won't be replicated to the subscriber as two_phase was not > > enabled for the slot. However, subsequent keepalive messages can send the > > latest WAL location to the subscriber and get the confirmation of the same > > from > > the subscriber without its origin being moved. Now, after we restart the > > apply > > worker (due to disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as explained in > > one of the previous emails in another thread [1]. During this temporary > > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at > > and confirm_flush_lsn values, leading to the assertion failure. We see this > > issue intermittently because it depends on the timing of slotsync worker's > > request to fetch the slot's value. > > Based on this theory, I can reproduce the BF failure in the 040 tap-test on > HEAD after applying the 0001 patch. This is achieved by using the injection > point to stop the walsender from sending a keepalive before receiving the old > origin position from the apply worker, ensuring the confirmed_flush > consistently moves backward before slotsync. > > Additionally, I've reproduced the duplicate data issue on HEAD without > slotsync > using the attached script (after applying the injection point patch). This > issue arises if we immediately disable the subscription after the > confirm_flush_lsn moves backward, preventing the walsender from advancing the > confirm_flush_lsn. > Script contents: psql -d postgres -p $port_primary -c "create extension injection_points;SELECT injection_points_attach('process-replies', 'wait');" psql -d postgres -p $port_subscriber -c "alter subscription sub set (two_phase =on); alter subscription sub enable ;" sleep 1 psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" I think what you said in the above paragraph is happening here. How can walsender move back the confirm_flush_lsn backwards when it is waiting due to the injection point? I think I am missing something here. It would be good if you could add a few comments to your scripts. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, May 8, 2025 at 11:35 AM shveta malik wrote: > > On Wed, May 7, 2025 at 4:36 PM Nisha Moond wrote: > > > > > > Attached is the v13 patch with the above comments addressed. > > > > -- > > Thanks for the patch. > > I think we can have the restriction mentioned under the 'two_phase' > section as well along with the 'failover' section in the CREATE > SUBSCRIPTION doc, similar to what we have in CREATE_REPLICATION_SLOT. > > Apart from this, I have no more comments. > Thanks for the review. Please find the v14 patch, updated with the above suggestion and rephrased changes in alter_subscription.sgml with the missing links. -- Thanks, Nisha From c7b4f5dd22a2d629d88c7e34469bac7547c0e2fc Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Fri, 9 May 2025 11:09:23 +0530 Subject: [PATCH v14] PG17 Approach 3 Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering slot's failover to true when two_phase state is 'pending'. User can try altering failover again when two_phase state is moved beyond 'pending'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 13 ++ doc/src/sgml/protocol.sgml| 17 +++ doc/src/sgml/ref/alter_subscription.sgml | 12 ++ doc/src/sgml/ref/create_subscription.sgml | 17 +++ src/backend/commands/subscriptioncmds.c | 25 src/backend/replication/logical/logical.c | 11 ++ src/backend/replication/logical/slotsync.c| 10 ++ src/backend/replication/slot.c| 27 + src/bin/pg_upgrade/t/003_logical_slots.pl | 50 ++-- .../t/040_standby_failover_slots_sync.pl | 112 ++ src/test/regress/expected/subscription.out| 4 + src/test/regress/sql/subscription.sql | 4 + 14 files changed, 294 insertions(+), 11 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..87b28ad8d55 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: cannot enable both "failover" and "two_phase" options during replication slot creation SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -182,6 +182,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot', 'tes SELECT 'init' FROM pg_create_logical_replication_slot('failover_false_slot', 'test_decoding', false, false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', 'test_decoding', false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); SELECT slot_name, slot_type, failover FROM pg_replication_slots; diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 697c1a02891..45504cf14d5 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -29076,6 +29076,19 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.settin
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, May 8, 2025 at 6:04 PM Zhijie Hou (Fujitsu) wrote: > > On Tue, May 6, 2025 at 7:22 PM Zhijie Hou (Fujitsu) wrote: > > > > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > > > > > wrote: > > > > > > > > While I cannot be entirely certain of my analysis, I believe the > > > > root cause might be related to the backward movement of the > > > > confirmed_flush LSN. The following scenario seems possible: > > > > > > > > 1. The walsender enables the two_phase and sets two_phase_at > > > > (which should be the same as confirmed_flush). > > > > 2. The slot's confirmed_flush regresses for some reason. > > > > 3. The slotsync worker retrieves the remote slot information and > > > > enables two_phase for the local slot. > > > > > > > > > > Yes, this is possible. Here is my theory as to how it can happen > > > in the current case. In the failed test, after the primary has > > > prepared a transaction, the transaction won't be replicated to the > > > subscriber as two_phase was not enabled for the slot. However, > > > subsequent keepalive messages can send the latest WAL location to > > > the subscriber and get the confirmation of the same from the > > > subscriber without its origin being moved. Now, after we restart > > > the apply worker (due to disable/enable for a subscription), it > > > will use the previous origin_lsn to temporarily move back the > > > confirmed flush LSN as explained in one of the previous emails in another > > > thread [1]. > > > During this temporary movement of confirm flush LSN, the slotsync > > > worker fetches the two_phase_at and confirm_flush_lsn values, > > > leading to the assertion failure. We see this issue intermittently > > > because it depends on the > > timing of slotsync worker's request to fetch the slot's value. > > > > Based on this theory, I can reproduce the BF failure in the 040 > > tap-test on HEAD after applying the 0001 patch. This is achieved by > > using the injection point to stop the walsender from sending a > > keepalive before receiving the old origin position from the apply > > worker, ensuring the confirmed_flush consistently moves backward > > before > slotsync. > > > > Additionally, I've reproduced the duplicate data issue on HEAD > > without slotsync using the attached script (after applying the injection > > point patch). > > This issue arises if we immediately disable the subscription after > > the confirm_flush_lsn moves backward, preventing the walsender from > > advancing the confirm_flush_lsn. > > > > In this case, if a prepared transaction exists before two_phase_at, > > then after re-enabling the subscription, it will replicate that > > prepared transaction when decoding the PREPARE record and replicate > > that again when decoding the COMMIT PREPARED record. In such cases, > > the apply worker keeps reporting the error: > > > > ERROR: transaction identifier "pg_gid_16387_755" is already in use. > > > > Apart from above, we're investigating whether the same issue can > > occur in back-branches and will share the results once ready. > > I reproduced the duplicate data issue on PG17 as well using the > attached shell script. Since PG17 doesn’t allow altering the twophase > option, I created a subscription with two_phase=on and copy_data=on. I > prepared a transaction before the table synchronization was ready, at > a time when the slot's two_phase hadn't been set to true. This setup > can cause in the prepared transaction being replicated twice after > restarting the apply worker and the confirmed_flush_lsn move backwards. > > To ensure the origin position is initialized during table sync, I > inserted some data before the prepared transaction. I added injection > points(0001) to manage the table sync worker's process, allowing the > apply worker to replicate some changes and update the origin position > while table sync was ongoing. The above reproduction of the issue indicates that it has been present since at least PG15, when the twophase subscription option was introduced. I am currently investigating whether the issue occurs without the twophase option. If it does, the fix will need to be applied to all supported branches. I will share the results once they are available. Best Regards, Hou zj
RE: Fix slot synchronization with two_phase decoding enabled
On Tue, May 6, 2025 at 7:22 PM Zhijie Hou (Fujitsu) wrote: > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > > > wrote: > > > > > > While I cannot be entirely certain of my analysis, I believe the > > > root cause might be related to the backward movement of the > > > confirmed_flush LSN. The following scenario seems possible: > > > > > > 1. The walsender enables the two_phase and sets two_phase_at > > > (which should be the same as confirmed_flush). > > > 2. The slot's confirmed_flush regresses for some reason. > > > 3. The slotsync worker retrieves the remote slot information and > > > enables two_phase for the local slot. > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in > > the current case. In the failed test, after the primary has prepared > > a transaction, the transaction won't be replicated to the subscriber > > as two_phase was not enabled for the slot. However, subsequent > > keepalive messages can send the latest WAL location to the > > subscriber and get the confirmation of the same from the subscriber > > without its origin being moved. Now, after we restart the apply > > worker (due to disable/enable for a subscription), it will use the > > previous origin_lsn to temporarily move back the confirmed flush LSN > > as explained in one of the previous emails in another thread [1]. > > During this temporary movement of confirm flush LSN, the slotsync > > worker fetches the two_phase_at and confirm_flush_lsn values, > > leading to the assertion failure. We see this issue intermittently > > because it depends on the > timing of slotsync worker's request to fetch the slot's value. > > Based on this theory, I can reproduce the BF failure in the 040 > tap-test on HEAD after applying the 0001 patch. This is achieved by > using the injection point to stop the walsender from sending a > keepalive before receiving the old origin position from the apply > worker, ensuring the confirmed_flush consistently moves backward before > slotsync. > > Additionally, I've reproduced the duplicate data issue on HEAD without > slotsync using the attached script (after applying the injection point patch). > This issue arises if we immediately disable the subscription after the > confirm_flush_lsn moves backward, preventing the walsender from > advancing the confirm_flush_lsn. > > In this case, if a prepared transaction exists before two_phase_at, > then after re-enabling the subscription, it will replicate that > prepared transaction when decoding the PREPARE record and replicate > that again when decoding the COMMIT PREPARED record. In such cases, > the apply worker keeps reporting the error: > > ERROR: transaction identifier "pg_gid_16387_755" is already in use. > > Apart from above, we're investigating whether the same issue can occur > in back-branches and will share the results once ready. I reproduced the duplicate data issue on PG17 as well using the attached shell script. Since PG17 doesn’t allow altering the twophase option, I created a subscription with two_phase=on and copy_data=on. I prepared a transaction before the table synchronization was ready, at a time when the slot's two_phase hadn't been set to true. This setup can cause in the prepared transaction being replicated twice after restarting the apply worker and the confirmed_flush_lsn move backwards. To ensure the origin position is initialized during table sync, I inserted some data before the prepared transaction. I added injection points(0001) to manage the table sync worker's process, allowing the apply worker to replicate some changes and update the origin position while table sync was ongoing. Best Regards, Hou zj #!/bin/bash port_primary=5432 port_secondary=5433 port_third=5434 port_secondary_2=5436 port_secondary_3=5437 port_subscriber=5434 port_subscriber2=5435 echo '==' echo '=Clean up=' echo '==' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_third_ pg_ctl stop -D data_secondary_2 pg_ctl stop -D data_secondary_3 pg_ctl stop -D data_subscriber pg_ctl stop -D data_subscriber2 rm -rf data_* *log rm -rf /home/houzj/archivedir/* echo '===' echo '=Set up primary server=' echo '===' initdb -D data_primary cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary #standby_slot_names = 'physical' #log_replication_commands = 'on' #max_slot_wal_keep_size = 64kB max_wal_senders=550 max_worker_processes=1000 max_replication_slots=550 log_replication_commands = 'on' checkpoint_timeout = 1d shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = on checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off wal_sender_timeout = 6000s wal_receiver_timeout = 6000s #log_min_messages = 'debug2' #arch
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, May 7, 2025 at 4:36 PM Nisha Moond wrote: > > > Attached is the v13 patch with the above comments addressed. > > -- Thanks for the patch. I think we can have the restriction mentioned under the 'two_phase' section as well along with the 'failover' section in the CREATE SUBSCRIPTION doc, similar to what we have in CREATE_REPLICATION_SLOT. Apart from this, I have no more comments. thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, May 7, 2025 at 10:32 AM shveta malik wrote: > > On Mon, May 5, 2025 at 3:29 PM Nisha Moond wrote: > > > > Please find the v12 patch with above suggested changes. > > > > Thanks for the patch. Few comments for doc changes: > > 1) > func.sgml - pg_create_logical_replication_slot: > > +failover. The parameters twophase and > +failover cannot be enabled together when > +creating a replication slot. However, a slot created with > twophase > +enabled can later have failover set to true > +either using ALTER_REPLICATION_SLOT or via > +ALTER SUBSCRIPTION if a subscription is using this > +slot. > > Will it be better if we make a NOTE for this? Also can we create links for > ALTER SUBSCRIPTION and > ALTER_REPLICATION_SLOT? > > 2) > protocol.sgml: > > + The default is false. This option cannot be set together with > + failover when creating a logical replication > slot. > + However, once the slot is created with two_phase = > true, > + failover can be set to true after the slot has > + consumed all the changes up to the point where two-phase decoding > + was enabled. > > Can we make a new paragraph for the new restriction added. Same for > other new restriction in this page. > > 3) > create_subscription.sgml: > > + cannot be enabled together when creating a subscription. However, a > + two_phase enabled subscription can later have > + failover set to true via ALTER > SUBSCRIPTION, > > via-->using > Attached is the v13 patch with the above comments addressed. -- Thanks, Nisha From fce2cdece5e16f11ec940ffd7bac34dc53e5f9da Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Mon, 7 Apr 2025 09:36:29 +0530 Subject: [PATCH v13] PG17 Approach 3 Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering slot's failover to true when two_phase state is 'pending'. User can try altering failover again when two_phase state is moved beyond 'pending'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 13 ++ doc/src/sgml/protocol.sgml| 17 +++ doc/src/sgml/ref/alter_subscription.sgml | 8 ++ doc/src/sgml/ref/create_subscription.sgml | 9 ++ src/backend/commands/subscriptioncmds.c | 25 src/backend/replication/logical/logical.c | 11 ++ src/backend/replication/logical/slotsync.c| 10 ++ src/backend/replication/slot.c| 27 + src/bin/pg_upgrade/t/003_logical_slots.pl | 50 ++-- .../t/040_standby_failover_slots_sync.pl | 112 ++ src/test/regress/expected/subscription.out| 4 + src/test/regress/sql/subscription.sql | 4 + 14 files changed, 282 insertions(+), 11 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..87b28ad8d55 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: cannot enable both "failover" and "two_phase" options during replication slot creation SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, May 5, 2025 at 3:29 PM Nisha Moond wrote: > > Please find the v12 patch with above suggested changes. > Thanks for the patch. Few comments for doc changes: 1) func.sgml - pg_create_logical_replication_slot: +failover. The parameters twophase and +failover cannot be enabled together when +creating a replication slot. However, a slot created with twophase +enabled can later have failover set to true +either using ALTER_REPLICATION_SLOT or via +ALTER SUBSCRIPTION if a subscription is using this +slot. Will it be better if we make a NOTE for this? Also can we create links for ALTER SUBSCRIPTION and ALTER_REPLICATION_SLOT? 2) protocol.sgml: + The default is false. This option cannot be set together with + failover when creating a logical replication slot. + However, once the slot is created with two_phase = true, + failover can be set to true after the slot has + consumed all the changes up to the point where two-phase decoding + was enabled. Can we make a new paragraph for the new restriction added. Same for other new restriction in this page. 3) create_subscription.sgml: + cannot be enabled together when creating a subscription. However, a + two_phase enabled subscription can later have + failover set to true via ALTER SUBSCRIPTION, via-->using thanks Shveta
RE: Fix slot synchronization with two_phase decoding enabled
On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > wrote: > > > > While I cannot be entirely certain of my analysis, I believe the root > > cause might be related to the backward movement of the confirmed_flush > > LSN. The following scenario seems possible: > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > should be the same as confirmed_flush). > > 2. The slot's confirmed_flush regresses for some reason. > > 3. The slotsync worker retrieves the remote slot information and > > enables two_phase for the local slot. > > > > Yes, this is possible. Here is my theory as to how it can happen in the > current > case. In the failed test, after the primary has prepared a transaction, the > transaction won't be replicated to the subscriber as two_phase was not > enabled for the slot. However, subsequent keepalive messages can send the > latest WAL location to the subscriber and get the confirmation of the same > from > the subscriber without its origin being moved. Now, after we restart the apply > worker (due to disable/enable for a subscription), it will use the previous > origin_lsn to temporarily move back the confirmed flush LSN as explained in > one of the previous emails in another thread [1]. During this temporary > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at > and confirm_flush_lsn values, leading to the assertion failure. We see this > issue intermittently because it depends on the timing of slotsync worker's > request to fetch the slot's value. Based on this theory, I can reproduce the BF failure in the 040 tap-test on HEAD after applying the 0001 patch. This is achieved by using the injection point to stop the walsender from sending a keepalive before receiving the old origin position from the apply worker, ensuring the confirmed_flush consistently moves backward before slotsync. Additionally, I've reproduced the duplicate data issue on HEAD without slotsync using the attached script (after applying the injection point patch). This issue arises if we immediately disable the subscription after the confirm_flush_lsn moves backward, preventing the walsender from advancing the confirm_flush_lsn. In this case, if a prepared transaction exists before two_phase_at, then after re-enabling the subscription, it will replicate that prepared transaction when decoding the PREPARE record and replicate that again when decoding the COMMIT PREPARED record. In such cases, the apply worker keeps reporting the error: ERROR: transaction identifier "pg_gid_16387_755" is already in use. Apart from above, we're investigating whether the same issue can occur in back-branches and will share the results once ready. Best Regards, Hou zj #!/bin/bash port_primary=5432 port_secondary=5433 port_third=5434 port_secondary_2=5436 port_secondary_3=5437 port_subscriber=5434 port_subscriber2=5435 echo '==' echo '=Clean up=' echo '==' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_third_ pg_ctl stop -D data_secondary_2 pg_ctl stop -D data_secondary_3 pg_ctl stop -D data_subscriber pg_ctl stop -D data_subscriber2 rm -rf data_* *log rm -rf /home/houzj/archivedir/* echo '===' echo '=Set up primary server=' echo '===' initdb -D data_primary cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary #standby_slot_names = 'physical' #log_replication_commands = 'on' #max_slot_wal_keep_size = 64kB max_wal_senders=550 max_worker_processes=1000 max_replication_slots=550 log_replication_commands = 'on' checkpoint_timeout = 1d shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = on checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off wal_sender_timeout = 6000s wal_receiver_timeout = 6000s #log_min_messages = 'debug2' #archive_mode = on #archive_command = 'cp %p /home/houzj/archivedir/%f' #restore_command = 'cp /home/houzj/archivedir/%f %p' max_prepared_transactions = 10 EOF pg_ctl -D data_primary start -w -l primary.log psql -d postgres -p $port_primary -c "SELECT * FROM pg_create_physical_replication_slot('physical');" echo '=' echo '=Set up secondary server=' echo '=' psql -d postgres -p $port_primary -c "CHECKPOINT;" pg_basebackup -D data_secondary -p $port_primary cat << EOF >> data_secondary/postgresql.conf port = $port_secondary primary_conninfo = 'port=$port_primary application_name=secondary dbname=postgres' #primary_conninfo = 'port=$port_primary application_name=secondary dbname=postgreis' primary_slot_name = 'physical' hot_standby = on hot_standby_feedback = on #sync_replication_slots = on #standby_slot_names = '' EOF cat << EOF >> data_secondary/standby.signal EOF pg_ctl -D data_secondary start -w psql -d postgres
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, May 5, 2025 at 2:59 AM Nisha Moond wrote: > > On Fri, May 2, 2025 at 3:05 PM shveta malik wrote: > > > > On Fri, May 2, 2025 at 12:57 PM Nisha Moond > > wrote: > > > > > > > > > Please find the v11 patch addressing the above points and all other > > > comments. I have also optimized the test by reducing the number of > > > subscriptions and slots. > > > > > > > Thanks for the patch. Few comments: > > > > 1) > > > > pg_upgrade/t/003_logical_slots.pl: > > > > - "SELECT slot_name, two_phase, failover FROM pg_replication_slots"); > > -is($result, qq(regress_sub|t|t), 'check the slot exists on new cluster'); > > + "SELECT slot_name, two_phase FROM pg_replication_slots"); > > +is($result, qq(regress_sub|t), 'check the slot exists on new cluster'); > > > > With this change, the failover property test during upgrade is > > removed. Shall we add it again either using ALTER SUB to enable > > failover and two_phase together or a new subscription with failover > > alone? > > > > If ALTER SUB is used to set failover for the existing subscription, > then pg_upgrade will fail while trying to create a slot with both > failover and two_phase on the upgraded node $newpub. > So, I've implemented the other suggestion by adding a separate > pub-subs to verify the failover property. > > > 2) > > + The default is false. This option cannot be set together with > > + failover when creating a logical replication > > slot. > > + However, once the slot is created with two_phase = > > true, > > + failover can be set to true after the slot has > > + consumed all transactions up to the point where two-phase > > decoding > > + was enabled. > > > > > > Suggestion: all transactions --> all the changes. > > > > Done. > ~~~ > > Please find the v12 patch with above suggested changes. Thank you for updating the patch! I'll review it and share comments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, May 6, 2025 at 12:26 AM Masahiko Sawada wrote: > > On Mon, May 5, 2025 at 3:59 AM Amit Kapila wrote: > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > > wrote: > > > > > > While I cannot be entirely certain of my analysis, I believe the root > > > cause might be related to the backward movement of the confirmed_flush > > > LSN. The following scenario seems possible: > > > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > > should be the same as confirmed_flush). > > > 2. The slot's confirmed_flush regresses for some reason. > > > 3. The slotsync worker retrieves the remote slot information and > > > enables two_phase for the local slot. > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in > > the current case. In the failed test, after the primary has prepared a > > transaction, the transaction won't be replicated to the subscriber as > > two_phase was not enabled for the slot. However, subsequent keepalive > > messages can send the latest WAL location to the subscriber and get > > the confirmation of the same from the subscriber without its origin > > being moved. Now, after we restart the apply worker (due to > > disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as > > explained in one of the previous emails in another thread [1]. During > > this temporary movement of confirm flush LSN, the slotsync worker > > fetches the two_phase_at and confirm_flush_lsn values, leading to the > > assertion failure. We see this issue intermittently because it depends > > on the timing of slotsync worker's request to fetch the slot's value. > > > > If this theory is correct, then we need something on the lines of what > > Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to > > fix it. > > This theory seems plausible. According to the thread you shared, we > didn't address the issue of backward movement in the slot's > confirmed_flush LSN. I think that we should carefully consider whether > this behavior is acceptable in the first place. If we decide to > maintain this behavior, we would need to modify the slotsync to deal > with such scenarios (note we have already ensured that the synced > slot's confirmed_flush cannot regress). However, I'm concerned about a > case like where the server crashes immediately after persisting the > retreated confirmed_flush LSN (along with restart_lsn or xmin updates) > to disk. In such a case, wouldn't the walsender potentially send > duplicate data? > Yes that could happen but last time when we tried we couldn't reproduce such a case. However, we will try some more to see if we can reproduce a problem even without slotsync. Having said that, we will also try to reproduce the failure with slotsync in the test where BF is showing failure to confirm the above theory. > If so, we would need a patch like Vignesh proposed. > Right, I feel that is the right thing to do, even if we could prove that it can lead to failure in slotsync. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, May 5, 2025 at 3:59 AM Amit Kapila wrote: > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada wrote: > > > > While I cannot be entirely certain of my analysis, I believe the root > > cause might be related to the backward movement of the confirmed_flush > > LSN. The following scenario seems possible: > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > should be the same as confirmed_flush). > > 2. The slot's confirmed_flush regresses for some reason. > > 3. The slotsync worker retrieves the remote slot information and > > enables two_phase for the local slot. > > > > Yes, this is possible. Here is my theory as to how it can happen in > the current case. In the failed test, after the primary has prepared a > transaction, the transaction won't be replicated to the subscriber as > two_phase was not enabled for the slot. However, subsequent keepalive > messages can send the latest WAL location to the subscriber and get > the confirmation of the same from the subscriber without its origin > being moved. Now, after we restart the apply worker (due to > disable/enable for a subscription), it will use the previous > origin_lsn to temporarily move back the confirmed flush LSN as > explained in one of the previous emails in another thread [1]. During > this temporary movement of confirm flush LSN, the slotsync worker > fetches the two_phase_at and confirm_flush_lsn values, leading to the > assertion failure. We see this issue intermittently because it depends > on the timing of slotsync worker's request to fetch the slot's value. > > If this theory is correct, then we need something on the lines of what > Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to > fix it. This theory seems plausible. According to the thread you shared, we didn't address the issue of backward movement in the slot's confirmed_flush LSN. I think that we should carefully consider whether this behavior is acceptable in the first place. If we decide to maintain this behavior, we would need to modify the slotsync to deal with such scenarios (note we have already ensured that the synced slot's confirmed_flush cannot regress). However, I'm concerned about a case like where the server crashes immediately after persisting the retreated confirmed_flush LSN (along with restart_lsn or xmin updates) to disk. In such a case, wouldn't the walsender potentially send duplicate data? If so, we would need a patch like Vignesh proposed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada wrote: > > While I cannot be entirely certain of my analysis, I believe the root > cause might be related to the backward movement of the confirmed_flush > LSN. The following scenario seems possible: > > 1. The walsender enables the two_phase and sets two_phase_at (which > should be the same as confirmed_flush). > 2. The slot's confirmed_flush regresses for some reason. > 3. The slotsync worker retrieves the remote slot information and > enables two_phase for the local slot. > Yes, this is possible. Here is my theory as to how it can happen in the current case. In the failed test, after the primary has prepared a transaction, the transaction won't be replicated to the subscriber as two_phase was not enabled for the slot. However, subsequent keepalive messages can send the latest WAL location to the subscriber and get the confirmation of the same from the subscriber without its origin being moved. Now, after we restart the apply worker (due to disable/enable for a subscription), it will use the previous origin_lsn to temporarily move back the confirmed flush LSN as explained in one of the previous emails in another thread [1]. During this temporary movement of confirm flush LSN, the slotsync worker fetches the two_phase_at and confirm_flush_lsn values, leading to the assertion failure. We see this issue intermittently because it depends on the timing of slotsync worker's request to fetch the slot's value. If this theory is correct, then we need something on the lines of what Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to fix it. [1]: https://www.postgresql.org/message-id/CAA4eK1%2BzWQwOe5G8zCYGvErnaXh5%2BDbyg_A1Z3uywSf_4%3DT0UA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CALDaNm3hgow2%2BoEov5jBk4iYP5eQrUCF1yZtW7%2BdV3J__p4KLQ%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, May 2, 2025 at 3:05 PM shveta malik wrote: > > On Fri, May 2, 2025 at 12:57 PM Nisha Moond wrote: > > > > > > Please find the v11 patch addressing the above points and all other > > comments. I have also optimized the test by reducing the number of > > subscriptions and slots. > > > > Thanks for the patch. Few comments: > > 1) > > pg_upgrade/t/003_logical_slots.pl: > > - "SELECT slot_name, two_phase, failover FROM pg_replication_slots"); > -is($result, qq(regress_sub|t|t), 'check the slot exists on new cluster'); > + "SELECT slot_name, two_phase FROM pg_replication_slots"); > +is($result, qq(regress_sub|t), 'check the slot exists on new cluster'); > > With this change, the failover property test during upgrade is > removed. Shall we add it again either using ALTER SUB to enable > failover and two_phase together or a new subscription with failover > alone? > If ALTER SUB is used to set failover for the existing subscription, then pg_upgrade will fail while trying to create a slot with both failover and two_phase on the upgraded node $newpub. So, I've implemented the other suggestion by adding a separate pub-subs to verify the failover property. > 2) > + The default is false. This option cannot be set together with > + failover when creating a logical replication > slot. > + However, once the slot is created with two_phase = > true, > + failover can be set to true after the slot has > + consumed all transactions up to the point where two-phase decoding > + was enabled. > > > Suggestion: all transactions --> all the changes. > Done. ~~~ Please find the v12 patch with above suggested changes. -- Thanks, Nisha From 3de761aa43578952543974af48178234ac37e88d Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Mon, 7 Apr 2025 09:36:29 +0530 Subject: [PATCH v12] PG17 Approach 3 Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering slot's failover to true when two_phase state is 'pending'. User can try altering failover again when two_phase state is moved beyond 'pending'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 11 +- doc/src/sgml/protocol.sgml| 14 ++- doc/src/sgml/ref/alter_subscription.sgml | 8 ++ doc/src/sgml/ref/create_subscription.sgml | 9 ++ src/backend/commands/subscriptioncmds.c | 25 src/backend/replication/logical/logical.c | 11 ++ src/backend/replication/logical/slotsync.c| 10 ++ src/backend/replication/slot.c| 27 + src/bin/pg_upgrade/t/003_logical_slots.pl | 50 ++-- .../t/040_standby_failover_slots_sync.pl | 112 ++ src/test/regress/expected/subscription.out| 4 + src/test/regress/sql/subscription.sql | 4 + 14 files changed, 272 insertions(+), 16 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..87b28ad8d55 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: cannot enable both "failover" and "two_phase" options during replication slot creation SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712f
Re: Fix slot synchronization with two_phase decoding enabled
On Sat, May 3, 2025 at 10:23 AM Tom Lane wrote: > > BF member mule just showed what seems an identical failure [1]: > > 2025-05-03 17:46:20.088 CEST [24308:3] LOG: database system is ready to > accept read-only connections > 2025-05-03 17:46:20.091 CEST [24321:1] LOG: slot sync worker started > 2025-05-03 17:46:20.100 CEST [24322:1] LOG: started streaming WAL from > primary at 0/600 on timeline 1 > 2025-05-03 17:46:20.117 CEST [24321:2] LOG: starting logical decoding for > slot "lsub1_slot" > 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL: Streaming transactions > committing after 0/60049C8, reading WAL from 0/548. > 2025-05-03 17:46:20.117 CEST [24321:4] LOG: logical decoding found > consistent point at 0/548 > 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL: There are no running > transactions. > TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), > File: "slotsync.c", Line: 311, PID: 24321 > postgres: standby1: slotsync worker > (ExceptionalCondition+0x59)[0x556c3b8cb3e9] > postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100] > postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c] > postgres: standby1: slotsync worker > (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598] > postgres: standby1: slotsync worker > (postmaster_child_launch+0x102)[0x556c3b6d6962] > postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea] > postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d] > postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f] > postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a] > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305] > postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421] > 2025-05-03 17:46:20.337 CEST [24308:4] LOG: slot sync worker process (PID > 24321) was terminated by signal 6: Aborted > While I cannot be entirely certain of my analysis, I believe the root cause might be related to the backward movement of the confirmed_flush LSN. The following scenario seems possible: 1. The walsender enables the two_phase and sets two_phase_at (which should be the same as confirmed_flush). 2. The slot's confirmed_flush regresses for some reason. 3. The slotsync worker retrieves the remote slot information and enables two_phase for the local slot. If I recall correctly, we previously discussed[1] how a slot's confirmed_flush LSN could move backward depending on the feedback messages received from the downstream system. Based on the following primary server's logs, it appears possible that the subscriber sent feedback LSNs older than the slot's confirmed_flush, consequently causing the confirmed_flush to move backward. However, I should note that this is currently a hypothesis, as I haven't yet been able to reproduce and verify this scenario. Regards, [1] https://www.postgresql.org/message-id/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
BF member mule just showed what seems an identical failure [1]: 2025-05-03 17:46:20.088 CEST [24308:3] LOG: database system is ready to accept read-only connections 2025-05-03 17:46:20.091 CEST [24321:1] LOG: slot sync worker started 2025-05-03 17:46:20.100 CEST [24322:1] LOG: started streaming WAL from primary at 0/600 on timeline 1 2025-05-03 17:46:20.117 CEST [24321:2] LOG: starting logical decoding for slot "lsub1_slot" 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL: Streaming transactions committing after 0/60049C8, reading WAL from 0/548. 2025-05-03 17:46:20.117 CEST [24321:4] LOG: logical decoding found consistent point at 0/548 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL: There are no running transactions. TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), File: "slotsync.c", Line: 311, PID: 24321 postgres: standby1: slotsync worker (ExceptionalCondition+0x59)[0x556c3b8cb3e9] postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100] postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c] postgres: standby1: slotsync worker (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598] postgres: standby1: slotsync worker (postmaster_child_launch+0x102)[0x556c3b6d6962] postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea] postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d] postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f] postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305] postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421] 2025-05-03 17:46:20.337 CEST [24308:4] LOG: slot sync worker process (PID 24321) was terminated by signal 6: Aborted regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2025-05-03%2015%3A30%3A02
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, May 2, 2025 at 12:57 PM Nisha Moond wrote: > > > Please find the v11 patch addressing the above points and all other > comments. I have also optimized the test by reducing the number of > subscriptions and slots. > Thanks for the patch. Few comments: 1) pg_upgrade/t/003_logical_slots.pl: - "SELECT slot_name, two_phase, failover FROM pg_replication_slots"); -is($result, qq(regress_sub|t|t), 'check the slot exists on new cluster'); + "SELECT slot_name, two_phase FROM pg_replication_slots"); +is($result, qq(regress_sub|t), 'check the slot exists on new cluster'); With this change, the failover property test during upgrade is removed. Shall we add it again either using ALTER SUB to enable failover and two_phase together or a new subscription with failover alone? 2) + The default is false. This option cannot be set together with + failover when creating a logical replication slot. + However, once the slot is created with two_phase = true, + failover can be set to true after the slot has + consumed all transactions up to the point where two-phase decoding + was enabled. Suggestion: all transactions --> all the changes. thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 30, 2025 at 5:45 AM Masahiko Sawada wrote: > > On Tue, Apr 29, 2025 at 5:00 AM Nisha Moond wrote: > > Thank you for updating the patch! Here are some comments on v10. > Thanks for reviewing the patch! > --- > + > +# Also wait for two-phase to be enabled > +$subscriber1->poll_query_until( > + 'postgres', qq[ > + SELECT count(1) = 0 FROM pg_subscription WHERE subname = > 'regress_mysub2' and subtwophasestate NOT IN ('e');] > +) or die "Timed out while waiting for subscriber to enable twophase"; > + > +# Try to enable the failover for the subscription, should give error > +($result, $stdout, $stderr) = $subscriber1->psql( > + 'postgres', > + "ALTER SUBSCRIPTION regress_mysub2 DISABLE; > +ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);"); > +ok( $stderr =~ > + qr/ERROR: could not alter replication slot "lsub2_slot": ERROR: > cannot enable failover for a two-phase enabled replication slot/, > + "failover can't be enabled if restart_lsn < two_phase_at on a > two_phase subscription." > +); > > I think it's possible between two tests that the walsender consumes > some changes (e.g. generated by autovacuums) then the second check > fails (i.e., ALTER SUBSCRIPTION SET (failover = ture) succeeds). > Yes, this is a possibility. To account for this negative test case where restart_lsn < two_phase_at, I updated the test to attempt altering a two_phase-enabled slot without any associated subscription. > --- > +# Test that SQL API disallows slot creation with both two_phase and > failover enabled > +($result, $stdout, $stderr) = $publisher->psql('postgres', > + "SELECT pg_create_logical_replication_slot('regress_mysub3', > 'pgoutput', false, true, true);" > +); > +ok( $stderr =~ > + /ERROR: cannot enable both "failover" and "two_phase" options > during replication slot creation/, > + "cannot create slot with both two_phase and failover enabled"); > > Isn't this test already covered by test_decoding/sql/slot.sql? > Yes, it is covered in slot.sql, hence removed from here. > I've attached a patch that contains comment changes I mentioned above. > Please incorporate it if you agree with them. > I have incorporated the suggested changes and updated a couple more places accordingly. ~~~ Please find the v11 patch addressing the above points and all other comments. I have also optimized the test by reducing the number of subscriptions and slots. -- Thanks, Nisha v11-0001-PG17-Approach-3-Fix-slot-synchronization-for-two.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 30, 2025 at 2:38 AM Zhijie Hou (Fujitsu) wrote: > > On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote: > > > > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada > > wrote: > > > > > > Your fix looks good to me. Is it worth considering putting an > > > assertion to verify if new two_phase_at is equal to or greater than > > > confirmed_lsn (or at least it doesn't go backward), when syncing > > > two_phase_at? > > > > > > > Yeah, it makes sense. But the condition should be reverse (two_phase_at > > should be less than or equal to confirmed_flush). I have done that, changed > > a > > few comments, and committed the patch. > > I noticed a BF failure[1] related to the fix, where the recently added > assertion > Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. > After > examining the logs and code, I couldn't identify any functionality issues. Yeah, that's weird. > AFAICS, the last slotsync cycle should have retrieved the latest > confirmed_flush and > two_phase_at from the source slot, both expected to be 0/6005150. Here are > some details: > > The standby's log[1] shows the source slot's two_phase_at as 0/6005150. > Meanwhile, the publisher's log reveals that the source slot's confirmed_flush > was already 0/6005150 before the last slotsync. Therefore, the last slotsync > cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150. > > If we assume remote_slot->confirmed_lsn during the last sync cycle is > 0/6005150, then either the local slot's confirmed_lsn had already been updated > to this value in a prior sync, leading it to skip the update in the last cycle > (in which case, the assertion shouldn't be broken), or it should enter the > slot > update branch to set the synced slot to 0/6005150 (see > update_local_synced_slot() for details). Thus, theoretically, the assertion > wouldn't fail. Agreed with your analysis. After enabling the subscription with two_phase=true, the primary server has the following logs that indicate that logical decoding started from 0/6005150, not 0/6004FC8. Given that the slot's two_phase was enabled at this time, the slot's confirmed_flush and two_phase_at were 0/6005150. 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: received replication command: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: acquired logical replication slot "lsub1_slot" 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: 0/6004FC8 has been already streamed, forwarding to 0/6005150 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG: starting logical decoding for slot "lsub1_slot" 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL: Streaming transactions committing after 0/6005150, reading WAL from 0/6004A00. 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') > > Beyond functionality problems, another possibility might be the CPU reordered > memory access, causing the assertion to execute before updating > confirmed_flush. Not sure CPU reordered memory access could matter in this case but I don't have other ideas of possible causes. > However, we lack the information needed to verify this, so one > idea is to add some DEBUG message in update_local_synced_slot() to facilitate > tracking if the issue recurs. That would be a good idea. Also, it's unrelated to this issue but probably we are better off doing this assertion check only when enabling two_phase. The assertion is currently checked even when disabling the two_phase, which seems not to make sense, and we don't clear two_phase_at value when disabling two_phase by ReplicationSlotAlter(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote: > > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada > wrote: > > > > Your fix looks good to me. Is it worth considering putting an > > assertion to verify if new two_phase_at is equal to or greater than > > confirmed_lsn (or at least it doesn't go backward), when syncing > > two_phase_at? > > > > Yeah, it makes sense. But the condition should be reverse (two_phase_at > should be less than or equal to confirmed_flush). I have done that, changed a > few comments, and committed the patch. I noticed a BF failure[1] related to the fix, where the recently added assertion Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. After examining the logs and code, I couldn't identify any functionality issues. Here is my analysis: AFAICS, the last slotsync cycle should have retrieved the latest confirmed_flush and two_phase_at from the source slot, both expected to be 0/6005150. Here are some details: The standby's log[1] shows the source slot's two_phase_at as 0/6005150. Meanwhile, the publisher's log reveals that the source slot's confirmed_flush was already 0/6005150 before the last slotsync. Therefore, the last slotsync cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150. If we assume remote_slot->confirmed_lsn during the last sync cycle is 0/6005150, then either the local slot's confirmed_lsn had already been updated to this value in a prior sync, leading it to skip the update in the last cycle (in which case, the assertion shouldn't be broken), or it should enter the slot update branch to set the synced slot to 0/6005150 (see update_local_synced_slot() for details). Thus, theoretically, the assertion wouldn't fail. Beyond functionality problems, another possibility might be the CPU reordered memory access, causing the assertion to execute before updating confirmed_flush. However, we lack the information needed to verify this, so one idea is to add some DEBUG message in update_local_synced_slot() to facilitate tracking if the issue recurs. I'll continue to investigate, but any suggestions or insights would be greatly appreciated. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=calliphoridae&dt=2025-04-29%2009%3A11%3A31&stg=recovery-check [2] 2025-04-29 09:18:05.641 UTC [1023004][client backend][0/2:0] LOG: statement: SELECT two_phase AND '0/6005150' = two_phase_at from pg_replication_slots WHERE slot_name = 'lsub1_slot' AND synced AND NOT temporary; [3] 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: 0/6004FC8 has been already streamed, forwarding to 0/6005150 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.489 UTC [1022911][client backend][:0] LOG: disconnection: session time: 0:00:00.005 user=bf database=postgres host=[local] 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG: starting logical decoding for slot "lsub1_slot" 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL: Streaming transactions committing after 0/6005150, reading WAL from 0/6004A00. 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8 (proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.518 UTC [1022687][client backend][9/19:0] LOG: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, two_phase_at, failover, database, invalidation_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2025-04-29 09:18:05.524 UTC [1022928][not initialized][:0] LOG: connection received: host=[local] Best Regards, Hou zj
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 29, 2025 at 5:00 AM Nisha Moond wrote: > > On Tue, Apr 29, 2025 at 2:51 PM shveta malik wrote: > > > > On Mon, Apr 28, 2025 at 4:33 PM Nisha Moond > > wrote: > > > > > > Please find the v9 patch, addressing the above and all other comments as > > > well. > > > > > > > Thanks for the patch. > > > > 1) > > > > + The default is false. This option cannot be set together with > > + two_phase when creating the slot. However, > > once > > + the slot is created with two_phase=true, the > > + failover can be set to true after the > > + restart_lsn has advanced beyond its > > + two_phase_at value > > > > As the user is not aware of two_phase_at, we shall change all the docs > > to have a more user friendly explanation. > > > > Done. > > > 2) > > # Confirm that the invalidated slot has been dropped. > > $standby1->wait_for_log( > > - qr/dropped replication slot "lsub1_slot" of database with OID > > [0-9]+/, $log_offset); > > + qr/dropped replication slot "lsub1_slot" of database with OID [0-9]+/, > > + $log_offset); > > > > I think this change is not needed. > > > > The perltidy is auto-correcting it now, though it didn't for earlier > patches. I've kept the original for now. > ~~~ > > Thanks for review, please find the updated patch v10. Thank you for updating the patch! Here are some comments on v10. We refer to the following comment in ReplicationSlotCreate() from many places: /* * Do not allow users to enable both failover and two_phase for slots. * * This is because the two_phase_at field of a slot, which tracks the * LSN, from which two-phase decoding starts, is not synchronized to * standby servers. Without two_phase_at, the logical decoding might * incorrectly identify prepared transaction as already replicated to * the subscriber after promotion of standby server, causing them to * be skipped. * * However, both failover and two_phase enabled slots can be created * during slot synchronization because we need to retain the same * values as the remote slot. */ I think this is an important implementation restriction, so how about describing the overall idea of combination of two_phase and failover in the comment atop slotsync.c or worker.c? It would be easier to find. --- + if (failover && MyReplicationSlot->data.two_phase && + MyReplicationSlot->data.restart_lsn < MyReplicationSlot->data.two_phase_at) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable failover for a two-phase enabled replication slot"), + errdetail("The slot need to consume change upto %X/%X to enable failover.", + LSN_FORMAT_ARGS(MyReplicationSlot->data.two_phase_at))); I think it's better to mention the reason in the error message why we cannot enable two-phase in this situation. How about: - errmsg("cannot enable failover for a two-phase enabled replication slot"), + errmsg("cannot enable failover for a two-phase enabled replication slot due to unconsumed changes"), --- + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable both \"%s\" and \"%s\" options at CREATE SUBSCRIPTION", + "failover", "two_phase"), + errhint("The \"%s\" option can be enabled after \"%s\" state is ready using ALTER SUBSCRIPTION.", + "failover", "two_phase")); How about rewriting the errhint message to: - errhint("The \"%s\" option can be enabled after \"%s\" state is ready using ALTER SUBSCRIPTION.", - "failover", "two_phase")); + errhint("Use ALTER SUBSCRIPTION ... SET (failover = true) to enable \"%s\" after two_phase state is ready", + "failover")); --- @@ -29073,9 +29073,13 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset failover, when set to true, specifies that this slot is enabled to be synced to the standbys so that logical replication can be resumed after -failover. A call to this function has the same effect as -the replication protocol command -CREATE_REPLICATION_SLOT ... LOGICAL. +failover. The parameters twophase and +failover cannot be enabled together when +creating a replication slot. However, a slot created with twophase +enabled can later have failover set to true via +ALTER SUBSCRIPTION, if a subscription is using this +slot. A call to this function has the same effect as the replication +protocol command CREATE_REPLICATION_SLOT ... LOGICAL. Can we also mention ALTER_REPLICATION to enable the failover for slots that are not used by subscriptions? --- +# Also wait for two-phase to be enabled +$subscriber1->poll_query_until( + 'postgres', qq[ + SELECT count(1) = 0 FROM pg_subscript
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 29, 2025 at 2:51 PM shveta malik wrote: > > On Mon, Apr 28, 2025 at 4:33 PM Nisha Moond wrote: > > > > Please find the v9 patch, addressing the above and all other comments as > > well. > > > > Thanks for the patch. > > 1) > > + The default is false. This option cannot be set together with > + two_phase when creating the slot. However, once > + the slot is created with two_phase=true, the > + failover can be set to true after the > + restart_lsn has advanced beyond its > + two_phase_at value > > As the user is not aware of two_phase_at, we shall change all the docs > to have a more user friendly explanation. > Done. > 2) > # Confirm that the invalidated slot has been dropped. > $standby1->wait_for_log( > - qr/dropped replication slot "lsub1_slot" of database with OID > [0-9]+/, $log_offset); > + qr/dropped replication slot "lsub1_slot" of database with OID [0-9]+/, > + $log_offset); > > I think this change is not needed. > The perltidy is auto-correcting it now, though it didn't for earlier patches. I've kept the original for now. ~~~ Thanks for review, please find the updated patch v10. -- Thanks, Nisha From 7aa643c49411f9b855ee01aacb28c6f646d6dc1f Mon Sep 17 00:00:00 2001 From: Nisha Moond Date: Mon, 7 Apr 2025 09:36:29 +0530 Subject: [PATCH v10] PG17 Approach 3 Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To prevent the risk of losing prepared transactions, we disallow enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that slot's restart_lsn > two_phase_at. The fix enforces the following conditions: 1) Always disallow creating slots with two_phase=true and failover=true. 2) Always disallow creating subscriptions with (two_phase=true, failover=true). 3) Prevent altering the slot's failover to true if two_phase=true and restart_lsn is less than two_phase_at. Otherwise, allow changing failover to true. 4) Disallow altering slot's failover to true when two_phase state is 'pending'. User can try altering failover again when two_phase state is moved beyond 'pending'. --- contrib/test_decoding/expected/slot.out | 2 + contrib/test_decoding/sql/slot.sql| 1 + doc/src/sgml/func.sgml| 10 +- doc/src/sgml/protocol.sgml| 14 +- doc/src/sgml/ref/alter_subscription.sgml | 8 ++ doc/src/sgml/ref/create_subscription.sgml | 9 ++ src/backend/commands/subscriptioncmds.c | 43 +++ src/backend/replication/logical/logical.c | 12 ++ src/backend/replication/slot.c| 37 ++ src/bin/pg_upgrade/t/003_logical_slots.pl | 6 +- .../t/040_standby_failover_slots_sync.pl | 121 ++ src/test/regress/expected/subscription.out| 4 + src/test/regress/sql/subscription.sql | 4 + 13 files changed, 263 insertions(+), 8 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..87b28ad8d55 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: cannot enable both "failover" and "two_phase" options during replication slot creation SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -182,6 +182,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot', 'tes SELECT 'init' FROM pg_create_logical_replication_slot('failover_false_slot', 'test_decoding', false, false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', 'test_decoding', false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); +SELECT 'init' FROM pg_cr
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada wrote: > > Your fix looks good to me. Is it worth considering putting an > assertion to verify if new two_phase_at is equal to or greater than > confirmed_lsn (or at least it doesn't go backward), when syncing > two_phase_at? > Yeah, it makes sense. But the condition should be reverse (two_phase_at should be less than or equal to confirmed_flush). I have done that, changed a few comments, and committed the patch. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 4:33 PM Nisha Moond wrote: > > Please find the v9 patch, addressing the above and all other comments as well. > Thanks for the patch. 1) + The default is false. This option cannot be set together with + two_phase when creating the slot. However, once + the slot is created with two_phase=true, the + failover can be set to true after the + restart_lsn has advanced beyond its + two_phase_at value As the user is not aware of two_phase_at, we shall change all the docs to have a more user friendly explanation. 2) # Confirm that the invalidated slot has been dropped. $standby1->wait_for_log( - qr/dropped replication slot "lsub1_slot" of database with OID [0-9]+/, $log_offset); + qr/dropped replication slot "lsub1_slot" of database with OID [0-9]+/, + $log_offset); I think this change is not needed. thanks Shveta
RE: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 7:33 PM Zhijie Hou (Fujitsu) wrote: > > Thanks for reviewing. Here is V3 patch that addressed it. > > BTW, I also did some tests to confirm the catalog_xmin could still be > ahead in some case, and here is an example: > > 1. Create a failover replication slot named 'logicalslot' on primary > and acquire it in the walsender. > > 2. Log two standby snapshots on primary. Before logging, call > txid_current() To assign a xid, so that each standby snapshot will > hold a new xid in its oldestrunningXid field: >- txid_current(); >- `0/3000420` - `running_xacts` (no running transactions, > oldestrunningXid = 755) >- txid_current(); >- `0/3000488` - `running_xacts` (no running transactions, > oldestrunningXid = 756) > > 3. The walsender sets `0/3000420` as the `candidate_restart_lsn`, 755 as >`candidate_catalog_xmin`, skipping the second `running_xacts` because >`candidate_restart_lsn`/`candidate_catalog_xmin` is already valid. > > 4. After receiving a reply from the apply worker, the walsender assigns >`0/3000420` to `restart_lsn`, `755` to `catalog_xmin`. At this point, the >replication slot 'logicalslot' has `restart_lsn` set to `0/3000420`, >`catalog_xmin` set to `755`. > > 5. On the standby, execute `pg_sync_replication_slots()` to synchronize >'logicalslot'. > > 6. During synchronization, with the initial `restart_lsn` at `0/3000420`, the >sync slot reaches a consistent point at this position. As a result, it does >not update `candidate_restart_lsn` and `candidate_catalog_xmin` at >consistent point (refer to `SnapBuildProcessRunningXacts()`). > > 7. The sync process identifies the second standby snapshot at > `0/3000488` and >uses its LSN as `candidate_restart_lsn`, and use the > oldestrunningXid `756` >as `candidate_catalog_xmin`, eventually updating it to `restart_lsn` and >`catalog_xmin`. > > 8. Now, the synced slot holds `restart_lsn` at `0/3000488`, `catalog_xmin` at >`756`, which are all ahead of the remote slot on the primary server. > > Attaching a script to reproduce the same. > > Note that, to reproduce this stably, we'd better modify the value of > LOG_SNAPSHOT_INTERVAL_MS in bgwriter.c to a bigger number to avoid > unexpected xl_running_xacts logging. In addition to above steps, for those interested in reproducing the specific scenario where two_phase_at advances past the synced confirmed_flush, I'm attaching a new script. This script can reproduce the issue after applying the injection points I provided. Best Regards, Hou zj From a4a314b7041779d6e5725a9d66b0135715d393f4 Mon Sep 17 00:00:00 2001 From: Zhijie Hou Date: Tue, 29 Apr 2025 13:56:25 +0800 Subject: [PATCH] injection point --- src/backend/replication/logical/logical.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index a8d2e024d34..afe194fb398 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -41,6 +41,7 @@ #include "storage/proc.h" #include "storage/procarray.h" #include "utils/builtins.h" +#include "utils/injection_point.h" #include "utils/inval.h" #include "utils/memutils.h" @@ -603,6 +604,8 @@ CreateDecodingContext(XLogRecPtr start_lsn, SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); } + INJECTION_POINT("initialized-two-phase-at"); + ctx->reorder->output_rewrites = ctx->options.receive_rewrites; ereport(LOG, -- 2.30.0.windows.2 #!/bin/bash port_primary=5432 port_secondary=5433 port_third=5434 port_secondary_2=5436 port_secondary_3=5437 port_subscriber=5434 port_subscriber2=5435 echo '==' echo '=Clean up=' echo '==' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_third_ pg_ctl stop -D data_secondary_2 pg_ctl stop -D data_secondary_3 pg_ctl stop -D data_subscriber pg_ctl stop -D data_subscriber2 rm -rf data_* *log rm -rf /home/houzj/archivedir/* echo '===' echo '=Set up primary server=' echo '===' initdb -D data_primary cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary #standby_slot_names = 'physical' #log_replication_commands = 'on' #max_slot_wal_keep_size = 64kB max_wal_senders=550 max_worker_processes=1000 max_replication_slots=550 log_replication_commands = 'on' checkpoint_timeout = 1d shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = on checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off wal_sender_timeout = 6000s wal_receiver_timeout = 6000s #log_min_messages = 'debug2' #archive_mode = on #archive_command = 'cp %p /home/houzj/archivedir/%f' #restore_command = 'cp /home/houzj/archivedir/%f %p' max_prepared_transactions = 10 EOF pg_ctl -D data_
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 11:54 PM Masahiko Sawada wrote: > > On Sat, Apr 26, 2025 at 5:07 AM Amit Kapila wrote: > > > > On Fri, Apr 25, 2025 at 9:57 PM Masahiko Sawada > > wrote: > > > > > > On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila > > > wrote: > > > > > > > > The second can mislead the user > > > > for a long period in cases where prepare and commit have a large time > > > > gap. I feel this will introduce complexity either in the form of code > > > > or in giving the information to the user. > > > > > > Agreed. Both ways introduce complexity so we need to consider the > > > user-unfriendliness (by not having a proper way to enable failover for > > > the two_phase-enabled-slot using SQL API) vs. risk (of introducing > > > complexity). > > > > > > > Right, to me it sounds risky to provide such functionality for SQL API > > in the back branch. > > So do you think it's okay to leave it as a restriction (i.e. there is > no easy way to enable failover for a two_phase-enabled logical slot > created by SQL API)? or do you have any better idea for that? > I don't have any better ideas. I think it is better to leave it as a restriction. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 4:33 AM Zhijie Hou (Fujitsu) wrote: > > On Mon, Apr 28, 2025 at 5:33 PM shveta malik wrote: > > > > On Mon, Apr 28, 2025 at 2:27 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote: > > > > > > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > > > > > > > > > > > > > - > > > > > Fix > > > > > - > > > > > > > > > > I think we should keep the confirmed_flush even if the previous > > > > > synced restart_lsn/catalog_xmin is newer. Attachments include a > > > > > patch for the > > > > same. > > > > > > > > > > > > > This will fix the case we are facing but adds a new rule for slot > > synchronization. > > > > Can we think of a simpler way to fix this by avoiding updating other > > > > slot fields (like two_phase, two_phase_at) if restart_lsn or > > > > catalog_xmin of the local slot is ahead of the remote slot? > > > > > > Since the fix for xmin advancement during fast-forward decoding has > > > been pushed (commit aaf9e95), I'm attaching the V2 patch to address > > > the assert failure by avoiding updating other slot fields (like > > > two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot > > > is > > ahead. > > > > > > > The patch looks good to me. We can have minor comment tweak: > > > > + * Syncing only two_phase_at, without also syncing the latest > > + * confirmed_lsn, might lead to transactions between the old > > + * confirmed_lsn and two_phase_at being unexpectedly decoded and sent > > + * to the subscriber. > > > > We can append "following a promotion". > > Thanks for reviewing. Here is V3 patch that addressed it. > > BTW, I also did some tests to confirm the catalog_xmin could > still be ahead in some case, and here is an example: > > 1. Create a failover replication slot named 'logicalslot' on primary > and acquire it in the walsender. > > 2. Log two standby snapshots on primary. Before logging, call txid_current() > To assign a xid, so that each standby snapshot will hold a new xid in its > oldestrunningXid field: >- txid_current(); >- `0/3000420` - `running_xacts` (no running transactions, oldestrunningXid > = 755) >- txid_current(); >- `0/3000488` - `running_xacts` (no running transactions, oldestrunningXid > = 756) > > 3. The walsender sets `0/3000420` as the `candidate_restart_lsn`, 755 as >`candidate_catalog_xmin`, skipping the second `running_xacts` because >`candidate_restart_lsn`/`candidate_catalog_xmin` is already valid. > > 4. After receiving a reply from the apply worker, the walsender assigns >`0/3000420` to `restart_lsn`, `755` to `catalog_xmin`. At this point, the >replication slot 'logicalslot' has `restart_lsn` set to `0/3000420`, >`catalog_xmin` set to `755`. > > 5. On the standby, execute `pg_sync_replication_slots()` to synchronize >'logicalslot'. > > 6. During synchronization, with the initial `restart_lsn` at `0/3000420`, the >sync slot reaches a consistent point at this position. As a result, it does >not update `candidate_restart_lsn` and `candidate_catalog_xmin` at >consistent point (refer to `SnapBuildProcessRunningXacts()`). > > 7. The sync process identifies the second standby snapshot at `0/3000488` and >uses its LSN as `candidate_restart_lsn`, and use the oldestrunningXid `756` >as `candidate_catalog_xmin`, eventually updating it to `restart_lsn` and >`catalog_xmin`. > > 8. Now, the synced slot holds `restart_lsn` at `0/3000488`, `catalog_xmin` at >`756`, which are all ahead of the remote slot on the primary server. > > Attaching a script to reproduce the same. > > Note that, to reproduce this stably, we'd better modify the value of > LOG_SNAPSHOT_INTERVAL_MS in bgwriter.c to a bigger number to avoid > unexpected xl_running_xacts logging. Thank you for sharing the patch and reproducible steps. I agree that this issue still happens on HEAD and should be fixed. If I understand the problem correctly, we need to avoid setting the values such that two_phase_at > confirmed_lsn. Because otherwise if a transaction is prepared between them, we end up decoding the prepared transaction twice or with an assertion failure if it's enabled. It never happens on the primary since we always set the same value to two_phase_at as confirmed_lsn. Your fix looks good to me. Is it worth considering putting an assertion to verify if new two_phase_at is equal to or greater than confirmed_lsn (or at least it doesn't go backward), when syncing two_phase_at? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Sat, Apr 26, 2025 at 5:07 AM Amit Kapila wrote: > > On Fri, Apr 25, 2025 at 9:57 PM Masahiko Sawada wrote: > > > > On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila wrote: > > > > > > The second can mislead the user > > > for a long period in cases where prepare and commit have a large time > > > gap. I feel this will introduce complexity either in the form of code > > > or in giving the information to the user. > > > > Agreed. Both ways introduce complexity so we need to consider the > > user-unfriendliness (by not having a proper way to enable failover for > > the two_phase-enabled-slot using SQL API) vs. risk (of introducing > > complexity). > > > > Right, to me it sounds risky to provide such functionality for SQL API > in the back branch. So do you think it's okay to leave it as a restriction (i.e. there is no easy way to enable failover for a two_phase-enabled logical slot created by SQL API)? or do you have any better idea for that? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 5:33 PM shveta malik wrote: > > On Mon, Apr 28, 2025 at 2:27 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote: > > > > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > > > > > > > > > > - > > > > Fix > > > > - > > > > > > > > I think we should keep the confirmed_flush even if the previous > > > > synced restart_lsn/catalog_xmin is newer. Attachments include a > > > > patch for the > > > same. > > > > > > > > > > This will fix the case we are facing but adds a new rule for slot > synchronization. > > > Can we think of a simpler way to fix this by avoiding updating other > > > slot fields (like two_phase, two_phase_at) if restart_lsn or > > > catalog_xmin of the local slot is ahead of the remote slot? > > > > Since the fix for xmin advancement during fast-forward decoding has > > been pushed (commit aaf9e95), I'm attaching the V2 patch to address > > the assert failure by avoiding updating other slot fields (like > > two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot is > ahead. > > > > The patch looks good to me. We can have minor comment tweak: > > + * Syncing only two_phase_at, without also syncing the latest > + * confirmed_lsn, might lead to transactions between the old > + * confirmed_lsn and two_phase_at being unexpectedly decoded and sent > + * to the subscriber. > > We can append "following a promotion". Thanks for reviewing. Here is V3 patch that addressed it. BTW, I also did some tests to confirm the catalog_xmin could still be ahead in some case, and here is an example: 1. Create a failover replication slot named 'logicalslot' on primary and acquire it in the walsender. 2. Log two standby snapshots on primary. Before logging, call txid_current() To assign a xid, so that each standby snapshot will hold a new xid in its oldestrunningXid field: - txid_current(); - `0/3000420` - `running_xacts` (no running transactions, oldestrunningXid = 755) - txid_current(); - `0/3000488` - `running_xacts` (no running transactions, oldestrunningXid = 756) 3. The walsender sets `0/3000420` as the `candidate_restart_lsn`, 755 as `candidate_catalog_xmin`, skipping the second `running_xacts` because `candidate_restart_lsn`/`candidate_catalog_xmin` is already valid. 4. After receiving a reply from the apply worker, the walsender assigns `0/3000420` to `restart_lsn`, `755` to `catalog_xmin`. At this point, the replication slot 'logicalslot' has `restart_lsn` set to `0/3000420`, `catalog_xmin` set to `755`. 5. On the standby, execute `pg_sync_replication_slots()` to synchronize 'logicalslot'. 6. During synchronization, with the initial `restart_lsn` at `0/3000420`, the sync slot reaches a consistent point at this position. As a result, it does not update `candidate_restart_lsn` and `candidate_catalog_xmin` at consistent point (refer to `SnapBuildProcessRunningXacts()`). 7. The sync process identifies the second standby snapshot at `0/3000488` and uses its LSN as `candidate_restart_lsn`, and use the oldestrunningXid `756` as `candidate_catalog_xmin`, eventually updating it to `restart_lsn` and `catalog_xmin`. 8. Now, the synced slot holds `restart_lsn` at `0/3000488`, `catalog_xmin` at `756`, which are all ahead of the remote slot on the primary server. Attaching a script to reproduce the same. Note that, to reproduce this stably, we'd better modify the value of LOG_SNAPSHOT_INTERVAL_MS in bgwriter.c to a bigger number to avoid unexpected xl_running_xacts logging. Best Regards, Hou zj advance_catalog_xmin.sh Description: advance_catalog_xmin.sh v3-0001-Fix-assertion-failure-when-decoding-synced-two-ph.patch Description: v3-0001-Fix-assertion-failure-when-decoding-synced-two-ph.patch
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 12:04 PM shveta malik wrote: > > On Fri, Apr 25, 2025 at 5:53 PM Nisha Moond wrote: > > > > Please find the attached v8 patch with above comments addressed. > > This version includes the documentation updates suggested by > > Sawada-san at [1], and incorporates his feedback from [2] as well. > > > > Thanks for the patches. > > > 1) > Regarding documents, these are the details added: > a) pg_create_logical_replication_slot says "The parameters twophase > and failover cannot be enabled together when creating a replication > slot." > b) CREATE_REPLICATION_SLOT says TWO_PHASE: This option is mutually > exclusive with failover. > c) CREATE SUBSCRIPTION doc says: The options failover and twophase > cannot be enabled together when creating a subscription. > d) ALTER SUB says: When two_phase is in the "pending" state, setting > failover = true is not permitted. Once two_phase is "enabled", > failover = true can be set only if the slot's restart_lsn has advanced > beyond its two_phase_at value. > > The individual points mentioned are correct. But nowhere we get > complete picture as in how user can enable failover and twophase > together when setting up a slot and subscription. I think we can add > brief notes in a,b and c to state how user can actually enable these 2 > together. > Or can add this detail in single page and let other pages refer/point > to that page. Thoughts? > I have updated the docs with a brief explanation of how both options can be enabled together for points (a), (b), and (c). Let’s wait for others’ opinions on whether this detail should stay consolidated on one page, with other pages referring to it. > 2) > > ReplicationSlotCreate(): > > + if (two_phase && !IsSyncingReplicationSlots()) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot enable both \"%s\" and \"%s\" options during > replication slot creation", > +"failover", "two_phase"), > + errhint("The \"%s\" option can be enabled after \"%s\" state is ready.", > + "failover", "two_phase")); > > I think errhint here is not correct. We can not alter failover of > replication slot without /o doing create-sub first and then performing > alter-sub. So either we should mention complete details or remove this > hint, otherwise it is confusing. > Removed the hint. I feel this explanation will be too long for the HINT and also docs have the details. ~~~ Please find the v9 patch, addressing the above and all other comments as well. -- Thanks, Nisha v9-0001-PG17-Approach-3-Fix-slot-synchronization-for-two_.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 28, 2025 at 2:27 PM Zhijie Hou (Fujitsu) wrote: > > On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote: > > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > > > > > > > - > > > Fix > > > - > > > > > > I think we should keep the confirmed_flush even if the previous synced > > > restart_lsn/catalog_xmin is newer. Attachments include a patch for the > > same. > > > > > > > This will fix the case we are facing but adds a new rule for slot > > synchronization. > > Can we think of a simpler way to fix this by avoiding updating other slot > > fields > > (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local > > slot > > is ahead of the remote slot? > > Since the fix for xmin advancement during fast-forward decoding has been > pushed > (commit aaf9e95), I'm attaching the V2 patch to address the assert failure by > avoiding updating other slot fields (like two_phase, two_phase_at) if > restart_lsn or catalog_xmin of the local slot is ahead. > The patch looks good to me. We can have minor comment tweak: + * Syncing only two_phase_at, without also syncing the latest + * confirmed_lsn, might lead to transactions between the old + * confirmed_lsn and two_phase_at being unexpectedly decoded and sent + * to the subscriber. We can append "following a promotion". thanks Shveta
RE: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote: > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > > > - > > Fix > > - > > > > I think we should keep the confirmed_flush even if the previous synced > > restart_lsn/catalog_xmin is newer. Attachments include a patch for the > same. > > > > This will fix the case we are facing but adds a new rule for slot > synchronization. > Can we think of a simpler way to fix this by avoiding updating other slot > fields > (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local > slot > is ahead of the remote slot? Since the fix for xmin advancement during fast-forward decoding has been pushed (commit aaf9e95), I'm attaching the V2 patch to address the assert failure by avoiding updating other slot fields (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot is ahead. After commit aaf9e95, it's less likely for the synced slot's restart/catalog to advance ahead of the source slot (The tests in 040_standby_failover_slots_sync.pl would not trigger it anymore). However, this can still occur in certain scenarios. Here is an example scenario with the steps involved: 1. Create a failover replication slot named 'logicalslot' on primary and acquire it in the walsender. 2. Log two standby snapshots on primary: - `0/3000428` - `running_xacts` (no running transactions) - `0/3000460` - `running_xacts` (no running transactions) 3. The walsender sets `0/3000428` as the `candidate_restart_lsn`, skipping the second `running_xacts` because `candidate_restart_lsn` is already valid. 4. After receiving a reply from the apply worker, the walsender assigns `0/3000428` to `restart_lsn`. At this point, the replication slot 'logicalslot' has `restart_lsn` set to `0/3000428`. 5. On the standby, execute `pg_sync_replication_slots()` to synchronize 'logicalslot'. 6. During synchronization, with the initial `restart_lsn` at `0/3000428`, the sync slot reaches a consistent point at this position. As a result, it does not use this consistent point as `candidate_restart_lsn` (refer to `SnapBuildProcessRunningXacts()`). 7. The sync process identifies the second standby snapshot at `0/3000460` and uses its LSN as `candidate_restart_lsn`, eventually updating it to `restart_lsn`. 8. Now, the synced slot holds `restart_lsn` at `0/3000460`, which is ahead of the remote slot on the primary server. So, if a user prepares a txn and changes the two_phase setting to true after the steps mentioned above, the value for two_phase_at might get synced to the standby while skipping the sync of the latest confirmed_flush_lsn. So I think it's still necessary to fix the assert failure. BTW, this fix is only for HEAD as back-branches do not sync two_phase_at field. Best Regards, Hou zj v2-0001-Fix-assertion-failure-when-decoding-synced-two-ph.patch Description: v2-0001-Fix-assertion-failure-when-decoding-synced-two-ph.patch
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 25, 2025 at 5:53 PM Nisha Moond wrote: > > Please find the attached v8 patch with above comments addressed. > This version includes the documentation updates suggested by > Sawada-san at [1], and incorporates his feedback from [2] as well. > Thanks for the patches. 1) Regarding documents, these are the details added: a) pg_create_logical_replication_slot says "The parameters twophase and failover cannot be enabled together when creating a replication slot." b) CREATE_REPLICATION_SLOT says TWO_PHASE: This option is mutually exclusive with failover. c) CREATE SUBSCRIPTION doc says: The options failover and twophase cannot be enabled together when creating a subscription. d) ALTER SUB says: When two_phase is in the "pending" state, setting failover = true is not permitted. Once two_phase is "enabled", failover = true can be set only if the slot's restart_lsn has advanced beyond its two_phase_at value. The individual points mentioned are correct. But nowhere we get complete picture as in how user can enable failover and twophase together when setting up a slot and subscription. I think we can add brief notes in a,b and c to state how user can actually enable these 2 together. Or can add this detail in single page and let other pages refer/point to that page. Thoughts? 2) ReplicationSlotCreate(): + if (two_phase && !IsSyncingReplicationSlots()) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable both \"%s\" and \"%s\" options during replication slot creation", +"failover", "two_phase"), + errhint("The \"%s\" option can be enabled after \"%s\" state is ready.", + "failover", "two_phase")); I think errhint here is not correct. We can not alter failover of replication slot without /o doing create-sub first and then performing alter-sub. So either we should mention complete details or remove this hint, otherwise it is confusing. 3) CreateSubscription: + if (opts.twophase && opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable both \"%s\" and \"%s\" options at CREATE SUBSCRIPTION", +"failover", "two_phase"), + errhint("The \"%s\" option can be enabled after \"%s\" state is ready.", + "failover", "two_phase")); Shall we mention "using Alter Sub" in the errhint? 4) AlterSubscription: + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && + opts.failover) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot enable failover for a subscription with a pending two_phase state")); +state")); Shall we mention errhint here too? thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 25, 2025 at 9:57 PM Masahiko Sawada wrote: > > On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila wrote: > > > > On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada > > wrote: > > > > > > I realized that users who create a logical slot using > > > pg_create_logical_replication_slot() would not be able to enable both > > > options at slot creation, and there is no easy way to enable the > > > failover after two_phase-enabled-slot creation. Users would need to > > > use ALTER_REPLICATION_SLOT replication command, which seems > > > unrealistics for users to use. On the other hand, if we allow creating > > > a logical slot with enabling failover and two_phase using SQL API, > > > there is still a chance for this bug to occur. Would it be worth > > > considering that if a logical slot is created with enabling failover > > > and two_phase using SQL API, we create the slot with only > > > two_phase=true, then advance the slot until the slot satisfies > > > restart_lsn >= two_phase_at, and then enable the failover? > > > > > > > This means we either need to maintain somewhere that user has provided > > failover flag till restart_lsn >= two_phase_at or and then set > > failover flag in the slot > > I was thinking of this idea. > > > or initially mark it but enable the > > functionality of failover when we reach the condition restart_lsn >= > > two_phase_at. > > IIUC the slot could be synchronized to the standby as soon as we > complete DecodingContextFindStartpoint() for a failover-enabled slot. > So we would need some mechanisms to make sure that the slot is not > synchronized while we're waiting to reach the condition restart_lsn >= > two_phase_at even if the failover is enabled. > So, then we need any state or persistent flag for this. > > Both seem to have different kinds of problems. The first > > idea seems to have an issue with persistence, which means we can lose > > track of the flag after the restart. > > I think we can do this series of operations while the slot is not > persistent, that is the slot is still RS_EPHEMERAL. > But we still need a persistent flag to indicate such slots shouldn't be synced to standby till we reach the condition restart_lsn >= two_phase_at. > > The second can mislead the user > > for a long period in cases where prepare and commit have a large time > > gap. I feel this will introduce complexity either in the form of code > > or in giving the information to the user. > > Agreed. Both ways introduce complexity so we need to consider the > user-unfriendliness (by not having a proper way to enable failover for > the two_phase-enabled-slot using SQL API) vs. risk (of introducing > complexity). > Right, to me it sounds risky to provide such functionality for SQL API in the back branch. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila wrote: > > On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada wrote: > > > > I realized that users who create a logical slot using > > pg_create_logical_replication_slot() would not be able to enable both > > options at slot creation, and there is no easy way to enable the > > failover after two_phase-enabled-slot creation. Users would need to > > use ALTER_REPLICATION_SLOT replication command, which seems > > unrealistics for users to use. On the other hand, if we allow creating > > a logical slot with enabling failover and two_phase using SQL API, > > there is still a chance for this bug to occur. Would it be worth > > considering that if a logical slot is created with enabling failover > > and two_phase using SQL API, we create the slot with only > > two_phase=true, then advance the slot until the slot satisfies > > restart_lsn >= two_phase_at, and then enable the failover? > > > > This means we either need to maintain somewhere that user has provided > failover flag till restart_lsn >= two_phase_at or and then set > failover flag in the slot I was thinking of this idea. > or initially mark it but enable the > functionality of failover when we reach the condition restart_lsn >= > two_phase_at. IIUC the slot could be synchronized to the standby as soon as we complete DecodingContextFindStartpoint() for a failover-enabled slot. So we would need some mechanisms to make sure that the slot is not synchronized while we're waiting to reach the condition restart_lsn >= two_phase_at even if the failover is enabled. > Both seem to have different kinds of problems. The first > idea seems to have an issue with persistence, which means we can lose > track of the flag after the restart. I think we can do this series of operations while the slot is not persistent, that is the slot is still RS_EPHEMERAL. > The second can mislead the user > for a long period in cases where prepare and commit have a large time > gap. I feel this will introduce complexity either in the form of code > or in giving the information to the user. Agreed. Both ways introduce complexity so we need to consider the user-unfriendliness (by not having a proper way to enable failover for the two_phase-enabled-slot using SQL API) vs. risk (of introducing complexity). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 3:57 PM shveta malik wrote: > > On Thu, Apr 24, 2025 at 2:54 PM Nisha Moond wrote: > > > > Please find the updated patch for Approach 3, which implements the > > idea suggested by Swada-san above. > > > > Thank You for the patch. > > 1) > > CreateDecodingContext: > > if (ctx->twophase && !slot->data.two_phase) > { > + /* > + * Do not allow two-phase decoding for failover enabled slots. > + * > + * See comments atop the similar check in ReplicationSlotCreate() for > + * a detailed reason. > + */ > + if (slot->data.failover) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", > + NameStr(slot->data.name; > + > > Can you please add tetscase to cover this scenario? > Added a test for the above scenario. > 2) > We shall update create-sub documents as well for these mutually > exclusive options. Review other pages (alter-sub, create-slot) as well > for any required change. > Updated docs where I felt this new change should be mentioned. Please let me know if I missed any place. > 3) > +## > +# Test that the failover option can be enabled for a two_phase enabled > +# subscription. > +## > > Suggestion: 'Test that the failover option can be enabled for a two_phase > enabled subscription only through Alter Subscription (failover=true)' > > Done. ~~~ Please find the attached v8 patch with above comments addressed. This version includes the documentation updates suggested by Sawada-san at [1], and incorporates his feedback from [2] as well. [1] https://www.postgresql.org/message-id/CAD21AoD08Nb588fAJ%2BJd5xRxtAT8yWnOfZ3zq-K5sto9b3ntsA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD21AoD08Nb588fAJ%2BJd5xRxtAT8yWnOfZ3zq-K5sto9b3ntsA%40mail.gmail.com -- Thanks, Nisha v8-0001-PG17-Approach-3-Fix-slot-synchronization-for-two_.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada wrote: > > I realized that users who create a logical slot using > pg_create_logical_replication_slot() would not be able to enable both > options at slot creation, and there is no easy way to enable the > failover after two_phase-enabled-slot creation. Users would need to > use ALTER_REPLICATION_SLOT replication command, which seems > unrealistics for users to use. On the other hand, if we allow creating > a logical slot with enabling failover and two_phase using SQL API, > there is still a chance for this bug to occur. Would it be worth > considering that if a logical slot is created with enabling failover > and two_phase using SQL API, we create the slot with only > two_phase=true, then advance the slot until the slot satisfies > restart_lsn >= two_phase_at, and then enable the failover? > This means we either need to maintain somewhere that user has provided failover flag till restart_lsn >= two_phase_at or and then set failover flag in the slot or initially mark it but enable the functionality of failover when we reach the condition restart_lsn >= two_phase_at. Both seem to have different kinds of problems. The first idea seems to have an issue with persistence, which means we can lose track of the flag after the restart. The second can mislead the user for a long period in cases where prepare and commit have a large time gap. I feel this will introduce complexity either in the form of code or in giving the information to the user. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 10:48 AM Masahiko Sawada wrote: > > On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond wrote: > > > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila > > wrote: > > > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila > > > > wrote: > > > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Approach 2 > > > > > > > > -- > > > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover > > > > > > > > together, a more > > > > > > > > flexible strategy could be only restrict failover for slots > > > > > > > > with two-phase > > > > > > > > enabled when there's a possibility of existing prepared > > > > > > > > transactions before > > > > > > > the > > > > > > > > two_phase_at that are not yet replicated. During slot creation > > > > > > > > with > > > > > > > two-phase > > > > > > > > and failover, we could check for any decoded prepared > > > > > > > > transactions when > > > > > > > > determining the decoding start point > > > > > > > > (DecodingContextFindStartpoint). For > > > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > > > two_phase_at is > > > > > > > > less than restart_lsn, indicating that all prepared > > > > > > > > transactions have been > > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > > > slot creation > > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for > > > > > > > > transactions to > > > > > > > prepare > > > > > > > > during consistent snapshot creation, the restriction becomes > > > > > > > > almost > > > > > > > > unnoticeable. > > > > > > > > > > > > > > I think this approach can work for the transactions that are > > > > > > > prepared > > > > > > > while the slot is created. But if I understand the problem > > > > > > > correctly, > > > > > > > while the initial table sync is performing, the slot's two_phase > > > > > > > is > > > > > > > still false, so we need to deal with the transactions that are > > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > > > haven't > > > > > > started decoding when setting two_phase=true during > > > > > > CreateDecodingContext() > > > > > > after tablesync, we could check prepared transactions afterwards > > > > > > during > > > > > > decoding. This could involve reporting an ERROR when skipping a > > > > > > prepared > > > > > > transaction during decoding if its prepare LSN is less than > > > > > > two_phase_at. > > > > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > > later point of time. > > > > > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > > > entirely > > > > > > during the CREATE SUBSCRIPTION command. For example, we could > > > > > > restrict slots > > > > > > created with failover set to true and twophase is later modified to > > > > > > true after > > > > > > tablesync. Although the simpler check is more user-visible, it may > > > > > > offer less > > > > > > flexibility. > > > > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > > smart checks in the back-branch. If we follow what you say here, then > > > > > users have the following ways in PG17 to enable both failover and > > > > > two_phase. (a) During Create Subscription, users can set both > > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > > to set 'failover'. > > > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > > add check when enabling failover for the two_phase-enabled-slots. For > > > > example, users who want to enable both need to do two steps: > > > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > > the two_phase states is ready (and possibly check if the slot's > > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > > the publisher enables the fai
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond wrote: > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila wrote: > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > > wrote: > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila > > > wrote: > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Approach 2 > > > > > > > -- > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover > > > > > > > together, a more > > > > > > > flexible strategy could be only restrict failover for slots with > > > > > > > two-phase > > > > > > > enabled when there's a possibility of existing prepared > > > > > > > transactions before > > > > > > the > > > > > > > two_phase_at that are not yet replicated. During slot creation > > > > > > > with > > > > > > two-phase > > > > > > > and failover, we could check for any decoded prepared > > > > > > > transactions when > > > > > > > determining the decoding start point > > > > > > > (DecodingContextFindStartpoint). For > > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > > two_phase_at is > > > > > > > less than restart_lsn, indicating that all prepared transactions > > > > > > > have been > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > > slot creation > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for > > > > > > > transactions to > > > > > > prepare > > > > > > > during consistent snapshot creation, the restriction becomes > > > > > > > almost > > > > > > > unnoticeable. > > > > > > > > > > > > I think this approach can work for the transactions that are > > > > > > prepared > > > > > > while the slot is created. But if I understand the problem > > > > > > correctly, > > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > > still false, so we need to deal with the transactions that are > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > > haven't > > > > > started decoding when setting two_phase=true during > > > > > CreateDecodingContext() > > > > > after tablesync, we could check prepared transactions afterwards > > > > > during > > > > > decoding. This could involve reporting an ERROR when skipping a > > > > > prepared > > > > > transaction during decoding if its prepare LSN is less than > > > > > two_phase_at. > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > later point of time. > > > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > > entirely > > > > > during the CREATE SUBSCRIPTION command. For example, we could > > > > > restrict slots > > > > > created with failover set to true and twophase is later modified to > > > > > true after > > > > > tablesync. Although the simpler check is more user-visible, it may > > > > > offer less > > > > > flexibility. > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > smart checks in the back-branch. If we follow what you say here, then > > > > users have the following ways in PG17 to enable both failover and > > > > two_phase. (a) During Create Subscription, users can set both > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > to set 'failover'. > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > add check when enabling failover for the two_phase-enabled-slots. For > > > example, users who want to enable both need to do two steps: > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > the two_phase states is ready (and possibly check if the slot's > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > the publisher enables the failover for the two_phase-enabled-slot up > > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > > passed slot's two_phase_at, otherwise raise an error with the message > > > like "the slot need to consume change upto %X/%X to enable failover". > > > > > > > This should furthe
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 2:54 PM Nisha Moond wrote: > > Please find the updated patch for Approach 3, which implements the > idea suggested by Swada-san above. > Thank You for the patch. 1) CreateDecodingContext: if (ctx->twophase && !slot->data.two_phase) { + /* + * Do not allow two-phase decoding for failover enabled slots. + * + * See comments atop the similar check in ReplicationSlotCreate() for + * a detailed reason. + */ + if (slot->data.failover) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", + NameStr(slot->data.name; + Can you please add tetscase to cover this scenario? 2) We shall update create-sub documents as well for these mutually exclusive options. Review other pages (alter-sub, create-slot) as well for any required change. 3) +## +# Test that the failover option can be enabled for a two_phase enabled +# subscription. +## Suggestion: 'Test that the failover option can be enabled for a two_phase enabled subscription only through Alter Subscription (failover=true)' thanks Shveta
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila wrote: > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada > wrote: > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila wrote: > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > > > > > > > > > -- > > > > > > Approach 2 > > > > > > -- > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, > > > > > > a more > > > > > > flexible strategy could be only restrict failover for slots with > > > > > > two-phase > > > > > > enabled when there's a possibility of existing prepared > > > > > > transactions before > > > > > the > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > two-phase > > > > > > and failover, we could check for any decoded prepared transactions > > > > > > when > > > > > > determining the decoding start point > > > > > > (DecodingContextFindStartpoint). For > > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > > two_phase_at is > > > > > > less than restart_lsn, indicating that all prepared transactions > > > > > > have been > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > pros: > > > > > > > > > > > > This method minimizes restrictions for users. Especially during > > > > > > slot creation > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions > > > > > > to > > > > > prepare > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > unnoticeable. > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > while the slot is created. But if I understand the problem correctly, > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > still false, so we need to deal with the transactions that are > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we > > > > haven't > > > > started decoding when setting two_phase=true during > > > > CreateDecodingContext() > > > > after tablesync, we could check prepared transactions afterwards during > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > transaction during decoding if its prepare LSN is less than > > > > two_phase_at. > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > later point of time. > > > > > > > Alternatively, a simpler method would be to prevent this situation > > > > entirely > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict > > > > slots > > > > created with failover set to true and twophase is later modified to > > > > true after > > > > tablesync. Although the simpler check is more user-visible, it may > > > > offer less > > > > flexibility. > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > smart checks in the back-branch. If we follow what you say here, then > > > users have the following ways in PG17 to enable both failover and > > > two_phase. (a) During Create Subscription, users can set both > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > 'copy_data' is true, during Create Subscription, then users can enable > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > to set 'failover'. > > > > Yet another idea would be to disallow enabling both two_phase and > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > add check when enabling failover for the two_phase-enabled-slots. For > > example, users who want to enable both need to do two steps: > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > 2. ALTER SUBSCRIPTION with failover = true. > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > the two_phase states is ready (and possibly check if the slot's > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > the publisher enables the failover for the two_phase-enabled-slot up > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > passed slot's two_phase_at, otherwise raise an error with the message > > like "the slot need to consume change upto %X/%X to enable failover". > > > > This should further simplify the checks with an additional restriction > during the CREATE SUBSCRIPTION time. I am in favor of it because I > want the code in this area to be as much same in HEAD and backbranches > as possible. > Please find the updated patch for Approach 3, which implements the idea suggested by Swada-san above. -- Thanks, Nisha v7-0001-PG17-Appro
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada wrote: > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila wrote: > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > > > > > > > > -- > > > > > Approach 2 > > > > > -- > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a > > > > > more > > > > > flexible strategy could be only restrict failover for slots with > > > > > two-phase > > > > > enabled when there's a possibility of existing prepared transactions > > > > > before > > > > the > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > two-phase > > > > > and failover, we could check for any decoded prepared transactions > > > > > when > > > > > determining the decoding start point (DecodingContextFindStartpoint). > > > > > For > > > > > subsequent attempts to alter failover to true, we ensure that > > > > > two_phase_at is > > > > > less than restart_lsn, indicating that all prepared transactions have > > > > > been > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > pros: > > > > > > > > > > This method minimizes restrictions for users. Especially during slot > > > > > creation > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > prepare > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > unnoticeable. > > > > > > > > I think this approach can work for the transactions that are prepared > > > > while the slot is created. But if I understand the problem correctly, > > > > while the initial table sync is performing, the slot's two_phase is > > > > still false, so we need to deal with the transactions that are > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > started decoding when setting two_phase=true during > > > CreateDecodingContext() > > > after tablesync, we could check prepared transactions afterwards during > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > It will make it difficult for users to detect it as this happens at a > > later point of time. > > > > > Alternatively, a simpler method would be to prevent this situation > > > entirely > > > during the CREATE SUBSCRIPTION command. For example, we could restrict > > > slots > > > created with failover set to true and twophase is later modified to true > > > after > > > tablesync. Although the simpler check is more user-visible, it may offer > > > less > > > flexibility. > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > smart checks in the back-branch. If we follow what you say here, then > > users have the following ways in PG17 to enable both failover and > > two_phase. (a) During Create Subscription, users can set both > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > 'copy_data' is true, during Create Subscription, then users can enable > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > to set 'failover'. > > Yet another idea would be to disallow enabling both two_phase and > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > add check when enabling failover for the two_phase-enabled-slots. For > example, users who want to enable both need to do two steps: > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > 2. ALTER SUBSCRIPTION with failover = true. > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > the two_phase states is ready (and possibly check if the slot's > two_phase has been enabled too), otherwise raises an ERROR. Then, when > the publisher enables the failover for the two_phase-enabled-slot up > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > passed slot's two_phase_at, otherwise raise an error with the message > like "the slot need to consume change upto %X/%X to enable failover". > This should further simplify the checks with an additional restriction during the CREATE SUBSCRIPTION time. I am in favor of it because I want the code in this area to be as much same in HEAD and backbranches as possible. Please note that the PG17 also suffers from data loss after promotion due to a bug in fast-forward mode as described in email [1]. So, we should try to get that fixed as well. Thanks for your feedback. [1] - https://www.postgresql.org/message-id/OS0PR01MB57163087F86621D44D9A72BF94BB2%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila wrote: > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > > > > > -- > > > > Approach 2 > > > > -- > > > > > > > > Instead of disallowing the use of two-phase and failover together, a > > > > more > > > > flexible strategy could be only restrict failover for slots with > > > > two-phase > > > > enabled when there's a possibility of existing prepared transactions > > > > before > > > the > > > > two_phase_at that are not yet replicated. During slot creation with > > > two-phase > > > > and failover, we could check for any decoded prepared transactions when > > > > determining the decoding start point (DecodingContextFindStartpoint). > > > > For > > > > subsequent attempts to alter failover to true, we ensure that > > > > two_phase_at is > > > > less than restart_lsn, indicating that all prepared transactions have > > > > been > > > > committed and replicated, thus the bug would not happen. > > > > > > > > pros: > > > > > > > > This method minimizes restrictions for users. Especially during slot > > > > creation > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > prepare > > > > during consistent snapshot creation, the restriction becomes almost > > > > unnoticeable. > > > > > > I think this approach can work for the transactions that are prepared > > > while the slot is created. But if I understand the problem correctly, > > > while the initial table sync is performing, the slot's two_phase is > > > still false, so we need to deal with the transactions that are > > > prepared during the initial table sync too. What do you think? > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > started decoding when setting two_phase=true during CreateDecodingContext() > > after tablesync, we could check prepared transactions afterwards during > > decoding. This could involve reporting an ERROR when skipping a prepared > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > It will make it difficult for users to detect it as this happens at a > later point of time. > > > Alternatively, a simpler method would be to prevent this situation entirely > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > created with failover set to true and twophase is later modified to true > > after > > tablesync. Although the simpler check is more user-visible, it may offer > > less > > flexibility. > > > > I agree with your point, but OTOH, I am also afraid of adding too many > smart checks in the back-branch. If we follow what you say here, then > users have the following ways in PG17 to enable both failover and > two_phase. (a) During Create Subscription, users can set both > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > 'copy_data' is true, during Create Subscription, then users can enable > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > to set 'failover'. Yet another idea would be to disallow enabling both two_phase and failover at CREATE SUBSCRIPTION regardless of copy_data value and to add check when enabling failover for the two_phase-enabled-slots. For example, users who want to enable both need to do two steps: 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. 2. ALTER SUBSCRIPTION with failover = true. At ALTER SUBSCRIPTION with failover = true, the subscriber checks if the two_phase states is ready (and possibly check if the slot's two_phase has been enabled too), otherwise raises an ERROR. Then, when the publisher enables the failover for the two_phase-enabled-slot up on walrcv_alter_slot() request, it checks the slot's restart_lsn has passed slot's two_phase_at, otherwise raise an error with the message like "the slot need to consume change upto %X/%X to enable failover". Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 22, 2025 at 3:23 PM shveta malik wrote: > > On Fri, Apr 11, 2025 at 1:03 PM Nisha Moond wrote: > > > > > > Patch "v5_aprch_3-0001" implements the above Approach 3. > > > > Thanks Hou-san for implementing approach-2 and providing the patch. > > > > I find Approach 2 better than Approach1. Yet to review Approach3. > Please find my initial comments: > Thanks for the review! I’ve addressed the comments for Approach 2, as it seems the most suitable. We can revisit the others if needed. > > > Approach #2: > > 1) > + ReorderBufferSkipPrepare(reorder, parsed.twophase_xid); > > In xact_decode, why do we have a new call to ReorderBufferSkipPrepare, > can you please add some comments here? > Done. > 2) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" > is specified", > > So like approach 1, here as well we disallow creating subscriptions > with both failover and two_phase enabled when create_slot and > copy_data is true? But users can alter the sub later to allow failover > for two-phase enabled slot provided there are no pending PREPARE txns > which are not yet consumed by the subscriber. Is my understanding > correct? Can you please tell all the ways using which the user can > enable both failover and two_phase together? The patch msg is not that > clear. It will be good to add such details in patch message. > We allow creating subscriptions in this scenario as long as no prepared transactions exist before the two_phase_at. Similarly, altering a subscription to set failover = true is permitted, provided the slot's restart_lsn is greater than two_phase_at. Amit has suggested the ways at [1] in which users can enable both failover and two_phase together. I've updated the commit message to include more details about the conditions enforced by the fix. > 3) > > + /* > + * Do not allow users to enable the failover and two_phase options together > + * when create_slot is specified. > + * > + * See comments atop the similar check in DecodingContextFindStartpoint() > + * for a detailed reason. > + */ > + if (!opts.create_slot && opts.twophase && opts.failover) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" > is specified", > + > > Is the check '!opts.create_slot' correct? The error msg and comment > says otherwise. > Corrected the check as it should be checking if copy_data is specified. Thanks for the input! Please find the attached v6 patch for approach-2 fix on PG17. [1] https://www.postgresql.org/message-id/CAA4eK1LvMwXxvAzHpK%2BEgjc7vu1NmGxxKcaK_06pE7GKk7JtJQ%40mail.gmail.com -- Thanks, Nisha v6-0001-PG17-Approach-2-Fix-slot-synchronization-for-two_.patch Description: Binary data
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) wrote: > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > > > > -- > > > Approach 2 > > > -- > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > flexible strategy could be only restrict failover for slots with two-phase > > > enabled when there's a possibility of existing prepared transactions > > > before > > the > > > two_phase_at that are not yet replicated. During slot creation with > > two-phase > > > and failover, we could check for any decoded prepared transactions when > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > subsequent attempts to alter failover to true, we ensure that > > > two_phase_at is > > > less than restart_lsn, indicating that all prepared transactions have been > > > committed and replicated, thus the bug would not happen. > > > > > > pros: > > > > > > This method minimizes restrictions for users. Especially during slot > > > creation > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > prepare > > > during consistent snapshot creation, the restriction becomes almost > > > unnoticeable. > > > > I think this approach can work for the transactions that are prepared > > while the slot is created. But if I understand the problem correctly, > > while the initial table sync is performing, the slot's two_phase is > > still false, so we need to deal with the transactions that are > > prepared during the initial table sync too. What do you think? > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > started decoding when setting two_phase=true during CreateDecodingContext() > after tablesync, we could check prepared transactions afterwards during > decoding. This could involve reporting an ERROR when skipping a prepared > transaction during decoding if its prepare LSN is less than two_phase_at. > It will make it difficult for users to detect it as this happens at a later point of time. > Alternatively, a simpler method would be to prevent this situation entirely > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > created with failover set to true and twophase is later modified to true after > tablesync. Although the simpler check is more user-visible, it may offer less > flexibility. > I agree with your point, but OTOH, I am also afraid of adding too many smart checks in the back-branch. If we follow what you say here, then users have the following ways in PG17 to enable both failover and two_phase. (a) During Create Subscription, users can set both 'failover' and 'two_phase', if 'copy_data' is false, or (b), if 'copy_data' is true, during Create Subscription, then users can enable 'two_phase' and wait for it to be enabled. Then use Alter Subscription to set 'failover'. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 11, 2025 at 1:03 PM Nisha Moond wrote: > > > Patch "v5_aprch_3-0001" implements the above Approach 3. > > Thanks Hou-san for implementing approach-2 and providing the patch. > I find Approach 2 better than Approach1. Yet to review Approach3. Please find my initial comments: Approach#1: 1) When slot is created with failover enabled and later we try to create a subscription using that pre-created slot with two-phase enabled, it does not error out. But it silently retains two_phase of the existing slot as false. Please check if an error is possible in this case too. Approach #2: 1) + ReorderBufferSkipPrepare(reorder, parsed.twophase_xid); In xact_decode, why do we have a new call to ReorderBufferSkipPrepare, can you please add some comments here? 2) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" is specified", So like approach 1, here as well we disallow creating subscriptions with both failover and two_phase enabled when create_slot and copy_data is true? But users can alter the sub later to allow failover for two-phase enabled slot provided there are no pending PREPARE txns which are not yet consumed by the subscriber. Is my understanding correct? Can you please tell all the ways using which the user can enable both failover and two_phase together? The patch msg is not that clear. It will be good to add such details in patch message. 3) + /* + * Do not allow users to enable the failover and two_phase options together + * when create_slot is specified. + * + * See comments atop the similar check in DecodingContextFindStartpoint() + * for a detailed reason. + */ + if (!opts.create_slot && opts.twophase && opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" is specified", + Is the check '!opts.create_slot' correct? The error msg and comment says otherwise. thanks Shveta
RE: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 22, 2025 at 11:23 AM Amit Kapila wrote: > > On Fri, Apr 18, 2025 at 9:58 AM Amit Kapila wrote: > > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > - > > > Fix > > > - > > > > > > I think we should keep the confirmed_flush even if the previous > > > synced restart_lsn/catalog_xmin is newer. Attachments include a patch > for the same. > > > > > > > This will fix the case we are facing but adds a new rule for slot > > synchronization. Can we think of a simpler way to fix this by avoiding > > updating other slot fields (like two_phase, two_phase_at) if > > restart_lsn or catalog_xmin of the local slot is ahead of the remote > > slot? > > > > Thinking more about this problem, it seems to me that if the catalog_xmin of > synced slot is allowed to be ahead than the remote_slot when there is still an > open (prepared transaction), it could cause data loss. I mean that after the > promotion, some of the required catalog rows could be removed, and decoding > corresponding changes (changes from tables affected by DDL) could give > unexpected results. Those would be protected on primary/publisher because > the catalog_xmin on it was still accurate and behind. If this theory turns > out to > be true, then this is a drawback/bug of the existing fast_forward mode code. I agree that could be a problem. Upon further analysis, I find that the core problem is we do not build a base snapshot when decoding changes during fast forward mode, preventing reference to the minimum transaction ID that remains visible in the snapshot when determining the candidate for catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest running transaction ID found in the latest running_xacts record. In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could not be reached during fast forward decoding, resulting in rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the system attempts to refer to rb->txns_by_base_snapshot_lsn via ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using running->oldestRunningXid as the candidate for catalog_xmin. For reference, see the implementation details in SnapBuildProcessRunningXacts. I think this is a general issue in fast forward decoding, which not only affect slotsync. I can reproduce the issue using slot_advance SQL API as well. See the attachment for a patch that includes a test to prove that the catalog data that are still required would be removed after premature catalog_xmin advancement during fast forward decoding. If you test that patch on HEAD, you would find that the output missed a column due to vacuum removal. I will start a new thread to report and fix this general. Best Regards, Hou zj 0001-Add-a-test.patch Description: 0001-Add-a-test.patch
Re: Fix slot synchronization with two_phase decoding enabled
On Fri, Apr 18, 2025 at 9:58 AM Amit Kapila wrote: > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > wrote: > > > > - > > Fix > > - > > > > I think we should keep the confirmed_flush even if the previous synced > > restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. > > > > This will fix the case we are facing but adds a new rule for slot > synchronization. Can we think of a simpler way to fix this by avoiding > updating other slot fields (like two_phase, two_phase_at) if > restart_lsn or catalog_xmin of the local slot is ahead of the remote > slot? > Thinking more about this problem, it seems to me that if the catalog_xmin of synced slot is allowed to be ahead than the remote_slot when there is still an open (prepared transaction), it could cause data loss. I mean that after the promotion, some of the required catalog rows could be removed, and decoding corresponding changes (changes from tables affected by DDL) could give unexpected results. Those would be protected on primary/publisher because the catalog_xmin on it was still accurate and behind. If this theory turns out to be true, then this is a drawback/bug of the existing fast_forward mode code. -- With Regards, Amit Kapila.
RE: Fix slot synchronization with two_phase decoding enabled
On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > wrote: > > > > > > -- > > Approach 2 > > -- > > > > Instead of disallowing the use of two-phase and failover together, a more > > flexible strategy could be only restrict failover for slots with two-phase > > enabled when there's a possibility of existing prepared transactions before > the > > two_phase_at that are not yet replicated. During slot creation with > two-phase > > and failover, we could check for any decoded prepared transactions when > > determining the decoding start point (DecodingContextFindStartpoint). For > > subsequent attempts to alter failover to true, we ensure that two_phase_at > > is > > less than restart_lsn, indicating that all prepared transactions have been > > committed and replicated, thus the bug would not happen. > > > > pros: > > > > This method minimizes restrictions for users. Especially during slot > > creation > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > prepare > > during consistent snapshot creation, the restriction becomes almost > > unnoticeable. > > I think this approach can work for the transactions that are prepared > while the slot is created. But if I understand the problem correctly, > while the initial table sync is performing, the slot's two_phase is > still false, so we need to deal with the transactions that are > prepared during the initial table sync too. What do you think? > Yes, I agree that we need to restrict this case too. Given that we haven't started decoding when setting two_phase=true during CreateDecodingContext() after tablesync, we could check prepared transactions afterwards during decoding. This could involve reporting an ERROR when skipping a prepared transaction during decoding if its prepare LSN is less than two_phase_at. Alternatively, a simpler method would be to prevent this situation entirely during the CREATE SUBSCRIPTION command. For example, we could restrict slots created with failover set to true and twophase is later modified to true after tablesync. Although the simpler check is more user-visible, it may offer less flexibility. What do you think ? Best Regards, Hou zj
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) wrote: > > > -- > Approach 2 > -- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before > the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. I think this approach can work for the transactions that are prepared while the slot is created. But if I understand the problem correctly, while the initial table sync is performing, the slot's two_phase is still false, so we need to deal with the transactions that are prepared during the initial table sync too. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) wrote: > > Hi, > > The recently added tests for slotsync on two-phase enabled slots failed[1] due > to a broken assertion triggered while decoding the COMMIT PREPARED record on > the promoted standby. > > - > Analysis > - > > if ((txn->final_lsn < two_phase_at) && is_commit) > { > /* > * txn must have been marked as a prepared transaction and > skipped but > * not sent a prepare. Also, the prepare info must have been > updated > * in txn even if we skip prepare. > */ > Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) == >(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)); > > I think the issue stem from the PREPARED transaction, which was skipped on the > primary, not being skipped on the promoted standby. The root cause appears to > be the latest 'confirmed_flush' (0/6005118) of the source slot not being > synced > to the standby. This results in the confirmed_flush (0/6004F90) of the synced > slot being less than the synced two_phase_at field (0/6005118). Consequently, > the PREPARE record's LSN exceeds the synced confirmed_flush during > decoding, causing it to be incorrectly decoded and sent to the subscriber. > Thus, when > decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but > wasn't skipped. > > -- The LOGs that proves the confirmed_flush is not synced: > > decoding on original slot (primary): > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading > WAL from 0/60049C8. > > decoding on synced slot (promted standby) - failed run: > DETAIL: Streaming transactions committing after 0/6004F90, reading > WAL from 0/60049C8. > > decoding on synced slot (promted standby) - success run: > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading > WAL from 0/60049C8. > -- > > The logs above clearly show that during the test failure, the starting > position > (6004F90) was less than that of a successful run. Additionally, it was lower > than the starting position of the remote slot on the primary, indicating a > failure to synchronize confirmed_lsn. > > The reason confirmed_flush isn't synced should stem from the following check, > which skip the syncing of the confirmed_flush value. I also reproduced the > assertion failure by creating a scenario that leads to sync omission due to > this check: > > /* > * Don't overwrite if we already have a newer catalog_xmin and > * restart_lsn. > */ > if (remote_slot->restart_lsn < slot->data.restart_lsn || > TransactionIdPrecedes(remote_slot->catalog_xmin, > > slot->data.catalog_xmin)) > { > > > This check triggered because the synced catalog_xmin surpassed that of the > source slot (It can be seen from the log that the restart_lsn was synced to > the > same value) ). This situation arises due to several factors: > > On the publisher(primary), the slot may retain an older catalog_xmin and > restart_lsn because it is being consumed by a walsender. In this scenario, if > the apply worker does not immediately confirm that changes have been flushed, > the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old > value for candidate_catalog_xmin/candidate_restart_lsn and would not update > them until the apply worker confirms via LogicalConfirmReceivedLocation. > > However, the slotsync worker has the opportunity to begin WAL reading from a > more recent point, potentially advancing catalog_xmin/restart_lsn to newer > values. > > And it's also possible to advance only catalog_xmin while keep restart_lsn > unchanged, by starting a transaction before the running_xact record. In this > case, it would pass builder->last_serialized_snapshot as restart_lsn to > LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would > point to the position of the last running_xact record instead of the current > one, or the value would be NULL if no snapshot has been serialized before, so > it would not advance restart_lsn. The advancement of catalog_xmin is not > blocked by running transaction, as it would either use the XID of the running > transaction or the oldestRunningXid as candidate. > Your analysis appears correct to me. > > - > Fix > - > > I think we should keep the confirmed_flush even if the previous synced > restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. > This will fix the case we are facing but adds a new rule for slot synchronization. Can we think of a simpler way to fix this by avoiding updating other slot fields (like two_p
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 17, 2025 at 8:44 PM Zhijie Hou (Fujitsu) wrote: > > Hi, > > The recently added tests for slotsync on two-phase enabled slots > failed[1] due to a broken assertion triggered while decoding the > COMMIT PREPARED record on the promoted standby. Here is the missing link: [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-04-09%2010%3A36%3A46 Best Regards, Hou zj
RE: Fix slot synchronization with two_phase decoding enabled
Hi, The recently added tests for slotsync on two-phase enabled slots failed[1] due to a broken assertion triggered while decoding the COMMIT PREPARED record on the promoted standby. - Analysis - if ((txn->final_lsn < two_phase_at) && is_commit) { /* * txn must have been marked as a prepared transaction and skipped but * not sent a prepare. Also, the prepare info must have been updated * in txn even if we skip prepare. */ Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) == (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)); I think the issue stem from the PREPARED transaction, which was skipped on the primary, not being skipped on the promoted standby. The root cause appears to be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced to the standby. This results in the confirmed_flush (0/6004F90) of the synced slot being less than the synced two_phase_at field (0/6005118). Consequently, the PREPARE record's LSN exceeds the synced confirmed_flush during decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but wasn't skipped. -- The LOGs that proves the confirmed_flush is not synced: decoding on original slot (primary): LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 ... DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. decoding on synced slot (promted standby) - failed run: DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8. decoding on synced slot (promted standby) - success run: LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 ... DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. -- The logs above clearly show that during the test failure, the starting position (6004F90) was less than that of a successful run. Additionally, it was lower than the starting position of the remote slot on the primary, indicating a failure to synchronize confirmed_lsn. The reason confirmed_flush isn't synced should stem from the following check, which skip the syncing of the confirmed_flush value. I also reproduced the assertion failure by creating a scenario that leads to sync omission due to this check: /* * Don't overwrite if we already have a newer catalog_xmin and * restart_lsn. */ if (remote_slot->restart_lsn < slot->data.restart_lsn || TransactionIdPrecedes(remote_slot->catalog_xmin, slot->data.catalog_xmin)) { This check triggered because the synced catalog_xmin surpassed that of the source slot (It can be seen from the log that the restart_lsn was synced to the same value) ). This situation arises due to several factors: On the publisher(primary), the slot may retain an older catalog_xmin and restart_lsn because it is being consumed by a walsender. In this scenario, if the apply worker does not immediately confirm that changes have been flushed, the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old value for candidate_catalog_xmin/candidate_restart_lsn and would not update them until the apply worker confirms via LogicalConfirmReceivedLocation. However, the slotsync worker has the opportunity to begin WAL reading from a more recent point, potentially advancing catalog_xmin/restart_lsn to newer values. And it's also possible to advance only catalog_xmin while keep restart_lsn unchanged, by starting a transaction before the running_xact record. In this case, it would pass builder->last_serialized_snapshot as restart_lsn to LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would point to the position of the last running_xact record instead of the current one, or the value would be NULL if no snapshot has been serialized before, so it would not advance restart_lsn. The advancement of catalog_xmin is not blocked by running transaction, as it would either use the XID of the running transaction or the oldestRunningXid as candidate. - Reproduction - Accompanying this analysis is a script to reproduce the issue. In these scripts, two running_xacts were logged post-slot creation on the publisher. Below is a detailed progression of restart_lsn/catalog_xmin advancement observed on both the publisher and standby: The process is : slot creation -> log running_xact -> prepare a txn -> log running_xact The process of advancing the catalog_xmin/restart_lsn on publisher: slot creation ... 1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755) got new catalog xmin 755 at 0/301B880 LOG: got new restart lsn 0/301B880 a
Re: Fix slot synchronization with two_phase decoding enabled
HI, Please find the patches attached for all three approaches. On Wed, Apr 9, 2025 at 10:45 AM Zhijie Hou (Fujitsu) wrote: > > On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote: > > > > On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > > > > > > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila > > > wrote: > > > > > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > > > wrote: > > > > > > > > > > > > Thank you for the explanation! I agree that the issue happens in > > > > > > these > > > cases. > > > > > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > > > doesn't have any transactions that are prepared before the point > > > > > > of enabling two_phase. For example, when the slotsync worker > > > > > > fetches the remote slot, it remembers the confirmed_flush_lsn > > > > > > (say > > > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > > > slot is newly created with enabling two_phase, and then it makes > > > > > > the slot 'sync-ready' once it confirmed that the slot's > > > > > > restart_lsn passed > > > LSN-1. Does it work? > > > > > > > > > > Thanks for the idea! > > > > > > > > > > We considered a similar approach in [1] to confirm there is no > > > > > prepared transactions before two_phase_at, but the issue is that > > > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > > > this case, the slot may have already been marked as sync-ready > > > > > before the two_phase flag is enabled, as slotsync is unaware of > > > > > potential > > > future changes to the two_phase flag. > > > > > > > > This can happen because when copy_data is true, tablesync can take a > > > > long time to complete the sync and in the meantime, slot without a > > > > two_phase flag would have been synced to standby. Such a slot would > > > > be marked as sync-ready even if we follow the calculation proposed > > > > by Sawada-san. Note that we enable two_phase once all the tables are > > > > in ready state (See run_apply_worker() and comments atop worker.c > > > > (TWO_PHASE TRANSACTIONS)). > > > > > > Right. It doesn't make sense to make the slot not-sync-ready and then > > > back to sync-ready. > > > > > > While I agree with the approach for HEAD and it seems difficult to > > > find a solution, I'm concerned that disallowing to use both failover > > > and two_phase in a minor release would affect users much. Users who > > > are already using that combination might end up needing to re-think > > > their system architecture. So I'm trying to narrow down use cases > > > where we're going to prohibit or to find workarounds. > > We analyzed more for the backbranch fix, and here is a summary of different > approaches that could be used for PG17. > > -- > Approach 1 > -- > > This is the original approach implemented in V4 patch. > > In this approach, we entirely disallow enabling both failover and the > two-phase > feature together for a replication slot or subscription. > > pros: > > This restriction is simple to implement and easy for users to comprehend. > > cons: > > Users would be unable to use the two-phase feature in conjunction with > failover. > > Following the upgrade to a new release with this fix, existing subscriptions > that have both failover and two-phase enabled would require manual > re-creation, > which is time-consuming. > Patch "v5_aprch_1-0001" implements the above Approach 1. > -- > Approach 2 > -- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before > the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. > > Users are not always forced to re-create subscriptions post-upgrade. > > cons: > > The logic involved for (restart_lsn > two_phase_at) might be less intuitive > for > users. > > After upgrading, it's recommended
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote: > > On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > > > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila > > wrote: > > > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > Thank you for the explanation! I agree that the issue happens in > > > > > these > > cases. > > > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > > doesn't have any transactions that are prepared before the point > > > > > of enabling two_phase. For example, when the slotsync worker > > > > > fetches the remote slot, it remembers the confirmed_flush_lsn > > > > > (say > > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > > slot is newly created with enabling two_phase, and then it makes > > > > > the slot 'sync-ready' once it confirmed that the slot's > > > > > restart_lsn passed > > LSN-1. Does it work? > > > > > > > > Thanks for the idea! > > > > > > > > We considered a similar approach in [1] to confirm there is no > > > > prepared transactions before two_phase_at, but the issue is that > > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > > this case, the slot may have already been marked as sync-ready > > > > before the two_phase flag is enabled, as slotsync is unaware of > > > > potential > > future changes to the two_phase flag. > > > > > > This can happen because when copy_data is true, tablesync can take a > > > long time to complete the sync and in the meantime, slot without a > > > two_phase flag would have been synced to standby. Such a slot would > > > be marked as sync-ready even if we follow the calculation proposed > > > by Sawada-san. Note that we enable two_phase once all the tables are > > > in ready state (See run_apply_worker() and comments atop worker.c > > > (TWO_PHASE TRANSACTIONS)). > > > > Right. It doesn't make sense to make the slot not-sync-ready and then > > back to sync-ready. > > > > While I agree with the approach for HEAD and it seems difficult to > > find a solution, I'm concerned that disallowing to use both failover > > and two_phase in a minor release would affect users much. Users who > > are already using that combination might end up needing to re-think > > their system architecture. So I'm trying to narrow down use cases > > where we're going to prohibit or to find workarounds. We analyzed more for the backbranch fix, and here is a summary of different approaches that could be used for PG17. -- Approach 1 -- This is the original approach implemented in V4 patch. In this approach, we entirely disallow enabling both failover and the two-phase feature together for a replication slot or subscription. pros: This restriction is simple to implement and easy for users to comprehend. cons: Users would be unable to use the two-phase feature in conjunction with failover. Following the upgrade to a new release with this fix, existing subscriptions that have both failover and two-phase enabled would require manual re-creation, which is time-consuming. -- Approach 2 -- Instead of disallowing the use of two-phase and failover together, a more flexible strategy could be only restrict failover for slots with two-phase enabled when there's a possibility of existing prepared transactions before the two_phase_at that are not yet replicated. During slot creation with two-phase and failover, we could check for any decoded prepared transactions when determining the decoding start point (DecodingContextFindStartpoint). For subsequent attempts to alter failover to true, we ensure that two_phase_at is less than restart_lsn, indicating that all prepared transactions have been committed and replicated, thus the bug would not happen. pros: This method minimizes restrictions for users. Especially during slot creation with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare during consistent snapshot creation, the restriction becomes almost unnoticeable. Users are not always forced to re-create subscriptions post-upgrade. cons: The logic involved for (restart_lsn > two_phase_at) might be less intuitive for users. After upgrading, it's recommended that users verify whether two_phase_at for a source slot is less than restart_lsn of the synced slot. If it isn't, users should be careful when using the synced slot on a standby. This might be complex to understand. -- Approach 3 -- This approach is similar to Approach 2 but simplifies some aspects by avoiding checks for prepared transactions during slot creation. It disallows enabling both
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila > wrote: > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > Thank you for the explanation! I agree that the issue happens in these > cases. Just to share the steps to reproduce the issue when creating the subscription With (create_slot=true, failover = true, two_phase=true, copy_data=false) for reference. - pub: begin a txn A. This is to stop the slot from building a consistent snapshot immediately. - sub: create subscription with (slot_name='sub', create_slot=true, failover = true, two_phase=true, copy_data=false); - pub: Attach to the walsender that will create the slot and add a checkpoint in SnapBuildFindSnapshot() -> (builder->state = SNAPBUILD_FULL_SNAPSHOT;). For now, the state should be BUILDING_SNAPSHOT. - pub: begin another txn B and commit txn A. The snapshot should reach FULL_SNAPSHOT now. - pub: prepared a txn C and then commit txn B. - pub: release the checkpoint in SnapBuildFindSnapshot(), then it should reach the SNAPBUILD_CONSISTENT. Now we have a prepared txn whose prepare_lsn is less than the two_phase_at. - stop the primary and promote the standby. - sub: alter the subscription to use the new primary as the publisher. - commit the prepared transaction on new primary, the following error will be reported on subscriber: LOG: logical replication apply worker for subscription "sub" has started ERROR: prepared transaction with identifier "pg_gid_16398_764" does not exist. > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > doesn't have any transactions that are prepared before the point > > > > of enabling two_phase. For example, when the slotsync worker > > > > fetches the remote slot, it remembers the confirmed_flush_lsn (say > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > slot is newly created with enabling two_phase, and then it makes > > > > the slot 'sync-ready' once it confirmed that the slot's restart_lsn > > > > passed > LSN-1. Does it work? > > > > > > Thanks for the idea! > > > > > > We considered a similar approach in [1] to confirm there is no > > > prepared transactions before two_phase_at, but the issue is that > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > this case, the slot may have already been marked as sync-ready > > > before the two_phase flag is enabled, as slotsync is unaware of potential > future changes to the two_phase flag. > > > > This can happen because when copy_data is true, tablesync can take a > > long time to complete the sync and in the meantime, slot without a > > two_phase flag would have been synced to standby. Such a slot would be > > marked as sync-ready even if we follow the calculation proposed by > > Sawada-san. Note that we enable two_phase once all the tables are in > > ready state (See run_apply_worker() and comments atop worker.c > > (TWO_PHASE TRANSACTIONS)). > > Right. It doesn't make sense to make the slot not-sync-ready and then back to > sync-ready. > > While I agree with the approach for HEAD and it seems difficult to find a > solution, I'm concerned that disallowing to use both failover and two_phase in > a minor release would affect users much. Users who are already using that > combination might end up needing to re-think their system architecture. So I'm > trying to narrow down use cases where we're going to prohibit or to find > workarounds. > > If we agree with the fix for HEAD, we can push the fix for HEAD first, which > would be better to be done sooner as it needs to bump the catversion. We can > discuss the ideas and workarounds for v17 later. Agreed. I will think more on it. One workaround could be skipping the prepared transaction when such an issue arises, followed by manually replicating the skipped changes. Best Regards, Hou zj
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > wrote: > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > Hi, > > > > > > When testing the slot synchronization with logical replication slots that > > > enabled two_phase decoding, I found that transactions prepared before > two-phase > > > decoding is enabled may fail to replicate to the subscriber after being > > > committed on a promoted standby following a failover. > > > > > > To reproduce this issue, please follow these steps (also detailed in the > > > attached TAP test, v1-0001): > > > > > > 1. sub: create a subscription with (two_phase = false) > > > 2. primary (pub): prepare a txn A. > > > 3. sub: alter subscription set (two_phase = true) and wait for the logical > slot to > > >be synced to standby. > > > 4. primary (pub): stop primary, promote the standby and let the subscriber > use > > >the promoted standby as publisher. > > > 5. promoted standby (pub): COMMIT PREPARED A; > > > 6. sub: the apply worker will report the following ERROR because it didn't > > >receive the PREPARE. > > >ERROR: prepared transaction with identifier "pg_gid_16387_752" > does not exist > > > > > > I think the root cause of this issue is that the two_phase_at field of the > > > slot, which indicates the LSN from which two-phase decoding is enabled > (used to > > > prevent duplicate data transmission for prepared transactions), is not > > > synchronized to the standby server. > > > > > > In step 3, transaction A is not immediately replicated because it occurred > > > before enabling two-phase decoding. Thus, the prepared transaction > should only > > > be replicated after decoding the final COMMIT PREPARED, as referenced > in > > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at > on the > > > standby, the prepared transaction fails to send at that time. > > > > > > This problem arises after the support for altering the two-phase option > > > (1462aad). > > > > > I suspect that this can happen in PG17 as well, but I need to think > more about it to make a reproducible test case. After further analysis, I was able to reproduce the same issue [1] in PG 17. However, since the proposed fix requires catalog changes and the issue is not a security risk significant enough to justify changing the catalog in back branches, we cannot back-patch the same solution. Following off-list discussions with Amit and Kuroda-san, we are considering disallowing enabling failover and two-phase decoding together for a replication slot, as suggested in attachment 0002. Another idea considered is to prevent the slot that enables two-phase decoding from being synced to standby. IOW, this means displaying the failover field as false in the view, if there is any possibility that transactions prepared before the two_phase_at position exist (e.g., if restart_lsn is less than two_phase_at). However, implementing this change would require additional explanations to users for this new behavior, which seems tricky. > > In the meantime, I have a few minor comments on the proposed patches: > 1. > ## > # Promote the standby1 to primary. Confirm that: > # a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new > primary > # b) logical replication for regress_mysub1 is resumed successfully > after failover > # c) changes can be consumed from the synced slot 'snap_test_slot' > ## > -$standby1->start; > $primary->wait_for_replay_catchup($standby1); > > # Capture the time before the standby is promoted > @@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1'); > is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > "20", 'data replicated from the new primary'); > > +# Commit the prepared transaction > +$standby1->safe_psql('postgres', > + "COMMIT PREPARED 'test_twophase_slotsync';"); > +$standby1->wait_for_catchup('regress_mysub1'); > + > +# Confirm that the prepared transaction is replicated to the subscriber > +is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > + "21", 'prepared data replicated from the new primary'); > > The commentary above this test should include information about > verifying the replication of previously prepared transactions after > promotion. Also, it would be better if confirm the commit prepared > before confirming the new Insert is replicated after promotion. > > 2. > @@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot, > Oid remote_dbid, > SpinLockAcquire(&slot->mutex); > slot->data.restart_lsn = remote_slot->restart_lsn; > slot->data.confirmed_flush = remote_slot->confirmed_lsn; > + slot->data.two_phase_at = remote_slot->two_phase_at; > > Why do we need to update the two_phase_at here when the patch does it > later in this function when local and
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 3, 2025 at 11:08 AM Masahiko Sawada wrote: > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila wrote: > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > Thank you for the explanation! I agree that the issue happens in these > > > > cases. > > > > > > > > As another idea, I wonder if we could somehow defer to make the synced > > > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > > > any transactions that are prepared before the point of enabling > > > > two_phase. For example, when the slotsync worker fetches the remote > > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > > > slot's two_phase becomes true or the local slot is newly created with > > > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > > > > > Thanks for the idea! > > > > > > We considered a similar approach in [1] to confirm there is no prepared > > > transactions before two_phase_at, but the issue is that when the > > > two_phase flag > > > is switched from 'false' to 'true' (as in the case with (copy_data=true, > > > failover=true, two_phase=true)). In this case, the slot may have already > > > been > > > marked as sync-ready before the two_phase flag is enabled, as slotsync is > > > unaware of potential future changes to the two_phase flag. > > > > This can happen because when copy_data is true, tablesync can take a > > long time to complete the sync and in the meantime, slot without a > > two_phase flag would have been synced to standby. Such a slot would be > > marked as sync-ready even if we follow the calculation proposed by > > Sawada-san. Note that we enable two_phase once all the tables are in > > ready state (See run_apply_worker() and comments atop worker.c > > (TWO_PHASE TRANSACTIONS)). > > Right. It doesn't make sense to make the slot not-sync-ready and then > back to sync-ready. > > While I agree with the approach for HEAD and it seems difficult to > find a solution, I'm concerned that disallowing to use both failover > and two_phase in a minor release would affect users much. Users who > are already using that combination might end up needing to re-think > their system architecture. So I'm trying to narrow down use cases > where we're going to prohibit or to find workarounds. > > If we agree with the fix for HEAD, we can push the fix for HEAD first, > which would be better to be done sooner as it needs to bump the > catversion. We can discuss the ideas and workarounds for v17 later. > Thanks, I'll push the patch for HEAD and then keep thinking if we have a better way to deal with the problem in 17. BTW, the problem for 17 can happen in a much narrower set of cases as explained in the emails above. -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila wrote: > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > Thank you for the explanation! I agree that the issue happens in these > > > cases. > > > > > > As another idea, I wonder if we could somehow defer to make the synced > > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > > any transactions that are prepared before the point of enabling > > > two_phase. For example, when the slotsync worker fetches the remote > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > > slot's two_phase becomes true or the local slot is newly created with > > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > > > Thanks for the idea! > > > > We considered a similar approach in [1] to confirm there is no prepared > > transactions before two_phase_at, but the issue is that when the two_phase > > flag > > is switched from 'false' to 'true' (as in the case with (copy_data=true, > > failover=true, two_phase=true)). In this case, the slot may have already > > been > > marked as sync-ready before the two_phase flag is enabled, as slotsync is > > unaware of potential future changes to the two_phase flag. > > This can happen because when copy_data is true, tablesync can take a > long time to complete the sync and in the meantime, slot without a > two_phase flag would have been synced to standby. Such a slot would be > marked as sync-ready even if we follow the calculation proposed by > Sawada-san. Note that we enable two_phase once all the tables are in > ready state (See run_apply_worker() and comments atop worker.c > (TWO_PHASE TRANSACTIONS)). Right. It doesn't make sense to make the slot not-sync-ready and then back to sync-ready. While I agree with the approach for HEAD and it seems difficult to find a solution, I'm concerned that disallowing to use both failover and two_phase in a minor release would affect users much. Users who are already using that combination might end up needing to re-think their system architecture. So I'm trying to narrow down use cases where we're going to prohibit or to find workarounds. If we agree with the fix for HEAD, we can push the fix for HEAD first, which would be better to be done sooner as it needs to bump the catversion. We can discuss the ideas and workarounds for v17 later. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) wrote: > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > wrote: > > > > Thank you for the explanation! I agree that the issue happens in these > > cases. > > > > As another idea, I wonder if we could somehow defer to make the synced > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > any transactions that are prepared before the point of enabling > > two_phase. For example, when the slotsync worker fetches the remote > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > slot's two_phase becomes true or the local slot is newly created with > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > Thanks for the idea! > > We considered a similar approach in [1] to confirm there is no prepared > transactions before two_phase_at, but the issue is that when the two_phase > flag > is switched from 'false' to 'true' (as in the case with (copy_data=true, > failover=true, two_phase=true)). In this case, the slot may have already been > marked as sync-ready before the two_phase flag is enabled, as slotsync is > unaware of potential future changes to the two_phase flag. > This can happen because when copy_data is true, tablesync can take a long time to complete the sync and in the meantime, slot without a two_phase flag would have been synced to standby. Such a slot would be marked as sync-ready even if we follow the calculation proposed by Sawada-san. Note that we enable two_phase once all the tables are in ready state (See run_apply_worker() and comments atop worker.c (TWO_PHASE TRANSACTIONS)). -- With Regards, Amit Kapila.
RE: Fix slot synchronization with two_phase decoding enabled
On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: > > > > Hi, > > > > > > > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > After further analysis, I was able to reproduce the same issue [1] in > > > > PG 17. > > > > > > > > [1] > > > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > > > transaction > > > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', > > > > so > it is > > > > greater than prepared txn lsn. > > > > - sub: create subscription with (slot_name='sub', create_slot=false, > > > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > > > be set to the same as confirmed_flush_lsn which is greater than the > > > prepared transaction. > > > > - stop the primary and promote the standby. > > > > - commit the prepared transaction on standby, the following error will > > > > be > > > > reported on subscriber: > > > > > > It seems to require elaborate steps to reproduce this issue in v17. I > > > wonder > if we > > > could somehow narrow down the cases that we want to prohibit. The patch > for > > > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and > failover, > > > but I guess that it's still safe if it also creates the replication slot > > > (e.g., > > > create_slot is true). If my understanding is right, we can allow users to > specify > > > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > > > disallow that in ReplicationSlotCreate(). > > > > Thanks for reviewing the steps. > > > > The current reproducer aims for simplicity; however, I think it's possible > > to > reproduce > > the issue even with create_slot = true, although it requires the help of a > > debugger and additional steps. But as long as there are transactions > prepared > > before the two_phase_at position, and they are skipped due to checks in > > ReorderBufferFinishPrepared() (refer to comments[1] for why we skip > sending > > prepared transaction), the issue can be reproduced. > > > > For instance, when creating a subscription with (copy_data=true, > failover=true, > > two_phase=true), the slot's two_phase setting starts as false and shifts to > > true after table sync (refer to [2] for related comments). During this > > period, > > if a user prepares a transaction where the prepare LSN is less than the > > two_phase_at, the same problem could happen. > > > > Similarly, when setting up a subscription with (copy_data=false, > failover=true, > > two_phase=true), although two_phase is initially set to true and we wait for > > running transactions to finish when building consistent snapshot, a race > > condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it > > won't check running transactions further (see SnapBuildFindSnapshot() for > > specifics), if a user prepares a transaction at this point, it's possible > > for > > the prepare LSN to be less than the LSN marking the snapshot's consistent > > state, causing the same issue. > > Thank you for the explanation! I agree that the issue happens in these cases. > > As another idea, I wonder if we could somehow defer to make the synced > slot as 'sync-ready' until we can ensure that the slot doesn't have > any transactions that are prepared before the point of enabling > two_phase. For example, when the slotsync worker fetches the remote > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > slot's two_phase becomes true or the local slot is newly created with > enabling two_phase, and then it makes the slot 'sync-ready' once it > confirmed that the slot's restart_lsn passed LSN-1. Does it work? Thanks for the idea! We considered a similar approach in [1] to confirm there is no prepared transactions before two_phase_at, but the issue is that when the two_phase flag is switched from 'false' to 'true' (as in the case with (copy_data=true, failover=true, two_phase=true)). In this case, the slot may have already been marked as sync-ready before the two_phase flag is enabled, as slotsync is unaware of potential future changes to the two_phase flag. Then the slot have to revert to sync-not-ready after the two_phase flag change, which could be difficult for users to understand. What do you think ? [1] https://www.postgresql.org/message-id/OS0PR01MB57161D9BB5409F229564957994AD2%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best Regards, Hou zj
Re: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) wrote: > > On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: > > Hi, > > > > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > > > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > > > > > > > > wrote: > > > > > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > When testing the slot synchronization with logical replication > > > > > > slots that enabled two_phase decoding, I found that transactions > > > > > > prepared before > > > > two-phase > > > > > > decoding is enabled may fail to replicate to the subscriber > > > > > > after being committed on a promoted standby following a failover. > > > > > > > > > > > > To reproduce this issue, please follow these steps (also > > > > > > detailed in the attached TAP test, v1-0001): > > > > > > > > > > > > 1. sub: create a subscription with (two_phase = false) 2. > > > > > > primary (pub): prepare a txn A. > > > > > > 3. sub: alter subscription set (two_phase = true) and wait for > > > > > > the logical > > > > slot to > > > > > >be synced to standby. > > > > > > 4. primary (pub): stop primary, promote the standby and let the > > > > > > subscriber > > > > use > > > > > >the promoted standby as publisher. > > > > > > 5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply > > > > > > worker will report the following ERROR because it didn't > > > > > >receive the PREPARE. > > > > > >ERROR: prepared transaction with identifier "pg_gid_16387_752" > > > > does not exist > > > > > > > > > > > > I think the root cause of this issue is that the two_phase_at > > > > > > field of the slot, which indicates the LSN from which two-phase > > > > > > decoding is enabled > > > > (used to > > > > > > prevent duplicate data transmission for prepared transactions), > > > > > > is not synchronized to the standby server. > > > > > > > > > > > > In step 3, transaction A is not immediately replicated because > > > > > > it occurred before enabling two-phase decoding. Thus, the > > > > > > prepared transaction > > > > should only > > > > > > be replicated after decoding the final COMMIT PREPARED, as > > > > > > referenced > > > > in > > > > > > ReorderBufferFinishPrepared(). However, due to the invalid > > > > > > two_phase_at > > > > on the > > > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > > > > > This problem arises after the support for altering the two-phase > > > > > > option (1462aad). > > > > > > > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > > > more about it to make a reproducible test case. > > > > > > After further analysis, I was able to reproduce the same issue [1] in > > > PG 17. > > > > > > [1] > > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > > transaction > > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so > > > it is > > > greater than prepared txn lsn. > > > - sub: create subscription with (slot_name='sub', create_slot=false, > > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > > be set to the same as confirmed_flush_lsn which is greater than the > > prepared transaction. > > > - stop the primary and promote the standby. > > > - commit the prepared transaction on standby, the following error will be > > > reported on subscriber: > > > > It seems to require elaborate steps to reproduce this issue in v17. I > > wonder if we > > could somehow narrow down the cases that we want to prohibit. The patch for > > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, > > but I guess that it's still safe if it also creates the replication slot > > (e.g., > > create_slot is true). If my understanding is right, we can allow users to > > specify > > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > > disallow that in ReplicationSlotCreate(). > > Thanks for reviewing the steps. > > The current reproducer aims for simplicity; however, I think it's possible to > reproduce > the issue even with create_slot = true, although it requires the help of a > debugger and additional steps. But as long as there are transactions prepared > before the two_phase_at position, and they are skipped due to checks in > ReorderBufferFinishPrepared() (refer to comments[1] for why we skip sending > prepared transaction), the issue can be reproduced. > > For instance, when creating a subscription with (copy_data=true, > failover=true, > two_phase=true), the slot's two_phase setting starts as false and shifts to > true after table sync (refer to [2] for related comments). During this period, > if a user prepares a transaction where the prepare LSN is less than the > two_phase_at, the same problem could happen. > > Similarly, when setting u
RE: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: Hi, > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > > > > > > wrote: > > > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > When testing the slot synchronization with logical replication > > > > > slots that enabled two_phase decoding, I found that transactions > > > > > prepared before > > > two-phase > > > > > decoding is enabled may fail to replicate to the subscriber > > > > > after being committed on a promoted standby following a failover. > > > > > > > > > > To reproduce this issue, please follow these steps (also > > > > > detailed in the attached TAP test, v1-0001): > > > > > > > > > > 1. sub: create a subscription with (two_phase = false) 2. > > > > > primary (pub): prepare a txn A. > > > > > 3. sub: alter subscription set (two_phase = true) and wait for > > > > > the logical > > > slot to > > > > >be synced to standby. > > > > > 4. primary (pub): stop primary, promote the standby and let the > > > > > subscriber > > > use > > > > >the promoted standby as publisher. > > > > > 5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply > > > > > worker will report the following ERROR because it didn't > > > > >receive the PREPARE. > > > > >ERROR: prepared transaction with identifier "pg_gid_16387_752" > > > does not exist > > > > > > > > > > I think the root cause of this issue is that the two_phase_at > > > > > field of the slot, which indicates the LSN from which two-phase > > > > > decoding is enabled > > > (used to > > > > > prevent duplicate data transmission for prepared transactions), > > > > > is not synchronized to the standby server. > > > > > > > > > > In step 3, transaction A is not immediately replicated because > > > > > it occurred before enabling two-phase decoding. Thus, the > > > > > prepared transaction > > > should only > > > > > be replicated after decoding the final COMMIT PREPARED, as > > > > > referenced > > > in > > > > > ReorderBufferFinishPrepared(). However, due to the invalid > > > > > two_phase_at > > > on the > > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > > > This problem arises after the support for altering the two-phase > > > > > option (1462aad). > > > > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > > more about it to make a reproducible test case. > > > > After further analysis, I was able to reproduce the same issue [1] in > > PG 17. > > > > [1] > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > transaction > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so > > it is > > greater than prepared txn lsn. > > - sub: create subscription with (slot_name='sub', create_slot=false, > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > be set to the same as confirmed_flush_lsn which is greater than the > prepared transaction. > > - stop the primary and promote the standby. > > - commit the prepared transaction on standby, the following error will be > > reported on subscriber: > > It seems to require elaborate steps to reproduce this issue in v17. I wonder > if we > could somehow narrow down the cases that we want to prohibit. The patch for > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, > but I guess that it's still safe if it also creates the replication slot > (e.g., > create_slot is true). If my understanding is right, we can allow users to > specify > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > disallow that in ReplicationSlotCreate(). Thanks for reviewing the steps. The current reproducer aims for simplicity; however, I think it's possible to reproduce the issue even with create_slot = true, although it requires the help of a debugger and additional steps. But as long as there are transactions prepared before the two_phase_at position, and they are skipped due to checks in ReorderBufferFinishPrepared() (refer to comments[1] for why we skip sending prepared transaction), the issue can be reproduced. For instance, when creating a subscription with (copy_data=true, failover=true, two_phase=true), the slot's two_phase setting starts as false and shifts to true after table sync (refer to [2] for related comments). During this period, if a user prepares a transaction where the prepare LSN is less than the two_phase_at, the same problem could happen. Similarly, when setting up a subscription with (copy_data=false, failover=true, two_phase=true), although two_phase is initially set to true and we wait for running transactions to finish when building consistent snapshot, a race condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it won't check
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Mar 31, 2025 at 4:34 AM Zhijie Hou (Fujitsu) wrote: > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > > wrote: > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > Hi, > > > > > > > > When testing the slot synchronization with logical replication slots > > > > that > > > > enabled two_phase decoding, I found that transactions prepared before > > two-phase > > > > decoding is enabled may fail to replicate to the subscriber after being > > > > committed on a promoted standby following a failover. > > > > > > > > To reproduce this issue, please follow these steps (also detailed in the > > > > attached TAP test, v1-0001): > > > > > > > > 1. sub: create a subscription with (two_phase = false) > > > > 2. primary (pub): prepare a txn A. > > > > 3. sub: alter subscription set (two_phase = true) and wait for the > > > > logical > > slot to > > > >be synced to standby. > > > > 4. primary (pub): stop primary, promote the standby and let the > > > > subscriber > > use > > > >the promoted standby as publisher. > > > > 5. promoted standby (pub): COMMIT PREPARED A; > > > > 6. sub: the apply worker will report the following ERROR because it > > > > didn't > > > >receive the PREPARE. > > > >ERROR: prepared transaction with identifier "pg_gid_16387_752" > > does not exist > > > > > > > > I think the root cause of this issue is that the two_phase_at field of > > > > the > > > > slot, which indicates the LSN from which two-phase decoding is enabled > > (used to > > > > prevent duplicate data transmission for prepared transactions), is not > > > > synchronized to the standby server. > > > > > > > > In step 3, transaction A is not immediately replicated because it > > > > occurred > > > > before enabling two-phase decoding. Thus, the prepared transaction > > should only > > > > be replicated after decoding the final COMMIT PREPARED, as referenced > > in > > > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at > > on the > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > This problem arises after the support for altering the two-phase option > > > > (1462aad). > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > more about it to make a reproducible test case. > > After further analysis, I was able to reproduce the same issue [1] in > PG 17. > > [1] > - pub: created a slot 'sub' with two_phase=false, then prepared a transaction > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it > is > greater than prepared txn lsn. > - sub: create subscription with (slot_name='sub', create_slot=false, failover > = > true, two_phase=true, copy_data=false); two_phase_at will be set to the same > as confirmed_flush_lsn which is greater than the prepared transaction. > - stop the primary and promote the standby. > - commit the prepared transaction on standby, the following error will be > reported on subscriber: It seems to require elaborate steps to reproduce this issue in v17. I wonder if we could somehow narrow down the cases that we want to prohibit. The patch for v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, but I guess that it's still safe if it also creates the replication slot (e.g., create_slot is true). If my understanding is right, we can allow users to specify both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to disallow that in ReplicationSlotCreate(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Fix slot synchronization with two_phase decoding enabled
On Wed, Apr 2, 2025 at 12:41 PM Amit Kapila wrote: > > On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu) > > wrote: > > > > Here is the V3 patch set which addressed all the comments. > > > > Comment 0n 0001 > NULL for logical slots where > + two_phase is false and physical slots. > + > > change above to: > NULL for logical slots where > two_phase is false and for physical slots. Changed. > > Comment on 0002 > +# Create a subscription with two_phase enabled > +$subscriber1->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION > '$publisher_connstr' > PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot = > false, enabled = false, two_phase = true);" > +); > + > +# Enable failover for the subscription ($result, $stdout, $stderr) = > +$subscriber1->psql('postgres', "ALTER SUBSCRIPTION regress_mysub2 > +SET (failover = true)"); ok( $stderr =~ > +/ERROR: cannot enable failover for a two_phase > enabled subscription/, > + "Enabling failover is not allowed for a two_phase enabled > + subscription"); > > Is there a need for this test to be in .pl file? Can't we add it in .sql file? Right, I moved the test into subscription.sql > Apart from the above, I have made minor modifications to the PG17 > patch in the attached. Here is V4 patch set which addressed above comments and passed pgindent test. Best Regards, Hou zj v4-0001-HEAD-Fix-slot-synchronization-for-two_phase-enables-sl.patch Description: v4-0001-HEAD-Fix-slot-synchronization-for-two_phase-enables-sl.patch From e1ad472436f02ed9b410bc96c1b5d0e7952f7883 Mon Sep 17 00:00:00 2001 From: Hou Zhijie Date: Mon, 31 Mar 2025 16:28:57 +0800 Subject: [PATCH v4 2/2] Fix slot synchronization for two_phase enables slots. The issue is that the transactions prepared before two-phase decoding is enabled can fail to replicate to the subscriber after being committed on a promoted standby following a failover. This is because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without two_phase_at, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber after promotion of standby server, causing them to be skipped. To address the issue on HEAD, the two_phase_at field of the slot is exposed by the pg_replication_slots view and allows the slot synchronization to copy this value to the corresponding synced slot on the standby server. However, we can't fix the issue in the same way in PG17 where slot synchronization has been introduced, as it requires a change in view and builtin function definition, which in turn requires a catversion bump. Instead, to prevent the risk of losing prepared transactions, we disallow enabling failover and two-phase decoding together for a replication slot. Users are advised to disable slot synchronization for two_phase-enabled slots or re-create the subscriptions with option 'two_phase' as false and 'failover' as true. --- contrib/test_decoding/expected/slot.out| 2 ++ contrib/test_decoding/sql/slot.sql | 1 + src/backend/commands/subscriptioncmds.c| 26 + src/backend/replication/slot.c | 27 ++ src/bin/pg_upgrade/t/003_logical_slots.pl | 6 ++--- src/test/regress/expected/subscription.out | 6 + src/test/regress/sql/subscription.sql | 7 ++ 7 files changed, 72 insertions(+), 3 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..8fd762dea85 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: "failover" and "two_phase" are mutually exclusive options SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -182,6 +182,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot', 'tes SELECT 'init' FROM pg_create_logical_replication_slot('failover_false_slot', 'test_decoding', false, false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', 'test_decoding', false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); +SELECT 'init' FROM pg_cre
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu) wrote: > > Here is the V3 patch set which addressed all the comments. > Comment 0n 0001 NULL for logical slots where + two_phase is false and physical slots. + change above to: NULL for logical slots where two_phase is false and for physical slots. Comment on 0002 +# Create a subscription with two_phase enabled +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot = false, enabled = false, two_phase = true);" +); + +# Enable failover for the subscription +($result, $stdout, $stderr) = $subscriber1->psql('postgres', + "ALTER SUBSCRIPTION regress_mysub2 SET (failover = true)"); +ok( $stderr =~ /ERROR: cannot enable failover for a two_phase enabled subscription/, + "Enabling failover is not allowed for a two_phase enabled subscription"); Is there a need for this test to be in .pl file? Can't we add it in .sql file? Apart from the above, I have made minor modifications to the PG17 patch in the attached. -- With Regards, Amit Kapila. diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 6db3c9347fa..09c5f0ef685 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -649,10 +649,11 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, errhint("Subscriptions with the password_required option set to false may only be created or modified by the superuser."))); /* -* Do not allow users to enable failover and two_phase option together. +* Do not allow users to enable the failover and two_phase options +* together. * -* See comments atop the similar check in ReplicationSlotCreate() for -* detailed reasons. +* See comments atop the similar check in ReplicationSlotCreate() for a +* detailed reason. */ if (opts.twophase && opts.failover) ereport(ERROR, @@ -1258,11 +1259,11 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, "failover"))); /* -* Do not allow users to enable failover and two_phase -* option together. +* Do not allow users to enable the failover and two_phase +* options together. * * See comments atop the similar check in -* ReplicationSlotCreate() for detailed reasons. +* ReplicationSlotCreate() for a detailed reason. */ if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && opts.failover) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index f40028bd941..d251ce95fe6 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -345,14 +345,14 @@ ReplicationSlotCreate(const char *name, bool db_specific, errmsg("cannot enable failover for a temporary replication slot")); /* -* Do not allow users to enable failover for slots that enable -* two-phase decoding. +* Do not allow users to enable both failover and two_phase for slots. * * This is because the two_phase_at field of a slot, which tracks the -* LSN from which two-phase decoding starts, is not synchronized to -* standby servers. Without this field, the logical decoding might +* LSN, from which two-phase decoding starts, is not synchronized to +* standby servers. Without two_phase_at, the logical decoding might * incorrectly identify prepared transaction as already replicated to -* the subscriber, causing them to be skipped. +* the subscriber after promotion of standby server, causing them to be +* skipped. */ if (two_phase) ereport(ERROR, @@ -865,11 +865,10 @@ ReplicationSlotAlter(const char *name, bool failover) errmsg("cannot enable failover for a temporary replication slot")); /* -* Do not allow users to enable failover for slots that enable two-phase -* decoding. +* Do not allow users to enable failover for a two_phase enabled slot. * -* See comment
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu) wrote: > > Thanks for the comments, they have been addressed in V2. > In the PG17 patch, we only have a check preventing failover and two_phase in CreateSubscription. Don't we need a similar check for AlterSubscription? Apart from the above, I have a few suggestions for changes in docs, error messages, and comments for both versions. See attached. -- With Regards, Amit Kapila. diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 141c140331d..bff0d143ac5 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2566,7 +2566,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx The address (LSN) from which the decoding of prepared - transactions is enabled. Always NULL for physical + transactions is enabled. NULL for physical slots. diff --git a/src/test/recovery/t/040_standby_failover_slots_sync.pl b/src/test/recovery/t/040_standby_failover_slots_sync.pl index 67cc6374565..19273a49914 100644 --- a/src/test/recovery/t/040_standby_failover_slots_sync.pl +++ b/src/test/recovery/t/040_standby_failover_slots_sync.pl @@ -896,9 +896,9 @@ is($result, 't', # Promote the standby1 to primary. Confirm that: # a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary # b) logical replication for regress_mysub1 is resumed successfully after failover -# c) changes from the transaction 'test_twophase_slotsync', which was prepared -#on the old primary, can be consumed from the synced slot 'snap_test_slot' -#once committed on the new primary. +# c) changes from the transaction prepared 'test_twophase_slotsync' can be +#consumed from the synced slot 'snap_test_slot' once committed on the new +#primary. # d) changes can be consumed from the synced slot 'snap_test_slot' ## $primary->wait_for_replay_catchup($standby1); diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 8308ccaad5a..df7ab827f7d 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -657,7 +657,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, if (opts.twophase && opts.failover) ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot enable failover option when two_phase option is enabled")); + errmsg("failover and two_phase are mutually exclusive options")); /* * If built with appropriate switch, whine when regression-testing diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index c1bbd16254e..da510f60f9c 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -357,7 +357,7 @@ ReplicationSlotCreate(const char *name, bool db_specific, if (two_phase) ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot enable failover for a replication slot that enables two-phase decoding")); + errmsg("failover and two_phase are mutually exclusive options")); } /* @@ -873,7 +873,7 @@ ReplicationSlotAlter(const char *name, bool failover) if (failover && MyReplicationSlot->data.two_phase) ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot enable failover for a replication slot that enables two-phase decoding")); + errmsg("cannot enable failover for a two-phase enabled replication slot")); if (MyReplicationSlot->data.failover != failover) {
RE: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 1, 2025 at 2:09 PM Amit Kapila wrote: > > On Mon, Mar 31, 2025 at 5:0 PM Zhijie Hou (Fujitsu) > wrote: > > > > Thanks for the comments, they have been addressed in V2. > > > > In the PG17 patch, we only have a check preventing failover and > two_phase in CreateSubscription. Don't we need a similar check for > AlterSubscription? > > Apart from the above, I have a few suggestions for changes in docs, > error messages, and comments for both versions. See attached. Thanks for the comments. Here is the V3 patch set which addressed all the comments. I also added a testcase for ALTER SUB in 0002 as suggested by Kuroda-san off-list. Additionally, I found that a test failed in PG17 following this patch because it involved creating a subscription with both failover and two-phase enabled. Since that test was designed to test the upgrade of replication slots and is not directly related to failover functionality, I decided to disable the failover option for test case. Best Regards, Hou zj v3-0001-Fix-slot-synchronization-with-two_phase-decoding-.patch Description: v3-0001-Fix-slot-synchronization-with-two_phase-decoding-.patch From 0c9a8086913ca2e13106df386544821f39c1d41c Mon Sep 17 00:00:00 2001 From: Hou Zhijie Date: Mon, 31 Mar 2025 16:28:57 +0800 Subject: [PATCH v3] Disallow enabling failover for a replication slot that enables two-phase decoding This commit fixes a bug for slot synchronization with logical replication slots that enabled two_phase decoding. As it stands, transactions prepared before two-phase decoding is enabled may fail to replicate to the subscriber after being committed on a promoted standby following a failover. The issue arises because the two_phase_at field of a slot, which tracks the LSN from which two-phase decoding starts, is not synchronized to standby servers. Without this field, the logical decoding might incorrectly identify prepared transaction as already replicated to the subscriber, causing them to be skipped. To address the issue on HEAD, this commit makes the two_phase_at field of the slot visible in the pg_replication_slots view and enables the slot synchronization to copy this value to the corresponding synced slot on the standby server. The bug has been present since the introduction of slot synchronization in PostgreSQL 17. However, due to the need for catalog changes, backpatching this fix is not feasible. Instead, to prevent the risk of losing prepared transactions in prior versions, we now disallow enabling failover and two-phase decoding together for a replication slot. --- contrib/test_decoding/expected/slot.out | 2 ++ contrib/test_decoding/sql/slot.sql| 1 + src/backend/commands/subscriptioncmds.c | 25 + src/backend/replication/slot.c| 28 +++ src/bin/pg_upgrade/t/003_logical_slots.pl | 6 ++-- .../t/040_standby_failover_slots_sync.pl | 23 +++ src/test/regress/expected/subscription.out| 3 ++ src/test/regress/sql/subscription.sql | 4 +++ 8 files changed, 89 insertions(+), 3 deletions(-) diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 7de03c79f6f..8fd762dea85 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -427,6 +427,8 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', ' SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); ERROR: cannot enable failover for a temporary replication slot +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); +ERROR: "failover" and "two_phase" are mutually exclusive options SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); ?column? -- diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 580e3ae3bef..a89fe712ff6 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -182,6 +182,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot', 'tes SELECT 'init' FROM pg_create_logical_replication_slot('failover_false_slot', 'test_decoding', false, false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_default_slot', 'test_decoding', false, false); SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_temp_slot', 'test_decoding', true, false, true); +SELECT 'init' FROM pg_create_logical_replication_slot('failover_twophase_true_slot', 'test_decoding', false, true, true); SELECT 'init' FROM pg_create_physical_replication_slot('physical_slot'); SELECT slot_name, slot_type, failover FROM pg_replication_slots; diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 9467f58a23d..6db3c9347fa 100644 ---
Re: Fix slot synchronization with two_phase decoding enabled
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu) wrote: > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > I suspect that this can happen in PG17 as well, but I need to think > > more about it to make a reproducible test case. > > After further analysis, I was able to reproduce the same issue [1] in > PG 17. > > However, since the proposed fix requires catalog changes and the issue is not > a > security risk significant enough to justify changing the catalog in back > branches, we cannot back-patch the same solution. > Agreed. In the past, as in commit b6e39ca92e, we have backported a catalog-modifying commit, but that is for a CVE. Users need to follow manual steps as explained in 9.6.4 release notes [1], which would be cumbersome for them. This is not a security issue, so we shouldn't backpatch a catalog modifying commit following past. > Following off-list > discussions with Amit and Kuroda-san, we are considering disallowing enabling > failover and two-phase decoding together for a replication slot, as suggested > in attachment 0002. > > Another idea considered is to prevent the slot that enables two-phase decoding > from being synced to standby. IOW, this means displaying the failover field as > false in the view, if there is any possibility that transactions prepared > before the two_phase_at position exist (e.g., if restart_lsn is less than > two_phase_at). However, implementing this change would require additional > explanations to users for this new behavior, which seems tricky. > I find it tricky to explain to users. We need to say that sometimes the slots won't be synced even if the failover is set to true. Users can verify that by checking slot properties on the publisher. Also, on the subscriber, the failover flag in the subscription may still be true unless we do more engineering to make it false. So, I prefer to simply disallow setting failover and two_phase together. We need to recommend to users in release notes for 17 that they need to disable failover for subscriptions where two_phase is enabled or re-create the subscriptions with two_phase=false and failover=true. Users may not like it, but I think it is better than getting a complaint that after promotion of standby the data to subscriber is not getting replicated. [1] - https://www.postgresql.org/docs/9.6/release-9-6-4.html -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Mar 25, 2025 at 12:14 PM Amit Kapila wrote: > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > wrote: > > > > Hi, > > > > When testing the slot synchronization with logical replication slots that > > enabled two_phase decoding, I found that transactions prepared before > > two-phase > > decoding is enabled may fail to replicate to the subscriber after being > > committed on a promoted standby following a failover. > > > > To reproduce this issue, please follow these steps (also detailed in the > > attached TAP test, v1-0001): > > > > 1. sub: create a subscription with (two_phase = false) > > 2. primary (pub): prepare a txn A. > > 3. sub: alter subscription set (two_phase = true) and wait for the logical > > slot to > >be synced to standby. > > 4. primary (pub): stop primary, promote the standby and let the subscriber > > use > >the promoted standby as publisher. > > 5. promoted standby (pub): COMMIT PREPARED A; > > 6. sub: the apply worker will report the following ERROR because it didn't > >receive the PREPARE. > >ERROR: prepared transaction with identifier "pg_gid_16387_752" does not > > exist > > > > I think the root cause of this issue is that the two_phase_at field of the > > slot, which indicates the LSN from which two-phase decoding is enabled > > (used to > > prevent duplicate data transmission for prepared transactions), is not > > synchronized to the standby server. > > > > In step 3, transaction A is not immediately replicated because it occurred > > before enabling two-phase decoding. Thus, the prepared transaction should > > only > > be replicated after decoding the final COMMIT PREPARED, as referenced in > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on > > the > > standby, the prepared transaction fails to send at that time. > > > > This problem arises after the support for altering the two-phase option > > (1462aad). > > I suspect that this can happen in PG17 as well, but I need to think more about it to make a reproducible test case. In the meantime, I have a few minor comments on the proposed patches: 1. ## # Promote the standby1 to primary. Confirm that: # a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary # b) logical replication for regress_mysub1 is resumed successfully after failover # c) changes can be consumed from the synced slot 'snap_test_slot' ## -$standby1->start; $primary->wait_for_replay_catchup($standby1); # Capture the time before the standby is promoted @@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1'); is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), "20", 'data replicated from the new primary'); +# Commit the prepared transaction +$standby1->safe_psql('postgres', + "COMMIT PREPARED 'test_twophase_slotsync';"); +$standby1->wait_for_catchup('regress_mysub1'); + +# Confirm that the prepared transaction is replicated to the subscriber +is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), + "21", 'prepared data replicated from the new primary'); The commentary above this test should include information about verifying the replication of previously prepared transactions after promotion. Also, it would be better if confirm the commit prepared before confirming the new Insert is replicated after promotion. 2. @@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid, SpinLockAcquire(&slot->mutex); slot->data.restart_lsn = remote_slot->restart_lsn; slot->data.confirmed_flush = remote_slot->confirmed_lsn; + slot->data.two_phase_at = remote_slot->two_phase_at; Why do we need to update the two_phase_at here when the patch does it later in this function when local and remote values don't match? -- With Regards, Amit Kapila.
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) wrote: > > Hi, > > When testing the slot synchronization with logical replication slots that > enabled two_phase decoding, I found that transactions prepared before > two-phase > decoding is enabled may fail to replicate to the subscriber after being > committed on a promoted standby following a failover. > > To reproduce this issue, please follow these steps (also detailed in the > attached TAP test, v1-0001): > > 1. sub: create a subscription with (two_phase = false) > 2. primary (pub): prepare a txn A. > 3. sub: alter subscription set (two_phase = true) and wait for the logical > slot to >be synced to standby. > 4. primary (pub): stop primary, promote the standby and let the subscriber use >the promoted standby as publisher. > 5. promoted standby (pub): COMMIT PREPARED A; > 6. sub: the apply worker will report the following ERROR because it didn't >receive the PREPARE. >ERROR: prepared transaction with identifier "pg_gid_16387_752" does not > exist > > I think the root cause of this issue is that the two_phase_at field of the > slot, which indicates the LSN from which two-phase decoding is enabled (used > to > prevent duplicate data transmission for prepared transactions), is not > synchronized to the standby server. > > In step 3, transaction A is not immediately replicated because it occurred > before enabling two-phase decoding. Thus, the prepared transaction should only > be replicated after decoding the final COMMIT PREPARED, as referenced in > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the > standby, the prepared transaction fails to send at that time. > > This problem arises after the support for altering the two-phase option > (1462aad). > Thanks for the report and patch. I'll look into it. -- With Regards, Amit Kapila.