Re: Why is subscription/t/031_column_list.pl failing so much?
On Sat, Feb 24, 2024 at 2:31 AM Tom Lane wrote: > > Amit Kapila writes: > > Seeing no objections, I have pushed the required test changes to silence > > the BF. > > The farm is still unhappy in the v16 branch. > As this feature and tests were introduced in 15, I think we should backpatch till 15. I'll do that early next week unless you or someone else thinks otherwise. -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
Amit Kapila writes: > Seeing no objections, I have pushed the required test changes to silence the > BF. The farm is still unhappy in the v16 branch. regards, tom lane
Re: Why is subscription/t/031_column_list.pl failing so much?
On Mon, Feb 19, 2024 at 2:19 PM Amit Kapila wrote: > > On Thu, Feb 15, 2024 at 10:46 AM vignesh C wrote: > > > > On Thu, 15 Feb 2024 at 08:42, Amit Kapila wrote: > > > > > > > > To resolve the BF > > > failure, I still feel, we should just recreate the subscription. This > > > is a pre-existing problem and we can track it via a separate patch > > > with a test case targetting such failures. > > > > +1 for going with recreation of the subscription, the changes for this > > are available at [1]. > > [1] - > > https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com > > > > Tom, and others, does anyone still object to going ahead with an idea > by just changing the test to recreate the subscription to silence BF > failures for this test? > Seeing no objections, I have pushed the required test changes to silence the BF. -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Thu, Feb 15, 2024 at 10:46 AM vignesh C wrote: > > On Thu, 15 Feb 2024 at 08:42, Amit Kapila wrote: > > > > > To resolve the BF > > failure, I still feel, we should just recreate the subscription. This > > is a pre-existing problem and we can track it via a separate patch > > with a test case targetting such failures. > > +1 for going with recreation of the subscription, the changes for this > are available at [1]. > [1] - > https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com > Tom, and others, does anyone still object to going ahead with an idea by just changing the test to recreate the subscription to silence BF failures for this test? Note that Vignesh has already started a thread [1] to add an option (or otherwise) to ignore publication lookup failures. [1] - https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Thu, 15 Feb 2024 at 08:42, Amit Kapila wrote: > > On Wed, Feb 14, 2024 at 12:58 PM vignesh C wrote: > > > > On Wed, 7 Feb 2024 at 16:27, vignesh C wrote: > > > > > > I was able to reproduce the issue consistently with the changes shared > > > by Tom Lane at [1]. > > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE > > > SUBSCRIPTION and verified that the test has passed consistently for > > > >50 runs that I ran. Also the test execution time increased for this > > > case is very negligibly: > > > Without patch: 7.991 seconds > > > With test change patch: 8.121 seconds > > > > > > The test changes for the same are attached. > > > > Alternative, this could also be fixed like the changes proposed by > > Amit at [1]. In this case we ignore publications that are not found > > for the purpose of computing RelSyncEntry attributes. We won't mark > > such an entry as valid till all the publications are loaded without > > anything missing. This means we won't publish operations on tables > > corresponding to that publication till we found such a publication and > > that seems okay. > > > > Tomas had raised a performance issue forcing us to reload it for every > > replicated change/row in case the publications are invalid at [2]. > > > > Did you measure the performance impact? I think it should impact the > performance only when there is a dropped/non-existent publication as > per the subscription definition. The performance was almost similar in this case: Without patch: 7.991 seconds With skip_not_exist_publication option patch: 7.996 seconds > Also, in the same email[2], Tomas > raised another concern that in such cases it is better to break the > replication. I will handle this in the next version > > > How > > about keeping the default option as it is and providing a new option > > skip_not_exist_publication while creating/altering a subscription. In > > this case if skip_not_exist_publication is specified we will ignore > > the case if publication is not present and publications will be kept > > in invalid and get validated later. > > > > I don't know if this deserves a separate option or not but I think it > is better to discuss this in a separate thread. I will start a separate thread for this. > To resolve the BF > failure, I still feel, we should just recreate the subscription. This > is a pre-existing problem and we can track it via a separate patch > with a test case targetting such failures. +1 for going with recreation of the subscription, the changes for this are available at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com Regards, Vignesh
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, Feb 14, 2024 at 12:58 PM vignesh C wrote: > > On Wed, 7 Feb 2024 at 16:27, vignesh C wrote: > > > > I was able to reproduce the issue consistently with the changes shared > > by Tom Lane at [1]. > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE > > SUBSCRIPTION and verified that the test has passed consistently for > > >50 runs that I ran. Also the test execution time increased for this > > case is very negligibly: > > Without patch: 7.991 seconds > > With test change patch: 8.121 seconds > > > > The test changes for the same are attached. > > Alternative, this could also be fixed like the changes proposed by > Amit at [1]. In this case we ignore publications that are not found > for the purpose of computing RelSyncEntry attributes. We won't mark > such an entry as valid till all the publications are loaded without > anything missing. This means we won't publish operations on tables > corresponding to that publication till we found such a publication and > that seems okay. > > Tomas had raised a performance issue forcing us to reload it for every > replicated change/row in case the publications are invalid at [2]. > Did you measure the performance impact? I think it should impact the performance only when there is a dropped/non-existent publication as per the subscription definition. Also, in the same email[2], Tomas raised another concern that in such cases it is better to break the replication. > How > about keeping the default option as it is and providing a new option > skip_not_exist_publication while creating/altering a subscription. In > this case if skip_not_exist_publication is specified we will ignore > the case if publication is not present and publications will be kept > in invalid and get validated later. > I don't know if this deserves a separate option or not but I think it is better to discuss this in a separate thread. To resolve the BF failure, I still feel, we should just recreate the subscription. This is a pre-existing problem and we can track it via a separate patch with a test case targetting such failures. > The attached patch has the changes for the same. Thoughts? > > [1] - > https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com > [2] - > https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com > -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, 7 Feb 2024 at 16:27, vignesh C wrote: > > On Wed, 7 Feb 2024 at 15:21, Amit Kapila wrote: > > > > On Tue, Feb 6, 2024 at 8:21 PM Tom Lane wrote: > > > > > > Amit Kapila writes: > > > > Yeah, I was worried about that. The other idea I have previously > > > > thought was to change Alter Subscription to Drop+Create Subscription. > > > > That should also help in bringing stability without losing any > > > > functionality. > > > > > > Hm, why would that fix it? > > > > > > > Because for new subscriptions, we will start reading WAL from the > > latest WAL insert pointer on the publisher which will be after the > > point where publication is created. > > I was able to reproduce the issue consistently with the changes shared > by Tom Lane at [1]. > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE > SUBSCRIPTION and verified that the test has passed consistently for > >50 runs that I ran. Also the test execution time increased for this > case is very negligibly: > Without patch: 7.991 seconds > With test change patch: 8.121 seconds > > The test changes for the same are attached. Alternative, this could also be fixed like the changes proposed by Amit at [1]. In this case we ignore publications that are not found for the purpose of computing RelSyncEntry attributes. We won't mark such an entry as valid till all the publications are loaded without anything missing. This means we won't publish operations on tables corresponding to that publication till we found such a publication and that seems okay. Tomas had raised a performance issue forcing us to reload it for every replicated change/row in case the publications are invalid at [2]. How about keeping the default option as it is and providing a new option skip_not_exist_publication while creating/altering a subscription. In this case if skip_not_exist_publication is specified we will ignore the case if publication is not present and publications will be kept in invalid and get validated later. The attached patch has the changes for the same. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com [2] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com Regards, Vignesh diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index 406a3c2dd1..404577725e 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -74,6 +74,7 @@ GetSubscription(Oid subid, bool missing_ok) sub->passwordrequired = subform->subpasswordrequired; sub->runasowner = subform->subrunasowner; sub->failover = subform->subfailover; + sub->skipnotexistpub = subform->subskipnotexistpub; /* Get conninfo */ datum = SysCacheGetAttrNotNull(SUBSCRIPTIONOID, diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 6791bff9dd..bdd367d272 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1359,7 +1359,8 @@ REVOKE ALL ON pg_subscription FROM public; GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subdisableonerr, subpasswordrequired, subrunasowner, subfailover, - subslotname, subsynccommit, subpublications, suborigin) + subskipnotexistpub, subslotname, subsynccommit, subpublications, + suborigin) ON pg_subscription TO public; CREATE VIEW pg_stat_subscription_stats AS diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index a05d69922d..bd59efc73a 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -72,6 +72,7 @@ #define SUBOPT_FAILOVER0x2000 #define SUBOPT_LSN 0x4000 #define SUBOPT_ORIGIN0x8000 +#define SUBOPT_SKIP_NOT_EXISTS_PUB 0x0001 /* check if the 'val' has 'bits' set */ #define IsSet(val, bits) (((val) & (bits)) == (bits)) @@ -97,6 +98,7 @@ typedef struct SubOpts bool passwordrequired; bool runasowner; bool failover; + bool skipnotexistpub; char *origin; XLogRecPtr lsn; } SubOpts; @@ -159,6 +161,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->runasowner = false; if (IsSet(supported_opts, SUBOPT_FAILOVER)) opts->failover = false; + if (IsSet(supported_opts, SUBOPT_SKIP_NOT_EXISTS_PUB)) + opts->skipnotexistpub = false; if (IsSet(supported_opts, SUBOPT_ORIGIN)) opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY); @@ -316,6 +320,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->specified_opts |= SUBOPT_FAILOVER; opts->failover = defGetBoolean(defel); } + else if (IsSet(supported_opts, SUBOPT_SKIP_NOT_EXISTS_PUB) && + strcmp(defel->defname, "skip_not_exist_publication") == 0) + { + if (IsSet(opts->specified_opts,
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, 7 Feb 2024 at 15:26, Amit Kapila wrote: > > On Wed, Feb 7, 2024 at 2:06 AM Tom Lane wrote: > > > > I wrote: > > > More to the point, aren't these proposals just band-aids that > > > would stabilize the test without fixing the actual problem? > > > The same thing is likely to happen to people in the field, > > > unless we do something drastic like removing ALTER SUBSCRIPTION. > > > > I've been able to make the 031_column_list.pl failure pretty > > reproducible by adding a delay in walsender, as attached. > > > > While I'm not too familiar with this code, it definitely does appear > > that the new walsender is told to start up at an LSN before the > > creation of the publication, and then if it needs to decide whether > > to stream a particular data change before it's reached that creation, > > kaboom! > > > > I read and understood the upthread worries about it not being > > a great idea to ignore publication lookup failures, but I really > > don't see that we have much choice. As an example, if a subscriber > > is humming along reading publication pub1, and then someone > > drops and then recreates pub1 on the publisher, I don't think that > > the subscriber will be able to advance through that gap if there > > are any operations within it that require deciding if they should > > be streamed. > > > > Right. One idea to address those worries was to have a new > subscription option like ignore_nonexistant_pubs (or some better name > for such an option). The 'true' value of this new option means that we > will ignore the publication lookup failures and continue replication, > the 'false' means give an error as we are doing now. If we agree that > such an option is useful or at least saves us in some cases as > discussed in another thread [1], we can keep the default value as true > so that users don't face such errors by default and also have a way to > go back to current behavior. > > > > (That is, contrary to Amit's expectation that > > DROP/CREATE would mask the problem, I suspect it will instead turn > > it into a hard failure. I've not experimented though.) > > > > This is not contrary because I was suggesting to DROP/CREATE > Subscription whereas you are talking of drop and recreate of > Publication. > > > BTW, this same change breaks two other subscription tests: > > 015_stream.pl and 022_twophase_cascade.pl. > > The symptoms are different (no "publication does not exist" errors), > > so maybe these are just test problems not fundamental weaknesses. > > > > As per the initial analysis, this is because those cases have somewhat > larger transactions (more than 64kB) under test so it just times out > waiting for all the data to be replicated. We will do further analysis > and share the findings. Yes, these tests are failing while waiting to catchup the larger transactions to be replicated within180 seconds, as the transactions needs more time to replicate because of the sleep added. To verify this I had tried a couple of things a) I had increased the timeout to a higher value and verified that both the test runs successfully with 1800 seconds timeout. b) I reduced the sleep to 1000 microseconds and verified that both the test runs successfully. So I feel these tests 015_stream.pl and 022_twophase_cascade.pl failing after the sleep added can be ignored. Regards, Vignesh
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, 7 Feb 2024 at 15:21, Amit Kapila wrote: > > On Tue, Feb 6, 2024 at 8:21 PM Tom Lane wrote: > > > > Amit Kapila writes: > > > Yeah, I was worried about that. The other idea I have previously > > > thought was to change Alter Subscription to Drop+Create Subscription. > > > That should also help in bringing stability without losing any > > > functionality. > > > > Hm, why would that fix it? > > > > Because for new subscriptions, we will start reading WAL from the > latest WAL insert pointer on the publisher which will be after the > point where publication is created. I was able to reproduce the issue consistently with the changes shared by Tom Lane at [1]. I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE SUBSCRIPTION and verified that the test has passed consistently for >50 runs that I ran. Also the test execution time increased for this case is very negligibly: Without patch: 7.991 seconds With test change patch: 8.121 seconds The test changes for the same are attached. [1] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us Regards, Vignesh From 35e31ca3d843d81b43a01d537073174f30f5097c Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Wed, 7 Feb 2024 15:49:00 +0530 Subject: [PATCH v3] Fix the intermittent buildfarm failures in 031_column_list test. Fix the intermittent buildfarm failures in 031_column_list test. --- src/test/subscription/t/031_column_list.pl | 43 ++ 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/src/test/subscription/t/031_column_list.pl b/src/test/subscription/t/031_column_list.pl index 938582e31a..9a97fa5020 100644 --- a/src/test/subscription/t/031_column_list.pl +++ b/src/test/subscription/t/031_column_list.pl @@ -370,7 +370,8 @@ $node_subscriber->safe_psql( $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub2, pub3 )); $node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1'); @@ -411,7 +412,8 @@ $node_subscriber->safe_psql( $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub4 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub4 )); $node_subscriber->wait_for_subscription_sync; @@ -487,7 +489,8 @@ $node_subscriber->safe_psql( $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub5 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub5 )); $node_subscriber->wait_for_subscription_sync; @@ -601,10 +604,12 @@ $node_publisher->safe_psql( ALTER PUBLICATION pub6 ADD TABLE test_part_a_2 (b); )); -# add the publication to our subscription, wait for sync to complete +# create the subscription for the above publication, wait for sync to +# complete $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub6 )); $node_subscriber->wait_for_subscription_sync; @@ -667,10 +672,12 @@ $node_publisher->safe_psql( CREATE PUBLICATION pub7 FOR TABLE test_part_b (a, b) WITH (publish_via_partition_root = true); )); -# add the publication to our subscription, wait for sync to complete +# create the subscription for the above publication, wait for sync to +# complete $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub7 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub7 )); $node_subscriber->wait_for_subscription_sync; @@ -737,7 +744,8 @@ $node_publisher->safe_psql( ALTER PUBLICATION pub8 ADD TABLE test_part_c_2 (a,b); )); -# add the publication to our subscription, wait for sync to complete +# create the subscription for the above publication, wait for sync to +# complete $node_subscriber->safe_psql( 'postgres', qq( DROP SUBSCRIPTION sub1; @@ -835,10 +843,12 @@ $node_publisher->safe_psql( CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); )); -# add the publication to our subscription, wait for sync to complete +# create the subscription for the above publication, wait for sync to +# complete $node_subscriber->safe_psql( 'postgres', qq( - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9 + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub9 )); $node_subscriber->wait_for_subscription_sync; @@ -880,8 +890,8 @@ $node_publisher->safe_psql( $node_subscriber->safe_psql( 'postgres', qq( CREATE TABLE test_mix_2 (a int PRIMARY KEY, b int, c int); - ALTER SUBSCRIPTION sub1 SET PUBLICATION pub_mix_3, pub_mix_4; - ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION; + DROP SUBSCRIPTION sub1; + CREATE SUBSCRIPTION sub1
Re: Why is subscription/t/031_column_list.pl failing so much?
On Wed, Feb 7, 2024 at 2:06 AM Tom Lane wrote: > > I wrote: > > More to the point, aren't these proposals just band-aids that > > would stabilize the test without fixing the actual problem? > > The same thing is likely to happen to people in the field, > > unless we do something drastic like removing ALTER SUBSCRIPTION. > > I've been able to make the 031_column_list.pl failure pretty > reproducible by adding a delay in walsender, as attached. > > While I'm not too familiar with this code, it definitely does appear > that the new walsender is told to start up at an LSN before the > creation of the publication, and then if it needs to decide whether > to stream a particular data change before it's reached that creation, > kaboom! > > I read and understood the upthread worries about it not being > a great idea to ignore publication lookup failures, but I really > don't see that we have much choice. As an example, if a subscriber > is humming along reading publication pub1, and then someone > drops and then recreates pub1 on the publisher, I don't think that > the subscriber will be able to advance through that gap if there > are any operations within it that require deciding if they should > be streamed. > Right. One idea to address those worries was to have a new subscription option like ignore_nonexistant_pubs (or some better name for such an option). The 'true' value of this new option means that we will ignore the publication lookup failures and continue replication, the 'false' means give an error as we are doing now. If we agree that such an option is useful or at least saves us in some cases as discussed in another thread [1], we can keep the default value as true so that users don't face such errors by default and also have a way to go back to current behavior. > (That is, contrary to Amit's expectation that > DROP/CREATE would mask the problem, I suspect it will instead turn > it into a hard failure. I've not experimented though.) > This is not contrary because I was suggesting to DROP/CREATE Subscription whereas you are talking of drop and recreate of Publication. > BTW, this same change breaks two other subscription tests: > 015_stream.pl and 022_twophase_cascade.pl. > The symptoms are different (no "publication does not exist" errors), > so maybe these are just test problems not fundamental weaknesses. > As per the initial analysis, this is because those cases have somewhat larger transactions (more than 64kB) under test so it just times out waiting for all the data to be replicated. We will do further analysis and share the findings. -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane wrote: > > Amit Kapila writes: > > Yeah, I was worried about that. The other idea I have previously > > thought was to change Alter Subscription to Drop+Create Subscription. > > That should also help in bringing stability without losing any > > functionality. > > Hm, why would that fix it? > Because for new subscriptions, we will start reading WAL from the latest WAL insert pointer on the publisher which will be after the point where publication is created. > More to the point, aren't these proposals just band-aids that > would stabilize the test without fixing the actual problem? Yes, but OTOH, this behavior has been since the beginning of logical replication. This particular test has just exposed it, so keeping BF failing for this particular test doesn't sound like the best way to remember it. -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Tue, 6 Feb 2024 at 08:30, Alexander Lakhin wrote: > > Hello Amit, > > 05.02.2024 15:20, Amit Kapila wrote: > > If this can be reproduced frequently then we can even try to test the > > patch in that thread by Osumi-San [1] (I haven't tested that it > > applies cleanly but shouldn't be difficult to make it work) based on > > the theory that walsender is starting up at an LSN somewhere before > > where the publication is created. > > > > [1] - > > https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com > > > > Yes, with the aforementioned modification of bgwriter.c and when running > 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the > updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every > ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran > with no failures. I was able to reproduce the issue with the patch changes suggested in bgwriter, but for me it failed in the 287th run. Then I had run the test 1000 times with the test changes shared at [1] and the test had passed all the 1000 times successfully. I have measured the test execution with average of 10 runs and found that it takes about 1.2 seconds more time to execute with the changes: Without patch: 8.454 seconds With test change patch: 9.672 seconds For the test execution comparison I had used a machine which has total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System. [1] - https://www.postgresql.org/message-id/e6ce3cf7-4025-f129-e3ac-0f778469f720%40gmail.com Regards, Vignesh
Re: Why is subscription/t/031_column_list.pl failing so much?
I wrote: > More to the point, aren't these proposals just band-aids that > would stabilize the test without fixing the actual problem? > The same thing is likely to happen to people in the field, > unless we do something drastic like removing ALTER SUBSCRIPTION. I've been able to make the 031_column_list.pl failure pretty reproducible by adding a delay in walsender, as attached. While I'm not too familiar with this code, it definitely does appear that the new walsender is told to start up at an LSN before the creation of the publication, and then if it needs to decide whether to stream a particular data change before it's reached that creation, kaboom! I read and understood the upthread worries about it not being a great idea to ignore publication lookup failures, but I really don't see that we have much choice. As an example, if a subscriber is humming along reading publication pub1, and then someone drops and then recreates pub1 on the publisher, I don't think that the subscriber will be able to advance through that gap if there are any operations within it that require deciding if they should be streamed. (That is, contrary to Amit's expectation that DROP/CREATE would mask the problem, I suspect it will instead turn it into a hard failure. I've not experimented though.) BTW, this same change breaks two other subscription tests: 015_stream.pl and 022_twophase_cascade.pl. The symptoms are different (no "publication does not exist" errors), so maybe these are just test problems not fundamental weaknesses. But "replication falls over if the walsender is slow" isn't something I'd call acceptable. regards, tom lane diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 77c8baa32a..c897ef4c60 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2699,6 +2699,8 @@ WalSndLoop(WalSndSendDataCallback send_data) !pq_is_send_pending()) break; + pg_usleep(1); + /* * If we don't have any pending data in the output buffer, try to send * some more. If there is some, we don't bother to call send_data
Re: Why is subscription/t/031_column_list.pl failing so much?
Amit Kapila writes: > Yeah, I was worried about that. The other idea I have previously > thought was to change Alter Subscription to Drop+Create Subscription. > That should also help in bringing stability without losing any > functionality. Hm, why would that fix it? More to the point, aren't these proposals just band-aids that would stabilize the test without fixing the actual problem? The same thing is likely to happen to people in the field, unless we do something drastic like removing ALTER SUBSCRIPTION. regards, tom lane
Re: Why is subscription/t/031_column_list.pl failing so much?
On Tue, Feb 6, 2024 at 2:30 PM Alexander Lakhin wrote: > > 06.02.2024 09:48, Amit Kapila wrote: > > cool, is it possible to see whether this patch changes the runtime of > > this test in any noticeable way? > > > > Yes, unfortunately it does. > I've measured duration of 100 tests runs without the patch (with pristine > bgwriter and with NO_TEMP_INSTALL): > real6m46,031s > real6m52,406s > real6m51,014s > > But with the patched test, I've got: > real9m39,872s > real9m40,044s > real9m38,236s > (nearly 2 seconds increase per one test run) > > Under Valgrind, the original test run takes: > Files=1, Tests=36, 334 wallclock secs ( 0.02 usr 0.00 sys + 163.14 cusr > 7.98 csys = 171.14 CPU) > > But the patched one: > Files=1, Tests=36, 368 wallclock secs ( 0.02 usr 0.00 sys + 182.16 cusr > 8.90 csys = 191.08 CPU) > (30 seconds increase) > Yeah, I was worried about that. The other idea I have previously thought was to change Alter Subscription to Drop+Create Subscription. That should also help in bringing stability without losing any functionality. -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
06.02.2024 09:48, Amit Kapila wrote: cool, is it possible to see whether this patch changes the runtime of this test in any noticeable way? Yes, unfortunately it does. I've measured duration of 100 tests runs without the patch (with pristine bgwriter and with NO_TEMP_INSTALL): real 6m46,031s real 6m52,406s real 6m51,014s But with the patched test, I've got: real 9m39,872s real 9m40,044s real 9m38,236s (nearly 2 seconds increase per one test run) Under Valgrind, the original test run takes: Files=1, Tests=36, 334 wallclock secs ( 0.02 usr 0.00 sys + 163.14 cusr 7.98 csys = 171.14 CPU) But the patched one: Files=1, Tests=36, 368 wallclock secs ( 0.02 usr 0.00 sys + 182.16 cusr 8.90 csys = 191.08 CPU) (30 seconds increase) Maybe the more CPU-efficient solution would be disabling bgworker, as was proposed in another discussion of tests instability: https://www.postgresql.org/message-id/ZaTxhjnPygOdosJ4%40ip-10-97-1-34.eu-west-3.compute.internal Though I think that devising a way to control bgwriter may take more time than we can afford given the current 031 failure rate on the buildfarm (17 failures for the last 3 days). Best regards, Alexander
Re: Why is subscription/t/031_column_list.pl failing so much?
On Tue, Feb 6, 2024 at 8:30 AM Alexander Lakhin wrote: > > 05.02.2024 15:20, Amit Kapila wrote: > > If this can be reproduced frequently then we can even try to test the > > patch in that thread by Osumi-San [1] (I haven't tested that it > > applies cleanly but shouldn't be difficult to make it work) based on > > the theory that walsender is starting up at an LSN somewhere before > > where the publication is created. > > > > [1] - > > https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com > > > > Yes, with the aforementioned modification of bgwriter.c and when running > 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the > updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every > ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran > with no failures. > cool, is it possible to see whether this patch changes the runtime of this test in any noticeable way? -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
Hello Amit, 05.02.2024 15:20, Amit Kapila wrote: If this can be reproduced frequently then we can even try to test the patch in that thread by Osumi-San [1] (I haven't tested that it applies cleanly but shouldn't be difficult to make it work) based on the theory that walsender is starting up at an LSN somewhere before where the publication is created. [1] - https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com Yes, with the aforementioned modification of bgwriter.c and when running 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran with no failures. Best regards, Alexanderdiff --git a/src/test/subscription/t/031_column_list.pl b/src/test/subscription/t/031_column_list.pl index 938582e31a..b199c8a1bf 100644 --- a/src/test/subscription/t/031_column_list.pl +++ b/src/test/subscription/t/031_column_list.pl @@ -368,6 +368,8 @@ $node_subscriber->safe_psql( CREATE TABLE tab5 (a int PRIMARY KEY, b int, d int); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 @@ -409,6 +411,8 @@ $node_subscriber->safe_psql( CREATE TABLE tab6 (a int PRIMARY KEY, b int, c int, d int); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( ALTER SUBSCRIPTION sub1 SET PUBLICATION pub4 @@ -485,6 +489,8 @@ $node_subscriber->safe_psql( CREATE TABLE tab7 (a int PRIMARY KEY, b int, c int, d int); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( ALTER SUBSCRIPTION sub1 SET PUBLICATION pub5 @@ -601,6 +607,8 @@ $node_publisher->safe_psql( ALTER PUBLICATION pub6 ADD TABLE test_part_a_2 (b); )); +$node_publisher->wait_for_catchup('sub1'); + # add the publication to our subscription, wait for sync to complete $node_subscriber->safe_psql( 'postgres', qq( @@ -667,6 +675,8 @@ $node_publisher->safe_psql( CREATE PUBLICATION pub7 FOR TABLE test_part_b (a, b) WITH (publish_via_partition_root = true); )); +$node_publisher->wait_for_catchup('sub1'); + # add the publication to our subscription, wait for sync to complete $node_subscriber->safe_psql( 'postgres', qq( @@ -835,6 +845,8 @@ $node_publisher->safe_psql( CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); )); +$node_publisher->wait_for_catchup('sub1'); + # add the publication to our subscription, wait for sync to complete $node_subscriber->safe_psql( 'postgres', qq( @@ -877,6 +889,8 @@ $node_publisher->safe_psql( INSERT INTO test_mix_2 VALUES (1, 2, 3); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( CREATE TABLE test_mix_2 (a int PRIMARY KEY, b int, c int); @@ -1016,6 +1030,8 @@ $node_publisher->safe_psql( INSERT INTO s1.t VALUES (1, 2, 3); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( CREATE SCHEMA s1; @@ -1039,6 +1055,8 @@ is( $node_subscriber->safe_psql('postgres', "SELECT * FROM s1.t ORDER BY a"), 4|5|6), 'two publications, publishing the same relation'); +$node_publisher->wait_for_catchup('sub1'); + # Now resync the subscription, but with publications in the opposite order. # The result should be the same. @@ -1082,6 +1100,8 @@ $node_publisher->safe_psql( INSERT INTO t VALUES (1, 2, 3); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( DROP SCHEMA s1 CASCADE; @@ -1130,6 +1150,8 @@ $node_publisher->safe_psql( INSERT INTO t VALUES (1, 2, 3); )); +$node_publisher->wait_for_catchup('sub1'); + $node_subscriber->safe_psql( 'postgres', qq( DROP TABLE t;
Re: Why is subscription/t/031_column_list.pl failing so much?
On Sat, Feb 3, 2024 at 12:30 PM Alexander Lakhin wrote: > > 03.02.2024 04:24, Tom Lane wrote: > > I'm still wondering how come the failure seems to have suddenly gotten > > way more common. The only changes that are in vaguely-related places > > and fit the time frame are Amit's 732924043 and 776621a5e, but I sure > > don't see a connection. > > I think the failure rate increased due to tamandua, calliphoridae, > flaviventris, and kestrel were switched from make to meson recently. > If this can be reproduced frequently then we can even try to test the patch in that thread by Osumi-San [1] (I haven't tested that it applies cleanly but shouldn't be difficult to make it work) based on the theory that walsender is starting up at an LSN somewhere before where the publication is created. [1] - https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Why is subscription/t/031_column_list.pl failing so much?
Hello Tom and Noah, 03.02.2024 04:24, Tom Lane wrote: I'm still wondering how come the failure seems to have suddenly gotten way more common. The only changes that are in vaguely-related places and fit the time frame are Amit's 732924043 and 776621a5e, but I sure don't see a connection. I think the failure rate increased due to tamandua, calliphoridae, flaviventris, and kestrel were switched from make to meson recently. The last `make` builds for them: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-01-31%2016%3A51%3A31 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-31%2016%3A51%3A38 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-01-31%2016%3A52%3A37 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2016%3A51%3A53 and since that switch the 031_column_list duration increased significantly, e.g., on the last tamandua `make` run it executed for 7 seconds, but successful `meson` runs give much longer duration: 280/283 postgresql:subscription / subscription/031_column_list OK 38.27s 36 subtests passed 280/283 postgresql:subscription / subscription/031_column_list OK 126.13s 36 subtests passed 280/283 postgresql:subscription / subscription/031_column_list OK 31.93s 36 subtests passed 279/283 postgresql:subscription / subscription/031_column_list OK 99.76s 36 subtests passed So, looking at the tamandua's failure log, I see: 2024-02-02 19:41:19.750 UTC [1579219][postmaster][:0] LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-12.3.0, 64-bit ... 2024-02-02 19:42:19.973 UTC [1629333][client backend][4/213:0] LOG: statement: ALTER SUBSCRIPTION sub1 SET PUBLICATION pub7 2024-02-02 19:42:20.131 UTC [1625765][logical replication apply worker][3/122:0] LOG: logical replication worker for subscription "sub1" will restart because of a parameter change 2024-02-02 19:42:20.137 UTC [1629333][client backend][:0] LOG: disconnection: session time: 0:00:00.212 user=bf database=postgres host=[local] 2024-02-02 19:42:20.191 UTC [1629535][logical replication apply worker][3/124:0] LOG: logical replication apply worker for subscription "sub1" has started ... 2024-02-02 19:42:20.445 UTC [1629535][logical replication apply worker][3/0:0] ERROR: could not receive data from WAL stream: ERROR: publication "pub7" does not exist CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698 (The interval between subscriber start and the error is ~ 4 * 15 seconds.) Thus it still may be explained by bgwriter activity, though perhaps autovacuum/checkpointer can add something as well. Best regards, Alexander
Re: Why is subscription/t/031_column_list.pl failing so much?
Noah Misch writes: > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: >> If you look at the buildfarm's failures page and filter down to >> just subscriptionCheck failures, what you find is that all of the >> last 6 such failures are in 031_column_list.pl: >> ... >> I don't see anything that 031_column_list.pl is doing that is much >> different from other subscription tests, so why is it the only one >> failing? And more to the point, what's going wrong exactly? > I don't know, but > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > reported a replacement BgWriterDelay value reproducing it. That hasn't > reproduced it in ~10 runs on my machine, though. Ah, thanks for that link. I like the theory proposed in that thread that the walsender is starting up at an LSN somewhere before where the publication is created. I'm tempted to add some more queries to the test script to see if that can be proven. I'm still wondering how come the failure seems to have suddenly gotten way more common. The only changes that are in vaguely-related places and fit the time frame are Amit's 732924043 and 776621a5e, but I sure don't see a connection. regards, tom lane
Re: Why is subscription/t/031_column_list.pl failing so much?
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > If you look at the buildfarm's failures page and filter down to > > just subscriptionCheck failures, what you find is that all of the > > last 6 such failures are in 031_column_list.pl: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23 > > > > There are some further back too: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24 > > > > but this definitely got way more common in the last few days. > > > I don't see anything that 031_column_list.pl is doing that is much > > different from other subscription tests, so why is it the only one > > failing? And more to the point, what's going wrong exactly? > > I don't know, but > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > reported a replacement BgWriterDelay value reproducing it. Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to BgWriterDelay. > That hasn't reproduced it in ~10 runs on my machine, though. After 207 successes, it did fail once for me. > > I am suspicious that this somehow represents a failure of the > > historical catalog decoding logic, but I don't see how that theory > > explains this only breaking in one test script.
Re: Why is subscription/t/031_column_list.pl failing so much?
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > If you look at the buildfarm's failures page and filter down to > just subscriptionCheck failures, what you find is that all of the > last 6 such failures are in 031_column_list.pl: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23 > > There are some further back too: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24 > > but this definitely got way more common in the last few days. > I don't see anything that 031_column_list.pl is doing that is much > different from other subscription tests, so why is it the only one > failing? And more to the point, what's going wrong exactly? I don't know, but https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com reported a replacement BgWriterDelay value reproducing it. That hasn't reproduced it in ~10 runs on my machine, though. > I am suspicious that this somehow represents a failure of the > historical catalog decoding logic, but I don't see how that theory > explains this only breaking in one test script.
Why is subscription/t/031_column_list.pl failing so much?
If you look at the buildfarm's failures page and filter down to just subscriptionCheck failures, what you find is that all of the last 6 such failures are in 031_column_list.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23 There are some further back too: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24 but this definitely got way more common in the last few days. Digging down into the logs, these all look pretty similar. Somehow things get into a state where replication connections fail with the publisher reporting "publication does not exist": 2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG: connection received: host=[local] 2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: connection authenticated: user="bf" method=trust (/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117) 2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG: replication connection authorized: user=bf application_name=sub1 2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: IDENTIFY_SYSTEM 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: IDENTIFY_SYSTEM 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG: acquired logical replication slot "sub1" 2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG: starting logical decoding for slot "sub1" 2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL: Streaming transactions committing after 0/15BCDD0, reading WAL from 0/15BCDD0. 2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG: logical decoding found consistent point at 0/15BCDD0 2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL: There are no running transactions. 2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR: publication "pub7" does not exist 2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT: slot "sub1", output plugin "pgoutput", in the change callback, associated LSN 0/15C7698 2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 'any', publication_names '"pub7"') 2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG: released logical replication slot "sub1" 2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG: disconnection: session time: 0:00:00.647 user=bf database=postgres host=[local] and then we just repeat that until the test times out. It fails at different points in the test script (hence, different publication names), but the pattern looks about the same. I don't see anything that 031_column_list.pl is doing that is much different from other subscription tests, so why is it the only one failing? And more to the point, what's going wrong exactly? I am suspicious that this somehow represents a failure of the historical catalog decoding logic, but I don't see how that theory explains this only breaking in one test script. regards, tom lane