Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-23 Thread Amit Kapila
On Sat, Feb 24, 2024 at 2:31 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Seeing no objections, I have pushed the required test changes to silence 
> > the BF.
>
> The farm is still unhappy in the v16 branch.
>

As this feature and tests were introduced in 15, I think we should
backpatch till 15. I'll do that early next week unless you or someone
else thinks otherwise.

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-23 Thread Tom Lane
Amit Kapila  writes:
> Seeing no objections, I have pushed the required test changes to silence the 
> BF.

The farm is still unhappy in the v16 branch.

regards, tom lane




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-22 Thread Amit Kapila
On Mon, Feb 19, 2024 at 2:19 PM Amit Kapila  wrote:
>
> On Thu, Feb 15, 2024 at 10:46 AM vignesh C  wrote:
> >
> > On Thu, 15 Feb 2024 at 08:42, Amit Kapila  wrote:
> > >
> >
> > > To resolve the BF
> > > failure, I still feel, we should just recreate the subscription. This
> > > is a pre-existing problem and we can track it via a separate patch
> > > with a test case targetting such failures.
> >
> > +1 for going with recreation of the subscription, the changes for this
> > are available at [1].
> > [1] - 
> > https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com
> >
>
> Tom, and others, does anyone still object to going ahead with an idea
> by just changing the test to recreate the subscription to silence BF
> failures for this test?
>

Seeing no objections, I have pushed the required test changes to silence the BF.

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-19 Thread Amit Kapila
On Thu, Feb 15, 2024 at 10:46 AM vignesh C  wrote:
>
> On Thu, 15 Feb 2024 at 08:42, Amit Kapila  wrote:
> >
>
> > To resolve the BF
> > failure, I still feel, we should just recreate the subscription. This
> > is a pre-existing problem and we can track it via a separate patch
> > with a test case targetting such failures.
>
> +1 for going with recreation of the subscription, the changes for this
> are available at [1].
> [1] - 
> https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com
>

Tom, and others, does anyone still object to going ahead with an idea
by just changing the test to recreate the subscription to silence BF
failures for this test?

Note that Vignesh has already started a thread [1] to add an option
(or otherwise) to ignore publication lookup failures.

[1] - 
https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-14 Thread vignesh C
On Thu, 15 Feb 2024 at 08:42, Amit Kapila  wrote:
>
> On Wed, Feb 14, 2024 at 12:58 PM vignesh C  wrote:
> >
> > On Wed, 7 Feb 2024 at 16:27, vignesh C  wrote:
> > >
> > > I was able to reproduce the issue consistently with the changes shared
> > > by Tom Lane at [1].
> > > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > > SUBSCRIPTION and verified that the test has passed consistently for
> > > >50 runs that I ran. Also the test execution time increased for this
> > > case is very negligibly:
> > > Without patch: 7.991 seconds
> > > With test change patch:   8.121 seconds
> > >
> > > The test changes for the same are attached.
> >
> > Alternative, this could also be fixed like the changes proposed by
> > Amit at [1]. In this case we ignore publications that are not found
> > for the purpose of computing RelSyncEntry attributes. We won't mark
> > such an entry as valid till all the publications are loaded without
> > anything missing. This means we won't publish operations on tables
> > corresponding to that publication till we found such a publication and
> > that seems okay.
> >
> > Tomas had raised a performance issue forcing us to reload it for every
> > replicated change/row in case the publications are invalid at [2].
> >
>
> Did you measure the performance impact? I think it should impact the
> performance only when there is a dropped/non-existent publication as
> per the subscription definition.

The performance was almost similar in this case:
Without patch: 7.991 seconds
With skip_not_exist_publication option patch: 7.996 seconds

> Also, in the same email[2], Tomas
> raised another concern that in such cases it is better to break the
> replication.

I will handle this in the next version

> >
>  How
> > about keeping the default option as it is and providing a new option
> > skip_not_exist_publication while creating/altering a subscription. In
> > this case if skip_not_exist_publication  is specified we will ignore
> > the case if publication is not present and publications will be kept
> > in invalid and get validated later.
> >
>
> I don't know if this deserves a separate option or not but I think it
> is better to discuss this in a separate thread.

I will start a separate thread for this.

> To resolve the BF
> failure, I still feel, we should just recreate the subscription. This
> is a pre-existing problem and we can track it via a separate patch
> with a test case targetting such failures.

+1 for going with recreation of the subscription, the changes for this
are available at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com

Regards,
Vignesh




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-14 Thread Amit Kapila
On Wed, Feb 14, 2024 at 12:58 PM vignesh C  wrote:
>
> On Wed, 7 Feb 2024 at 16:27, vignesh C  wrote:
> >
> > I was able to reproduce the issue consistently with the changes shared
> > by Tom Lane at [1].
> > I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> > SUBSCRIPTION and verified that the test has passed consistently for
> > >50 runs that I ran. Also the test execution time increased for this
> > case is very negligibly:
> > Without patch: 7.991 seconds
> > With test change patch:   8.121 seconds
> >
> > The test changes for the same are attached.
>
> Alternative, this could also be fixed like the changes proposed by
> Amit at [1]. In this case we ignore publications that are not found
> for the purpose of computing RelSyncEntry attributes. We won't mark
> such an entry as valid till all the publications are loaded without
> anything missing. This means we won't publish operations on tables
> corresponding to that publication till we found such a publication and
> that seems okay.
>
> Tomas had raised a performance issue forcing us to reload it for every
> replicated change/row in case the publications are invalid at [2].
>

Did you measure the performance impact? I think it should impact the
performance only when there is a dropped/non-existent publication as
per the subscription definition. Also, in the same email[2], Tomas
raised another concern that in such cases it is better to break the
replication.

>
 How
> about keeping the default option as it is and providing a new option
> skip_not_exist_publication while creating/altering a subscription. In
> this case if skip_not_exist_publication  is specified we will ignore
> the case if publication is not present and publications will be kept
> in invalid and get validated later.
>

I don't know if this deserves a separate option or not but I think it
is better to discuss this in a separate thread. To resolve the BF
failure, I still feel, we should just recreate the subscription. This
is a pre-existing problem and we can track it via a separate patch
with a test case targetting such failures.

> The attached patch has the changes for the same. Thoughts?
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com
> [2] - 
> https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com
>

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-13 Thread vignesh C
On Wed, 7 Feb 2024 at 16:27, vignesh C  wrote:
>
> On Wed, 7 Feb 2024 at 15:21, Amit Kapila  wrote:
> >
> > On Tue, Feb 6, 2024 at 8:21 PM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > Yeah, I was worried about that. The other idea I have previously
> > > > thought was to change Alter Subscription to Drop+Create Subscription.
> > > > That should also help in bringing stability without losing any
> > > > functionality.
> > >
> > > Hm, why would that fix it?
> > >
> >
> > Because for new subscriptions, we will start reading WAL from the
> > latest WAL insert pointer on the publisher which will be after the
> > point where publication is created.
>
> I was able to reproduce the issue consistently with the changes shared
> by Tom Lane at [1].
> I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
> SUBSCRIPTION and verified that the test has passed consistently for
> >50 runs that I ran. Also the test execution time increased for this
> case is very negligibly:
> Without patch: 7.991 seconds
> With test change patch:   8.121 seconds
>
> The test changes for the same are attached.

Alternative, this could also be fixed like the changes proposed by
Amit at [1]. In this case we ignore publications that are not found
for the purpose of computing RelSyncEntry attributes. We won't mark
such an entry as valid till all the publications are loaded without
anything missing. This means we won't publish operations on tables
corresponding to that publication till we found such a publication and
that seems okay.

Tomas had raised a performance issue forcing us to reload it for every
replicated change/row in case the publications are invalid at [2]. How
about keeping the default option as it is and providing a new option
skip_not_exist_publication while creating/altering a subscription. In
this case if skip_not_exist_publication  is specified we will ignore
the case if publication is not present and publications will be kept
in invalid and get validated later.

The attached patch has the changes for the same. Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com

Regards,
Vignesh
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 406a3c2dd1..404577725e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -74,6 +74,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->passwordrequired = subform->subpasswordrequired;
 	sub->runasowner = subform->subrunasowner;
 	sub->failover = subform->subfailover;
+	sub->skipnotexistpub = subform->subskipnotexistpub;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttrNotNull(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6791bff9dd..bdd367d272 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1359,7 +1359,8 @@ REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
   subbinary, substream, subtwophasestate, subdisableonerr,
 			  subpasswordrequired, subrunasowner, subfailover,
-  subslotname, subsynccommit, subpublications, suborigin)
+  subskipnotexistpub, subslotname, subsynccommit, subpublications,
+  suborigin)
 ON pg_subscription TO public;
 
 CREATE VIEW pg_stat_subscription_stats AS
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index a05d69922d..bd59efc73a 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -72,6 +72,7 @@
 #define SUBOPT_FAILOVER0x2000
 #define SUBOPT_LSN	0x4000
 #define SUBOPT_ORIGIN0x8000
+#define SUBOPT_SKIP_NOT_EXISTS_PUB  0x0001
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -97,6 +98,7 @@ typedef struct SubOpts
 	bool		passwordrequired;
 	bool		runasowner;
 	bool		failover;
+	bool		skipnotexistpub;
 	char	   *origin;
 	XLogRecPtr	lsn;
 } SubOpts;
@@ -159,6 +161,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 		opts->runasowner = false;
 	if (IsSet(supported_opts, SUBOPT_FAILOVER))
 		opts->failover = false;
+	if (IsSet(supported_opts, SUBOPT_SKIP_NOT_EXISTS_PUB))
+		opts->skipnotexistpub = false;
 	if (IsSet(supported_opts, SUBOPT_ORIGIN))
 		opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY);
 
@@ -316,6 +320,15 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 			opts->specified_opts |= SUBOPT_FAILOVER;
 			opts->failover = defGetBoolean(defel);
 		}
+		else if (IsSet(supported_opts, SUBOPT_SKIP_NOT_EXISTS_PUB) &&
+ strcmp(defel->defname, "skip_not_exist_publication") == 0)
+		{
+			if (IsSet(opts->specified_opts, 

Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-07 Thread vignesh C
On Wed, 7 Feb 2024 at 15:26, Amit Kapila  wrote:
>
> On Wed, Feb 7, 2024 at 2:06 AM Tom Lane  wrote:
> >
> > I wrote:
> > > More to the point, aren't these proposals just band-aids that
> > > would stabilize the test without fixing the actual problem?
> > > The same thing is likely to happen to people in the field,
> > > unless we do something drastic like removing ALTER SUBSCRIPTION.
> >
> > I've been able to make the 031_column_list.pl failure pretty
> > reproducible by adding a delay in walsender, as attached.
> >
> > While I'm not too familiar with this code, it definitely does appear
> > that the new walsender is told to start up at an LSN before the
> > creation of the publication, and then if it needs to decide whether
> > to stream a particular data change before it's reached that creation,
> > kaboom!
> >
> > I read and understood the upthread worries about it not being
> > a great idea to ignore publication lookup failures, but I really
> > don't see that we have much choice.  As an example, if a subscriber
> > is humming along reading publication pub1, and then someone
> > drops and then recreates pub1 on the publisher, I don't think that
> > the subscriber will be able to advance through that gap if there
> > are any operations within it that require deciding if they should
> > be streamed.
> >
>
> Right. One idea to address those worries was to have a new
> subscription option like ignore_nonexistant_pubs (or some better name
> for such an option). The 'true' value of this new option means that we
> will ignore the publication lookup failures and continue replication,
> the 'false' means give an error as we are doing now. If we agree that
> such an option is useful or at least saves us in some cases as
> discussed in another thread [1], we can keep the default value as true
> so that users don't face such errors by default and also have a way to
> go back to current behavior.
>
> >
>   (That is, contrary to Amit's expectation that
> > DROP/CREATE would mask the problem, I suspect it will instead turn
> > it into a hard failure.  I've not experimented though.)
> >
>
> This is not contrary because I was suggesting to DROP/CREATE
> Subscription whereas you are talking of drop and recreate of
> Publication.
>
> > BTW, this same change breaks two other subscription tests:
> > 015_stream.pl and 022_twophase_cascade.pl.
> > The symptoms are different (no "publication does not exist" errors),
> > so maybe these are just test problems not fundamental weaknesses.
> >
>
> As per the initial analysis, this is because those cases have somewhat
> larger transactions (more than 64kB) under test so it just times out
> waiting for all the data to be replicated. We will do further analysis
> and share the findings.

Yes, these tests are failing while waiting to catchup the larger
transactions to be replicated within180 seconds, as the transactions
needs more time to replicate because of the sleep added. To verify
this I had tried a couple of things a) I had increased the timeout to
a higher value and verified that both the test runs successfully with
1800 seconds timeout. b) I reduced the sleep to 1000 microseconds and
verified that both the test runs successfully.

So I feel these tests 015_stream.pl and 022_twophase_cascade.pl
failing after the sleep added can be ignored.

Regards,
Vignesh




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-07 Thread vignesh C
On Wed, 7 Feb 2024 at 15:21, Amit Kapila  wrote:
>
> On Tue, Feb 6, 2024 at 8:21 PM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Yeah, I was worried about that. The other idea I have previously
> > > thought was to change Alter Subscription to Drop+Create Subscription.
> > > That should also help in bringing stability without losing any
> > > functionality.
> >
> > Hm, why would that fix it?
> >
>
> Because for new subscriptions, we will start reading WAL from the
> latest WAL insert pointer on the publisher which will be after the
> point where publication is created.

I was able to reproduce the issue consistently with the changes shared
by Tom Lane at [1].
I have made changes to change ALTER SUBSCRIPTION to DROP+CREATE
SUBSCRIPTION and verified that the test has passed consistently for
>50 runs that I ran. Also the test execution time increased for this
case is very negligibly:
Without patch: 7.991 seconds
With test change patch:   8.121 seconds

The test changes for the same are attached.

[1] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us

Regards,
Vignesh
From 35e31ca3d843d81b43a01d537073174f30f5097c Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 7 Feb 2024 15:49:00 +0530
Subject: [PATCH v3] Fix the intermittent buildfarm failures in 031_column_list test.

Fix the intermittent buildfarm failures in 031_column_list test.
---
 src/test/subscription/t/031_column_list.pl | 43 ++
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/src/test/subscription/t/031_column_list.pl b/src/test/subscription/t/031_column_list.pl
index 938582e31a..9a97fa5020 100644
--- a/src/test/subscription/t/031_column_list.pl
+++ b/src/test/subscription/t/031_column_list.pl
@@ -370,7 +370,8 @@ $node_subscriber->safe_psql(
 
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub2, pub3
 ));
 
 $node_subscriber->wait_for_subscription_sync($node_publisher, 'sub1');
@@ -411,7 +412,8 @@ $node_subscriber->safe_psql(
 
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub4
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub4
 ));
 
 $node_subscriber->wait_for_subscription_sync;
@@ -487,7 +489,8 @@ $node_subscriber->safe_psql(
 
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub5
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub5
 ));
 
 $node_subscriber->wait_for_subscription_sync;
@@ -601,10 +604,12 @@ $node_publisher->safe_psql(
 	ALTER PUBLICATION pub6 ADD TABLE test_part_a_2 (b);
 ));
 
-# add the publication to our subscription, wait for sync to complete
+# create the subscription for the above publication, wait for sync to
+# complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub6
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub6
 ));
 
 $node_subscriber->wait_for_subscription_sync;
@@ -667,10 +672,12 @@ $node_publisher->safe_psql(
 	CREATE PUBLICATION pub7 FOR TABLE test_part_b (a, b) WITH (publish_via_partition_root = true);
 ));
 
-# add the publication to our subscription, wait for sync to complete
+# create the subscription for the above publication, wait for sync to
+# complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub7
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub7
 ));
 
 $node_subscriber->wait_for_subscription_sync;
@@ -737,7 +744,8 @@ $node_publisher->safe_psql(
 	ALTER PUBLICATION pub8 ADD TABLE test_part_c_2 (a,b);
 ));
 
-# add the publication to our subscription, wait for sync to complete
+# create the subscription for the above publication, wait for sync to
+# complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	DROP SUBSCRIPTION sub1;
@@ -835,10 +843,12 @@ $node_publisher->safe_psql(
 	CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 ));
 
-# add the publication to our subscription, wait for sync to complete
+# create the subscription for the above publication, wait for sync to
+# complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub9
 ));
 
 $node_subscriber->wait_for_subscription_sync;
@@ -880,8 +890,8 @@ $node_publisher->safe_psql(
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	CREATE TABLE test_mix_2 (a int PRIMARY KEY, b int, c int);
-	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub_mix_3, pub_mix_4;
-	ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION;
+	DROP SUBSCRIPTION sub1;
+	CREATE SUBSCRIPTION sub1 

Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-07 Thread Amit Kapila
On Wed, Feb 7, 2024 at 2:06 AM Tom Lane  wrote:
>
> I wrote:
> > More to the point, aren't these proposals just band-aids that
> > would stabilize the test without fixing the actual problem?
> > The same thing is likely to happen to people in the field,
> > unless we do something drastic like removing ALTER SUBSCRIPTION.
>
> I've been able to make the 031_column_list.pl failure pretty
> reproducible by adding a delay in walsender, as attached.
>
> While I'm not too familiar with this code, it definitely does appear
> that the new walsender is told to start up at an LSN before the
> creation of the publication, and then if it needs to decide whether
> to stream a particular data change before it's reached that creation,
> kaboom!
>
> I read and understood the upthread worries about it not being
> a great idea to ignore publication lookup failures, but I really
> don't see that we have much choice.  As an example, if a subscriber
> is humming along reading publication pub1, and then someone
> drops and then recreates pub1 on the publisher, I don't think that
> the subscriber will be able to advance through that gap if there
> are any operations within it that require deciding if they should
> be streamed.
>

Right. One idea to address those worries was to have a new
subscription option like ignore_nonexistant_pubs (or some better name
for such an option). The 'true' value of this new option means that we
will ignore the publication lookup failures and continue replication,
the 'false' means give an error as we are doing now. If we agree that
such an option is useful or at least saves us in some cases as
discussed in another thread [1], we can keep the default value as true
so that users don't face such errors by default and also have a way to
go back to current behavior.

>
  (That is, contrary to Amit's expectation that
> DROP/CREATE would mask the problem, I suspect it will instead turn
> it into a hard failure.  I've not experimented though.)
>

This is not contrary because I was suggesting to DROP/CREATE
Subscription whereas you are talking of drop and recreate of
Publication.

> BTW, this same change breaks two other subscription tests:
> 015_stream.pl and 022_twophase_cascade.pl.
> The symptoms are different (no "publication does not exist" errors),
> so maybe these are just test problems not fundamental weaknesses.
>

As per the initial analysis, this is because those cases have somewhat
larger transactions (more than 64kB) under test so it just times out
waiting for all the data to be replicated. We will do further analysis
and share the findings.

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-07 Thread Amit Kapila
On Tue, Feb 6, 2024 at 8:21 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Yeah, I was worried about that. The other idea I have previously
> > thought was to change Alter Subscription to Drop+Create Subscription.
> > That should also help in bringing stability without losing any
> > functionality.
>
> Hm, why would that fix it?
>

Because for new subscriptions, we will start reading WAL from the
latest WAL insert pointer on the publisher which will be after the
point where publication is created.

> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?

Yes, but OTOH, this behavior has been since the beginning of logical
replication. This particular test has just exposed it, so keeping BF
failing for this particular test doesn't sound like the best way to
remember it.

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-06 Thread vignesh C
On Tue, 6 Feb 2024 at 08:30, Alexander Lakhin  wrote:
>
> Hello Amit,
>
> 05.02.2024 15:20, Amit Kapila wrote:
> > If this can be reproduced frequently then we can even try to test the
> > patch in that thread by Osumi-San [1] (I haven't tested that it
> > applies cleanly but shouldn't be difficult to make it work) based on
> > the theory that walsender is starting up at an LSN somewhere before
> > where the publication is created.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com
> >
>
> Yes, with the aforementioned modification of bgwriter.c and when running
> 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
> updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
> ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
> with no failures.

I was able to reproduce the issue with the patch changes suggested in
bgwriter, but for me it failed in the 287th run.
Then I had run the test 1000 times with the test changes shared at [1]
and the test had passed all the 1000 times successfully.

I have measured the test execution with average of 10 runs and found
that it takes about 1.2 seconds more time to execute with the changes:
Without patch: 8.454 seconds
With test change patch:   9.672 seconds

For the test execution comparison I had used a machine which has total
Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System.

[1] - 
https://www.postgresql.org/message-id/e6ce3cf7-4025-f129-e3ac-0f778469f720%40gmail.com

Regards,
Vignesh




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-06 Thread Tom Lane
I wrote:
> More to the point, aren't these proposals just band-aids that
> would stabilize the test without fixing the actual problem?
> The same thing is likely to happen to people in the field,
> unless we do something drastic like removing ALTER SUBSCRIPTION.

I've been able to make the 031_column_list.pl failure pretty
reproducible by adding a delay in walsender, as attached.

While I'm not too familiar with this code, it definitely does appear
that the new walsender is told to start up at an LSN before the
creation of the publication, and then if it needs to decide whether
to stream a particular data change before it's reached that creation,
kaboom!

I read and understood the upthread worries about it not being
a great idea to ignore publication lookup failures, but I really
don't see that we have much choice.  As an example, if a subscriber
is humming along reading publication pub1, and then someone
drops and then recreates pub1 on the publisher, I don't think that
the subscriber will be able to advance through that gap if there
are any operations within it that require deciding if they should
be streamed.  (That is, contrary to Amit's expectation that
DROP/CREATE would mask the problem, I suspect it will instead turn
it into a hard failure.  I've not experimented though.)

BTW, this same change breaks two other subscription tests:
015_stream.pl and 022_twophase_cascade.pl.
The symptoms are different (no "publication does not exist" errors),
so maybe these are just test problems not fundamental weaknesses.
But "replication falls over if the walsender is slow" isn't
something I'd call acceptable.

regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 77c8baa32a..c897ef4c60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2699,6 +2699,8 @@ WalSndLoop(WalSndSendDataCallback send_data)
 			!pq_is_send_pending())
 			break;
 
+		pg_usleep(1);
+
 		/*
 		 * If we don't have any pending data in the output buffer, try to send
 		 * some more.  If there is some, we don't bother to call send_data


Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-06 Thread Tom Lane
Amit Kapila  writes:
> Yeah, I was worried about that. The other idea I have previously
> thought was to change Alter Subscription to Drop+Create Subscription.
> That should also help in bringing stability without losing any
> functionality.

Hm, why would that fix it?

More to the point, aren't these proposals just band-aids that
would stabilize the test without fixing the actual problem?
The same thing is likely to happen to people in the field,
unless we do something drastic like removing ALTER SUBSCRIPTION.

regards, tom lane




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-06 Thread Amit Kapila
On Tue, Feb 6, 2024 at 2:30 PM Alexander Lakhin  wrote:
>
> 06.02.2024 09:48, Amit Kapila wrote:
> > cool, is it possible to see whether this patch changes the runtime of
> > this test in any noticeable way?
> >
>
> Yes, unfortunately it does.
> I've measured duration of 100 tests runs without the patch (with pristine
> bgwriter and with NO_TEMP_INSTALL):
> real6m46,031s
> real6m52,406s
> real6m51,014s
>
> But with the patched test, I've got:
> real9m39,872s
> real9m40,044s
> real9m38,236s
> (nearly 2 seconds increase per one test run)
>
> Under Valgrind, the original test run takes:
> Files=1, Tests=36, 334 wallclock secs ( 0.02 usr  0.00 sys + 163.14 cusr  
> 7.98 csys = 171.14 CPU)
>
> But the patched one:
> Files=1, Tests=36, 368 wallclock secs ( 0.02 usr  0.00 sys + 182.16 cusr  
> 8.90 csys = 191.08 CPU)
> (30 seconds increase)
>

Yeah, I was worried about that. The other idea I have previously
thought was to change Alter Subscription to Drop+Create Subscription.
That should also help in bringing stability without losing any
functionality.

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-06 Thread Alexander Lakhin

06.02.2024 09:48, Amit Kapila wrote:

cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?



Yes, unfortunately it does.
I've measured duration of 100 tests runs without the patch (with pristine
bgwriter and with NO_TEMP_INSTALL):
real    6m46,031s
real    6m52,406s
real    6m51,014s

But with the patched test, I've got:
real    9m39,872s
real    9m40,044s
real    9m38,236s
(nearly 2 seconds increase per one test run)

Under Valgrind, the original test run takes:
Files=1, Tests=36, 334 wallclock secs ( 0.02 usr  0.00 sys + 163.14 cusr  7.98 
csys = 171.14 CPU)

But the patched one:
Files=1, Tests=36, 368 wallclock secs ( 0.02 usr  0.00 sys + 182.16 cusr  8.90 
csys = 191.08 CPU)
(30 seconds increase)

Maybe the more CPU-efficient solution would be disabling bgworker, as was
proposed in another discussion of tests instability:
https://www.postgresql.org/message-id/ZaTxhjnPygOdosJ4%40ip-10-97-1-34.eu-west-3.compute.internal

Though I think that devising a way to control bgwriter may take more time
than we can afford given the current 031 failure rate on the buildfarm
(17 failures for the last 3 days).

Best regards,
Alexander




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-05 Thread Amit Kapila
On Tue, Feb 6, 2024 at 8:30 AM Alexander Lakhin  wrote:
>
> 05.02.2024 15:20, Amit Kapila wrote:
> > If this can be reproduced frequently then we can even try to test the
> > patch in that thread by Osumi-San [1] (I haven't tested that it
> > applies cleanly but shouldn't be difficult to make it work) based on
> > the theory that walsender is starting up at an LSN somewhere before
> > where the publication is created.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com
> >
>
> Yes, with the aforementioned modification of bgwriter.c and when running
> 20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
> updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
> ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
> with no failures.
>

cool, is it possible to see whether this patch changes the runtime of
this test in any noticeable way?

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-05 Thread Alexander Lakhin

Hello Amit,

05.02.2024 15:20, Amit Kapila wrote:

If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com



Yes, with the aforementioned modification of bgwriter.c and when running
20 tests in parallel, I got failures on iterations 20. 3, 21 ..., but with the
updated Osumi-San's patch (which adds wait_for_catchup('sub1') before every
ALTER SUBSCRIPTION sub1 SET PUBLICATION ...) applied, 300 iterations ran
with no failures.

Best regards,
Alexanderdiff --git a/src/test/subscription/t/031_column_list.pl b/src/test/subscription/t/031_column_list.pl
index 938582e31a..b199c8a1bf 100644
--- a/src/test/subscription/t/031_column_list.pl
+++ b/src/test/subscription/t/031_column_list.pl
@@ -368,6 +368,8 @@ $node_subscriber->safe_psql(
 	CREATE TABLE tab5 (a int PRIMARY KEY, b int, d int);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3
@@ -409,6 +411,8 @@ $node_subscriber->safe_psql(
 	CREATE TABLE tab6 (a int PRIMARY KEY, b int, c int, d int);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub4
@@ -485,6 +489,8 @@ $node_subscriber->safe_psql(
 	CREATE TABLE tab7 (a int PRIMARY KEY, b int, c int, d int);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	ALTER SUBSCRIPTION sub1 SET PUBLICATION pub5
@@ -601,6 +607,8 @@ $node_publisher->safe_psql(
 	ALTER PUBLICATION pub6 ADD TABLE test_part_a_2 (b);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 # add the publication to our subscription, wait for sync to complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
@@ -667,6 +675,8 @@ $node_publisher->safe_psql(
 	CREATE PUBLICATION pub7 FOR TABLE test_part_b (a, b) WITH (publish_via_partition_root = true);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 # add the publication to our subscription, wait for sync to complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
@@ -835,6 +845,8 @@ $node_publisher->safe_psql(
 	CREATE PUBLICATION pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 # add the publication to our subscription, wait for sync to complete
 $node_subscriber->safe_psql(
 	'postgres', qq(
@@ -877,6 +889,8 @@ $node_publisher->safe_psql(
 	INSERT INTO test_mix_2 VALUES (1, 2, 3);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	CREATE TABLE test_mix_2 (a int PRIMARY KEY, b int, c int);
@@ -1016,6 +1030,8 @@ $node_publisher->safe_psql(
 	INSERT INTO s1.t VALUES (1, 2, 3);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	CREATE SCHEMA s1;
@@ -1039,6 +1055,8 @@ is( $node_subscriber->safe_psql('postgres', "SELECT * FROM s1.t ORDER BY a"),
 4|5|6),
 	'two publications, publishing the same relation');
 
+$node_publisher->wait_for_catchup('sub1');
+
 # Now resync the subscription, but with publications in the opposite order.
 # The result should be the same.
 
@@ -1082,6 +1100,8 @@ $node_publisher->safe_psql(
 	INSERT INTO t VALUES (1, 2, 3);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	DROP SCHEMA s1 CASCADE;
@@ -1130,6 +1150,8 @@ $node_publisher->safe_psql(
 	INSERT INTO t VALUES (1, 2, 3);
 ));
 
+$node_publisher->wait_for_catchup('sub1');
+
 $node_subscriber->safe_psql(
 	'postgres', qq(
 	DROP TABLE t;


Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-05 Thread Amit Kapila
On Sat, Feb 3, 2024 at 12:30 PM Alexander Lakhin  wrote:
>
> 03.02.2024 04:24, Tom Lane wrote:
> > I'm still wondering how come the failure seems to have suddenly gotten
> > way more common.  The only changes that are in vaguely-related places
> > and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
> > don't see a connection.
>
> I think the failure rate increased due to tamandua, calliphoridae,
> flaviventris, and kestrel were switched from make to meson recently.
>

If this can be reproduced frequently then we can even try to test the
patch in that thread by Osumi-San [1] (I haven't tested that it
applies cleanly but shouldn't be difficult to make it work) based on
the theory that walsender is starting up at an LSN somewhere before
where the publication is created.

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Alexander Lakhin

Hello Tom and Noah,

03.02.2024 04:24, Tom Lane wrote:

I'm still wondering how come the failure seems to have suddenly gotten
way more common.  The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.


I think the failure rate increased due to tamandua, calliphoridae,
flaviventris, and kestrel were switched from make to meson recently.
The last `make` builds for them:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-01-31%2016%3A51%3A31
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-31%2016%3A51%3A38
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-01-31%2016%3A52%3A37
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2016%3A51%3A53

and since that switch the 031_column_list duration increased significantly,
e.g., on the last tamandua `make` run it executed for 7 seconds, but
successful `meson` runs give much longer duration:
280/283 postgresql:subscription / subscription/031_column_list  
  OK 38.27s   36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list  
  OK 126.13s   36 subtests passed
280/283 postgresql:subscription / subscription/031_column_list  
  OK 31.93s   36 subtests passed
279/283 postgresql:subscription / subscription/031_column_list  
  OK 99.76s   36 subtests passed

So, looking at the tamandua's failure log, I see:
2024-02-02 19:41:19.750 UTC [1579219][postmaster][:0] LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by 
gcc-12.3.0, 64-bit

...
2024-02-02 19:42:19.973 UTC [1629333][client backend][4/213:0] LOG: statement: 
ALTER SUBSCRIPTION sub1 SET PUBLICATION pub7
2024-02-02 19:42:20.131 UTC [1625765][logical replication apply worker][3/122:0] LOG:  logical replication worker for 
subscription "sub1" will restart because of a parameter change
2024-02-02 19:42:20.137 UTC [1629333][client backend][:0] LOG: disconnection: session time: 0:00:00.212 user=bf 
database=postgres host=[local]
2024-02-02 19:42:20.191 UTC [1629535][logical replication apply worker][3/124:0] LOG:  logical replication apply worker 
for subscription "sub1" has started

...
2024-02-02 19:42:20.445 UTC [1629535][logical replication apply worker][3/0:0] ERROR:  could not receive data from WAL 
stream: ERROR:  publication "pub7" does not exist

    CONTEXT:  slot "sub1", output plugin "pgoutput", in the change callback, 
associated LSN 0/15C7698

(The interval between subscriber start and the error is ~ 4 * 15 seconds.)

Thus it still may be explained by bgwriter activity, though perhaps
autovacuum/checkpointer can add something as well.

Best regards,
Alexander




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Tom Lane
Noah Misch  writes:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
>> If you look at the buildfarm's failures page and filter down to
>> just subscriptionCheck failures, what you find is that all of the
>> last 6 such failures are in 031_column_list.pl:
>> ...
>> I don't see anything that 031_column_list.pl is doing that is much
>> different from other subscription tests, so why is it the only one
>> failing?  And more to the point, what's going wrong exactly?

> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.  That hasn't
> reproduced it in ~10 runs on my machine, though.

Ah, thanks for that link.  I like the theory proposed in that thread
that the walsender is starting up at an LSN somewhere before where
the publication is created.  I'm tempted to add some more queries to
the test script to see if that can be proven.

I'm still wondering how come the failure seems to have suddenly gotten
way more common.  The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.

regards, tom lane




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > If you look at the buildfarm's failures page and filter down to
> > just subscriptionCheck failures, what you find is that all of the
> > last 6 such failures are in 031_column_list.pl:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> > 
> > There are some further back too:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> > 
> > but this definitely got way more common in the last few days.
> 
> > I don't see anything that 031_column_list.pl is doing that is much
> > different from other subscription tests, so why is it the only one
> > failing?  And more to the point, what's going wrong exactly?
> 
> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.

Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.

> That hasn't reproduced it in ~10 runs on my machine, though.

After 207 successes, it did fail once for me.

> > I am suspicious that this somehow represents a failure of the
> > historical catalog decoding logic, but I don't see how that theory
> > explains this only breaking in one test script.




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Noah Misch
On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> If you look at the buildfarm's failures page and filter down to
> just subscriptionCheck failures, what you find is that all of the
> last 6 such failures are in 031_column_list.pl:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> 
> There are some further back too:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> 
> but this definitely got way more common in the last few days.

> I don't see anything that 031_column_list.pl is doing that is much
> different from other subscription tests, so why is it the only one
> failing?  And more to the point, what's going wrong exactly?

I don't know, but
https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
reported a replacement BgWriterDelay value reproducing it.  That hasn't
reproduced it in ~10 runs on my machine, though.

> I am suspicious that this somehow represents a failure of the
> historical catalog decoding logic, but I don't see how that theory
> explains this only breaking in one test script.




Why is subscription/t/031_column_list.pl failing so much?

2024-02-02 Thread Tom Lane
If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23

There are some further back too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24

but this definitely got way more common in the last few days.

Digging down into the logs, these all look pretty similar.  Somehow
things get into a state where replication connections fail with
the publisher reporting "publication does not exist":

2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG:  connection 
received: host=[local]
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  connection 
authenticated: user="bf" method=trust 
(/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117)
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  replication 
connection authorized: user=bf application_name=sub1
2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG:  statement: 
SELECT pg_catalog.set_config('search_path', '', false);
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received 
replication command: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received 
replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 
(proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  acquired logical 
replication slot "sub1"
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG:  starting logical 
decoding for slot "sub1"
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL:  Streaming 
transactions committing after 0/15BCDD0, reading WAL from 0/15BCDD0.
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG:  logical decoding 
found consistent point at 0/15BCDD0
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL:  There are no 
running transactions.
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR:  publication 
"pub7" does not exist
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT:  slot "sub1", 
output plugin "pgoutput", in the change callback, associated LSN 0/15C7698
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG:  released logical 
replication slot "sub1"
2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG:  disconnection: 
session time: 0:00:00.647 user=bf database=postgres host=[local]

and then we just repeat that until the test times out.  It fails at
different points in the test script (hence, different publication
names), but the pattern looks about the same.

I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing?  And more to the point, what's going wrong exactly?

I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.

regards, tom lane