Re: 035_standby_logical_decoding unbounded hang

2024-03-07 Thread Noah Misch
On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote:
> I’m not sure if it is connected, but so far many patches in CFbot keep 
> hanging in this test. For example [0].

> [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 

Relevant part:

[22:03:10.292] stderr:
[22:03:10.292] # poll_query_until timed out executing this query:
[22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from 
pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0
[22:03:10.292] # expecting this output:
[22:03:10.292] # t
[22:03:10.292] # last actual query output:
[22:03:10.292] # f
[22:03:10.292] # with stderr:
[22:03:10.292] # Tests were run but no plan was declared and done_testing() was 
not seen.
[22:03:10.292] # Looks like your test exited with 255 just after 57.

The injection_activeslot test got added after $SUBJECT, in 08a52ab151
(2024-03-06).  It's now reverted in 65db0cfb4c (2024-03-07).




Re: 035_standby_logical_decoding unbounded hang

2024-03-07 Thread Andrey M. Borodin



> On 20 Feb 2024, at 04:09, Noah Misch  wrote:
> 

I’m not sure if it is connected, but so far many patches in CFbot keep hanging 
in this test. For example [0].

I haven’t done root cause analysis yet, but hangs may be related to this 
thread. Maybe someone more familiar with similar issues could take a look there?

Thanks!


Best regards, Andrey Borodin.

[0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 



Re: 035_standby_logical_decoding unbounded hang

2024-02-19 Thread Noah Misch
On Fri, Feb 16, 2024 at 06:37:38AM +, Bertrand Drouvot wrote:
> On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> > > What about creating a sub, say wait_for_restart_lsn_calculation() in 
> > > Cluster.pm
> > > and then make use of it in create_logical_slot_on_standby() and above? 
> > > (something
> > > like wait_for_restart_lsn_calculation-v1.patch attached).
> > 
> > Waiting for restart_lsn is just a prerequisite for calling
> > pg_log_standby_snapshot(), so I wouldn't separate those two.
> 
> Yeah, makes sense.
> 
> > If we're
> > extracting a sub, I would move the pg_log_standby_snapshot() call into the 
> > sub
> > and make the API like one of these:
> > 
> >   $standby->wait_for_subscription_starting_point($primary, $slot_name);
> >   $primary->log_standby_snapshot($standby, $slot_name);
> > 
> > Would you like to finish the patch in such a way?
> 
> Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
> log_standby_snapshot() as it seems more generic to me.

Thanks.  Pushed at commit 0e16281.




Re: 035_standby_logical_decoding unbounded hang

2024-02-15 Thread Bertrand Drouvot
Hi,

On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> > What about creating a sub, say wait_for_restart_lsn_calculation() in 
> > Cluster.pm
> > and then make use of it in create_logical_slot_on_standby() and above? 
> > (something
> > like wait_for_restart_lsn_calculation-v1.patch attached).
> 
> Waiting for restart_lsn is just a prerequisite for calling
> pg_log_standby_snapshot(), so I wouldn't separate those two.

Yeah, makes sense.

> If we're
> extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
> and make the API like one of these:
> 
>   $standby->wait_for_subscription_starting_point($primary, $slot_name);
>   $primary->log_standby_snapshot($standby, $slot_name);
> 
> Would you like to finish the patch in such a way?

Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
log_standby_snapshot() as it seems more generic to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 07da74cf56..5d3174d0b5 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3181,6 +3181,37 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->log_standby_snapshot(self, standby, slot_name)
+
+Log a standby snapshot on primary once the slot restart_lsn is determined on
+the standby.
+
+=cut
+
+sub log_standby_snapshot
+{
+	my ($self, $standby, $slot_name) = @_;
+	my ($stdout, $stderr);
+
+	# Once the slot's restart_lsn is determined, the standby looks for
+	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
+	# until the slot restart_lsn is determined.
+
+	$standby->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	])
+	  or die
+	  "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Then arrange for the xl_running_xacts record for which the standby is
+	# waiting.
+	$self->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+}
+
+=pod
+
 =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
 
 Create logical replication slot on given standby
@@ -3206,21 +3237,9 @@ sub create_logical_slot_on_standby
 		'2>',
 		\$stderr);
 
-	# Once the slot's restart_lsn is determined, the standby looks for
-	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
-	# until the slot restart_lsn is determined.
-
-	$self->poll_query_until(
-		'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
-	])
-	  or die
-	  "timed out waiting for logical slot to calculate its restart_lsn";
-
-	# Then arrange for the xl_running_xacts record for which pg_recvlogical is
+	# Arrange for the xl_running_xacts record for which pg_recvlogical is
 	# waiting.
-	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+	$primary->log_standby_snapshot($self, $slot_name);
 
 	$handle->finish();
 
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b020361b29..0710bccc17 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -465,8 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 
 $psql_subscriber{run}->pump_nb();
 
-# Speed up the subscription creation
-$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+# Log the standby snapshot to speed up the subscription creation
+$node_primary->log_standby_snapshot($node_standby, 'tap_sub');
 
 # Explicitly shut down psql instance gracefully - to avoid hangs
 # or worse on windows


Re: 035_standby_logical_decoding unbounded hang

2024-02-15 Thread Noah Misch
On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> > 
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> > XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> > XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits 
> > XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> > 
> > Other test code already has a solution for this, so the attached patches 
> > add a
> > timeout and copy the existing solution.  I'm also attaching the hack that
> > makes it 100% reproducible.

> I did a few tests and confirm that the proposed solution fixes the corner 
> case.

Thanks for reviewing.

> What about creating a sub, say wait_for_restart_lsn_calculation() in 
> Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? 
> (something
> like wait_for_restart_lsn_calculation-v1.patch attached).

Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two.  If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:

  $standby->wait_for_subscription_starting_point($primary, $slot_name);
  $primary->log_standby_snapshot($standby, $slot_name);

Would you like to finish the patch in such a way?




Re: 035_standby_logical_decoding unbounded hang

2024-02-14 Thread Bertrand Drouvot
Hi,

On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> Coincidentally, one of my buildfarm animals hanged several weeks in a
> different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
> reduction was part of making it reproducible:
> 
> On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > > If you look at the buildfarm's failures page and filter down to
> > > > just subscriptionCheck failures, what you find is that all of the
> > > > last 6 such failures are in 031_column_list.pl:
> 
> > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > > reported a replacement BgWriterDelay value reproducing it.
> > 
> > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> > BgWriterDelay.
> 
> I'm reusing this thread just in case there's overlap with the
> 031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
> a race condition arising from an event sequence like this:
> 
> - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> XLOG_RUNNING_XACTS.
> - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> XLOG_RUNNING_XACTS.
> - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
> - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> 
> Other test code already has a solution for this, so the attached patches add a
> timeout and copy the existing solution.  I'm also attaching the hack that
> makes it 100% reproducible.

Thanks!

I did a few tests and confirm that the proposed solution fixes the corner case.

standby-slot-test-1-timeout-v1.patch LGTM.

Regarding standby-slot-test-2-race-v1.patch:

> +# See corresponding create_logical_slot_on_standby() code.
> +$node_standby->poll_query_until(
> + 'postgres', qq[
> + SELECT restart_lsn IS NOT NULL
> + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
> + ]) or die "timed out waiting for logical slot to calculate its 
> restart_lsn";
> +

What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? 
(something
like wait_for_restart_lsn_calculation-v1.patch attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
commit 14d988c748f3e500c44d65e073c276e6d8af6156
Author: Bertrand Drouvot 
Date:   Wed Feb 14 15:21:35 2024 +

wait_for_restart_lsn_calculation

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index e2e70d0dbf..21cf179db1 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->wait_for_restart_lsn_calculation(self, slot_name)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub wait_for_restart_lsn_calculation
+{
+	my ($self, $slot_name) = @_;
+	my ($stdout, $stderr);
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	])
+	  or die
+	  "timed out waiting for logical slot to calculate its restart_lsn";
+}
+
+=pod
+
 =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
 
 Create logical replication slot on given standby
@@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby
 	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
 	# until the slot restart_lsn is determined.
 
-	$self->poll_query_until(
-		'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
-	])
-	  or die
-	  "timed out waiting for logical slot to calculate its restart_lsn";
+	$self->wait_for_restart_lsn_calculation($slot_name);
 
 	# Then arrange for the xl_running_xacts record for which pg_recvlogical is
 	# waiting.
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index f7b7fc7f9e..85330720c5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 
 $psql_subscriber{run}->pump_nb();
 
-# See corresponding create_logical_slot_on_standby() code.
-$node_standby->poll_query_until(
-	'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
-	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+# Wait for restart_lsn calculation
+$node_standby->wait_for_restart_lsn_calculation('tap_sub');
 

035_standby_logical_decoding unbounded hang

2024-02-10 Thread Noah Misch
Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:

On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > If you look at the buildfarm's failures page and filter down to
> > > just subscriptionCheck failures, what you find is that all of the
> > > last 6 such failures are in 031_column_list.pl:

> > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > reported a replacement BgWriterDelay value reproducing it.
> 
> Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> BgWriterDelay.

I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:

- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary.  Emits 
XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish.  Emits 
XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.

Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution.  I'm also attaching the hack that
makes it 100% reproducible.
Author: Noah Misch 
Commit: Noah Misch 

[not for commit] Demonstrate 035_standby_logical_decoding.pl hang.

There are three way XLOG_RUNNING_XACTS can show up:
- SELECT pg_log_standby_snapshot()
- CHECKPOINT
- bgwriter LOG_SNAPSHOT_INTERVAL_MS

An idle primary won't do the second two.  If you're unlucky enough for
all three to happen before CREATE_REPLICATION_SLOT's
DecodingContextFindStartpoint() call starts watching, then
CREATE_REPLICATION_SLOT will hang indefinitely.

diff --git a/src/backend/postmaster/bgwriter.c 
b/src/backend/postmaster/bgwriter.c
index f11ce27..07f987c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -70,7 +70,7 @@ int   BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
 
 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9270d7b..0f170a4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -1085,6 +1085,7 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char 
*slotname,
appendStringInfoString(, " PHYSICAL RESERVE_WAL");
}
 
+   pg_usleep(1000 * 1000);
res = libpqrcv_PQexec(conn->streamConn, cmd.data);
pfree(cmd.data);
 
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index cebfa52..c371369 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -467,6 +467,7 @@ $psql_subscriber{run}->pump_nb();
 
 # Speed up the subscription creation
 $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_primary->safe_psql('postgres', "CHECKPOINT");
 
 # Explicitly shut down psql instance gracefully - to avoid hangs
 # or worse on windows
Author: Noah Misch 
Commit: Noah Misch 

Bound waits in 035_standby_logical_decoding.pl.

One IPC::Run::start() used an IPC::Run::timer() without checking for
expiration.  The other used no timeout or timer.  Back-patch to v16,
which introduced the test.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index c371369..61da915 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -21,7 +21,6 @@ my $node_cascading_standby =
   PostgreSQL::Test::Cluster->new('cascading_standby');
 my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
-my $psql_timeout = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -90,7 +89,8 @@ sub make_slot_active
'>',
$to_stdout,
'2>',
-   $to_stderr);
+   $to_stderr,
+