Re: subscriptioncheck failure

2021-05-18 Thread vignesh C
On Wed, May 19, 2021 at 10:28 AM Amit Kapila  wrote:
>
> On Tue, May 18, 2021 at 11:25 AM vignesh C  wrote:
> >
> > Thanks for the comments, the attached patch has the changes for the same.
> >
>
> Thanks, I have pushed your patch after making minor changes in the comments.

Thanks for pushing this patch.

Regards,
Vignesh




Re: subscriptioncheck failure

2021-05-18 Thread Amit Kapila
On Tue, May 18, 2021 at 11:25 AM vignesh C  wrote:
>
> Thanks for the comments, the attached patch has the changes for the same.
>

Thanks, I have pushed your patch after making minor changes in the comments.

-- 
With Regards,
Amit Kapila.




Re: subscriptioncheck failure

2021-05-17 Thread vignesh C
On Tue, May 18, 2021 at 9:09 AM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 5:48 PM vignesh C  wrote:
> >
> > On Mon, May 17, 2021 at 10:40 AM Amit Kapila  
> > wrote:
> > >
> > > One more point:
> > > + $node_publisher->wait_for_catchup('tap_sub');
> > > +
> > > + # Ensure a transactional logical decoding message shows up on the slot
> > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > > DISABLE");
> > > +
> > > + # wait for the replication connection to drop from the publisher
> > > + $node_publisher->poll_query_until('postgres',
> > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> > > slot_name = 'tap_sub' AND active='f'", 1);
> > >
> > > In the above sequence, wait_for_catchup will query pg_stat_replication
> > > whereas after disabling subscription we are checking
> > > pg_replication_slots. I understand from your explanation why we can't
> > > rely on pg_stat_replication after DISABLE but it might be better to
> > > check that the slot is active before disabling it. I think currently,
> > > the test assumes that, isn't it better to have an explicit check for
> > > that?
> >
> > I felt this is not required, wait_for_catchup will poll_query_until
> > the state = 'streaming', even if START_REPLICATION takes time, state
> > will be in 'startup' state, this way poll_query_until will take care
> > of handling this.
> >
>
> makes sense, but let's add some comments to clarify the same.
>

Modified.

> > On further analysis I found that we need to do "Alter subscription
> > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
> > time, Instead we can change pg_logical_slot_peek_binary_changes to
> > pg_logical_slot_get_binary_changes at appropriate steps. This way the
> > common function can be removed and the enable/disable multiple times
> > can be removed.
> >
>
> I think that is a valid point. This was probably kept so that we can
> peek multiple times for the same message to test various things but
> that can be achieved with the way you have changed the test.
>
> One more thing, shouldn't we make auto_vacuum=off for this test by
> using 'append_conf' before starting the publisher. That will avoid the
> risk of empty transactions.

I felt that makes sense, added it.

Thanks for the comments, the attached patch has the changes for the same.

Regards,
Vignesh
From 7617be76996efd17f46cee16f9a9a18a23183f19 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 17 May 2021 17:23:35 +0530
Subject: [PATCH v4] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 32 -
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..aaafc9f11d 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -11,6 +11,8 @@ use Test::More tests => 5;
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	'autovacuum = off');
 $node_publisher->start;
 
 # Create subscriber node
@@ -35,12 +37,17 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+# Here wait_for_catchup is used intentionally instead of pg_replication_slots
+# as wait_for_catchup waits on poll_query_until until the state becomes
+# 'streaming', which will make sure that the replication slot is active.
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -77,7 +84,7 @@ is($result, qq(1|pgoutput),
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 'tap_pub')
 )

Re: subscriptioncheck failure

2021-05-17 Thread Amit Kapila
On Mon, May 17, 2021 at 5:48 PM vignesh C  wrote:
>
> On Mon, May 17, 2021 at 10:40 AM Amit Kapila  wrote:
> >
> > One more point:
> > + $node_publisher->wait_for_catchup('tap_sub');
> > +
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> > +
> > + # wait for the replication connection to drop from the publisher
> > + $node_publisher->poll_query_until('postgres',
> > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> > slot_name = 'tap_sub' AND active='f'", 1);
> >
> > In the above sequence, wait_for_catchup will query pg_stat_replication
> > whereas after disabling subscription we are checking
> > pg_replication_slots. I understand from your explanation why we can't
> > rely on pg_stat_replication after DISABLE but it might be better to
> > check that the slot is active before disabling it. I think currently,
> > the test assumes that, isn't it better to have an explicit check for
> > that?
>
> I felt this is not required, wait_for_catchup will poll_query_until
> the state = 'streaming', even if START_REPLICATION takes time, state
> will be in 'startup' state, this way poll_query_until will take care
> of handling this.
>

makes sense, but let's add some comments to clarify the same.

> On further analysis I found that we need to do "Alter subscription
> tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
> time, Instead we can change pg_logical_slot_peek_binary_changes to
> pg_logical_slot_get_binary_changes at appropriate steps. This way the
> common function can be removed and the enable/disable multiple times
> can be removed.
>

I think that is a valid point. This was probably kept so that we can
peek multiple times for the same message to test various things but
that can be achieved with the way you have changed the test.

One more thing, shouldn't we make auto_vacuum=off for this test by
using 'append_conf' before starting the publisher. That will avoid the
risk of empty transactions.

-- 
With Regards,
Amit Kapila.




Re: subscriptioncheck failure

2021-05-17 Thread vignesh C
On Mon, May 17, 2021 at 10:40 AM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 9:06 AM Amit Kapila  wrote:
> >
> > On Thu, May 13, 2021 at 7:06 PM vignesh C  wrote:
> > >
> > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier  
> > > wrote:
> >
> > Few comments:
> > 1.
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> >
> > After you have encapsulated this command in the function, the above
> > comment doesn't make sense because we do this for both transactional
> > and non-transactional messages. I suggest we can change it to
> > something like: "This is done to ensure a logical decoding message is
> > shown up on the slot".
> >
> > 2.
> > +# Setup the subscription before checking 
> > pg_logical_slot_peek_binary_changes
> > +sub setup_subscription
> >
> > I think here the functionality is more for the catchup of
> > subscription, so it might be better to name the function as
> > subscription_catchup or catchup_subscription. I think you can expand
> > the comments atop this function a bit as suggested by Michael.
> >
>
> One more point:
> + $node_publisher->wait_for_catchup('tap_sub');
> +
> + # Ensure a transactional logical decoding message shows up on the slot
> + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> DISABLE");
> +
> + # wait for the replication connection to drop from the publisher
> + $node_publisher->poll_query_until('postgres',
> + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> slot_name = 'tap_sub' AND active='f'", 1);
>
> In the above sequence, wait_for_catchup will query pg_stat_replication
> whereas after disabling subscription we are checking
> pg_replication_slots. I understand from your explanation why we can't
> rely on pg_stat_replication after DISABLE but it might be better to
> check that the slot is active before disabling it. I think currently,
> the test assumes that, isn't it better to have an explicit check for
> that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed.  If we are going ahead with this approach the above
comments provided are no more valid.

I have made the changes in similar lines in the attached patch. If you
are ok we can go ahead with the new approach which will simplify the
changes required.
Thoughts?

Regards,
Vignesh
From 33fdfd312a9b652a40891c09dc9e075d2dd41e44 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 17 May 2021 17:23:35 +0530
Subject: [PATCH v3] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 27 +
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..6a76d96d12 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -35,12 +35,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -77,7 +79,7 @@ is($result, qq(1|pgoutput),
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names', 

Re: subscriptioncheck failure

2021-05-16 Thread Amit Kapila
On Mon, May 17, 2021 at 9:06 AM Amit Kapila  wrote:
>
> On Thu, May 13, 2021 at 7:06 PM vignesh C  wrote:
> >
> > On Thu, May 13, 2021 at 4:41 PM Michael Paquier  wrote:
>
> Few comments:
> 1.
> + # Ensure a transactional logical decoding message shows up on the slot
> + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> DISABLE");
>
> After you have encapsulated this command in the function, the above
> comment doesn't make sense because we do this for both transactional
> and non-transactional messages. I suggest we can change it to
> something like: "This is done to ensure a logical decoding message is
> shown up on the slot".
>
> 2.
> +# Setup the subscription before checking pg_logical_slot_peek_binary_changes
> +sub setup_subscription
>
> I think here the functionality is more for the catchup of
> subscription, so it might be better to name the function as
> subscription_catchup or catchup_subscription. I think you can expand
> the comments atop this function a bit as suggested by Michael.
>

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

-- 
With Regards,
Amit Kapila.




Re: subscriptioncheck failure

2021-05-16 Thread Amit Kapila
On Thu, May 13, 2021 at 7:06 PM vignesh C  wrote:
>
> On Thu, May 13, 2021 at 4:41 PM Michael Paquier  wrote:
> >
> > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > > +$node_publisher->wait_for_catchup('tap_sub');
> > > +
> > >  # Ensure a transactional logical decoding message shows up on the slot
> > >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > > DISABLE");
> > >
> > >  # wait for the replication connection to drop from the publisher
> > >  $node_publisher->poll_query_until('postgres',
> > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE 
> > > slot_name = 'tap_sub' AND active='f'", 1);
> >
> > There are three places in this test where a slot is disabled, followed
> > by a wait to make sure that the slot is gone.  Perhaps it would be
> > better to wrap that in a small-ish routine?
>
> Thanks for the comments, Please find the attached patch having the changes.
>

Few comments:
1.
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");

After you have encapsulated this command in the function, the above
comment doesn't make sense because we do this for both transactional
and non-transactional messages. I suggest we can change it to
something like: "This is done to ensure a logical decoding message is
shown up on the slot".

2.
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription

I think here the functionality is more for the catchup of
subscription, so it might be better to name the function as
subscription_catchup or catchup_subscription. I think you can expand
the comments atop this function a bit as suggested by Michael.

-- 
With Regards,
Amit Kapila.




Re: subscriptioncheck failure

2021-05-13 Thread Michael Paquier
On Thu, May 13, 2021 at 07:05:55PM +0530, vignesh C wrote:
> Thanks for the comments, Please find the attached patch having the changes.

Cool, thanks for the new version.  I have spent some time
understanding the initial report from Amit as well as what you are
proposing here, and refactoring the test so as the set of CREATE/ALTER
SUBSCRIPTION commands are added within this routine is a good idea.

I would have made the comment on top of setup_subscription a bit more
talkative regarding the fact that it may reuse an existing
subscription, but that's a nit.  Let's wait for Amit and see what he
thinks about what you are proposing.
--
Michael


signature.asc
Description: PGP signature


Re: subscriptioncheck failure

2021-05-13 Thread vignesh C
On Thu, May 13, 2021 at 4:41 PM Michael Paquier  wrote:
>
> On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > +$node_publisher->wait_for_catchup('tap_sub');
> > +
> >  # Ensure a transactional logical decoding message shows up on the slot
> >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> >
> >  # wait for the replication connection to drop from the publisher
> >  $node_publisher->poll_query_until('postgres',
> > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name 
> > = 'tap_sub' AND active='f'", 1);
>
> There are three places in this test where a slot is disabled, followed
> by a wait to make sure that the slot is gone.  Perhaps it would be
> better to wrap that in a small-ish routine?

Thanks for the comments, Please find the attached patch having the changes.

Regards,
Vignesh
From 495b48981f18a5e05453cefeb80838fb60982e0f Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Thu, 13 May 2021 16:00:43 +0530
Subject: [PATCH v2] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 60 +
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..f1b2642640 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -8,6 +8,35 @@ use PostgresNode;
 use TestLib;
 use Test::More tests => 5;
 
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription
+{
+	my $node_publisher = shift;
+	my $node_subscriber = shift;
+	my $create_subscription = shift;
+
+	if ($create_subscription == 1)
+	{
+		my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+		$node_subscriber->safe_psql('postgres',
+			"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub");
+	}
+	else
+	{
+		$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+	}
+
+	$node_publisher->wait_for_catchup('tap_sub');
+
+	# Ensure a transactional logical decoding message shows up on the slot
+	$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+	# wait for the replication connection to drop from the publisher
+	$node_publisher->poll_query_until('postgres',
+		"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+	return;
+}
+
 # Create publisher node
 my $node_publisher = get_new_node('publisher');
 $node_publisher->init(allows_streaming => 'logical');
@@ -27,20 +56,10 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE TABLE tab_test (a int primary key)");
 
 # Setup logical replication
-my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub FOR TABLE tab_test");
 
-$node_subscriber->safe_psql('postgres',
-	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
-);
-
-# Ensure a transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+setup_subscription($node_publisher, $node_subscriber, 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -88,15 +107,7 @@ is( $result, qq(66
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
-$node_publisher->wait_for_catchup('tap_sub');
-
-# ensure a non-transactional logical decoding message shows up on the slot
-$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
-
-# wait for the replication connection to drop from the publisher
-$node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+setup_subscription($node_publisher, $node_subscriber, 0);
 
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
@@ -118,14 +129,7 @@ $result = $node_publisher->safe_psql(
 
 is($result, qq(77|0), 'non-transactional message on slot is M');
 
-$node_subscriber->safe_psql('postgres

Re: subscriptioncheck failure

2021-05-13 Thread vignesh C
On Thu, May 13, 2021 at 4:41 PM Michael Paquier  wrote:
>
> On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > +$node_publisher->wait_for_catchup('tap_sub');
> > +
> >  # Ensure a transactional logical decoding message shows up on the slot
> >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> >
> >  # wait for the replication connection to drop from the publisher
> >  $node_publisher->poll_query_until('postgres',
> > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name 
> > = 'tap_sub' AND active='f'", 1);
>
> There are three places in this test where a slot is disabled, followed
> by a wait to make sure that the slot is gone.  Perhaps it would be
> better to wrap that in a small-ish routine?

Yes that would be better, I will make the changes for this and post a patch.

Regards,
Vignesh




Re: subscriptioncheck failure

2021-05-13 Thread Michael Paquier
On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> +$node_publisher->wait_for_catchup('tap_sub');
> +
>  # Ensure a transactional logical decoding message shows up on the slot
>  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> DISABLE");
>  
>  # wait for the replication connection to drop from the publisher
>  $node_publisher->poll_query_until('postgres',
> - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 
> 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone.  Perhaps it would be
better to wrap that in a small-ish routine?
--
Michael


signature.asc
Description: PGP signature


Re: subscriptioncheck failure

2021-05-13 Thread vignesh C
On Thu, May 13, 2021 at 3:12 PM Amit Kapila  wrote:
>
> I noticed $SUBJECT in curculio.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23
>
> The failure test report:
> t/020_messages.pl(Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 5 tests but ran 1.
> Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
> cusr  9.47 csys = 31.89 CPU)
> Result: FAIL
> Makefile:21: recipe for target 'check' failed
> gmake: *** [check] Error 1
>
> As far as I can understand the failure happens due to error:
> 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
> replication slot "tap_sub" is active for PID 7354
> 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> OFFSET 1 LIMIT 1
>
> As per my initial analysis, this happens because in the test after
> creating a subscription, we are not waiting for the subscriber to
> catch up. I think there is a narrow window where it is possible that
> after creating a subscription, the apply worker starts replication
> even after we have done Alter Subscription .. DISABLE. This is also
> visible in buildfarm logs:
>
> 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 0)
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> ...
>
> 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
> (proto_version '2', publication_names '"tap_pub"')
>
> Notice that we have checked the messages before receiving the start
> replication command from the subscriber. Now, if we would have waited
> for the subscription to catch up then our check in pg_stat_replication
> wouldn't have given a false positive. I have noticed that we do have
> such wait in other subscription tests but somehow missed it here.

I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh
From 5cbfa5614db18925ca92c5e2c7310ff6520b8397 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Thu, 13 May 2021 16:00:43 +0530
Subject: [PATCH v1] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index e5d48ec8a0..7b385b48e3 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -34,12 +34,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -94,7 +96,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe

subscriptioncheck failure

2021-05-13 Thread Amit Kapila
I noticed $SUBJECT in curculio.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23

The failure test report:
t/020_messages.pl(Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 5 tests but ran 1.
Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
cusr  9.47 csys = 31.89 CPU)
Result: FAIL
Makefile:21: recipe for target 'check' failed
gmake: *** [check] Error 1

As far as I can understand the failure happens due to error:
2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
replication slot "tap_sub" is active for PID 7354
2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
OFFSET 1 LIMIT 1

As per my initial analysis, this happens because in the test after
creating a subscription, we are not waiting for the subscriber to
catch up. I think there is a narrow window where it is possible that
after creating a subscription, the apply worker starts replication
even after we have done Alter Subscription .. DISABLE. This is also
visible in buildfarm logs:

2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
SELECT get_byte(data, 0)
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
...

2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', publication_names '"tap_pub"')

Notice that we have checked the messages before receiving the start
replication command from the subscriber. Now, if we would have waited
for the subscription to catch up then our check in pg_stat_replication
wouldn't have given a false positive. I have noticed that we do have
such wait in other subscription tests but somehow missed it here.

Thoughts?

-- 
With Regards,
Amit Kapila.