Re: subscriptioncheck failure
On Wed, May 19, 2021 at 10:28 AM Amit Kapila wrote: > > On Tue, May 18, 2021 at 11:25 AM vignesh C wrote: > > > > Thanks for the comments, the attached patch has the changes for the same. > > > > Thanks, I have pushed your patch after making minor changes in the comments. Thanks for pushing this patch. Regards, Vignesh
Re: subscriptioncheck failure
On Tue, May 18, 2021 at 11:25 AM vignesh C wrote: > > Thanks for the comments, the attached patch has the changes for the same. > Thanks, I have pushed your patch after making minor changes in the comments. -- With Regards, Amit Kapila.
Re: subscriptioncheck failure
On Tue, May 18, 2021 at 9:09 AM Amit Kapila wrote: > > On Mon, May 17, 2021 at 5:48 PM vignesh C wrote: > > > > On Mon, May 17, 2021 at 10:40 AM Amit Kapila > > wrote: > > > > > > One more point: > > > + $node_publisher->wait_for_catchup('tap_sub'); > > > + > > > + # Ensure a transactional logical decoding message shows up on the slot > > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > > DISABLE"); > > > + > > > + # wait for the replication connection to drop from the publisher > > > + $node_publisher->poll_query_until('postgres', > > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > > > slot_name = 'tap_sub' AND active='f'", 1); > > > > > > In the above sequence, wait_for_catchup will query pg_stat_replication > > > whereas after disabling subscription we are checking > > > pg_replication_slots. I understand from your explanation why we can't > > > rely on pg_stat_replication after DISABLE but it might be better to > > > check that the slot is active before disabling it. I think currently, > > > the test assumes that, isn't it better to have an explicit check for > > > that? > > > > I felt this is not required, wait_for_catchup will poll_query_until > > the state = 'streaming', even if START_REPLICATION takes time, state > > will be in 'startup' state, this way poll_query_until will take care > > of handling this. > > > > makes sense, but let's add some comments to clarify the same. > Modified. > > On further analysis I found that we need to do "Alter subscription > > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple > > time, Instead we can change pg_logical_slot_peek_binary_changes to > > pg_logical_slot_get_binary_changes at appropriate steps. This way the > > common function can be removed and the enable/disable multiple times > > can be removed. > > > > I think that is a valid point. This was probably kept so that we can > peek multiple times for the same message to test various things but > that can be achieved with the way you have changed the test. > > One more thing, shouldn't we make auto_vacuum=off for this test by > using 'append_conf' before starting the publisher. That will avoid the > risk of empty transactions. I felt that makes sense, added it. Thanks for the comments, the attached patch has the changes for the same. Regards, Vignesh From 7617be76996efd17f46cee16f9a9a18a23183f19 Mon Sep 17 00:00:00 2001 From: vignesh Date: Mon, 17 May 2021 17:23:35 +0530 Subject: [PATCH v4] Messages test failure fix. This test was failing because there was no wait for catch up after creating subscription. After disabling subscription pg_stat_replication was used to verify if the walsender is exited. The steps of walsender exit includes cleaning up of walsendr and then releasing replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in the narrow time window. Fixed it by checking active column in pg_replication_slot instead of pg_stat_replication which is more reliable. --- src/test/subscription/t/020_messages.pl | 32 - 1 file changed, 10 insertions(+), 22 deletions(-) diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl index 0940d0f45f..aaafc9f11d 100644 --- a/src/test/subscription/t/020_messages.pl +++ b/src/test/subscription/t/020_messages.pl @@ -11,6 +11,8 @@ use Test::More tests => 5; # Create publisher node my $node_publisher = get_new_node('publisher'); $node_publisher->init(allows_streaming => 'logical'); +$node_publisher->append_conf('postgresql.conf', + 'autovacuum = off'); $node_publisher->start; # Create subscriber node @@ -35,12 +37,17 @@ $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub" ); +# Here wait_for_catchup is used intentionally instead of pg_replication_slots +# as wait_for_catchup waits on poll_query_until until the state becomes +# 'streaming', which will make sure that the replication slot is active. +$node_publisher->wait_for_catchup('tap_sub'); + # Ensure a transactional logical decoding message shows up on the slot $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); # wait for the replication connection to drop from the publisher $node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); $node_publisher->safe_psql('postgres', "SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')" @@ -77,7 +84,7 @@ is($result, qq(1|pgoutput), $result = $node_publisher->safe_psql( 'postgres', qq( SELECT get_byte(data, 0) - FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, + FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names', 'tap_pub') )
Re: subscriptioncheck failure
On Mon, May 17, 2021 at 5:48 PM vignesh C wrote: > > On Mon, May 17, 2021 at 10:40 AM Amit Kapila wrote: > > > > One more point: > > + $node_publisher->wait_for_catchup('tap_sub'); > > + > > + # Ensure a transactional logical decoding message shows up on the slot > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > DISABLE"); > > + > > + # wait for the replication connection to drop from the publisher > > + $node_publisher->poll_query_until('postgres', > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > > slot_name = 'tap_sub' AND active='f'", 1); > > > > In the above sequence, wait_for_catchup will query pg_stat_replication > > whereas after disabling subscription we are checking > > pg_replication_slots. I understand from your explanation why we can't > > rely on pg_stat_replication after DISABLE but it might be better to > > check that the slot is active before disabling it. I think currently, > > the test assumes that, isn't it better to have an explicit check for > > that? > > I felt this is not required, wait_for_catchup will poll_query_until > the state = 'streaming', even if START_REPLICATION takes time, state > will be in 'startup' state, this way poll_query_until will take care > of handling this. > makes sense, but let's add some comments to clarify the same. > On further analysis I found that we need to do "Alter subscription > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple > time, Instead we can change pg_logical_slot_peek_binary_changes to > pg_logical_slot_get_binary_changes at appropriate steps. This way the > common function can be removed and the enable/disable multiple times > can be removed. > I think that is a valid point. This was probably kept so that we can peek multiple times for the same message to test various things but that can be achieved with the way you have changed the test. One more thing, shouldn't we make auto_vacuum=off for this test by using 'append_conf' before starting the publisher. That will avoid the risk of empty transactions. -- With Regards, Amit Kapila.
Re: subscriptioncheck failure
On Mon, May 17, 2021 at 10:40 AM Amit Kapila wrote: > > On Mon, May 17, 2021 at 9:06 AM Amit Kapila wrote: > > > > On Thu, May 13, 2021 at 7:06 PM vignesh C wrote: > > > > > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier > > > wrote: > > > > Few comments: > > 1. > > + # Ensure a transactional logical decoding message shows up on the slot > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > DISABLE"); > > > > After you have encapsulated this command in the function, the above > > comment doesn't make sense because we do this for both transactional > > and non-transactional messages. I suggest we can change it to > > something like: "This is done to ensure a logical decoding message is > > shown up on the slot". > > > > 2. > > +# Setup the subscription before checking > > pg_logical_slot_peek_binary_changes > > +sub setup_subscription > > > > I think here the functionality is more for the catchup of > > subscription, so it might be better to name the function as > > subscription_catchup or catchup_subscription. I think you can expand > > the comments atop this function a bit as suggested by Michael. > > > > One more point: > + $node_publisher->wait_for_catchup('tap_sub'); > + > + # Ensure a transactional logical decoding message shows up on the slot > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > DISABLE"); > + > + # wait for the replication connection to drop from the publisher > + $node_publisher->poll_query_until('postgres', > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > slot_name = 'tap_sub' AND active='f'", 1); > > In the above sequence, wait_for_catchup will query pg_stat_replication > whereas after disabling subscription we are checking > pg_replication_slots. I understand from your explanation why we can't > rely on pg_stat_replication after DISABLE but it might be better to > check that the slot is active before disabling it. I think currently, > the test assumes that, isn't it better to have an explicit check for > that? I felt this is not required, wait_for_catchup will poll_query_until the state = 'streaming', even if START_REPLICATION takes time, state will be in 'startup' state, this way poll_query_until will take care of handling this. On further analysis I found that we need to do "Alter subscription tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple time, Instead we can change pg_logical_slot_peek_binary_changes to pg_logical_slot_get_binary_changes at appropriate steps. This way the common function can be removed and the enable/disable multiple times can be removed. If we are going ahead with this approach the above comments provided are no more valid. I have made the changes in similar lines in the attached patch. If you are ok we can go ahead with the new approach which will simplify the changes required. Thoughts? Regards, Vignesh From 33fdfd312a9b652a40891c09dc9e075d2dd41e44 Mon Sep 17 00:00:00 2001 From: vignesh Date: Mon, 17 May 2021 17:23:35 +0530 Subject: [PATCH v3] Messages test failure fix. This test was failing because there was no wait for catch up after creating subscription. After disabling subscription pg_stat_replication was used to verify if the walsender is exited. The steps of walsender exit includes cleaning up of walsendr and then releasing replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in the narrow time window. Fixed it by checking active column in pg_replication_slot instead of pg_stat_replication which is more reliable. --- src/test/subscription/t/020_messages.pl | 27 + 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl index 0940d0f45f..6a76d96d12 100644 --- a/src/test/subscription/t/020_messages.pl +++ b/src/test/subscription/t/020_messages.pl @@ -35,12 +35,14 @@ $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub" ); +$node_publisher->wait_for_catchup('tap_sub'); + # Ensure a transactional logical decoding message shows up on the slot $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); # wait for the replication connection to drop from the publisher $node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); $node_publisher->safe_psql('postgres', "SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')" @@ -77,7 +79,7 @@ is($result, qq(1|pgoutput), $result = $node_publisher->safe_psql( 'postgres', qq( SELECT get_byte(data, 0) - FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, + FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names',
Re: subscriptioncheck failure
On Mon, May 17, 2021 at 9:06 AM Amit Kapila wrote: > > On Thu, May 13, 2021 at 7:06 PM vignesh C wrote: > > > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier wrote: > > Few comments: > 1. > + # Ensure a transactional logical decoding message shows up on the slot > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > DISABLE"); > > After you have encapsulated this command in the function, the above > comment doesn't make sense because we do this for both transactional > and non-transactional messages. I suggest we can change it to > something like: "This is done to ensure a logical decoding message is > shown up on the slot". > > 2. > +# Setup the subscription before checking pg_logical_slot_peek_binary_changes > +sub setup_subscription > > I think here the functionality is more for the catchup of > subscription, so it might be better to name the function as > subscription_catchup or catchup_subscription. I think you can expand > the comments atop this function a bit as suggested by Michael. > One more point: + $node_publisher->wait_for_catchup('tap_sub'); + + # Ensure a transactional logical decoding message shows up on the slot + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); + + # wait for the replication connection to drop from the publisher + $node_publisher->poll_query_until('postgres', + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); In the above sequence, wait_for_catchup will query pg_stat_replication whereas after disabling subscription we are checking pg_replication_slots. I understand from your explanation why we can't rely on pg_stat_replication after DISABLE but it might be better to check that the slot is active before disabling it. I think currently, the test assumes that, isn't it better to have an explicit check for that? -- With Regards, Amit Kapila.
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 7:06 PM vignesh C wrote: > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier wrote: > > > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > > +$node_publisher->wait_for_catchup('tap_sub'); > > > + > > > # Ensure a transactional logical decoding message shows up on the slot > > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > > DISABLE"); > > > > > > # wait for the replication connection to drop from the publisher > > > $node_publisher->poll_query_until('postgres', > > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE > > > slot_name = 'tap_sub' AND active='f'", 1); > > > > There are three places in this test where a slot is disabled, followed > > by a wait to make sure that the slot is gone. Perhaps it would be > > better to wrap that in a small-ish routine? > > Thanks for the comments, Please find the attached patch having the changes. > Few comments: 1. + # Ensure a transactional logical decoding message shows up on the slot + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); After you have encapsulated this command in the function, the above comment doesn't make sense because we do this for both transactional and non-transactional messages. I suggest we can change it to something like: "This is done to ensure a logical decoding message is shown up on the slot". 2. +# Setup the subscription before checking pg_logical_slot_peek_binary_changes +sub setup_subscription I think here the functionality is more for the catchup of subscription, so it might be better to name the function as subscription_catchup or catchup_subscription. I think you can expand the comments atop this function a bit as suggested by Michael. -- With Regards, Amit Kapila.
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 07:05:55PM +0530, vignesh C wrote: > Thanks for the comments, Please find the attached patch having the changes. Cool, thanks for the new version. I have spent some time understanding the initial report from Amit as well as what you are proposing here, and refactoring the test so as the set of CREATE/ALTER SUBSCRIPTION commands are added within this routine is a good idea. I would have made the comment on top of setup_subscription a bit more talkative regarding the fact that it may reuse an existing subscription, but that's a nit. Let's wait for Amit and see what he thinks about what you are proposing. -- Michael signature.asc Description: PGP signature
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 4:41 PM Michael Paquier wrote: > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > +$node_publisher->wait_for_catchup('tap_sub'); > > + > > # Ensure a transactional logical decoding message shows up on the slot > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > DISABLE"); > > > > # wait for the replication connection to drop from the publisher > > $node_publisher->poll_query_until('postgres', > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name > > = 'tap_sub' AND active='f'", 1); > > There are three places in this test where a slot is disabled, followed > by a wait to make sure that the slot is gone. Perhaps it would be > better to wrap that in a small-ish routine? Thanks for the comments, Please find the attached patch having the changes. Regards, Vignesh From 495b48981f18a5e05453cefeb80838fb60982e0f Mon Sep 17 00:00:00 2001 From: vignesh Date: Thu, 13 May 2021 16:00:43 +0530 Subject: [PATCH v2] Messages test failure fix. This test was failing because there was no wait for catch up after creating subscription. After disabling subscription pg_stat_replication was used to verify if the walsender is exited. The steps of walsender exit includes cleaning up of walsendr and then releasing replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in the narrow time window. Fixed it by checking active column in pg_replication_slot instead of pg_stat_replication which is more reliable. --- src/test/subscription/t/020_messages.pl | 60 + 1 file changed, 32 insertions(+), 28 deletions(-) diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl index 0940d0f45f..f1b2642640 100644 --- a/src/test/subscription/t/020_messages.pl +++ b/src/test/subscription/t/020_messages.pl @@ -8,6 +8,35 @@ use PostgresNode; use TestLib; use Test::More tests => 5; +# Setup the subscription before checking pg_logical_slot_peek_binary_changes +sub setup_subscription +{ + my $node_publisher = shift; + my $node_subscriber = shift; + my $create_subscription = shift; + + if ($create_subscription == 1) + { + my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres'; + $node_subscriber->safe_psql('postgres', + "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"); + } + else + { + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE"); + } + + $node_publisher->wait_for_catchup('tap_sub'); + + # Ensure a transactional logical decoding message shows up on the slot + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); + + # wait for the replication connection to drop from the publisher + $node_publisher->poll_query_until('postgres', + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); + return; +} + # Create publisher node my $node_publisher = get_new_node('publisher'); $node_publisher->init(allows_streaming => 'logical'); @@ -27,20 +56,10 @@ $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_test (a int primary key)"); # Setup logical replication -my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres'; $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE tab_test"); -$node_subscriber->safe_psql('postgres', - "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub" -); - -# Ensure a transactional logical decoding message shows up on the slot -$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); - -# wait for the replication connection to drop from the publisher -$node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); +setup_subscription($node_publisher, $node_subscriber, 1); $node_publisher->safe_psql('postgres', "SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')" @@ -88,15 +107,7 @@ is( $result, qq(66 'option messages defaults to false so message (M) is not available on slot' ); -$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE"); -$node_publisher->wait_for_catchup('tap_sub'); - -# ensure a non-transactional logical decoding message shows up on the slot -$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); - -# wait for the replication connection to drop from the publisher -$node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); +setup_subscription($node_publisher, $node_subscriber, 0); $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)"); @@ -118,14 +129,7 @@ $result = $node_publisher->safe_psql( is($result, qq(77|0), 'non-transactional message on slot is M'); -$node_subscriber->safe_psql('postgres
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 4:41 PM Michael Paquier wrote: > > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > > +$node_publisher->wait_for_catchup('tap_sub'); > > + > > # Ensure a transactional logical decoding message shows up on the slot > > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > > DISABLE"); > > > > # wait for the replication connection to drop from the publisher > > $node_publisher->poll_query_until('postgres', > > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name > > = 'tap_sub' AND active='f'", 1); > > There are three places in this test where a slot is disabled, followed > by a wait to make sure that the slot is gone. Perhaps it would be > better to wrap that in a small-ish routine? Yes that would be better, I will make the changes for this and post a patch. Regards, Vignesh
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote: > +$node_publisher->wait_for_catchup('tap_sub'); > + > # Ensure a transactional logical decoding message shows up on the slot > $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub > DISABLE"); > > # wait for the replication connection to drop from the publisher > $node_publisher->poll_query_until('postgres', > - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = > 'tap_sub' AND active='f'", 1); There are three places in this test where a slot is disabled, followed by a wait to make sure that the slot is gone. Perhaps it would be better to wrap that in a small-ish routine? -- Michael signature.asc Description: PGP signature
Re: subscriptioncheck failure
On Thu, May 13, 2021 at 3:12 PM Amit Kapila wrote: > > I noticed $SUBJECT in curculio. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23 > > The failure test report: > t/020_messages.pl(Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 5 tests but ran 1. > Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36 > cusr 9.47 csys = 31.89 CPU) > Result: FAIL > Makefile:21: recipe for target 'check' failed > gmake: *** [check] Error 1 > > As far as I can understand the failure happens due to error: > 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR: > replication slot "tap_sub" is active for PID 7354 > 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT: > SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape') > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > OFFSET 1 LIMIT 1 > > As per my initial analysis, this happens because in the test after > creating a subscription, we are not waiting for the subscriber to > catch up. I think there is a narrow window where it is possible that > after creating a subscription, the apply worker starts replication > even after we have done Alter Subscription .. DISABLE. This is also > visible in buildfarm logs: > > 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT: > SELECT get_byte(data, 0) > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > ... > > 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 > (proto_version '2', publication_names '"tap_pub"') > > Notice that we have checked the messages before receiving the start > replication command from the subscriber. Now, if we would have waited > for the subscription to catch up then our check in pg_stat_replication > wouldn't have given a false positive. I have noticed that we do have > such wait in other subscription tests but somehow missed it here. I agree with your analysis, apart from this issue there is one other small problem. Currently after disabling subscription pg_stat_replication is used to verify if the walsender is exited. The steps of the walsender exit includes cleaning up of walsender and then releasing a replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in this narrow time window. I think it is better to fix it by checking the active column in pg_replication_slot instead of pg_stat_replication which is more reliable. Attached patch which has the fixes for the same. Thoughts? Regards, Vignesh From 5cbfa5614db18925ca92c5e2c7310ff6520b8397 Mon Sep 17 00:00:00 2001 From: vignesh Date: Thu, 13 May 2021 16:00:43 +0530 Subject: [PATCH v1] Messages test failure fix. This test was failing because there was no wait for catch up after creating subscription. After disabling subscription pg_stat_replication was used to verify if the walsender is exited. The steps of walsender exit includes cleaning up of walsendr and then releasing replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in the narrow time window. Fixed it by checking active column in pg_replication_slot instead of pg_stat_replication which is more reliable. --- src/test/subscription/t/020_messages.pl | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl index e5d48ec8a0..7b385b48e3 100644 --- a/src/test/subscription/t/020_messages.pl +++ b/src/test/subscription/t/020_messages.pl @@ -34,12 +34,14 @@ $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub" ); +$node_publisher->wait_for_catchup('tap_sub'); + # Ensure a transactional logical decoding message shows up on the slot $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); # wait for the replication connection to drop from the publisher $node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); $node_publisher->safe_psql('postgres', "SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')" @@ -94,7 +96,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE"); # wait for the replication connection to drop from the publisher $node_publisher->poll_query_until('postgres', - 'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0); + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1); $node_publisher->safe
subscriptioncheck failure
I noticed $SUBJECT in curculio. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23 The failure test report: t/020_messages.pl(Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 5 tests but ran 1. Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36 cusr 9.47 csys = 31.89 CPU) Result: FAIL Makefile:21: recipe for target 'check' failed gmake: *** [check] Error 1 As far as I can understand the failure happens due to error: 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR: replication slot "tap_sub" is active for PID 7354 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT: SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape') FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names', 'tap_pub', 'messages', 'true') OFFSET 1 LIMIT 1 As per my initial analysis, this happens because in the test after creating a subscription, we are not waiting for the subscriber to catch up. I think there is a narrow window where it is possible that after creating a subscription, the apply worker starts replication even after we have done Alter Subscription .. DISABLE. This is also visible in buildfarm logs: 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT: SELECT get_byte(data, 0) FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, 'proto_version', '1', 'publication_names', 'tap_pub', 'messages', 'true') ... 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') Notice that we have checked the messages before receiving the start replication command from the subscriber. Now, if we would have waited for the subscription to catch up then our check in pg_stat_replication wouldn't have given a false positive. I have noticed that we do have such wait in other subscription tests but somehow missed it here. Thoughts? -- With Regards, Amit Kapila.