Re: Failure of subscription tests with topminnow
On Tue, Aug 31, 2021 at 6:13 PM Ajin Cherian wrote: > > On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada wrote: > > Thanks Masahiko-san. I have included this change and made a new patch-set. > > Hi Amit, > > I have included your comments as well and also attached the patches > for the back-branches. > You forgot to make changes in 022_twophase_cascade.pl in the head version patch. I have made the required changes and pushed it. Thanks to you and Sawada-San for diagnosing and fixing this problem. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada wrote: > > On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila wrote: > > > > On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian wrote: > > > > > > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila > > > wrote: > > > > > > I have made the above changes on HEAD. > > > > > > > Thanks, this looks mostly good to me. I'll push and backpatch this > > tomorrow unless you or someone else thinks otherwise. > > > > Minor comments > > == > > 1. > > $oldpid = $node_publisher->safe_psql('postgres', > > - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" > > + "SELECT pid FROM pg_stat_replication WHERE application_name = > > 'tap_sub' AND state = 'streaming';;" > > ); > > > > An extra semicolon at the end of the statement. > > > > 2. > > +# restart of subscription workers. We check the state along with > > application_name > > +# to ensure that the walsender is (re)started. > > > > It is better to keep application_name in an above comment in the > > second line as that will make this line looks a bit more consistent > > with other comments. > > > > 3. In commit message, the text: "The reason was that the test was > > assuming the walsender started before it reaches the 'streaming' state > > and The check to test whether the subscription workers were restarting > > after a change in the subscription was failing." seems to be > > repeated/redundant. > > > > 4. Kindly submit the patches for back-branches. > > The patch with the above comments looks good to me. One minor > suggestion is to change the two messages of "die" to make the > investigation a bit easier. For example, > > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub' AND state = 'streaming';" > ) or die "Timed out while waiting for apply to restart after changing > CONNECTION"; > > and > > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub' AND state = 'streaming';" > ) or die "Timed out while waiting for apply to restart after changing > PUBLICATION"; > > Regards, > Thanks Masahiko-san. I have included this change and made a new patch-set. Hi Amit, I have included your comments as well and also attached the patches for the back-branches. regards, Ajin Cherian Fujitsu Australia head-v4-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data REL-10-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data REL-11-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data REL-12-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data REL-13-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data REL-14-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data
Re: Failure of subscription tests with topminnow
On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila wrote: > > On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian wrote: > > > > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila wrote: > > > > I have made the above changes on HEAD. > > > > Thanks, this looks mostly good to me. I'll push and backpatch this > tomorrow unless you or someone else thinks otherwise. > > Minor comments > == > 1. > $oldpid = $node_publisher->safe_psql('postgres', > - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" > + "SELECT pid FROM pg_stat_replication WHERE application_name = > 'tap_sub' AND state = 'streaming';;" > ); > > An extra semicolon at the end of the statement. > > 2. > +# restart of subscription workers. We check the state along with > application_name > +# to ensure that the walsender is (re)started. > > It is better to keep application_name in an above comment in the > second line as that will make this line looks a bit more consistent > with other comments. > > 3. In commit message, the text: "The reason was that the test was > assuming the walsender started before it reaches the 'streaming' state > and The check to test whether the subscription workers were restarting > after a change in the subscription was failing." seems to be > repeated/redundant. > > 4. Kindly submit the patches for back-branches. The patch with the above comments looks good to me. One minor suggestion is to change the two messages of "die" to make the investigation a bit easier. For example, $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';" ) or die "Timed out while waiting for apply to restart after changing CONNECTION"; and $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';" ) or die "Timed out while waiting for apply to restart after changing PUBLICATION"; Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian wrote: > > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila wrote: > > I have made the above changes on HEAD. > Thanks, this looks mostly good to me. I'll push and backpatch this tomorrow unless you or someone else thinks otherwise. Minor comments == 1. $oldpid = $node_publisher->safe_psql('postgres', - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" + "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';;" ); An extra semicolon at the end of the statement. 2. +# restart of subscription workers. We check the state along with application_name +# to ensure that the walsender is (re)started. It is better to keep application_name in an above comment in the second line as that will make this line looks a bit more consistent with other comments. 3. In commit message, the text: "The reason was that the test was assuming the walsender started before it reaches the 'streaming' state and The check to test whether the subscription workers were restarting after a change in the subscription was failing." seems to be repeated/redundant. 4. Kindly submit the patches for back-branches. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila wrote: > Isn't it better to check the streaming state when we are fetching > oldpid? If we don't add, then I suspect that the next time someone > adding tests on similar lines might get confused about where to check > the state and where not. Also, if you agree, add some comments before > the test on why it is important to check states. > > For ex., in below queries, the queries used for $oldpid. > my $oldpid = $node_publisher->safe_psql('postgres', > "SELECT pid FROM pg_stat_replication WHERE application_name = > 'tap_sub';" > ); > $node_subscriber->safe_psql('postgres', > "ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > sslmode=disable'" > ); > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub' AND state = 'streaming';" > ) or die "Timed out while waiting for apply to restart"; > > $oldpid = $node_publisher->safe_psql('postgres', > "SELECT pid FROM pg_stat_replication WHERE application_name = > 'tap_sub';" > ); > $node_subscriber->safe_psql('postgres', > "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only > WITH (copy_data = false)" > ); > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub' AND state = 'streaming';" > I have made the above changes on HEAD. regards, Ajin Cherian Fujitsu Australia v3-0001-Fix-the-random-test-failure-in-001_rep_changes.patch Description: Binary data
Re: Failure of subscription tests with topminnow
On Fri, Aug 27, 2021 at 4:33 PM Ajin Cherian wrote: > > On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila wrote: > > > > > > I think the fix is correct but similar changes are required in > > 022_twophase_cascade.pl as well (search for $oldpid in tests). I am > > not completely sure but I think it is better to make this test change > > in back branches as well to make it stable and reduce such random > > build farm failures. > > I have made the changes in 022_twophase_cascade.pl for HEAD as well as > patches for older branches. > Isn't it better to check the streaming state when we are fetching oldpid? If we don't add, then I suspect that the next time someone adding tests on similar lines might get confused about where to check the state and where not. Also, if you agree, add some comments before the test on why it is important to check states. For ex., in below queries, the queries used for $oldpid. my $oldpid = $node_publisher->safe_psql('postgres', "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" ); $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr sslmode=disable'" ); $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';" ) or die "Timed out while waiting for apply to restart"; $oldpid = $node_publisher->safe_psql('postgres', "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" ); $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)" ); $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';" -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila wrote: > > > I think the fix is correct but similar changes are required in > 022_twophase_cascade.pl as well (search for $oldpid in tests). I am > not completely sure but I think it is better to make this test change > in back branches as well to make it stable and reduce such random > build farm failures. I have made the changes in 022_twophase_cascade.pl for HEAD as well as patches for older branches. regards, Ajin Cherian Fujitsu Australia HEAD-v2-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data REL-13-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data REL-12-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data REL-10-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data REL-11-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data REL-14-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 4:28 PM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada wrote: > > Attaching a patch on head that modifies this particular script to also > consider the state of the walsender. > I think the fix is correct but similar changes are required in 022_twophase_cascade.pl as well (search for $oldpid in tests). I am not completely sure but I think it is better to make this test change in back branches as well to make it stable and reduce such random build farm failures. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada wrote: > I think that it’s possible that the orders of the process writing > disconnections logs and setting 0 to walsender's pid are mismatched. > We set 0 to walsender's pid in WalSndKill() that is called during > on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't > show the entry. On the other hand, disconnections logs are written by > log_disconnections() that is called during on_proc_exit callback. That > is, the following sequence could happen: > > 1. the second walsender (pid = 16475) raises an error as the slot is > already active (held by the first walsender). > 2. the first walsender (pid = 16336) clears its pid on the shmem. > 3. the polling query checks the walsender’s pid, and returns true > (since there is only the second walsender now). > 4. the second walsender clears its pid on the shmem. > 5. the second walsender writes disconnection log. > 6. the first walsender writes disconneciton log. I agree with this. Attaching a patch on head that modifies this particular script to also consider the state of the walsender. regards, Ajin Cherian Fujitsu Australia v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch Description: Binary data
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila wrote: > > > > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > > > > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila > > > wrote: > > > > > > > > > > > You have a point but if we see the below logs, it seems the second > > > > walsender (#step6) seemed to exited before the first walsender > > > > (#step4). > > > > > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > > > > > Isn't it possible that pid is cleared in the other order due to which > > > > we are seeing this problem? > > > > > > If the pid is cleared in the other order, wouldn't the query [1] return a > > > false? > > > > > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > > > application_name = 'tap_sub';" > > > > > > > I think it should return true because pid for 16336 is cleared first > > and the remaining one will be 16475. > > Yes, that was what I explained as well. 16336 is PID 'a' (first > walsender) in my explanation. The first walsender should > be cleared first for this theory to work. I think that it’s possible that the orders of the process writing disconnections logs and setting 0 to walsender's pid are mismatched. We set 0 to walsender's pid in WalSndKill() that is called during on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't show the entry. On the other hand, disconnections logs are written by log_disconnections() that is called during on_proc_exit callback. That is, the following sequence could happen: 1. the second walsender (pid = 16475) raises an error as the slot is already active (held by the first walsender). 2. the first walsender (pid = 16336) clears its pid on the shmem. 3. the polling query checks the walsender’s pid, and returns true (since there is only the second walsender now). 4. the second walsender clears its pid on the shmem. 5. the second walsender writes disconnection log. 6. the first walsender writes disconneciton log. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila wrote: > > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > > > > > > > You have a point but if we see the below logs, it seems the second > > > walsender (#step6) seemed to exited before the first walsender > > > (#step4). > > > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > > > Isn't it possible that pid is cleared in the other order due to which > > > we are seeing this problem? > > > > If the pid is cleared in the other order, wouldn't the query [1] return a > > false? > > > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > > application_name = 'tap_sub';" > > > > I think it should return true because pid for 16336 is cleared first > and the remaining one will be 16475. Yes, that was what I explained as well. 16336 is PID 'a' (first walsender) in my explanation. The first walsender should be cleared first for this theory to work. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > > > > You have a point but if we see the below logs, it seems the second > > walsender (#step6) seemed to exited before the first walsender > > (#step4). > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > Isn't it possible that pid is cleared in the other order due to which > > we are seeing this problem? > > If the pid is cleared in the other order, wouldn't the query [1] return a > false? > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > application_name = 'tap_sub';" > I think it should return true because pid for 16336 is cleared first and the remaining one will be 16475. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > You have a point but if we see the below logs, it seems the second > walsender (#step6) seemed to exited before the first walsender > (#step4). > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > session time: 0:00:00.036 user=nm database=postgres host=[local] > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > session time: 0:00:06.367 user=nm database=postgres host=[local] > > Isn't it possible that pid is cleared in the other order due to which > we are seeing this problem? If the pid is cleared in the other order, wouldn't the query [1] return a false? [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub';" regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada > wrote: > > > > Luckily these logs have the disconnection times of the tap test client > sessions as well. (not sure why I don't see these when I run these > tests). > > Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063 > 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT > pid != 16336 FROM pg_stat_replication WHERE application_name = > 'tap_sub'; > : > : > 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection: > session time: 0:00:00.063 user=nm database=postgres host=[local] > > When the query starts both walsenders are present but when the query > completes both walsenders are gone, the actual query evaluation could > have happened any time in between. This is the rare timing window that > causes this problem. > You have a point but if we see the below logs, it seems the second walsender (#step6) seemed to exited before the first walsender (#step4). 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: session time: 0:00:00.036 user=nm database=postgres host=[local] 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: session time: 0:00:06.367 user=nm database=postgres host=[local] Isn't it possible that pid is cleared in the other order due to which we are seeing this problem? -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada wrote: > > On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila > > wrote: > > > > > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada > > > wrote: > > > > > > > > I did a quick check with the following tap test code: > > > > > > > > $node_publisher->poll_query_until('postgres', > > > > qq( > > > > select 1 != foo.column1 from (values(0), (1)) as foo; > > > > )); > > > > > > > > The query returns {t, f} but poll_query_until() never finished. The > > > > same is true when the query returns {f, t}. > > > > > > > > Yes, this is true, I also see the same behaviour. > > > > > > > > This means something different is going on in Ajin's setup. Ajin, can > > > you please share how did you confirm your findings about poll_query? > > > > Relooking at my logs, I think what happens is this: > > > > 1. First walsender 'a' is running. > > 2. Second walsender 'b' starts and attempts at acquiring the slot > > finds that the slot is active for pid a. > > 3. Now both walsenders are active, the query does not return. > > 4. First walsender 'a' times out and exits. > > 5. Now only the second walsender is active and the query returns OK > > because pid != a. > > 6. Second walsender exits with error. > > 7. Another query attempts to get the pid of the running walsender for > > tap_sub but returns null because both walsender exits. > > 8. This null return value results in the next query erroring out and > > the test failing. > > So this is slightly different than what we can see in the topminnow > logs? According to the server logs, step #5 happened (at 18:44:38.016) > before step #4 happened (at 18:44:38.043). > Luckily these logs have the disconnection times of the tap test client sessions as well. (not sure why I don't see these when I run these tests). Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub'; : : 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.063 user=nm database=postgres host=[local] When the query starts both walsenders are present but when the query completes both walsenders are gone, the actual query evaluation could have happened any time in between. This is the rare timing window that causes this problem. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada > > wrote: > > > > > > I did a quick check with the following tap test code: > > > > > > $node_publisher->poll_query_until('postgres', > > > qq( > > > select 1 != foo.column1 from (values(0), (1)) as foo; > > > )); > > > > > > The query returns {t, f} but poll_query_until() never finished. The > > > same is true when the query returns {f, t}. > > > > > Yes, this is true, I also see the same behaviour. > > > > > This means something different is going on in Ajin's setup. Ajin, can > > you please share how did you confirm your findings about poll_query? > > Relooking at my logs, I think what happens is this: > > 1. First walsender 'a' is running. > 2. Second walsender 'b' starts and attempts at acquiring the slot > finds that the slot is active for pid a. > 3. Now both walsenders are active, the query does not return. > 4. First walsender 'a' times out and exits. > 5. Now only the second walsender is active and the query returns OK > because pid != a. > 6. Second walsender exits with error. > 7. Another query attempts to get the pid of the running walsender for > tap_sub but returns null because both walsender exits. > 8. This null return value results in the next query erroring out and > the test failing. So this is slightly different than what we can see in the topminnow logs? According to the server logs, step #5 happened (at 18:44:38.016) before step #4 happened (at 18:44:38.043). > > >Can you additionally check the value of 'state' from > >pg_stat_replication for both the old and new walsender sessions? > > Yes, will try this and post a patch tomorrow. Thanks. I guess the state of the new walsender should be "startup" whereas the old one should be "streaming". Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada wrote: > > > > I did a quick check with the following tap test code: > > > > $node_publisher->poll_query_until('postgres', > > qq( > > select 1 != foo.column1 from (values(0), (1)) as foo; > > )); > > > > The query returns {t, f} but poll_query_until() never finished. The > > same is true when the query returns {f, t}. > > Yes, this is true, I also see the same behaviour. > > This means something different is going on in Ajin's setup. Ajin, can > you please share how did you confirm your findings about poll_query? Relooking at my logs, I think what happens is this: 1. First walsender 'a' is running. 2. Second walsender 'b' starts and attempts at acquiring the slot finds that the slot is active for pid a. 3. Now both walsenders are active, the query does not return. 4. First walsender 'a' times out and exits. 5. Now only the second walsender is active and the query returns OK because pid != a. 6. Second walsender exits with error. 7. Another query attempts to get the pid of the running walsender for tap_sub but returns null because both walsender exits. 8. This null return value results in the next query erroring out and the test failing. >Can you additionally check the value of 'state' from >pg_stat_replication for both the old and new walsender sessions? Yes, will try this and post a patch tomorrow. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada wrote: > > I did a quick check with the following tap test code: > > $node_publisher->poll_query_until('postgres', > qq( > select 1 != foo.column1 from (values(0), (1)) as foo; > )); > > The query returns {t, f} but poll_query_until() never finished. The > same is true when the query returns {f, t}. > This means something different is going on in Ajin's setup. Ajin, can you please share how did you confirm your findings about poll_query? -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada wrote: > > > > > IIUC the query[1] used for polling returns two rows in this case: {t, > > f} or {f, t}. But did poll_query_until() returned OK in this case even > > if we expected one row of 't'? My guess of how this issue happened is: > > > > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > > passed (for some reason). > > 2. all wal senders exited. > > 3. get the pid of wal sender with application_name 'tap_sub' but got > > nothing. > > 4. the second polling query resulted in a syntax error since $oldpid is > > null. > > > > If the fact that two walsender with the same application_name could > > present in pg_stat_replication view was the cause of this issue, > > poll_query_until() should return OK even if we expected just 't'. I > > might be missing something, though. > > > > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE > > application_name = '$appname';" > > Yes, the query [1] returns OK with a {f,t} or {t,f} > > [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = '$appname';" > Can you additionally check the value of 'state' from pg_stat_replication for both the old and new walsender sessions? -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada wrote: > > > > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > > > > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian > > > > > wrote: > > > > > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > > > > > I have forced that condition with a changed query and found that the > > > > > > poll will not exit in case of a NULL return. > > > > > > > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > > > Won't at that stage both old and new walsender's are present, so the > > > > > query might return true. You can check that via debugger by stopping > > > > > just before this error occurs and then check pg_stat_replication view. > > > > > > > > If this error happens then the PID is NOT updated as the pid in the > > > > Replication slot. I have checked this > > > > and explained this in my first email itself > > > > > > > > > > Sorry about the above email, I misunderstood. I was looking at > > > pg_stat_replication_slot rather than pg_stat_replication hence the > > > confusion. > > > Amit is correct, just prior to the walsender erroring out, it briefly > > > appears in the > > > pg_stat_replication, and that is why this error happens. Sorry for the > > > confusion. > > > I just confirmed it, got both the walsenders stopped in the debugger: > > > > > > postgres=# select pid from pg_stat_replication where application_name = > > > 'sub'; > > > pid > > > -- > > > 7899 > > > 7993 > > > (2 rows) > > > > IIUC the query[1] used for polling returns two rows in this case: {t, > > f} or {f, t}. But did poll_query_until() returned OK in this case even > > if we expected one row of 't'? My guess of how this issue happened is: > > > > Yeah, we can check this but I guess as soon as it gets 't', the poll > query will exit. I did a quick check with the following tap test code: $node_publisher->poll_query_until('postgres', qq( select 1 != foo.column1 from (values(0), (1)) as foo; )); The query returns {t, f} but poll_query_until() never finished. The same is true when the query returns {f, t}. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada wrote: > > IIUC the query[1] used for polling returns two rows in this case: {t, > f} or {f, t}. But did poll_query_until() returned OK in this case even > if we expected one row of 't'? My guess of how this issue happened is: > > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > passed (for some reason). > 2. all wal senders exited. > 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. > 4. the second polling query resulted in a syntax error since $oldpid is null. > > If the fact that two walsender with the same application_name could > present in pg_stat_replication view was the cause of this issue, > poll_query_until() should return OK even if we expected just 't'. I > might be missing something, though. > > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = '$appname';" Yes, the query [1] returns OK with a {f,t} or {t,f} [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada wrote: > > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > > > I have forced that condition with a changed query and found that the > > > > > poll will not exit in case of a NULL return. > > > > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > > Won't at that stage both old and new walsender's are present, so the > > > > query might return true. You can check that via debugger by stopping > > > > just before this error occurs and then check pg_stat_replication view. > > > > > > If this error happens then the PID is NOT updated as the pid in the > > > Replication slot. I have checked this > > > and explained this in my first email itself > > > > > > > Sorry about the above email, I misunderstood. I was looking at > > pg_stat_replication_slot rather than pg_stat_replication hence the > > confusion. > > Amit is correct, just prior to the walsender erroring out, it briefly > > appears in the > > pg_stat_replication, and that is why this error happens. Sorry for the > > confusion. > > I just confirmed it, got both the walsenders stopped in the debugger: > > > > postgres=# select pid from pg_stat_replication where application_name = > > 'sub'; > > pid > > -- > > 7899 > > 7993 > > (2 rows) > > IIUC the query[1] used for polling returns two rows in this case: {t, > f} or {f, t}. But did poll_query_until() returned OK in this case even > if we expected one row of 't'? My guess of how this issue happened is: > Yeah, we can check this but I guess as soon as it gets 't', the poll query will exit. > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > passed (for some reason). > I think the reason for exit is that we get two rows with the same application_name in pg_stat_replication. > 2. all wal senders exited. > 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. > 4. the second polling query resulted in a syntax error since $oldpid is null. > Your understanding of steps is the same as mine. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > wrote: > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > I have forced that condition with a changed query and found that the > > > > poll will not exit in case of a NULL return. > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > Won't at that stage both old and new walsender's are present, so the > > > query might return true. You can check that via debugger by stopping > > > just before this error occurs and then check pg_stat_replication view. > > > > If this error happens then the PID is NOT updated as the pid in the > > Replication slot. I have checked this > > and explained this in my first email itself > > > > Sorry about the above email, I misunderstood. I was looking at > pg_stat_replication_slot rather than pg_stat_replication hence the confusion. > Amit is correct, just prior to the walsender erroring out, it briefly > appears in the > pg_stat_replication, and that is why this error happens. Sorry for the > confusion. > I just confirmed it, got both the walsenders stopped in the debugger: > > postgres=# select pid from pg_stat_replication where application_name = 'sub'; > pid > -- > 7899 > 7993 > (2 rows) IIUC the query[1] used for polling returns two rows in this case: {t, f} or {f, t}. But did poll_query_until() returned OK in this case even if we expected one row of 't'? My guess of how this issue happened is: 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" passed (for some reason). 2. all wal senders exited. 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. 4. the second polling query resulted in a syntax error since $oldpid is null. If the fact that two walsender with the same application_name could present in pg_stat_replication view was the cause of this issue, poll_query_until() should return OK even if we expected just 't'. I might be missing something, though. [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > wrote: > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > I have forced that condition with a changed query and found that the > > > poll will not exit in case of a NULL return. > > > > > > > What if the query in a poll is fired just before we get an error > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > Won't at that stage both old and new walsender's are present, so the > > query might return true. You can check that via debugger by stopping > > just before this error occurs and then check pg_stat_replication view. > > If this error happens then the PID is NOT updated as the pid in the > Replication slot. I have checked this > and explained this in my first email itself > Sorry about the above email, I misunderstood. I was looking at pg_stat_replication_slot rather than pg_stat_replication hence the confusion. Amit is correct, just prior to the walsender erroring out, it briefly appears in the pg_stat_replication, and that is why this error happens. Sorry for the confusion. I just confirmed it, got both the walsenders stopped in the debugger: postgres=# select pid from pg_stat_replication where application_name = 'sub'; pid -- 7899 7993 (2 rows) ajin 7896 3326 0 05:22 pts/200:00:00 psql -p 6972 postgres ajin 7897 7882 0 05:22 ?00:00:00 postgres: ajin postgres [local] idle ajin 7899 7882 0 05:22 ?00:00:00 postgres: walsender ajin ::1(37719) START_REPLICATION ajin 7992 3647 0 05:24 ?00:00:00 postgres: logical replication worker for subscription 16426 ajin 7993 7882 0 05:24 ?00:00:00 postgres: walsender ajin ::1(37720) START_REPLICATION regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > wrote: > > > > > But will poll function still poll or exit? Have you tried that? > > > > I have forced that condition with a changed query and found that the > > poll will not exit in case of a NULL return. > > > > What if the query in a poll is fired just before we get an error > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > Won't at that stage both old and new walsender's are present, so the > query might return true. You can check that via debugger by stopping > just before this error occurs and then check pg_stat_replication view. If this error happens then the PID is NOT updated as the pid in the Replication slot. I have checked this and explained this in my first email itself regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila wrote: > > > But will poll function still poll or exit? Have you tried that? > > I have forced that condition with a changed query and found that the > poll will not exit in case of a NULL return. > What if the query in a poll is fired just before we get an error "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? Won't at that stage both old and new walsender's are present, so the query might return true. You can check that via debugger by stopping just before this error occurs and then check pg_stat_replication view. > > I will try with REL_11_STABLE and see if the behaviour is different. > Okay, but I still suspect some timing issue here. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila wrote: > But will poll function still poll or exit? Have you tried that? I have forced that condition with a changed query and found that the poll will not exit in case of a NULL return. > Because it is not clear from your explanation how in the first > statement it returns a valid value which leads poll to exit and then > in second statement it returns NULL or maybe nothing. Can you share I don't have an explanation for it either. Maybe things are different in REL_11_STABLE > the log also when you are getting "replication slot "tap_sub" is > active for ..."? If you see in the below log [1], the STATEMENT is > printed twice, I want to see if you also get prints in a similar way > or is it something different? Do you know why it is printed twice? > Yes, I get the same. For every "LOG or ERROR" message, there is the associated STATEMENT message with it. First there is a LOG "received replication command" followed by the STATEMENT and then the ERROR "replication slot .. is active.." followed by the STATEMENT. I will try with REL_11_STABLE and see if the behaviour is different. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Tue, Aug 24, 2021 at 6:28 PM Ajin Cherian wrote: > > On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila wrote: > > > What happens when there is neither a new walsender nor an old > > walsender is present? It means to run the above statement when a new > > walsender is exited due to error "... slot is active ..." and before a > > new walsender could start. Also, allow old walsender (due to which the > > error occurs) to exit before executing the statement. > > > I tried this, then the query returns a null instead of 'false' because > there is no entry for that application_name. > > postgres=# select pid != 31876 from pg_stat_replication where > application_name = 'sub'; > ?column? > -- > (0 rows) > But will poll function still poll or exit? Have you tried that? Because it is not clear from your explanation how in the first statement it returns a valid value which leads poll to exit and then in second statement it returns NULL or maybe nothing. Can you share the log also when you are getting "replication slot "tap_sub" is active for ..."? If you see in the below log [1], the STATEMENT is printed twice, I want to see if you also get prints in a similar way or is it something different? Do you know why it is printed twice? [1] 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila wrote: > What happens when there is neither a new walsender nor an old > walsender is present? It means to run the above statement when a new > walsender is exited due to error "... slot is active ..." and before a > new walsender could start. Also, allow old walsender (due to which the > error occurs) to exit before executing the statement. > I tried this, then the query returns a null instead of 'false' because there is no entry for that application_name. postgres=# select pid != 31876 from pg_stat_replication where application_name = 'sub'; ?column? -- (0 rows) > Also, it seems this failure happens on REL_11_STABLE branch, not sure > if that matters, but it is better to use the same branch if you are > using a different branch to reproduce the issue. > Ok, I didn't realise this. I will try this. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian wrote: > > On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila wrote: > > > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the > > walsender process. Now, here the problem seems to be that the previous > > walsender process (16336) didn't exit and the new process started to > > use the slot which leads to the error shown in the log. This is > > evident from the below part of log where we can see that 16336 has > > exited after new process started to use the slot. > > > > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received > > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL > > 0/16BEEB0 (proto_version '1', publication_names > > '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: > > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > > publication_names '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication > > slot "tap_sub" is active for PID 16336 > > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: > > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > > publication_names '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > One idea to solve this is to first disable the subscription, wait for > > the walsender process to exit, and then change the connection string > > and re-enable the subscription. > > I tried to simulate this by putting delays prior to the exit of the > walsender. Even though I see the same error > in the logs that the replication slot is active for the previous PID, > the test case does not fail in the way seen in this case here.. > > The new walsender tries to acquire the slot but seeing that there is > another PID that is active on the slot, it > errors and exits. At no point does this new failed walsender update > its pid for the slot. As a result, the below polled query > would not return a true value > > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub';" > ) or die "Timed out while waiting for apply to restart"; > > In my runs I see that this query is repeated until a new walsender is > able to successfully acquire the slot. > What happens when there is neither a new walsender nor an old walsender is present? It means to run the above statement when a new walsender is exited due to error "... slot is active ..." and before a new walsender could start. Also, allow old walsender (due to which the error occurs) to exit before executing the statement. Also, it seems this failure happens on REL_11_STABLE branch, not sure if that matters, but it is better to use the same branch if you are using a different branch to reproduce the issue. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila wrote: > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the > walsender process. Now, here the problem seems to be that the previous > walsender process (16336) didn't exit and the new process started to > use the slot which leads to the error shown in the log. This is > evident from the below part of log where we can see that 16336 has > exited after new process started to use the slot. > > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL > 0/16BEEB0 (proto_version '1', publication_names > '"tap_pub","tap_pub_ins_only"') > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > publication_names '"tap_pub","tap_pub_ins_only"') > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication > slot "tap_sub" is active for PID 16336 > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > publication_names '"tap_pub","tap_pub_ins_only"') > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > session time: 0:00:00.036 user=nm database=postgres host=[local] > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > session time: 0:00:06.367 user=nm database=postgres host=[local] > > One idea to solve this is to first disable the subscription, wait for > the walsender process to exit, and then change the connection string > and re-enable the subscription. I tried to simulate this by putting delays prior to the exit of the walsender. Even though I see the same error in the logs that the replication slot is active for the previous PID, the test case does not fail in the way seen in this case here.. The new walsender tries to acquire the slot but seeing that there is another PID that is active on the slot, it errors and exits. At no point does this new failed walsender update its pid for the slot. As a result, the below polled query would not return a true value $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';" ) or die "Timed out while waiting for apply to restart"; In my runs I see that this query is repeated until a new walsender is able to successfully acquire the slot. I am not able to explain why this query returned true in the topminnow tap test. This would imply that a walsender was able to acquire the slot and update its pid but I don't see how. We also know that if this polled query returns a true (implying a pid other than $oldpid), then the next query in the test is to try and identify the pid: SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub'; >From the topminnow logs we know that this query returned a "NULL", as the value extracted from this is used to formulate the next query which errored out and eventually caused the test case to fail. [16482:3] 001_rep_changes.pl ERROR: syntax error at or near "FROM" at character 16 [16482:4] 001_rep_changes.pl STATEMENT: SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub'; I am not an expert in perl but I looked at the perl function used in the tap test poll_query_until(), this would poll until the query returns a 'true' (unless specified otherwise). I don't see in my tests that the polled function exits if the query returns a NULL. I don't know if differences in the installed perl can cause this difference in behaviour. Can a NULL set be construed as a true and cause the poll to exit? Any suggestions? regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Mon, Aug 16, 2021 at 9:24 AM Michael Paquier wrote: > > Hi all, > > topminnow has just failed in a weird fashion: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58 > # SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub'; > # expecting this output: > # t > # last actual query output: > # > # with stderr: > # ERROR: syntax error at or near "FROM" > # LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na... > > Looking at the logs, it seems like the problem boils down to an active > slot when using ALTER SUBSCRIPTION tap_sub CONNECTION: > 2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL > streaming: ERROR: replication slot "tap_sub" is active for PID 16336 > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the walsender process. Now, here the problem seems to be that the previous walsender process (16336) didn't exit and the new process started to use the slot which leads to the error shown in the log. This is evident from the below part of log where we can see that 16336 has exited after new process started to use the slot. 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication slot "tap_sub" is active for PID 16336 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', publication_names '"tap_pub","tap_pub_ins_only"') 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: session time: 0:00:00.036 user=nm database=postgres host=[local] 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: session time: 0:00:06.367 user=nm database=postgres host=[local] One idea to solve this is to first disable the subscription, wait for the walsender process to exit, and then change the connection string and re-enable the subscription. -- With Regards, Amit Kapila.
Failure of subscription tests with topminnow
Hi all, topminnow has just failed in a weird fashion: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58 # SELECT pid != FROM pg_stat_replication WHERE application_name = 'tap_sub'; # expecting this output: # t # last actual query output: # # with stderr: # ERROR: syntax error at or near "FROM" # LINE 1: SELECT pid != FROM pg_stat_replication WHERE application_na... Looking at the logs, it seems like the problem boils down to an active slot when using ALTER SUBSCRIPTION tap_sub CONNECTION: 2021-08-15 18:44:38.027 CEST [16473:2] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is active for PID 16336 There is only one place in 001_rep_changes.pl where this is used. Thoughts? -- Michael signature.asc Description: PGP signature