Re: [HACKERS] More replication race conditions
On Sat, Sep 2, 2017 at 12:03 AM, Alvaro Herrera wrote: > Michael Paquier wrote: >> On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier >> wrote: >> > Today's run has finished with the same failure: >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 >> > Attached is a patch to make this code path wait that the transaction >> > has been replayed. We could use as well synchronous_commit = apply, >> > but I prefer the solution of this patch with a wait query. >> >> I have added an open item for this issue for PG10. The 2PC tests in >> src/test/recovery are new in PG10, introduced by 3082098. > > Thanks, pushed. Thanks, the error has now not happened for 5 runs. Let's see if it ever shows up again. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
Michael Paquier wrote: > On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier > wrote: > > Today's run has finished with the same failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > > Attached is a patch to make this code path wait that the transaction > > has been replayed. We could use as well synchronous_commit = apply, > > but I prefer the solution of this patch with a wait query. > > I have added an open item for this issue for PG10. The 2PC tests in > src/test/recovery are new in PG10, introduced by 3082098. Thanks, pushed. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On 31 August 2017 at 12:54, Simon Riggs wrote: >> The above-described topic is currently a PostgreSQL 10 open item. Simon, >> since you committed the patch believed to have created it, you own this open >> item. If some other commit is more relevant or if this does not belong as a >> v10 open item, please let us know. Otherwise, please observe the policy on >> open item ownership[1] and send a status update within three calendar days of >> this message. Include a date for your subsequent status update. Testers may >> discover new open items at any time, and I want to plan to get them all fixed >> well in advance of shipping v10. Consequently, I will appreciate your >> efforts >> toward speedy resolution. Thanks. >> >> [1] >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I acknowledge receipt of the reminder and will fix by end of day tomorrow. Applied. Thanks for the patch, Petr. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On 27 August 2017 at 03:32, Noah Misch wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: >> On 24/08/17 19:54, Tom Lane wrote: >> > sungazer just failed with >> > >> > pg_recvlogical exited with code '256', stdout '' and stderr >> > 'pg_recvlogical: could not send replication command "START_REPLICATION >> > SLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" >> > '1')": ERROR: replication slot "test_slot" is active for PID 8913148 >> > pg_recvlogical: disconnected >> > ' at >> > /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm >> > line 1657. >> > >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 >> > >> > Looks like we're still not there on preventing replication startup >> > race conditions. >> >> Hmm, that looks like "by design" behavior. Slot acquiring will throw >> error if the slot is already used by somebody else (slots use their own >> locking mechanism that does not wait). In this case it seems the >> walsender which was using slot for previous previous step didn't finish >> releasing the slot by the time we start new command. We can work around >> this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I acknowledge receipt of the reminder and will fix by end of day tomorrow. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Tue, Aug 29, 2017 at 08:44:42PM +0900, Michael Paquier wrote: > On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier > wrote: > > Today's run has finished with the same failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > > Attached is a patch to make this code path wait that the transaction > > has been replayed. We could use as well synchronous_commit = apply, > > but I prefer the solution of this patch with a wait query. > > I have added an open item for this issue for PG10. The 2PC tests in > src/test/recovery are new in PG10, introduced by 3082098. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Álvaro, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Sun, Aug 27, 2017 at 02:32:49AM +, Noah Misch wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: > > On 24/08/17 19:54, Tom Lane wrote: > > > sungazer just failed with > > > > > > pg_recvlogical exited with code '256', stdout '' and stderr > > > 'pg_recvlogical: could not send replication command "START_REPLICATION > > > SLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" > > > '1')": ERROR: replication slot "test_slot" is active for PID 8913148 > > > pg_recvlogical: disconnected > > > ' at > > > /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm > > > line 1657. > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > > > > > Looks like we're still not there on preventing replication startup > > > race conditions. > > > > Hmm, that looks like "by design" behavior. Slot acquiring will throw > > error if the slot is already used by somebody else (slots use their own > > locking mechanism that does not wait). In this case it seems the > > walsender which was using slot for previous previous step didn't finish > > releasing the slot by the time we start new command. We can work around > > this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier wrote: > Today's run has finished with the same failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > Attached is a patch to make this code path wait that the transaction > has been replayed. We could use as well synchronous_commit = apply, > but I prefer the solution of this patch with a wait query. I have added an open item for this issue for PG10. The 2PC tests in src/test/recovery are new in PG10, introduced by 3082098. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On 28/08/17 01:36, Michael Paquier wrote: > On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek > wrote: >> Attached should fix this. > > +$node_master->poll_query_until('postgres', > +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = > 'test_slot' AND active_pid IS NULL)" > +) > + or die "slot never became inactive"; > + > $stdout_recv = $node_master->pg_recvlogical_upto( > I am wondering if a similar check should actually go into > pg_recvlogical_upto() instead. I tend to think that we've learned > things the hard way with 3043c1dd and such. > Putting it there can lead to tests that take forever to finish if written incorrectly. This way they at least just fail. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Mon, Aug 28, 2017 at 8:33 AM, Tom Lane wrote: > Michael Paquier writes: >> Attached is a patch to make this code path wait that the transaction >> has been replayed. We could use as well synchronous_commit = apply, >> but I prefer the solution of this patch with a wait query. > > Petr proposed a different patch to fix the same problem at > https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com > > Which one is better? Petr's patch is here to fix the race condition in test 006 for logical decoding. My patch involves the failures that dangomushi sees since yesterday with 2PC test 009. (Note: It took me a while to put this animal in an environment with a more stable connection, so now I hope that we'll get again daily reports) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek wrote: > Attached should fix this. +$node_master->poll_query_until('postgres', +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'test_slot' AND active_pid IS NULL)" +) + or die "slot never became inactive"; + $stdout_recv = $node_master->pg_recvlogical_upto( I am wondering if a similar check should actually go into pg_recvlogical_upto() instead. I tend to think that we've learned things the hard way with 3043c1dd and such. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
Michael Paquier writes: > Attached is a patch to make this code path wait that the transaction > has been replayed. We could use as well synchronous_commit = apply, > but I prefer the solution of this patch with a wait query. Petr proposed a different patch to fix the same problem at https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com Which one is better? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Sun, Aug 27, 2017 at 3:34 PM, Michael Paquier wrote: > On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane wrote: >> contains exactly no means of ensuring that the master's transaction has >> been replayed on the standby before we check for its results. It's not >> real clear why it seems to work 99.99% of the time, because, well, there >> isn't any frickin' interlock there ... > > I have noticed this one this morning, and I am planning to address it > with a proper patch soonishly. (I am still fighting a bit to get > dangomushi in a more stable stable, and things run slow on it, so it > is good at catching race conditions of this kind). Today's run has finished with the same failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 Attached is a patch to make this code path wait that the transaction has been replayed. We could use as well synchronous_commit = apply, but I prefer the solution of this patch with a wait query. -- Michael tap-2pc-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On 27/08/17 04:32, Noah Misch wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: >> On 24/08/17 19:54, Tom Lane wrote: >>> sungazer just failed with >>> >>> pg_recvlogical exited with code '256', stdout '' and stderr >>> 'pg_recvlogical: could not send replication command "START_REPLICATION SLOT >>> "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": >>> ERROR: replication slot "test_slot" is active for PID 8913148 >>> pg_recvlogical: disconnected >>> ' at >>> /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm >>> line 1657. >>> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 >>> >>> Looks like we're still not there on preventing replication startup >>> race conditions. >> >> Hmm, that looks like "by design" behavior. Slot acquiring will throw >> error if the slot is already used by somebody else (slots use their own >> locking mechanism that does not wait). In this case it seems the >> walsender which was using slot for previous previous step didn't finish >> releasing the slot by the time we start new command. We can work around >> this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > Attached should fix this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From d66caa3532558828ecb89b4f153cc1f29332b918 Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Sun, 27 Aug 2017 11:28:49 +0200 Subject: [PATCH] Fix race condition in logical decoding tap test --- src/test/recovery/t/006_logical_decoding.pl | 5 + 1 file changed, 5 insertions(+) diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl index 4a90e9a..8b35bc8 100644 --- a/src/test/recovery/t/006_logical_decoding.pl +++ b/src/test/recovery/t/006_logical_decoding.pl @@ -78,6 +78,11 @@ chomp($stdout_recv); is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session'); +$node_master->poll_query_until('postgres', +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'test_slot' AND active_pid IS NULL)" +) + or die "slot never became inactive"; + $stdout_recv = $node_master->pg_recvlogical_upto( 'postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', -- 2.7.4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane wrote: > And *another* replication test race condition just now: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08 > > As best I can interpret this, it's pointing out that this bit in > src/test/recovery/t/009_twophase.pl: > > $cur_master->psql( > 'postgres', " > BEGIN; > CREATE TABLE t_009_tbl2 (id int, msg text); > SAVEPOINT s1; > INSERT INTO t_009_tbl2 VALUES (27, 'issued to ${cur_master_name}'); > PREPARE TRANSACTION 'xact_009_13'; > -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock > -- held by 'create table' statement > CHECKPOINT; > COMMIT PREPARED 'xact_009_13';"); > > $cur_standby->psql( > 'postgres', > "SELECT count(*) FROM t_009_tbl2", > stdout => \$psql_out); > is($psql_out, '1', "Replay prepared transaction with DDL"); > > contains exactly no means of ensuring that the master's transaction has > been replayed on the standby before we check for its results. It's not > real clear why it seems to work 99.99% of the time, because, well, there > isn't any frickin' interlock there ... I have noticed this one this morning, and I am planning to address it with a proper patch soonishly. (I am still fighting a bit to get dangomushi in a more stable stable, and things run slow on it, so it is good at catching race conditions of this kind). -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
And *another* replication test race condition just now: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08 As best I can interpret this, it's pointing out that this bit in src/test/recovery/t/009_twophase.pl: $cur_master->psql( 'postgres', " BEGIN; CREATE TABLE t_009_tbl2 (id int, msg text); SAVEPOINT s1; INSERT INTO t_009_tbl2 VALUES (27, 'issued to ${cur_master_name}'); PREPARE TRANSACTION 'xact_009_13'; -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock -- held by 'create table' statement CHECKPOINT; COMMIT PREPARED 'xact_009_13';"); $cur_standby->psql( 'postgres', "SELECT count(*) FROM t_009_tbl2", stdout => \$psql_out); is($psql_out, '1', "Replay prepared transaction with DDL"); contains exactly no means of ensuring that the master's transaction has been replayed on the standby before we check for its results. It's not real clear why it seems to work 99.99% of the time, because, well, there isn't any frickin' interlock there ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: > On 24/08/17 19:54, Tom Lane wrote: > > sungazer just failed with > > > > pg_recvlogical exited with code '256', stdout '' and stderr > > 'pg_recvlogical: could not send replication command "START_REPLICATION SLOT > > "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": > > ERROR: replication slot "test_slot" is active for PID 8913148 > > pg_recvlogical: disconnected > > ' at > > /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm > > line 1657. > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > > > Looks like we're still not there on preventing replication startup > > race conditions. > > Hmm, that looks like "by design" behavior. Slot acquiring will throw > error if the slot is already used by somebody else (slots use their own > locking mechanism that does not wait). In this case it seems the > walsender which was using slot for previous previous step didn't finish > releasing the slot by the time we start new command. We can work around > this by changing the test to wait perhaps. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Simon, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More replication race conditions
On 24/08/17 19:54, Tom Lane wrote: > sungazer just failed with > > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: > could not send replication command "START_REPLICATION SLOT "test_slot" > LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: > replication slot "test_slot" is active for PID 8913148 > pg_recvlogical: disconnected > ' at > /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm > line 1657. > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > Looks like we're still not there on preventing replication startup > race conditions. Hmm, that looks like "by design" behavior. Slot acquiring will throw error if the slot is already used by somebody else (slots use their own locking mechanism that does not wait). In this case it seems the walsender which was using slot for previous previous step didn't finish releasing the slot by the time we start new command. We can work around this by changing the test to wait perhaps. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers