Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2022-02-22 17:48:55 -0800, Andres Freund wrote: > I think the minor changes might unfortunately have introduced a race? Before > the patch just used ConditionVariableSleep(), but now it also has a > ConditionVariablePrepareToSleep(). Without re-checking the sleep condition > until > /* Wait until the slot is released. */ > ConditionVariableSleep(>active_cv, >WAIT_EVENT_REPLICATION_SLOT_DROP); > > which directly violates what ConditionVariablePrepareToSleep() documents: > > * This can optionally be called before entering a test/sleep loop. > * Doing so is more efficient if we'll need to sleep at least once. > * However, if the first test of the exit condition is likely to succeed, > * it's more efficient to omit the ConditionVariablePrepareToSleep call. > * See comments in ConditionVariableSleep for more detail. > * > * Caution: "before entering the loop" means you *must* test the exit > * condition between calling ConditionVariablePrepareToSleep and calling > * ConditionVariableSleep. If that is inconvenient, omit calling > * ConditionVariablePrepareToSleep. > > > Afaics this means we can potentially sleep forever if the prior owner of the > slot releases it before the ConditionVariablePrepareToSleep(). > > There's a comment that's mentioning this danger: > > /* > * Prepare the sleep on the slot's condition variable before > * releasing the lock, to close a possible race condition if the > * slot is released before the sleep below. > */ > ConditionVariablePrepareToSleep(>active_cv); > > LWLockRelease(ReplicationSlotControlLock); > > but afaics that is bogus, because releasing a slot doesn't take > ReplicationSlotControlLock. That just protects against the slot being dropped, > not against it being released. > > We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier, > before the checks at the start of the while loop. Not at the start of the while loop, outside of the while loop. Doing it in the loop body doesn't make sense, even if it's at the top. Each ConditionVariablePrepareToSleep() will unregister itself: /* * If some other sleep is already prepared, cancel it; this is necessary * because we have just one static variable tracking the prepared sleep, * and also only one cvWaitLink in our PGPROC. It's okay to do this * because whenever control does return to the other test-and-sleep loop, * its ConditionVariableSleep call will just re-establish that sleep as * the prepared one. */ if (cv_sleep_target != NULL) ConditionVariableCancelSleep(); The intended use is documented in this comment: * This should be called in a predicate loop that tests for a specific exit * condition and otherwise sleeps, like so: * * ConditionVariablePrepareToSleep(cv); // optional * while (condition for which we are waiting is not true) * ConditionVariableSleep(cv, wait_event_info); * ConditionVariableCancelSleep(); Greetings, Andres Freund
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote: > On 2021-Jun-10, Álvaro Herrera wrote: > > > Here's a version that I feel is committable (0001). There was an issue > > when returning from the inner loop, if in a previous iteration we had > > released the lock. In that case we need to return with the lock not > > held, so that the caller can acquire it again, but weren't. This seems > > pretty hard to hit in practice (I suppose somebody needs to destroy the > > slot just as checkpointer killed the walsender, but before checkpointer > > marks it as its own) ... but if it did happen, I think checkpointer > > would block with no recourse. Also added some comments and slightly > > restructured the code. > > > > There are plenty of conflicts in pg13, but it's all easy to handle. > > Pushed, with additional minor changes. I stared at this code, due to [1], and I think I found a bug. I think it's not the cause of the failures in that thread, but we probably should still do something about it. I think the minor changes might unfortunately have introduced a race? Before the patch just used ConditionVariableSleep(), but now it also has a ConditionVariablePrepareToSleep(). Without re-checking the sleep condition until /* Wait until the slot is released. */ ConditionVariableSleep(>active_cv, WAIT_EVENT_REPLICATION_SLOT_DROP); which directly violates what ConditionVariablePrepareToSleep() documents: * This can optionally be called before entering a test/sleep loop. * Doing so is more efficient if we'll need to sleep at least once. * However, if the first test of the exit condition is likely to succeed, * it's more efficient to omit the ConditionVariablePrepareToSleep call. * See comments in ConditionVariableSleep for more detail. * * Caution: "before entering the loop" means you *must* test the exit * condition between calling ConditionVariablePrepareToSleep and calling * ConditionVariableSleep. If that is inconvenient, omit calling * ConditionVariablePrepareToSleep. Afaics this means we can potentially sleep forever if the prior owner of the slot releases it before the ConditionVariablePrepareToSleep(). There's a comment that's mentioning this danger: /* * Prepare the sleep on the slot's condition variable before * releasing the lock, to close a possible race condition if the * slot is released before the sleep below. */ ConditionVariablePrepareToSleep(>active_cv); LWLockRelease(ReplicationSlotControlLock); but afaics that is bogus, because releasing a slot doesn't take ReplicationSlotControlLock. That just protects against the slot being dropped, not against it being released. We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier, before the checks at the start of the while loop. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de
Re: Race condition in InvalidateObsoleteReplicationSlots()
On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera wrote: > > On 2021-Jul-05, vignesh C wrote: > > > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera > > wrote: > > > > > > On 2021-Jun-20, Tom Lane wrote: > > > > > > > Actually ... isn't there a second race, in the opposite direction? > > > > IIUC, the point of this is that once we force some WAL to be sent > > > > to the frozen sender/receiver, they'll be killed for failure to > > > > respond. But the advance_wal call is not the only possible cause > > > > of that; a background autovacuum for example could emit some WAL. > > > > So I fear it's possible for the 'to release replication slot' > > > > message to come out before we capture $logstart. I think you > > > > need to capture that value before the kill not after. > > > > > > I accounted for all those things and pushed again. > > > > I saw that this patch is pushed. If there is no pending work left for > > this, can we change the commitfest entry to Committed. > > There is none that I'm aware of, please mark it committed. Thanks Thanks for confirming, I have marked it as committed. Regards, Vignesh
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Jul-05, vignesh C wrote: > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera > wrote: > > > > On 2021-Jun-20, Tom Lane wrote: > > > > > Actually ... isn't there a second race, in the opposite direction? > > > IIUC, the point of this is that once we force some WAL to be sent > > > to the frozen sender/receiver, they'll be killed for failure to > > > respond. But the advance_wal call is not the only possible cause > > > of that; a background autovacuum for example could emit some WAL. > > > So I fear it's possible for the 'to release replication slot' > > > message to come out before we capture $logstart. I think you > > > need to capture that value before the kill not after. > > > > I accounted for all those things and pushed again. > > I saw that this patch is pushed. If there is no pending work left for > this, can we change the commitfest entry to Committed. There is none that I'm aware of, please mark it committed. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Re: Race condition in InvalidateObsoleteReplicationSlots()
On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera wrote: > > On 2021-Jun-20, Tom Lane wrote: > > > Actually ... isn't there a second race, in the opposite direction? > > IIUC, the point of this is that once we force some WAL to be sent > > to the frozen sender/receiver, they'll be killed for failure to > > respond. But the advance_wal call is not the only possible cause > > of that; a background autovacuum for example could emit some WAL. > > So I fear it's possible for the 'to release replication slot' > > message to come out before we capture $logstart. I think you > > need to capture that value before the kill not after. > > I accounted for all those things and pushed again. I saw that this patch is pushed. If there is no pending work left for this, can we change the commitfest entry to Committed. Regards, Vignesh
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Jun-20, Tom Lane wrote: > Actually ... isn't there a second race, in the opposite direction? > IIUC, the point of this is that once we force some WAL to be sent > to the frozen sender/receiver, they'll be killed for failure to > respond. But the advance_wal call is not the only possible cause > of that; a background autovacuum for example could emit some WAL. > So I fear it's possible for the 'to release replication slot' > message to come out before we capture $logstart. I think you > need to capture that value before the kill not after. I accounted for all those things and pushed again. -- Álvaro Herrera Valdivia, Chile "I can see support will not be a problem. 10 out of 10."(Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
Re: Race condition in InvalidateObsoleteReplicationSlots()
I wrote: > I studied this failure a bit more, and I think the test itself has > a race condition. It's doing > > # freeze walsender and walreceiver. Slot will still be active, but walreceiver > # won't get anything anymore. > kill 'STOP', $senderpid, $receiverpid; > $logstart = get_log_size($node_primary3); > advance_wal($node_primary3, 4); > ok(find_in_log($node_primary3, "to release replication slot", $logstart), > "walreceiver termination logged"); Actually ... isn't there a second race, in the opposite direction? IIUC, the point of this is that once we force some WAL to be sent to the frozen sender/receiver, they'll be killed for failure to respond. But the advance_wal call is not the only possible cause of that; a background autovacuum for example could emit some WAL. So I fear it's possible for the 'to release replication slot' message to come out before we capture $logstart. I think you need to capture that value before the kill not after. I also suggest that it wouldn't be a bad idea to make the find_in_log check more specific, by including the expected PID and perhaps the expected slot name in the string. The full message in primary3's log looks like 2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG: terminating process 3548959 to release replication slot "rep3" and I don't understand why we wouldn't match on the whole message text. (I think doing so will also reveal that what we are looking for here is the walsender pid, not the walreceiver pid, and thus that the description in the ok() call is backwards. Or maybe we do want to check the walreceiver side, in which case we are searching the wrong postmaster's log?) regards, tom lane
Re: Race condition in InvalidateObsoleteReplicationSlots()
I wrote: > Hmm ... desmoxytes has failed this test once, out of four runs since > it went in: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04 I studied this failure a bit more, and I think the test itself has a race condition. It's doing # freeze walsender and walreceiver. Slot will still be active, but walreceiver # won't get anything anymore. kill 'STOP', $senderpid, $receiverpid; $logstart = get_log_size($node_primary3); advance_wal($node_primary3, 4); ok(find_in_log($node_primary3, "to release replication slot", $logstart), "walreceiver termination logged"); The string it's looking for does show up in node_primary3's log, but not for another second or so; we can see instances of the following poll_query_until query before that happens. So the problem is that there is no interlock to ensure that the walreceiver terminates before this find_in_log check looks for it. You should be able to fix this by adding a retry loop around the find_in_log check (which would likely mean that you don't need to do multiple advance_wal iterations here). However, I agree with reverting the test for now and then trying again after beta2. regards, tom lane
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hah, desmoxytes failed once: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04 I'll revert it and investigate later. There have been no other failures. -- Álvaro Herrera39°49'30"S 73°17'W "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentro de él no son, por desgracia, nada idílicas" (Ijon Tichy)
Re: Race condition in InvalidateObsoleteReplicationSlots()
=?utf-8?Q?=C3=81lvaro?= Herrera writes: > It occurred to me that this could be made better by sigstopping both > walreceiver and walsender, then letting only the latter run; AFAICS this > makes the test stable. I'll register this on the upcoming commitfest to > let cfbot run it, and if it looks good there I'll get it pushed to > master. If there's any problem I'll just remove it before beta2 is > stamped. Hmm ... desmoxytes has failed this test once, out of four runs since it went in: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04 None of the other animals that have reported in so far are unhappy. Still, maybe that's not a track record we want to have for beta2? I've just launched a run on gaur, which given its dinosaur status might be the most likely animal to have an actual portability problem with this test technique. If you want to wait a few hours to see what it says, that'd be fine with me. regards, tom lane
Re: Race condition in InvalidateObsoleteReplicationSlots()
Apologies, I inadvertently sent the version before I added a maximum number of iterations in the final loop. -- Álvaro Herrera Valdivia, Chile "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi) >From 1492e9468ecd86167a1253c4a2e9b31139835649 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 10 Jun 2021 16:44:03 -0400 Subject: [PATCH v4] Add test case for obsoleting slot with active walsender MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The code to signal a running walsender when its reserved WAL size grows too large is completely uncovered before this commit; this adds coverage for that case. This test involves sending SIGSTOP to walsender and walreceiver and running a checkpoint while advancing WAL, then sending SIGCONT. There's no precedent for this coding in Perl tests, and my reading of relevant manpages says it's likely to fail on Windows. Because of this, this test is always skipped on that platform. Author: Álvaro Herrera Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql --- src/test/recovery/t/019_replslot_limit.pl | 84 ++- 1 file changed, 81 insertions(+), 3 deletions(-) diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 7094aa0704..52bc9f8cfd 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -11,7 +11,7 @@ use TestLib; use PostgresNode; use File::Path qw(rmtree); -use Test::More tests => 14; +use Test::More tests => $TestLib::windows_os ? 14 : 18; use Time::HiRes qw(usleep); $ENV{PGDATABASE} = 'postgres'; @@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++) } ok($failed, 'check that replication has been broken'); -$node_primary->stop('immediate'); -$node_standby->stop('immediate'); +$node_primary->stop; +$node_standby->stop; my $node_primary2 = get_new_node('primary2'); $node_primary2->init(allows_streaming => 1); @@ -253,6 +253,84 @@ my @result = timeout => '60')); is($result[1], 'finished', 'check if checkpoint command is not blocked'); +$node_primary2->stop; +$node_standby->stop; + +# The next test depends on Perl's `kill`, which apparently is not +# portable to Windows. (It would be nice to use Test::More's `subtest`, +# but that's not in the ancient version we require.) +if ($TestLib::windows_os) +{ + done_testing(); + exit; +} + +# Get a slot terminated while the walsender is active +# We do this by sending SIGSTOP to the walsender. Skip this on Windows. +my $node_primary3 = get_new_node('primary3'); +$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']); +$node_primary3->append_conf( + 'postgresql.conf', qq( + min_wal_size = 2MB + max_wal_size = 2MB + log_checkpoints = yes + max_slot_wal_keep_size = 1MB + )); +$node_primary3->start; +$node_primary3->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('rep3')"); +# Take backup +$backup_name = 'my_backup'; +$node_primary3->backup($backup_name); +# Create standby +my $node_standby3 = get_new_node('standby_3'); +$node_standby3->init_from_backup($node_primary3, $backup_name, + has_streaming => 1); +$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'"); +$node_standby3->start; +$node_primary3->wait_for_catchup($node_standby3->name, 'replay'); +my $senderpid = $node_primary3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"); +like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"); +my $receiverpid = $node_standby3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'"); +like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid"); + +# freeze walsender and walreceiver. Slot will still be active, but walreceiver +# won't get anything anymore. +kill 'STOP', $senderpid, $receiverpid; +$logstart = get_log_size($node_primary3); +advance_wal($node_primary3, 4); +ok(find_in_log($node_primary3, "to release replication slot", $logstart), + "walreceiver termination logged"); + +# Now let the walsender continue; slot should be killed now. +# (Must not let walreceiver run yet; otherwise the standby could start another +# one before the slot can be killed) +kill 'CONT', $senderpid; +$node_primary3->poll_query_until('postgres', + "SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'", + "lost") + or die "timed out waiting for slot to be lost"; + +my $max_attempts = 180; +while ($max_attempts-- > 0) +{ + if (find_in_log($node_primary3, + 'invalidating slot "rep3" because its restart_lsn', $logstart)) + { + ok(1, "slot invalidation logged"); + last; + } + sleep 1; +} + +# Now let the walreceiver continue, so that the node can be stopped cleanly +kill 'CONT', $receiverpid; + +$node_primary3->stop; +$node_standby3->stop; + # # Advance WAL of $node by $n segments
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Jun-11, Álvaro Herrera wrote: > I tried hard to make this stable, but it just isn't (it works fine one > thousand runs, then I grab some coffee and run it once more and that one > fails. Why? that's not clear to me). Attached is the last one I have, > in case somebody wants to make it better. Maybe there's some completely > different approach that works better, but I'm out of ideas for now. It occurred to me that this could be made better by sigstopping both walreceiver and walsender, then letting only the latter run; AFAICS this makes the test stable. I'll register this on the upcoming commitfest to let cfbot run it, and if it looks good there I'll get it pushed to master. If there's any problem I'll just remove it before beta2 is stamped. -- Álvaro Herrera Valdivia, Chile >From 8080ced6b3c807039ff9a66810fa661fae19b347 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 10 Jun 2021 16:44:03 -0400 Subject: [PATCH v3] Add test case for obsoleting slot with active walsender MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The code to signal a running walsender when its reserved WAL size grows too large is completely uncovered before this commit; this adds coverage for that case. This test involves sending SIGSTOP to walsender and walreceiver and running a checkpoint while advancing WAL, then sending SIGCONT. There's no precedent for this coding in Perl tests, and my reading of relevant manpages says it's likely to fail on Windows. Because of this, this test is always skipped on that platform. Author: Álvaro Herrera Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql --- src/test/recovery/t/019_replslot_limit.pl | 82 ++- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 7094aa0704..2e6c5a229e 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -11,7 +11,7 @@ use TestLib; use PostgresNode; use File::Path qw(rmtree); -use Test::More tests => 14; +use Test::More tests => $TestLib::windows_os ? 14 : 18; use Time::HiRes qw(usleep); $ENV{PGDATABASE} = 'postgres'; @@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++) } ok($failed, 'check that replication has been broken'); -$node_primary->stop('immediate'); -$node_standby->stop('immediate'); +$node_primary->stop; +$node_standby->stop; my $node_primary2 = get_new_node('primary2'); $node_primary2->init(allows_streaming => 1); @@ -253,6 +253,82 @@ my @result = timeout => '60')); is($result[1], 'finished', 'check if checkpoint command is not blocked'); +$node_primary2->stop; +$node_standby->stop; + +# The next test depends on Perl's `kill`, which apparently is not +# portable to Windows. (It would be nice to use Test::More's `subtest`, +# but that's not in the ancient version we require.) +if ($TestLib::windows_os) +{ + done_testing(); + exit; +} + +# Get a slot terminated while the walsender is active +# We do this by sending SIGSTOP to the walsender. Skip this on Windows. +my $node_primary3 = get_new_node('primary3'); +$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']); +$node_primary3->append_conf( + 'postgresql.conf', qq( + min_wal_size = 2MB + max_wal_size = 2MB + log_checkpoints = yes + max_slot_wal_keep_size = 1MB + )); +$node_primary3->start; +$node_primary3->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('rep3')"); +# Take backup +$backup_name = 'my_backup'; +$node_primary3->backup($backup_name); +# Create standby +my $node_standby3 = get_new_node('standby_3'); +$node_standby3->init_from_backup($node_primary3, $backup_name, + has_streaming => 1); +$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'"); +$node_standby3->start; +$node_primary3->wait_for_catchup($node_standby3->name, 'replay'); +my $senderpid = $node_primary3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"); +like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"); +my $receiverpid = $node_standby3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'"); +like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid"); + +# freeze walsender and walreceiver. Slot will still be active, but walreceiver +# won't get anything anymore. +kill 'STOP', $senderpid, $receiverpid; +$logstart = get_log_size($node_primary3); +advance_wal($node_primary3, 4); +ok(find_in_log($node_primary3, "to release replication slot", $logstart), + "walreceiver termination logged"); + +# Now let the walsender continue; slot should be killed now. +# (Must not let walreceiver run yet; otherwise the standby could start another +# one before the slot can be killed) +kill 'CONT', $senderpid; +$node_primary3->poll_query_until('postgres', + "SELECT wal_status FROM
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-06-11 15:52:21 -0400, Álvaro Herrera wrote: > On 2021-Apr-07, Andres Freund wrote: > > > After this I don't see a reason to have SAB_Inquire - as far as I can > > tell it's practically impossible to use without race conditions? Except > > for raising an error - which is "builtin"... > > Pushed 0002. > > Thanks! Thank you for your work on this!
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Apr-07, Andres Freund wrote: > After this I don't see a reason to have SAB_Inquire - as far as I can > tell it's practically impossible to use without race conditions? Except > for raising an error - which is "builtin"... Pushed 0002. Thanks! -- Álvaro Herrera39°49'30"S 73°17'W "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas/ desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Jun-10, Álvaro Herrera wrote: > Here's a version that I feel is committable (0001). There was an issue > when returning from the inner loop, if in a previous iteration we had > released the lock. In that case we need to return with the lock not > held, so that the caller can acquire it again, but weren't. This seems > pretty hard to hit in practice (I suppose somebody needs to destroy the > slot just as checkpointer killed the walsender, but before checkpointer > marks it as its own) ... but if it did happen, I think checkpointer > would block with no recourse. Also added some comments and slightly > restructured the code. > > There are plenty of conflicts in pg13, but it's all easy to handle. Pushed, with additional minor changes. > I wrote a test (0002) to cover the case of signalling a walsender, which > is currently not covered (we only deal with the case of a standby that's > not running). There are some sharp edges in this code -- I had to make > it use background_psql() to send a CHECKPOINT, which hangs, because I > previously send a SIGSTOP to the walreceiver. Maybe there's a better > way to achieve a walreceiver that remains connected but doesn't consume > input from the primary, but I don't know what it is. Anyway, the code > becomes covered with this. I would like to at least see it in master, > to gather some reactions from buildfarm. I tried hard to make this stable, but it just isn't (it works fine one thousand runs, then I grab some coffee and run it once more and that one fails. Why? that's not clear to me). Attached is the last one I have, in case somebody wants to make it better. Maybe there's some completely different approach that works better, but I'm out of ideas for now. -- Álvaro Herrera Valdivia, Chile "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy) >From 6b6cb174452c553437ba7949aa25f305f599d6b7 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 11 Jun 2021 12:21:45 -0400 Subject: [PATCH v3] Add test case for invalidating an active replication slot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The code to signal a running walsender was completely uncovered before. This test involves sending SIGSTOP to a walreceiver and running a checkpoint while advancing WAL. I'm not very certain that this test is stable, so it's in master only, and separate from the code-fix commit so that it can be reverted easily if need be. Author: Álvaro Herrera Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql --- src/test/recovery/t/019_replslot_limit.pl | 79 ++- 1 file changed, 76 insertions(+), 3 deletions(-) diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 7094aa0704..dcadfe1252 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -11,7 +11,7 @@ use TestLib; use PostgresNode; use File::Path qw(rmtree); -use Test::More tests => 14; +use Test::More tests => $TestLib::windows_os ? 14 : 17; use Time::HiRes qw(usleep); $ENV{PGDATABASE} = 'postgres'; @@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++) } ok($failed, 'check that replication has been broken'); -$node_primary->stop('immediate'); -$node_standby->stop('immediate'); +$node_primary->stop; +$node_standby->stop; my $node_primary2 = get_new_node('primary2'); $node_primary2->init(allows_streaming => 1); @@ -253,6 +253,79 @@ my @result = timeout => '60')); is($result[1], 'finished', 'check if checkpoint command is not blocked'); +$node_primary2->stop; +$node_standby->stop; + +# The next test depends on Perl's `kill`, which apparently is not +# portable to Windows. (It would be nice to use Test::More's `subtest`, +# but that's not in the ancient version we require.) +if ($TestLib::windows_os) +{ + done_testing(); + exit; +} + +# Get a slot terminated while the walsender is active +# We do this by sending SIGSTOP to the walreceiver. Skip this on Windows. +my $node_primary3 = get_new_node('primary3'); +$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']); +$node_primary3->append_conf( + 'postgresql.conf', qq( + min_wal_size = 2MB + max_wal_size = 2MB + log_checkpoints = yes + max_slot_wal_keep_size = 1MB + )); +$node_primary3->start; +$node_primary3->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('rep3')"); +# Take backup +$backup_name = 'my_backup'; +$node_primary3->backup($backup_name); +# Create standby +my $node_standby3 = get_new_node('standby_3'); +$node_standby3->init_from_backup($node_primary3, $backup_name, + has_streaming => 1); +$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'"); +$node_standby3->start; +$node_primary3->wait_for_catchup($node_standby3->name, 'replay'); +my
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Jun-10, Álvaro Herrera wrote: > I wrote a test (0002) to cover the case of signalling a walsender, which > is currently not covered (we only deal with the case of a standby that's > not running). There are some sharp edges in this code -- I had to make > it use background_psql() to send a CHECKPOINT, which hangs, because I > previously send a SIGSTOP to the walreceiver. Maybe there's a better > way to achieve a walreceiver that remains connected but doesn't consume > input from the primary, but I don't know what it is. Anyway, the code > becomes covered with this. I would like to at least see it in master, > to gather some reactions from buildfarm. Small fixup to the test one, so that skipping it on Windows works correctly. -- Álvaro Herrera39°49'30"S 73°17'W Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender) >From b5909d87a2252181fe55caecaf46d9544e05e47f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 10 Jun 2021 16:43:43 -0400 Subject: [PATCH v2 1/2] the code fix --- src/backend/replication/slot.c | 236 + 1 file changed, 149 insertions(+), 87 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index c88b803e5d..a8bda68b21 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1160,6 +1160,152 @@ ReplicationSlotReserveWal(void) } } +/* + * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot + * and mark it invalid, if necessary and possible. + * + * Returns whether ReplicationSlotControlLock was released (and in that case, + * we're not holding the lock at return; otherwise we are). + * + * This is inherently racy, because we want to release the + * LWLock for syscalls, so caller must restart if we return true. + */ +static bool +InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) +{ + int last_signaled_pid = 0; + bool released_lock = false; + + for (;;) + { + XLogRecPtr restart_lsn; + NameData slotname; + int active_pid = 0; + + Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); + + if (!s->in_use) + { + if (released_lock) +LWLockRelease(ReplicationSlotControlLock); + break; + } + + /* + * Check if the slot needs to be invalidated. If it needs to be + * invalidated, and is not currently acquired, acquire it and mark it + * as having been invalidated. We do this with the spinlock held to + * avoid race conditions -- for example the restart_lsn could move + * forward, or the slot could be dropped. + */ + SpinLockAcquire(>mutex); + + restart_lsn = s->data.restart_lsn; + + /* + * If the slot is already invalid or is fresh enough, we don't need to + * do anything. + */ + if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN) + { + SpinLockRelease(>mutex); + if (released_lock) +LWLockRelease(ReplicationSlotControlLock); + break; + } + + slotname = s->data.name; + active_pid = s->active_pid; + + /* + * If the slot can be acquired, do so and mark it invalidated + * immediately. Otherwise we'll signal the owning process, below, and + * retry. + */ + if (active_pid == 0) + { + MyReplicationSlot = s; + s->active_pid = MyProcPid; + s->data.invalidated_at = restart_lsn; + s->data.restart_lsn = InvalidXLogRecPtr; + } + + SpinLockRelease(>mutex); + + if (active_pid != 0) + { + LWLockRelease(ReplicationSlotControlLock); + released_lock = true; + + /* + * Signal to terminate the process that owns the slot, if we + * haven't already signalled it. (Avoidance of repeated + * signalling is the only reason for there to be a loop in this + * routine; otherwise we could rely on caller's restart loop.) + * + * There is the race condition that other process may own the slot + * after its current owner process is terminated and before this + * process owns it. To handle that, we signal only if the PID of + * the owning process has changed from the previous time. (This + * logic assumes that the same PID is not reused very quickly.) + */ + if (last_signaled_pid != active_pid) + { +ereport(LOG, + (errmsg("terminating process %d to release replication slot \"%s\"", +active_pid, NameStr(slotname; + +(void) kill(active_pid, SIGTERM); +last_signaled_pid = active_pid; + } + + /* + * Wait until the slot is released. + * + * Will immediately return in the first iteration, so we can + * recheck the condition before sleeping. That addresses the + * otherwise possible race of the slot already having been + * released. + */ + ConditionVariableSleep(>active_cv, + WAIT_EVENT_REPLICATION_SLOT_DROP); + + /* we're done; have caller restart */ + break; + } + else + { + /* + * We hold the slot now and have already invalidated
Re: Race condition in InvalidateObsoleteReplicationSlots()
Here's a version that I feel is committable (0001). There was an issue when returning from the inner loop, if in a previous iteration we had released the lock. In that case we need to return with the lock not held, so that the caller can acquire it again, but weren't. This seems pretty hard to hit in practice (I suppose somebody needs to destroy the slot just as checkpointer killed the walsender, but before checkpointer marks it as its own) ... but if it did happen, I think checkpointer would block with no recourse. Also added some comments and slightly restructured the code. There are plenty of conflicts in pg13, but it's all easy to handle. I wrote a test (0002) to cover the case of signalling a walsender, which is currently not covered (we only deal with the case of a standby that's not running). There are some sharp edges in this code -- I had to make it use background_psql() to send a CHECKPOINT, which hangs, because I previously send a SIGSTOP to the walreceiver. Maybe there's a better way to achieve a walreceiver that remains connected but doesn't consume input from the primary, but I don't know what it is. Anyway, the code becomes covered with this. I would like to at least see it in master, to gather some reactions from buildfarm. -- Álvaro Herrera Valdivia, Chile It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker. >From f296a5b880a5688656d308300d1015ba0562c1b7 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 10 Jun 2021 16:43:43 -0400 Subject: [PATCH 1/2] the code fix --- src/backend/replication/slot.c | 234 + 1 file changed, 147 insertions(+), 87 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index c88b803e5d..c36b16d070 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1160,6 +1160,150 @@ ReplicationSlotReserveWal(void) } } +/* + * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot + * and mark it invalid, if necessary and possible. + * + * Returns whether ReplicationSlotControlLock was released (and in that case, + * we're not holding the lock at return; otherwise we are). + * + * This is inherently racy, because we want to release the + * LWLock for syscalls, so caller must restart if we return true. + */ +static bool +InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) +{ + int last_signaled_pid = 0; + bool released_lock = false; + + for (;;) + { + XLogRecPtr restart_lsn; + NameData slotname; + int active_pid = 0; + + Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); + + if (!s->in_use) + { + if (released_lock) +LWLockRelease(ReplicationSlotControlLock); + break; + } + + /* + * Check if the slot needs to be invalidated. If it needs to be + * invalidated, and is not currently acquired, acquire it and mark it + * as having been invalidated. We do this with the spinlock held to + * avoid race conditions -- for example the restart_lsn could move + * forward, or the slot could be dropped. + */ + SpinLockAcquire(>mutex); + + restart_lsn = s->data.restart_lsn; + + /* + * If the slot is already invalid or is fresh enough, we don't need to + * do anything. + */ + if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN) + { + SpinLockRelease(>mutex); + if (released_lock) +LWLockRelease(ReplicationSlotControlLock); + break; + } + + slotname = s->data.name; + active_pid = s->active_pid; + + /* + * If the slot can be acquired, do so and mark it invalidated + * immediately. Otherwise we'll signal the owning process, below, and + * retry. + */ + if (active_pid == 0) + { + MyReplicationSlot = s; + s->active_pid = MyProcPid; + s->data.invalidated_at = restart_lsn; + s->data.restart_lsn = InvalidXLogRecPtr; + } + + SpinLockRelease(>mutex); + + if (active_pid != 0) + { + LWLockRelease(ReplicationSlotControlLock); + released_lock = true; + + /* + * Signal to terminate the process that owns the slot. + * + * There is the race condition that other process may own the slot + * after its current owner process is terminated and before this + * process owns it. To handle that, we signal only if the PID of + * the owning process has changed from the previous time. (This + * logic assumes that the same PID is not reused very quickly.) + */ + if (last_signaled_pid != active_pid) + { +ereport(LOG, + (errmsg("terminating process %d to release replication slot \"%s\"", +active_pid, NameStr(slotname; + +(void) kill(active_pid, SIGTERM); +last_signaled_pid = active_pid; + } + + /* + * Wait until the slot is released. + * + * Will immediately return in the first iteration, so we can + * recheck the condition before sleeping. That addresses the + * otherwise possible race of the slot already
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote: > On 2021-Apr-07, Andres Freund wrote: > > > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > > 10). Why do we need a timed sleep here in the first place? And why with > > such a short sleep? > > I was scared of the possibility that a process would not set the CV for > whatever reason, causing checkpointing to become stuck. Maybe that's > misguided thinking if CVs are reliable enough. They better be, or we have bigger problems. And if it's an escape hatch we surely ought not to use 10ms as the timeout. That's an appropriate time for something *not* using condition variables... > I attach a couple of changes to your 0001. It's all cosmetic; what > looks not so cosmetic is the change of "continue" to "break" in helper > routine; if !s->in_use, we'd loop infinitely. The other routine > already checks that before calling the helper; since you hold > ReplicationSlotControlLock at that point, it should not be possible to > drop it in between. Anyway, it's a trivial change to make, so it should > be correct. > I also added a "continue" at the bottom of one block; currently that > doesn't change any behavior, but if we add code at the other block, it > might not be what's intended. Seems sane. > Are you getting this set pushed, or would you like me to handle it? > (There seems to be some minor conflict in 13) I'd be quite happy for you to handle it... Greetings, Andres Freund
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-04-08 17:03:41 +0530, Amit Kapila wrote: > I haven't tested the patch but I couldn't spot any problems while > reading it. A minor point, don't we need to use > ConditionVariableCancelSleep() at some point after doing > ConditionVariableTimedSleep? It's not really necessary - unless the CV could get deallocated as part of dynamic shared memory or such. Greetings, Andres Freund
Re: Race condition in InvalidateObsoleteReplicationSlots()
On 2021-Apr-07, Andres Freund wrote: > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > 10). Why do we need a timed sleep here in the first place? And why with > such a short sleep? I was scared of the possibility that a process would not set the CV for whatever reason, causing checkpointing to become stuck. Maybe that's misguided thinking if CVs are reliable enough. > I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - > but is aware it's running in checkpointer. I don't think CFI does much > there? If we are worried about needing to check for interrupts, more > work is needed. Hmm .. yeah, doing CFI seems pretty useless. I think that should just be removed. If checkpointer gets USR2 (request for shutdown) it's not going to affect the behavior of CFI anyway. I attach a couple of changes to your 0001. It's all cosmetic; what looks not so cosmetic is the change of "continue" to "break" in helper routine; if !s->in_use, we'd loop infinitely. The other routine already checks that before calling the helper; since you hold ReplicationSlotControlLock at that point, it should not be possible to drop it in between. Anyway, it's a trivial change to make, so it should be correct. I also added a "continue" at the bottom of one block; currently that doesn't change any behavior, but if we add code at the other block, it might not be what's intended. > After this I don't see a reason to have SAB_Inquire - as far as I can > tell it's practically impossible to use without race conditions? Except > for raising an error - which is "builtin"... Hmm, interesting ... If not needed, yeah let's get rid of that. Are you getting this set pushed, or would you like me to handle it? (There seems to be some minor conflict in 13) -- Álvaro Herrera Valdivia, Chile "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra) >From 7f31b0ec12e52b6c967047384353895538161840 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 29 Apr 2021 13:19:54 -0400 Subject: [PATCH] Alvaro's edits --- src/backend/replication/slot.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index d28330cbd8..cd6f75b3e9 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1172,19 +1172,17 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) while (true) { - XLogRecPtr restart_lsn = InvalidXLogRecPtr; + XLogRecPtr restart_lsn; bool slot_conflicts; NameData slotname; int active_pid = 0; Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); - CHECK_FOR_INTERRUPTS(); - slot_conflicts = false; if (!s->in_use) - continue; + break; /* * Check if the slot needs to be invalidated. If it needs to be @@ -1205,12 +1203,16 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) slotname = s->data.name; active_pid = s->active_pid; - /* check if we can acquire it */ + /* + * If the slot can be acquired, do so and mark it invalidated + * immediately. Otherwise we'll signal the owning process, below, + * and retry. + */ if (active_pid == 0) { MyReplicationSlot = s; s->active_pid = MyProcPid; -s->data.invalidated_at = s->data.restart_lsn; +s->data.invalidated_at = restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; } } @@ -1262,6 +1264,7 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) /* re-acquire for next loop iteration */ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); + continue; } else { @@ -1286,7 +1289,6 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) break; } - } Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); @@ -1313,8 +1315,6 @@ restart: { ReplicationSlot *s = >replication_slots[i]; - CHECK_FOR_INTERRUPTS(); - if (!s->in_use) continue; -- 2.20.1
Re: Race condition in InvalidateObsoleteReplicationSlots()
On Thu, Apr 8, 2021 at 7:39 AM Andres Freund wrote: > > On 2021-04-07 17:10:37 -0700, Andres Freund wrote: > > I think this can be solved in two different ways: > > > > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of > >InvalidateObsoleteReplicationSlots(). That way nobody could re-create a > > new > >slot in the to-be-obsoleted-slot's place. > > > > 2) Atomically check whether the slot needs to be invalidated and try to > >acquire if needed. Don't release ReplicationSlotControlLock between those > >two steps. Signal the owner to release the slot iff we couldn't acquire > > the > >slot. In the latter case wait and then recheck if the slot still needs to > >be dropped. > > > > To me 2) seems better, because we then can also be sure that the slot still > > needs to be obsoleted, rather than potentially doing so unnecessarily. > > +1. > > > > It looks to me like several of the problems here stem from trying to reuse > > code from ReplicationSlotAcquireInternal() (which before this was just named > > ReplicationSlotAcquire()). I don't think that makes sense, because cases > > like > > this want to check if a condition is true, and acquire it only if so. > > > > IOW, I think this basically needs to look like > > ReplicationSlotsDropDBSlots(), > > except that a different condition is checked, and the if (active_pid) case > > needs to prepare a condition variable, signal the owner and then wait on the > > condition variable, to restart after. > > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > 10). Why do we need a timed sleep here in the first place? And why with > such a short sleep? > > I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - > but is aware it's running in checkpointer. I don't think CFI does much > there? If we are worried about needing to check for interrupts, more > work is needed. > > > Sketch for a fix attached. I did leave the odd > ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's > there... > I haven't tested the patch but I couldn't spot any problems while reading it. A minor point, don't we need to use ConditionVariableCancelSleep() at some point after doing ConditionVariableTimedSleep? -- With Regards, Amit Kapila.
Re: Race condition in InvalidateObsoleteReplicationSlots()
Hi, On 2021-04-07 17:10:37 -0700, Andres Freund wrote: > I think this can be solved in two different ways: > > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of >InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new >slot in the to-be-obsoleted-slot's place. > > 2) Atomically check whether the slot needs to be invalidated and try to >acquire if needed. Don't release ReplicationSlotControlLock between those >two steps. Signal the owner to release the slot iff we couldn't acquire the >slot. In the latter case wait and then recheck if the slot still needs to >be dropped. > > To me 2) seems better, because we then can also be sure that the slot still > needs to be obsoleted, rather than potentially doing so unnecessarily. > > > It looks to me like several of the problems here stem from trying to reuse > code from ReplicationSlotAcquireInternal() (which before this was just named > ReplicationSlotAcquire()). I don't think that makes sense, because cases like > this want to check if a condition is true, and acquire it only if so. > > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), > except that a different condition is checked, and the if (active_pid) case > needs to prepare a condition variable, signal the owner and then wait on the > condition variable, to restart after. I'm also confused by the use of ConditionVariableTimedSleep(timeout = 10). Why do we need a timed sleep here in the first place? And why with such a short sleep? I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - but is aware it's running in checkpointer. I don't think CFI does much there? If we are worried about needing to check for interrupts, more work is needed. Sketch for a fix attached. I did leave the odd ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's there... After this I don't see a reason to have SAB_Inquire - as far as I can tell it's practically impossible to use without race conditions? Except for raising an error - which is "builtin"... Greetings, Andres Freund >From bb2c2c4ecca51d2675f2e5c4d6ac3490995be2b0 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Wed, 7 Apr 2021 19:01:03 -0700 Subject: [PATCH 1/2] WIP: Sketch for a fix for InvalidateObsoleteReplicationSlots(). --- src/backend/replication/slot.c | 226 - 1 file changed, 139 insertions(+), 87 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 75a087c2f9d..5864b9b0139 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1153,6 +1153,140 @@ ReplicationSlotReserveWal(void) } } +/* + * Helper for InvalidateObsoleteReplicationSlots. Returns whether + * ReplicationSlotControlLock was released. + */ +static bool +InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN) +{ + int last_signaled_pid = 0; + bool released_lock = false; + + while (true) + { + XLogRecPtr restart_lsn = InvalidXLogRecPtr; + bool slot_conflicts; + NameData slotname; + int active_pid = 0; + + Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); + + CHECK_FOR_INTERRUPTS(); + + slot_conflicts = false; + + if (!s->in_use) + continue; + + /* + * Check if the slot needs to be invalidated. If it needs to be + * invalidated, and is not currently acquired, acquire it and mark it + * as having been invalidated. We do all of this with the spinlock + * held - otherwise there would be race conditions (e.g. the slot's + * restart_lsn moving ahead, the slot concurrently being dropped after + * we release ReplicationSlotControlLock, ...). + */ + SpinLockAcquire(>mutex); + + restart_lsn = s->data.restart_lsn; + + /* check if slot needs to be invalidated */ + if (!XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < oldestLSN) + { + slot_conflicts = true; + slotname = s->data.name; + active_pid = s->active_pid; + + /* check if we can acquire it */ + if (active_pid == 0) + { +MyReplicationSlot = s; +s->active_pid = MyProcPid; +s->data.invalidated_at = s->data.restart_lsn; +s->data.restart_lsn = InvalidXLogRecPtr; + } + } + + SpinLockRelease(>mutex); + + if (!slot_conflicts) + { + Assert(active_pid == 0); + + break; + } + else if (active_pid != 0) + { + LWLockRelease(ReplicationSlotControlLock); + released_lock = true; + + /* + * Signal to terminate the process that owns the slot. + * + * There is the race condition where other process may own + * the slot after the process using it was terminated and before + * this process owns it. To handle this case, we signal again + * if the PID of the owning process is changed than the last. + * + * XXX This logic assumes that the same PID is not reused + * very quickly. + */ + if (last_signaled_pid != active_pid) + { +ereport(LOG, + (errmsg("terminating process %d
Race condition in InvalidateObsoleteReplicationSlots()
Hi, I was looking at InvalidateObsoleteReplicationSlots() while reviewing / polishing the logical decoding on standby patch. Which lead me to notice that I think there's a race in InvalidateObsoleteReplicationSlots() (because ResolveRecoveryConflictWithLogicalSlots has a closely related one). void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno) { ... LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); for (int i = 0; i < max_replication_slots; i++) { ... if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN) continue; LWLockRelease(ReplicationSlotControlLock); ... for (;;) { ... wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire); ... SpinLockAcquire(>mutex); s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(>mutex); ... As far as I can tell there's no guarantee that the slot wasn't concurrently dropped and another replication slot created at the same offset in ReplicationSlotCtl->replication_slots. Which we then promptly would invalidate, regardless of the slot not actually needing to be invalidated. Note that this is different from the race mentioned in a comment: /* * Signal to terminate the process that owns the slot. * * There is the race condition where other process may own * the slot after the process using it was terminated and before * this process owns it. To handle this case, we signal again * if the PID of the owning process is changed than the last. * * XXX This logic assumes that the same PID is not reused * very quickly. */ It's one thing to terminate a connection erroneously - permanently breaking a replica due to invalidating the wrong slot or such imo is different. Interestingly this problem seems to have been present both in commit c6550776394e25c1620bc8258427c8f1d448080d Author: Alvaro Herrera Date: 2020-04-07 18:35:00 -0400 Allow users to limit storage reserved by replication slots commit f9e9704f09daf882f5a1cf1fbe3f5a3150ae2bb9 Author: Fujii Masao Date: 2020-06-19 17:15:52 +0900 Fix issues in invalidation of obsolete replication slots. I think this can be solved in two different ways: 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new slot in the to-be-obsoleted-slot's place. 2) Atomically check whether the slot needs to be invalidated and try to acquire if needed. Don't release ReplicationSlotControlLock between those two steps. Signal the owner to release the slot iff we couldn't acquire the slot. In the latter case wait and then recheck if the slot still needs to be dropped. To me 2) seems better, because we then can also be sure that the slot still needs to be obsoleted, rather than potentially doing so unnecessarily. It looks to me like several of the problems here stem from trying to reuse code from ReplicationSlotAcquireInternal() (which before this was just named ReplicationSlotAcquire()). I don't think that makes sense, because cases like this want to check if a condition is true, and acquire it only if so. IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), except that a different condition is checked, and the if (active_pid) case needs to prepare a condition variable, signal the owner and then wait on the condition variable, to restart after. Greetings, Andres Freund