subject:"Race condition in InvalidateObsoleteReplicationSlots\(\)"

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 17:48:55 -0800, Andres Freund wrote:
> I think the minor changes might unfortunately have introduced a race? Before
> the patch just used ConditionVariableSleep(), but now it also has a
> ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
> until
> /* Wait until the slot is released. */
> ConditionVariableSleep(>active_cv,
>WAIT_EVENT_REPLICATION_SLOT_DROP);
> 
> which directly violates what ConditionVariablePrepareToSleep() documents:
> 
>  * This can optionally be called before entering a test/sleep loop.
>  * Doing so is more efficient if we'll need to sleep at least once.
>  * However, if the first test of the exit condition is likely to succeed,
>  * it's more efficient to omit the ConditionVariablePrepareToSleep call.
>  * See comments in ConditionVariableSleep for more detail.
>  *
>  * Caution: "before entering the loop" means you *must* test the exit
>  * condition between calling ConditionVariablePrepareToSleep and calling
>  * ConditionVariableSleep.  If that is inconvenient, omit calling
>  * ConditionVariablePrepareToSleep.
> 
> 
> Afaics this means we can potentially sleep forever if the prior owner of the
> slot releases it before the ConditionVariablePrepareToSleep().
> 
> There's a comment that's mentioning this danger:
> 
> /*
>  * Prepare the sleep on the slot's condition variable before
>  * releasing the lock, to close a possible race condition if the
>  * slot is released before the sleep below.
>  */
>   ConditionVariablePrepareToSleep(>active_cv);
> 
>   LWLockRelease(ReplicationSlotControlLock);
> 
> but afaics that is bogus, because releasing a slot doesn't take
> ReplicationSlotControlLock. That just protects against the slot being dropped,
> not against it being released.
> 
> We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
> before the checks at the start of the while loop.

Not at the start of the while loop, outside of the while loop. Doing it in the
loop body doesn't make sense, even if it's at the top. Each
ConditionVariablePrepareToSleep() will unregister itself:

/*
 * If some other sleep is already prepared, cancel it; this is necessary
 * because we have just one static variable tracking the prepared sleep,
 * and also only one cvWaitLink in our PGPROC.  It's okay to do this
 * because whenever control does return to the other test-and-sleep loop,
 * its ConditionVariableSleep call will just re-establish that sleep as
 * the prepared one.
 */
if (cv_sleep_target != NULL)
ConditionVariableCancelSleep();

The intended use is documented in this comment:

 * This should be called in a predicate loop that tests for a specific exit
 * condition and otherwise sleeps, like so:
 *
 *   ConditionVariablePrepareToSleep(cv);  // optional
 *   while (condition for which we are waiting is not true)
 *   ConditionVariableSleep(cv, wait_event_info);
 *   ConditionVariableCancelSleep();

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote:
> On 2021-Jun-10, Álvaro Herrera wrote:
> 
> > Here's a version that I feel is committable (0001).  There was an issue
> > when returning from the inner loop, if in a previous iteration we had
> > released the lock.  In that case we need to return with the lock not
> > held, so that the caller can acquire it again, but weren't.  This seems
> > pretty hard to hit in practice (I suppose somebody needs to destroy the
> > slot just as checkpointer killed the walsender, but before checkpointer
> > marks it as its own) ... but if it did happen, I think checkpointer
> > would block with no recourse.  Also added some comments and slightly
> > restructured the code.
> > 
> > There are plenty of conflicts in pg13, but it's all easy to handle.
> 
> Pushed, with additional minor changes.

I stared at this code, due to [1], and I think I found a bug. I think it's not
the cause of the failures in that thread, but we probably should still do
something about it.

I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
until
/* Wait until the slot is released. */
ConditionVariableSleep(>active_cv,
   WAIT_EVENT_REPLICATION_SLOT_DROP);

which directly violates what ConditionVariablePrepareToSleep() documents:

 * This can optionally be called before entering a test/sleep loop.
 * Doing so is more efficient if we'll need to sleep at least once.
 * However, if the first test of the exit condition is likely to succeed,
 * it's more efficient to omit the ConditionVariablePrepareToSleep call.
 * See comments in ConditionVariableSleep for more detail.
 *
 * Caution: "before entering the loop" means you *must* test the exit
 * condition between calling ConditionVariablePrepareToSleep and calling
 * ConditionVariableSleep.  If that is inconvenient, omit calling
 * ConditionVariablePrepareToSleep.

Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().

There's a comment that's mentioning this danger:

/*
 * Prepare the sleep on the slot's condition variable before
 * releasing the lock, to close a possible race condition if the
 * slot is released before the sleep below.
 */
ConditionVariablePrepareToSleep(>active_cv);

LWLockRelease(ReplicationSlotControlLock);

but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.

We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread vignesh C

On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera  wrote:
>
> On 2021-Jul-05, vignesh C wrote:
>
> > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  
> > wrote:
> > >
> > > On 2021-Jun-20, Tom Lane wrote:
> > >
> > > > Actually ... isn't there a second race, in the opposite direction?
> > > > IIUC, the point of this is that once we force some WAL to be sent
> > > > to the frozen sender/receiver, they'll be killed for failure to
> > > > respond.  But the advance_wal call is not the only possible cause
> > > > of that; a background autovacuum for example could emit some WAL.
> > > > So I fear it's possible for the 'to release replication slot'
> > > > message to come out before we capture $logstart.  I think you
> > > > need to capture that value before the kill not after.
> > >
> > > I accounted for all those things and pushed again.
> >
> > I saw that this patch is pushed. If there is no pending work left for
> > this, can we change the commitfest entry to Committed.
>
> There is none that I'm aware of, please mark it committed.  Thanks

Thanks for confirming, I have marked it as committed.

Regards,
Vignesh

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread Álvaro Herrera

On 2021-Jul-05, vignesh C wrote:

> On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  
> wrote:
> >
> > On 2021-Jun-20, Tom Lane wrote:
> >
> > > Actually ... isn't there a second race, in the opposite direction?
> > > IIUC, the point of this is that once we force some WAL to be sent
> > > to the frozen sender/receiver, they'll be killed for failure to
> > > respond.  But the advance_wal call is not the only possible cause
> > > of that; a background autovacuum for example could emit some WAL.
> > > So I fear it's possible for the 'to release replication slot'
> > > message to come out before we capture $logstart.  I think you
> > > need to capture that value before the kill not after.
> >
> > I accounted for all those things and pushed again.
> 
> I saw that this patch is pushed. If there is no pending work left for
> this, can we change the commitfest entry to Committed.

There is none that I'm aware of, please mark it committed.  Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-07-05 Thread vignesh C

On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera  wrote:
>
> On 2021-Jun-20, Tom Lane wrote:
>
> > Actually ... isn't there a second race, in the opposite direction?
> > IIUC, the point of this is that once we force some WAL to be sent
> > to the frozen sender/receiver, they'll be killed for failure to
> > respond.  But the advance_wal call is not the only possible cause
> > of that; a background autovacuum for example could emit some WAL.
> > So I fear it's possible for the 'to release replication slot'
> > message to come out before we capture $logstart.  I think you
> > need to capture that value before the kill not after.
>
> I accounted for all those things and pushed again.

I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.

Regards,
Vignesh

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-23 Thread Álvaro Herrera

On 2021-Jun-20, Tom Lane wrote:

> Actually ... isn't there a second race, in the opposite direction?
> IIUC, the point of this is that once we force some WAL to be sent
> to the frozen sender/receiver, they'll be killed for failure to
> respond.  But the advance_wal call is not the only possible cause
> of that; a background autovacuum for example could emit some WAL.
> So I fear it's possible for the 'to release replication slot'
> message to come out before we capture $logstart.  I think you
> need to capture that value before the kill not after.

I accounted for all those things and pushed again.

-- 
Álvaro Herrera   Valdivia, Chile
"I can see support will not be a problem.  10 out of 10."(Simon Wittber)
  (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-20 Thread Tom Lane

I wrote:
> I studied this failure a bit more, and I think the test itself has
> a race condition.  It's doing
>
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> $logstart = get_log_size($node_primary3);
> advance_wal($node_primary3, 4);
> ok(find_in_log($node_primary3, "to release replication slot", $logstart),
>   "walreceiver termination logged");

Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond.  But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart.  I think you
need to capture that value before the kill not after.

I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string.  The full
message in primary3's log looks like

2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG:  terminating process 
3548959 to release replication slot "rep3"

and I don't understand why we wouldn't match on the whole
message text.  (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)

regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-20 Thread Tom Lane

I wrote:
> Hmm ... desmoxytes has failed this test once, out of four runs since
> it went in:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04

I studied this failure a bit more, and I think the test itself has
a race condition.  It's doing

# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
"walreceiver termination logged");

The string it's looking for does show up in node_primary3's log, but
not for another second or so; we can see instances of the following
poll_query_until query before that happens.  So the problem is that
there is no interlock to ensure that the walreceiver terminates
before this find_in_log check looks for it.

You should be able to fix this by adding a retry loop around the
find_in_log check (which would likely mean that you don't need
to do multiple advance_wal iterations here).

However, I agree with reverting the test for now and then trying
again after beta2.

regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-20 Thread Álvaro Herrera

Hah, desmoxytes failed once:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04
I'll revert it and investigate later.  There have been no other
failures.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-19 Thread Tom Lane

=?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> It occurred to me that this could be made better by sigstopping both
> walreceiver and walsender, then letting only the latter run; AFAICS this
> makes the test stable.  I'll register this on the upcoming commitfest to
> let cfbot run it, and if it looks good there I'll get it pushed to
> master.  If there's any problem I'll just remove it before beta2 is
> stamped.

Hmm ... desmoxytes has failed this test once, out of four runs since
it went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-19%2003%3A06%3A04

None of the other animals that have reported in so far are unhappy.
Still, maybe that's not a track record we want to have for beta2?

I've just launched a run on gaur, which given its dinosaur status
might be the most likely animal to have an actual portability problem
with this test technique.  If you want to wait a few hours to see what
it says, that'd be fine with me.

regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-18 Thread Álvaro Herrera

Apologies, I inadvertently sent the version before I added a maximum
number of iterations in the final loop.

-- 
Álvaro Herrera   Valdivia, Chile
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)
>From 1492e9468ecd86167a1253c4a2e9b31139835649 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Jun 2021 16:44:03 -0400
Subject: [PATCH v4] Add test case for obsoleting slot with active walsender
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The code to signal a running walsender when its reserved WAL size grows
too large is completely uncovered before this commit; this adds coverage
for that case.

This test involves sending SIGSTOP to walsender and walreceiver and
running a checkpoint while advancing WAL, then sending SIGCONT.  There's
no precedent for this coding in Perl tests, and my reading of relevant
manpages says it's likely to fail on Windows.  Because of this, this
test is always skipped on that platform.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql
---
 src/test/recovery/t/019_replslot_limit.pl | 84 ++-
 1 file changed, 81 insertions(+), 3 deletions(-)

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7094aa0704..52bc9f8cfd 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -11,7 +11,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 14;
+use Test::More tests => $TestLib::windows_os ? 14 : 18;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++)
 }
 ok($failed, 'check that replication has been broken');
 
-$node_primary->stop('immediate');
-$node_standby->stop('immediate');
+$node_primary->stop;
+$node_standby->stop;
 
 my $node_primary2 = get_new_node('primary2');
 $node_primary2->init(allows_streaming => 1);
@@ -253,6 +253,84 @@ my @result =
 		timeout => '60'));
 is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
+$node_primary2->stop;
+$node_standby->stop;
+
+# The next test depends on Perl's `kill`, which apparently is not
+# portable to Windows.  (It would be nice to use Test::More's `subtest`,
+# but that's not in the ancient version we require.)
+if ($TestLib::windows_os)
+{
+	done_testing();
+	exit;
+}
+
+# Get a slot terminated while the walsender is active
+# We do this by sending SIGSTOP to the walsender.  Skip this on Windows.
+my $node_primary3 = get_new_node('primary3');
+$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_primary3->append_conf(
+	'postgresql.conf', qq(
+	min_wal_size = 2MB
+	max_wal_size = 2MB
+	log_checkpoints = yes
+	max_slot_wal_keep_size = 1MB
+	));
+$node_primary3->start;
+$node_primary3->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep3')");
+# Take backup
+$backup_name = 'my_backup';
+$node_primary3->backup($backup_name);
+# Create standby
+my $node_standby3 = get_new_node('standby_3');
+$node_standby3->init_from_backup($node_primary3, $backup_name,
+	has_streaming => 1);
+$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'");
+$node_standby3->start;
+$node_primary3->wait_for_catchup($node_standby3->name, 'replay');
+my $senderpid = $node_primary3->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'");
+like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid");
+my $receiverpid = $node_standby3->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'");
+like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid");
+
+# freeze walsender and walreceiver. Slot will still be active, but walreceiver
+# won't get anything anymore.
+kill 'STOP', $senderpid, $receiverpid;
+$logstart = get_log_size($node_primary3);
+advance_wal($node_primary3, 4);
+ok(find_in_log($node_primary3, "to release replication slot", $logstart),
+	"walreceiver termination logged");
+
+# Now let the walsender continue; slot should be killed now.
+# (Must not let walreceiver run yet; otherwise the standby could start another
+# one before the slot can be killed)
+kill 'CONT', $senderpid;
+$node_primary3->poll_query_until('postgres',
+	"SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'",
+	"lost")
+  or die "timed out waiting for slot to be lost";
+
+my $max_attempts = 180;
+while ($max_attempts-- > 0)
+{
+	if (find_in_log($node_primary3,
+			'invalidating slot "rep3" because its restart_lsn', $logstart))
+	{
+		ok(1, "slot invalidation logged");
+		last;
+	}
+	sleep 1;
+}
+
+# Now let the walreceiver continue, so that the node can be stopped cleanly
+kill 'CONT', $receiverpid;
+
+$node_primary3->stop;
+$node_standby3->stop;
+
 #
 # Advance WAL of $node by $n segments

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-18 Thread Álvaro Herrera

On 2021-Jun-11, Álvaro Herrera wrote:

> I tried hard to make this stable, but it just isn't (it works fine one
> thousand runs, then I grab some coffee and run it once more and that one
> fails.  Why?  that's not clear to me).  Attached is the last one I have,
> in case somebody wants to make it better.  Maybe there's some completely
> different approach that works better, but I'm out of ideas for now.

It occurred to me that this could be made better by sigstopping both
walreceiver and walsender, then letting only the latter run; AFAICS this
makes the test stable.  I'll register this on the upcoming commitfest to
let cfbot run it, and if it looks good there I'll get it pushed to
master.  If there's any problem I'll just remove it before beta2 is
stamped.

-- 
Álvaro Herrera   Valdivia, Chile
>From 8080ced6b3c807039ff9a66810fa661fae19b347 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Jun 2021 16:44:03 -0400
Subject: [PATCH v3] Add test case for obsoleting slot with active walsender
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The code to signal a running walsender when its reserved WAL size grows
too large is completely uncovered before this commit; this adds coverage
for that case.

This test involves sending SIGSTOP to walsender and walreceiver and
running a checkpoint while advancing WAL, then sending SIGCONT.  There's
no precedent for this coding in Perl tests, and my reading of relevant
manpages says it's likely to fail on Windows.  Because of this, this
test is always skipped on that platform.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql
---
 src/test/recovery/t/019_replslot_limit.pl | 82 ++-
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7094aa0704..2e6c5a229e 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -11,7 +11,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 14;
+use Test::More tests => $TestLib::windows_os ? 14 : 18;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++)
 }
 ok($failed, 'check that replication has been broken');
 
-$node_primary->stop('immediate');
-$node_standby->stop('immediate');
+$node_primary->stop;
+$node_standby->stop;
 
 my $node_primary2 = get_new_node('primary2');
 $node_primary2->init(allows_streaming => 1);
@@ -253,6 +253,82 @@ my @result =
 		timeout => '60'));
 is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
+$node_primary2->stop;
+$node_standby->stop;
+
+# The next test depends on Perl's `kill`, which apparently is not
+# portable to Windows.  (It would be nice to use Test::More's `subtest`,
+# but that's not in the ancient version we require.)
+if ($TestLib::windows_os)
+{
+	done_testing();
+	exit;
+}
+
+# Get a slot terminated while the walsender is active
+# We do this by sending SIGSTOP to the walsender.  Skip this on Windows.
+my $node_primary3 = get_new_node('primary3');
+$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_primary3->append_conf(
+	'postgresql.conf', qq(
+	min_wal_size = 2MB
+	max_wal_size = 2MB
+	log_checkpoints = yes
+	max_slot_wal_keep_size = 1MB
+	));
+$node_primary3->start;
+$node_primary3->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep3')");
+# Take backup
+$backup_name = 'my_backup';
+$node_primary3->backup($backup_name);
+# Create standby
+my $node_standby3 = get_new_node('standby_3');
+$node_standby3->init_from_backup($node_primary3, $backup_name,
+	has_streaming => 1);
+$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'");
+$node_standby3->start;
+$node_primary3->wait_for_catchup($node_standby3->name, 'replay');
+my $senderpid = $node_primary3->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'");
+like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid");
+my $receiverpid = $node_standby3->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'");
+like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid");
+
+# freeze walsender and walreceiver. Slot will still be active, but walreceiver
+# won't get anything anymore.
+kill 'STOP', $senderpid, $receiverpid;
+$logstart = get_log_size($node_primary3);
+advance_wal($node_primary3, 4);
+ok(find_in_log($node_primary3, "to release replication slot", $logstart),
+	"walreceiver termination logged");
+
+# Now let the walsender continue; slot should be killed now.
+# (Must not let walreceiver run yet; otherwise the standby could start another
+# one before the slot can be killed)
+kill 'CONT', $senderpid;
+$node_primary3->poll_query_until('postgres',
+	"SELECT wal_status FROM

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-11 Thread Andres Freund

On 2021-06-11 15:52:21 -0400, Álvaro Herrera wrote:
> On 2021-Apr-07, Andres Freund wrote:
> 
> > After this I don't see a reason to have SAB_Inquire - as far as I can
> > tell it's practically impossible to use without race conditions? Except
> > for raising an error - which is "builtin"...
> 
> Pushed 0002.
> 
> Thanks!

Thank you for your work on this!

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-11 Thread Álvaro Herrera

On 2021-Apr-07, Andres Freund wrote:

> After this I don't see a reason to have SAB_Inquire - as far as I can
> tell it's practically impossible to use without race conditions? Except
> for raising an error - which is "builtin"...

Pushed 0002.

Thanks!

-- 
Álvaro Herrera39°49'30"S 73°17'W
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-11 Thread Álvaro Herrera

On 2021-Jun-10, Álvaro Herrera wrote:

> Here's a version that I feel is committable (0001).  There was an issue
> when returning from the inner loop, if in a previous iteration we had
> released the lock.  In that case we need to return with the lock not
> held, so that the caller can acquire it again, but weren't.  This seems
> pretty hard to hit in practice (I suppose somebody needs to destroy the
> slot just as checkpointer killed the walsender, but before checkpointer
> marks it as its own) ... but if it did happen, I think checkpointer
> would block with no recourse.  Also added some comments and slightly
> restructured the code.
> 
> There are plenty of conflicts in pg13, but it's all easy to handle.

Pushed, with additional minor changes.

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running).  There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver.  Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is.  Anyway, the code
> becomes covered with this.  I would like to at least see it in master,
> to gather some reactions from buildfarm.

I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails.  Why?  that's not clear to me).  Attached is the last one I have,
in case somebody wants to make it better.  Maybe there's some completely
different approach that works better, but I'm out of ideas for now.

-- 
Álvaro Herrera   Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)
>From 6b6cb174452c553437ba7949aa25f305f599d6b7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 11 Jun 2021 12:21:45 -0400
Subject: [PATCH v3] Add test case for invalidating an active replication slot
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The code to signal a running walsender was completely uncovered before.

This test involves sending SIGSTOP to a walreceiver and running a
checkpoint while advancing WAL.  I'm not very certain that this test is
stable, so it's in master only, and separate from the code-fix commit so
that it can be reverted easily if need be.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/202106102202.mjw4huiix7lo@alvherre.pgsql
---
 src/test/recovery/t/019_replslot_limit.pl | 79 ++-
 1 file changed, 76 insertions(+), 3 deletions(-)

diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 7094aa0704..dcadfe1252 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -11,7 +11,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 14;
+use Test::More tests => $TestLib::windows_os ? 14 : 17;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -211,8 +211,8 @@ for (my $i = 0; $i < 1; $i++)
 }
 ok($failed, 'check that replication has been broken');
 
-$node_primary->stop('immediate');
-$node_standby->stop('immediate');
+$node_primary->stop;
+$node_standby->stop;
 
 my $node_primary2 = get_new_node('primary2');
 $node_primary2->init(allows_streaming => 1);
@@ -253,6 +253,79 @@ my @result =
 		timeout => '60'));
 is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
+$node_primary2->stop;
+$node_standby->stop;
+
+# The next test depends on Perl's `kill`, which apparently is not
+# portable to Windows.  (It would be nice to use Test::More's `subtest`,
+# but that's not in the ancient version we require.)
+if ($TestLib::windows_os)
+{
+	done_testing();
+	exit;
+}
+
+# Get a slot terminated while the walsender is active
+# We do this by sending SIGSTOP to the walreceiver.  Skip this on Windows.
+my $node_primary3 = get_new_node('primary3');
+$node_primary3->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_primary3->append_conf(
+	'postgresql.conf', qq(
+	min_wal_size = 2MB
+	max_wal_size = 2MB
+	log_checkpoints = yes
+	max_slot_wal_keep_size = 1MB
+	));
+$node_primary3->start;
+$node_primary3->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep3')");
+# Take backup
+$backup_name = 'my_backup';
+$node_primary3->backup($backup_name);
+# Create standby
+my $node_standby3 = get_new_node('standby_3');
+$node_standby3->init_from_backup($node_primary3, $backup_name,
+	has_streaming => 1);
+$node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'");
+$node_standby3->start;
+$node_primary3->wait_for_catchup($node_standby3->name, 'replay');
+my

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-10 Thread Álvaro Herrera

On 2021-Jun-10, Álvaro Herrera wrote:

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running).  There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver.  Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is.  Anyway, the code
> becomes covered with this.  I would like to at least see it in master,
> to gather some reactions from buildfarm.

Small fixup to the test one, so that skipping it on Windows works
correctly.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
>From b5909d87a2252181fe55caecaf46d9544e05e47f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Jun 2021 16:43:43 -0400
Subject: [PATCH v2 1/2] the code fix

---
 src/backend/replication/slot.c | 236 +
 1 file changed, 149 insertions(+), 87 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803e5d..a8bda68b21 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1160,6 +1160,152 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released (and in that case,
+ * we're not holding the lock at return; otherwise we are).
+ *
+ * This is inherently racy, because we want to release the
+ * LWLock for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	int			last_signaled_pid = 0;
+	bool		released_lock = false;
+
+	for (;;)
+	{
+		XLogRecPtr	restart_lsn;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated.  We do this with the spinlock held to
+		 * avoid race conditions -- for example the restart_lsn could move
+		 * forward, or the slot could be dropped.
+		 */
+		SpinLockAcquire(>mutex);
+
+		restart_lsn = s->data.restart_lsn;
+
+		/*
+		 * If the slot is already invalid or is fresh enough, we don't need to
+		 * do anything.
+		 */
+		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		{
+			SpinLockRelease(>mutex);
+			if (released_lock)
+LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.invalidated_at = restart_lsn;
+			s->data.restart_lsn = InvalidXLogRecPtr;
+		}
+
+		SpinLockRelease(>mutex);
+
+		if (active_pid != 0)
+		{
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot, if we
+			 * haven't already signalled it.  (Avoidance of repeated
+			 * signalling is the only reason for there to be a loop in this
+			 * routine; otherwise we could rely on caller's restart loop.)
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+ereport(LOG,
+		(errmsg("terminating process %d to release replication slot \"%s\"",
+active_pid, NameStr(slotname;
+
+(void) kill(active_pid, SIGTERM);
+last_signaled_pid = active_pid;
+			}
+
+			/*
+			 * Wait until the slot is released.
+			 *
+			 * Will immediately return in the first iteration, so we can
+			 * recheck the condition before sleeping. That addresses the
+			 * otherwise possible race of the slot already having been
+			 * released.
+			 */
+			ConditionVariableSleep(>active_cv,
+   WAIT_EVENT_REPLICATION_SLOT_DROP);
+
+			/* we're done; have caller restart */
+			break;
+		}
+		else
+		{
+			/*
+			 * We hold the slot now and have already invalidated

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-06-10 Thread Álvaro Herrera

Here's a version that I feel is committable (0001).  There was an issue
when returning from the inner loop, if in a previous iteration we had
released the lock.  In that case we need to return with the lock not
held, so that the caller can acquire it again, but weren't.  This seems
pretty hard to hit in practice (I suppose somebody needs to destroy the
slot just as checkpointer killed the walsender, but before checkpointer
marks it as its own) ... but if it did happen, I think checkpointer
would block with no recourse.  Also added some comments and slightly
restructured the code.

There are plenty of conflicts in pg13, but it's all easy to handle.

I wrote a test (0002) to cover the case of signalling a walsender, which
is currently not covered (we only deal with the case of a standby that's
not running).  There are some sharp edges in this code -- I had to make
it use background_psql() to send a CHECKPOINT, which hangs, because I
previously send a SIGSTOP to the walreceiver.  Maybe there's a better
way to achieve a walreceiver that remains connected but doesn't consume
input from the primary, but I don't know what it is.  Anyway, the code
becomes covered with this.  I would like to at least see it in master,
to gather some reactions from buildfarm.

-- 
Álvaro Herrera   Valdivia, Chile
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.
>From f296a5b880a5688656d308300d1015ba0562c1b7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Jun 2021 16:43:43 -0400
Subject: [PATCH 1/2] the code fix

---
 src/backend/replication/slot.c | 234 +
 1 file changed, 147 insertions(+), 87 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803e5d..c36b16d070 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1160,6 +1160,150 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
+ * and mark it invalid, if necessary and possible.
+ *
+ * Returns whether ReplicationSlotControlLock was released (and in that case,
+ * we're not holding the lock at return; otherwise we are).
+ *
+ * This is inherently racy, because we want to release the
+ * LWLock for syscalls, so caller must restart if we return true.
+ */
+static bool
+InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	for (;;)
+	{
+		XLogRecPtr	restart_lsn;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		if (!s->in_use)
+		{
+			if (released_lock)
+LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated.  We do this with the spinlock held to
+		 * avoid race conditions -- for example the restart_lsn could move
+		 * forward, or the slot could be dropped.
+		 */
+		SpinLockAcquire(>mutex);
+
+		restart_lsn = s->data.restart_lsn;
+
+		/*
+		 * If the slot is already invalid or is fresh enough, we don't need to
+		 * do anything.
+		 */
+		if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
+		{
+			SpinLockRelease(>mutex);
+			if (released_lock)
+LWLockRelease(ReplicationSlotControlLock);
+			break;
+		}
+
+		slotname = s->data.name;
+		active_pid = s->active_pid;
+
+		/*
+		 * If the slot can be acquired, do so and mark it invalidated
+		 * immediately.  Otherwise we'll signal the owning process, below, and
+		 * retry.
+		 */
+		if (active_pid == 0)
+		{
+			MyReplicationSlot = s;
+			s->active_pid = MyProcPid;
+			s->data.invalidated_at = restart_lsn;
+			s->data.restart_lsn = InvalidXLogRecPtr;
+		}
+
+		SpinLockRelease(>mutex);
+
+		if (active_pid != 0)
+		{
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot.
+			 *
+			 * There is the race condition that other process may own the slot
+			 * after its current owner process is terminated and before this
+			 * process owns it. To handle that, we signal only if the PID of
+			 * the owning process has changed from the previous time. (This
+			 * logic assumes that the same PID is not reused very quickly.)
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+ereport(LOG,
+		(errmsg("terminating process %d to release replication slot \"%s\"",
+active_pid, NameStr(slotname;
+
+(void) kill(active_pid, SIGTERM);
+last_signaled_pid = active_pid;
+			}
+
+			/*
+			 * Wait until the slot is released.
+			 *
+			 * Will immediately return in the first iteration, so we can
+			 * recheck the condition before sleeping. That addresses the
+			 * otherwise possible race of the slot already

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-30 Thread Andres Freund

Hi,

On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote:
> On 2021-Apr-07, Andres Freund wrote:
> 
> > I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> > 10). Why do we need a timed sleep here in the first place? And why with
> > such a short sleep?
> 
> I was scared of the possibility that a process would not set the CV for
> whatever reason, causing checkpointing to become stuck.  Maybe that's
> misguided thinking if CVs are reliable enough.

They better be, or we have bigger problems. And if it's an escape hatch
we surely ought not to use 10ms as the timeout. That's an appropriate
time for something *not* using condition variables...


> I attach a couple of changes to your 0001.  It's all cosmetic; what
> looks not so cosmetic is the change of "continue" to "break" in helper
> routine; if !s->in_use, we'd loop infinitely.  The other routine
> already checks that before calling the helper; since you hold
> ReplicationSlotControlLock at that point, it should not be possible to
> drop it in between.  Anyway, it's a trivial change to make, so it should
> be correct.

> I also added a "continue" at the bottom of one block; currently that
> doesn't change any behavior, but if we add code at the other block, it
> might not be what's intended.

Seems sane.


> Are you getting this set pushed, or would you like me to handle it?
> (There seems to be some minor conflict in 13)

I'd be quite happy for you to handle it...

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-30 Thread Andres Freund

Hi,

On 2021-04-08 17:03:41 +0530, Amit Kapila wrote:
> I haven't tested the patch but I couldn't spot any problems while
> reading it. A minor point, don't we need to use
> ConditionVariableCancelSleep() at some point after doing
> ConditionVariableTimedSleep?

It's not really necessary - unless the CV could get deallocated as part
of dynamic shared memory or such.

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-29 Thread Álvaro Herrera

On 2021-Apr-07, Andres Freund wrote:

> I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> 10). Why do we need a timed sleep here in the first place? And why with
> such a short sleep?

I was scared of the possibility that a process would not set the CV for
whatever reason, causing checkpointing to become stuck.  Maybe that's
misguided thinking if CVs are reliable enough.

> I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
> but is aware it's running in checkpointer. I don't think CFI does much
> there? If we are worried about needing to check for interrupts, more
> work is needed.

Hmm .. yeah, doing CFI seems pretty useless.  I think that should just
be removed.  If checkpointer gets USR2 (request for shutdown) it's not
going to affect the behavior of CFI anyway.

I attach a couple of changes to your 0001.  It's all cosmetic; what
looks not so cosmetic is the change of "continue" to "break" in helper
routine; if !s->in_use, we'd loop infinitely.  The other routine
already checks that before calling the helper; since you hold
ReplicationSlotControlLock at that point, it should not be possible to
drop it in between.  Anyway, it's a trivial change to make, so it should
be correct.

I also added a "continue" at the bottom of one block; currently that
doesn't change any behavior, but if we add code at the other block, it
might not be what's intended.

> After this I don't see a reason to have SAB_Inquire - as far as I can
> tell it's practically impossible to use without race conditions? Except
> for raising an error - which is "builtin"...

Hmm, interesting ... If not needed, yeah let's get rid of that.


Are you getting this set pushed, or would you like me to handle it?
(There seems to be some minor conflict in 13)

-- 
Álvaro Herrera   Valdivia, Chile
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
>From 7f31b0ec12e52b6c967047384353895538161840 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 29 Apr 2021 13:19:54 -0400
Subject: [PATCH] Alvaro's edits

---
 src/backend/replication/slot.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d28330cbd8..cd6f75b3e9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1172,19 +1172,17 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
 
 	while (true)
 	{
-		XLogRecPtr	restart_lsn = InvalidXLogRecPtr;
+		XLogRecPtr	restart_lsn;
 		bool		slot_conflicts;
 		NameData	slotname;
 		int			active_pid = 0;
 
 		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
 
-		CHECK_FOR_INTERRUPTS();
-
 		slot_conflicts = false;
 
 		if (!s->in_use)
-			continue;
+			break;
 
 		/*
 		 * Check if the slot needs to be invalidated. If it needs to be
@@ -1205,12 +1203,16 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
 			slotname = s->data.name;
 			active_pid = s->active_pid;
 
-			/* check if we can acquire it */
+			/*
+			 * If the slot can be acquired, do so and mark it invalidated
+			 * immediately.  Otherwise we'll signal the owning process, below,
+			 * and retry.
+			 */
 			if (active_pid == 0)
 			{
 MyReplicationSlot = s;
 s->active_pid = MyProcPid;
-s->data.invalidated_at = s->data.restart_lsn;
+s->data.invalidated_at = restart_lsn;
 s->data.restart_lsn = InvalidXLogRecPtr;
 			}
 		}
@@ -1262,6 +1264,7 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
 
 			/* re-acquire for next loop iteration */
 			LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+			continue;
 		}
 		else
 		{
@@ -1286,7 +1289,6 @@ InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
 
 			break;
 		}
-
 	}
 
 	Assert(!released_lock == LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
@@ -1313,8 +1315,6 @@ restart:
 	{
 		ReplicationSlot *s = >replication_slots[i];
 
-		CHECK_FOR_INTERRUPTS();
-
 		if (!s->in_use)
 			continue;
 
-- 
2.20.1

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-08 Thread Amit Kapila

On Thu, Apr 8, 2021 at 7:39 AM Andres Freund  wrote:
>
> On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> > I think this can be solved in two different ways:
> >
> > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
> >InvalidateObsoleteReplicationSlots(). That way nobody could re-create a 
> > new
> >slot in the to-be-obsoleted-slot's place.
> >
> > 2) Atomically check whether the slot needs to be invalidated and try to
> >acquire if needed. Don't release ReplicationSlotControlLock between those
> >two steps. Signal the owner to release the slot iff we couldn't acquire 
> > the
> >slot. In the latter case wait and then recheck if the slot still needs to
> >be dropped.
> >
> > To me 2) seems better, because we then can also be sure that the slot still
> > needs to be obsoleted, rather than potentially doing so unnecessarily.
> >

+1.

> >
> > It looks to me like several of the problems here stem from trying to reuse
> > code from ReplicationSlotAcquireInternal() (which before this was just named
> > ReplicationSlotAcquire()).  I don't think that makes sense, because cases 
> > like
> > this want to check if a condition is true, and acquire it only if so.
> >
> > IOW, I think this basically needs to look like 
> > ReplicationSlotsDropDBSlots(),
> > except that a different condition is checked, and the if (active_pid) case
> > needs to prepare a condition variable, signal the owner and then wait on the
> > condition variable, to restart after.
>
> I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> 10). Why do we need a timed sleep here in the first place? And why with
> such a short sleep?
>
> I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
> but is aware it's running in checkpointer. I don't think CFI does much
> there? If we are worried about needing to check for interrupts, more
> work is needed.
>
>
> Sketch for a fix attached. I did leave the odd
> ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
> there...
>

I haven't tested the patch but I couldn't spot any problems while
reading it. A minor point, don't we need to use
ConditionVariableCancelSleep() at some point after doing
ConditionVariableTimedSleep?

-- 
With Regards,
Amit Kapila.

Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-07 Thread Andres Freund

Hi,

On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> I think this can be solved in two different ways:
>
> 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
>InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
>slot in the to-be-obsoleted-slot's place.
>
> 2) Atomically check whether the slot needs to be invalidated and try to
>acquire if needed. Don't release ReplicationSlotControlLock between those
>two steps. Signal the owner to release the slot iff we couldn't acquire the
>slot. In the latter case wait and then recheck if the slot still needs to
>be dropped.
>
> To me 2) seems better, because we then can also be sure that the slot still
> needs to be obsoleted, rather than potentially doing so unnecessarily.
>
>
> It looks to me like several of the problems here stem from trying to reuse
> code from ReplicationSlotAcquireInternal() (which before this was just named
> ReplicationSlotAcquire()).  I don't think that makes sense, because cases like
> this want to check if a condition is true, and acquire it only if so.
>
> IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
> except that a different condition is checked, and the if (active_pid) case
> needs to prepare a condition variable, signal the owner and then wait on the
> condition variable, to restart after.

I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?

I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.


Sketch for a fix attached. I did leave the odd
ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
there...

After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...

Greetings,

Andres Freund
>From bb2c2c4ecca51d2675f2e5c4d6ac3490995be2b0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 7 Apr 2021 19:01:03 -0700
Subject: [PATCH 1/2] WIP: Sketch for a fix for
 InvalidateObsoleteReplicationSlots().

---
 src/backend/replication/slot.c | 226 -
 1 file changed, 139 insertions(+), 87 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9d..5864b9b0139 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1153,6 +1153,140 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Helper for InvalidateObsoleteReplicationSlots. Returns whether
+ * ReplicationSlotControlLock was released.
+ */
+static bool
+InvalidateObsoleteReplicationSlot(ReplicationSlot *s, XLogRecPtr oldestLSN)
+{
+	int		last_signaled_pid = 0;
+	bool	released_lock = false;
+
+	while (true)
+	{
+		XLogRecPtr	restart_lsn = InvalidXLogRecPtr;
+		bool		slot_conflicts;
+		NameData	slotname;
+		int			active_pid = 0;
+
+		Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+		CHECK_FOR_INTERRUPTS();
+
+		slot_conflicts = false;
+
+		if (!s->in_use)
+			continue;
+
+		/*
+		 * Check if the slot needs to be invalidated. If it needs to be
+		 * invalidated, and is not currently acquired, acquire it and mark it
+		 * as having been invalidated. We do all of this with the spinlock
+		 * held - otherwise there would be race conditions (e.g. the slot's
+		 * restart_lsn moving ahead, the slot concurrently being dropped after
+		 * we release ReplicationSlotControlLock, ...).
+		 */
+		SpinLockAcquire(>mutex);
+
+		restart_lsn = s->data.restart_lsn;
+
+		/* check if slot needs to be invalidated */
+		if (!XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < oldestLSN)
+		{
+			slot_conflicts = true;
+			slotname = s->data.name;
+			active_pid = s->active_pid;
+
+			/* check if we can acquire it */
+			if (active_pid == 0)
+			{
+MyReplicationSlot = s;
+s->active_pid = MyProcPid;
+s->data.invalidated_at = s->data.restart_lsn;
+s->data.restart_lsn = InvalidXLogRecPtr;
+			}
+		}
+
+		SpinLockRelease(>mutex);
+
+		if (!slot_conflicts)
+		{
+			Assert(active_pid == 0);
+
+			break;
+		}
+		else if (active_pid != 0)
+		{
+			LWLockRelease(ReplicationSlotControlLock);
+			released_lock = true;
+
+			/*
+			 * Signal to terminate the process that owns the slot.
+			 *
+			 * There is the race condition where other process may own
+			 * the slot after the process using it was terminated and before
+			 * this process owns it. To handle this case, we signal again
+			 * if the PID of the owning process is changed than the last.
+			 *
+			 * XXX This logic assumes that the same PID is not reused
+			 * very quickly.
+			 */
+			if (last_signaled_pid != active_pid)
+			{
+ereport(LOG,
+		(errmsg("terminating process %d

Race condition in InvalidateObsoleteReplicationSlots()

2021-04-07 Thread Andres Freund

Hi,

I was looking at InvalidateObsoleteReplicationSlots() while reviewing /
polishing the logical decoding on standby patch. Which lead me to notice that
I think there's a race in InvalidateObsoleteReplicationSlots() (because
ResolveRecoveryConflictWithLogicalSlots has a closely related one).

void
InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
{
...
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
for (int i = 0; i < max_replication_slots; i++)
{
...
if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
continue;
LWLockRelease(ReplicationSlotControlLock);
...
for (;;)
{
...
wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire);
...
SpinLockAcquire(>mutex);
s->data.invalidated_at = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(>mutex);
...


As far as I can tell there's no guarantee that the slot wasn't concurrently
dropped and another replication slot created at the same offset in
ReplicationSlotCtl->replication_slots. Which we then promptly would
invalidate, regardless of the slot not actually needing to be invalidated.

Note that this is different from the race mentioned in a comment:
/*
 * Signal to terminate the process that owns the slot.
 *
 * There is the race condition where other process may own
 * the slot after the process using it was terminated and before
 * this process owns it. To handle this case, we signal again
 * if the PID of the owning process is changed than the last.
 *
 * XXX This logic assumes that the same PID is not reused
 * very quickly.
 */

It's one thing to terminate a connection erroneously - permanently breaking a
replica due to invalidating the wrong slot or such imo is different.


Interestingly this problem seems to have been present both in

commit c6550776394e25c1620bc8258427c8f1d448080d
Author: Alvaro Herrera 
Date:   2020-04-07 18:35:00 -0400

Allow users to limit storage reserved by replication slots

commit f9e9704f09daf882f5a1cf1fbe3f5a3150ae2bb9
Author: Fujii Masao 
Date:   2020-06-19 17:15:52 +0900

Fix issues in invalidation of obsolete replication slots.


I think this can be solved in two different ways:

1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
   InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
   slot in the to-be-obsoleted-slot's place.

2) Atomically check whether the slot needs to be invalidated and try to
   acquire if needed. Don't release ReplicationSlotControlLock between those
   two steps. Signal the owner to release the slot iff we couldn't acquire the
   slot. In the latter case wait and then recheck if the slot still needs to
   be dropped.

To me 2) seems better, because we then can also be sure that the slot still
needs to be obsoleted, rather than potentially doing so unnecessarily.


It looks to me like several of the problems here stem from trying to reuse
code from ReplicationSlotAcquireInternal() (which before this was just named
ReplicationSlotAcquire()).  I don't think that makes sense, because cases like
this want to check if a condition is true, and acquire it only if so.

IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
except that a different condition is checked, and the if (active_pid) case
needs to prepare a condition variable, signal the owner and then wait on the
condition variable, to restart after.


Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Re: Race condition in InvalidateObsoleteReplicationSlots()

Race condition in InvalidateObsoleteReplicationSlots()

23 matches

Site Navigation

Mail list logo

Footer information