Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-03-16 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote:
> Pushed and back-patched (it's slightly different before 12).  Thanks!

Thank you!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-03-15 Thread Thomas Munro
On Sat, Feb 19, 2022 at 7:01 AM Nathan Bossart  wrote:
> Here is a new patch.  This is essentially the same as v1, but I've spruced
> up the comments and the commit message.

Hi Nathan,

Pushed and back-patched (it's slightly different before 12).  Thanks!




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-18 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 01:00:53PM -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote:
>> That doesn't strike me as great architecturally. E.g. in theory the same
>> problem could exist in single user mode. I think it doesn't today, because
>> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
>> of feels like an implementation detail?
> 
> Yeah, maybe that is a reason to add an absorb somewhere within
> CreateCheckPoint() instead, like v1 [0] does.  Then the extra absorb would
> be sufficient for single-user mode if the requests were not absorbed
> immediately.

Here is a new patch.  This is essentially the same as v1, but I've spruced
up the comments and the commit message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From adf7f7c9717897cc72a3c76d90ad0db082a7a432 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Feb 2022 22:53:04 -0800
Subject: [PATCH v3 1/1] Fix race condition between DROP TABLESPACE and
 checkpointing.

Commands like ALTER TABLE SET TABLESPACE may leave files for the next
checkpoint to clean up.  If such files are not removed by the time DROP
TABLESPACE is called, we request a checkpoint so that they are deleted.
However, there is presently a window before checkpoint start where new unlink
requests won't be scheduled until the following checkpoint.  This means that
the checkpoint forced by DROP TABLESPACE might not remove the files we expect
it to remove, and the following ERROR will be emitted:

	ERROR:  tablespace "mytblspc" is not empty

To fix, add a call to AbsorbSyncRequests() just before advancing the unlink
cycle counter.  This ensures that any unlink requests forwarded prior to
checkpoint start (i.e., when ckpt_started is incremented) will be processed by
the current checkpoint.  Since AbsorbSyncRequests() performs memory
allocations, it cannot be called within a critical section, so we also need to
move SyncPreCheckpoint() to before CreateCheckPoint()'s critical section.

This is an old bug, so back-patch to all supported versions.

Author: Nathan Bossart
Reported-by: Nathan Bossart
Diagnosed-by: Thomas Munro, Nathan Bossart
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20220215235845.GA2665318%40nathanxps13
---
 src/backend/access/transam/xlog.c | 15 ---
 src/backend/storage/sync/sync.c   | 14 +-
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce78ac413e..174aab5ae4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6294,6 +6294,14 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/*
+	 * Let smgr prepare for checkpoint; this has to happen outside the critical
+	 * section and before we determine the REDO pointer.  Note that smgr must
+	 * not do anything that'd have to be undone if we decide no checkpoint is
+	 * needed.
+	 */
+	SyncPreCheckpoint();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -6307,13 +6315,6 @@ CreateCheckPoint(int flags)
 		LWLockRelease(ControlFileLock);
 	}
 
-	/*
-	 * Let smgr prepare for checkpoint; this has to happen before we determine
-	 * the REDO pointer.  Note that smgr must not do anything that'd have to
-	 * be undone if we decide no checkpoint is needed.
-	 */
-	SyncPreCheckpoint();
-
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
 	checkPoint.time = (pg_time_t) time(NULL);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..f16b25faa1 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -173,7 +173,9 @@ InitSync(void)
  * counter is incremented here.
  *
  * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
+ * That ensures that we won't delete files too soon.  Since this calls
+ * AbsorbSyncRequests(), which performs memory allocations, it cannot be
+ * called within a critical section.
  *
  * Note that we can't do anything here that depends on the assumption
  * that the checkpoint will be completed.
@@ -181,6 +183,16 @@ InitSync(void)
 void
 SyncPreCheckpoint(void)
 {
+	/*
+	 * Operations such as DROP TABLESPACE assume that the next checkpoint will
+	 * process all recently forwarded unlink requests, but if they aren't
+	 * absorbed prior to advancing the cycle counter, they won't be processed
+	 * until a future checkpoint.  The following absorb ensures that any unlink
+	 * requests forwarded before the checkpoint began will be processed in the
+	 * current checkpoint.
+	 */
+	AbsorbSyncRequests();
+
 	/*
 	 * Any unlink requests arriving after this point will be assigned the next
 	 * cycle counter, and won't be unlinked until next chec

Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote:
> Did you check whether this is a problem recently introduced or long-lived?

I've reproduced it back to v9.3.  I'm assuming it's much older than that.

> Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
> that it influences the timing in a way that makes it less likely?

I think it just influences the timing.  I believe the WAL pre-allocation
stuff triggers the issue because it adds a step between
AbsorbSyncRequests() and incrementing the started counter.

> ISTM that the problem is partially caused by having multiple "checkpoint"
> counters that are used in different ways, but then only waiting for one of
> them. I wonder whether a better approach could be to either unify the
> different counters, or to request / wait for the sync counter specifically?
> 
> Historically the sync stuff was something in md.c that the global system
> didn't really know anything about, but now it's a "proper" subsystem, so we
> can change the separation of concerns a bit more.

If the counters were unified, I think we might still need to do an extra
aborb after incrementing it, and we'd need to make sure that all of those
requests were tagged with the previous counter value so that they are
processed in the current checkpoint.  If callers requested with a specific
counter value, they might need to lock the counter (ckpt_lck) when making
requests.  Maybe that is okay.

>> Here's a patch that adds a call to AbsorbSyncRequests() in
>> CheckpointerMain() instead of SyncPreCheckpoint().
> 
> That doesn't strike me as great architecturally. E.g. in theory the same
> problem could exist in single user mode. I think it doesn't today, because
> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
> of feels like an implementation detail?

Yeah, maybe that is a reason to add an absorb somewhere within
CreateCheckPoint() instead, like v1 [0] does.  Then the extra absorb would
be sufficient for single-user mode if the requests were not absorbed
immediately.

[0] 
https://www.postgresql.org/message-id/attachment/130994/v1-0001-call-AbsorbSyncRequests-before-advancing-checkpoi.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Andres Freund
Hi,

On 2022-02-16 09:37:21 -0800, Nathan Bossart wrote:
> I've also figured out a
> way to reproduce the issue without the pre-allocation patches applied:
> 
> 1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
>increment ckpt_started.
> 2. In session 1, run the following commands:
>   a. CREATE TABLESPACE test LOCATION '/path/to/dir';
>   b. CREATE TABLE test TABLESPACE test AS SELECT 1;
> 3. In session 2, start a checkpoint.
> 4. In session 1, run these commands:
>   a. ALTER TABLE test SET TABLESPACE pg_default;
>   b. DROP TABLESPACE test;  -- fails
>   c. DROP TABLESPACE test;  -- succeeds

Did you check whether this is a problem recently introduced or long-lived?

Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
that it influences the timing in a way that makes it less likely?


ISTM that the problem is partially caused by having multiple "checkpoint"
counters that are used in different ways, but then only waiting for one of
them. I wonder whether a better approach could be to either unify the
different counters, or to request / wait for the sync counter specifically?

Historically the sync stuff was something in md.c that the global system
didn't really know anything about, but now it's a "proper" subsystem, so we
can change the separation of concerns a bit more.


> Here's a patch that adds a call to AbsorbSyncRequests() in
> CheckpointerMain() instead of SyncPreCheckpoint().

That doesn't strike me as great architecturally. E.g. in theory the same
problem could exist in single user mode. I think it doesn't today, because
RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
of feels like an implementation detail?


Greetings,

Andres Freund




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 08:44:42AM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
>> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>>> It looks like register_unlink_segment() is called prior to the checkpoint,
>>> but the checkpointer is not calling RememberSyncRequest() until after
>>> SyncPreCheckpoint().  This means that the requests are registered with the
>>> next checkpoint cycle count, so they aren't processed until the next
>>> checkpoint.
>> 
>> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
>> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
>> out of the critical section in CreateCheckPoint().  Patch attached.
> 
> An alternative fix might be to call AbsorbSyncRequests() after increasing
> the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
> just before checkpointing where new requests are registered for the
> checkpoint following the one about to begin.

Here's a patch that adds a call to AbsorbSyncRequests() in
CheckpointerMain() instead of SyncPreCheckpoint().  I've also figured out a
way to reproduce the issue without the pre-allocation patches applied:

1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
   increment ckpt_started.
2. In session 1, run the following commands:
a. CREATE TABLESPACE test LOCATION '/path/to/dir';
b. CREATE TABLE test TABLESPACE test AS SELECT 1;
3. In session 2, start a checkpoint.
4. In session 1, run these commands:
a. ALTER TABLE test SET TABLESPACE pg_default;
b. DROP TABLESPACE test;  -- fails
c. DROP TABLESPACE test;  -- succeeds

With the attached patch applied, the first attempt at dropping the
tablespace no longer fails.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e9707dfde25eaa9c447032c4b5a61e3011141dc9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Feb 2022 09:26:08 -0800
Subject: [PATCH v2 1/1] call AbsorbSyncRequests() after indicating checkpoint
 start

---
 src/backend/postmaster/checkpointer.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..e93d34b71f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -401,6 +401,18 @@ CheckpointerMain(void)
 
 			ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
 
+			/*
+			 * Pick up any last minute requests.  DROP TABLESPACE schedules a
+			 * checkpoint to clean up any lingering files that are scheduled for
+			 * deletion.  If we don't absorb those requests now, they might not
+			 * be absorbed until after incrementing the checkpoint cycle
+			 * counter, so the files won't be deleted until the following
+			 * checkpoint.  By absorbing requests after indicating the
+			 * checkpoint has started, operations like DROP TABLESPACE can be
+			 * sure that the next checkpoint will clean up any such files.
+			 */
+			AbsorbSyncRequests();
+
 			/*
 			 * The end-of-recovery checkpoint is a real checkpoint that's
 			 * performed while we're still in recovery.
-- 
2.25.1



Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-16 Thread Nathan Bossart
On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>> It looks like register_unlink_segment() is called prior to the checkpoint,
>> but the checkpointer is not calling RememberSyncRequest() until after
>> SyncPreCheckpoint().  This means that the requests are registered with the
>> next checkpoint cycle count, so they aren't processed until the next
>> checkpoint.
> 
> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
> out of the critical section in CreateCheckPoint().  Patch attached.

An alternative fix might be to call AbsorbSyncRequests() after increasing
the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
just before checkpointing where new requests are registered for the
checkpoint following the one about to begin.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Nathan Bossart
On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
> It looks like register_unlink_segment() is called prior to the checkpoint,
> but the checkpointer is not calling RememberSyncRequest() until after
> SyncPreCheckpoint().  This means that the requests are registered with the
> next checkpoint cycle count, so they aren't processed until the next
> checkpoint.

Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
out of the critical section in CreateCheckPoint().  Patch attached.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 188fbe8ab12387c423081edfb85b26e43b342f4e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Feb 2022 22:53:04 -0800
Subject: [PATCH v1 1/1] call AbsorbSyncRequests() before advancing checkpoint
 cycle counter

---
 src/backend/access/transam/xlog.c | 14 +++---
 src/backend/storage/sync/sync.c   |  6 ++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60d0f3d99e..c72d6faa88 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9042,6 +9042,13 @@ CreateCheckPoint(int flags)
 	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
+	/*
+	 * Let smgr prepare for checkpoint; this has to happen before we determine
+	 * the REDO pointer.  Note that smgr must not do anything that'd have to
+	 * be undone if we decide no checkpoint is needed.
+	 */
+	SyncPreCheckpoint();
+
 	/*
 	 * Use a critical section to force system panic if we have trouble.
 	 */
@@ -9055,13 +9062,6 @@ CreateCheckPoint(int flags)
 		LWLockRelease(ControlFileLock);
 	}
 
-	/*
-	 * Let smgr prepare for checkpoint; this has to happen before we determine
-	 * the REDO pointer.  Note that smgr must not do anything that'd have to
-	 * be undone if we decide no checkpoint is needed.
-	 */
-	SyncPreCheckpoint();
-
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
 	checkPoint.time = (pg_time_t) time(NULL);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 543f691f2d..8ffb3b8279 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -180,6 +180,12 @@ InitSync(void)
 void
 SyncPreCheckpoint(void)
 {
+	/*
+	 * Some operations assume that the next checkpoint will process recently
+	 * forwarded requests, so absorb them before we advance the cycle counter.
+	 */
+	AbsorbSyncRequests();
+
 	/*
 	 * Any unlink requests arriving after this point will be assigned the next
 	 * cycle counter, and won't be unlinked until next checkpoint.
-- 
2.25.1



Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Nathan Bossart
On Tue, Feb 15, 2022 at 09:06:45PM -0800, Andres Freund wrote:
> On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote:
>> On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
>> > So it's not getting unlinked until the *next* checkpoint cycle.  I
>> > don't know why.
>> 
>> On my machine (5.11.0-43), it looks like the test starts failing after
>> cc50080.  That commit adjusted some regression tests, so I'm assuming it's
>> not the root cause.
> 
> That makes some sense. Before that we didn't create database objects before
> tablespace.sql. A scheduling like this that could make some pre-existing
> issues much more visible.
> 
> I think we also saw the windows drop tablespace issues less commonly than
> recently (outside the aio tree at least)...

It looks like register_unlink_segment() is called prior to the checkpoint,
but the checkpointer is not calling RememberSyncRequest() until after
SyncPreCheckpoint().  This means that the requests are registered with the
next checkpoint cycle count, so they aren't processed until the next
checkpoint.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
> > So it's not getting unlinked until the *next* checkpoint cycle.  I
> > don't know why.
> 
> On my machine (5.11.0-43), it looks like the test starts failing after
> cc50080.  That commit adjusted some regression tests, so I'm assuming it's
> not the root cause.

That makes some sense. Before that we didn't create database objects before
tablespace.sql. A scheduling like this that could make some pre-existing
issues much more visible.

I think we also saw the windows drop tablespace issues less commonly than
recently (outside the aio tree at least)...

Greetings,

Andres Freund




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-16 15:34:08 +1300, Thomas Munro wrote:
> So it's not getting unlinked until the *next* checkpoint cycle.  I
> don't know why.

It might be helpful to know what the relfilenode maps to, so we know the
operations done to it. Perhaps logging in heap_create() and
RelationSetNewRelfilenode() would do the trick?

Greetings,

Andres Freund




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Nathan Bossart
On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
> So it's not getting unlinked until the *next* checkpoint cycle.  I
> don't know why.

On my machine (5.11.0-43), it looks like the test starts failing after
cc50080.  That commit adjusted some regression tests, so I'm assuming it's
not the root cause.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-16 15:09:15 +1300, Thomas Munro wrote:
> I pushed a hack to log the name of the file that's getting in the way:

FWIW, I think we should just have that in core. Right now it's just
pointlessly hard to debug. I think just reporting the first file would be good
enough...

Greetings,

Andres Freund




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Thomas Munro
On Wed, Feb 16, 2022 at 3:09 PM Thomas Munro  wrote:
> I tried adding another log line so I could see the unlink() happening,
> but with that it doesn't fail (though I'm still trying).  We may be in
> heisenbug territory.

This time I was luckier:

https://api.cirrus-ci.com/v1/artifact/task/4915815005814784/log/src/test/regress/log/postmaster.log

Interesting lines:

2022-02-16 02:08:50.933 UTC [14496][client backend]
[pg_regress/tablespace][3/682:918] LOG:   found 16748

2022-02-16 02:08:50.935 UTC [14089][checkpointer] LOG:  checkpoint
starting: immediate force wait
2022-02-16 02:08:50.937 UTC [14089][checkpointer] LOG:  checkpoint
complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB
2022-02-16 02:08:50.937 UTC [14496][client backend]
[pg_regress/tablespace][3/682:918] LOG:   found 16748
...
2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG:  shutting down
2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG:  checkpoint
starting: shutdown immediate
...
2022-02-16 02:10:10.080 UTC [14089][checkpointer] LOG:  
mdunlinkfiletag "pg_tblspc/16501/PG_15_202202141/16384/16748"

So it's not getting unlinked until the *next* checkpoint cycle.  I
don't know why.




Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Thomas Munro
Hi Nathan,

On Wed, Feb 16, 2022 at 12:58 PM Nathan Bossart
 wrote:
> I've noticed that my WAL pre-allocation patches [0] routinely fail with
> "tablespace is not empty" errors on Linux [1]:
>
>  DROP TABLESPACE regress_tblspace_renamed;
> +ERROR:  tablespace "regress_tblspace_renamed" is not empty
>
> This appears to have been discussed elsewhere (and fixed) for Windows [2]
> [3] [4].  However, I'm able to reliably reproduce the issue on Linux when
> the WAL pre-allocation patches are applied.  Adding a short sleep before
> dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear
> up the issue.
>
> My current thinking is that the performance boost from pre-allocating WAL
> is revealing this issue, but I'm not 100% certain it's not due to a bug in
> my code.  I'm continuing to investigate.

Huh.  I applied your patches and saw no failure on my local Linux
hacking station.  I noticed that it failed a couple of times recently
on Linux CI as you mentioned:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3158

But it didn't fail before (sorry, cfbot branch names needlessly
include the commitfest number, which is stupid and breaks the history;
I'll fix that one day):

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3158

It also hasn't failed on those other Unices (FreeBSD, macOS).

I pushed a hack to log the name of the file that's getting in the way:

2022-02-16 01:10:54.180 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] LOG:   found 16748
2022-02-16 01:10:54.180 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] STATEMENT:  DROP TABLESPACE
regress_tblspace_renamed;
2022-02-16 01:10:54.187 UTC [21880][checkpointer] LOG:  checkpoint
starting: immediate force wait
2022-02-16 01:10:54.188 UTC [21880][checkpointer] LOG:  checkpoint
complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] LOG:   found 16748
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] STATEMENT:  DROP TABLESPACE
regress_tblspace_renamed;
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] ERROR:  tablespace
"regress_tblspace_renamed" is not empty

So we can see that the checkpoint didn't get rid of (presumed)
tombstone file 16748.  At least in unpatched PostgreSQL, this
shouldn't be possible, if 16748 is on the pending delete list (since
CreateCheckPoint() runs SyncPostCheckpoint() before returning, and
only after that do we kick the condition variable and release the
backend), and USE_BARRIER_SMGRRELEASE shouldn't really make a
difference on a Unix, since all it does it close file descriptors, it
doesn't unlink anything.

I tried adding another log line so I could see the unlink() happening,
but with that it doesn't fail (though I'm still trying).  We may be in
heisenbug territory.

A conspiracy theorist with a siege mentality might wonder if Linux has
developed a name cache invalidation bug so we're seeing a slightly out
of date directory list (hmm, that CI image's kernel did go from
5.10.84-1 to 5.10.92-1 ~24 days ago), but I don't really understand
what your patch has to do with that...  Hrmph.




USE_BARRIER_SMGRRELEASE on Linux?

2022-02-15 Thread Nathan Bossart
I've noticed that my WAL pre-allocation patches [0] routinely fail with
"tablespace is not empty" errors on Linux [1]:

 DROP TABLESPACE regress_tblspace_renamed;
+ERROR:  tablespace "regress_tblspace_renamed" is not empty

This appears to have been discussed elsewhere (and fixed) for Windows [2]
[3] [4].  However, I'm able to reliably reproduce the issue on Linux when
the WAL pre-allocation patches are applied.  Adding a short sleep before
dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear
up the issue.

My current thinking is that the performance boost from pre-allocating WAL
is revealing this issue, but I'm not 100% certain it's not due to a bug in
my code.  I'm continuing to investigate.

[0] https://commitfest.postgresql.org/37/3158/
[1] 
https://api.cirrus-ci.com/v1/artifact/task/4678319084929024/log/src/test/regress/regression.diffs
[2] 
https://postgr.es/m/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com
[3] 
https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4eb2176

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com