Re: DROP DATABASE is interruptible
On Tue, Mar 12, 2024 at 9:00 PM Alexander Lakhin wrote: > I see two backends waiting: > law 2420132 2420108 0 09:05 ?00:00:00 postgres: node: law > postgres [local] DROP DATABASE waiting > law 2420135 2420108 0 09:05 ?00:00:00 postgres: node: law > postgres [local] startup waiting Ugh.
Re: DROP DATABASE is interruptible
Hi, 13.07.2023 23:52, Andres Freund wrote: Backpatching indeed was no fun. Not having BackgroundPsql.pm was the worst part. But also a lot of other conflicts in tests... Took me 5-6 hours or so. But I now finally pushed the fixes. Hope the buildfarm agrees with it... Thanks for the review! I've discovered that the test 037_invalid_database, introduced with c66a7d75e, hangs when a server built with -DCLOBBER_CACHE_ALWAYS or with debug_discard_caches = 1 set via TEMP_CONFIG: echo "debug_discard_caches = 1" >/tmp/extra.config TEMP_CONFIG=/tmp/extra.config make -s check -C src/test/recovery/ PROVE_TESTS="t/037*" # +++ tap check in src/test/recovery +++ [09:05:48] t/037_invalid_database.pl .. 6/? regress_log_037_invalid_database ends with: [09:05:51.622](0.021s) # issuing query via background psql: # CREATE DATABASE regression_invalid_interrupt; # BEGIN; # LOCK pg_tablespace; # PREPARE TRANSACTION 'lock_tblspc'; [09:05:51.684](0.062s) ok 8 - blocked DROP DATABASE completion I see two backends waiting: law 2420132 2420108 0 09:05 ? 00:00:00 postgres: node: law postgres [local] DROP DATABASE waiting law 2420135 2420108 0 09:05 ? 00:00:00 postgres: node: law postgres [local] startup waiting and the latter's stack trace: #0 0x7f65c8fd3f9a in epoll_wait (epfd=9, events=0x563c40e15478, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x563c3fa9a9fa in WaitEventSetWaitBlock (set=0x563c40e15410, cur_timeout=-1, occurred_events=0x7fff579dda80, nevents=1) at latch.c:1570 #2 0x563c3fa9a8e4 in WaitEventSetWait (set=0x563c40e15410, timeout=-1, occurred_events=0x7fff579dda80, nevents=1, wait_event_info=50331648) at latch.c:1516 #3 0x563c3fa99b14 in WaitLatch (latch=0x7f65c5e112e4, wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:538 #4 0x563c3fac7dee in ProcSleep (locallock=0x563c40e41e80, lockMethodTable=0x563c4007cba0 ) at proc.c:1339 #5 0x563c3fab4160 in WaitOnLock (locallock=0x563c40e41e80, owner=0x563c40ea5af8) at lock.c:1816 #6 0x563c3fab2c80 in LockAcquireExtended (locktag=0x7fff579dde30, lockmode=1, sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x7fff579dde28) at lock.c:1080 #7 0x563c3faaf86d in LockRelationOid (relid=1213, lockmode=1) at lmgr.c:116 #8 0x563c3f537aff in relation_open (relationId=1213, lockmode=1) at relation.c:55 #9 0x563c3f5efde9 in table_open (relationId=1213, lockmode=1) at table.c:44 #10 0x563c3fca2227 in CatalogCacheInitializeCache (cache=0x563c40e8fe80) at catcache.c:980 #11 0x563c3fca255e in InitCatCachePhase2 (cache=0x563c40e8fe80, touch_index=true) at catcache.c:1083 #12 0x563c3fcc0556 in InitCatalogCachePhase2 () at syscache.c:184 #13 0x563c3fcb7db3 in RelationCacheInitializePhase3 () at relcache.c:4317 #14 0x563c3fce2748 in InitPostgres (in_dbname=0x563c40e54000 "postgres", dboid=5, username=0x563c40e53fe8 "law", useroid=0, flags=1, out_dbname=0x0) at postinit.c:1177 #15 0x563c3fad90a7 in PostgresMain (dbname=0x563c40e54000 "postgres", username=0x563c40e53fe8 "law") at postgres.c:4229 #16 0x563c3f9f01e4 in BackendRun (port=0x563c40e45360) at postmaster.c:4475 It looks like no new backend can be started due to the pg_tablespace lock, when a new relcache file is needed during the backend initialization. Best regards, Alexander
Re: DROP DATABASE is interruptible
Hi, On 2023-09-25 01:48:31 +0100, Peter Eisentraut wrote: > I noticed that this patch set introduced this pg_dump test: > > On 12.07.23 03:59, Andres Freund wrote: > > + 'CREATE DATABASE invalid...' => { > > + create_order => 1, > > + create_sql => q(CREATE DATABASE invalid; UPDATE pg_database SET > > datconnlimit = -2 WHERE datname = 'invalid'), > > + regexp => qr/^CREATE DATABASE invalid/m, > > + not_like => { > > + pg_dumpall_dbprivs => 1, > > + }, > > + }, > > But the key "not_like" isn't used for anything by that test suite. Maybe > "unlike" was meant? It's not clear to me either. Invalid databases shouldn't *ever* be dumped, so explicitly listing pg_dumpall_dbprivs is odd. TBH, I find this testsuite the most opaque in postgres... > But even then it would be useless because the "like" key is empty, so there > is nothing that "unlike" can subtract from. Was there something expected > from the mention of "pg_dumpall_dbprivs"? Not that I can figure out... > Perhaps it would be better to write out > > like => {}, > > explicitly, with a comment, like some other tests are doing. Yea, that looks like the right direction. I'll go and backpatch the adjustment. Greetings, Andres Freund
Re: DROP DATABASE is interruptible
I noticed that this patch set introduced this pg_dump test: On 12.07.23 03:59, Andres Freund wrote: + 'CREATE DATABASE invalid...' => { + create_order => 1, + create_sql => q(CREATE DATABASE invalid; UPDATE pg_database SET datconnlimit = -2 WHERE datname = 'invalid'), + regexp => qr/^CREATE DATABASE invalid/m, + not_like => { + pg_dumpall_dbprivs => 1, + }, + }, But the key "not_like" isn't used for anything by that test suite. Maybe "unlike" was meant? But even then it would be useless because the "like" key is empty, so there is nothing that "unlike" can subtract from. Was there something expected from the mention of "pg_dumpall_dbprivs"? Perhaps it would be better to write out like => {}, explicitly, with a comment, like some other tests are doing.
Re: DROP DATABASE is interruptible
> On 13 Jul 2023, at 22:52, Andres Freund wrote: > On 2023-07-12 11:54:18 +0200, Daniel Gustafsson wrote: >> Looking more at this I wonder if we in HEAD should make this a bit nicer by >> extending the --check phase to catch this? I did a quick hack along these >> lines in the 0003 commit attached here (0001 and 0002 are your unchanged >> patches, just added for consistency and to be CFBot compatible). If done it >> could be a separate commit to make the 0002 patch backport cleaner of course. > > I don't really have an opinion on that, tbh... Fair enough. Thinking more on it I think it has merits, so I will submit that patch in its own thread on -hackers. -- Daniel Gustafsson
Re: DROP DATABASE is interruptible
Hi, On 2023-07-12 11:54:18 +0200, Daniel Gustafsson wrote: > > On 12 Jul 2023, at 03:59, Andres Freund wrote: > > On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote: > >>> On 25 Jun 2023, at 19:03, Andres Freund wrote: > >>> On 2023-06-21 12:02:04 -0700, Andres Freund wrote: > > >>> There don't need to be explict checks, because pg_upgrade will fail, > >>> because > >>> it connects to every database. Obviously the error could be nicer, but it > >>> seems ok for something hopefully very rare. I did add a test ensuring > >>> that the > >>> behaviour is caught. > >> > >> I don't see any pg_upgrade test in the patch? > > > > Oops, I stashed them alongside some unrelated changes... Included this time. > > Looking more at this I wonder if we in HEAD should make this a bit nicer by > extending the --check phase to catch this? I did a quick hack along these > lines in the 0003 commit attached here (0001 and 0002 are your unchanged > patches, just added for consistency and to be CFBot compatible). If done it > could be a separate commit to make the 0002 patch backport cleaner of course. I don't really have an opinion on that, tbh... > >> + errhint("Use DROP DATABASE to drop invalid databases")); > >> Should end with a period as a complete sentence? > > > > I get confused about this every time. It's not helped by this example in > > sources.sgml: > > > > > > Primary:could not create shared memory segment: %m > > Detail: Failed syscall was shmget(key=%d, size=%u, 0%o). > > Hint: the addendum > > > > > > Which notably does not use punctuation for the hint. But indeed, later we > > say: > > > >Detail and hint messages: Use complete sentences, and end each with > >a period. Capitalize the first word of sentences. Put two spaces after > >the period if another sentence follows (for English text; might be > >inappropriate in other languages). > > > > That's not a very helpful example, and one which may give the wrong impression > unless the entire page is read. I've raised this with a small diff to improve > it on -docs. Thanks for doing that! > > Updated patches attached. > > This version of the patchset LGTM. Backpatching indeed was no fun. Not having BackgroundPsql.pm was the worst part. But also a lot of other conflicts in tests... Took me 5-6 hours or so. But I now finally pushed the fixes. Hope the buildfarm agrees with it... Thanks for the review! Greetings, Andres Freund
Re: DROP DATABASE is interruptible
> On 12 Jul 2023, at 03:59, Andres Freund wrote: > On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote: >>> On 25 Jun 2023, at 19:03, Andres Freund wrote: >>> On 2023-06-21 12:02:04 -0700, Andres Freund wrote: >>> There don't need to be explict checks, because pg_upgrade will fail, because >>> it connects to every database. Obviously the error could be nicer, but it >>> seems ok for something hopefully very rare. I did add a test ensuring that >>> the >>> behaviour is caught. >> >> I don't see any pg_upgrade test in the patch? > > Oops, I stashed them alongside some unrelated changes... Included this time. Looking more at this I wonder if we in HEAD should make this a bit nicer by extending the --check phase to catch this? I did a quick hack along these lines in the 0003 commit attached here (0001 and 0002 are your unchanged patches, just added for consistency and to be CFBot compatible). If done it could be a separate commit to make the 0002 patch backport cleaner of course. I'm not sure what should be done for psql. It's probably not a good idea to change tab completion, that'd just make it appear the database is gone. But \l could probably show dropped databases more prominently? >>> >>> I have not done that. I wonder if this is something that should be done in >>> the >>> back branches? >> >> Possibly, I'm not sure where we usually stand on changing the output format >> of >> \ commands in psql in minor revisions. > > I'd normally be quite careful, people do script psql. > > While breaking things when encountering an invalid database doesn't actually > sound like a bad thing, I don't think it fits into any of the existing column > output by psql for \l. Agreed, it doesn't, it would have to be a new column. >> +errhint("Use DROP DATABASE to drop invalid databases")); >> Should end with a period as a complete sentence? > > I get confused about this every time. It's not helped by this example in > sources.sgml: > > > Primary:could not create shared memory segment: %m > Detail: Failed syscall was shmget(key=%d, size=%u, 0%o). > Hint: the addendum > > > Which notably does not use punctuation for the hint. But indeed, later we say: > >Detail and hint messages: Use complete sentences, and end each with >a period. Capitalize the first word of sentences. Put two spaces after >the period if another sentence follows (for English text; might be >inappropriate in other languages). > That's not a very helpful example, and one which may give the wrong impression unless the entire page is read. I've raised this with a small diff to improve it on -docs. > Updated patches attached. This version of the patchset LGTM. > Does anybody have an opinion about whether we should add a dedicated field to > pg_database for representing invalid databases in HEAD? I'm inclined to think > that it's not really worth the cross-version complexity at this point, and > it's not that bad a misuse to use pg_database.datconnlimit. FWIW I think we should use pg_database.datconnlimit for this, it doesn't seem like a common enough problem to warrant the added complexity and cost. -- Daniel Gustafsson v4-0003-pg_upgrade-Extend-check-for-database-not-allowing.patch Description: Binary data v4-0002-Handle-DROP-DATABASE-getting-interrupted.patch Description: Binary data v4-0001-Release-lock-after-encountering-bogs-row-in-vac_t.patch Description: Binary data
Re: DROP DATABASE is interruptible
Hi, On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote: > > On 25 Jun 2023, at 19:03, Andres Freund wrote: > > On 2023-06-21 12:02:04 -0700, Andres Freund wrote: > >> I'm hacking on this bugfix again, > > This patch LGTM from reading through and testing (manually and with your > supplied tests in the patch), I think this is a sound approach to deal with > this. Thanks! > >> I haven't yet added checks to pg_upgrade, even though that's clearly > >> needed. I'm waffling a bit between erroring out and just ignoring the > >> database? pg_upgrade already fails when datallowconn is set "wrongly", see > >> check_proper_datallowconn(). Any opinions? > > > > There don't need to be explict checks, because pg_upgrade will fail, because > > it connects to every database. Obviously the error could be nicer, but it > > seems ok for something hopefully very rare. I did add a test ensuring that > > the > > behaviour is caught. > > I don't see any pg_upgrade test in the patch? Oops, I stashed them alongside some unrelated changes... Included this time. > > It's somewhat odd that pg_upgrade prints errors on stdout... > > There are many odd things about pg_upgrade logging, updating it to use the > common logging framework of other utils would be nice. Indeed. > >> I'm not sure what should be done for psql. It's probably not a good idea to > >> change tab completion, that'd just make it appear the database is gone. > >> But \l > >> could probably show dropped databases more prominently? > > > > I have not done that. I wonder if this is something that should be done in > > the > > back branches? > > Possibly, I'm not sure where we usually stand on changing the output format of > \ commands in psql in minor revisions. I'd normally be quite careful, people do script psql. While breaking things when encountering an invalid database doesn't actually sound like a bad thing, I don't think it fits into any of the existing column output by psql for \l. > A few small comments on the patch: > > + * Max connections allowed (-1=no limit, -2=invalid database). A database > + * is set to invalid partway through eing dropped. Using datconnlimit=-2 > + * for this purpose isn't particularly clean, but is backpatchable. > Typo: s/eing/being/. Fixed. > A limit of -1 makes sense, but the meaning of -2 is less intuitive IMO. > Would it make sense to add a #define with a more descriptive name to save > folks reading this having to grep around and figure out what -2 means? I went back and forth about this one. We don't use defines for such things in all the frontend code today, so the majority of places won't be improved by adding this. I added them now, which required touching a few otherwise untouched places, but not too bad. > + errhint("Use DROP DATABASE to drop invalid databases")); > Should end with a period as a complete sentence? I get confused about this every time. It's not helped by this example in sources.sgml: Primary:could not create shared memory segment: %m Detail: Failed syscall was shmget(key=%d, size=%u, 0%o). Hint: the addendum Which notably does not use punctuation for the hint. But indeed, later we say: Detail and hint messages: Use complete sentences, and end each with a period. Capitalize the first word of sentences. Put two spaces after the period if another sentence follows (for English text; might be inappropriate in other languages). > + errmsg("cannot alter invalid database \"%s\"", stmt->dbname), > + errdetail("Use DROP DATABASE to drop invalid databases")); > Shouldn't this be an errhint() instead? Also ending with a period. Yep. > + if (database_is_invalid_form((Form_pg_database) dbform)) > + continue; > Would it make sense to stick a DEBUG2 log entry in there to signal that such a > database exist? (The same would apply for the similar hunk in autovacuum.c.) I don't really have an opinion on it. Added. elog(DEBUG2, "skipping invalid database \"%s\" while computing relfrozenxid", NameStr(dbform->datname)); and elog(DEBUG2, "autovacuum: skipping invalid database \"%s\"", NameStr(pgdatabase->datname)); Updated patches attached. Not looking forward to fixing all the conflicts. Does anybody have an opinion about whether we should add a dedicated field to pg_database for representing invalid databases in HEAD? I'm inclined to think that it's not really worth the cross-version complexity at this point, and it's not that bad a misuse to use pg_database.datconnlimit. Greetings, Andres Freund >From 6cdbabf5f243d5227588df5bfd3f83018bcefb9a Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 22 Jun 2023 17:27:54 -0700 Subject: [PATCH v3 1/2] Release lock after encountering bogs row in vac_truncate_clog() When
Re: DROP DATABASE is interruptible
> On 25 Jun 2023, at 19:03, Andres Freund wrote: > On 2023-06-21 12:02:04 -0700, Andres Freund wrote: >> I'm hacking on this bugfix again, This patch LGTM from reading through and testing (manually and with your supplied tests in the patch), I think this is a sound approach to deal with this. >> I've been adding checks for partiall-dropped databases to the following >> places >> so far: >> - vac_truncate_clog(), as autovacuum can't process it anymore. Otherwise a >> partially dropped database could easily lead to shutdown-due-to-wraparound. >> - get_database_list() - so autovacuum workers don't error out when connecting >> - template database used by CREATE DATABASE >> - pg_dumpall, so we don't try to connect to the database >> - vacuumdb, clusterdb, reindexdb, same > > Also pg_amcheck. That seems like an exhaustive list to me, I was unable to think of any other place which would need the same treatment. pg_checksums does come to mind but it can clearly not see the required info so there doesn't seem like theres a lot to do about that. >> I haven't yet added checks to pg_upgrade, even though that's clearly >> needed. I'm waffling a bit between erroring out and just ignoring the >> database? pg_upgrade already fails when datallowconn is set "wrongly", see >> check_proper_datallowconn(). Any opinions? > > There don't need to be explict checks, because pg_upgrade will fail, because > it connects to every database. Obviously the error could be nicer, but it > seems ok for something hopefully very rare. I did add a test ensuring that the > behaviour is caught. I don't see any pg_upgrade test in the patch? > It's somewhat odd that pg_upgrade prints errors on stdout... There are many odd things about pg_upgrade logging, updating it to use the common logging framework of other utils would be nice. >> I'm not sure what should be done for psql. It's probably not a good idea to >> change tab completion, that'd just make it appear the database is gone. But >> \l >> could probably show dropped databases more prominently? > > I have not done that. I wonder if this is something that should be done in the > back branches? Possibly, I'm not sure where we usually stand on changing the output format of \ commands in psql in minor revisions. A few small comments on the patch: + * Max connections allowed (-1=no limit, -2=invalid database). A database + * is set to invalid partway through eing dropped. Using datconnlimit=-2 + * for this purpose isn't particularly clean, but is backpatchable. Typo: s/eing/being/. A limit of -1 makes sense, but the meaning of -2 is less intuitive IMO. Would it make sense to add a #define with a more descriptive name to save folks reading this having to grep around and figure out what -2 means? + errhint("Use DROP DATABASE to drop invalid databases")); Should end with a period as a complete sentence? + errmsg("cannot alter invalid database \"%s\"", stmt->dbname), + errdetail("Use DROP DATABASE to drop invalid databases")); Shouldn't this be an errhint() instead? Also ending with a period. + if (database_is_invalid_form((Form_pg_database) dbform)) + continue; Would it make sense to stick a DEBUG2 log entry in there to signal that such a database exist? (The same would apply for the similar hunk in autovacuum.c.) -- Daniel Gustafsson
Re: DROP DATABASE is interruptible
Hi, On 2023-06-21 12:02:04 -0700, Andres Freund wrote: > I'm hacking on this bugfix again, thanks to Evgeny's reminder on the other > thread [1]. > > > I've been adding checks for partiall-dropped databases to the following places > so far: > - vac_truncate_clog(), as autovacuum can't process it anymore. Otherwise a > partially dropped database could easily lead to shutdown-due-to-wraparound. > - get_database_list() - so autovacuum workers don't error out when connecting > - template database used by CREATE DATABASE > - pg_dumpall, so we don't try to connect to the database > - vacuumdb, clusterdb, reindexdb, same Also pg_amcheck. > It's somewhat annoying that there is no shared place for the relevant query > for the client-side cases. Still the case, I looked around, and it doesn't look we do anything smart anywhere :/ > I haven't yet added checks to pg_upgrade, even though that's clearly > needed. I'm waffling a bit between erroring out and just ignoring the > database? pg_upgrade already fails when datallowconn is set "wrongly", see > check_proper_datallowconn(). Any opinions? There don't need to be explict checks, because pg_upgrade will fail, because it connects to every database. Obviously the error could be nicer, but it seems ok for something hopefully very rare. I did add a test ensuring that the behaviour is caught. It's somewhat odd that pg_upgrade prints errors on stdout... > I'm not sure what should be done for psql. It's probably not a good idea to > change tab completion, that'd just make it appear the database is gone. But \l > could probably show dropped databases more prominently? I have not done that. I wonder if this is something that should be done in the back branches? Greetings, Andres Freund >From 2c8baf448fddacd14c478da0abe30aa45391dff9 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 22 Jun 2023 17:27:54 -0700 Subject: [PATCH v2 1/2] Add missing lock releases to vac_truncate_clog() --- src/backend/commands/vacuum.c | 4 1 file changed, 4 insertions(+) diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index bb79de4da6a..984c98a5df1 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -1902,12 +1902,16 @@ vac_truncate_clog(TransactionId frozenXID, ereport(WARNING, (errmsg("some databases have not been vacuumed in over 2 billion transactions"), errdetail("You might have already suffered transaction-wraparound data loss."))); + LWLockRelease(WrapLimitsVacuumLock); return; } /* chicken out if data is bogus in any other way */ if (bogus) + { + LWLockRelease(WrapLimitsVacuumLock); return; + } /* * Advance the oldest value for commit timestamps before truncating, so -- 2.38.0 >From e1c67f6a1c028ae111dacb0867b8bc6a478678ab Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 8 May 2023 18:28:33 -0700 Subject: [PATCH v2 2/2] Handle interrupted DROP DATABASE Author: Reviewed-by: Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm...@awork3.anarazel.de Backpatch: --- src/include/catalog/pg_database.h | 10 +- src/backend/commands/dbcommands.c | 102 --- src/backend/commands/vacuum.c | 10 ++ src/backend/postmaster/autovacuum.c | 7 ++ src/backend/utils/init/postinit.c | 10 ++ src/bin/pg_amcheck/pg_amcheck.c | 2 +- src/bin/pg_amcheck/t/002_nonesuch.pl| 34 + src/bin/pg_dump/pg_dumpall.c| 4 +- src/bin/pg_dump/t/002_pg_dump.pl| 17 +++ src/bin/scripts/clusterdb.c | 2 +- src/bin/scripts/reindexdb.c | 2 +- src/bin/scripts/t/011_clusterdb_all.pl | 14 +++ src/bin/scripts/t/050_dropdb.pl | 9 ++ src/bin/scripts/t/091_reindexdb_all.pl | 14 +++ src/bin/scripts/t/101_vacuumdb_all.pl | 14 +++ src/bin/scripts/vacuumdb.c | 2 +- src/test/recovery/meson.build | 1 + src/test/recovery/t/037_invalid_database.pl | 130 18 files changed, 361 insertions(+), 23 deletions(-) create mode 100644 src/test/recovery/t/037_invalid_database.pl diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h index d004f4dc8aa..93206bafaca 100644 --- a/src/include/catalog/pg_database.h +++ b/src/include/catalog/pg_database.h @@ -49,7 +49,11 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID /* new connections allowed? */ bool datallowconn; - /* max connections allowed (-1=no limit) */ + /* + * Max connections allowed (-1=no limit, -2=invalid database). A database + * is set to invalid partway through eing dropped. Using datconnlimit=-2 + * for this purpose isn't particularly clean, but is backpatchable. + */ int32 datconnlimit; /* all Xids < this are frozen in this DB */ @@ -103,4 +107,8 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_database_oid_index, 2672,
Re: DROP DATABASE is interruptible
Hi, On 2023-05-09 15:41:36 +1200, Thomas Munro wrote: > +# FIXME: It'd be good to test the actual interruption path. But it's not > +# immediately obvious how. > > I wonder if there is some way to incorporate something based on > SIGSTOP signals into the test, but I don't know how to do it on > Windows and maybe that's a bit weird anyway. For a non-OS-specific > way to do it, I was wondering about having a test module function that > has a wait loop that accepts ^C but deliberately ignores > ProcSignalBarrier, and leaving that running in a background psql for a > similar effect? I found a way to test it reliably, albeit partially. However, I'm not sure where to do so / if it's worth doing so. The problem occurs once remove_dbtablespaces() starts working. The fix does a heap_inplace_update() before that. So to reproduce the problem one session can lock pg_tablespace, another can drop a database. Then the second session can be cancelled by the first. Waiting for locks to be acquired etc is somewhat cumbersome in a tap test. It'd be easier in an isolation test. But I don't think we want to do this as part of the normal isolation schedule? So just open coding it in a tap test seems to be the best way? Is it worth doing? Greetings, Andres Freund
Re: DROP DATABASE is interruptible
Hi, I'm hacking on this bugfix again, thanks to Evgeny's reminder on the other thread [1]. I've been adding checks for partiall-dropped databases to the following places so far: - vac_truncate_clog(), as autovacuum can't process it anymore. Otherwise a partially dropped database could easily lead to shutdown-due-to-wraparound. - get_database_list() - so autovacuum workers don't error out when connecting - template database used by CREATE DATABASE - pg_dumpall, so we don't try to connect to the database - vacuumdb, clusterdb, reindexdb, same It's somewhat annoying that there is no shared place for the relevant query for the client-side cases. I haven't yet added checks to pg_upgrade, even though that's clearly needed. I'm waffling a bit between erroring out and just ignoring the database? pg_upgrade already fails when datallowconn is set "wrongly", see check_proper_datallowconn(). Any opinions? I'm not sure what should be done for psql. It's probably not a good idea to change tab completion, that'd just make it appear the database is gone. But \l could probably show dropped databases more prominently? We don't really have a good place to for database specific code. dbcommands.[ch] are for commands (duh), but already contain a bunch of functions that don't really belong there. Seems we should add a catalog/pg_database.c or catalog/database.c (tbh, I don't really know which we use for what). But that's probably for HEAD only. dbcommands.c's get_db_info() seems to have gone completely off the deep end. It returns information in 14 separate out parameters, and the variables for that need to all exhaustively be declared. And of course that differs heavily between releases, making it a pain to backpatch any change. ISTM we should just define a struct for the parameters - alternatively we could just return a copy of the pg_database tuple, but it looks like the variable-width attributes would make that *just* about a loss. I guess that's once more something better dealt with on HEAD, but damn, I'm not relishing having to deal with backpatching anything touching it - I think it might be reasonable to just open-code fetching datconnlimit :/. This patch is starting to be a bit big, particularly once adding tests for all the checks mentioned above - but I haven't heard of or thought of a better proposal :(. Greetings, Andres Freund [1] https://postgr.es/m/01020188d31d0a86-16af92c0-4466-4cb6-a2e8-0e5898aab800-00%40eu-west-1.amazonses.com
Re: DROP DATABASE is interruptible
Hi, On 2023-05-09 15:41:36 +1200, Thomas Munro wrote: > I tried out the patch you posted over at [1]. Thanks! > $ psql db2 > psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" > failed: FATAL: database "db2" is invalid > DETAIL: Use DROP DATABASE to drop invalid databases > > I suppose it should be a HINT? Yup. > +# FIXME: It'd be good to test the actual interruption path. But it's not > +# immediately obvious how. > > I wonder if there is some way to incorporate something based on > SIGSTOP signals into the test, but I don't know how to do it on > Windows and maybe that's a bit weird anyway. For a non-OS-specific > way to do it, I was wondering about having a test module function that > has a wait loop that accepts ^C but deliberately ignores > ProcSignalBarrier, and leaving that running in a background psql for a > similar effect? Seems a bit too complicated. We really need to work at a framework for this kind of thing. > Not sure why the test is under src/test/recovery. Where else? We don't really have a place to put backend specific tests that aren't about logical replication or recovery right now... It partially is about dealing with crashes etc in the middle of DROP DATABASE, so it doesn't seem unreasonble to me anyway. > While a database exists in this state, we get periodic autovacuum > noise, which I guess we should actually skip? Yes, good catch. Also should either reset datfrozenxid et al when invalidating, or ignore it when computing horizons. > I suppose someone might eventually wonder if autovacuum could complete the > drop, but it seems a bit of a sudden weird leap in duties and might be > confusing (perhaps it'd make more sense if 'invalid because creating' and > 'invalid because dropping' were distinguished). I'm bit hesitant to do so for now. Once it's a bit more settled, maybe? Although I wonder if there's something better suited to the task than autovacuum. Greetings, Andres Freund
Re: DROP DATABASE is interruptible
On Tue, May 9, 2023 at 3:41 PM Thomas Munro wrote: > I tried out the patch you posted over at [1]. I forgot to add, +1, I think this is a good approach. (I'm still a little embarrassed at how long we spent trying to debug this in the other thread from the supplied clues, when you'd already pointed this failure mechanism out including the exact error message a couple of months ago. One thing I've noticed is that new threads posted in the middle of commitfests are hard to see :-D We were getting pretty close, though.)
Re: DROP DATABASE is interruptible
I tried out the patch you posted over at [1]. For those wanting an easy way to test it, or test the buggy behaviour in master without this patch, you can simply kill -STOP the checkpointer, so that DROP DATABASE hangs in RequestCheckpoint() (or you could SIGSTOP any other backend so it hangs in the barrier thing instead), and then you can just press ^C like this: postgres=# create database db2; CREATE DATABASE postgres=# drop database db2; ^CCancel request sent ERROR: canceling statement due to user request After that you get: $ psql db2 psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "db2" is invalid DETAIL: Use DROP DATABASE to drop invalid databases I suppose it should be a HINT? +# FIXME: It'd be good to test the actual interruption path. But it's not +# immediately obvious how. I wonder if there is some way to incorporate something based on SIGSTOP signals into the test, but I don't know how to do it on Windows and maybe that's a bit weird anyway. For a non-OS-specific way to do it, I was wondering about having a test module function that has a wait loop that accepts ^C but deliberately ignores ProcSignalBarrier, and leaving that running in a background psql for a similar effect? Not sure why the test is under src/test/recovery. While a database exists in this state, we get periodic autovacuum noise, which I guess we should actually skip? I suppose someone might eventually wonder if autovacuum could complete the drop, but it seems a bit of a sudden weird leap in duties and might be confusing (perhaps it'd make more sense if 'invalid because creating' and 'invalid because dropping' were distinguished). 2023-05-09 15:24:10.860 NZST [523191] FATAL: database "db2" is invalid 2023-05-09 15:24:10.860 NZST [523191] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:25:10.883 NZST [523279] FATAL: database "db2" is invalid 2023-05-09 15:25:10.883 NZST [523279] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:26:10.899 NZST [523361] FATAL: database "db2" is invalid 2023-05-09 15:26:10.899 NZST [523361] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:27:10.919 NZST [523408] FATAL: database "db2" is invalid 2023-05-09 15:27:10.919 NZST [523408] DETAIL: Use DROP DATABASE to drop invalid databases 2023-05-09 15:28:10.938 NZST [523456] FATAL: database "db2" is invalid 2023-05-09 15:28:10.938 NZST [523456] DETAIL: Use DROP DATABASE to drop invalid databases [1] https://www.postgresql.org/message-id/20230509013255.fjrlpitnj3ltur76%40awork3.anarazel.de