Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Dilip Kumar
On Tue, May 9, 2023 at 3:15 AM Michael Paquier  wrote:
>
> On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote:
> > I am able to reproduce this using the steps given above, I am also
> > trying to analyze this further.  I will send the update once I get
> > some clue.
>
> Have you been able to reproduce this on HEAD or at the top of
> REL_15_STABLE, or is that 15.2-ish without fa5dd46?
>

I am able to reproduce on REL_15_STABLE as well with your test case.
The only difference is without fa5dd46 the issue gets reproduced just
in a couple of seconds and very consistent OTOH on REL_15_STABLE it
takes time to reproduce 2-3 mins and also reproduction is not very
consistent.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 17:46:37 -0700, Andres Freund wrote:
> My current gut feeling is that we should use datconnlimit == -2 to prevent
> connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
> column in 16, for both createdb() and dropdb().

Attached is a rough prototype of that idea (only using datconnlimit == -2 for
now).

This would need a fair bit more polish. The tests are crappy and output stuff
to stderr and don't validate enough. The error messages are bad. No docs
changes. More comments about why datconnlimit == -2 is used. Etc.

But I think it should be sufficient to discuss whether this is a viable path.


I guess we need to move this to -hackers. Perhaps I'll post subsequent
versions below
https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
 ?

Greetings,

Andres Freund
>From 5784ba0b21eaf05c2989c57ce1f73c13edf0111e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 8 May 2023 18:28:33 -0700
Subject: [PATCH v1] Handle interrupted DROP DATABASE

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm...@awork3.anarazel.de
Backpatch:
---
 src/include/catalog/pg_database.h   |  2 +-
 src/backend/commands/dbcommands.c   | 50 ++---
 src/backend/utils/init/postinit.c   | 11 +
 src/test/recovery/meson.build   |  1 +
 src/test/recovery/t/037_invalid_database.pl | 48 
 5 files changed, 95 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/037_invalid_database.pl

diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h
index d004f4dc8aa..5cd972da324 100644
--- a/src/include/catalog/pg_database.h
+++ b/src/include/catalog/pg_database.h
@@ -49,7 +49,7 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID
 	/* new connections allowed? */
 	bool		datallowconn;
 
-	/* max connections allowed (-1=no limit) */
+	/* max connections allowed (-1=no limit, -2=invalid database) */
 	int32		datconnlimit;
 
 	/* all Xids < this are frozen in this DB */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff24..a0360d9ad80 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1569,6 +1569,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	bool		db_istemplate;
 	Relation	pgdbrel;
 	HeapTuple	tup;
+	Form_pg_database datform;
 	int			notherbackends;
 	int			npreparedxacts;
 	int			nslots,
@@ -1684,17 +1685,6 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 		dbname),
  errdetail_busy_db(notherbackends, npreparedxacts)));
 
-	/*
-	 * Remove the database's tuple from pg_database.
-	 */
-	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(db_id));
-	if (!HeapTupleIsValid(tup))
-		elog(ERROR, "cache lookup failed for database %u", db_id);
-
-	CatalogTupleDelete(pgdbrel, >t_self);
-
-	ReleaseSysCache(tup);
-
 	/*
 	 * Delete any comments or security labels associated with the database.
 	 */
@@ -1711,6 +1701,31 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	dropDatabaseDependencies(db_id);
 
+	/*
+	 * Tell the cumulative stats system to forget it immediately, too.
+	 */
+	pgstat_drop_database(db_id);
+
+	tup = SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(db_id));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR, "cache lookup failed for database %u", db_id);
+	datform = (Form_pg_database) GETSTRUCT(tup);
+
+	/*
+	 * Subsequent actions are not transactional (consider
+	 * DropDatabaseBuffers() discarding modified buffers). But we might crash
+	 * or get interrupted below. To prevent accessing a database with invalid
+	 * contents, mark the database as invalid using an in-place update.
+	 */
+	datform->datconnlimit = -2;
+	heap_inplace_update(pgdbrel, tup);
+
+	/*
+	 * Also delete the tuple - transactionally. If this transaction commits,
+	 * the row will be gone, but if we fail, dropdb() can be invoked again.
+	 */
+	CatalogTupleDelete(pgdbrel, >t_self);
+
 	/*
 	 * Drop db-specific replication slots.
 	 */
@@ -1723,11 +1738,6 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	DropDatabaseBuffers(db_id);
 
-	/*
-	 * Tell the cumulative stats system to forget it immediately, too.
-	 */
-	pgstat_drop_database(db_id);
-
 	/*
 	 * Tell checkpointer to forget any pending fsync and unlink requests for
 	 * files in the database; else the fsyncs will fail at next checkpoint, or
@@ -2339,6 +2349,14 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 	datform = (Form_pg_database) GETSTRUCT(tuple);
 	dboid = datform->oid;
 
+	if (datform->datconnlimit == -2)
+	{
+		ereport(FATAL,
+errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("database \"%s\" is invalid", stmt->dbname),
+errdetail("Use DROP DATABASE to drop invalid databases"));
+	}
+
 	if (!object_ownercheck(DatabaseRelationId, dboid, GetUserId()))
 

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 14:04:00 -0700, Andres Freund wrote:
> But perhaps a similar approach could be the solution? My gut says that the
> rought direction might allow us to keep dropdb() a single transaction.

I started to hack on the basic approach of committing after the catalog
changes. But then I started wondering if we're not tackling this all wrong.


We don't actually want to drop the catalog contents early as that prevents the
database from being dropped again, while potentially leaving behind contents
in the case of a crash / interrupt.

Similary, the fact that we only commit the transaction at the end of
createdb() leads to interrupts / crashes orphaning the contents of that
[partial] database.  We also hve similar issues with movedb(), I think.


This is a non-transactional operation. I think we should copy the approach of
the CONCURRENTLY operations. Namely add a new column to pg_database,
indicating whether the database contents are valid. An invalid database can be
dropped, but not connected to.

Then we could have createdb() commit before starting to create the target
database directory (with invalid = true, of course). After all the filesystem
level stuff is complete, set invalid = false.

For dropping a database we'd use heap_inplace_update() to set invalid = true
just before the DropDatabaseBuffers(), preventing any connections after that
point.

Making movedb() safe is probably a bit harder - I think it'd temporarily
require two pg_database entries?


Of course we can't add a new column in the back branches. IIRC we had a
similar issue with CIC some point, and just ended up misusing some other
column for the backbranches?  We could e.g. use datconnlimit == -2 for that
purpose (but would need to make sure that ALTER DATABASE can't unset it).


My current gut feeling is that we should use datconnlimit == -2 to prevent
connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
column in 16, for both createdb() and dropdb().  In some ways handling
createdb() properly is a new feature, but it's also arguably a bug that we
leak the space - and I think the code will be better if we work on both
together.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Michael Paquier
On Mon, May 08, 2023 at 06:04:23PM -0400, Tom Lane wrote:
> Andres seems to think it's a problem with aborting a DROP DATABASE.
> Adding more data might serve to make the window wider, perhaps.

And the odds get indeed much better once I use these two toys:
CREATE OR REPLACE FUNCTION create_tables(num_tables int)
  RETURNS VOID AS
  $func$
  BEGIN
  FOR i IN 1..num_tables LOOP
EXECUTE format('
  CREATE TABLE IF NOT EXISTS %I (id int)', 't_' || i);
  END LOOP;
END
$func$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION drop_tables(num_tables int)
  RETURNS VOID AS
  $func$
  BEGIN
  FOR i IN 1..num_tables LOOP
EXECUTE format('
  DROP TABLE IF EXISTS %I', 't_' || i);
  END LOOP;
END
$func$ LANGUAGE plpgsql;

And then use this loop with the others I have mentioned upthread (just
create origindb and the functions in them):
while true;
  do psql -c 'select create_tables(1000)' origindb > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
  psql -c 'select drop_tables(1000)' origindb > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
done;

On top of that, I have also been able to reproduce the issue on HEAD,
and luckily some pg_class file remain around, full of zeros:
$ hexdump ./base/199634/1259
000        

The contents of 2662, though, looked OK.

Echoing Alvaro..  Could we, err, revisit the choice of making WAL_LOG
the default strategy even for this set of minor releases?  FWIW, I've
mentioned that this choice was too aggressive in the thread of
8a86618..
--
Michael


signature.asc
Description: PGP signature


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Thomas Munro
On Tue, May 9, 2023 at 10:04 AM Tom Lane  wrote:
> Michael Paquier  writes:
> > One thing I was wondering about to improve the odds of the hits is to
> > be more aggressive with the number of relations created at once, so as
> > we are much more aggressive with the number of pages extended in
> > pg_class from the origin database.
>
> Andres seems to think it's a problem with aborting a DROP DATABASE.
> Adding more data might serve to make the window wider, perhaps.

Here's an easy way:

@@ -1689,6 +1689,14 @@ dropdb(const char *dbname, bool missing_ok, bool force)
/* Close all smgr fds in all backends. */

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

+/* XXX pretend one of the above steps got interrupted by a statement
timeout or ^C */
+if (random() % 2 == 0)
+{
+QueryCancelPending = true;
+InterruptPending = true;
+CHECK_FOR_INTERRUPTS();
+}

postgres=# create database db2;
CREATE DATABASE
postgres=# drop database db2;
ERROR:  canceling statement due to user request

$ psql db2
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432"
failed: PANIC:  could not open critical system index 2662

$ od -t x1 base/17/2662
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
010
$ od -t x1 base/17/2837
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
004
$ od -t x1 base/17/2840
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
010




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Tom Lane
Michael Paquier  writes:
> One thing I was wondering about to improve the odds of the hits is to
> be more aggressive with the number of relations created at once, so as
> we are much more aggressive with the number of pages extended in
> pg_class from the origin database.

Andres seems to think it's a problem with aborting a DROP DATABASE.
Adding more data might serve to make the window wider, perhaps.

regards, tom lane




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Michael Paquier
On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote:
> I am able to reproduce this using the steps given above, I am also
> trying to analyze this further.  I will send the update once I get
> some clue.

Have you been able to reproduce this on HEAD or at the top of
REL_15_STABLE, or is that 15.2-ish without fa5dd46?

One thing I was wondering about to improve the odds of the hits is to
be more aggressive with the number of relations created at once, so as
we are much more aggressive with the number of pages extended in
pg_class from the origin database.
--
Michael


signature.asc
Description: PGP signature


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 20:27:14 +, Evgeny Morozov wrote:
> On 8/05/2023 9:47 pm, Andres Freund wrote:
> > Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> > Either due the connection being terminated or a crash?
>
> I've uploaded an edited version of the PG log for the time as
> https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
> (test_behavior_638186279733138190 and test_behavior_638186280406544656
> are the two DBs that got corrupted).
>
> I cannot see any crash in the test logs or the PG logs, but whether it
> was interrupted is less clear. I don't know whether the the tests ran
> successfully up to the point where they tried to drop the DBs (I've
> since added logging to show that next time), but DROP DATABASE did not
> return after 30 seconds and the client library (Npgsql) then tried to
> cancel the requests. We then tried to drop the DB again, with the same
> results in both cases. After the second attempts timed out we closed the
> connections anyway - so maybe that was the interruption?

Yep, that is the interruption.  I suspect that something was not processing
interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to
block.

Are you using any extensions? Do you have any chance to figure out what
statements were running concurrently with the DROP DATABASE?


Seems we figured out what the problem is... Just need to figure out how to fix
it.

I think the minimal fix will entail moving the commit of the transaction into
dropdb(). We need to commit before executing DropDatabaseBuffers(), as the
database is inconsistent after that point.  Luckily DROP DATABASE already
can't be executed in a transaction, so that's easily possible.

That means we'd loose track of the files for the database in case of a crash,
but that's surely better than a corrupt database. And it's already the case
for CREATE DATABASE.


Additionally we probably need to prevent processing cancellations between
DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop
accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah,
what fun.

Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we
don't just need to prevent cancellations, we also can't accept termintions
either...

For a bit I thought we might be able to just reorder operations, moving the
WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works,
because until DropDatabaseBuffers() has executed, backends might write such
buffers back, we need to do WaitForProcSignalBarrier() after that.


I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a
first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other
flag). That'd prevent new IO from being started for those buffers. Then
WaitForProcSignalBarrier(), to make sure nobody has an open FD for those
files. Then actually drop all those buffers and unlink the files. That now can
happen with interrupts held, without any chance of being blocked, afaict.  In
case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO()
would remove IO_IN_PROGRESS, and everything would be fine.

I don't like the idea of WaitForProcSignalBarrier() while having buffers
marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be
unsurprised if there were some.

But perhaps a similar approach could be the solution? My gut says that the
rought direction might allow us to keep dropdb() a single transaction.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Evgeny Morozov
On 8/05/2023 9:47 pm, Andres Freund wrote:
> Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> Either due the connection being terminated or a crash?

I've uploaded an edited version of the PG log for the time as
https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
(test_behavior_638186279733138190 and test_behavior_638186280406544656
are the two DBs that got corrupted).

I cannot see any crash in the test logs or the PG logs, but whether it
was interrupted is less clear. I don't know whether the the tests ran
successfully up to the point where they tried to drop the DBs (I've
since added logging to show that next time), but DROP DATABASE did not
return after 30 seconds and the client library (Npgsql) then tried to
cancel the requests. We then tried to drop the DB again, with the same
results in both cases. After the second attempts timed out we closed the
connections anyway - so maybe that was the interruption?


> As described in
> https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
> we don't handle that correctly for DROP DATABASE.
>
> I think that might actually fit the symptoms - the DropDatabaseBuffers() will
> throw away the dirty buffer contents from the WAL strategy CREATE DATABASE,
> but if you then get cancelled ata point before all the files are removed, the
> on-disk fails with all-zeroes would remain.
Oooh, that does seem to fit! Thank you for digging that up.




Re: ICU, locale and collation question

2023-05-08 Thread Laurenz Albe
On Mon, 2023-05-08 at 16:35 +0200, Oscar Carlberg wrote:
> We have a bunch of existing Postgres 10 clusters running on CentOS 7, 
> which have been initialized (initdb) with these collation options;
> 
> -E 'UTF-8'
> --lc-collate=sv_SE.UTF-8
> --lc-ctype=sv_SE.UTF-8
> --lc_monetary=sv_SE.UTF-8
> --lc-numeric=sv_SE.UTF-8
> --lc-time=sv_SE.UTF-8
> --lc-messages=en_US.UTF-8
> 
> And createdb were provided with these locale options when setting up 
> databases:
> --lc-collate=sv_SE.UTF-8
> --lc-ctype=sv_SE.UTF-8
> 
> We're upgrading the servers using logical replication, and would like to 
> take the opportunity to switch to ICU rather than relying on glibc, to 
> avoid future problems with index corruption if using physical 
> replication between servers with different versions of glibc.
> 
> We're trying to figure out the most correct way to configure postgres to 
> achieve this. Currently we have:
> 
> -E 'UTF-8'
> --locale-provider=icu
> --icu-locale=sv-SE-x-icu
> 
> And createdb are provided with locale options:
> --lc-collate=C
> --lc-ctype=C
> 
> Is this a safe configuration to avoid index corruption, and other 
> problems, while still being compatible with the previous locale 
> settings?

Yes, that is safe.

But it is not compatible with the old setup when it comes to lc_time,
lc_messages and the others.  You should use sv_SE.UTF-8 for these
locale categories.

Note that that won't cause problems with upgrading the C library.

Yours,
Laurenz Albe




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-07 16:10:28 +, Evgeny Morozov wrote:
> Yes, kind of. We have a test suite that creates one test DB and runs a
> bunch of tests on it. Two of these tests, however, create another DB
> each (also by cloning the same template DB) in order to test copying
> data between DBs. It's only these "extra" DBs that were corrupted, at
> least on this occasion.

Did you have any occasions where CREATE or DROP DATABASE was interrupted?
Either due the connection being terminated or a crash?

As described in
https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
we don't handle that correctly for DROP DATABASE.

I think that might actually fit the symptoms - the DropDatabaseBuffers() will
throw away the dirty buffer contents from the WAL strategy CREATE DATABASE,
but if you then get cancelled ata point before all the files are removed, the
on-disk fails with all-zeroes would remain.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Tom Lane
Alvaro Herrera  writes:
> Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default
> again, for branch 15, before today's release.

If we had more than one such report, I'd be in favor of that.
But I think it's a bit premature to conclude that the copy
strategy is to blame.

regards, tom lane




ICU, locale and collation question

2023-05-08 Thread Oscar Carlberg

Hello,

We have a bunch of existing Postgres 10 clusters running on CentOS 7, 
which have been initialized (initdb) with these collation options;


-E 'UTF-8'
--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8
--lc_monetary=sv_SE.UTF-8
--lc-numeric=sv_SE.UTF-8
--lc-time=sv_SE.UTF-8
--lc-messages=en_US.UTF-8

And createdb were provided with these locale options when setting up 
databases:

--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8

\l in psql gives:

 Name |  Owner   | Encoding | Collate   
|    Ctype    |

--+--+--+-+-+
 test-db  | test-user    | UTF8 | 
sv_SE.UTF-8 | sv_SE.UTF-8 |



We're upgrading the servers using logical replication, and would like to 
take the opportunity to switch to ICU rather than relying on glibc, to 
avoid future problems with index corruption if using physical 
replication between servers with different versions of glibc.


We're trying to figure out the most correct way to configure postgres to 
achieve this. Currently we have:


-E 'UTF-8'
--locale-provider=icu
--icu-locale=sv-SE-x-icu

And createdb are provided with locale options:
--lc-collate=C
--lc-ctype=C

\l in psql now gives:

    Name |    Owner    | Encoding | Collate   |    
Ctype    | ICU Locale  | Locale Provider |

-+-+--+-+-+-+-+
 test-db | test-user   | UTF8 | C   | 
C   | sv-SE-x-icu | icu |


Is this a safe configuration to avoid index corruption, and other 
problems, while still being compatible with the previous locale 
settings? We have done some testing and it appears ORDER BY does sort 
rows according to Swedish localization in the ICU configured test-db.


We're uncertain since this blogpost -> 
https://peter.eisentraut.org/blog/2022/09/26/icu-features-in-postgresql-15
mentions that there are still some postgres code relying on libc locale 
facilities. Should we set lc-collate and lc-ctype to sv_SE.UTF-8 when 
creating databases in addition to the ICU options provided to initdb due 
to this? Will we still be safe from glibc related corruption as long as 
--locale-provider=icu --icu-locale=sv-SE-x-icu is set?


Best Regards,

Oscar


--
Innehållet i detta e-postmeddelande är konfidentiellt och avsett endast för 
adressaten.Varje spridning, kopiering eller utnyttjande av innehållet är 
förbjuden utan tillåtelse av avsändaren. Om detta meddelande av misstag 
gått till fel adressat vänligen radera det ursprungliga meddelandet och 
underrätta avsändaren via e-post





Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Dilip Kumar
On Mon, May 8, 2023 at 7:55 AM Michael Paquier  wrote:
>
> On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> > Bug-in-PostgreSQL explanations could include that we forgot it was
> > dirty, or some backend wrote it out to the wrong file; but if we were
> > forgetting something like permanent or dirty, would there be a more
> > systematic failure?  Oh, it could require special rare timing if it is
> > similar to 8a8661828's confusion about permanence level or otherwise
> > somehow not setting BM_PERMANENT, but in the target blocks, so I think
> > that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
> > but perhaps more unlucky ingredients are needed.
> >
> > Bug-in-OS/FS explanations could include that a whole lot of writes
> > were mysteriously lost in some time window, so all those files still
> > contain the zeroes we write first in smgrextend().  I guess this
> > previously rare (previously limited to hash indexes?) use of sparse
> > file hole-punching could be a factor in an it's-all-ZFS's-fault
> > explanation:
>
> Yes, you would need a bit of all that.
>
> I can reproduce the same backtrace here.  That's just my usual laptop
> with ext4, so this would be a Postgres bug.  First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:
> 1) Create and drop a database with WAL_LOG as strategy and the
> regression database as template:
> while true; do
>   createdb --template=regression --strategy=wal_log testdb;
>   dropdb testdb;
> done
> 2) Feeding more data to pg_class in the middle, while testing the
> connection to the database created:
> while true;
>   do psql -c 'create table popo as select 1 as a;' regression > /dev/null 
> 2>&1 ;
>   psql testdb -c "select 1" > /dev/null 2>&1 ;
>   psql -c 'drop table popo' regression > /dev/null 2>&1 ;
>   psql testdb -c "select 1" > /dev/null 2>&1 ;
> done;
> 3) Force some checkpoints:
> while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
> 4) Force a few crashes and recoveries:
> while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done
>

I am able to reproduce this using the steps given above, I am also
trying to analyze this further.  I will send the update once I get
some clue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: huge discrepancy between EXPLAIN cost and actual time (but the table has just been ANALYZED)

2023-05-08 Thread Tom Lane
Kent Tong  writes:
> I have a complex query involving over 15 joins and a CTE query and it takes
> over 17s to complete.

If you want constructive comments about that, you're going to need to show
us the whole thing (and the underlying tables' schemas).  Little bits in
isolation are not very intelligible.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

One thought is that you might need to raise from_collapse_limit and/or
join_collapse_limit to at least 15 to get the planner to do a complete
search of the join-order space.

regards, tom lane




Re: huge discrepancy between EXPLAIN cost and actual time (but the table has just been ANALYZED)

2023-05-08 Thread Ron

On 5/8/23 07:29, Kent Tong wrote:

Hi,

I have a complex query involving over 15 joins and a CTE query and it 
takes over 17s to complete. The output of EXPLAIN ANALYZE includes 
(somewhere deep inside):


Index Scan using document_pkey on document document0_
 (cost=0.29..8.31 rows=1 width=3027) (actual time=16243.959..16243.961
rows=1 loops=1)


This shows an index scan with a very small cost but a very large actual 
time. The strange thing is, all the tables have just been analyzed with 
the ANALYZE command (it is not a foreign table). Furthermore, if I run a 
simple query using that index, both the cost and the actual time are small.


Another snippet is:

                                  -> CTE Scan on all_related_document
p  (cost=1815513.32..3030511.77 rows=241785 width=16) (actual
time=203.969..203.976 rows=0 loops=1)


I think the cost-actual time discrepancy is fine as it is a recursive CTE 
so postgresql can't estimate the cost well. It is materialized and a full 
table scan is performed. However, the actual time is not that bad.  Also, 
the estimated rows and the actual rows are also vastly different, but I 
guess this is fine, isn't it?


Any idea how I should check further?


ANALYZE just samples the table.  If data within the relevant indexed columns 
aren't evenly distributed, then the statistics might not show the true data 
distribution.


--
Born in Arizona, moved to Babylonia.

Re: Additive backup and restore?

2023-05-08 Thread Ron

On 5/8/23 05:24, Age Apache wrote:
I am designing a database for a web application. In the near future I will 
require past data for Audit, Security and Analysis purpose. But storing 
all that data will increase the database size. What are some well known 
approaches to archiving data for later use without increasing the database 
size?


One approach I am thinking of is compressed additive backup


Additive?

Why not one compressed archive file per month per table?

and restore i.e. copy the rows of the tables that will be required later 
and store them in a compressed form, and then delete those rows from the 
table. And when the data is required then restore them from the backup 
files in an additive way.


Is there an easy way to achieve this, say via pg_dump?


If the tables are partitioned on the relevant date field, then archiving old 
data will be "trivially" easy.


If not, then it'll still be "easy":
- COPY TO a file
- compress it
- store it somewhere.

Reverse to "de-archive" the data.

--
Born in Arizona, moved to Babylonia.




huge discrepancy between EXPLAIN cost and actual time (but the table has just been ANALYZED)

2023-05-08 Thread Kent Tong
Hi,

I have a complex query involving over 15 joins and a CTE query and it takes
over 17s to complete. The output of EXPLAIN ANALYZE includes (somewhere
deep inside):

Index Scan using document_pkey on document document0_  (cost=0.29..8.31
rows=1 width=3027) (actual time=16243.959..16243.961 rows=1 loops=1)


This shows an index scan with a very small cost but a very large actual
time. The strange thing is, all the tables have just been analyzed with the
ANALYZE command (it is not a foreign table). Furthermore, if I run a simple
query using that index, both the cost and the actual time are small.

Another snippet is:



-> CTE Scan on all_related_document p  (cost=1815513.32..3030511.77
rows=241785 width=16) (actual time=203.969..203.976 rows=0 loops=1)


I think the cost-actual time discrepancy is fine as it is a recursive CTE
so postgresql can't estimate the cost well. It is materialized and a full
table scan is performed. However, the actual time is not that bad.  Also,
the estimated rows and the actual rows are also vastly different, but I
guess this is fine, isn't it?

Any idea how I should check further?

Many thanks in advance

-- 
Kent Tong
IT author and consultant, child education coach


Re: Additive backup and restore?

2023-05-08 Thread Andreas Kretschmer
Consider table partitioning. You can detach, save and delete partitions, and 
you can restore and attach partitions.

On 8 May 2023 12:24:06 CEST, Age Apache  wrote:
>I am designing a database for a web application. In the near future I will
>require past data for Audit, Security and Analysis purpose. But storing all
>that data will increase the database size. What are some well known
>approaches to archiving data for later use without increasing the database
>size?
>
>One approach I am thinking of is compressed additive backup and restore
>i.e. copy the rows of the tables that will be required later and store them
>in a compressed form, and then delete those rows from the table. And when
>the data is required then restore them from the backup files in an additive
>way.
>
>Is there an easy way to achieve this, say via pg_dump?


Additive backup and restore?

2023-05-08 Thread Age Apache
I am designing a database for a web application. In the near future I will
require past data for Audit, Security and Analysis purpose. But storing all
that data will increase the database size. What are some well known
approaches to archiving data for later use without increasing the database
size?

One approach I am thinking of is compressed additive backup and restore
i.e. copy the rows of the tables that will be required later and store them
in a compressed form, and then delete those rows from the table. And when
the data is required then restore them from the backup files in an additive
way.

Is there an easy way to achieve this, say via pg_dump?


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Alvaro Herrera
On 2023-May-07, Thomas Munro wrote:

> Did you previously run this same workload on versions < 15 and never
> see any problem?  15 gained a new feature CREATE DATABASE ...
> STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> bug somewhere near that, though I have no specific idea.  If you
> explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands,
> you'll get the traditional behaviour.

Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default
again, for branch 15, before today's release.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Evgeny Morozov
On 8/05/2023 4:24 am, Michael Paquier wrote:
> here are the four things running in parallel so as I can get a failure
> in loading a critical index when connecting

Wow, that is some amazing detective work! We do indeed create tables
during our tests, specifically partitions of tables copied from the
template DB. Checkpoints seem to be happening every few minutes (we
don't force them, but there is a big DB with more writes on the same
instance - it's probably due to that). PG is not crashing in our case,
though - not this time.

Do you have any idea why the "drop database" command would have timed
out (not completed after 30 seconds) for the corrupted DBs?

On 8/05/2023 4:17 am, Thomas Munro wrote:
> Maybe you could do some one way and some the other, so that we try to
> learn more?
Do you still want me to try this given what Michael has found? Or
anything else to help narrow this down?




Re: Hung Query with No Locking Issues

2023-05-08 Thread Michael P. McDonnell
Okay - that worked.
How did you know that would work? That's incredible.

On Sun, May 7, 2023 at 4:25 PM Tom Lane  wrote:

> "Michael P. McDonnell"  writes:
> > I have 2 stored procedures that need to run back to back. It could
> > convert to a single one - but it's easier from a maintenance perspective
> to
> > keep them separated.
>
> > The first procedure effectively is
> > INSERT INTO table_b () SELECT  FROM _table_a_;
> > COMMIT;
>
> > Total execution time - about 180s. Nothing in the pg_locks table and
> > nothing in the pg_stat_activity table suggests anything is hung over.
>
> > The second procedure mutates table_b data into table_b_collapsed
> > INSERT INTO table_c () SELECT _ FROM _table_b_ JOIN _table_b as b1_
> > JOIN _table_b as b2_ JOIN _table_b as b3_, etc...;
> > COMMIT;
>
> > The first time I run my second stored procedure - it hangs for up to 8
> > hours.
> > If I immediately cancel and re-run the second stored procedure it runs
> in 2
> > seconds.
>
> Perhaps an "ANALYZE table_b" in between would help.
>
> regards, tom lane
>


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Michael Paquier
On Mon, May 08, 2023 at 02:46:37PM +1200, Thomas Munro wrote:
> That sounds like good news, but I'm still confused: do you see all 0s
> in the target database (popo)'s catalogs, as reported (and if so can
> you explain how they got there?), or is it regression that is
> corrupted in more subtle ways also involving 1s?

Nope, I have not been able to confirm that yet without 8a86618.  The
test runs at a high frequency, so that's kind of hard to catch.  I
have not been able to get things in a state where I could look at a
FPW for pg_class or its index, either, in a way similar to Evgeny.
--
Michael


signature.asc
Description: PGP signature