Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 01, 2022 at 05:35:23PM +1200, Thomas Munro wrote: > So it *looks* like it finished early (and without the expected > error?). But it also looks like it replayed that record, according to > the page LSN. So which is it? Could you recompile with WAL_DEBUG > defined in pg_config_manual.h, and run recovery with wal_debug = on, > and see if it replays 1CAF84B0? This is with 6672d79139 un-reverted. $ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on 2>&1 |grep 1CAF84B0 || echo not found not found $ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on -c recovery_prefetch=no 2>&1 |grep 1CAF84B0 || echo not found < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF8478; LSN 1201/1CAF84B0: prev 1201/1CAF8438; xid 0; len 2; blkref #0: rel 1663/16881/2840, blk 53 - Heap2/VACUUM: nunused 4 < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84B0; LSN 1201/1CAF84F0: prev 1201/1CAF8478; xid 0; len 5; blkref #0: rel 1663/16881/2840, fork 2, blk 0; blkref #1: rel 1663/16881/2840, blk 53 - Heap2/VISIBLE: cutoff xid 3678741092 flags 0x01 < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84F0; LSN 1201/1CAF8AC0: prev 1201/1CAF84B0; xid 0; len 2; blkref #0: rel 1663/16881/1259, blk 1 FPW, compression method: zstd - Heap/INPLACE: off 33 (Note that "compression method: zstd" is a local change to xlog_block_info() which I just extracted from my original patch for wal_compression, after forgetting to compile --with-zstd. I'll mail about that at a later time...). -- Justin
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 1, 2022 at 5:18 PM Kyotaro Horiguchi wrote: > At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby > wrote in > > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > > > Hmm. Justin, when you built from source, which commit were you at? > > > If it's REL_15_BETA3, > > > > No - it's: > > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, > > origin/REL_15_STABLE) > > It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at > that commit. Yeah. > > I wasn't sure what mean by "without that" , so here's a bunch of logs to > > sift through: So it *looks* like it finished early (and without the expected error?). But it also looks like it replayed that record, according to the page LSN. So which is it? Could you recompile with WAL_DEBUG defined in pg_config_manual.h, and run recovery with wal_debug = on, and see if it replays 1CAF84B0?
Re: FOR EACH ROW triggers, on partitioend tables, with indexes?
On Thu, Sep 01, 2022 at 04:19:37PM +1200, David Rowley wrote: > On Sat, 20 Aug 2022 at 09:18, Justin Pryzby wrote: > > Is it somwhow possible to call CreateTrigger() to create a FOR EACH ROW > > trigger, with an index, and not internally ? > > I've been looking over this and I very much agree that the code looks > very broken. As for whether this is dead code or not, I've been > looking at that too... > > At trigger.c:1147 we have: if (partition_recurse). partition_recurse > can only ever be true if isInternal == false per trigger.c:367's > "partition_recurse = !isInternal && stmt->row &&". isInternal is a > parameter to the function. Also, the code in question only triggers > when the indexOid parameter is a valid oid. So it should just be a > matter of looking for usages of CreateTriggerFiringOn() which pass > isInternal as false and pass a valid indexOid. > > There seems to be no direct calls doing this, but we do also call this > function via CreateTrigger() and I can see only 1 call to > CreateTrigger() that passes isInternal as false, but that explicitly > passes indexOid as InvalidOid, so this code looks very much dead to > me. > > Alvaro, any objections to just ripping this out? aka, the attached. It's possible that extensions or 3rd party code or forks use this, no ? In that case, it might be "not dead" .. > + * that ever changes then we'll need to quite code here to find > the quite? write? quire? acquire? quine? -- Justin
Re: question about access custom enum type from C
Hi David as you suggested create type first_type as enum ('red', 'green', 'brown', 'yellow', 'blue'); SELECT oid,typname,typlen,typtype from pg_type where typname='first_type' returns everything I was looking for thanks again, I think I’m all set dm > On Sep 1, 2022, at 12:49 AM, David Rowley wrote: > > (I think this is a better question for the general mailing list) > > On Thu, 1 Sept 2022 at 16:28, Dmitry Markman wrote: >> >> Hi, when I’m trying to access values of my custom enum type I created with >> >> create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue'); >> >> I’m getting oid as 16387 and I can see it stored as a chars > > You might see the names if you query the table, but all that's stored > in the table is the numerical value. > > https://www.postgresql.org/docs/current/datatype-enum.html states "An > enum value occupies four bytes on disk.". > >> is number 16387 is always OID for enum type? > > I'm not sure where you got that number from. Perhaps it's the oid for > the pg_type record? The following would show it. > > select oid,typname from pg_type where typname = 'colors'; > >> if not how I can get information about type of the result if it’s custom >> enum type > > I'm not sure what you mean by "the result". Maybe pg_typeof(column) > might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM > mytable; > > David
Re: pg15b3: recovery fails with wal prefetch enabled
At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby wrote in > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi > > wrote: > > > Just for information, there was a fixed bug about > > > overwrite-aborted-contrecord feature, which causes this kind of > > > failure (xlog flush request exceeds insertion bleeding edge). If it is > > > that, it has been fixed by 6672d79139 two-days ago. > > > > Hmm. Justin, when you built from source, which commit were you at? > > If it's REL_15_BETA3, > > No - it's: > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, > origin/REL_15_STABLE) It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at that commit. > > If it's REL_15_BETA3, any chance you could cherry pick that change and > > check what happens? And without that, could you show what this logs > > And without that, could you show what this logs > > for good and bad recovery settings? > > I wasn't sure what mean by "without that" , so here's a bunch of logs to > sift through: There's no need to cherry picking.. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: question about access custom enum type from C
Hi David, thanks a lot for your answer I got that number from PQparamtype I already see that 16387 is not a ‘constant’, if I have few custom types I got different numbers for them thanks dm > On Sep 1, 2022, at 12:49 AM, David Rowley wrote: > > (I think this is a better question for the general mailing list) > > On Thu, 1 Sept 2022 at 16:28, Dmitry Markman wrote: >> >> Hi, when I’m trying to access values of my custom enum type I created with >> >> create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue'); >> >> I’m getting oid as 16387 and I can see it stored as a chars > > You might see the names if you query the table, but all that's stored > in the table is the numerical value. > > https://www.postgresql.org/docs/current/datatype-enum.html states "An > enum value occupies four bytes on disk.". > >> is number 16387 is always OID for enum type? > > I'm not sure where you got that number from. Perhaps it's the oid for > the pg_type record? The following would show it. > > select oid,typname from pg_type where typname = 'colors'; > >> if not how I can get information about type of the result if it’s custom >> enum type > > I'm not sure what you mean by "the result". Maybe pg_typeof(column) > might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM > mytable; > > David
Re: question about access custom enum type from C
(I think this is a better question for the general mailing list) On Thu, 1 Sept 2022 at 16:28, Dmitry Markman wrote: > > Hi, when I’m trying to access values of my custom enum type I created with > > create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue'); > > I’m getting oid as 16387 and I can see it stored as a chars You might see the names if you query the table, but all that's stored in the table is the numerical value. https://www.postgresql.org/docs/current/datatype-enum.html states "An enum value occupies four bytes on disk.". > is number 16387 is always OID for enum type? I'm not sure where you got that number from. Perhaps it's the oid for the pg_type record? The following would show it. select oid,typname from pg_type where typname = 'colors'; > if not how I can get information about type of the result if it’s custom enum > type I'm not sure what you mean by "the result". Maybe pg_typeof(column) might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM mytable; David
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi > wrote: > > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro > > wrote in > > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > > > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: > > > > end-of-recovery immediate wait > > > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of > > > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > > > base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request > > > > 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > > > base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > > > prefetch tried to do too much. The VM runs centos7 under qemu. > > > > I'm making a copy of the data dir in cases it's needed. > > > > Just for information, there was a fixed bug about > > overwrite-aborted-contrecord feature, which causes this kind of > > failure (xlog flush request exceeds insertion bleeding edge). If it is > > that, it has been fixed by 6672d79139 two-days ago. > > Hmm. Justin, when you built from source, which commit were you at? > If it's REL_15_BETA3, No - it's: commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE) > If it's REL_15_BETA3, any chance you could cherry pick that change and > check what happens? And without that, could you show what this logs > And without that, could you show what this logs > for good and bad recovery settings? I wasn't sure what mean by "without that" , so here's a bunch of logs to sift through: At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 ... < 2022-08-31 23:31:38.690 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:31:40.204 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation < 2022-08-31 23:31:40.307 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation < 2022-08-31 23:31:40.493 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small < 2022-08-31 23:31:40.721 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.23 s, elapsed: 2.03 s < 2022-08-31 23:31:41.452 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:31:41.698 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.698 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.699 CDT >FATAL: checkpoint request failed < 2022-08-31 23:31:41.699 CDT >HINT: Consult recent messages in the server log for details. < 2022-08-31 23:31:41.704 CDT >LOG: startup process (PID 25046) exited with exit code 1 < 2022-08-31 23:31:41.704 CDT >LOG: terminating any other active server processes < 2022-08-31 23:31:41.705 CDT >LOG: shutting down due to startup process failure < 2022-08-31 23:31:41.731 CDT >LOG: database system is shut down With your patch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 ... < 2022-08-31 23:34:22.897 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:34:23.146 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation < 2022-08-31 23:34:23.147 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation < 2022-08-31 23:34:23.268 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small < 2022-08-31 23:34:23.323 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.29 s, system: 0.12 s, elapsed: 0.42 s < 2022-08-31 23:34:23.323 CDT >LOG: point 0: lastRec = 12011cadb300 < 2022-08-31 23:34:23.323 CDT >LOG: point 0: endOfLog = 12011cadb730 < 2022-08-31 23:34:23.323 CDT >LOG: XXX point 1: EndOfLog
Re: Add tracking of backend memory allocated to pg_stat_activity
At Wed, 31 Aug 2022 12:03:06 -0400, Reid Thompson wrote in > Attached is a patch to > Add tracking of backend memory allocated to pg_stat_activity > @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size) > return NULL; > > context->mem_allocated += blksize; > + pgstat_report_backend_mem_allocated_increase(blksize); I'm not sure this is acceptable. The function adds a branch even when the feature is turned off, which I think may cause a certain extent of performance degradation. A past threads [1], [2] and [3] might be informative. [1] https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop [2] https://www.postgresql.org/message-id/72a656e0f71d0860161e0b3f67e4d771%40oss.nttdata.com [3] https://www.postgresql.org/message-id/0271f440ac77f2a4180e0e56ebd944d1%40oss.nttdata.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Add native windows on arm64 support
On Wed, Aug 31, 2022 at 10:22:06PM -0400, Tom Lane wrote: > Michael Paquier writes: > > The input object could also be reworked so as we would not have any > > ordering issues, say "predeeppost". > > Changing only the path, you could use "/e/n2" instead of "node()", so > > as we'd always get the most internal member. > > Done that way. Cool, thanks. bowerbird looks happy after its first run. An argument in favor of backpatching is if one decides to build the code with MSVC and patch the scripts to enable ASLR. However, nobody has complained about that yet, so fine by me to leave this change only on HEAD for now. -- Michael signature.asc Description: PGP signature
question about access custom enum type from C
Hi, when I’m trying to access values of my custom enum type I created with create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue'); I’m getting oid as 16387 and I can see it stored as a chars is number 16387 is always OID for enum type? if not how I can get information about type of the result if it’s custom enum type thanks in advance dm
Re: Solaris "sed" versus pre-v13 plpython tests
On Wed, Aug 31, 2022 at 06:25:01PM -0400, Tom Lane wrote: > I am not completely sure why buildfarm member wrasse isn't failing > similarly wrasse disabled plpython in v12-, from day one, due to this and a crash bug that I shelved. I will be interested to see how margay reacts to your fix.
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi wrote: > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro > wrote in > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: > > > end-of-recovery immediate wait > > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of > > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > > base/16881/2840_vm > > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 > > > is not satisfied --- flushed only to 1201/1CADB730 > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > > base/16881/2840_vm > > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > > prefetch tried to do too much. The VM runs centos7 under qemu. > > > I'm making a copy of the data dir in cases it's needed. > > Just for information, there was a fixed bug about > overwrite-aborted-contrecord feature, which causes this kind of > failure (xlog flush request exceeds insertion bleeding edge). If it is > that, it has been fixed by 6672d79139 two-days ago. Hmm. Justin, when you built from source, which commit were you at? If it's REL_15_BETA3, any chance you could cherry pick that change and check what happens? And without that, could you show what this logs for good and bad recovery settings? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 248a40e8fa..1e435e81d5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5308,6 +5308,7 @@ StartupXLOG(void) */ endOfRecoveryInfo = FinishWalRecovery(); EndOfLog = endOfRecoveryInfo->endOfLog; +elog(LOG, "XXX point 1: EndOfLog = %lx", EndOfLog); EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI; abortedRecPtr = endOfRecoveryInfo->abortedRecPtr; missingContrecPtr = endOfRecoveryInfo->missingContrecPtr; @@ -5446,7 +5447,9 @@ StartupXLOG(void) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; +elog(LOG, "XXX clobbering EndOfLog"); } +elog(LOG, "XXX point 2: EndOfLog = %lx", EndOfLog); /* * Prepare to write WAL starting at EndOfLog location, and init xlog @@ -5456,6 +5459,7 @@ StartupXLOG(void) Insert = >Insert; Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec); Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog); +elog(LOG, "XXX point 3: Insert->CurrBytePos = %lx", Insert->CurrBytePos); /* * Tricky point here: lastPage contains the *last* block that the LastRec @@ -5554,6 +5558,7 @@ StartupXLOG(void) Insert->fullPageWrites = lastFullPageWrites; UpdateFullPageWrites(); +elog(LOG, "XXX point 4: Insert->CurrBytePos = %lx", Insert->CurrBytePos); /* * Emit checkpoint or end-of-recovery record in XLOG, if required. */ diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 141af04388..8c6753a8a8 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1434,6 +1434,8 @@ FinishWalRecovery(void) XLogPrefetcherBeginRead(xlogprefetcher, lastRec); (void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI); endOfLog = xlogreader->EndRecPtr; +elog(LOG, "point 0: lastRec = %lx", lastRec); +elog(LOG, "point 0: endOfLog = %lx", endOfLog); /* * Remember the TLI in the filename of the XLOG segment containing the
Re: FOR EACH ROW triggers, on partitioend tables, with indexes?
On Sat, 20 Aug 2022 at 09:18, Justin Pryzby wrote: > Is it somwhow possible to call CreateTrigger() to create a FOR EACH ROW > trigger, with an index, and not internally ? I've been looking over this and I very much agree that the code looks very broken. As for whether this is dead code or not, I've been looking at that too... At trigger.c:1147 we have: if (partition_recurse). partition_recurse can only ever be true if isInternal == false per trigger.c:367's "partition_recurse = !isInternal && stmt->row &&". isInternal is a parameter to the function. Also, the code in question only triggers when the indexOid parameter is a valid oid. So it should just be a matter of looking for usages of CreateTriggerFiringOn() which pass isInternal as false and pass a valid indexOid. There seems to be no direct calls doing this, but we do also call this function via CreateTrigger() and I can see only 1 call to CreateTrigger() that passes isInternal as false, but that explicitly passes indexOid as InvalidOid, so this code looks very much dead to me. Alvaro, any objections to just ripping this out? aka, the attached. I've left an Assert() in there to ensure we notice if we're ever to start calling CreateTriggerFiringOn() with isInternal == false with a valid indexOid. David diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index 7661e004a9..08f9a307fd 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -1147,8 +1147,6 @@ CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString, if (partition_recurse) { PartitionDesc partdesc = RelationGetPartitionDesc(rel, true); - List *idxs = NIL; - List *childTbls = NIL; int i; MemoryContext oldcxt, perChildCxt; @@ -1158,53 +1156,23 @@ CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString, ALLOCSET_SMALL_SIZES); /* -* When a trigger is being created associated with an index, we'll -* need to associate the trigger in each child partition with the -* corresponding index on it. +* We don't currently expect to be called with a valid indexOid. If +* that ever changes then we'll need to quite code here to find the +* corresponding child index. */ - if (OidIsValid(indexOid)) - { - ListCell *l; - List *idxs = NIL; - - idxs = find_inheritance_children(indexOid, ShareRowExclusiveLock); - foreach(l, idxs) - childTbls = lappend_oid(childTbls, - IndexGetRelation(lfirst_oid(l), - false)); - } + Assert(!OidIsValid(indexOid)); oldcxt = MemoryContextSwitchTo(perChildCxt); /* Iterate to create the trigger on each existing partition */ for (i = 0; i < partdesc->nparts; i++) { - Oid indexOnChild = InvalidOid; - ListCell *l; - ListCell *l2; CreateTrigStmt *childStmt; RelationchildTbl; Node *qual; childTbl = table_open(partdesc->oids[i], ShareRowExclusiveLock); - /* Find which of the child indexes is the one on this partition */ - if (OidIsValid(indexOid)) - { - forboth(l, idxs, l2, childTbls) - { - if (lfirst_oid(l2) == partdesc->oids[i]) - { - indexOnChild = lfirst_oid(l); - break; - } - } - if (!OidIsValid(indexOnChild)) - elog(ERROR, "failed to find index matching index \"%s\" in partition \"%s\"", -get_rel_name(indexOid), - get_rel_name(partdesc->oids[i])); - } - /* * Initialize our fabricated parse node by copying the original * one, then resetting fields that we pass separately.
Re: Reducing the chunk header sizes on all memory context types
David Rowley writes: > On Thu, 1 Sept 2022 at 16:06, Tom Lane wrote: >> It doesn't seem to be fixing any live bug in the back branches, >> but by the same token it's harmless. > I considered that an extension might use the Slab allocator with a > non-MAXALIGNED chunksize and might run into some troubles during > SlabCheck(). Oh, yeah, the sentinel_ok() change is a live bug. Extensions have no control over sizeof(SlabBlock) though. regards, tom lane
Re: Reducing the chunk header sizes on all memory context types
On Thu, 1 Sept 2022 at 16:06, Tom Lane wrote: > > David Rowley writes: > > Does anyone have any objections to d5ee4db0e in its entirety being > > backpatched? > > It doesn't seem to be fixing any live bug in the back branches, > but by the same token it's harmless. I considered that an extension might use the Slab allocator with a non-MAXALIGNED chunksize and might run into some troubles during SlabCheck(). David
Re: Reducing the chunk header sizes on all memory context types
David Rowley writes: > Does anyone have any objections to d5ee4db0e in its entirety being > backpatched? It doesn't seem to be fixing any live bug in the back branches, but by the same token it's harmless. regards, tom lane
Re: Handle infinite recursion in logical replication setup
On Wed, Aug 31, 2022 at 11:35 AM Peter Smith wrote: > > Here are my review comments for v43-0001. > > == > > 1. Commit message > > The initial copy phase has no way to know the origin of the row data, > so if 'copy_data = true' in the step 4 below, log a warning to notify user > that potentially non-local data might have been copied. > > 1a. > "in the step 4" -> "in step 4" > > ~ > > 1b. > "notify user" -> "notify the user" > > == > > 2. doc/src/sgml/ref/create_subscription.sgml > > + > + If the subscription is created with origin = none and > + copy_data = true, it will check if the publisher has > + subscribed to the same table from other publishers and, if so, log a > + warning to notify the user to check the publisher tables. Before > continuing > + with other operations the user should check that publisher tables did not > + have data with different origins to prevent data inconsistency issues on > the > + subscriber. > + > > I am not sure about that wording saying "to prevent data inconsistency > issues" because I thought maybe is already too late because any > unwanted origin data is already copied during the initial sync. I > think the user can do it try to clean up any problems caused before > things become even worse (??) > > ~~~ > > You asked for my thoughts (see [1] 6b): > > > There is nothing much a user can do in this case. Only option would be > > to take a backup before the operation and restore it and then recreate > > the replication setup. I was not sure if we should document these > > steps as similar inconsistent data could get created without the > > origin option . Thoughts? > > I think those cases are a bit different: > > - For a normal scenario (i.e. origin=ANY) then inconsistent data just > refers to problems like maybe PK violations so I think you just get > what you get and have to deal with the errors... > > - But when the user said origin=NONE they are specifically saying they > do NOT want any non-local data, yet they still might end up getting > data contrary to their wishes. So I think maybe there ought to be > something documented to warn about this potentially happening. My > first impression is that all this seems like a kind of terrible > problem because if it happens then cleanup could be difficult - if > that subscriber in turn was also a publisher then this bad data might > have propagated all over the place already! Anyway, I guess all this > could be mentioned in some (new ?) section of the > logical-replication.sgml file (i.e. patch 0002). > > IDEA: I was wondering if the documentation should recommend (or maybe > we can even enforce this) that when using origin=NONE the user should > always create the subscription with enabled=FALSE. That way if there > are some warnings about potential bad origin data then there it might > be possible to take some action or change their mind *before* any > unwanted data pollutes everything. > > == > > 3. src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed > > + if (count == 0) > + tablenames = makeStringInfo(); > + else > + appendStringInfoString(tablenames, ", "); > + > + appendStringInfo(tablenames, "%s.%s", nspname, relname); > + count++; > > I think the quotes are not correct - each separate table name should be > quoted. > > ~~~ > > 4. > > + /* > + * There might be many tables present in this case, we will display a > + * maximum of 100 tables followed by "..." to indicate there might be > + * more tables. > + */ > + if (count == 100) > + { > + appendStringInfoString(tablenames, ", ..."); > + break; > + } > > 4a. > IMO this code should be part of the previous if/else > > e.g. > if (count == 0) ... makeStringInfo() > else if (count > 100) ... append ellipsis "..." and break out of the loop > else ... append table name to the message > > Doing it in this way means "..." definitely means there are more than > 100 bad tables. > > ~ > > 4b. > Changing like above (#4a) simplifies the comment because it removes > doubts like "might be…" since you already know you found the 101st > table. > > SUGGESTION (comment) > The message displays a maximum of 100 tables having potentially > non-local data. Any more than that will be indicated by "...". > > ~ > Are we using this style of an error message anywhere else in the code? If not, then I think we should avoid it here as well as it appears a bit adhoc to me in the sense that why restrict at 100 tables. Can we instead think of displaying the publications list in the message? So errdetail would be something like: Subscribed publications \"%s\" has been subscribed from some other publisher. Then we can probably give a SQL query for a user to find tables that may data from multiple origins especially if that is not complicated. Also, then we can change the function name to check_publications_origin(). Few other minor comments on v43-0001* 1. + ereport(WARNING, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("subscription %s
Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes wrote: > I like this addition, but I would also like to see how many pages got newly > set to all frozen by the vacuum. I'd say that that's independent work. Though I'm happy to discuss it now. It would be fairly straightforward to show something about the VM itself, but it's not entirely clear what aspects of the VM should be emphasized. Are we reporting on the state of the table, or work performed by VACUUM? You said you were interested in the latter already, but why prefer that over a summary of the contents of the VM at the end of the VACUUM? Are you concerned about the cost of setting pages all-visible? Do you have an interest in how VACUUM manages to set VM pages over time? Something else? We already call visibilitymap_count() at the end of every VACUUM, which scans the authoritative VM to produce a more-or-less consistent summary of the VM at that point in time. This information is then used to update pg_class.relallvisible (we don't do anything with the all-frozen number at all). Why not show that information in VERBOSE/autovacuum's log message? Does it really matter *when* a page became all-visible/all-frozen/unset? > Also, isn't all of vacuuming about XID processing? I think "frozen:" would > be a more suitable line prefix. That also works for me. I have no strong feelings here. -- Peter Geoghegan
Re: Reducing the chunk header sizes on all memory context types
On Tue, 30 Aug 2022 at 13:16, David Rowley wrote: > I'm also wondering if this should also be backpatched back to v10, > providing the build farm likes it well enough on master. Does anyone have any objections to d5ee4db0e in its entirety being backpatched? David
Re: pg15b3: recovery fails with wal prefetch enabled
At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro wrote in > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery > > immediate wait > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is > > not satisfied --- flushed only to 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > prefetch tried to do too much. The VM runs centos7 under qemu. > > I'm making a copy of the data dir in cases it's needed. Just for information, there was a fixed bug about overwrite-aborted-contrecord feature, which causes this kind of failure (xlog flush request exceeds insertion bleeding edge). If it is that, it has been fixed by 6672d79139 two-days ago. http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=moue2vnhhml6b3t7w7qqvva...@mail.gmail.com http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
On Sat, Aug 20, 2022 at 7:29 PM Peter Geoghegan wrote: > XIDs processed: 45 pages from table (100.00% of total) had 1 tuples > frozen > I like this addition, but I would also like to see how many pages got newly set to all frozen by the vacuum. Would that be a convenient thing to also report here? Also, isn't all of vacuuming about XID processing? I think "frozen:" would be a more suitable line prefix. Cheers, Jeff
Re: Add the ability to limit the amount of memory that can be allocated to backends.
At Wed, 31 Aug 2022 12:50:19 -0400, Reid Thompson wrote in > Hi Hackers, > > Add the ability to limit the amount of memory that can be allocated to > backends. The patch seems to limit both of memory-context allocations and DSM allocations happen on a specific process by the same budget. In the fist place I don't think it's sensible to cap the amount of DSM allocations by per-process budget. DSM is used by pgstats subsystem. There can be cases where pgstat complains for denial of DSM allocation after the budget has been exhausted by memory-context allocations, or every command complains for denial of memory-context allocation after once the per-process budget is exhausted by DSM allocations. That doesn't seem reasonable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg15b3: recovery fails with wal prefetch enabled
Some more details, in case they're important: First: the server has wal_compression=zstd (I wonder if something doesn't allow/accomodate compressed FPI?) I thought to mention that after compiling pg15 locally and forgetting to use --with-zstd. I compiled it to enable your debug logging, which wrote these during recovery: < 2022-08-31 21:17:01.807 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which truncates the relation < 2022-08-31 21:17:01.903 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which truncates the relation < 2022-08-31 21:17:02.029 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the relation is too small Also, pg_waldump seems to fail early with -w: [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00011201001C rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0 Also, the VM has crashed with OOM before, while runnning pg15, with no issue in recovery. I haven't been able to track down the cause.. The VM is running: kernel-3.10.0-1160.66.1.el7.x86_64 pgsql is an ext4 FS (no tablespaces), which is a qemu block device exposed like: It's nowhere near full: /dev/vdc 96G51G 46G 53% /var/lib/pgsql
Re: num_sa_scans in genericcostestimate
On Sun, Aug 21, 2022 at 2:45 PM Jeff Janes wrote: > ... > > The context for this is that I was looking at cases where btree indexes > were not using all the columns they could, but rather shoving some of the > conditions down into a Filter unnecessarily/unhelpfully. This change > doesn't fix that, but it does seem to be moving in the right direction. > Added to commitfest. > This does cause a regression test failure due to an (apparently?) > uninteresting plan change. > Looking more at the regression test plan change, it points up an interesting question which is only tangentially related to this patch. With patch applied: [local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; QUERY PLAN --- Sort (cost=4.55..4.56 rows=1 width=8) (actual time=0.100..0.101 rows=2 loops=1) Sort Key: thousand Sort Method: quicksort Memory: 25kB -> Index Only Scan using tenk1_thous_tenthous on tenk1 (cost=0.29..4.50 rows=1 width=8) (actual time=0.044..0.048 rows=2 loops=1) Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) Heap Fetches: 0 Planning Time: 1.040 ms Execution Time: 0.149 ms (8 rows) [local] 417536 regression=# set enable_sort TO off ; [local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; QUERY PLAN - Index Only Scan using tenk1_thous_tenthous on tenk1 (cost=0.29..4.71 rows=1 width=8) (actual time=0.021..0.024 rows=2 loops=1) Index Cond: (thousand < 2) Filter: (tenthous = ANY ('{1001,3000}'::integer[])) Rows Removed by Filter: 18 Heap Fetches: 0 Planning Time: 0.156 ms Execution Time: 0.039 ms (7 rows) Why does having the =ANY in the "Index Cond:" rather than the "Filter:" inhibit it from understanding that the rows will still be delivered in order by "thousand"? Cheers, Jeff >
RE: Handle infinite recursion in logical replication setup
On Wed, Aug 31, 2022 1:06 AM vignesh C wrote: > > The attached v43 patch has the changes for the same. > Thanks for updating the patch. Here is a comment on the 0001 patch. + if (!isnewtable) + { + pfree(nspname); + pfree(relname); + continue; + } If it is a new table, in which case it would log a warning, should we also call pfree()? Regards, Shi yu
Re: [PATCH] Add native windows on arm64 support
Michael Paquier writes: > The input object could also be reworked so as we would not have any > ordering issues, say "predeeppost". > Changing only the path, you could use "/e/n2" instead of "node()", so > as we'd always get the most internal member. Done that way. regards, tom lane
Re: SELECT documentation
On Mon, Aug 15, 2022 at 10:53:18PM -0400, Bruce Momjian wrote: > On Sat, Aug 13, 2022 at 10:21:26PM -0400, Tom Lane wrote: > > Bruce Momjian writes: > > > Hi, I agree we should show the more modern JOIN sytax. However, this is > > > just an example, so one example should be sufficient. I went with the > > > first one in the attached patch. > > > > You should not remove the CROSS JOIN mention at l. 604, first because > > the references to it just below would become odd, and second because > > then it's not explained anywhere on the page. Perhaps you could > > put back a definition of CROSS JOIN just below the entry for NATURAL, > > but you'll still have to do something with the references at l. 614, > > 628, 632. > > Good point. I restrutured the docs to move CROSS JOIN to a separate > section like NATURAL and adjusted the text, patch attached. Patch applied back to PG 11. PG 10 was different enough and old enough that I skipped it. This is a big improvement. Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 1, 2022 at 12:53 PM Justin Pryzby wrote: > Yes, I have a copy that reproduces the issue: That's good news. So the last record touching that page was: > rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: > 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoff xid 3678741092 flags > 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel > 1663/16881/2840 blk 53 I think the expected LSN for that page is past the end of that record, so 0x1CAF84B0 + 59 = 0x1caf84eb which rounds up to 0x1CAF84F0, and indeed we see that in the restored page when recovery succeeds. Next question: why do we think the WAL finishes at 1201/1CADB730 while running that checkpoint? Looking...
Re: Add tracking of backend memory allocated to pg_stat_activity
At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby wrote in > On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > > Hi Hackers, > > > > Attached is a patch to > > Add tracking of backend memory allocated to pg_stat_activity > > > + proargmodes => > > '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', > > In the past, there was concern about making pg_stat_activity wider by > adding information that's less-essential than what's been there for > years. This is only an int64, so it's not "wide", but I wonder if > there's another way to expose this information? Like adding backends to The view looks already too wide to me. I don't want the numbers for metrics are added to the view. > pg_get_backend_memory_contexts() , maybe with another view on top of > that ? +1 > +* shown allocated in pgstat_activity when the creator > destroys the > > > pg_stat > > > +* Posix creation calls dsm_impl_posix_resize implying that > > resizing > > +* occurs or may be added in the future. As implemented > > +* dsm_impl_posix_resize utilizes fallocate or truncate, > > passing the > > +* whole new size as input, growing the allocation as needed * > > (only > > +* truncate supports shrinking). We update by replacing the * > > old > > wrapping caused extraneous stars > > > +* Do not allow backend_mem_allocated to go below zero. ereport if we > > +* would have. There's no need for a lock around the read here asit's > > as it's > > > + ereport(LOG, (errmsg("decrease reduces reported backend memory > > allocated below zero; setting reported to 0"))); > > errmsg() doesn't require the outside paranthesis since a couple years > ago. +1 > > + /* > > +* Until header allocation is included in context->mem_allocated cast to > > +* slab and decrement the headerSize > > add a comma before "cast" ? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Avoid overhead open-close indexes (catalog updates)
At Wed, 31 Aug 2022 08:16:55 -0300, Ranier Vilela wrote in > Hi, > > The commit > https://github.com/postgres/postgres/commit/b17ff07aa3eb142d2cde2ea00e4a4e8f63686f96 > Introduced the CopyStatistics function. > > To do the work, CopyStatistics uses a less efficient function > to update/insert tuples at catalog systems. > > The comment at indexing.c says: > "Avoid using it for multiple tuples, since opening the indexes > * and building the index info structures is moderately expensive. > * (Use CatalogTupleInsertWithInfo in such cases.)" > > So inspired by the comment, changed in some fews places, > the CatalogInsert/CatalogUpdate to more efficient functions > CatalogInsertWithInfo/CatalogUpdateWithInfo. > > With quick tests, resulting in small performance. Considering the whole operation usually takes far longer time, I'm not sure that amount of performance gain is useful or not, but I like the change as a matter of tidiness or as example for later codes. > There are other places that this could be useful, > but a careful analysis is necessary. What kind of concern do have in your mind? By the way, there is another similar function CatalogTupleMultiInsertWithInfo() which would be more time-efficient (but not space-efficient), which is used in InsertPgAttributeTuples. I don't see a clear criteria of choosing which one of the two, though. I think the overhead of catalog index open is significant when any other time-consuming tasks are not involved in the whole operation. In that sense, in term of performance, rather storeOperations and storePrecedures (called under DefineOpCalss) might get more benefit from that if disregarding the rareness of the command being used.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reducing the chunk header sizes on all memory context types
Tomas Vondra writes: > You're probably right we'll notice the clobber cases due to corruption > of the next chunk header. The annoying thing is having a corrupted > header only tells you there's a corruption somewhere, but it may be hard > to know which part of the code caused it. Same's true of a sentinel, though. > OTOH we have platforms where valgrind is either not supported or no one > runs tests with (e.g. on rpi4 it'd take insane amounts of code). According to https://valgrind.org/info/platforms.html valgrind supports a pretty respectable set of platforms. It might be too slow to be useful on ancient hardware, of course. I've had some success in identifying clobber perpetrators by putting a hardware watchpoint on the clobbered word, which IIRC does work on recent ARM hardware. It's tedious and far more manual than valgrind, but it's possible. regards, tom lane
Re: Support tls-exporter as channel binding for TLSv1.3
On Wed, Aug 31, 2022 at 04:16:29PM -0700, Jacob Champion wrote: > For protocols less than 1.3 we'll need to ensure that the extended > master secret is in use: > >This channel binding mechanism is defined only when the TLS handshake >results in unique master secrets. This is true of TLS versions prior >to 1.3 when the extended master secret extension of [RFC7627] is in >use, and it is always true for TLS 1.3 (see Appendix D of [RFC8446]). > > OpenSSL should have an API for that (SSL_get_extms_support); I don't > know when it was introduced. This is only available from 1.1.0, meaning that we'd better disable tls-exporter when building with something older than that :( With 1.0.2 already not supported by upstream even if a bunch of vendors keep it around for compatibility, I guess that's fine as long as the default setting is tls-server-end-point. It would not be complex to switch to tls-exporter by default when using TLSv1.3 and tls-server-end-point for TLS <= v1.2 though, but that makes the code more complicated and OpenSSL does not complain with tls-exporter when using < 1.3. If we switch the default on the fly, we could drop channel_binding_type and control which one gets used depending on ssl_max/min_server_protocol. I don't like that much, TBH, as this creates more dependencies across our the internal code with the initial choice of the connection parameters, making it more complex to maintain in the long-term. At least that's my impression. > If we want to cross all our T's, we should also disallow tls-exporter > if the server was unable to set SSL_OP_NO_RENEGOTIATION. Hmm. Okay. I have not considered that. But TLSv1.3 has no support for renegotiation, isn't it? And you mean to fail hard when using TLS <= v1.2 with tls-exporter on the backend's SSL_CTX_set_options() call? We cannot do that as the backend's SSL context is initialized before authentication, but we could re-check the state of the SSL options afterwards, during authentication, and force a failure. > Did you have any thoughts about contributing the Python tests (or > porting them to Perl, bleh) so that we could test failure modes as > well? Unfortunately those Python tests were also OpenSSL-based, so > they're less powerful than an independent implementation... No. I have not been able to look at that with the time I had, unfortunately. -- Michael signature.asc Description: PGP signature
Re: Solaris "sed" versus pre-v13 plpython tests
I wrote: > I confirmed on the gcc farm's Solaris 11 box that the pattern > doesn't work as expected with /usr/bin/sed: > tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except > \([[:alpha:]][[:alpha:].]*\), *\([[:alpha:]][[:alpha:]]*\):/except \1 as > \2:/g' > except foo, bar: > We could perhaps do this instead: > $ echo except foo, bar: | sed -e '/^ *except.*,.*: *$/s/, / as /g' > except foo as bar: > It's a little bit more brittle, but Python doesn't allow any other > commands on the same line does it? Oh ... after a bit more experimentation, there's an easier way. Apparently the real problem is that Solaris' sed doesn't handle [[:alpha:]] (I wonder if this is locale-dependent?). I get correct results after expanding it manually, eg tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except \([a-z][a-z.]*\), *\([a-z][a-zA-Z]*\):/except \1 as \2:/g' except foo as bar: We aren't likely to need anything beyond a-zA-Z and maybe 0-9, so I'll go fix it that way. regards, tom lane
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery > > immediate wait > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is > > not satisfied --- flushed only to 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > prefetch tried to do too much. The VM runs centos7 under qemu. > > I'm making a copy of the data dir in cases it's needed. > > Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a > vm fork page (which recovery prefetch should ignore completely). Did > you happen to get a copy before the successful recovery? After the > successful recovery, what LSN does that page have, and can you find > the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 > -F vm? Have you turned FPW off (perhaps this is on ZFS?)? Yes, I have a copy that reproduces the issue: #1 0x009a0df4 in errfinish (filename=, filename@entry=0xa15535 "xlog.c", lineno=lineno@entry=2671, funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588 #2 0x0055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668 #3 0x00813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=) at bufmgr.c:2889 #4 0x00817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, skip_recently_used=skip_recently_used@entry=false, wb_context=wb_context@entry=0x7fffcdf0) at bufmgr.c:2576 #5 0x008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164 #6 0x008182f5 in CheckPointBuffers (flags=flags@entry=358) at bufmgr.c:2743 #7 0x005587b2 in CheckPointGuts (checkPointRedo=19795985413936, flags=flags@entry=358) at xlog.c:6855 #8 0x0055feb3 in CreateCheckPoint (flags=flags@entry=358) at xlog.c:6534 #9 0x007aceaa in CheckpointerMain () at checkpointer.c:455 #10 0x007aac52 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153 #11 0x007b0bd8 in StartChildProcess (type=) at postmaster.c:5430 #12 0x007b388f in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0xf139e0) at postmaster.c:1463 #13 0x004986a6 in main (argc=7, argv=0xf139e0) at main.c:202 It's not on zfs, and FPW have never been turned off. I should add that this instance has been pg_upgraded since v10. BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings, pg_stat_bgwriter, pg_stat_database, pg_stat_wal). postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'main', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+ 1201/1CDD1F98 |-6200 | 1 |44 | 424 |8192 | 8192 | 4 | 3681043287 (1 fila) postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'vm', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+--- 1201/1CAF84F0 |-6010 | 0 |24 | 8192 |8192 | 8192 | 4 | 0 I found this in waldump (note that you had a typoe - it's 16881). [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00011201001C rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 0 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE cutoff xid 3671427998 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 2 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE cutoff xid 3672889900 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 4 rmgr: Heap2 len
Re: POC: Better infrastructure for automated testing of concurrency issues
Hi! On Tue, Feb 23, 2021 at 3:09 AM Peter Geoghegan wrote: > On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov > wrote: > > Thank you for your feedback! > > It would be nice to use this patch to test things that are important > but untested inside vacuumlazy.c, such as the rare > HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would > have been removed by..."). Same is true of the closely related > heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code. I'll continue work on this patch. The rebased patch is attached. It implements stop events as configure option (not runtime GUC option). -- Regards, Alexander Korotkov 0001-Stopevents-v3.patch Description: Binary data
Re: Reducing the chunk header sizes on all memory context types
On 9/1/22 02:23, Tom Lane wrote: > Tomas Vondra writes: >> Focusing on the aset, vast majority of allocations (60M out of 64M) is >> small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so >> ~30%. Not great, not terrible. > > Not sure why this escaped me before, but I remembered another argument > for not forcibly adding space for a sentinel: if you don't have room, > that means the chunk end is up against the header for the next chunk, > which means that any buffer overrun will clobber that header. So we'll > detect the problem anyway if we validate the headers to a reasonable > extent. > > The hole in this argument is that the very last chunk allocated in a > block might have no following chunk to validate. But we could probably > special-case that somehow, maybe by laying down a sentinel in the free > space, where it will get overwritten by the next chunk when that does > get allocated. > > 30% memory bloat seems like a high price to pay if it's adding negligible > detection ability, which it seems is true if this argument is valid. > Is there reason to think we can't validate headers enough to catch > clobbers? > I'm not quite convinced the 30% figure is correct - it might be if you ignore cases exceeding allocChunkLimit, but that also makes it pretty bogus (because large allocations represent ~2x as much space). You're probably right we'll notice the clobber cases due to corruption of the next chunk header. The annoying thing is having a corrupted header only tells you there's a corruption somewhere, but it may be hard to know which part of the code caused it. I was hoping the sentinel would make it easier, because we mark it as NOACCESS for valgrind. But now I see we mark the first part of a MemoryChunk too, so maybe that's enough. OTOH we have platforms where valgrind is either not supported or no one runs tests with (e.g. on rpi4 it'd take insane amounts of code). In that case the sentinel might be helpful, especially considering alignment on those platforms can cause funny memory issues, as evidenced by this thread. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Reducing the chunk header sizes on all memory context types
David Rowley writes: > Maybe we should just consider always making room for a sentinel for > chunks that are on dedicated blocks. At most that's an extra 8 bytes > in some allocation that's either over 1024 or 8192 (depending on > maxBlockSize). Agreed, if we're not doing that already then we should. regards, tom lane
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
On Wed, Aug 31, 2022 at 01:09:22PM -0700, Andres Freund wrote: > On 2022-08-31 16:05:03 -0400, Tom Lane wrote: >> Even if there's only a visible leak in v15/HEAD, I'm inclined >> to back-patch this all the way, because it's certainly buggy >> on its own terms. It's just the merest accident that neither >> caller is leaking other stuff into TopMemoryContext, because >> they both think they are using a short-lived context. > > +1 Ouch. Thanks for the quick fix and the backpatch. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Add native windows on arm64 support
On Wed, Aug 31, 2022 at 01:33:53PM -0400, Tom Lane wrote: > I did not check the other XMLTABLE infrastructure, so what probably > is best to do is keep the two output columns but change 'node()' > to something with a more stable result; any preferences? The input object could also be reworked so as we would not have any ordering issues, say "predeeppost". Changing only the path, you could use "/e/n2" instead of "node()", so as we'd always get the most internal member. -- Michael signature.asc Description: PGP signature
Re: Reducing the chunk header sizes on all memory context types
On Thu, 1 Sept 2022 at 12:23, Tom Lane wrote: > Is there reason to think we can't validate headers enough to catch > clobbers? For non-sentinel chunks, the next byte after the end of the chunk will be storing the block offset for the following chunk. I think: if (block != MemoryChunkGetBlock(chunk)) elog(WARNING, "problem in alloc set %s: bad block offset for chunk %p in block %p", name, chunk, block); should catch those. Maybe we should just consider always making room for a sentinel for chunks that are on dedicated blocks. At most that's an extra 8 bytes in some allocation that's either over 1024 or 8192 (depending on maxBlockSize). David
Re: Reducing the chunk header sizes on all memory context types
Tomas Vondra writes: > Focusing on the aset, vast majority of allocations (60M out of 64M) is > small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so > ~30%. Not great, not terrible. Not sure why this escaped me before, but I remembered another argument for not forcibly adding space for a sentinel: if you don't have room, that means the chunk end is up against the header for the next chunk, which means that any buffer overrun will clobber that header. So we'll detect the problem anyway if we validate the headers to a reasonable extent. The hole in this argument is that the very last chunk allocated in a block might have no following chunk to validate. But we could probably special-case that somehow, maybe by laying down a sentinel in the free space, where it will get overwritten by the next chunk when that does get allocated. 30% memory bloat seems like a high price to pay if it's adding negligible detection ability, which it seems is true if this argument is valid. Is there reason to think we can't validate headers enough to catch clobbers? regards, tom lane
Re: Handle infinite recursion in logical replication setup
On Wed, Aug 31, 2022 at 8:36 PM Amit Kapila wrote: > > On Wed, Aug 31, 2022 at 11:35 AM Peter Smith wrote: > > > > Here are my review comments for v43-0001. > > ... > > == > > > > 2. doc/src/sgml/ref/create_subscription.sgml > > > > + > > + If the subscription is created with origin = none and > > + copy_data = true, it will check if the publisher has > > + subscribed to the same table from other publishers and, if so, log a > > + warning to notify the user to check the publisher tables. Before > > continuing > > + with other operations the user should check that publisher tables did > > not > > + have data with different origins to prevent data inconsistency issues > > on the > > + subscriber. > > + > > > > I am not sure about that wording saying "to prevent data inconsistency > > issues" because I thought maybe is already too late because any > > unwanted origin data is already copied during the initial sync. I > > think the user can do it try to clean up any problems caused before > > things become even worse (??) > > > > Oh no, it is not too late in all cases. The problem can only occur if > the user will try to subscribe from all the different publications > with copy_data = true. We are anyway trying to clarify in the second > patch the right way to accomplish this. So, I am not sure what better > we can do here. The only bulletproof way is to provide some APIs where > users don't need to bother about all these cases, basically something > similar to what Kuroda-San is working on in the thread [1]. > The point of my review comment was only about the wording of the note - specifically, you cannot "prevent" something (e,g, data inconsistency) if it has already happened. Maybe modify the wording like below? BEFORE Before continuing with other operations the user should check that publisher tables did not have data with different origins to prevent data inconsistency issues on the subscriber. AFTER If a publisher table with data from different origins was found to have been copied to the subscriber, then some corrective action may be necessary before continuing with other operations. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: PostgreSQL 15 release announcement draft
On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote: > In this latest release, PostgreSQL improves on its in-memory and on-disk > sorting > algorithms, with benchmarks showing speedups of 25% - 400% based on sort > types. rather than "based on": "depending on the data types being sorted" > Building on work from the previous PostgreSQL release for allowing async > remote > queries, the PostgreSQL foreign data wrapper, `postgres_fdw`, can now commit > transactions in parallel. asynchronous > benefits for certain workloads. On certain operating systems, PostgreSQL 15 s/certain/some ? > supports the ability to prefetch WAL file contents and speed up recovery > times. > PostgreSQL's built-in backup command, `pg_basebackup`, now supports > server-side > compression of backup files with a choice of gzip, LZ4, and zstd. remove "server-side", since they're also supported on the client-side. > PostgreSQL 15 lets user create views that query data using the permissions of users > the caller, not the view creator. This option, called `security_invoker`, adds > an additional layer of protection to ensure view callers have the correct > permissions for working with the underlying data. ensure *that ? > alter server-level configuration parameters. Additionally, users can now > search > for information about configuration using the `\dconfig` command from the > `psql` > command-line tool. rather than "search for information about configuration", say "list configuration information" ? > PostgreSQL server-level statistics are now collected in shared memory, > eliminating the statistics collector process and writing these stats to disk. and *the need to periodically* write these stats to disk -- Justin
Re: Reducing the chunk header sizes on all memory context types
On 8/31/22 23:46, David Rowley wrote: > On Thu, 1 Sept 2022 at 08:53, Tomas Vondra > wrote: >> So the raw size (what we asked for) is ~23.5GB, but in practice we >> allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra >> 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, >> and it's far from the +60% you got. >> >> I wonder where does the difference come - I did make installcheck too, >> so how come you get 10/16GB, and I get 28/31GB? My patch is attached, >> maybe I did something silly. > > The reason my reported results were lower is because I ignored size > > allocChunkLimit allocations. These are not raised to the next power of > 2, so I didn't think they should be included. If I differentiate the large chunks allocated separately (v2 patch attached), I get this: f|t | count | s1 | s2 | s3 -+--+--+--+--+-- AllocSetAlloc | normal | 60714914 | 4982 | 6288 | 8185 AllocSetAlloc | separate | 824390 |18245 |18245 |18251 AllocSetRelloc | normal | 182070 | 763 | 826 | 1423 GenerationAlloc | normal | 2118115 | 68 | 90 | 102 GenerationAlloc | separate | 28 |0 |0 |0 (5 rows) Where s1 is the sum of requested sizes, s2 is the sum of allocated chunks, and s3 is chunks allocated with 1B sentinel. Focusing on the aset, vast majority of allocations (60M out of 64M) is small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so ~30%. Not great, not terrible. For the large allocations, there's almost no increase - in the last query I used the power-of-2 logic in the calculations, but that was incorrect, of course. > > I'm not sure why you're seeing only a 3GB additional overhead. I > noticed a logic error in my query where I was checking > maxaligned_size=pow2_size and doubling that to give sentinel space. > That really should have been "case size=pow2_size then pow2_size * 2 > else pow2_size end", However, after adjusting the query, it does not > seem to change the results much: > > postgres=# select > postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, > postgres-# round(sum(case when size=pow2_size then pow2_size*2 else > pow2_size end)::numeric/1024/1024/1024,3) as method1, > postgres-# round(sum(case when size=pow2_size then pow2_size+8 else > pow2_size end)::numeric/1024/1024/1024,3) as method2 > postgres-# from memstats > postgres-# where pow2_size > 0; > pow2_size | method1 | method2 > ---+-+- > 10.269 | 16.322 | 10.476 > > I've attached the crude patch I came up with for this. For some > reason it was crashing on Linux, but it ran ok on Windows, so I used > the results from that instead. Maybe that accounts for some > differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be > surprised if that accounted for so many GBs though. > I tried to use that patch, but "make installcheck" never completes for me, for some reason. It seems to get stuck in infinite_recurse.sql, but I haven't looked into the details. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..b215940062 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context) free(set); } +static Size pow2_size(Size size) +{ + Size s = 1; + while (s < size) + s *= 2; + + return s; +} + /* * AllocSetAlloc * Returns pointer to allocated memory of given size or NULL if @@ -703,9 +712,12 @@ AllocSetAlloc(MemoryContext context, Size size) * If requested size exceeds maximum for chunks, allocate an entire block * for this request. */ - if (size > set->allocChunkLimit) + if (size + 1 > set->allocChunkLimit) { - chunk_size = MAXALIGN(size); + fprintf(stderr, "AllocSetAlloc separate %ld %ld %ld\n", size, MAXALIGN(size), MAXALIGN(size+1)); + fflush(stderr); + + chunk_size = MAXALIGN(size+1); blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; block = (AllocBlock) malloc(blksize); if (block == NULL) @@ -726,6 +738,11 @@ AllocSetAlloc(MemoryContext context, Size size) /* set mark to catch clobber of "unused" space */ if (size < chunk_size) set_sentinel(MemoryChunkGetPointer(chunk), size); + else + { + fprintf(stderr, "sentinel not added\n"); + fflush(stderr); + } #endif #ifdef RANDOMIZE_ALLOCATED_MEMORY /* fill the allocated space with junk */ @@ -761,13 +778,16 @@ AllocSetAlloc(MemoryContext context, Size size) return MemoryChunkGetPointer(chunk); } + fprintf(stderr, "AllocSetAlloc normal %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1)); + fflush(stderr); + /* * Request is small enough to be treated as a chunk. Look in the * corresponding
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery > immediate wait > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated > WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is > not satisfied --- flushed only to 1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > I was able to start it with -c recovery_prefetch=no, so it seems like > prefetch tried to do too much. The VM runs centos7 under qemu. > I'm making a copy of the data dir in cases it's needed. Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a vm fork page (which recovery prefetch should ignore completely). Did you happen to get a copy before the successful recovery? After the successful recovery, what LSN does that page have, and can you find the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 -F vm? Have you turned FPW off (perhaps this is on ZFS?)?
Re: Inconsistent error message for varchar(n)
8On Tue, Aug 16, 2022 at 09:56:17PM -0400, Bruce Momjian wrote: > On Sun, Nov 14, 2021 at 10:33:19AM +0800, Japin Li wrote: > > Oh! I didn't consider this situation. Since the max size of varchar cannot > > exceed 10485760, however, I cannot find this in documentation [1]. Is there > > something I missed? Should we mention this in the documentation? > > > > [1] https://www.postgresql.org/docs/devel/datatype-character.html > > Sorry for my long delay in reviewing this issue. You are correct this > should be documented --- patch attached. Patch applied back to PG 10. Thanks for the report. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Transparent column encryption
On Tue, Aug 30, 2022 at 4:53 AM Peter Eisentraut wrote: > I would be interested in learning more about such padding systems. I > have done a lot of reading for this development project, and I have > never come across a cryptographic approach to hide length differences by > padding. Of course, padding to the block cipher's block size is already > part of the process, but that is done out of necessity, not because you > want to disguise the length. Are there any other methods? I'm > interested to learn more. TLS 1.3 has one example. Here is a description from GnuTLS: https://gnutls.org/manual/html_node/On-Record-Padding.html (Note the option to turn on constant-time padding; that may not be a good tradeoff for us if we're focusing on offline attacks.) Here's a recent paper that claims to formally characterize length hiding, but it's behind a wall and I haven't read it: https://dl.acm.org/doi/abs/10.1145/3460120.3484590 I'll try to find more when I get the chance. --Jacob
Re: Doc patch
On Fri, Aug 19, 2022 at 12:04:54PM -0400, Bruce Momjian wrote: > You are right about the above to changes. The existing syntax shows > FROM/IN is only possible if a direction is specified, while > src/parser/gram.y says that FROM/IN with no direction is supported. > > I plan to apply this second part of the patch soon. Patch applied back to PG 10. Thanks for the research on this. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Small cleanups to tuplesort.c and a bonus small performance improvement
On Wed, 31 Aug 2022 at 22:39, David Rowley wrote: > My current thoughts are that this is a very trivial patch and unless > there's any objections I plan to push it soon. Pushed. David
Re: Support tls-exporter as channel binding for TLSv1.3
On Sun, Aug 28, 2022 at 11:02 PM Michael Paquier wrote: > RFC9266, that has been released not so long ago, has added > tls-exporter as a new channel binding type: > https://www.rfc-editor.org/rfc/rfc5929.html Hi Michael, thank you for sending this! > Note also that tls-exporter is aimed for > TLSv1.3 and newer protocols, but OpenSSL allows the thing to work with > older protocols (testable with ssl_max_protocol_version, for example), > and I don't see a need to prevent this scenario. For protocols less than 1.3 we'll need to ensure that the extended master secret is in use: This channel binding mechanism is defined only when the TLS handshake results in unique master secrets. This is true of TLS versions prior to 1.3 when the extended master secret extension of [RFC7627] is in use, and it is always true for TLS 1.3 (see Appendix D of [RFC8446]). OpenSSL should have an API for that (SSL_get_extms_support); I don't know when it was introduced. If we want to cross all our T's, we should also disallow tls-exporter if the server was unable to set SSL_OP_NO_RENEGOTIATION. > An extra thing is > that attempting to use tls-exporter with a backend <= 15 and a client > >= 16 causes a failure during the SASL exchange, where the backend > complains about tls-exporter being unsupported. Yep. -- Did you have any thoughts about contributing the Python tests (or porting them to Perl, bleh) so that we could test failure modes as well? Unfortunately those Python tests were also OpenSSL-based, so they're less powerful than an independent implementation... Thanks, --Jacob
Re: cataloguing NOT NULL constraints
On Wed, Aug 31, 2022 at 4:08 PM Zhihong Yu wrote: > > > On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera > wrote: > >> So I was wrong in thinking that "this case was simple to implement" as I >> replied upthread. Doing that actually required me to rewrite large >> parts of the patch. I think it ended up being a good thing, because in >> hindsight the approach I was using was somewhat bogus anyway, and the >> current one should be better. Please find it attached. >> >> There are still a few problems, sadly. Most notably, I ran out of time >> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode. >> I have to review that again, but I think it'll need a deeper rethink of >> how we pg_upgrade inherited constraints. So the pg_upgrade tests are >> known to fail. I'm not aware of any other tests failing, but I'm sure >> the cfbot will prove me wrong. >> >> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull, >> to allow setting pg_attribute.attnotnull without adding a CHECK >> constraint (only used internally). I would like to find a better way to >> go about this, so I may remove it again, therefore it's not fully >> implemented. >> >> There are *many* changed regress expect files and I didn't carefully vet >> all of them. Mostly it's the addition of CHECK constraints in the >> footers of many \d listings and stuff like that. At a quick glance they >> appear valid, but I need to review them more carefully still. >> >> We've had pg_constraint.conparentid for a while now, but for some >> constraints we continue to use conislocal/coninhcount. I think we >> should get rid of that and rely on conparentid completely. >> >> An easily fixed issue is that of constraint naming. >> ChooseConstraintName has an argument for passing known constraint names, >> but this patch doesn't use it and it must. >> >> One issue that I don't currently know how to fix, is the fact that we >> need to know whether a column is a row type or not (because they need a >> different null test). At table creation time that's easy to know, >> because we have the descriptor already built by the time we add the >> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT >> then we don't. >> >> Some ancient code comments suggest that allowing a child table's NOT >> NULL constraint acquired from parent shouldn't be independently >> droppable. This patch doesn't change that, but it's easy to do if we >> decide to. However, that'd be a compatibility break, so I'd rather not >> do it in the same patch that introduces the feature. >> >> Overall, there's a lot more work required to get this to a good shape. >> That said, I think it's the right direction. >> >> -- >> Álvaro Herrera 48°01'N 7°57'E — >> https://www.EnterpriseDB.com/ >> "La primera ley de las demostraciones en vivo es: no trate de usar el >> sistema. >> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) >> > > Hi, > For findNotNullConstraintAttnum(): > > + if (multiple == NULL) > + break; > > Shouldn't `pfree(arr)` be called before breaking ? > > +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name, > > You used `NN` because there is method makeCheckNotNullConstraint, right ? > I think it would be better to expand `NN` so that its meaning is easy to > understand. > > Cheers > Hi, For tryExtractNotNullFromNode , in the block for `if (rel == NULL)`: + return false; I think you meant returning NULL since false is for boolean. Cheers
Re: cataloguing NOT NULL constraints
On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera wrote: > So I was wrong in thinking that "this case was simple to implement" as I > replied upthread. Doing that actually required me to rewrite large > parts of the patch. I think it ended up being a good thing, because in > hindsight the approach I was using was somewhat bogus anyway, and the > current one should be better. Please find it attached. > > There are still a few problems, sadly. Most notably, I ran out of time > trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode. > I have to review that again, but I think it'll need a deeper rethink of > how we pg_upgrade inherited constraints. So the pg_upgrade tests are > known to fail. I'm not aware of any other tests failing, but I'm sure > the cfbot will prove me wrong. > > I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull, > to allow setting pg_attribute.attnotnull without adding a CHECK > constraint (only used internally). I would like to find a better way to > go about this, so I may remove it again, therefore it's not fully > implemented. > > There are *many* changed regress expect files and I didn't carefully vet > all of them. Mostly it's the addition of CHECK constraints in the > footers of many \d listings and stuff like that. At a quick glance they > appear valid, but I need to review them more carefully still. > > We've had pg_constraint.conparentid for a while now, but for some > constraints we continue to use conislocal/coninhcount. I think we > should get rid of that and rely on conparentid completely. > > An easily fixed issue is that of constraint naming. > ChooseConstraintName has an argument for passing known constraint names, > but this patch doesn't use it and it must. > > One issue that I don't currently know how to fix, is the fact that we > need to know whether a column is a row type or not (because they need a > different null test). At table creation time that's easy to know, > because we have the descriptor already built by the time we add the > constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT > then we don't. > > Some ancient code comments suggest that allowing a child table's NOT > NULL constraint acquired from parent shouldn't be independently > droppable. This patch doesn't change that, but it's easy to do if we > decide to. However, that'd be a compatibility break, so I'd rather not > do it in the same patch that introduces the feature. > > Overall, there's a lot more work required to get this to a good shape. > That said, I think it's the right direction. > > -- > Álvaro Herrera 48°01'N 7°57'E — > https://www.EnterpriseDB.com/ > "La primera ley de las demostraciones en vivo es: no trate de usar el > sistema. > Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) > Hi, For findNotNullConstraintAttnum(): + if (multiple == NULL) + break; Shouldn't `pfree(arr)` be called before breaking ? +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name, You used `NN` because there is method makeCheckNotNullConstraint, right ? I think it would be better to expand `NN` so that its meaning is easy to understand. Cheers
Solaris "sed" versus pre-v13 plpython tests
I noticed that buildfarm member margay, which just recently started running tests on REL_12_STABLE, is failing the plpython tests in that branch [1], though it's happy in v13 and later. The failures appear due to syntax errors in Python "except" statements, and it's visible in some of the tests that we are failing to convert ancient Python "except" syntax to what Python 3 wants: diff -U3 /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out --- /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out 2022-08-31 22:29:51.070597370 +0200 +++ /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out 2022-08-31 22:29:53.053544840 +0200 @@ -400,15 +400,16 @@ import marshal try: return marshal.loads(x) -except ValueError as e: +except ValueError, e: return 'FAILED: ' + str(e) $$ LANGUAGE plpython3u; +ERROR: could not compile PL/Python function "test_type_unmarshal" +DETAIL: SyntaxError: invalid syntax (, line 6) So it would seem that regress-python3-mangle.mk's sed command to perform this transformation isn't working on margay's sed. We've had to hack regress-python3-mangle.mk for Solaris "sed" before (cf commit c3556f6fa). This seems to be another instance of that same crummy implementation of '*' patterns. I suppose Noah missed this angle at the time because the problematic pattern had already been removed in v13 and up (45223fd9c). But it's still there in v12. I am not completely sure why buildfarm member wrasse isn't failing similarly, but a likely theory is that Noah has got some other sed in his search path there. I confirmed on the gcc farm's Solaris 11 box that the pattern doesn't work as expected with /usr/bin/sed: tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except \([[:alpha:]][[:alpha:].]*\), *\([[:alpha:]][[:alpha:]]*\):/except \1 as \2:/g' except foo, bar: We could perhaps do this instead: $ echo except foo, bar: | sed -e '/^ *except.*,.*: *$/s/, / as /g' except foo as bar: It's a little bit more brittle, but Python doesn't allow any other commands on the same line does it? Another idea is to try to find some other sed to use. I see that configure already does that on most platforms, because ax_pthread.m4 has "AC_REQUIRE([AC_PROG_SED])". But if there's still someone out there using --disable-thread-safety, they might think this is an annoying movement of the build-requirements goalposts. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2022-08-31%2020%3A00%3A05
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan wrote: > Attached patch series is a completely overhauled version of earlier > work on freezing. Related work from the Postgres 15 cycle became > commits 0b018fab, f3c15cbe, and 44fa8488. Attached is v2. This is just to keep CFTester happy, since v1 now has conflicts when applied against HEAD. There are no notable changes in this v2 compared to v1. -- Peter Geoghegan v2-0001-Add-page-level-freezing-to-VACUUM.patch Description: Binary data v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch Description: Binary data v2-0003-Add-eager-freezing-strategy-to-VACUUM.patch Description: Binary data v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch Description: Binary data
Re: Reducing the chunk header sizes on all memory context types
On Thu, 1 Sept 2022 at 08:53, Tomas Vondra wrote: > So the raw size (what we asked for) is ~23.5GB, but in practice we > allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra > 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, > and it's far from the +60% you got. > > I wonder where does the difference come - I did make installcheck too, > so how come you get 10/16GB, and I get 28/31GB? My patch is attached, > maybe I did something silly. The reason my reported results were lower is because I ignored size > allocChunkLimit allocations. These are not raised to the next power of 2, so I didn't think they should be included. I'm not sure why you're seeing only a 3GB additional overhead. I noticed a logic error in my query where I was checking maxaligned_size=pow2_size and doubling that to give sentinel space. That really should have been "case size=pow2_size then pow2_size * 2 else pow2_size end", However, after adjusting the query, it does not seem to change the results much: postgres=# select postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, postgres-# round(sum(case when size=pow2_size then pow2_size*2 else pow2_size end)::numeric/1024/1024/1024,3) as method1, postgres-# round(sum(case when size=pow2_size then pow2_size+8 else pow2_size end)::numeric/1024/1024/1024,3) as method2 postgres-# from memstats postgres-# where pow2_size > 0; pow2_size | method1 | method2 ---+-+- 10.269 | 16.322 | 10.476 I've attached the crude patch I came up with for this. For some reason it was crashing on Linux, but it ran ok on Windows, so I used the results from that instead. Maybe that accounts for some differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be surprised if that accounted for so many GBs though. I also forgot to add code to GenerationRealloc and AllocSetRealloc David diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..f4977f9bcc 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -46,6 +46,7 @@ #include "postgres.h" +#include "miscadmin.h" #include "port/pg_bitutils.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -696,9 +697,22 @@ AllocSetAlloc(MemoryContext context, Size size) int fidx; Sizechunk_size; Sizeblksize; + static int rlevel = 1; AssertArg(AllocSetIsValid(set)); + if (rlevel == 1 && GetProcessingMode() == NormalProcessing) + { + rlevel++; + elog(LOG, "AllocSetAlloc,%s,%zu,%zu,%zu,%zu", +context->name, +size, +MAXALIGN(size), +size > set->allocChunkLimit ? 0 : GetChunkSizeFromFreeListIdx(AllocSetFreeIndex(size)), +set->allocChunkLimit); + rlevel--; + } + /* * If requested size exceeds maximum for chunks, allocate an entire block * for this request. diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index b39894ec94..9c5ff3c095 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -36,6 +36,7 @@ #include "postgres.h" #include "lib/ilist.h" +#include "miscadmin.h" #include "port/pg_bitutils.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -344,6 +345,18 @@ GenerationAlloc(MemoryContext context, Size size) MemoryChunk *chunk; Sizechunk_size = MAXALIGN(size); Sizerequired_size = chunk_size + Generation_CHUNKHDRSZ; + static int rlevel = 1; + + if (rlevel == 1 && GetProcessingMode() == NormalProcessing) + { + rlevel++; + elog(LOG, "GenerationAlloc,%s,%zu,%zu,0,%zu", +context->name, +size, +MAXALIGN(size), +set->allocChunkLimit); + rlevel--; + } /* is it an over-sized chunk? if yes, allocate special block */ if (chunk_size > set->allocChunkLimit)
Re: Trivial doc patch
On Fri, Aug 19, 2022 at 10:42:45AM -0400, Bruce Momjian wrote: > > Given that 'optional, optional' has no independent meaning from > > 'optional'; it requires one to scan the entire set looking for > > the non-optional embedded in the option. So no gain. > > I originally had the same reaction Michael Paquier did, that having one > big optional block is nice, but seeing that 'CREATE DATABASE name WITH' > actually works, I can see the point in having our syntax be accurate, > and removing the outer optional brackets now does seem like an > improvement to me. Backpatched to PG 10. Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Reducing the chunk header sizes on all memory context types
On 8/31/22 00:40, David Rowley wrote: > On Wed, 31 Aug 2022 at 02:17, Tom Lane wrote: >> >> I wrote: >>> So maybe we should revisit the question. It'd be worth collecting some >>> stats about how much extra space would be needed if we force there >>> to be room for a sentinel. >> >> Actually, after ingesting more caffeine, the problem with this for aset.c >> is that the only way to add space for a sentinel that didn't fit already >> is to double the space allocation. That's a little daunting, especially >> remembering how many places deliberately allocate power-of-2-sized >> arrays. > > I decided to try and quantify that by logging the size, MAXALIGN(size) > and the power of 2 size during AllocSetAlloc and GenerationAlloc. I > made the pow2_size 0 in GenerationAlloc and in AlloocSetAlloc when > size > allocChunkLimit. > > After running make installcheck, grabbing the records out the log and > loading them into Postgres, I see that if we did double the pow2_size > when there's no space for the sentinel byte then we'd go from > allocating a total of 10.2GB all the way to 16.4GB (!) of > non-dedicated block aset.c allocations. > > select > round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, > round(sum(case when maxalign_size=pow2_size then pow2_size*2 else > pow2_size end)::numeric/1024/1024/1024,3) as method1, > round(sum(case when maxalign_size=pow2_size then pow2_size+8 else > pow2_size end)::numeric/1024/1024/1024,3) as method2 > from memstats > where pow2_size > 0; > pow2_size | method1 | method2 > ---+-+- > 10.194 | 16.382 | 10.463 > > if we did just add on an extra 8 bytes (or or MAXALIGN(size+1) at > least), then that would take the size up to 10.5GB. > I've been experimenting with this a bit too, and my results are similar, but not exactly the same. I've logged all Alloc/Realloc calls for the two memory contexts, and when I aggregated the results I get this: f| size | pow2(size) | pow2(size+1) -+--++-- AllocSetAlloc |23528 | 28778 |31504 AllocSetRelloc | 761 |824 | 1421 GenerationAlloc | 68 | 90 | 102 So the raw size (what we asked for) is ~23.5GB, but in practice we allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, and it's far from the +60% you got. I wonder where does the difference come - I did make installcheck too, so how come you get 10/16GB, and I get 28/31GB? My patch is attached, maybe I did something silly. I also did a quick hack to see if always having the sentinel detects any pre-existing issues, but that didn't happen. I guess valgrind would find those, but not sure? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..e6d86a54d7 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context) free(set); } +static Size pow2_size(Size size) +{ + Size s = 1; + while (s < size) + s *= 2; + + return s; +} + /* * AllocSetAlloc * Returns pointer to allocated memory of given size or NULL if @@ -699,13 +708,16 @@ AllocSetAlloc(MemoryContext context, Size size) AssertArg(AllocSetIsValid(set)); + fprintf(stderr, "AllocSetAlloc %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1)); + fflush(stderr); + /* * If requested size exceeds maximum for chunks, allocate an entire block * for this request. */ - if (size > set->allocChunkLimit) + if (size + 1 > set->allocChunkLimit) { - chunk_size = MAXALIGN(size); + chunk_size = MAXALIGN(size + 1); blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; block = (AllocBlock) malloc(blksize); if (block == NULL) @@ -725,7 +737,16 @@ AllocSetAlloc(MemoryContext context, Size size) chunk->requested_size = size; /* set mark to catch clobber of "unused" space */ if (size < chunk_size) + { set_sentinel(MemoryChunkGetPointer(chunk), size); + fprintf(stderr, "sentinel added\n"); + fflush(stderr); + } + else + { + fprintf(stderr, "A sentinel not added\n"); + fflush(stderr); + } #endif #ifdef RANDOMIZE_ALLOCATED_MEMORY /* fill the allocated space with junk */ @@ -767,7 +788,7 @@ AllocSetAlloc(MemoryContext context, Size size) * If one is found, remove it from the free list, make it again a member * of the alloc set and return its data address. */ - fidx = AllocSetFreeIndex(size); + fidx = AllocSetFreeIndex(size+1); chunk = set->freelist[fidx]; if (chunk != NULL) { @@ -784,7 +805,16 @@ AllocSetAlloc(MemoryContext context, Size size) chunk->requested_size = size; /* set mark to catch clobber of "unused" space */ if (size < GetChunkSizeFromFreeListIdx(fidx)) + {
Re: SQL/JSON features for v15
On 31.08.2022 20:14, Tom Lane wrote: Robert Haas writes: On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: The currently proposed patchset hacks up a relatively small number of core datatypes to be able to do that. But it's just a hack and there's no prospect of extension types being able to join in the fun. I think where we need to start, for v16, is making an API design that will let any datatype have this functionality. (I don't say that we'd convert every datatype to do so right away; in the long run we should, but I'm content to start with just the same core types touched here.) Beside the JSON stuff, there is another even more pressing application for such behavior, namely the often-requested COPY functionality to be able to shunt bad data off somewhere without losing the entire transfer. In the COPY case I think we'd want to be able to capture the error message that would have been issued, which means the current patches are not at all appropriate as a basis for that API design: they're just returning a bool without any details. I would be in favor of making more of an effort than just a few token data types. The initial patch could just touch a few, but once the infrastructure is in place we should really make a sweep through the tree and tidy up. Sure, but my point is that we can do that in a time-extended fashion rather than having a flag day where everything must be updated. The initial patch just needs to update a few types as proof of concept. And here is a quick POC patch with an example for COPY and float4: =# CREATE TABLE test (i int, f float4); CREATE TABLE =# COPY test (f) FROM stdin WITH (null_on_error (f)); 1 err 2 \. COPY 3 =# SELECT f FROM test; f --- 1 2 (3 rows) =# COPY test (i) FROM stdin WITH (null_on_error (i)); ERROR: input function for datatype "integer" does not support error handling PG_RETURN_ERROR() is a reincarnation of ereport_safe() macro for returning ErrorData, which was present in older versions (~v18) of SQL/JSON patches. Later it was replaced with `bool *have_error` and less magical `if (have_error) ... else ereport(...)`. Obviously, this needs a separate thread. -- Nikita Glukhov Postgres Professional:http://www.postgrespro.com The Russian Postgres Company From 537e65e1ebcccdf4a760d3aa743c88611e7c Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Wed, 31 Aug 2022 22:24:33 +0300 Subject: [PATCH 1/5] Introduce PG_RETURN_ERROR and FuncCallError node --- src/include/nodes/execnodes.h | 7 +++ src/include/utils/elog.h | 25 + 2 files changed, 32 insertions(+) diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 01b1727fc09..b401d354656 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -2729,4 +2729,11 @@ typedef struct LimitState TupleTableSlot *last_slot; /* slot for evaluation of ties */ } LimitState; +typedef struct FuncCallError +{ + NodeTag type; + ErrorData *error; /* place where function should put + * error data instead of throwing it */ +} FuncCallError; + #endif /* EXECNODES_H */ diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 56398176901..5fd3deed61f 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -159,6 +159,31 @@ #define ereport(elevel, ...) \ ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__) +/* + * PG_RETURN_ERROR() -- special macro for copying error info into the + * FuncCallError node, if it was as FunctionCallInfo.context, + * instead of throwing it with ordinary ereport(). + * + * This is intended for handling of errors of categories like + * ERRCODE_DATA_EXCEPTION without PG_TRY/PG_CATCH and substransactions, + * but not for errors like ERRCODE_OUT_OF_MEMORY. + */ +#define PG_RETURN_ERROR(...) \ + do { \ + if (fcinfo->context && IsA(fcinfo->context, FuncCallError)) { \ + pg_prevent_errno_in_scope(); \ + \ + errstart(ERROR, TEXTDOMAIN); \ + (__VA_ARGS__); \ + castNode(FuncCallError, fcinfo->context)->error = CopyErrorData(); \ + FlushErrorState(); \ + \ + PG_RETURN_NULL(); \ + } else { \ + ereport(ERROR, (__VA_ARGS__)); \ + } \ + } while (0) + #define TEXTDOMAIN NULL extern bool message_level_is_interesting(int elevel); -- 2.25.1 From 7831b2c571784a614dd4ee4b48730187b0c0a8d1 Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Wed, 31 Aug 2022 23:02:06 +0300 Subject: [PATCH 2/5] Introduce pg_proc.proissafe and func_is_safe() --- src/backend/utils/cache/lsyscache.c | 19 +++ src/include/catalog/pg_proc.h | 3 +++ src/include/utils/lsyscache.h | 1 + 3 files changed, 23 insertions(+) diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index a16a63f4957..ff3005077b2 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -1830,6 +1830,25 @@ get_func_leakproof(Oid funcid) return result; } +/* + * func_is_safe +
Re: SQL/JSON features for v15
On 8/31/22 3:08 PM, Andrew Dunstan wrote: On 2022-08-31 We 14:45, Tom Lane wrote: To the extent that there was a management failure here, it was that we didn't press for a resolution sooner. Given the scale of the concerns raised in June, I kind of agree with Andres' opinion that fixing them post-freeze was doomed to failure. It was definitely doomed once we reached August with no real work done towards it. I'm not going to comment publicly in general about this, you might imagine what my reaction is. The decision is the RMT's to make and I have no quarrel with that. But I do want it understood that there was work being done right from the time in June when Andres' complaints were published. These were difficult issues, and we didn't let the grass grow looking for a fix. I concede that might not have been visible until later. June was a bit of a rough month too -- we had the issues that spawned the out-of-cycle release at the top of the month, which started almost right after Beta 1, and then almost immediately into Beta 2 after 14.4. I know that consumed a lot of my cycles. At that point in time for the v15 release process I was primarily focused on monitoring open items at that point, so I missed the June comments. Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Hi, On 2022-08-31 16:05:03 -0400, Tom Lane wrote: > Even if there's only a visible leak in v15/HEAD, I'm inclined > to back-patch this all the way, because it's certainly buggy > on its own terms. It's just the merest accident that neither > caller is leaking other stuff into TopMemoryContext, because > they both think they are using a short-lived context. +1 > Thanks for the report! +1 Greetings, Andres Freund
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Reid Thompson writes: > This patch ensures get_database_list() switches back to the memory > context in use upon entry rather than returning with TopMemoryContext > as the context. > This will address memory allocations in autovacuum.c being associated > with TopMemoryContext when they should not be. > autovacuum.c do_start_worker() with current context > 'Autovacuum start worker (tmp)' invokes get_database_list(). Upon > return, the current context has been changed to TopMemoryContext by > AtCommit_Memory() as part of an internal transaction. Further down > in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked. > Previously this didn't pose a issue, however recent changes altered > how pgstat_fetch_stat_dbentry() is implemented. The new > implementation has a branch utilizing palloc. The patch ensures these > allocations are associated with the 'Autovacuum start worker (tmp)' > context rather than the TopMemoryContext. Prior to the change, > leaving an idle laptop PG instance running just shy of 3 days saw the > autovacuum launcher process grow to 42MB with most of that growth in > TopMemoryContext due to the palloc allocations issued with autovacuum > worker startup. Yeah, I can reproduce noticeable growth within a couple minutes after setting autovacuum_naptime to 1s, and I confirm that the launcher's space consumption stays static after applying this. Even if there's only a visible leak in v15/HEAD, I'm inclined to back-patch this all the way, because it's certainly buggy on its own terms. It's just the merest accident that neither caller is leaking other stuff into TopMemoryContext, because they both think they are using a short-lived context. Thanks for the report! regards, tom lane
Re: First draft of the PG 15 release notes
On Wed, Aug 31, 2022 at 11:38:33AM +0200, Peter Eisentraut wrote: > On 30.08.22 22:42, Bruce Momjian wrote: > > Good question --- the full text is: > > > > > > > > Record and check the collation of each > linkend="sql-createdatabase">database (Peter Eisentraut) > > > > > > > > This is designed to detect collation > > mismatches to avoid data corruption. Function > > pg_database_collation_actual_version() > > reports the underlying operating system collation version, and > > ALTER DATABASE ... REFRESH sets the database > > to match the operating system collation version. DETAILS? > > > > > > > > I just can't figure out what the user needs to understand about this, > > and I understand very little of it. > > We already had this feature for (schema-level) collations, now we have it on > the level of the database collation. Okay, I figured out the interplay between OS collation version support, collation libraries, and collation levels. Here is an updated patch for the release notes. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson diff --git a/doc/src/sgml/release-15.sgml b/doc/src/sgml/release-15.sgml index e3ab13e53c..dca4550c03 100644 --- a/doc/src/sgml/release-15.sgml +++ b/doc/src/sgml/release-15.sgml @@ -578,17 +578,17 @@ Author: Peter Eisentraut - Record and check the collation of each database (Peter Eisentraut) - This is designed to detect collation + This is designed to detect collation version mismatches to avoid data corruption. Function pg_database_collation_actual_version() reports the underlying operating system collation version, and ALTER DATABASE ... REFRESH sets the database - to match the operating system collation version. DETAILS? + to match the operating system collation version. @@ -605,9 +605,11 @@ Author: Peter Eisentraut - Previously, ICU collations could only be - specified in CREATE - COLLATION and used with the + Previously, only libc-based + collations could be set at the cluster and database levels. + ICU collations were previously limited + to CREATE + COLLATION and referenced by the COLLATE clause.
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi, On 2022-08-31 12:24:28 -0700, Nathan Bossart wrote: > On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote: > > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: > > It continues to make no sense to me to add behaviour changes around > > error-handling as part of a conversion to get_dirent_type(). I don't at all > > understand why e.g. the above change to make copydir() silently skip over > > files it can't stat is ok? > > In this example, the call to get_dirent_type() should ERROR if the call to > lstat() fails (the "elevel" argument is set to ERROR). Oh, oops. Skimmed code too quickly... Greetings, Andres Freund
Re: pg_stat_have_stats() returns true for dropped indexes (or for index creation transaction rolled back)
Hiu, On 2022-08-25 11:44:34 +0200, Drouvot, Bertrand wrote: > For REPLSLOT, I agree that we can add one test: I added it in > contrib/test_decoding/sql/stats.sql. It relies on pg_stat_have_stats() (as > relying on pg_stat_replication_slots and/or pg_stat_get_replication_slot() > would not help that much for this test given that the slot has been removed > from ReplicationSlotCtl) As Horiguchi-san noted, we can't rely on specific indexes being used. I feel ok with the current coverage in that area, but if we *really* feel we need to test it, we'd need to count the number of indexes with slots before dropping the slot and after the drop. > +-- pg_stat_have_stats returns true for committed index creation Maybe another test for an uncommitted index creation would be good too? Could you try running this test with debug_discard_caches = 1 - it's pretty easy to write tests in this area that aren't reliable timing wise. > +CREATE table stats_test_tab1 as select generate_series(1,10) a; > +CREATE index stats_test_idx1 on stats_test_tab1(a); > +SELECT oid AS dboid from pg_database where datname = current_database() \gset Since you introduced this, maybe convert the other instance of this query at the end of the file as well? Greetings, Andres Freund
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote: > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: >> -if (lstat(fromfile, ) < 0) >> -ereport(ERROR, >> -(errcode_for_file_access(), >> - errmsg("could not stat file \"%s\": >> %m", fromfile))); >> +xlde_type = get_dirent_type(fromfile, xlde, false, ERROR); >> >> -if (S_ISDIR(fst.st_mode)) >> +if (xlde_type == PGFILETYPE_DIR) >> { >> /* recurse to handle subdirectories */ >> if (recurse) >> copydir(fromfile, tofile, true); >> } >> -else if (S_ISREG(fst.st_mode)) >> +else if (xlde_type == PGFILETYPE_REG) >> copy_file(fromfile, tofile); >> } >> FreeDir(xldir); > > It continues to make no sense to me to add behaviour changes around > error-handling as part of a conversion to get_dirent_type(). I don't at all > understand why e.g. the above change to make copydir() silently skip over > files it can't stat is ok? In this example, the call to get_dirent_type() should ERROR if the call to lstat() fails (the "elevel" argument is set to ERROR). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 11:56:29AM -0700, Andres Freund wrote: > Hi, > > On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote: > > As for having a lower granularity and preventing the > > one-syscall-per-Relation issue, can't we reuse the query_start or > > state_change timestamps that appear in pg_stat_activity (potentially > > updated immediately before this stat flush), or some other per-backend > > timestamp that is already maintained and considered accurate enough > > for this use? > > The problem is that it won't change at all for a query that runs for a week - > and we'll report the timestamp from a week ago when it finally ends. > > But given this is done when stats are flushed, which only happens after the > transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if > we got to flushing the transaction stats we'll already have computed that. Oh, good point --- it is safer to show a more recent time than a too-old time. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: [PATCH] Add sortsupport for range types and btree_gist
Hi! Sorry for the long delay. This fixes the crashes, now all tests run fine w/ and w/o debug configuration. That shouldn't even have slipped through as such. Notable changes from v1: - gbt_enum_sortsupport() now passes on fcinfo->flinfo enum_cmp_internal() needs a place to cache the typcache entry. - inet sortsupport now uses network_cmp() directly Thanks, Christoph HeissFrom 667779bc0761c1356141722181c5a54ac46d96b9 Mon Sep 17 00:00:00 2001 From: Christoph Heiss Date: Wed, 31 Aug 2022 19:20:43 +0200 Subject: [PATCH v2 1/1] Add sortsupport for range types and btree_gist Incrementally building up a GiST index can result in a sub-optimal index structure for range types. By sorting the data before inserting it into the index will result in a much better index. This can provide sizeable improvements in query execution time, I/O read time and shared block hits/reads. Signed-off-by: Christoph Heiss --- contrib/btree_gist/Makefile | 3 +- contrib/btree_gist/btree_bit.c | 19 contrib/btree_gist/btree_bool.c | 22 contrib/btree_gist/btree_cash.c | 22 contrib/btree_gist/btree_enum.c | 26 + contrib/btree_gist/btree_gist--1.7--1.8.sql | 110 contrib/btree_gist/btree_gist.control | 2 +- contrib/btree_gist/btree_inet.c | 19 contrib/btree_gist/btree_interval.c | 19 contrib/btree_gist/btree_macaddr8.c | 19 contrib/btree_gist/btree_time.c | 19 src/backend/utils/adt/rangetypes_gist.c | 70 + src/include/catalog/pg_amproc.dat | 3 + src/include/catalog/pg_proc.dat | 3 + 14 files changed, 354 insertions(+), 2 deletions(-) create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile index 48997c75f6..4096de73ea 100644 --- a/contrib/btree_gist/Makefile +++ b/contrib/btree_gist/Makefile @@ -33,7 +33,8 @@ EXTENSION = btree_gist DATA = btree_gist--1.0--1.1.sql \ btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \ btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \ - btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql + btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \ + btree_gist--1.7--1.8.sql PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes" REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \ diff --git a/contrib/btree_gist/btree_bit.c b/contrib/btree_gist/btree_bit.c index 5b246bcde4..bf32bfd628 100644 --- a/contrib/btree_gist/btree_bit.c +++ b/contrib/btree_gist/btree_bit.c @@ -7,6 +7,7 @@ #include "btree_utils_var.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/sortsupport.h" #include "utils/varbit.h" @@ -19,10 +20,17 @@ PG_FUNCTION_INFO_V1(gbt_bit_picksplit); PG_FUNCTION_INFO_V1(gbt_bit_consistent); PG_FUNCTION_INFO_V1(gbt_bit_penalty); PG_FUNCTION_INFO_V1(gbt_bit_same); +PG_FUNCTION_INFO_V1(gbt_bit_sortsupport); /* define for comparison */ +static int +bit_fast_cmp(Datum x, Datum y, SortSupport ssup) +{ + return DatumGetInt32(DirectFunctionCall2(byteacmp, x, y)); +} + static bool gbt_bitgt(const void *a, const void *b, Oid collation, FmgrInfo *flinfo) { @@ -208,3 +216,14 @@ gbt_bit_penalty(PG_FUNCTION_ARGS) PG_RETURN_POINTER(gbt_var_penalty(result, o, n, PG_GET_COLLATION(), , fcinfo->flinfo)); } + +Datum +gbt_bit_sortsupport(PG_FUNCTION_ARGS) +{ + SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); + + ssup->comparator = bit_fast_cmp; + ssup->ssup_extra = NULL; + + PG_RETURN_VOID(); +} diff --git a/contrib/btree_gist/btree_bool.c b/contrib/btree_gist/btree_bool.c index 8b2af129b5..3a9f230e68 100644 --- a/contrib/btree_gist/btree_bool.c +++ b/contrib/btree_gist/btree_bool.c @@ -6,6 +6,7 @@ #include "btree_gist.h" #include "btree_utils_num.h" #include "common/int.h" +#include "utils/sortsupport.h" typedef struct boolkey { @@ -23,6 +24,16 @@ PG_FUNCTION_INFO_V1(gbt_bool_picksplit); PG_FUNCTION_INFO_V1(gbt_bool_consistent); PG_FUNCTION_INFO_V1(gbt_bool_penalty); PG_FUNCTION_INFO_V1(gbt_bool_same); +PG_FUNCTION_INFO_V1(gbt_bool_sortsupport); + +static int +bool_fast_cmp(Datum x, Datum y, SortSupport ssup) +{ + bool arg1 = DatumGetBool(x); + bool arg2 = DatumGetBool(y); + + return arg1 - arg2; +} static bool gbt_boolgt(const void *a, const void *b, FmgrInfo *flinfo) @@ -167,3 +178,14 @@ gbt_bool_same(PG_FUNCTION_ARGS) *result = gbt_num_same((void *) b1, (void *) b2, , fcinfo->flinfo); PG_RETURN_POINTER(result); } + +Datum +gbt_bool_sortsupport(PG_FUNCTION_ARGS) +{ + SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); + + ssup->comparator = bool_fast_cmp; + ssup->ssup_extra = NULL; + + PG_RETURN_VOID(); +} diff --git a/contrib/btree_gist/btree_cash.c b/contrib/btree_gist/btree_cash.c index 6ac97e2b12..389b725130 100644 ---
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi, On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: > @@ -50,7 +51,7 @@ copydir(char *fromdir, char *todir, bool recurse) > > while ((xlde = ReadDir(xldir, fromdir)) != NULL) > { > - struct stat fst; > + PGFileType xlde_type; > > /* If we got a cancel signal during the copy of the directory, > quit */ > CHECK_FOR_INTERRUPTS(); > @@ -62,18 +63,15 @@ copydir(char *fromdir, char *todir, bool recurse) > snprintf(fromfile, sizeof(fromfile), "%s/%s", fromdir, > xlde->d_name); > snprintf(tofile, sizeof(tofile), "%s/%s", todir, xlde->d_name); > > - if (lstat(fromfile, ) < 0) > - ereport(ERROR, > - (errcode_for_file_access(), > - errmsg("could not stat file \"%s\": > %m", fromfile))); > + xlde_type = get_dirent_type(fromfile, xlde, false, ERROR); > > - if (S_ISDIR(fst.st_mode)) > + if (xlde_type == PGFILETYPE_DIR) > { > /* recurse to handle subdirectories */ > if (recurse) > copydir(fromfile, tofile, true); > } > - else if (S_ISREG(fst.st_mode)) > + else if (xlde_type == PGFILETYPE_REG) > copy_file(fromfile, tofile); > } > FreeDir(xldir); It continues to make no sense to me to add behaviour changes around error-handling as part of a conversion to get_dirent_type(). I don't at all understand why e.g. the above change to make copydir() silently skip over files it can't stat is ok? Greetings, Andres Freund
Re: SQL/JSON features for v15
On 2022-08-31 We 14:45, Tom Lane wrote: > To the extent that there was a management failure here, it was that > we didn't press for a resolution sooner. Given the scale of the > concerns raised in June, I kind of agree with Andres' opinion that > fixing them post-freeze was doomed to failure. It was definitely > doomed once we reached August with no real work done towards it. I'm not going to comment publicly in general about this, you might imagine what my reaction is. The decision is the RMT's to make and I have no quarrel with that. But I do want it understood that there was work being done right from the time in June when Andres' complaints were published. These were difficult issues, and we didn't let the grass grow looking for a fix. I concede that might not have been visible until later. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [PATCH] Query Jumbling for CALL and SET utility statements
Hi, On 2022-08-31 11:00:05 -0700, Jeremy Schneider wrote: > On 8/31/22 9:08 AM, Andres Freund wrote: > > > > I suspect we should carve out things like CALL, PREPARE, EXECUTE from > > track_utility - it's more or less an architectural accident that they're > > utility statements. It's a bit less clear that SET should be dealt with > > that > > way. > > Regarding SET, the compelling use case was around "application_name" > whose purpose is to provide a label in pg_stat_activity and on log > lines, which can be used to improve observability and connect queries to > their source in application code. I wasn't saying that SET shouldn't be jumbled, just that it seems more reasonable to track it only when track_utility is enabled, rather than doing so even when that's disabled. Which I do think makes sense for executing a prepared statement and calling a procedure, since they're really only utility statements by accident. > Personally, at this point, I think pg_stat_statements is critical > infrastructure for anyone running PostgreSQL at scale. The information > it provides is indispensable. I don't think it's really defensible to > tell people that if they want to scale, then they need to fly blind on > any utility statements. I wasn't suggesting doing so at all. Greetings, Andres Freund
Re: Tracking last scan time
Hi, On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote: > As for having a lower granularity and preventing the > one-syscall-per-Relation issue, can't we reuse the query_start or > state_change timestamps that appear in pg_stat_activity (potentially > updated immediately before this stat flush), or some other per-backend > timestamp that is already maintained and considered accurate enough > for this use? The problem is that it won't change at all for a query that runs for a week - and we'll report the timestamp from a week ago when it finally ends. But given this is done when stats are flushed, which only happens after the transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if we got to flushing the transaction stats we'll already have computed that. > tabentry->numscans += lstats->t_counts.t_numscans; > + if (pgstat_track_scans && lstats->t_counts.t_numscans) > + tabentry->lastscan = GetCurrentTimestamp(); Besides replacing GetCurrentTimestamp() with GetCurrentTransactionStopTimestamp(), this should then also check if tabentry->lastscan is already newer than the new timestamp. Greetings, Andres Freund
Re: SQL/JSON features for v15
Bruce Momjian writes: > I guess you are saying that setting a cut-off was a bad idea, or that > the cut-off was too close to the final release date. For me, I think > there were three questions: > 1. Were subtransactions acceptable, consensus no > 2. Could trapping errors work for PG 15, consensus no > 3. Could the feature be trimmed back for PG 15 to avoid these, consensus ? We could probably have accomplished #3 if there was more time, but we're out of time. (I'm not entirely convinced that spending effort towards #3 was productive anyway, given that we're now thinking about a much differently-scoped patch with API changes.) > I don't think our community works well when there are three issues in > play at once. To the extent that there was a management failure here, it was that we didn't press for a resolution sooner. Given the scale of the concerns raised in June, I kind of agree with Andres' opinion that fixing them post-freeze was doomed to failure. It was definitely doomed once we reached August with no real work done towards it. regards, tom lane
Re: effective_multixact_freeze_max_age issue
On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart wrote: > LGTM Pushed, thanks. -- Peter Geoghegan
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 12:04:44PM -0400, Tom Lane wrote: > Andres Freund writes: > > From my POV the only real discussion is whether we'd want to revert this in > > 15 > > and HEAD or just 15. There's imo a decent point to be made to just revert in > > 15 and aggressively press forward with the changes posted in this thread. > > I'm not for that. Code that we don't think is ready to ship > has no business being in the common tree, nor does it make > review any easier to be looking at one bulky set of > already-committed patches and another bulky set of deltas. Agreed on removing from PG 15 and master --- it would be confusing to have lots of incomplete code in master that is not in PG 15. > I'm okay with making an exception for the include/nodes/ and > backend/nodes/ files in HEAD, since the recent changes in that > area mean it'd be a lot of error-prone work to produce a reverting > patch there. We can leave those in as dead code temporarily, I think. I don't have an opinion on this. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 12:26:29PM -0400, Robert Haas wrote: > someone else to make the decision. But that's not how it works. The > RMT concept was invented precisely to solve problems like this one, > where the patch authors don't really want to revert it but other > people think it's pretty busted. If such problems were best addressed > by waiting for a long time to see whether anything changes, we > wouldn't need an RMT. That's exactly how we used to handle these kinds > of problems, and it sucked. I saw the RMT/Jonathan stated August 28 as the cut-off date for a decision, which was later changed to September 1: > The RMT is still inclined to revert, but will give folks until Sep 1 0:00 > AoE[1] to reach consensus on if SQL/JSON can be included in v15. This matches > up to Andrew's availability timeline for a revert, and gives enough time to > get through the buildfarm prior to the Beta 4 release[2]. I guess you are saying that setting a cut-off was a bad idea, or that the cut-off was too close to the final release date. For me, I think there were three questions: 1. Were subtransactions acceptable, consensus no 2. Could trapping errors work for PG 15, consensus no 3. Could the feature be trimmed back for PG 15 to avoid these, consensus ? I don't think our community works well when there are three issues in play at once. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: SQL/JSON features for v15
On 2022-08-31 We 12:48, Jonathan S. Katz wrote: > > > With RMT hat on -- Andrew can you please revert the patchset? :-( Yes, I'll do it, starting with the v15 branch. Might take a day or so. cheers (kinda) andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [RFC] building postgres with meson - v12
Hi, On 2022-08-31 10:28:05 +0200, Peter Eisentraut wrote: > I found that the perl test modules are not installed. See attached patch to > correct this. > > To the patches: > > 4e15ee0e24 Don't hardcode tmp_check/ as test directory for tap tests > 1a3169bc3f Split TESTDIR into TESTLOGDIR and TESTDATADIR > > It's a bit weird that the first patch changes the meaning of TESTDIR > and the second patch removes it. Maybe these patches should be > squashed together? Hm, to me they seem topically separate enough, but I don't have a strong opinion on it. > 581721fa99 meson: prereq: Add src/tools/gen_export.pl > > Still wondering about the whitespace changes I reported recently, but > that can also be fine-tuned later. I'll look into it in a bit. > a1fb97a81b meson: Add meson based buildsystem > > I'm not a fan of all this business to protect the two build systems > from each other. I don't like the build process touching a file under > version control every time. How necessary is this? What happens > otherwise? I added it after just about everyone trying meson hit problems due to conflicts between (past) in-tree configure builds and meson, due to files left in tree (picking up the wrong .h files, cannot entirely be fixed with -I arguments, due to the "" includes). By adding the relevant check to the meson configure phase, and by triggering meson re-configure whenever an in-tree configure build is done, these issues can be detected. It'd of course be nicer to avoid the potential for such conflicts, but that appears to be a huge chunk of work, see the bison/flex subthread. So I don't really see an alternative. > conversion_helpers.txt: should probably be removed now. Done. > doc/src/sgml/resolv.xsl: I don't understand what this is doing. Maybe > at least add a comment in the file. It's only used for building epubs. Perhaps I should extract that into a separate patch as well? The relevant section is: > # > # epub > # > > # This was previously implemented using dbtoepub - but that doesn't seem to > # support running in build != source directory (i.e. VPATH builds already > # weren't supported). > if pandoc.found() and xsltproc.found() > # XXX: Wasn't able to make pandoc successfully resolve entities > # XXX: Perhaps we should just make all targets use this, to avoid repeatedly > # building whole thing? It's comparatively fast though. > postgres_full_xml = custom_target('postgres-full.xml', > input: ['resolv.xsl', 'postgres.sgml'], > output: ['postgres-full.xml'], > depends: doc_generated + [postgres_sgml_valid], > command: [xsltproc, '--path', '@OUTDIR@/', xsltproc_flags, > '-o', '@OUTPUT@', '@INPUT@'], > build_by_default: false, > ) A noted, I couldn't make pandoc resolve our entities, so I used resolv.xsl them, before calling pandoc. I'll rename it to resolve-entities.xsl and add a comment. > src/common/unicode/meson.build: The comment at the top of the file > should be moved next to the files it is describing (similar to how it > is in the makefile). Done. > I don't see CLDR_VERSION set anywhere. Is that part implemented? No, I didn't implement the generation parts of contrib/unaccent. I started tackling the src/common/unicode bits after John Naylor asked whether that could be done, but considered that good enough... > src/port/win32ver.rc.in: This is redundant with src/port/win32ver.rc. > (Note that the latter is also used as an input file for text > substitution. So having another file named *.in next to it would be > super confusing.) Yea, this stuff isn't great. I think the better solution, both for meson and for configure, would be to move to do all the substitution to the C preprocessor. > src/tools/find_meson: Could use a brief comment what it does. Added. > src/tools/pgflex: Could use a not-brief comment about what it does, > why it's needed. Also a comment where it's used. Also run this > through pycodestyle. Working on that. > cd193eb3e8 meson: ci: Build both with meson and as before > > I haven't reviewed this one in detail. Maybe add a summary in the > commit message, like these are the new jobs, these are the changes to > existing jobs. It looks like there is more in there than just adding > a few meson jobs. I don't think we want to commit this as-is. It contains CI for a lot of platforms - that's very useful for working on meson, but too much for in-tree. I guess I'll split it into two, one patch for converting a reasonable subset of the current CI tasks to meson and another to add (back) the current array of tested platforms. > If the above are addressed, I think this will be just about at the > point where the above patches can be committed. Woo! > Everything past these patches I'm mentally postponing right now. Makes sense. Greetings, Andres Freund
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 07:52:49PM +0200, Matthias van de Meent wrote: > As for having a lower granularity and preventing the > one-syscall-per-Relation issue, can't we reuse the query_start or > state_change timestamps that appear in pg_stat_activity (potentially Yeah, query start should be fine, but not transaction start time. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Tracking last scan time
On Wed, 31 Aug 2022 at 18:21, Andres Freund wrote: > > Hi, > > On 2022-08-23 10:55:09 +0100, Dave Page wrote: > > Often it is beneficial to review one's schema with a view to removing > > indexes (and sometimes tables) that are no longer required. It's very > > difficult to understand when that is the case by looking at the number of > > scans of a relation as, for example, an index may be used infrequently but > > may be critical in those times when it is used. > > > > The attached patch against HEAD adds optional tracking of the last scan > > time for relations. It updates pg_stat_*_tables with new last_seq_scan and > > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to > > help with this. > > > > Due to the use of gettimeofday(), those values are only maintained if a new > > GUC, track_scans, is set to on. By default, it is off. > > > > I did run a 12 hour test to see what the performance impact is. pgbench was > > run with scale factor 1 and 75 users across 4 identical bare metal > > machines running Rocky 8 in parallel which showed roughly a -2% average > > performance penalty against HEAD with track_scans enabled. Machines were > > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data > > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source > > is tsc. > > > >HEAD track_scans Penalty (%) > > box1 19582.4973519341.8881 -1.22869541 > > box2 19936.5551319928.07479-0.04253664659 > > box3 19631.7889518649.64379-5.002830696 > > box4 19810.8676719420.67192-1.969604525 > > Average 19740.4272819335.06965-2.05343896 > > Based on the size of those numbers this was a r/w pgbench. If it has this > noticable an impact for r/w, with a pretty low number of scans/sec, how's the > overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It > must be quite bad. > > I don't think we should accept this feature with this overhead - but I also > think we can do better, by accepting a bit less accuracy. For this to be > useful we don't need a perfectly accurate timestamp. The statement start time > is probably not accurate enough, but we could just have bgwriter or such > update one in shared memory every time we wake up? Or perhaps we could go to > an even lower granularity, by putting in the current LSN or such? I don't think that LSN is precise enough. For example, if you're in a (mostly) read-only system, the system may go long times without any meaningful records being written. As for having a lower granularity and preventing the one-syscall-per-Relation issue, can't we reuse the query_start or state_change timestamps that appear in pg_stat_activity (potentially updated immediately before this stat flush), or some other per-backend timestamp that is already maintained and considered accurate enough for this use? Regardless, with this patch as it is we get a new timestamp for each relation processed, which I think is a waste of time (heh) even in VDSO-enabled systems. Apart from the above, I don't have any other meaningful opinion on this patch - it might be a good addition, but I don't consume stats often enough to make a good cost / benefit comparison. Kind regards, Matthias van de Meent
Re: PostgreSQL 15 release announcement draft
Hi, On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote: > ### Other Notable Changes > > PostgreSQL server-level statistics are now collected in shared memory, > eliminating the statistics collector process and writing these stats to disk. > PostgreSQL 15 also revokes the `CREATE` permission from all users except a > database owner from the `public` (or default) schema. It's a bit weird to lump those two in the same paragraph, but ok. However, I think the "and writing these stats to disk." might not be very clear to people not familiar with the feature, they might think writing stats to disk is part of the new feature. So I propose "as well as writing theses stats to disk" instead or something. Michael
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On Wed, Aug 31, 2022 at 12:50:19PM -0400, Reid Thompson wrote: > Hi Hackers, > > Add the ability to limit the amount of memory that can be allocated to > backends. > > This builds on the work that adds backend memory allocated to > pg_stat_activity > https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com > Both patches are attached. You should name the patches with different prefixes, like 001,002,003 Otherwise, cfbot may try to apply them in the wrong order. git format-patch is the usual tool for that. > +Specifies a limit to the amount of memory (MB) that may be allocated > to MB are just the default unit, right ? The user should be allowed to write max_total_backend_memory='2GB' > +backends in total (i.e. this is not a per user or per backend limit). > +If unset, or set to 0 it is disabled. A backend request that would > push > +the total over the limit will be denied with an out of memory error > +causing that backends current query/transaction to fail. Due to the > dynamic backend's > +nature of memory allocations, this limit is not exact. If within > 1.5MB of > +the limit and two backends request 1MB each at the same time both > may be > +allocated exceeding the limit. Further requests will not be > allocated until allocated, and exceed the limit > +bool > +exceeds_max_total_bkend_mem(uint64 allocation_request) > +{ > + bool result = false; > + > + if (MyAuxProcType != NotAnAuxProcess) > + return result; The double negative is confusing, so could use a comment. > + /* Convert max_total_bkend_mem to bytes for comparison */ > + if (max_total_bkend_mem && > + pgstat_get_all_backend_memory_allocated() + > + allocation_request > (uint64)max_total_bkend_mem * 1024 * 1024) > + { > + /* > + * Explicitely identify the OOM being a result of this > + * configuration parameter vs a system failure to allocate OOM. > + */ > + elog(WARNING, > + "request will exceed postgresql.conf defined > max_total_backend_memory limit (%lu > %lu)", > + pgstat_get_all_backend_memory_allocated() + > + allocation_request, (uint64)max_total_bkend_mem * 1024 > * 1024); I think it should be ereport() rather than elog(), which is internal-only, and not-translated. > + {"max_total_backend_memory", PGC_SIGHUP, RESOURCES_MEM, > + gettext_noop("Restrict total backend memory > allocations to this max."), > + gettext_noop("0 turns this feature off."), > + GUC_UNIT_MB > + }, > + _total_bkend_mem, > + 0, 0, INT_MAX, > + NULL, NULL, NULL I think this needs a maximum like INT_MAX/1024/1024 > +uint64 > +pgstat_get_all_backend_memory_allocated(void) > +{ ... > + for (i = 1; i <= NumBackendStatSlots; i++) > + { It's looping over every backend for each allocation. Do you know if there's any performance impact of that ? I think it may be necessary to track the current allocation size in shared memory (with atomic increments?). Maybe decrements would need to be exactly accounted for, or otherwise Assert() that the value is not negative. I don't know how expensive it'd be to have conditionals for each decrement, but maybe the value would only be decremented at strategic times, like at transaction commit or backend shutdown. -- Justin
Re: [PATCH] Add native windows on arm64 support
Michael Paquier writes: > At the end, I'd like to think that it would be wiser to just remove > entirely the test for node() or reduce the contents of the string to > be able to strictly control the output order (say a simple > 'prepost' would do the trick). I cannot think > about a better idea now. Note that removing the test case where we > have node() has no impact on the coverage of xml.c. Yeah, I confirm that: local code-coverage testing shows no change in the number of lines reached in xml.c when I remove that column: -SELECT * FROM XMLTABLE('*' PASSING 'predeeppost' COLUMNS x xml PATH 'node()', y xml PATH '/'); +SELECT * FROM XMLTABLE('*' PASSING 'predeeppost' COLUMNS y xml PATH '/'); Dropping the query altogether does result in a reduction in the number of lines hit. I did not check the other XMLTABLE infrastructure, so what probably is best to do is keep the two output columns but change 'node()' to something with a more stable result; any preferences? regards, tom lane
Re: SQL/JSON features for v15
Robert Haas writes: > On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: >> (I don't say that we'd convert every datatype to do so right away; >> in the long run we should, but I'm content to start with just the >> same core types touched here.) > I would be in favor of making more of an effort than just a few token > data types. The initial patch could just touch a few, but once the > infrastructure is in place we should really make a sweep through the > tree and tidy up. Sure, but my point is that we can do that in a time-extended fashion rather than having a flag day where everything must be updated. The initial patch just needs to update a few types as proof of concept. regards, tom lane
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: > The currently proposed patchset hacks up a relatively small number > of core datatypes to be able to do that. But it's just a hack > and there's no prospect of extension types being able to join > in the fun. I think where we need to start, for v16, is making > an API design that will let any datatype have this functionality. This would be really nice to have. > (I don't say that we'd convert every datatype to do so right away; > in the long run we should, but I'm content to start with just the > same core types touched here.) I would be in favor of making more of an effort than just a few token data types. The initial patch could just touch a few, but once the infrastructure is in place we should really make a sweep through the tree and tidy up. > Beside the JSON stuff, there is > another even more pressing application for such behavior, namely > the often-requested COPY functionality to be able to shunt bad data > off somewhere without losing the entire transfer. In the COPY case > I think we'd want to be able to capture the error message that > would have been issued, which means the current patches are not > at all appropriate as a basis for that API design: they're just > returning a bool without any details. Fully agreed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SQL/JSON features for v15
I wrote: > Andres Freund writes: >> From my POV the only real discussion is whether we'd want to revert this in >> 15 >> and HEAD or just 15. There's imo a decent point to be made to just revert in >> 15 and aggressively press forward with the changes posted in this thread. > I'm not for that. Code that we don't think is ready to ship > has no business being in the common tree, nor does it make > review any easier to be looking at one bulky set of > already-committed patches and another bulky set of deltas. To enlarge on that a bit: it seems to me that the really fundamental issue here is how to catch datatype-specific input and conversion errors without using subtransactions, because those are too expensive and can mask errors we'd rather not be masking, such as OOM. (Andres had some additional, more localized concerns, but I think this is the one with big-picture implications.) The currently proposed patchset hacks up a relatively small number of core datatypes to be able to do that. But it's just a hack and there's no prospect of extension types being able to join in the fun. I think where we need to start, for v16, is making an API design that will let any datatype have this functionality. (I don't say that we'd convert every datatype to do so right away; in the long run we should, but I'm content to start with just the same core types touched here.) Beside the JSON stuff, there is another even more pressing application for such behavior, namely the often-requested COPY functionality to be able to shunt bad data off somewhere without losing the entire transfer. In the COPY case I think we'd want to be able to capture the error message that would have been issued, which means the current patches are not at all appropriate as a basis for that API design: they're just returning a bool without any details. So that's why I'm in favor of reverting and starting over. There are probably big chunks of what's been done that can be re-used, but it all needs to be re-examined with this sort of design in mind. As a really quick sketch of what such an API might look like: we could invent a new node type, say IOCallContext, which is intended to be passed as FunctionCallInfo.context to type input functions and perhaps type conversion functions. Call sites wishing to have no-thrown-error functionality would initialize one of these to show "no error" and then pass it to the data type's usual input function. Old-style input functions would ignore this and just throw errors as usual; sorry, you don't get the no-error functionality you wanted. But I/O functions that had been updated would know to store the report of a relevant error into that node and then return NULL. (Although I think there may be assumptions somewhere that I/O functions don't return NULL, so maybe "just return any dummy value" is a better idea? Although likely it wouldn't be hard to remove such assumptions from callers using this functionality.) The caller would detect the presence of an error by examining the node contents and then do whatever it needs to do. regards, tom lane
Re: Add tracking of backend memory allocated to pg_stat_activity
On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > Hi Hackers, > > Attached is a patch to > Add tracking of backend memory allocated to pg_stat_activity > + proargmodes => > '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', In the past, there was concern about making pg_stat_activity wider by adding information that's less-essential than what's been there for years. This is only an int64, so it's not "wide", but I wonder if there's another way to expose this information? Like adding backends to pg_get_backend_memory_contexts() , maybe with another view on top of that ? +* shown allocated in pgstat_activity when the creator destroys the pg_stat > + * Posix creation calls dsm_impl_posix_resize implying that > resizing > + * occurs or may be added in the future. As implemented > + * dsm_impl_posix_resize utilizes fallocate or truncate, > passing the > + * whole new size as input, growing the allocation as needed * > (only > + * truncate supports shrinking). We update by replacing the * > old wrapping caused extraneous stars > + * Do not allow backend_mem_allocated to go below zero. ereport if we > + * would have. There's no need for a lock around the read here asit's as it's > + ereport(LOG, (errmsg("decrease reduces reported backend memory > allocated below zero; setting reported to 0"))); errmsg() doesn't require the outside paranthesis since a couple years ago. > + /* > + * Until header allocation is included in context->mem_allocated cast to > + * slab and decrement the headerSize add a comma before "cast" ? -- Justin
Re: [PATCH] Query Jumbling for CALL and SET utility statements
st 31. 8. 2022 v 17:50 odesílatel Pavel Stehule napsal: > Hi > > > st 31. 8. 2022 v 17:34 odesílatel Drouvot, Bertrand > napsal: > >> Hi hackers, >> >> While query jumbling is provided for function calls that’s currently not >> the case for procedures calls. >> The reason behind this is that all utility statements are currently >> discarded for jumbling. >> >> We’ve recently seen performance impacts (LWLock contention) due to the >> lack of jumbling on procedure calls with pg_stat_statements and >> pg_stat_statements.track_utility enabled (think an application with a high >> rate of procedure calls with unique parameters for each call). >> >> Jeremy has had this conversation on twitter (see >> https://twitter.com/jer_s/status/1560003560116342785) and Nikolay >> reported that he also had to work on a similar performance issue with SET >> being used. >> >> That’s why we think it would make sense to allow jumbling for those 2 >> utility statements: CALL and SET. >> >> Please find attached a patch proposal for doing so. >> >> With the attached patch we would get things like: >> CALL MINUS_TWO(3); >> CALL MINUS_TWO(7); >> CALL SUM_TWO(3, 8); >> CALL SUM_TWO(7, 5); >> set enable_seqscan=false; >> set enable_seqscan=true; >> set seq_page_cost=2.0; >> set seq_page_cost=1.0; >> >> postgres=# SELECT query, calls, rows FROM pg_stat_statements; >>query | calls | rows >> ---+---+-- >> set seq_page_cost=$1 | 2 |0 >> CALL MINUS_TWO($1)| 2 |0 >> set enable_seqscan=$1 | 2 |0 >> CALL SUM_TWO($1, $2) | 2 |0 >> >> Looking forward to your feedback, >> > The idea is good, but I think you should use pg_stat_functions instead. > Maybe it is supported already (I didn't test it). I am not sure so SET > statement should be traced in pg_stat_statements - it is usually pretty > fast, and without context it says nothing. It looks like just overhead. > I was wrong - there is an analogy with SELECT fx, and the statistics are in pg_stat_statements, and in pg_stat_function too. Regards Pavel > > Regards > > Pavel > > >> Thanks, >> >> Jeremy & Bertrand >> >> -- >> Bertrand Drouvot >> Amazon Web Services: https://aws.amazon.com >> >>
Add the ability to limit the amount of memory that can be allocated to backends.
Hi Hackers, Add the ability to limit the amount of memory that can be allocated to backends. This builds on the work that adds backend memory allocated to pg_stat_activity https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com Both patches are attached. Add GUC variable max_total_backend_memory. Specifies a limit to the amount of memory (MB) that may be allocated to backends in total (i.e. this is not a per user or per backend limit). If unset, or set to 0 it is disabled. It is intended as a resource to help avoid the OOM killer. A backend request that would push the total over the limit will be denied with an out of memory error causing that backends current query/transaction to fail. Due to the dynamic nature of memory allocations, this limit is not exact. If within 1.5MB of the limit and two backends request 1MB each at the same time both may be allocated exceeding the limit. Further requests will not be allocated until dropping below the limit. Keep this in mind when setting this value to avoid the OOM killer. Currently, this limit does not affect auxiliary backend processes, this list of non-affected backend processes is open for discussion as to what should/should not be included. Backend memory allocations are displayed in the pg_stat_activity view. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a5cd4e44c7..caf958310a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2079,6 +2079,32 @@ include_dir 'conf.d' + + max_total_backend_memory (integer) + + max_total_backend_memory configuration parameter + + + + +Specifies a limit to the amount of memory (MB) that may be allocated to +backends in total (i.e. this is not a per user or per backend limit). +If unset, or set to 0 it is disabled. A backend request that would push +the total over the limit will be denied with an out of memory error +causing that backends current query/transaction to fail. Due to the dynamic +nature of memory allocations, this limit is not exact. If within 1.5MB of +the limit and two backends request 1MB each at the same time both may be +allocated exceeding the limit. Further requests will not be allocated until +dropping below the limit. Keep this in mind when setting this value. This +limit does not affect auxiliary backend processes + . Backend memory allocations +(backend_mem_allocated) are displayed in the +pg_stat_activity +view. + + + + diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index 269ad2fe53..808ffe75f2 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -253,6 +253,10 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, return true; } + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* * Create new segment or open an existing one for attach. * @@ -524,6 +528,10 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, int flags = IPCProtection; size_t segsize; + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* * Allocate the memory BEFORE acquiring the resource, so that we don't * leak the resource if memory allocation fails. @@ -718,6 +726,10 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size, return true; } + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* Create new segment or open an existing one for attach. */ if (op == DSM_OP_CREATE) { diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c index 17a00587f8..9137a000ae 100644 --- a/src/backend/utils/activity/backend_status.c +++ b/src/backend/utils/activity/backend_status.c @@ -44,6 +44,8 @@ */ bool pgstat_track_activities = false; int pgstat_track_activity_query_size = 1024; +/* Max backend memory allocation allowed (MB). 0 = disabled */ +int max_total_bkend_mem = 0; /* exposed so that backend_progress.c can access it */ @@ -1253,3 +1255,107 @@ pgstat_report_backend_mem_allocated_decrease(uint64 deallocation) beentry->backend_mem_allocated -= deallocation; PGSTAT_END_WRITE_ACTIVITY(beentry); } + +/* -- + * pgstat_get_all_backend_memory_allocated() - + * + * Return a uint64 representing the current shared memory allocated to all + * backends. This looks directly at the BackendStatusArray, and so will + *
Re: SQL/JSON features for v15
On 8/31/22 12:26 PM, Robert Haas wrote: On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz wrote: Andres, Robert, Tom: With this recent work, have any of your opinions changed on including SQL/JSON in v15? No. Nothing's been committed, and there's no time to review anything in detail, and there was never going to be. OK. Based on this feedback, the RMT is going to request that this is reverted. With RMT hat on -- Andrew can you please revert the patchset? Nikita said he was ready to start hacking in mid-August. That's good of him, but feature freeze was in April. We don't start hacking on a feature 4 months after the freeze. I'm unwilling to drop everything I'm working on to review patches that were written in a last minute rush. Even if these patches were more important to me than my own work, which they are not, I couldn't possibly do a good job reviewing complex patches on top of other complex patches in an area that I haven't studied in years. And if I could do a good job, no doubt I'd find a bunch of problems - whether they would be large or small, I don't know - and then that would lead to more changes even closer to the intended release date. I just don't understand what the RMT thinks it is doing here. When a concern is raised about whether a feature is anywhere close to being in a releasable state in August, "hack on it some more and then see where we're at" seems like an obviously impractical way forward. It seemed clear to me from the moment that Andres raised his concerns that the only two viable strategies were (1) revert the feature and be sad or (2) decide to ship it anyway and hope that Andres is incorrect in thinking that it will become an embarrassment to the project. The RMT has chosen neither of these, and in fact, really seems to want someone else to make the decision. But that's not how it works. The RMT concept was invented precisely to solve problems like this one, where the patch authors don't really want to revert it but other people think it's pretty busted. If such problems were best addressed by waiting for a long time to see whether anything changes, we wouldn't need an RMT. That's exactly how we used to handle these kinds of problems, and it sucked. This is fair feedback. However, there are a few things to consider here: 1. When Andres raised his initial concerns, the RMT did recommend to revert but did not force it. Part of the RMT charter is to try to get consensus before doing so and after we've exhausted the community process. As we moved closer, the patch others proposed some suggestions which other folks were amenable to trying. Unfortunately, time has run out. However, 2. One of the other main goals of the RMT is to ensure the release ships "on time" which we define to be late Q3/early Q4. We factored that into the decision making process around this. We are still on time for the release. I take responsibility for the decision making. I would be open to discuss this further around what worked / what didn't with the RMT and where we can improve in the future. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: SQL/JSON features for v15
Hi, On 2022-08-31 12:26:29 -0400, Robert Haas wrote: > On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz > wrote: > > Andres, Robert, Tom: With this recent work, have any of your opinions > > changed on including SQL/JSON in v15? > > No. Nothing's been committed, and there's no time to review anything > in detail, and there was never going to be. Nikita said he was ready > to start hacking in mid-August. That's good of him, but feature freeze > was in April. As additional context: I had started raising those concerns mid June. Greetings, Andres Freund
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz wrote: > Andres, Robert, Tom: With this recent work, have any of your opinions > changed on including SQL/JSON in v15? No. Nothing's been committed, and there's no time to review anything in detail, and there was never going to be. Nikita said he was ready to start hacking in mid-August. That's good of him, but feature freeze was in April. We don't start hacking on a feature 4 months after the freeze. I'm unwilling to drop everything I'm working on to review patches that were written in a last minute rush. Even if these patches were more important to me than my own work, which they are not, I couldn't possibly do a good job reviewing complex patches on top of other complex patches in an area that I haven't studied in years. And if I could do a good job, no doubt I'd find a bunch of problems - whether they would be large or small, I don't know - and then that would lead to more changes even closer to the intended release date. I just don't understand what the RMT thinks it is doing here. When a concern is raised about whether a feature is anywhere close to being in a releasable state in August, "hack on it some more and then see where we're at" seems like an obviously impractical way forward. It seemed clear to me from the moment that Andres raised his concerns that the only two viable strategies were (1) revert the feature and be sad or (2) decide to ship it anyway and hope that Andres is incorrect in thinking that it will become an embarrassment to the project. The RMT has chosen neither of these, and in fact, really seems to want someone else to make the decision. But that's not how it works. The RMT concept was invented precisely to solve problems like this one, where the patch authors don't really want to revert it but other people think it's pretty busted. If such problems were best addressed by waiting for a long time to see whether anything changes, we wouldn't need an RMT. That's exactly how we used to handle these kinds of problems, and it sucked. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Tracking last scan time
Hi, On 2022-08-23 10:55:09 +0100, Dave Page wrote: > Often it is beneficial to review one's schema with a view to removing > indexes (and sometimes tables) that are no longer required. It's very > difficult to understand when that is the case by looking at the number of > scans of a relation as, for example, an index may be used infrequently but > may be critical in those times when it is used. > > The attached patch against HEAD adds optional tracking of the last scan > time for relations. It updates pg_stat_*_tables with new last_seq_scan and > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to > help with this. > > Due to the use of gettimeofday(), those values are only maintained if a new > GUC, track_scans, is set to on. By default, it is off. > > I did run a 12 hour test to see what the performance impact is. pgbench was > run with scale factor 1 and 75 users across 4 identical bare metal > machines running Rocky 8 in parallel which showed roughly a -2% average > performance penalty against HEAD with track_scans enabled. Machines were > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source > is tsc. > >HEAD track_scans Penalty (%) > box1 19582.4973519341.8881 -1.22869541 > box2 19936.5551319928.07479-0.04253664659 > box3 19631.7889518649.64379-5.002830696 > box4 19810.8676719420.67192-1.969604525 > Average 19740.4272819335.06965-2.05343896 Based on the size of those numbers this was a r/w pgbench. If it has this noticable an impact for r/w, with a pretty low number of scans/sec, how's the overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It must be quite bad. I don't think we should accept this feature with this overhead - but I also think we can do better, by accepting a bit less accuracy. For this to be useful we don't need a perfectly accurate timestamp. The statement start time is probably not accurate enough, but we could just have bgwriter or such update one in shared memory every time we wake up? Or perhaps we could go to an even lower granularity, by putting in the current LSN or such? Greetings, Andres Freund
Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Hi Hackers, This patch ensures get_database_list() switches back to the memory context in use upon entry rather than returning with TopMemoryContext as the context. This will address memory allocations in autovacuum.c being associated with TopMemoryContext when they should not be. autovacuum.c do_start_worker() with current context 'Autovacuum start worker (tmp)' invokes get_database_list(). Upon return, the current context has been changed to TopMemoryContext by AtCommit_Memory() as part of an internal transaction. Further down in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked. Previously this didn't pose a issue, however recent changes altered how pgstat_fetch_stat_dbentry() is implemented. The new implementation has a branch utilizing palloc. The patch ensures these allocations are associated with the 'Autovacuum start worker (tmp)' context rather than the TopMemoryContext. Prior to the change, leaving an idle laptop PG instance running just shy of 3 days saw the autovacuum launcher process grow to 42MB with most of that growth in TopMemoryContext due to the palloc allocations issued with autovacuum worker startup. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index b3b1afba86..92b1ef45c1 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -1941,6 +1941,13 @@ get_database_list(void) CommitTransactionCommand(); + /* + * CommitTransactionCommand returns with CurrentMemoryContext set + * to TopMemoryContext. Switch back to the context that we entered + * with. + */ + MemoryContextSwitchTo(resultcxt); + return dblist; }
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 05:02:33PM +0100, Dave Page wrote: > > > On Tue, 30 Aug 2022 at 19:46, Bruce Momjian wrote: > > On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote: > > On Thu, 25 Aug 2022 at 01:44, David Rowley wrote: > > I don't have a particular opinion about the patch, I'm just pointing > > out that there are other ways. Even just writing down the numbers on > a > > post-it note and coming back in a month to see if they've changed is > > enough to tell if the table or index has been used. > > > > > > There are usually other ways to perform monitoring tasks, but there is > > something to be said for the convenience of having functionality built > in > and > > not having to rely on tools, scripts, or post-it notes :-) > > Should we consider using something cheaper like time() so we don't need > a GUC to enable this? > > > Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls > takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds. Wow. I was just thinking you need second-level accuracy, which must be cheap somewhere. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: [PATCH] Query Jumbling for CALL and SET utility statements
Hi, On 2022-08-31 17:33:44 +0200, Drouvot, Bertrand wrote: > While query jumbling is provided for function calls that’s currently not the > case for procedures calls. > The reason behind this is that all utility statements are currently > discarded for jumbling. > [...] > That’s why we think it would make sense to allow jumbling for those 2 > utility statements: CALL and SET. Yes, I've seen this be an issue repeatedly. Although more heavily for PREPARE / EXECUTE than either of the two cases you handle here. IME not tracking PREPARE / EXECUTE can distort statistics substantially - there's appears to be a surprising number of applications / frameworks resorting to them. Basically requiring that track utility is turned on. I suspect we should carve out things like CALL, PREPARE, EXECUTE from track_utility - it's more or less an architectural accident that they're utility statements. It's a bit less clear that SET should be dealt with that way. > @@ -383,6 +384,30 @@ JumbleExpr(JumbleState *jstate, Node *node) > APP_JUMB(var->varlevelsup); > } > break; > + case T_CallStmt: > + { > + CallStmt *stmt = (CallStmt *) node; > + FuncExpr *expr = stmt->funcexpr; > + > + APP_JUMB(expr->funcid); > + JumbleExpr(jstate, (Node *) expr->args); > + } > + break; Why do we need to take the arguments into account? > + case T_VariableSetStmt: > + { > + VariableSetStmt *stmt = (VariableSetStmt *) > node; > + > + APP_JUMB_STRING(stmt->name); > + JumbleExpr(jstate, (Node *) stmt->args); > + } > + break; Same? > + case T_A_Const: > + { > + int loc = ((const A_Const > *) node)->location; > + > + RecordConstLocation(jstate, loc); > + } > + break; I suspect we only need this because of the jumbling of unparsed arguments I questioned above? If we do end up needing it, shouldn't we include the type in the jumbling? Greetings, Andres Freund
Re: SQL/JSON features for v15
Andres Freund writes: > On 2022-08-31 10:20:24 -0400, Jonathan S. Katz wrote: >> Andres, Robert, Tom: With this recent work, have any of your opinions >> changed on including SQL/JSON in v15? > I don't really know what to do here. It feels blatantly obvious that this code > isn't even remotely close to being releasable. I'm worried about the impact of > the big revert at this stage of the release cycle, and that's not getting > better by delaying further. And I'm getting weary of being asked to make the > obvious call that the authors of this feature as well as the RMT should have > made a while ago. I have to agree. There is a large amount of code at stake here. We're being asked to review a bunch of hastily-produced patches to that code on an even more hasty schedule (and personally I have other things I need to do today...) I think the odds of a favorable end result are small. > From my POV the only real discussion is whether we'd want to revert this in 15 > and HEAD or just 15. There's imo a decent point to be made to just revert in > 15 and aggressively press forward with the changes posted in this thread. I'm not for that. Code that we don't think is ready to ship has no business being in the common tree, nor does it make review any easier to be looking at one bulky set of already-committed patches and another bulky set of deltas. I'm okay with making an exception for the include/nodes/ and backend/nodes/ files in HEAD, since the recent changes in that area mean it'd be a lot of error-prone work to produce a reverting patch there. We can leave those in as dead code temporarily, I think. regards, tom lane
Add tracking of backend memory allocated to pg_stat_activity
Hi Hackers, Attached is a patch to Add tracking of backend memory allocated to pg_stat_activity This new field displays the current bytes of memory allocated to the backend process. It is updated as memory for the process is malloc'd/free'd. Memory allocated to items on the freelist is included in the displayed value. Dynamic shared memory allocations are included only in the value displayed for the backend that created them, they are not included in the value for backends that are attached to them to avoid double counting. On occasion, orphaned memory segments may be cleaned up on postmaster startup. This may result in decreasing the sum without a prior increment. We limit the floor of backend_mem_allocated to zero. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 1d9509a2f6..40ae638f25 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -947,6 +947,18 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + + backend_mem_allocated bigint + + + The byte count of memory allocated to this backend. Dynamic shared memory + allocations are included only in the value displayed for the backend that + created them, they are not included in the value for backends that are + attached to them to avoid double counting. + + + query text diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 5a844b63a1..d23f0e9dbb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -863,6 +863,7 @@ CREATE VIEW pg_stat_activity AS S.backend_xid, s.backend_xmin, S.query_id, +S.backend_mem_allocated, S.query, S.backend_type FROM pg_stat_get_activity(NULL) AS S diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index e1b90c5de4..269ad2fe53 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -66,6 +66,7 @@ #include "postmaster/postmaster.h" #include "storage/dsm_impl.h" #include "storage/fd.h" +#include "utils/backend_status.h" #include "utils/guc.h" #include "utils/memutils.h" @@ -232,6 +233,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + /* + * Detach and destroy pass through here, only decrease the memory + * shown allocated in pgstat_activity when the creator destroys the + * allocation. + */ + if (op == DSM_OP_DESTROY) + pgstat_report_backend_mem_allocated_decrease(*mapped_size); *mapped_address = NULL; *mapped_size = 0; if (op == DSM_OP_DESTROY && shm_unlink(name) != 0) @@ -332,6 +340,36 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Attach and create pass through here, only update backend memory + * allocated in pgstat_activity for the creator process. + */ + if (op == DSM_OP_CREATE) + { + /* + * Posix creation calls dsm_impl_posix_resize implying that resizing + * occurs or may be added in the future. As implemented + * dsm_impl_posix_resize utilizes fallocate or truncate, passing the + * whole new size as input, growing the allocation as needed * (only + * truncate supports shrinking). We update by replacing the * old + * allocation with the new. + */ +#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__) + /* + * posix_fallocate does not shrink allocations, adjust only on + * allocation increase. + */ + if (request_size > *mapped_size) + { + pgstat_report_backend_mem_allocated_decrease(*mapped_size); + pgstat_report_backend_mem_allocated_increase(request_size); + } +#else + pgstat_report_backend_mem_allocated_decrease(*mapped_size); + pgstat_report_backend_mem_allocated_increase(request_size); +#endif + } *mapped_address = address; *mapped_size = request_size; close(fd); @@ -537,6 +575,14 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Detach and destroy pass through here, only decrease the memory + * shown allocated in pgstat_activity when the creator destroys the + * allocation. + */ + if (op == DSM_OP_DESTROY) + pgstat_report_backend_mem_allocated_decrease(*mapped_size); *mapped_address = NULL; *mapped_size = 0; if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0) @@ -584,6 +630,13 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Attach and create pass through here, only update backend memory + * allocated in pgstat_activity for the creator process. + */ + if (op == DSM_OP_CREATE) +
Re: Tracking last scan time
On Tue, 30 Aug 2022 at 19:46, Bruce Momjian wrote: > On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote: > > On Thu, 25 Aug 2022 at 01:44, David Rowley wrote: > > I don't have a particular opinion about the patch, I'm just pointing > > out that there are other ways. Even just writing down the numbers on > a > > post-it note and coming back in a month to see if they've changed is > > enough to tell if the table or index has been used. > > > > > > There are usually other ways to perform monitoring tasks, but there is > > something to be said for the convenience of having functionality built > in and > > not having to rely on tools, scripts, or post-it notes :-) > > Should we consider using something cheaper like time() so we don't need > a GUC to enable this? > Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds. -- Dave Page Blog: https://pgsnake.blogspot.com Twitter: @pgsnake EDB: https://www.enterprisedb.com
Re: SQL/JSON features for v15
On 2022-08-30 Tu 17:25, Nikita Glukhov wrote: > > > Patches 0001-0006: Yeah, these add the overhead of an extra function call (typin() -> typin_opt_error()) in possibly very common paths. Other than refactoring *all* places that call typin() to use the new API, the only other option seems to be to leave the typin() functions alone and duplicate their code in typin_opt_error() versions for all the types that this patch cares about. Though maybe, that's not necessarily a better compromise than accepting the extra function call overhead. >>> I think another possibility is to create a static inline function in the >>> corresponding .c module (say boolin_impl() in bool.c), which is called >>> by both the opt_error variant as well as the regular one. This would >>> avoid the duplicate code as well as the added function-call overhead. >> +1 > I always thought about such internal inline functions, I 've added them in > v10. > > A couple of questions about these: 1. Patch 5 changes the API of DecodeDateTime() and DecodeTimeOnly() by adding an extra parameter bool *error. Would it be better to provide _opt_error flavors of these? 2. Patch 6 changes jsonb_from_cstring so that it's no longer static inline. Shouldn't we have a static inline function that can be called from inside jsonb.c and is called by the extern function? changing both of these things would be quite trivial and should not hold anything up. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com