date:20220831

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Justin Pryzby

On Thu, Sep 01, 2022 at 05:35:23PM +1200, Thomas Munro wrote:
> So it *looks* like it finished early (and without the expected
> error?).  But it also looks like it replayed that record, according to
> the page LSN.  So which is it?  Could you recompile with WAL_DEBUG
> defined in pg_config_manual.h, and run recovery with wal_debug = on,
> and see if it replays 1CAF84B0?

This is with 6672d79139 un-reverted.

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D 
/mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on 2>&1 
|grep 1CAF84B0 || echo not found
not found

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D 
/mnt/tmp/15/data -c logging_collector=no -c port=5678 -c wal_debug=on -c 
recovery_prefetch=no 2>&1 |grep 1CAF84B0 || echo not found
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF8478; LSN 1201/1CAF84B0: 
prev 1201/1CAF8438; xid 0; len 2; blkref #0: rel 1663/16881/2840, blk 53 - 
Heap2/VACUUM: nunused 4
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF84B0; LSN 1201/1CAF84F0: 
prev 1201/1CAF8478; xid 0; len 5; blkref #0: rel 1663/16881/2840, fork 2, blk 
0; blkref #1: rel 1663/16881/2840, blk 53 - Heap2/VISIBLE: cutoff xid 
3678741092 flags 0x01
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF84F0; LSN 1201/1CAF8AC0: 
prev 1201/1CAF84B0; xid 0; len 2; blkref #0: rel 1663/16881/1259, blk 1 FPW, 
compression method: zstd - Heap/INPLACE: off 33

(Note that "compression method: zstd" is a local change to
xlog_block_info() which I just extracted from my original patch for
wal_compression, after forgetting to compile --with-zstd.  I'll mail
about that at a later time...).

-- 
Justin

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Thomas Munro

On Thu, Sep 1, 2022 at 5:18 PM Kyotaro Horiguchi
 wrote:
> At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby  
> wrote in
> > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> > > Hmm.  Justin, when you built from source, which commit were you at?
> > > If it's REL_15_BETA3,
> >
> > No - it's:
> > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, 
> > origin/REL_15_STABLE)
>
> It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
> that commit.

Yeah.

> > I wasn't sure what mean by "without that" , so here's a bunch of logs to
> > sift through:

So it *looks* like it finished early (and without the expected
error?).  But it also looks like it replayed that record, according to
the page LSN.  So which is it?  Could you recompile with WAL_DEBUG
defined in pg_config_manual.h, and run recovery with wal_debug = on,
and see if it replays 1CAF84B0?

Re: FOR EACH ROW triggers, on partitioend tables, with indexes?

2022-08-31 Thread Justin Pryzby

On Thu, Sep 01, 2022 at 04:19:37PM +1200, David Rowley wrote:
> On Sat, 20 Aug 2022 at 09:18, Justin Pryzby  wrote:
> > Is it somwhow possible to call CreateTrigger() to create a FOR EACH ROW
> > trigger, with an index, and not internally ?
> 
> I've been looking over this and I very much agree that the code looks
> very broken. As for whether this is dead code or not, I've been
> looking at that too...
> 
> At trigger.c:1147 we have: if (partition_recurse).  partition_recurse
> can only ever be true if isInternal == false per trigger.c:367's
> "partition_recurse = !isInternal && stmt->row &&". isInternal is a
> parameter to the function.  Also, the code in question only triggers
> when the indexOid parameter is a valid oid.  So it should just be a
> matter of looking for usages of CreateTriggerFiringOn() which pass
> isInternal as false and pass a valid indexOid.
> 
> There seems to be no direct calls doing this, but we do also call this
> function via CreateTrigger() and I can see only 1 call to
> CreateTrigger() that passes isInternal as false, but that explicitly
> passes indexOid as InvalidOid, so this code looks very much dead to
> me.
> 
> Alvaro, any objections to just ripping this out? aka, the attached.

It's possible that extensions or 3rd party code or forks use this, no ?
In that case, it might be "not dead" ..

> +  * that ever changes then we'll need to quite code here to find 
> the

quite?  write?  quire?  acquire?  quine? 

-- 
Justin

Re: question about access custom enum type from C

2022-08-31 Thread Dmitry Markman

Hi David

as you suggested

create type first_type as enum ('red', 'green', 'brown', 'yellow', 'blue');
SELECT oid,typname,typlen,typtype from pg_type where typname='first_type'

returns everything I was looking for

thanks again, I think I’m all set

dm


> On Sep 1, 2022, at 12:49 AM, David Rowley  wrote:
> 
> (I think this is a better question for the general mailing list)
> 
> On Thu, 1 Sept 2022 at 16:28, Dmitry Markman  wrote:
>> 
>> Hi, when I’m trying to access values of my custom enum type I created with
>> 
>> create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue');
>> 
>> I’m getting oid as 16387 and I can see it stored as a chars
> 
> You might see the names if you query the table, but all that's stored
> in the table is the numerical value.
> 
> https://www.postgresql.org/docs/current/datatype-enum.html states "An
> enum value occupies four bytes on disk.".
> 
>> is number 16387 is always OID for enum type?
> 
> I'm not sure where you got that number from. Perhaps it's the oid for
> the pg_type record? The following would show it.
> 
> select oid,typname from pg_type where typname = 'colors';
> 
>> if not how I can get information about type of the result if it’s custom 
>> enum type
> 
> I'm not sure what you mean by "the result".  Maybe pg_typeof(column)
> might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM
> mytable;
> 
> David

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Kyotaro Horiguchi

At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby  wrote 
in 
> On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
> >  wrote:
> > > Just for information, there was a fixed bug about
> > > overwrite-aborted-contrecord feature, which causes this kind of
> > > failure (xlog flush request exceeds insertion bleeding edge). If it is
> > > that, it has been fixed by 6672d79139 two-days ago.
> > 
> > Hmm.  Justin, when you built from source, which commit were you at?
> > If it's REL_15_BETA3,
> 
> No - it's:
> commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, 
> origin/REL_15_STABLE)

It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
that commit.

> > If it's REL_15_BETA3, any chance you could cherry pick that change and
> > check what happens?  And without that, could you show what this logs
> > And without that, could you show what this logs
> > for good and bad recovery settings?
> 
> I wasn't sure what mean by "without that" , so here's a bunch of logs to
> sift through:

There's no need to cherry picking..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: question about access custom enum type from C

2022-08-31 Thread Dmitry Markman

Hi David, thanks a lot for your answer

I got that number from

PQparamtype

I already see that 16387 is not a ‘constant’, if I have few custom types I got 
different numbers for them

thanks

dm



> On Sep 1, 2022, at 12:49 AM, David Rowley  wrote:
> 
> (I think this is a better question for the general mailing list)
> 
> On Thu, 1 Sept 2022 at 16:28, Dmitry Markman  wrote:
>> 
>> Hi, when I’m trying to access values of my custom enum type I created with
>> 
>> create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue');
>> 
>> I’m getting oid as 16387 and I can see it stored as a chars
> 
> You might see the names if you query the table, but all that's stored
> in the table is the numerical value.
> 
> https://www.postgresql.org/docs/current/datatype-enum.html states "An
> enum value occupies four bytes on disk.".
> 
>> is number 16387 is always OID for enum type?
> 
> I'm not sure where you got that number from. Perhaps it's the oid for
> the pg_type record? The following would show it.
> 
> select oid,typname from pg_type where typname = 'colors';
> 
>> if not how I can get information about type of the result if it’s custom 
>> enum type
> 
> I'm not sure what you mean by "the result".  Maybe pg_typeof(column)
> might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM
> mytable;
> 
> David

Re: question about access custom enum type from C

2022-08-31 Thread David Rowley

(I think this is a better question for the general mailing list)

On Thu, 1 Sept 2022 at 16:28, Dmitry Markman  wrote:
>
> Hi, when I’m trying to access values of my custom enum type I created with
>
> create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue');
>
> I’m getting oid as 16387 and I can see it stored as a chars

You might see the names if you query the table, but all that's stored
in the table is the numerical value.

https://www.postgresql.org/docs/current/datatype-enum.html states "An
enum value occupies four bytes on disk.".

> is number 16387 is always OID for enum type?

I'm not sure where you got that number from. Perhaps it's the oid for
the pg_type record? The following would show it.

select oid,typname from pg_type where typname = 'colors';

> if not how I can get information about type of the result if it’s custom enum 
> type

I'm not sure what you mean by "the result".  Maybe pg_typeof(column)
might be what you want? You can do: SELECT pg_typeof(myenumcol) FROM
mytable;

David

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Justin Pryzby

On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
>  wrote:
> > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro  
> > wrote in
> > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby  wrote:
> > > > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: 
> > > > end-of-recovery immediate wait
> > > > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of 
> > > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
> > > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > > > base/16881/2840_vm
> > > > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 
> > > > 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730
> > > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > > > base/16881/2840_vm
> > > > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> > > >
> > > > I was able to start it with -c recovery_prefetch=no, so it seems like
> > > > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > > > I'm making a copy of the data dir in cases it's needed.
> >
> > Just for information, there was a fixed bug about
> > overwrite-aborted-contrecord feature, which causes this kind of
> > failure (xlog flush request exceeds insertion bleeding edge). If it is
> > that, it has been fixed by 6672d79139 two-days ago.
> 
> Hmm.  Justin, when you built from source, which commit were you at?
> If it's REL_15_BETA3,

No - it's:
commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, 
origin/REL_15_STABLE)
  
> If it's REL_15_BETA3, any chance you could cherry pick that change and
> check what happens?  And without that, could you show what this logs
> And without that, could you show what this logs
> for good and bad recovery settings?

I wasn't sure what mean by "without that" , so here's a bunch of logs to
sift through:

At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE:

[pryzbyj@template0 postgresql]$ sudo -u postgres 
./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c 
logging_collector=no -c port=5678
...
< 2022-08-31 23:31:38.690 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:31:40.204 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which 
truncates the relation
< 2022-08-31 23:31:40.307 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which 
truncates the relation
< 2022-08-31 23:31:40.493 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the 
relation is too small
< 2022-08-31 23:31:40.721 CDT  >LOG:  redo done at 1201/1CADB300 system usage: 
CPU: user: 0.41 s, system: 0.23 s, elapsed: 2.03 s
< 2022-08-31 23:31:41.452 CDT  >LOG:  checkpoint starting: end-of-recovery 
immediate wait
< 2022-08-31 23:31:41.698 CDT  >LOG:  request to flush past end of generated 
WAL; request 1201/1CAF84F0, current position 1201/1CADB730
< 2022-08-31 23:31:41.698 CDT  >CONTEXT:  writing block 0 of relation 
base/16881/2840_vm
< 2022-08-31 23:31:41.698 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not 
satisfied --- flushed only to 1201/1CADB730
< 2022-08-31 23:31:41.698 CDT  >CONTEXT:  writing block 0 of relation 
base/16881/2840_vm
< 2022-08-31 23:31:41.699 CDT  >FATAL:  checkpoint request failed
< 2022-08-31 23:31:41.699 CDT  >HINT:  Consult recent messages in the server 
log for details.
< 2022-08-31 23:31:41.704 CDT  >LOG:  startup process (PID 25046) exited with 
exit code 1
< 2022-08-31 23:31:41.704 CDT  >LOG:  terminating any other active server 
processes
< 2022-08-31 23:31:41.705 CDT  >LOG:  shutting down due to startup process 
failure
< 2022-08-31 23:31:41.731 CDT  >LOG:  database system is shut down

With your patch:

[pryzbyj@template0 postgresql]$ sudo -u postgres 
./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c 
logging_collector=no -c port=5678
...
< 2022-08-31 23:34:22.897 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:34:23.146 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which 
truncates the relation
< 2022-08-31 23:34:23.147 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which 
truncates the relation
< 2022-08-31 23:34:23.268 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the 
relation is too small
< 2022-08-31 23:34:23.323 CDT  >LOG:  redo done at 1201/1CADB300 system usage: 
CPU: user: 0.29 s, system: 0.12 s, elapsed: 0.42 s
< 2022-08-31 23:34:23.323 CDT  >LOG:  point 0: lastRec = 12011cadb300
< 2022-08-31 23:34:23.323 CDT  >LOG:  point 0: endOfLog = 12011cadb730
< 2022-08-31 23:34:23.323 CDT  >LOG:  XXX point 1: EndOfLog

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-08-31 Thread Kyotaro Horiguchi

At Wed, 31 Aug 2022 12:03:06 -0400, Reid Thompson 
 wrote in 
> Attached is a patch to 
> Add tracking of backend memory allocated to pg_stat_activity

> @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size)
>   return NULL;
>  
>   context->mem_allocated += blksize;
> + pgstat_report_backend_mem_allocated_increase(blksize);

I'm not sure this is acceptable. The function adds a branch even when
the feature is turned off, which I think may cause a certain extent of
performance degradation. A past threads [1], [2] and [3] might be
informative.

[1] 
https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop
[2] 
https://www.postgresql.org/message-id/72a656e0f71d0860161e0b3f67e4d771%40oss.nttdata.com
[3] 
https://www.postgresql.org/message-id/0271f440ac77f2a4180e0e56ebd944d1%40oss.nttdata.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Add native windows on arm64 support

2022-08-31 Thread Michael Paquier

On Wed, Aug 31, 2022 at 10:22:06PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > The input object could also be reworked so as we would not have any
> > ordering issues, say "predeeppost".
> > Changing only the path, you could use "/e/n2" instead of "node()", so
> > as we'd always get the most internal member.
> 
> Done that way.

Cool, thanks.  bowerbird looks happy after its first run.

An argument in favor of backpatching is if one decides to build the
code with MSVC and patch the scripts to enable ASLR.  However, nobody
has complained about that yet, so fine by me to leave this change only
on HEAD for now.
--
Michael


signature.asc
Description: PGP signature

question about access custom enum type from C

2022-08-31 Thread Dmitry Markman

Hi, when I’m trying to access values of my custom enum type I created with

create type colors as enum ('red', 'green', 'brown', 'yellow', 'blue');

I’m getting oid as 16387 and I can see it stored as a chars

is number 16387 is always OID for enum type?

if not how I can get information about type of the result if it’s custom enum 
type

thanks in advance

dm

Re: Solaris "sed" versus pre-v13 plpython tests

2022-08-31 Thread Noah Misch

On Wed, Aug 31, 2022 at 06:25:01PM -0400, Tom Lane wrote:
> I am not completely sure why buildfarm member wrasse isn't failing
> similarly

wrasse disabled plpython in v12-, from day one, due to this and a crash bug
that I shelved.  I will be interested to see how margay reacts to your fix.

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Thomas Munro

On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
 wrote:
> At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro  
> wrote in
> > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby  wrote:
> > > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: 
> > > end-of-recovery immediate wait
> > > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of 
> > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
> > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > > base/16881/2840_vm
> > > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 
> > > is not satisfied --- flushed only to 1201/1CADB730
> > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > > base/16881/2840_vm
> > > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> > >
> > > I was able to start it with -c recovery_prefetch=no, so it seems like
> > > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > > I'm making a copy of the data dir in cases it's needed.
>
> Just for information, there was a fixed bug about
> overwrite-aborted-contrecord feature, which causes this kind of
> failure (xlog flush request exceeds insertion bleeding edge). If it is
> that, it has been fixed by 6672d79139 two-days ago.

Hmm.  Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3, any chance you could cherry pick that change and
check what happens?  And without that, could you show what this logs
for good and bad recovery settings?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 248a40e8fa..1e435e81d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5308,6 +5308,7 @@ StartupXLOG(void)
 	 */
 	endOfRecoveryInfo = FinishWalRecovery();
 	EndOfLog = endOfRecoveryInfo->endOfLog;
+elog(LOG, "XXX point 1: EndOfLog = %lx", EndOfLog);
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
@@ -5446,7 +5447,9 @@ StartupXLOG(void)
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
+elog(LOG, "XXX clobbering EndOfLog");
 	}
+elog(LOG, "XXX point 2: EndOfLog = %lx", EndOfLog);
 
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
@@ -5456,6 +5459,7 @@ StartupXLOG(void)
 	Insert = >Insert;
 	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+elog(LOG, "XXX point 3: Insert->CurrBytePos = %lx", Insert->CurrBytePos);
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -5554,6 +5558,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
 
+elog(LOG, "XXX point 4: Insert->CurrBytePos = %lx", Insert->CurrBytePos);
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
 	 */
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 141af04388..8c6753a8a8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1434,6 +1434,8 @@ FinishWalRecovery(void)
 	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
 	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
+elog(LOG, "point 0: lastRec = %lx", lastRec);
+elog(LOG, "point 0: endOfLog = %lx", endOfLog);
 
 	/*
 	 * Remember the TLI in the filename of the XLOG segment containing the

Re: FOR EACH ROW triggers, on partitioend tables, with indexes?

2022-08-31 Thread David Rowley

On Sat, 20 Aug 2022 at 09:18, Justin Pryzby  wrote:
> Is it somwhow possible to call CreateTrigger() to create a FOR EACH ROW
> trigger, with an index, and not internally ?

I've been looking over this and I very much agree that the code looks
very broken. As for whether this is dead code or not, I've been
looking at that too...

At trigger.c:1147 we have: if (partition_recurse).  partition_recurse
can only ever be true if isInternal == false per trigger.c:367's
"partition_recurse = !isInternal && stmt->row &&". isInternal is a
parameter to the function.  Also, the code in question only triggers
when the indexOid parameter is a valid oid.  So it should just be a
matter of looking for usages of CreateTriggerFiringOn() which pass
isInternal as false and pass a valid indexOid.

There seems to be no direct calls doing this, but we do also call this
function via CreateTrigger() and I can see only 1 call to
CreateTrigger() that passes isInternal as false, but that explicitly
passes indexOid as InvalidOid, so this code looks very much dead to
me.

Alvaro, any objections to just ripping this out? aka, the attached.
I've left an Assert() in there to ensure we notice if we're ever to
start calling CreateTriggerFiringOn() with isInternal == false with a
valid indexOid.

David
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 7661e004a9..08f9a307fd 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1147,8 +1147,6 @@ CreateTriggerFiringOn(CreateTrigStmt *stmt, const char 
*queryString,
if (partition_recurse)
{
PartitionDesc partdesc = RelationGetPartitionDesc(rel, true);
-   List   *idxs = NIL;
-   List   *childTbls = NIL;
int i;
MemoryContext oldcxt,
perChildCxt;
@@ -1158,53 +1156,23 @@ CreateTriggerFiringOn(CreateTrigStmt *stmt, const char 
*queryString,

ALLOCSET_SMALL_SIZES);

/*
-* When a trigger is being created associated with an index, 
we'll
-* need to associate the trigger in each child partition with 
the
-* corresponding index on it.
+* We don't currently expect to be called with a valid 
indexOid.  If
+* that ever changes then we'll need to quite code here to find 
the
+* corresponding child index.
 */
-   if (OidIsValid(indexOid))
-   {
-   ListCell   *l;
-   List   *idxs = NIL;
-
-   idxs = find_inheritance_children(indexOid, 
ShareRowExclusiveLock);
-   foreach(l, idxs)
-   childTbls = lappend_oid(childTbls,
-   
IndexGetRelation(lfirst_oid(l),
-   
 false));
-   }
+   Assert(!OidIsValid(indexOid));

oldcxt = MemoryContextSwitchTo(perChildCxt);

/* Iterate to create the trigger on each existing partition */
for (i = 0; i < partdesc->nparts; i++)
{
-   Oid indexOnChild = InvalidOid;
-   ListCell   *l;
-   ListCell   *l2;
CreateTrigStmt *childStmt;
RelationchildTbl;
Node   *qual;

childTbl = table_open(partdesc->oids[i], 
ShareRowExclusiveLock);

-   /* Find which of the child indexes is the one on this 
partition */
-   if (OidIsValid(indexOid))
-   {
-   forboth(l, idxs, l2, childTbls)
-   {
-   if (lfirst_oid(l2) == partdesc->oids[i])
-   {
-   indexOnChild = lfirst_oid(l);
-   break;
-   }
-   }
-   if (!OidIsValid(indexOnChild))
-   elog(ERROR, "failed to find index 
matching index \"%s\" in partition \"%s\"",
-get_rel_name(indexOid),
-
get_rel_name(partdesc->oids[i]));
-   }
-
/*
 * Initialize our fabricated parse node by copying the 
original
 * one, then resetting fields that we pass separately.

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tom Lane

David Rowley  writes:
> On Thu, 1 Sept 2022 at 16:06, Tom Lane  wrote:
>> It doesn't seem to be fixing any live bug in the back branches,
>> but by the same token it's harmless.

> I considered that an extension might use the Slab allocator with a
> non-MAXALIGNED chunksize and might run into some troubles during
> SlabCheck().

Oh, yeah, the sentinel_ok() change is a live bug.  Extensions
have no control over sizeof(SlabBlock) though.

regards, tom lane

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread David Rowley

On Thu, 1 Sept 2022 at 16:06, Tom Lane  wrote:
>
> David Rowley  writes:
> > Does anyone have any objections to d5ee4db0e in its entirety being 
> > backpatched?
>
> It doesn't seem to be fixing any live bug in the back branches,
> but by the same token it's harmless.

I considered that an extension might use the Slab allocator with a
non-MAXALIGNED chunksize and might run into some troubles during
SlabCheck().

David

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tom Lane

David Rowley  writes:
> Does anyone have any objections to d5ee4db0e in its entirety being 
> backpatched?

It doesn't seem to be fixing any live bug in the back branches,
but by the same token it's harmless.

regards, tom lane

Re: Handle infinite recursion in logical replication setup

2022-08-31 Thread Amit Kapila

On Wed, Aug 31, 2022 at 11:35 AM Peter Smith  wrote:
>
> Here are my review comments for v43-0001.
>
> ==
>
> 1. Commit message
>
> The initial copy phase has no way to know the origin of the row data,
> so if 'copy_data = true' in the step 4 below, log a warning to notify user
> that potentially non-local data might have been copied.
>
> 1a.
> "in the step 4" -> "in step 4"
>
> ~
>
> 1b.
> "notify user" -> "notify the user"
>
> ==
>
> 2. doc/src/sgml/ref/create_subscription.sgml
>
> +  
> +   If the subscription is created with origin = none and
> +   copy_data = true, it will check if the publisher has
> +   subscribed to the same table from other publishers and, if so, log a
> +   warning to notify the user to check the publisher tables. Before 
> continuing
> +   with other operations the user should check that publisher tables did not
> +   have data with different origins to prevent data inconsistency issues on 
> the
> +   subscriber.
> +  
>
> I am not sure about that wording saying "to prevent data inconsistency
> issues" because I thought maybe is already too late because any
> unwanted origin data is already copied during the initial sync. I
> think the user can do it try to clean up any problems caused before
> things become even worse (??)
>
> ~~~
>
> You asked for my thoughts (see [1] 6b):
>
> > There is nothing much a user can do in this case. Only option would be
> > to take a backup before the operation and restore it and then recreate
> > the replication setup. I was not sure if we should document these
> > steps as similar inconsistent data could get created without the
> > origin option . Thoughts?
>
> I think those cases are a bit different:
>
> - For a normal scenario (i.e. origin=ANY) then inconsistent data just
> refers to problems like maybe PK violations so I think you just get
> what you get and have to deal with the errors...
>
> - But when the user said origin=NONE they are specifically saying they
> do NOT want any non-local data, yet they still might end up getting
> data contrary to their wishes. So I think maybe there ought to be
> something documented to warn about this potentially happening. My
> first impression is that all this seems like a kind of terrible
> problem because if it happens then cleanup could be difficult - if
> that subscriber in turn was also a publisher then this bad data might
> have propagated all over the place already! Anyway, I guess all this
> could be mentioned in some (new ?) section of the
> logical-replication.sgml file (i.e. patch 0002).
>
> IDEA: I was wondering if the documentation should recommend (or maybe
> we can even enforce this) that when using origin=NONE the user should
> always create the subscription with enabled=FALSE. That way if there
> are some warnings about potential bad origin data then there it might
> be possible to take some action or change their mind *before* any
> unwanted data pollutes everything.
>
> ==
>
> 3. src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed
>
> + if (count == 0)
> + tablenames = makeStringInfo();
> + else
> + appendStringInfoString(tablenames, ", ");
> +
> + appendStringInfo(tablenames, "%s.%s", nspname, relname);
> + count++;
>
> I think the quotes are not correct - each separate table name should be 
> quoted.
>
> ~~~
>
> 4.
>
> + /*
> + * There might be many tables present in this case, we will display a
> + * maximum of 100 tables followed by "..." to indicate there might be
> + * more tables.
> + */
> + if (count == 100)
> + {
> + appendStringInfoString(tablenames, ", ...");
> + break;
> + }
>
> 4a.
> IMO this code should be part of the previous if/else
>
> e.g.
> if (count == 0) ... makeStringInfo()
> else if (count > 100) ... append ellipsis "..." and break out of the loop
> else ... append table name to the message
>
> Doing it in this way means "..." definitely means there are more than
> 100 bad tables.
>
> ~
>
> 4b.
> Changing like above (#4a) simplifies the comment because it removes
> doubts like "might be…" since you already know you found the 101st
> table.
>
> SUGGESTION (comment)
> The message displays a maximum of 100 tables having potentially
> non-local data. Any more than that will be indicated by "...".
>
> ~
>

Are we using this style of an error message anywhere else in the code?
If not, then I think we should avoid it here as well as it appears a
bit adhoc to me in the sense that why restrict at 100 tables. Can we
instead think of displaying the publications list in the message? So
errdetail would be something like: Subscribed publications \"%s\" has
been subscribed from some other publisher. Then we can probably give a
SQL query for a user to find tables that may data from multiple
origins especially if that is not complicated. Also, then we can
change the function name to check_publications_origin().

Few other minor comments on v43-0001*
1.
+ ereport(WARNING,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("subscription %s

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-08-31 Thread Peter Geoghegan

On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes  wrote:
> I like this addition, but I would also like to see how many pages got newly 
> set to all frozen by the vacuum.

I'd say that that's independent work. Though I'm happy to discuss it now.

It would be fairly straightforward to show something about the VM
itself, but it's not entirely clear what aspects of the VM should be
emphasized. Are we reporting on the state of the table, or work
performed by VACUUM? You said you were interested in the latter
already, but why prefer that over a summary of the contents of the VM
at the end of the VACUUM? Are you concerned about the cost of setting
pages all-visible? Do you have an interest in how VACUUM manages to
set VM pages over time? Something else?

We already call visibilitymap_count() at the end of every VACUUM,
which scans the authoritative VM to produce a more-or-less consistent
summary of the VM at that point in time. This information is then used
to update pg_class.relallvisible (we don't do anything with the
all-frozen number at all). Why not show that information in
VERBOSE/autovacuum's log message? Does it really matter *when* a page
became all-visible/all-frozen/unset?

> Also, isn't all of vacuuming about XID processing?  I think "frozen:" would 
> be a more suitable line prefix.

That also works for me. I have no strong feelings here.

-- 
Peter Geoghegan

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread David Rowley

On Tue, 30 Aug 2022 at 13:16, David Rowley  wrote:
> I'm also wondering if this should also be backpatched back to v10,
> providing the build farm likes it well enough on master.

Does anyone have any objections to d5ee4db0e in its entirety being backpatched?

David

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Kyotaro Horiguchi

At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro  wrote 
in 
> On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby  wrote:
> > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery 
> > immediate wait
> > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of 
> > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is 
> > not satisfied --- flushed only to 1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> >
> > I was able to start it with -c recovery_prefetch=no, so it seems like
> > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > I'm making a copy of the data dir in cases it's needed.

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=moue2vnhhml6b3t7w7qqvva...@mail.gmail.com
http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-08-31 Thread Jeff Janes

On Sat, Aug 20, 2022 at 7:29 PM Peter Geoghegan  wrote:

> XIDs processed: 45 pages from table (100.00% of total) had 1 tuples
> frozen
>

I like this addition, but I would also like to see how many pages got newly
set to all frozen by the vacuum.  Would that be a convenient thing to also
report here?

Also, isn't all of vacuuming about XID processing?  I think "frozen:" would
be a more suitable line prefix.

Cheers,

Jeff

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2022-08-31 Thread Kyotaro Horiguchi

At Wed, 31 Aug 2022 12:50:19 -0400, Reid Thompson 
 wrote in 
> Hi Hackers,
> 
> Add the ability to limit the amount of memory that can be allocated to
> backends.

The patch seems to limit both of memory-context allocations and DSM
allocations happen on a specific process by the same budget. In the
fist place I don't think it's sensible to cap the amount of DSM
allocations by per-process budget.

DSM is used by pgstats subsystem. There can be cases where pgstat
complains for denial of DSM allocation after the budget has been
exhausted by memory-context allocations, or every command complains
for denial of memory-context allocation after once the per-process
budget is exhausted by DSM allocations. That doesn't seem reasonable.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Justin Pryzby

Some more details, in case they're important:

First: the server has wal_compression=zstd (I wonder if something
doesn't allow/accomodate compressed FPI?)

I thought to mention that after compiling pg15 locally and forgetting to
use --with-zstd.

I compiled it to enable your debug logging, which wrote these during
recovery:

< 2022-08-31 21:17:01.807 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958212 from block 156 until 1201/1C3965A0 is replayed, which 
truncates the relation
< 2022-08-31 21:17:01.903 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C39CC98 is replayed, which 
truncates the relation
< 2022-08-31 21:17:02.029 CDT  >NOTICE:  suppressing prefetch in relation 
1663/16888/165958523 from block 23 until 1201/1C8643C8 is replayed, because the 
relation is too small

Also, pg_waldump seems to fail early with -w:
[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 
-F vm -p /mnt/tmp/15/data/pg_wal 00011201001C 
rmgr: Heap2   len (rec/tot): 64/   122, tx:  0, lsn: 
1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 
0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 
1663/16881/2840 blk 54
pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length 
at 1201/1CD91010: wanted 24, got 0

Also, the VM has crashed with OOM before, while runnning pg15, with no issue in
recovery.  I haven't been able to track down the cause..

The VM is running: kernel-3.10.0-1160.66.1.el7.x86_64

pgsql is an ext4 FS (no tablespaces), which is a qemu block device
exposed like:

  
  

It's nowhere near full:

/dev/vdc  96G51G   46G  53% /var/lib/pgsql

Re: num_sa_scans in genericcostestimate

2022-08-31 Thread Jeff Janes

On Sun, Aug 21, 2022 at 2:45 PM Jeff Janes  wrote:


> ...
>


> The context for this is that I was looking at cases where btree indexes
> were not using all the columns they could, but rather shoving some of the
> conditions down into a Filter unnecessarily/unhelpfully.  This change
> doesn't fix that, but it does seem to be moving in the right direction.
>

Added to commitfest.


> This does cause a regression test failure due to an (apparently?)
> uninteresting plan change.
>

Looking more at the regression test plan change, it points up an
interesting question which is only tangentially related to this patch.

With patch applied:

[local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM
tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
  QUERY PLAN

---
 Sort  (cost=4.55..4.56 rows=1 width=8) (actual time=0.100..0.101 rows=2
loops=1)
   Sort Key: thousand
   Sort Method: quicksort  Memory: 25kB
   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
 (cost=0.29..4.50 rows=1 width=8) (actual time=0.044..0.048 rows=2 loops=1)
 Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
 Heap Fetches: 0
 Planning Time: 1.040 ms
 Execution Time: 0.149 ms
(8 rows)


[local] 417536 regression=# set enable_sort TO off ;


[local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM
tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
   QUERY PLAN

-
 Index Only Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..4.71
rows=1 width=8) (actual time=0.021..0.024 rows=2 loops=1)
   Index Cond: (thousand < 2)
   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
   Rows Removed by Filter: 18
   Heap Fetches: 0
 Planning Time: 0.156 ms
 Execution Time: 0.039 ms
(7 rows)

Why does having the =ANY in the "Index Cond:" rather than the "Filter:"
inhibit it from understanding that the rows will still be delivered in
order by "thousand"?

Cheers,

Jeff

>

RE: Handle infinite recursion in logical replication setup

2022-08-31 Thread shiy.f...@fujitsu.com

On Wed, Aug 31, 2022 1:06 AM vignesh C  wrote:
> 
> The attached v43 patch has the changes for the same.
> 

Thanks for updating the patch.

Here is a comment on the 0001 patch.

+   if (!isnewtable)
+   {
+   pfree(nspname);
+   pfree(relname);
+   continue;
+   }

If it is a new table, in which case it would log a warning, should we also call
pfree()?

Regards,
Shi yu

Re: [PATCH] Add native windows on arm64 support

2022-08-31 Thread Tom Lane

Michael Paquier  writes:
> The input object could also be reworked so as we would not have any
> ordering issues, say "predeeppost".
> Changing only the path, you could use "/e/n2" instead of "node()", so
> as we'd always get the most internal member.

Done that way.

regards, tom lane

Re: SELECT documentation

2022-08-31 Thread Bruce Momjian

 On Mon, Aug 15, 2022 at 10:53:18PM -0400, Bruce Momjian wrote:
> On Sat, Aug 13, 2022 at 10:21:26PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > Hi, I agree we should show the more modern JOIN sytax.  However, this is
> > > just an example, so one example should be sufficient.  I went with the
> > > first one in the attached patch.
> > 
> > You should not remove the CROSS JOIN mention at l. 604, first because
> > the references to it just below would become odd, and second because
> > then it's not explained anywhere on the page.  Perhaps you could
> > put back a definition of CROSS JOIN just below the entry for NATURAL,
> > but you'll still have to do something with the references at l. 614,
> > 628, 632.
> 
> Good point.  I restrutured the docs to move CROSS JOIN to a separate
> section like NATURAL and adjusted the text, patch attached.

Patch applied back to PG 11.  PG 10 was different enough and old enough
that I skipped it.  This is a big improvement.  Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Thomas Munro

On Thu, Sep 1, 2022 at 12:53 PM Justin Pryzby  wrote:
> Yes, I have a copy that reproduces the issue:

That's good news.

So the last record touching that page was:

> rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
> 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoff xid 3678741092 flags 
> 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 
> 1663/16881/2840 blk 53

I think the expected LSN for that page is past the end of that record,
so 0x1CAF84B0 + 59 = 0x1caf84eb which rounds up to 0x1CAF84F0, and
indeed we see that in the restored page when recovery succeeds.

Next question: why do we think the WAL finishes at 1201/1CADB730 while
running that checkpoint?  Looking...

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-08-31 Thread Kyotaro Horiguchi

At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby  wrote 
in 
> On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote:
> > Hi Hackers,
> > 
> > Attached is a patch to 
> > Add tracking of backend memory allocated to pg_stat_activity
> 
> > +  proargmodes => 
> > '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
> 
> In the past, there was concern about making pg_stat_activity wider by
> adding information that's less-essential than what's been there for
> years.  This is only an int64, so it's not "wide", but I wonder if
> there's another way to expose this information?  Like adding backends to

The view looks already too wide to me.  I don't want the numbers for
metrics are added to the view.

> pg_get_backend_memory_contexts() , maybe with another view on top of
> that ?

+1

> +* shown allocated in pgstat_activity when the creator 
> destroys the  
>   
> 
> pg_stat
> 
> > +* Posix creation calls dsm_impl_posix_resize implying that 
> > resizing
> > +* occurs or may be added in the future. As implemented
> > +* dsm_impl_posix_resize utilizes fallocate or truncate, 
> > passing the
> > +* whole new size as input, growing the allocation as needed * 
> > (only
> > +* truncate supports shrinking). We update by replacing the * 
> > old
> 
> wrapping caused extraneous stars
> 
> > +* Do not allow backend_mem_allocated to go below zero. ereport if we
> > +* would have. There's no need for a lock around the read here asit's
> 
> as it's
> 
> > +   ereport(LOG, (errmsg("decrease reduces reported backend memory 
> > allocated below zero; setting reported to 0")));
> 
> errmsg() doesn't require the outside paranthesis since a couple years
> ago.

+1

> > +   /*
> > +* Until header allocation is included in context->mem_allocated cast to
> > +* slab and decrement the headerSize
> 
> add a comma before "cast" ?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Avoid overhead open-close indexes (catalog updates)

2022-08-31 Thread Kyotaro Horiguchi

At Wed, 31 Aug 2022 08:16:55 -0300, Ranier Vilela  wrote 
in 
> Hi,
> 
> The commit
> https://github.com/postgres/postgres/commit/b17ff07aa3eb142d2cde2ea00e4a4e8f63686f96
> Introduced the CopyStatistics function.
> 
> To do the work, CopyStatistics uses a less efficient function
> to update/insert tuples at catalog systems.
> 
> The comment at indexing.c says:
> "Avoid using it for multiple tuples, since opening the indexes
>  * and building the index info structures is moderately expensive.
>  * (Use CatalogTupleInsertWithInfo in such cases.)"
> 
> So inspired by the comment, changed in some fews places,
> the CatalogInsert/CatalogUpdate to more efficient functions
> CatalogInsertWithInfo/CatalogUpdateWithInfo.
> 
> With quick tests, resulting in small performance.

Considering the whole operation usually takes far longer time, I'm not
sure that amount of performance gain is useful or not, but I like the
change as a matter of tidiness or as example for later codes.

> There are other places that this could be useful,
> but a careful analysis is necessary.

What kind of concern do have in your mind?

By the way, there is another similar function
CatalogTupleMultiInsertWithInfo() which would be more time-efficient
(but not space-efficient), which is used in InsertPgAttributeTuples. I
don't see a clear criteria of choosing which one of the two, though.

I think the overhead of catalog index open is significant when any
other time-consuming tasks are not involved in the whole operation.
In that sense, in term of performance, rather storeOperations and
storePrecedures (called under DefineOpCalss) might get more benefit
from that if disregarding the rareness of the command being used..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tom Lane

Tomas Vondra  writes:
> You're probably right we'll notice the clobber cases due to corruption
> of the next chunk header. The annoying thing is having a corrupted
> header only tells you there's a corruption somewhere, but it may be hard
> to know which part of the code caused it.

Same's true of a sentinel, though.

> OTOH we have platforms where valgrind is either not supported or no one
> runs tests with (e.g. on rpi4 it'd take insane amounts of code).

According to
https://valgrind.org/info/platforms.html
valgrind supports a pretty respectable set of platforms.  It might
be too slow to be useful on ancient hardware, of course.

I've had some success in identifying clobber perpetrators by putting
a hardware watchpoint on the clobbered word, which IIRC does work on
recent ARM hardware.  It's tedious and far more manual than valgrind,
but it's possible.

regards, tom lane

Re: Support tls-exporter as channel binding for TLSv1.3

2022-08-31 Thread Michael Paquier

On Wed, Aug 31, 2022 at 04:16:29PM -0700, Jacob Champion wrote:
> For protocols less than 1.3 we'll need to ensure that the extended
> master secret is in use:
> 
>This channel binding mechanism is defined only when the TLS handshake
>results in unique master secrets.  This is true of TLS versions prior
>to 1.3 when the extended master secret extension of [RFC7627] is in
>use, and it is always true for TLS 1.3 (see Appendix D of [RFC8446]).
> 
> OpenSSL should have an API for that (SSL_get_extms_support); I don't
> know when it was introduced.

This is only available from 1.1.0, meaning that we'd better disable
tls-exporter when building with something older than that :(  With
1.0.2 already not supported by upstream even if a bunch of vendors
keep it around for compatibility, I guess that's fine as long as
the default setting is tls-server-end-point.  It would not be complex
to switch to tls-exporter by default when using TLSv1.3 and
tls-server-end-point for TLS <= v1.2 though, but that makes the code
more complicated and OpenSSL does not complain with tls-exporter when
using < 1.3.  If we switch the default on the fly, we could drop
channel_binding_type and control which one gets used depending on
ssl_max/min_server_protocol.  I don't like that much, TBH, as this
creates more dependencies across our the internal code with the
initial choice of the connection parameters, making it more complex to
maintain in the long-term.  At least that's my impression.

> If we want to cross all our T's, we should also disallow tls-exporter
> if the server was unable to set SSL_OP_NO_RENEGOTIATION.

Hmm.  Okay.  I have not considered that.  But TLSv1.3 has no support
for renegotiation, isn't it?  And you mean to fail hard when using
TLS <= v1.2 with tls-exporter on the backend's SSL_CTX_set_options()
call?  We cannot do that as the backend's SSL context is initialized
before authentication, but we could re-check the state of the SSL
options afterwards, during authentication, and force a failure.

> Did you have any thoughts about contributing the Python tests (or
> porting them to Perl, bleh) so that we could test failure modes as
> well? Unfortunately those Python tests were also OpenSSL-based, so
> they're less powerful than an independent implementation...

No.  I have not been able to look at that with the time I had,
unfortunately.
--
Michael

signature.asc
Description: PGP signature

Re: Solaris "sed" versus pre-v13 plpython tests

2022-08-31 Thread Tom Lane

I wrote:
> I confirmed on the gcc farm's Solaris 11 box that the pattern
> doesn't work as expected with /usr/bin/sed:

> tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except 
> \([[:alpha:]][[:alpha:].]*\), *\([[:alpha:]][[:alpha:]]*\):/except \1 as 
> \2:/g'
> except foo, bar:

> We could perhaps do this instead:

> $ echo except foo, bar: | sed -e '/^ *except.*,.*: *$/s/, / as /g'
> except foo as bar:

> It's a little bit more brittle, but Python doesn't allow any other
> commands on the same line does it?

Oh ... after a bit more experimentation, there's an easier way.
Apparently the real problem is that Solaris' sed doesn't handle
[[:alpha:]] (I wonder if this is locale-dependent?).  I get
correct results after expanding it manually, eg

tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except \([a-z][a-z.]*\), 
*\([a-z][a-zA-Z]*\):/except \1 as \2:/g'
except foo as bar:

We aren't likely to need anything beyond a-zA-Z and maybe 0-9,
so I'll go fix it that way.

regards, tom lane

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Justin Pryzby

On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote:
> On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby  wrote:
> > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery 
> > immediate wait
> > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of 
> > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is 
> > not satisfied --- flushed only to 1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> > base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> >
> > I was able to start it with -c recovery_prefetch=no, so it seems like
> > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > I'm making a copy of the data dir in cases it's needed.
> 
> Hmm, a page with an LSN set 118208 bytes past the end of WAL.  It's a
> vm fork page (which recovery prefetch should ignore completely).  Did
> you happen to get a copy before the successful recovery?  After the
> successful recovery, what LSN does that page have, and can you find
> the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
> -F vm?  Have you turned FPW off (perhaps this is on ZFS?)?

Yes, I have a copy that reproduces the issue:

#1  0x009a0df4 in errfinish (filename=, 
filename@entry=0xa15535 "xlog.c", lineno=lineno@entry=2671, 
funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588
#2  0x0055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668
#3  0x00813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=) at bufmgr.c:2889
#4  0x00817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, 
skip_recently_used=skip_recently_used@entry=false, 
wb_context=wb_context@entry=0x7fffcdf0) at bufmgr.c:2576
#5  0x008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164
#6  0x008182f5 in CheckPointBuffers (flags=flags@entry=358) at 
bufmgr.c:2743
#7  0x005587b2 in CheckPointGuts (checkPointRedo=19795985413936, 
flags=flags@entry=358) at xlog.c:6855
#8  0x0055feb3 in CreateCheckPoint (flags=flags@entry=358) at 
xlog.c:6534
#9  0x007aceaa in CheckpointerMain () at checkpointer.c:455
#10 0x007aac52 in AuxiliaryProcessMain 
(auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153
#11 0x007b0bd8 in StartChildProcess (type=) at 
postmaster.c:5430
#12 0x007b388f in PostmasterMain (argc=argc@entry=7, 
argv=argv@entry=0xf139e0) at postmaster.c:1463
#13 0x004986a6 in main (argc=7, argv=0xf139e0) at main.c:202

It's not on zfs, and FPW have never been turned off.

I should add that this instance has been pg_upgraded since v10.

BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from
CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings,
pg_stat_bgwriter, pg_stat_database, pg_stat_wal).

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 
'main', 0));
  lsn  | checksum | flags | lower | upper | special | pagesize | 
version | prune_xid  
---+--+---+---+---+-+--+-+
 1201/1CDD1F98 |-6200 | 1 |44 |   424 |8192 | 8192 |   
4 | 3681043287
(1 fila)

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 
'vm', 0));
  lsn  | checksum | flags | lower | upper | special | pagesize | 
version | prune_xid 
---+--+---+---+---+-+--+-+---
 1201/1CAF84F0 |-6010 | 0 |24 |  8192 |8192 | 8192 |   
4 | 0

I found this in waldump (note that you had a typoe - it's 16881).

[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F 
vm -p /mnt/tmp/15/data/pg_wal 00011201001C
rmgr: Heap2   len (rec/tot): 64/   122, tx:  0, lsn: 
1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 
0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 
1663/16881/2840 blk 54
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE cutoff xid 2 flags 0x03, 
blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 
blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE cutoff xid 3671427998 flags 
0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 
1663/16881/2840 blk 2
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE cutoff xid 3672889900 flags 
0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 
1663/16881/2840 blk 4
rmgr: Heap2   len

Re: POC: Better infrastructure for automated testing of concurrency issues

2022-08-31 Thread Alexander Korotkov

Hi!

On Tue, Feb 23, 2021 at 3:09 AM Peter Geoghegan  wrote:
> On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov  
> wrote:
> > Thank you for your feedback!
>
> It would be nice to use this patch to test things that are important
> but untested inside vacuumlazy.c, such as the rare
> HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would
> have been removed by..."). Same is true of the closely related
> heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code.

I'll continue work on this patch.  The rebased patch is attached.  It
implements stop events as configure option (not runtime GUC option).

--
Regards,
Alexander Korotkov


0001-Stopevents-v3.patch
Description: Binary data

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tomas Vondra

On 9/1/22 02:23, Tom Lane wrote:
> Tomas Vondra  writes:
>> Focusing on the aset, vast majority of allocations (60M out of 64M) is
>> small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so
>> ~30%. Not great, not terrible.
> 
> Not sure why this escaped me before, but I remembered another argument
> for not forcibly adding space for a sentinel: if you don't have room,
> that means the chunk end is up against the header for the next chunk,
> which means that any buffer overrun will clobber that header.  So we'll
> detect the problem anyway if we validate the headers to a reasonable
> extent.
> 
> The hole in this argument is that the very last chunk allocated in a
> block might have no following chunk to validate.  But we could probably
> special-case that somehow, maybe by laying down a sentinel in the free
> space, where it will get overwritten by the next chunk when that does
> get allocated.
> 
> 30% memory bloat seems like a high price to pay if it's adding negligible
> detection ability, which it seems is true if this argument is valid.
> Is there reason to think we can't validate headers enough to catch
> clobbers?
> 

I'm not quite convinced the 30% figure is correct - it might be if you
ignore cases exceeding allocChunkLimit, but that also makes it pretty
bogus (because large allocations represent ~2x as much space).

You're probably right we'll notice the clobber cases due to corruption
of the next chunk header. The annoying thing is having a corrupted
header only tells you there's a corruption somewhere, but it may be hard
to know which part of the code caused it. I was hoping the sentinel
would make it easier, because we mark it as NOACCESS for valgrind. But
now I see we mark the first part of a MemoryChunk too, so maybe that's
enough.

OTOH we have platforms where valgrind is either not supported or no one
runs tests with (e.g. on rpi4 it'd take insane amounts of code). In that
case the sentinel might be helpful, especially considering alignment on
those platforms can cause funny memory issues, as evidenced by this thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tom Lane

David Rowley  writes:
> Maybe we should just consider always making room for a sentinel for
> chunks that are on dedicated blocks. At most that's an extra 8 bytes
> in some allocation that's either over 1024 or 8192 (depending on
> maxBlockSize).

Agreed, if we're not doing that already then we should.

regards, tom lane

Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'

2022-08-31 Thread Michael Paquier

On Wed, Aug 31, 2022 at 01:09:22PM -0700, Andres Freund wrote:
> On 2022-08-31 16:05:03 -0400, Tom Lane wrote:
>> Even if there's only a visible leak in v15/HEAD, I'm inclined
>> to back-patch this all the way, because it's certainly buggy
>> on its own terms.  It's just the merest accident that neither
>> caller is leaking other stuff into TopMemoryContext, because
>> they both think they are using a short-lived context.
> 
> +1

Ouch.  Thanks for the quick fix and the backpatch.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Add native windows on arm64 support

2022-08-31 Thread Michael Paquier

On Wed, Aug 31, 2022 at 01:33:53PM -0400, Tom Lane wrote:
> I did not check the other XMLTABLE infrastructure, so what probably
> is best to do is keep the two output columns but change 'node()'
> to something with a more stable result; any preferences?

The input object could also be reworked so as we would not have any
ordering issues, say "predeeppost".
Changing only the path, you could use "/e/n2" instead of "node()", so
as we'd always get the most internal member.
--
Michael


signature.asc
Description: PGP signature

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread David Rowley

On Thu, 1 Sept 2022 at 12:23, Tom Lane  wrote:
> Is there reason to think we can't validate headers enough to catch
> clobbers?

For non-sentinel chunks, the next byte after the end of the chunk will
be storing the block offset for the following chunk.  I think:

if (block != MemoryChunkGetBlock(chunk))
elog(WARNING, "problem in alloc set %s: bad block offset for chunk %p
in block %p",
name, chunk, block);

should catch those.

Maybe we should just consider always making room for a sentinel for
chunks that are on dedicated blocks. At most that's an extra 8 bytes
in some allocation that's either over 1024 or 8192 (depending on
maxBlockSize).

David

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tom Lane

Tomas Vondra  writes:
> Focusing on the aset, vast majority of allocations (60M out of 64M) is
> small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so
> ~30%. Not great, not terrible.

Not sure why this escaped me before, but I remembered another argument
for not forcibly adding space for a sentinel: if you don't have room,
that means the chunk end is up against the header for the next chunk,
which means that any buffer overrun will clobber that header.  So we'll
detect the problem anyway if we validate the headers to a reasonable
extent.

The hole in this argument is that the very last chunk allocated in a
block might have no following chunk to validate.  But we could probably
special-case that somehow, maybe by laying down a sentinel in the free
space, where it will get overwritten by the next chunk when that does
get allocated.

30% memory bloat seems like a high price to pay if it's adding negligible
detection ability, which it seems is true if this argument is valid.
Is there reason to think we can't validate headers enough to catch
clobbers?

regards, tom lane

Re: Handle infinite recursion in logical replication setup

2022-08-31 Thread Peter Smith

On Wed, Aug 31, 2022 at 8:36 PM Amit Kapila  wrote:
>
> On Wed, Aug 31, 2022 at 11:35 AM Peter Smith  wrote:
> >
> > Here are my review comments for v43-0001.
> >
...
> > ==
> >
> > 2. doc/src/sgml/ref/create_subscription.sgml
> >
> > +  
> > +   If the subscription is created with origin = none and
> > +   copy_data = true, it will check if the publisher has
> > +   subscribed to the same table from other publishers and, if so, log a
> > +   warning to notify the user to check the publisher tables. Before 
> > continuing
> > +   with other operations the user should check that publisher tables did 
> > not
> > +   have data with different origins to prevent data inconsistency issues 
> > on the
> > +   subscriber.
> > +  
> >
> > I am not sure about that wording saying "to prevent data inconsistency
> > issues" because I thought maybe is already too late because any
> > unwanted origin data is already copied during the initial sync. I
> > think the user can do it try to clean up any problems caused before
> > things become even worse (??)
> >
>
> Oh no, it is not too late in all cases. The problem can only occur if
> the user will try to subscribe from all the different publications
> with copy_data = true. We are anyway trying to clarify in the second
> patch the right way to accomplish this. So, I am not sure what better
> we can do here. The only bulletproof way is to provide some APIs where
> users don't need to bother about all these cases, basically something
> similar to what Kuroda-San is working on in the thread [1].
>

The point of my review comment was only about the wording of the note
- specifically, you cannot "prevent" something (e,g, data
inconsistency) if it has already happened.

Maybe modify the wording like below?

BEFORE
Before continuing with other operations the user should check that
publisher tables did not have data with different origins to prevent
data inconsistency issues on the subscriber.

AFTER
If a publisher table with data from different origins was found to
have been copied to the subscriber, then some corrective action may be
necessary before continuing with other operations.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: PostgreSQL 15 release announcement draft

2022-08-31 Thread Justin Pryzby

On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote:

> In this latest release, PostgreSQL improves on its in-memory and on-disk 
> sorting
> algorithms, with benchmarks showing speedups of 25% - 400% based on sort 
> types.

rather than "based on": "depending on the data types being sorted"

> Building on work from the previous PostgreSQL release for allowing async 
> remote
> queries, the PostgreSQL foreign data wrapper, `postgres_fdw`, can now commit
> transactions in parallel.

asynchronous

> benefits for certain workloads. On certain operating systems, PostgreSQL 15

s/certain/some ?

> supports the ability to prefetch WAL file contents and speed up recovery 
> times.

> PostgreSQL's built-in backup command, `pg_basebackup`, now supports 
> server-side
> compression of backup files with a choice of gzip, LZ4, and zstd.

remove "server-side", since they're also supported on the client-side.

> PostgreSQL 15 lets user create views that query data using the permissions of

users

> the caller, not the view creator. This option, called `security_invoker`, adds
> an additional layer of protection to ensure view callers have the correct
> permissions for working with the underlying data.

ensure *that ?

> alter server-level configuration parameters. Additionally, users can now 
> search
> for information about configuration using the `\dconfig` command from the 
> `psql`
> command-line tool.

rather than "search for information about configuration", say "list
configuration information" ?

> PostgreSQL server-level statistics are now collected in shared memory,
> eliminating the statistics collector process and writing these stats to disk.

and *the need to periodically* write these stats to disk

-- 
Justin

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tomas Vondra



On 8/31/22 23:46, David Rowley wrote:
> On Thu, 1 Sept 2022 at 08:53, Tomas Vondra
>  wrote:
>> So the raw size (what we asked for) is ~23.5GB, but in practice we
>> allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra
>> 1B we end up allocating 31.5GB. That doesn't seem like a huge increase,
>> and it's far from the +60% you got.
>>
>> I wonder where does the difference come - I did make installcheck too,
>> so how come you get 10/16GB, and I get 28/31GB? My patch is attached,
>> maybe I did something silly.
> 
> The reason my reported results were lower is because I ignored size >
> allocChunkLimit allocations. These are not raised to the next power of
> 2, so I didn't think they should be included.

If I differentiate the large chunks allocated separately (v2 patch
attached), I get this:

f|t |  count   |   s1 |   s2 |   s3
-+--+--+--+--+--
 AllocSetAlloc   | normal   | 60714914 | 4982 | 6288 | 8185
 AllocSetAlloc   | separate |   824390 |18245 |18245 |18251
 AllocSetRelloc  | normal   |   182070 |  763 |  826 | 1423
 GenerationAlloc | normal   |  2118115 |   68 |   90 |  102
 GenerationAlloc | separate |   28 |0 |0 |0
(5 rows)

Where s1 is the sum of requested sizes, s2 is the sum of allocated
chunks, and s3 is chunks allocated with 1B sentinel.

Focusing on the aset, vast majority of allocations (60M out of 64M) is
small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so
~30%. Not great, not terrible.

For the large allocations, there's almost no increase - in the last
query I used the power-of-2 logic in the calculations, but that was
incorrect, of course.


> 
> I'm not sure why you're seeing only a 3GB additional overhead. I
> noticed a logic error in my query where I was checking
> maxaligned_size=pow2_size and doubling that to give sentinel space.
> That really should have been "case size=pow2_size then pow2_size * 2
> else pow2_size end", However, after adjusting the query, it does not
> seem to change the results much:
> 
> postgres=# select
> postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size,
> postgres-# round(sum(case when size=pow2_size then pow2_size*2 else
> pow2_size end)::numeric/1024/1024/1024,3) as method1,
> postgres-# round(sum(case when size=pow2_size then pow2_size+8 else
> pow2_size end)::numeric/1024/1024/1024,3) as method2
> postgres-# from memstats
> postgres-# where pow2_size > 0;
>  pow2_size | method1 | method2
> ---+-+-
> 10.269 |  16.322 |  10.476
> 
> I've attached the crude patch I came up with for this.  For some
> reason it was crashing on Linux, but it ran ok on Windows, so I used
> the results from that instead.  Maybe that accounts for some
> differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be
> surprised if that accounted for so many GBs though.
> 

I tried to use that patch, but "make installcheck" never completes for
me, for some reason. It seems to get stuck in infinite_recurse.sql, but
I haven't looked into the details.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index b6eeb8abab..b215940062 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context)
 	free(set);
 }
 
+static Size pow2_size(Size size)
+{
+	Size s = 1;
+	while (s < size)
+		s *= 2;
+
+	return s;
+}
+
 /*
  * AllocSetAlloc
  *		Returns pointer to allocated memory of given size or NULL if
@@ -703,9 +712,12 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 * If requested size exceeds maximum for chunks, allocate an entire block
 	 * for this request.
 	 */
-	if (size > set->allocChunkLimit)
+	if (size + 1 > set->allocChunkLimit)
 	{
-		chunk_size = MAXALIGN(size);
+		fprintf(stderr, "AllocSetAlloc separate %ld %ld %ld\n", size, MAXALIGN(size), MAXALIGN(size+1));
+		fflush(stderr);
+
+		chunk_size = MAXALIGN(size+1);
 		blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
 		block = (AllocBlock) malloc(blksize);
 		if (block == NULL)
@@ -726,6 +738,11 @@ AllocSetAlloc(MemoryContext context, Size size)
 		/* set mark to catch clobber of "unused" space */
 		if (size < chunk_size)
 			set_sentinel(MemoryChunkGetPointer(chunk), size);
+		else
+		{
+			fprintf(stderr, "sentinel not added\n");
+			fflush(stderr);
+		}
 #endif
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 		/* fill the allocated space with junk */
@@ -761,13 +778,16 @@ AllocSetAlloc(MemoryContext context, Size size)
 		return MemoryChunkGetPointer(chunk);
 	}
 
+	fprintf(stderr, "AllocSetAlloc normal %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1));
+	fflush(stderr);
+
 	/*
 	 * Request is small enough to be treated as a chunk.  Look in the
 	 * corresponding

Re: pg15b3: recovery fails with wal prefetch enabled

2022-08-31 Thread Thomas Munro

On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby  wrote:
> < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery 
> immediate wait
> < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated 
> WAL; request 1201/1CAF84F0, current position 1201/1CADB730
> < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> base/16881/2840_vm
> < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is 
> not satisfied --- flushed only to 1201/1CADB730
> < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation 
> base/16881/2840_vm
> < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
>
> I was able to start it with -c recovery_prefetch=no, so it seems like
> prefetch tried to do too much.  The VM runs centos7 under qemu.
> I'm making a copy of the data dir in cases it's needed.

Hmm, a page with an LSN set 118208 bytes past the end of WAL.  It's a
vm fork page (which recovery prefetch should ignore completely).  Did
you happen to get a copy before the successful recovery?  After the
successful recovery, what LSN does that page have, and can you find
the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
-F vm?  Have you turned FPW off (perhaps this is on ZFS?)?

Re: Inconsistent error message for varchar(n)

2022-08-31 Thread Bruce Momjian

8On Tue, Aug 16, 2022 at 09:56:17PM -0400, Bruce Momjian wrote:
> On Sun, Nov 14, 2021 at 10:33:19AM +0800, Japin Li wrote:
> > Oh! I didn't consider this situation.  Since the max size of varchar cannot
> > exceed 10485760, however, I cannot find this in documentation [1]. Is there
> > something I missed? Should we mention this in the documentation?
> > 
> > [1] https://www.postgresql.org/docs/devel/datatype-character.html
> 
> Sorry for my long delay in reviewing this issue.  You are correct this
> should be documented --- patch attached.

Patch applied back to PG 10.  Thanks for the report.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Transparent column encryption

2022-08-31 Thread Jacob Champion

On Tue, Aug 30, 2022 at 4:53 AM Peter Eisentraut
 wrote:
> I would be interested in learning more about such padding systems.  I
> have done a lot of reading for this development project, and I have
> never come across a cryptographic approach to hide length differences by
> padding.  Of course, padding to the block cipher's block size is already
> part of the process, but that is done out of necessity, not because you
> want to disguise the length.  Are there any other methods?  I'm
> interested to learn more.

TLS 1.3 has one example. Here is a description from GnuTLS:
https://gnutls.org/manual/html_node/On-Record-Padding.html (Note the
option to turn on constant-time padding; that may not be a good
tradeoff for us if we're focusing on offline attacks.)

Here's a recent paper that claims to formally characterize length
hiding, but it's behind a wall and I haven't read it:
https://dl.acm.org/doi/abs/10.1145/3460120.3484590

I'll try to find more when I get the chance.

--Jacob

Re: Doc patch

2022-08-31 Thread Bruce Momjian

On Fri, Aug 19, 2022 at 12:04:54PM -0400, Bruce Momjian wrote:
> You are right about the above to changes.  The existing syntax shows
> FROM/IN is only possible if a direction is specified, while
> src/parser/gram.y says that FROM/IN with no direction is supported.
> 
> I plan to apply this second part of the patch soon.

Patch applied back to PG 10.  Thanks for the research on this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Small cleanups to tuplesort.c and a bonus small performance improvement

2022-08-31 Thread David Rowley

On Wed, 31 Aug 2022 at 22:39, David Rowley  wrote:
> My current thoughts are that this is a very trivial patch and unless
> there's any objections I plan to push it soon.

Pushed.

David

Re: Support tls-exporter as channel binding for TLSv1.3

2022-08-31 Thread Jacob Champion

On Sun, Aug 28, 2022 at 11:02 PM Michael Paquier  wrote:
> RFC9266, that has been released not so long ago, has added
> tls-exporter as a new channel binding type:
> https://www.rfc-editor.org/rfc/rfc5929.html

Hi Michael, thank you for sending this!

> Note also that tls-exporter is aimed for
> TLSv1.3 and newer protocols, but OpenSSL allows the thing to work with
> older protocols (testable with ssl_max_protocol_version, for example),
> and I don't see a need to prevent this scenario.

For protocols less than 1.3 we'll need to ensure that the extended
master secret is in use:

   This channel binding mechanism is defined only when the TLS handshake
   results in unique master secrets.  This is true of TLS versions prior
   to 1.3 when the extended master secret extension of [RFC7627] is in
   use, and it is always true for TLS 1.3 (see Appendix D of [RFC8446]).

OpenSSL should have an API for that (SSL_get_extms_support); I don't
know when it was introduced.

If we want to cross all our T's, we should also disallow tls-exporter
if the server was unable to set SSL_OP_NO_RENEGOTIATION.

> An extra thing is
> that attempting to use tls-exporter with a backend <= 15 and a client
> >= 16 causes a failure during the SASL exchange, where the backend
> complains about tls-exporter being unsupported.

Yep.

--

Did you have any thoughts about contributing the Python tests (or
porting them to Perl, bleh) so that we could test failure modes as
well? Unfortunately those Python tests were also OpenSSL-based, so
they're less powerful than an independent implementation...

Thanks,
--Jacob

Re: cataloguing NOT NULL constraints

2022-08-31 Thread Zhihong Yu

On Wed, Aug 31, 2022 at 4:08 PM Zhihong Yu  wrote:

>
>
> On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera 
> wrote:
>
>> So I was wrong in thinking that "this case was simple to implement" as I
>> replied upthread.  Doing that actually required me to rewrite large
>> parts of the patch.  I think it ended up being a good thing, because in
>> hindsight the approach I was using was somewhat bogus anyway, and the
>> current one should be better.  Please find it attached.
>>
>> There are still a few problems, sadly.  Most notably, I ran out of time
>> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode.
>> I have to review that again, but I think it'll need a deeper rethink of
>> how we pg_upgrade inherited constraints.  So the pg_upgrade tests are
>> known to fail.  I'm not aware of any other tests failing, but I'm sure
>> the cfbot will prove me wrong.
>>
>> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull,
>> to allow setting pg_attribute.attnotnull without adding a CHECK
>> constraint (only used internally).  I would like to find a better way to
>> go about this, so I may remove it again, therefore it's not fully
>> implemented.
>>
>> There are *many* changed regress expect files and I didn't carefully vet
>> all of them.  Mostly it's the addition of CHECK constraints in the
>> footers of many \d listings and stuff like that.  At a quick glance they
>> appear valid, but I need to review them more carefully still.
>>
>> We've had pg_constraint.conparentid for a while now, but for some
>> constraints we continue to use conislocal/coninhcount.  I think we
>> should get rid of that and rely on conparentid completely.
>>
>> An easily fixed issue is that of constraint naming.
>> ChooseConstraintName has an argument for passing known constraint names,
>> but this patch doesn't use it and it must.
>>
>> One issue that I don't currently know how to fix, is the fact that we
>> need to know whether a column is a row type or not (because they need a
>> different null test).  At table creation time that's easy to know,
>> because we have the descriptor already built by the time we add the
>> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT
>> then we don't.
>>
>> Some ancient code comments suggest that allowing a child table's NOT
>> NULL constraint acquired from parent shouldn't be independently
>> droppable.  This patch doesn't change that, but it's easy to do if we
>> decide to.  However, that'd be a compatibility break, so I'd rather not
>> do it in the same patch that introduces the feature.
>>
>> Overall, there's a lot more work required to get this to a good shape.
>> That said, I think it's the right direction.
>>
>> --
>> Álvaro Herrera   48°01'N 7°57'E  —
>> https://www.EnterpriseDB.com/
>> "La primera ley de las demostraciones en vivo es: no trate de usar el
>> sistema.
>> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
>>
>
> Hi,
> For findNotNullConstraintAttnum():
>
> +   if (multiple == NULL)
> +   break;
>
> Shouldn't `pfree(arr)` be called before breaking ?
>
> +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name,
>
> You used `NN` because there is method makeCheckNotNullConstraint, right ?
> I think it would be better to expand `NN` so that its meaning is easy to
> understand.
>
> Cheers
>
Hi,
For tryExtractNotNullFromNode , in the block for `if (rel == NULL)`:

+   return false;

I think you meant returning NULL since false is for boolean.

Cheers

Re: cataloguing NOT NULL constraints

2022-08-31 Thread Zhihong Yu

On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera 
wrote:

> So I was wrong in thinking that "this case was simple to implement" as I
> replied upthread.  Doing that actually required me to rewrite large
> parts of the patch.  I think it ended up being a good thing, because in
> hindsight the approach I was using was somewhat bogus anyway, and the
> current one should be better.  Please find it attached.
>
> There are still a few problems, sadly.  Most notably, I ran out of time
> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode.
> I have to review that again, but I think it'll need a deeper rethink of
> how we pg_upgrade inherited constraints.  So the pg_upgrade tests are
> known to fail.  I'm not aware of any other tests failing, but I'm sure
> the cfbot will prove me wrong.
>
> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull,
> to allow setting pg_attribute.attnotnull without adding a CHECK
> constraint (only used internally).  I would like to find a better way to
> go about this, so I may remove it again, therefore it's not fully
> implemented.
>
> There are *many* changed regress expect files and I didn't carefully vet
> all of them.  Mostly it's the addition of CHECK constraints in the
> footers of many \d listings and stuff like that.  At a quick glance they
> appear valid, but I need to review them more carefully still.
>
> We've had pg_constraint.conparentid for a while now, but for some
> constraints we continue to use conislocal/coninhcount.  I think we
> should get rid of that and rely on conparentid completely.
>
> An easily fixed issue is that of constraint naming.
> ChooseConstraintName has an argument for passing known constraint names,
> but this patch doesn't use it and it must.
>
> One issue that I don't currently know how to fix, is the fact that we
> need to know whether a column is a row type or not (because they need a
> different null test).  At table creation time that's easy to know,
> because we have the descriptor already built by the time we add the
> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT
> then we don't.
>
> Some ancient code comments suggest that allowing a child table's NOT
> NULL constraint acquired from parent shouldn't be independently
> droppable.  This patch doesn't change that, but it's easy to do if we
> decide to.  However, that'd be a compatibility break, so I'd rather not
> do it in the same patch that introduces the feature.
>
> Overall, there's a lot more work required to get this to a good shape.
> That said, I think it's the right direction.
>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
> "La primera ley de las demostraciones en vivo es: no trate de usar el
> sistema.
> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
>

Hi,
For findNotNullConstraintAttnum():

+   if (multiple == NULL)
+   break;

Shouldn't `pfree(arr)` be called before breaking ?

+static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name,

You used `NN` because there is method makeCheckNotNullConstraint, right ?
I think it would be better to expand `NN` so that its meaning is easy to
understand.

Cheers

Solaris "sed" versus pre-v13 plpython tests

2022-08-31 Thread Tom Lane

I noticed that buildfarm member margay, which just recently started
running tests on REL_12_STABLE, is failing the plpython tests in that
branch [1], though it's happy in v13 and later.  The failures appear
due to syntax errors in Python "except" statements, and it's visible
in some of the tests that we are failing to convert ancient Python
"except" syntax to what Python 3 wants:

diff -U3 
/home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out
 
/home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out
--- 
/home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out
2022-08-31 22:29:51.070597370 +0200
+++ 
/home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out
   2022-08-31 22:29:53.053544840 +0200
@@ -400,15 +400,16 @@
 import marshal
 try:
 return marshal.loads(x)
-except ValueError as e:
+except ValueError, e:
 return 'FAILED: ' + str(e)
 $$ LANGUAGE plpython3u;
+ERROR:  could not compile PL/Python function "test_type_unmarshal"
+DETAIL:  SyntaxError: invalid syntax (, line 6)

So it would seem that regress-python3-mangle.mk's sed command to
perform this transformation isn't working on margay's sed.

We've had to hack regress-python3-mangle.mk for Solaris "sed" before
(cf commit c3556f6fa).  This seems to be another instance of that
same crummy implementation of '*' patterns.  I suppose Noah missed
this angle at the time because the problematic pattern had already
been removed in v13 and up (45223fd9c).  But it's still there in v12.

I am not completely sure why buildfarm member wrasse isn't failing
similarly, but a likely theory is that Noah has got some other sed
in his search path there.

I confirmed on the gcc farm's Solaris 11 box that the pattern
doesn't work as expected with /usr/bin/sed:

tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except 
\([[:alpha:]][[:alpha:].]*\), *\([[:alpha:]][[:alpha:]]*\):/except \1 as \2:/g'
except foo, bar:

We could perhaps do this instead:

$ echo except foo, bar: | sed -e '/^ *except.*,.*: *$/s/, / as /g'
except foo as bar:

It's a little bit more brittle, but Python doesn't allow any other
commands on the same line does it?

Another idea is to try to find some other sed to use.  I see that
configure already does that on most platforms, because ax_pthread.m4
has "AC_REQUIRE([AC_PROG_SED])".  But if there's still someone
out there using --disable-thread-safety, they might think this is
an annoying movement of the build-requirements goalposts.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2022-08-31%2020%3A00%3A05

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-31 Thread Peter Geoghegan

On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan  wrote:
> Attached patch series is a completely overhauled version of earlier
> work on freezing. Related work from the Postgres 15 cycle became
> commits 0b018fab, f3c15cbe, and 44fa8488.

Attached is v2.

This is just to keep CFTester happy, since v1 now has conflicts when
applied against HEAD. There are no notable changes in this v2 compared
to v1.

-- 
Peter Geoghegan


v2-0001-Add-page-level-freezing-to-VACUUM.patch
Description: Binary data


v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch
Description: Binary data


v2-0003-Add-eager-freezing-strategy-to-VACUUM.patch
Description: Binary data


v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch
Description: Binary data

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread David Rowley

On Thu, 1 Sept 2022 at 08:53, Tomas Vondra
 wrote:
> So the raw size (what we asked for) is ~23.5GB, but in practice we
> allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra
> 1B we end up allocating 31.5GB. That doesn't seem like a huge increase,
> and it's far from the +60% you got.
>
> I wonder where does the difference come - I did make installcheck too,
> so how come you get 10/16GB, and I get 28/31GB? My patch is attached,
> maybe I did something silly.

The reason my reported results were lower is because I ignored size >
allocChunkLimit allocations. These are not raised to the next power of
2, so I didn't think they should be included.

I'm not sure why you're seeing only a 3GB additional overhead. I
noticed a logic error in my query where I was checking
maxaligned_size=pow2_size and doubling that to give sentinel space.
That really should have been "case size=pow2_size then pow2_size * 2
else pow2_size end", However, after adjusting the query, it does not
seem to change the results much:

postgres=# select
postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size,
postgres-# round(sum(case when size=pow2_size then pow2_size*2 else
pow2_size end)::numeric/1024/1024/1024,3) as method1,
postgres-# round(sum(case when size=pow2_size then pow2_size+8 else
pow2_size end)::numeric/1024/1024/1024,3) as method2
postgres-# from memstats
postgres-# where pow2_size > 0;
 pow2_size | method1 | method2
---+-+-
10.269 |  16.322 |  10.476

I've attached the crude patch I came up with for this.  For some
reason it was crashing on Linux, but it ran ok on Windows, so I used
the results from that instead.  Maybe that accounts for some
differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be
surprised if that accounted for so many GBs though.

I also forgot to add code to GenerationRealloc and AllocSetRealloc

David
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index b6eeb8abab..f4977f9bcc 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -46,6 +46,7 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
@@ -696,9 +697,22 @@ AllocSetAlloc(MemoryContext context, Size size)
int fidx;
Sizechunk_size;
Sizeblksize;
+   static int  rlevel = 1;
 
AssertArg(AllocSetIsValid(set));
 
+   if (rlevel == 1 && GetProcessingMode() == NormalProcessing)
+   {
+   rlevel++;
+   elog(LOG, "AllocSetAlloc,%s,%zu,%zu,%zu,%zu",
+context->name,
+size,
+MAXALIGN(size),
+size > set->allocChunkLimit ? 0 : 
GetChunkSizeFromFreeListIdx(AllocSetFreeIndex(size)),
+set->allocChunkLimit);
+   rlevel--;
+   }
+
/*
 * If requested size exceeds maximum for chunks, allocate an entire 
block
 * for this request.
diff --git a/src/backend/utils/mmgr/generation.c 
b/src/backend/utils/mmgr/generation.c
index b39894ec94..9c5ff3c095 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -36,6 +36,7 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
@@ -344,6 +345,18 @@ GenerationAlloc(MemoryContext context, Size size)
MemoryChunk *chunk;
Sizechunk_size = MAXALIGN(size);
Sizerequired_size = chunk_size + Generation_CHUNKHDRSZ;
+   static int  rlevel = 1;
+
+   if (rlevel == 1 && GetProcessingMode() == NormalProcessing)
+   {
+   rlevel++;
+   elog(LOG, "GenerationAlloc,%s,%zu,%zu,0,%zu",
+context->name,
+size,
+MAXALIGN(size),
+set->allocChunkLimit);
+   rlevel--;
+   }
 
/* is it an over-sized chunk? if yes, allocate special block */
if (chunk_size > set->allocChunkLimit)

Re: Trivial doc patch

2022-08-31 Thread Bruce Momjian

On Fri, Aug 19, 2022 at 10:42:45AM -0400, Bruce Momjian wrote:
> > Given that 'optional, optional' has no independent meaning from
> > 'optional';  it requires one to scan the entire set looking for
> > the non-optional embedded in the option.  So no gain.
> 
> I originally had the same reaction Michael Paquier did, that having one
> big optional block is nice, but seeing that 'CREATE DATABASE name WITH'
> actually works, I can see the point in having our syntax be accurate,
> and removing the outer optional brackets now does seem like an
> improvement to me.

Backpatched to PG 10.  Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Reducing the chunk header sizes on all memory context types

2022-08-31 Thread Tomas Vondra



On 8/31/22 00:40, David Rowley wrote:
> On Wed, 31 Aug 2022 at 02:17, Tom Lane  wrote:
>>
>> I wrote:
>>> So maybe we should revisit the question.  It'd be worth collecting some
>>> stats about how much extra space would be needed if we force there
>>> to be room for a sentinel.
>>
>> Actually, after ingesting more caffeine, the problem with this for aset.c
>> is that the only way to add space for a sentinel that didn't fit already
>> is to double the space allocation.  That's a little daunting, especially
>> remembering how many places deliberately allocate power-of-2-sized
>> arrays.
> 
> I decided to try and quantify that by logging the size, MAXALIGN(size)
> and the power of 2 size during AllocSetAlloc and GenerationAlloc. I
> made the pow2_size 0 in GenerationAlloc and in AlloocSetAlloc when
> size > allocChunkLimit.
> 
> After running make installcheck, grabbing the records out the log and
> loading them into Postgres, I see that if we did double the pow2_size
> when there's no space for the sentinel byte then we'd go from
> allocating a total of 10.2GB all the way to 16.4GB (!) of
> non-dedicated block aset.c allocations.
> 
> select
> round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size,
> round(sum(case when maxalign_size=pow2_size then pow2_size*2 else
> pow2_size end)::numeric/1024/1024/1024,3) as method1,
> round(sum(case when maxalign_size=pow2_size then pow2_size+8 else
> pow2_size end)::numeric/1024/1024/1024,3) as method2
> from memstats
> where pow2_size > 0;
>  pow2_size | method1 | method2
> ---+-+-
> 10.194 |  16.382 |  10.463
> 
> if we did just add on an extra 8 bytes (or or MAXALIGN(size+1) at
> least), then that would take the size up to 10.5GB.
> 

I've been experimenting with this a bit too, and my results are similar,
but not exactly the same. I've logged all Alloc/Realloc calls for the
two memory contexts, and when I aggregated the results I get this:

f| size | pow2(size) | pow2(size+1)
-+--++--
 AllocSetAlloc   |23528 |  28778 |31504
 AllocSetRelloc  |  761 |824 | 1421
 GenerationAlloc |   68 | 90 |  102

So the raw size (what we asked for) is ~23.5GB, but in practice we
allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra
1B we end up allocating 31.5GB. That doesn't seem like a huge increase,
and it's far from the +60% you got.

I wonder where does the difference come - I did make installcheck too,
so how come you get 10/16GB, and I get 28/31GB? My patch is attached,
maybe I did something silly.

I also did a quick hack to see if always having the sentinel detects any
pre-existing issues, but that didn't happen. I guess valgrind would find
those, but not sure?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index b6eeb8abab..e6d86a54d7 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context)
 	free(set);
 }
 
+static Size pow2_size(Size size)
+{
+	Size s = 1;
+	while (s < size)
+		s *= 2;
+
+	return s;
+}
+
 /*
  * AllocSetAlloc
  *		Returns pointer to allocated memory of given size or NULL if
@@ -699,13 +708,16 @@ AllocSetAlloc(MemoryContext context, Size size)
 
 	AssertArg(AllocSetIsValid(set));
 
+	fprintf(stderr, "AllocSetAlloc %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1));
+	fflush(stderr);
+
 	/*
 	 * If requested size exceeds maximum for chunks, allocate an entire block
 	 * for this request.
 	 */
-	if (size > set->allocChunkLimit)
+	if (size + 1 > set->allocChunkLimit)
 	{
-		chunk_size = MAXALIGN(size);
+		chunk_size = MAXALIGN(size + 1);
 		blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
 		block = (AllocBlock) malloc(blksize);
 		if (block == NULL)
@@ -725,7 +737,16 @@ AllocSetAlloc(MemoryContext context, Size size)
 		chunk->requested_size = size;
 		/* set mark to catch clobber of "unused" space */
 		if (size < chunk_size)
+		{
 			set_sentinel(MemoryChunkGetPointer(chunk), size);
+			fprintf(stderr, "sentinel added\n");
+			fflush(stderr);
+		}
+		else
+		{
+			fprintf(stderr, "A sentinel not added\n");
+			fflush(stderr);
+		}
 #endif
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
 		/* fill the allocated space with junk */
@@ -767,7 +788,7 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 * If one is found, remove it from the free list, make it again a member
 	 * of the alloc set and return its data address.
 	 */
-	fidx = AllocSetFreeIndex(size);
+	fidx = AllocSetFreeIndex(size+1);
 	chunk = set->freelist[fidx];
 	if (chunk != NULL)
 	{
@@ -784,7 +805,16 @@ AllocSetAlloc(MemoryContext context, Size size)
 		chunk->requested_size = size;
 		/* set mark to catch clobber of "unused" space */
 		if (size < GetChunkSizeFromFreeListIdx(fidx))
+		{

Re: SQL/JSON features for v15

2022-08-31 Thread Nikita Glukhov



On 31.08.2022 20:14, Tom Lane wrote:

Robert Haas  writes:

On Wed, Aug 31, 2022 at 1:06 PM Tom Lane  wrote:

The currently proposed patchset hacks up a relatively small number
of core datatypes to be able to do that.  But it's just a hack
and there's no prospect of extension types being able to join
in the fun.  I think where we need to start, for v16, is making
an API design that will let any datatype have this functionality.

(I don't say that we'd convert every datatype to do so right away;
in the long run we should, but I'm content to start with just the
same core types touched here.)  Beside the JSON stuff, there is
another even more pressing application for such behavior, namely
the often-requested COPY functionality to be able to shunt bad data
off somewhere without losing the entire transfer.  In the COPY case
I think we'd want to be able to capture the error message that
would have been issued, which means the current patches are not
at all appropriate as a basis for that API design: they're just
returning a bool without any details.


I would be in favor of making more of an effort than just a few token
data types. The initial patch could just touch a few, but once the
infrastructure is in place we should really make a sweep through the
tree and tidy up.

Sure, but my point is that we can do that in a time-extended fashion
rather than having a flag day where everything must be updated.
The initial patch just needs to update a few types as proof of concept.


And here is a quick POC patch with an example for COPY and float4:

=# CREATE TABLE test (i int, f float4);
CREATE TABLE

=# COPY test (f) FROM stdin WITH (null_on_error (f));
1
err
2
\.

COPY 3

=# SELECT f FROM test;
 f
---
 1
  
 2

(3 rows)

=# COPY test (i) FROM stdin WITH (null_on_error (i));
ERROR:  input function for datatype "integer" does not support error handling



PG_RETURN_ERROR() is a reincarnation of ereport_safe() macro for returning
ErrorData, which was present in older versions (~v18) of SQL/JSON patches.
Later it was replaced with `bool *have_error` and less magical
`if (have_error) ... else ereport(...)`.


Obviously, this needs a separate thread.

--
Nikita Glukhov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
From 537e65e1ebcccdf4a760d3aa743c88611e7c Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Wed, 31 Aug 2022 22:24:33 +0300
Subject: [PATCH 1/5] Introduce PG_RETURN_ERROR and FuncCallError node

---
 src/include/nodes/execnodes.h |  7 +++
 src/include/utils/elog.h  | 25 +
 2 files changed, 32 insertions(+)

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01b1727fc09..b401d354656 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2729,4 +2729,11 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+typedef struct FuncCallError
+{
+	NodeTag		type;
+	ErrorData  *error;			/* place where function should put
+ * error data instead of throwing it */
+} FuncCallError;
+
 #endif			/* EXECNODES_H */
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 56398176901..5fd3deed61f 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -159,6 +159,31 @@
 #define ereport(elevel, ...)	\
 	ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
 
+/*
+ * PG_RETURN_ERROR() -- special macro for copying error info into the
+ * FuncCallError node, if it was as FunctionCallInfo.context,
+ * instead of throwing it with ordinary ereport().
+ *
+ * This is intended for handling of errors of categories like
+ * ERRCODE_DATA_EXCEPTION without PG_TRY/PG_CATCH and substransactions,
+ * but not for errors like ERRCODE_OUT_OF_MEMORY.
+ */
+#define PG_RETURN_ERROR(...) \
+	do { \
+		if (fcinfo->context && IsA(fcinfo->context, FuncCallError)) { \
+			pg_prevent_errno_in_scope(); \
+			\
+			errstart(ERROR, TEXTDOMAIN); \
+			(__VA_ARGS__); \
+			castNode(FuncCallError, fcinfo->context)->error = CopyErrorData(); \
+			FlushErrorState(); \
+			\
+			PG_RETURN_NULL(); \
+		} else { \
+			ereport(ERROR, (__VA_ARGS__)); \
+		} \
+	} while (0)
+
 #define TEXTDOMAIN NULL
 
 extern bool message_level_is_interesting(int elevel);
-- 
2.25.1

From 7831b2c571784a614dd4ee4b48730187b0c0a8d1 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Wed, 31 Aug 2022 23:02:06 +0300
Subject: [PATCH 2/5] Introduce pg_proc.proissafe and func_is_safe()

---
 src/backend/utils/cache/lsyscache.c | 19 +++
 src/include/catalog/pg_proc.h   |  3 +++
 src/include/utils/lsyscache.h   |  1 +
 3 files changed, 23 insertions(+)

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f4957..ff3005077b2 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1830,6 +1830,25 @@ get_func_leakproof(Oid funcid)
 	return result;
 }
 
+/*
+ * func_is_safe
+

Re: SQL/JSON features for v15

2022-08-31 Thread Jonathan S. Katz


On 8/31/22 3:08 PM, Andrew Dunstan wrote:


On 2022-08-31 We 14:45, Tom Lane wrote:

To the extent that there was a management failure here, it was that
we didn't press for a resolution sooner.  Given the scale of the
concerns raised in June, I kind of agree with Andres' opinion that
fixing them post-freeze was doomed to failure.  It was definitely
doomed once we reached August with no real work done towards it.



I'm not going to comment publicly in general about this, you might
imagine what my reaction is. The decision is the RMT's to make and I
have no quarrel with that.

But I do want it understood that there was work being done right from
the time in June when Andres' complaints were published. These were
difficult issues, and we didn't let the grass grow looking for a fix. I
concede that might not have been visible until later.


June was a bit of a rough month too -- we had the issues that spawned 
the out-of-cycle release at the top of the month, which started almost 
right after Beta 1, and then almost immediately into Beta 2 after 14.4. 
I know that consumed a lot of my cycles. At that point in time for the 
v15 release process I was primarily focused on monitoring open items at 
that point, so I missed the June comments.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 16:05:03 -0400, Tom Lane wrote:
> Even if there's only a visible leak in v15/HEAD, I'm inclined
> to back-patch this all the way, because it's certainly buggy
> on its own terms.  It's just the merest accident that neither
> caller is leaking other stuff into TopMemoryContext, because
> they both think they are using a short-lived context.

+1

> Thanks for the report!

+1

Greetings,

Andres Freund

Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'

2022-08-31 Thread Tom Lane

Reid Thompson  writes:
> This patch ensures get_database_list() switches back to the memory
> context in use upon entry rather than returning with TopMemoryContext
> as the context.

> This will address memory allocations in autovacuum.c being associated
> with TopMemoryContext when they should not be.

> autovacuum.c do_start_worker() with current context
> 'Autovacuum start worker (tmp)' invokes get_database_list(). Upon
> return, the current context has been changed to TopMemoryContext by
> AtCommit_Memory() as part of an internal transaction. Further down
> in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked.
> Previously this didn't pose a issue, however recent changes altered
> how pgstat_fetch_stat_dbentry() is implemented. The new
> implementation has a branch utilizing palloc. The patch ensures these
> allocations are associated with the 'Autovacuum start worker (tmp)'
> context rather than the TopMemoryContext. Prior to the change,
> leaving an idle laptop PG instance running just shy of 3 days saw the
> autovacuum launcher process grow to 42MB with most of that growth in
> TopMemoryContext due to the palloc allocations issued with autovacuum
> worker startup.

Yeah, I can reproduce noticeable growth within a couple minutes
after setting autovacuum_naptime to 1s, and I confirm that the
launcher's space consumption stays static after applying this.

Even if there's only a visible leak in v15/HEAD, I'm inclined
to back-patch this all the way, because it's certainly buggy
on its own terms.  It's just the merest accident that neither
caller is leaking other stuff into TopMemoryContext, because
they both think they are using a short-lived context.

Thanks for the report!

regards, tom lane

Re: First draft of the PG 15 release notes

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 11:38:33AM +0200, Peter Eisentraut wrote:
> On 30.08.22 22:42, Bruce Momjian wrote:
> > Good question --- the full text is:
> > 
> >   
> >
> > Record and check the collation of each  > linkend="sql-createdatabase">database (Peter Eisentraut)
> >
> > 
> >
> > This is designed to detect collation
> > mismatches to avoid data corruption.  Function
> > pg_database_collation_actual_version()
> > reports the underlying operating system collation version, and
> > ALTER DATABASE ...  REFRESH sets the database
> > to match the operating system collation version.  DETAILS?
> >
> >   
> > 
> > I just can't figure out what the user needs to understand about this,
> > and I understand very little of it.
> 
> We already had this feature for (schema-level) collations, now we have it on
> the level of the database collation.

Okay, I figured out the interplay between OS collation version support,
collation libraries, and collation levels.  Here is an updated patch for
the release notes.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

diff --git a/doc/src/sgml/release-15.sgml b/doc/src/sgml/release-15.sgml
index e3ab13e53c..dca4550c03 100644
--- a/doc/src/sgml/release-15.sgml
+++ b/doc/src/sgml/release-15.sgml
@@ -578,17 +578,17 @@ Author: Peter Eisentraut 
 
  
   
-   Record and check the collation of each database (Peter Eisentraut)
   
 
   
-   This is designed to detect collation
+   This is designed to detect collation version
mismatches to avoid data corruption.  Function
pg_database_collation_actual_version()
reports the underlying operating system collation version, and
ALTER DATABASE ...  REFRESH sets the database
-   to match the operating system collation version.  DETAILS?
+   to match the operating system collation version.
   
  
 
@@ -605,9 +605,11 @@ Author: Peter Eisentraut 
   
 
   
-   Previously, ICU collations could only be
-   specified in CREATE
-   COLLATION and used with the
+   Previously, only libc-based
+   collations could be set at the cluster and database levels.
+   ICU collations were previously limited
+   to CREATE
+   COLLATION and referenced by the
COLLATE clause.

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 12:24:28 -0700, Nathan Bossart wrote:
> On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote:
> > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
> > It continues to make no sense to me to add behaviour changes around
> > error-handling as part of a conversion to get_dirent_type(). I don't at all
> > understand why e.g. the above change to make copydir() silently skip over
> > files it can't stat is ok?
> 
> In this example, the call to get_dirent_type() should ERROR if the call to
> lstat() fails (the "elevel" argument is set to ERROR).

Oh, oops. Skimmed code too quickly...

Greetings,

Andres Freund

Re: pg_stat_have_stats() returns true for dropped indexes (or for index creation transaction rolled back)

2022-08-31 Thread Andres Freund

Hiu,

On 2022-08-25 11:44:34 +0200, Drouvot, Bertrand wrote:
> For REPLSLOT, I agree that we can add one test: I added it in
> contrib/test_decoding/sql/stats.sql. It relies on pg_stat_have_stats() (as
> relying on pg_stat_replication_slots and/or pg_stat_get_replication_slot()
> would not help that much for this test given that the slot has been removed
> from ReplicationSlotCtl)

As Horiguchi-san noted, we can't rely on specific indexes being used. I feel
ok with the current coverage in that area, but if we *really* feel we need to
test it, we'd need to count the number of indexes with slots before dropping
the slot and after the drop.

> +-- pg_stat_have_stats returns true for committed index creation

Maybe another test for an uncommitted index creation would be good too?

Could you try running this test with debug_discard_caches = 1 - it's pretty
easy to write tests in this area that aren't reliable timing wise.

> +CREATE table stats_test_tab1 as select generate_series(1,10) a;
> +CREATE index stats_test_idx1 on stats_test_tab1(a);
> +SELECT oid AS dboid from pg_database where datname = current_database() \gset

Since you introduced this, maybe convert the other instance of this query at
the end of the file as well?

Greetings,

Andres Freund

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-31 Thread Nathan Bossart

On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote:
> On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
>> -if (lstat(fromfile, ) < 0)
>> -ereport(ERROR,
>> -(errcode_for_file_access(),
>> - errmsg("could not stat file \"%s\": 
>> %m", fromfile)));
>> +xlde_type = get_dirent_type(fromfile, xlde, false, ERROR);
>>  
>> -if (S_ISDIR(fst.st_mode))
>> +if (xlde_type == PGFILETYPE_DIR)
>>  {
>>  /* recurse to handle subdirectories */
>>  if (recurse)
>>  copydir(fromfile, tofile, true);
>>  }
>> -else if (S_ISREG(fst.st_mode))
>> +else if (xlde_type == PGFILETYPE_REG)
>>  copy_file(fromfile, tofile);
>>  }
>>  FreeDir(xldir);
> 
> It continues to make no sense to me to add behaviour changes around
> error-handling as part of a conversion to get_dirent_type(). I don't at all
> understand why e.g. the above change to make copydir() silently skip over
> files it can't stat is ok?

In this example, the call to get_dirent_type() should ERROR if the call to
lstat() fails (the "elevel" argument is set to ERROR).
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Tracking last scan time

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 11:56:29AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote:
> > As for having a lower granularity and preventing the
> > one-syscall-per-Relation issue, can't we reuse the query_start or
> > state_change timestamps that appear in pg_stat_activity (potentially
> > updated immediately before this stat flush), or some other per-backend
> > timestamp that is already maintained and considered accurate enough
> > for this use?
> 
> The problem is that it won't change at all for a query that runs for a week -
> and we'll report the timestamp from a week ago when it finally ends.
> 
> But given this is done when stats are flushed, which only happens after the
> transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if
> we got to flushing the transaction stats we'll already have computed that.

Oh, good point --- it is safer to show a more recent time than a too-old
time.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: [PATCH] Add sortsupport for range types and btree_gist

2022-08-31 Thread Christoph Heiss


Hi!

Sorry for the long delay.

This fixes the crashes, now all tests run fine w/ and w/o debug 
configuration. That shouldn't even have slipped through as such.


Notable changes from v1:
- gbt_enum_sortsupport() now passes on fcinfo->flinfo
  enum_cmp_internal() needs a place to cache the typcache entry.
- inet sortsupport now uses network_cmp() directly

Thanks,
Christoph HeissFrom 667779bc0761c1356141722181c5a54ac46d96b9 Mon Sep 17 00:00:00 2001
From: Christoph Heiss 
Date: Wed, 31 Aug 2022 19:20:43 +0200
Subject: [PATCH v2 1/1] Add sortsupport for range types and btree_gist

Incrementally building up a GiST index can result in a sub-optimal index
structure for range types.
By sorting the data before inserting it into the index will result in a
much better index.

This can provide sizeable improvements in query execution time, I/O read
time and shared block hits/reads.

Signed-off-by: Christoph Heiss 
---
 contrib/btree_gist/Makefile |   3 +-
 contrib/btree_gist/btree_bit.c  |  19 
 contrib/btree_gist/btree_bool.c |  22 
 contrib/btree_gist/btree_cash.c |  22 
 contrib/btree_gist/btree_enum.c |  26 +
 contrib/btree_gist/btree_gist--1.7--1.8.sql | 110 
 contrib/btree_gist/btree_gist.control   |   2 +-
 contrib/btree_gist/btree_inet.c |  19 
 contrib/btree_gist/btree_interval.c |  19 
 contrib/btree_gist/btree_macaddr8.c |  19 
 contrib/btree_gist/btree_time.c |  19 
 src/backend/utils/adt/rangetypes_gist.c |  70 +
 src/include/catalog/pg_amproc.dat   |   3 +
 src/include/catalog/pg_proc.dat |   3 +
 14 files changed, 354 insertions(+), 2 deletions(-)
 create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql

diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 48997c75f6..4096de73ea 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -33,7 +33,8 @@ EXTENSION = btree_gist
 DATA = btree_gist--1.0--1.1.sql \
btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \
btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \
-   btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql
+   btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \
+   btree_gist--1.7--1.8.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/btree_gist/btree_bit.c b/contrib/btree_gist/btree_bit.c
index 5b246bcde4..bf32bfd628 100644
--- a/contrib/btree_gist/btree_bit.c
+++ b/contrib/btree_gist/btree_bit.c
@@ -7,6 +7,7 @@
 #include "btree_utils_var.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/sortsupport.h"
 #include "utils/varbit.h"
 
 
@@ -19,10 +20,17 @@ PG_FUNCTION_INFO_V1(gbt_bit_picksplit);
 PG_FUNCTION_INFO_V1(gbt_bit_consistent);
 PG_FUNCTION_INFO_V1(gbt_bit_penalty);
 PG_FUNCTION_INFO_V1(gbt_bit_same);
+PG_FUNCTION_INFO_V1(gbt_bit_sortsupport);
 
 
 /* define for comparison */
 
+static int
+bit_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	return DatumGetInt32(DirectFunctionCall2(byteacmp, x, y));
+}
+
 static bool
 gbt_bitgt(const void *a, const void *b, Oid collation, FmgrInfo *flinfo)
 {
@@ -208,3 +216,14 @@ gbt_bit_penalty(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(gbt_var_penalty(result, o, n, PG_GET_COLLATION(),
 	  , fcinfo->flinfo));
 }
+
+Datum
+gbt_bit_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = bit_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/btree_gist/btree_bool.c b/contrib/btree_gist/btree_bool.c
index 8b2af129b5..3a9f230e68 100644
--- a/contrib/btree_gist/btree_bool.c
+++ b/contrib/btree_gist/btree_bool.c
@@ -6,6 +6,7 @@
 #include "btree_gist.h"
 #include "btree_utils_num.h"
 #include "common/int.h"
+#include "utils/sortsupport.h"
 
 typedef struct boolkey
 {
@@ -23,6 +24,16 @@ PG_FUNCTION_INFO_V1(gbt_bool_picksplit);
 PG_FUNCTION_INFO_V1(gbt_bool_consistent);
 PG_FUNCTION_INFO_V1(gbt_bool_penalty);
 PG_FUNCTION_INFO_V1(gbt_bool_same);
+PG_FUNCTION_INFO_V1(gbt_bool_sortsupport);
+
+static int
+bool_fast_cmp(Datum x, Datum y, SortSupport ssup)
+{
+	bool	arg1 = DatumGetBool(x);
+	bool	arg2 = DatumGetBool(y);
+
+	return arg1 - arg2;
+}
 
 static bool
 gbt_boolgt(const void *a, const void *b, FmgrInfo *flinfo)
@@ -167,3 +178,14 @@ gbt_bool_same(PG_FUNCTION_ARGS)
 	*result = gbt_num_same((void *) b1, (void *) b2, , fcinfo->flinfo);
 	PG_RETURN_POINTER(result);
 }
+
+Datum
+gbt_bool_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+	ssup->comparator = bool_fast_cmp;
+	ssup->ssup_extra = NULL;
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/btree_gist/btree_cash.c b/contrib/btree_gist/btree_cash.c
index 6ac97e2b12..389b725130 100644
---

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
> @@ -50,7 +51,7 @@ copydir(char *fromdir, char *todir, bool recurse)
>  
>   while ((xlde = ReadDir(xldir, fromdir)) != NULL)
>   {
> - struct stat fst;
> + PGFileType  xlde_type;
>  
>   /* If we got a cancel signal during the copy of the directory, 
> quit */
>   CHECK_FOR_INTERRUPTS();
> @@ -62,18 +63,15 @@ copydir(char *fromdir, char *todir, bool recurse)
>   snprintf(fromfile, sizeof(fromfile), "%s/%s", fromdir, 
> xlde->d_name);
>   snprintf(tofile, sizeof(tofile), "%s/%s", todir, xlde->d_name);
>  
> - if (lstat(fromfile, ) < 0)
> - ereport(ERROR,
> - (errcode_for_file_access(),
> -  errmsg("could not stat file \"%s\": 
> %m", fromfile)));
> + xlde_type = get_dirent_type(fromfile, xlde, false, ERROR);
>  
> - if (S_ISDIR(fst.st_mode))
> + if (xlde_type == PGFILETYPE_DIR)
>   {
>   /* recurse to handle subdirectories */
>   if (recurse)
>   copydir(fromfile, tofile, true);
>   }
> - else if (S_ISREG(fst.st_mode))
> + else if (xlde_type == PGFILETYPE_REG)
>   copy_file(fromfile, tofile);
>   }
>   FreeDir(xldir);

It continues to make no sense to me to add behaviour changes around
error-handling as part of a conversion to get_dirent_type(). I don't at all
understand why e.g. the above change to make copydir() silently skip over
files it can't stat is ok?

Greetings,

Andres Freund

Re: SQL/JSON features for v15

2022-08-31 Thread Andrew Dunstan

On 2022-08-31 We 14:45, Tom Lane wrote:
> To the extent that there was a management failure here, it was that
> we didn't press for a resolution sooner.  Given the scale of the
> concerns raised in June, I kind of agree with Andres' opinion that
> fixing them post-freeze was doomed to failure.  It was definitely
> doomed once we reached August with no real work done towards it.

I'm not going to comment publicly in general about this, you might
imagine what my reaction is. The decision is the RMT's to make and I
have no quarrel with that.

But I do want it understood that there was work being done right from
the time in June when Andres' complaints were published. These were
difficult issues, and we didn't let the grass grow looking for a fix. I
concede that might not have been visible until later.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Query Jumbling for CALL and SET utility statements

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 11:00:05 -0700, Jeremy Schneider wrote:
> On 8/31/22 9:08 AM, Andres Freund wrote:
> > 
> > I suspect we should carve out things like CALL, PREPARE, EXECUTE from
> > track_utility - it's more or less an architectural accident that they're
> > utility statements.  It's a bit less clear that SET should be dealt with 
> > that
> > way.
> 
> Regarding SET, the compelling use case was around "application_name"
> whose purpose is to provide a label in pg_stat_activity and on log
> lines, which can be used to improve observability and connect queries to
> their source in application code.

I wasn't saying that SET shouldn't be jumbled, just that it seems more
reasonable to track it only when track_utility is enabled, rather than doing
so even when that's disabled. Which I do think makes sense for executing a
prepared statement and calling a procedure, since they're really only utility
statements by accident.


> Personally, at this point, I think pg_stat_statements is critical
> infrastructure for anyone running PostgreSQL at scale. The information
> it provides is indispensable. I don't think it's really defensible to
> tell people that if they want to scale, then they need to fly blind on
> any utility statements.

I wasn't suggesting doing so at all.

Greetings,

Andres Freund

Re: Tracking last scan time

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote:
> As for having a lower granularity and preventing the
> one-syscall-per-Relation issue, can't we reuse the query_start or
> state_change timestamps that appear in pg_stat_activity (potentially
> updated immediately before this stat flush), or some other per-backend
> timestamp that is already maintained and considered accurate enough
> for this use?

The problem is that it won't change at all for a query that runs for a week -
and we'll report the timestamp from a week ago when it finally ends.

But given this is done when stats are flushed, which only happens after the
transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if
we got to flushing the transaction stats we'll already have computed that.

>   tabentry->numscans += lstats->t_counts.t_numscans;
> + if (pgstat_track_scans && lstats->t_counts.t_numscans)
> + tabentry->lastscan = GetCurrentTimestamp();

Besides replacing GetCurrentTimestamp() with
GetCurrentTransactionStopTimestamp(), this should then also check if
tabentry->lastscan is already newer than the new timestamp.

Greetings,

Andres Freund

Re: SQL/JSON features for v15

2022-08-31 Thread Tom Lane

Bruce Momjian  writes:
> I guess you are saying that setting a cut-off was a bad idea, or that
> the cut-off was too close to the final release date.  For me, I think
> there were three questions:

> 1.  Were subtransactions acceptable, consensus no
> 2.  Could trapping errors work for PG 15, consensus no
> 3.  Could the feature be trimmed back for PG 15 to avoid these, consensus ?

We could probably have accomplished #3 if there was more time,
but we're out of time.  (I'm not entirely convinced that spending
effort towards #3 was productive anyway, given that we're now thinking
about a much differently-scoped patch with API changes.)

> I don't think our community works well when there are three issues in
> play at once.

To the extent that there was a management failure here, it was that
we didn't press for a resolution sooner.  Given the scale of the
concerns raised in June, I kind of agree with Andres' opinion that
fixing them post-freeze was doomed to failure.  It was definitely
doomed once we reached August with no real work done towards it.

regards, tom lane

Re: effective_multixact_freeze_max_age issue

2022-08-31 Thread Peter Geoghegan

On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart  wrote:
> LGTM

Pushed, thanks.

-- 
Peter Geoghegan

Re: SQL/JSON features for v15

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 12:04:44PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > From my POV the only real discussion is whether we'd want to revert this in 
> > 15
> > and HEAD or just 15. There's imo a decent point to be made to just revert in
> > 15 and aggressively press forward with the changes posted in this thread.
> 
> I'm not for that.  Code that we don't think is ready to ship
> has no business being in the common tree, nor does it make
> review any easier to be looking at one bulky set of
> already-committed patches and another bulky set of deltas.

Agreed on removing from PG 15 and master --- it would be confusing to
have lots of incomplete code in master that is not in PG 15.

> I'm okay with making an exception for the include/nodes/ and
> backend/nodes/ files in HEAD, since the recent changes in that
> area mean it'd be a lot of error-prone work to produce a reverting
> patch there.  We can leave those in as dead code temporarily, I think.

I don't have an opinion on this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: SQL/JSON features for v15

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 12:26:29PM -0400, Robert Haas wrote:
> someone else to make the decision. But that's not how it works. The
> RMT concept was invented precisely to solve problems like this one,
> where the patch authors don't really want to revert it but other
> people think it's pretty busted. If such problems were best addressed
> by waiting for a long time to see whether anything changes, we
> wouldn't need an RMT. That's exactly how we used to handle these kinds
> of problems, and it sucked.

I saw the RMT/Jonathan stated August 28 as the cut-off date for a
decision, which was later changed to September 1:

> The RMT is still inclined to revert, but will give folks until Sep 1 0:00
> AoE[1] to reach consensus on if SQL/JSON can be included in v15. This matches
> up to Andrew's availability timeline for a revert, and gives enough time to
> get through the buildfarm prior to the Beta 4 release[2].

I guess you are saying that setting a cut-off was a bad idea, or that
the cut-off was too close to the final release date.  For me, I think
there were three questions:

1.  Were subtransactions acceptable, consensus no
2.  Could trapping errors work for PG 15, consensus no
3.  Could the feature be trimmed back for PG 15 to avoid these, consensus ?

I don't think our community works well when there are three issues in
play at once.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: SQL/JSON features for v15

2022-08-31 Thread Andrew Dunstan



On 2022-08-31 We 12:48, Jonathan S. Katz wrote:
>
>
> With RMT hat on -- Andrew can you please revert the patchset?


:-(


Yes, I'll do it, starting with the v15 branch. Might take a day or so.


cheers (kinda)


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [RFC] building postgres with meson - v12

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 10:28:05 +0200, Peter Eisentraut wrote:
> I found that the perl test modules are not installed.  See attached patch to
> correct this.
>
> To the patches:
>
> 4e15ee0e24 Don't hardcode tmp_check/ as test directory for tap tests
> 1a3169bc3f Split TESTDIR into TESTLOGDIR and TESTDATADIR
>
> It's a bit weird that the first patch changes the meaning of TESTDIR
> and the second patch removes it.  Maybe these patches should be
> squashed together?

Hm, to me they seem topically separate enough, but I don't have a strong
opinion on it.

> 581721fa99 meson: prereq: Add src/tools/gen_export.pl
>
> Still wondering about the whitespace changes I reported recently, but
> that can also be fine-tuned later.

I'll look into it in a bit.

> a1fb97a81b meson: Add meson based buildsystem
>
> I'm not a fan of all this business to protect the two build systems
> from each other.  I don't like the build process touching a file under
> version control every time.  How necessary is this?  What happens
> otherwise?

I added it after just about everyone trying meson hit problems due to
conflicts between (past) in-tree configure builds and meson, due to files left
in tree (picking up the wrong .h files, cannot entirely be fixed with -I
arguments, due to the "" includes). By adding the relevant check to the meson
configure phase, and by triggering meson re-configure whenever an in-tree
configure build is done, these issues can be detected.

It'd of course be nicer to avoid the potential for such conflicts, but that
appears to be a huge chunk of work, see the bison/flex subthread.

So I don't really see an alternative.

> conversion_helpers.txt: should probably be removed now.

Done.

> doc/src/sgml/resolv.xsl: I don't understand what this is doing.  Maybe
> at least add a comment in the file.

It's only used for building epubs. Perhaps I should extract that into a
separate patch as well? The relevant section is:

> #
> # epub
> #
>
> # This was previously implemented using dbtoepub - but that doesn't seem to
> # support running in build != source directory (i.e. VPATH builds already
> # weren't supported).
> if pandoc.found() and xsltproc.found()
>   # XXX: Wasn't able to make pandoc successfully resolve entities
>   # XXX: Perhaps we should just make all targets use this, to avoid repeatedly
>   # building whole thing? It's comparatively fast though.
>   postgres_full_xml = custom_target('postgres-full.xml',
> input: ['resolv.xsl', 'postgres.sgml'],
> output: ['postgres-full.xml'],
> depends: doc_generated + [postgres_sgml_valid],
> command: [xsltproc, '--path', '@OUTDIR@/', xsltproc_flags,
>   '-o', '@OUTPUT@', '@INPUT@'],
> build_by_default: false,
>   )

A noted, I couldn't make pandoc resolve our entities, so I used resolv.xsl
them, before calling pandoc.

I'll rename it to resolve-entities.xsl and add a comment.

> src/common/unicode/meson.build: The comment at the top of the file
> should be moved next to the files it is describing (similar to how it
> is in the makefile).

Done.

> I don't see CLDR_VERSION set anywhere.  Is that part implemented?

No, I didn't implement the generation parts of contrib/unaccent. I started
tackling the src/common/unicode bits after John Naylor asked whether that
could be done, but considered that good enough...

> src/port/win32ver.rc.in: This is redundant with src/port/win32ver.rc.
> (Note that the latter is also used as an input file for text
> substitution.  So having another file named *.in next to it would be
> super confusing.)

Yea, this stuff isn't great. I think the better solution, both for meson and
for configure, would be to move to do all the substitution to the C
preprocessor.

> src/tools/find_meson: Could use a brief comment what it does.

Added.

> src/tools/pgflex: Could use a not-brief comment about what it does,
> why it's needed.  Also a comment where it's used.  Also run this
> through pycodestyle.

Working on that.

> cd193eb3e8 meson: ci: Build both with meson and as before
>
> I haven't reviewed this one in detail.  Maybe add a summary in the
> commit message, like these are the new jobs, these are the changes to
> existing jobs.  It looks like there is more in there than just adding
> a few meson jobs.

I don't think we want to commit this as-is. It contains CI for a lot of
platforms - that's very useful for working on meson, but too much for
in-tree. I guess I'll split it into two, one patch for converting a reasonable
subset of the current CI tasks to meson and another to add (back) the current
array of tested platforms.

> If the above are addressed, I think this will be just about at the
> point where the above patches can be committed.

Woo!

> Everything past these patches I'm mentally postponing right now.

Makes sense.

Greetings,

Andres Freund

Re: Tracking last scan time

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 07:52:49PM +0200, Matthias van de Meent wrote:
> As for having a lower granularity and preventing the
> one-syscall-per-Relation issue, can't we reuse the query_start or
> state_change timestamps that appear in pg_stat_activity (potentially

Yeah, query start should be fine, but not transaction start time.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Tracking last scan time

2022-08-31 Thread Matthias van de Meent

On Wed, 31 Aug 2022 at 18:21, Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-23 10:55:09 +0100, Dave Page wrote:
> > Often it is beneficial to review one's schema with a view to removing
> > indexes (and sometimes tables) that are no longer required. It's very
> > difficult to understand when that is the case by looking at the number of
> > scans of a relation as, for example, an index may be used infrequently but
> > may be critical in those times when it is used.
> >
> > The attached patch against HEAD adds optional tracking of the last scan
> > time for relations. It updates pg_stat_*_tables with new last_seq_scan and
> > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
> > help with this.
> >
> > Due to the use of gettimeofday(), those values are only maintained if a new
> > GUC, track_scans, is set to on. By default, it is off.
> >
> > I did run a 12 hour test to see what the performance impact is. pgbench was
> > run with scale factor 1 and 75 users across 4 identical bare metal
> > machines running Rocky 8 in parallel which showed roughly a -2% average
> > performance penalty against HEAD with track_scans enabled. Machines were
> > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
> > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
> > is tsc.
> >
> >HEAD   track_scans  Penalty (%)
> > box1   19582.4973519341.8881  -1.22869541
> > box2   19936.5551319928.07479-0.04253664659
> > box3   19631.7889518649.64379-5.002830696
> > box4   19810.8676719420.67192-1.969604525
> > Average 19740.4272819335.06965-2.05343896
>
> Based on the size of those numbers this was a r/w pgbench. If it has this
> noticable an impact for r/w, with a pretty low number of scans/sec, how's the
> overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It
> must be quite bad.
>
> I don't think we should accept this feature with this overhead - but I also
> think we can do better, by accepting a bit less accuracy. For this to be
> useful we don't need a perfectly accurate timestamp. The statement start time
> is probably not accurate enough, but we could just have bgwriter or such
> update one in shared memory every time we wake up? Or perhaps we could go to
> an even lower granularity, by putting in the current LSN or such?

I don't think that LSN is precise enough. For example, if you're in a
(mostly) read-only system, the system may go long times without any
meaningful records being written.

As for having a lower granularity and preventing the
one-syscall-per-Relation issue, can't we reuse the query_start or
state_change timestamps that appear in pg_stat_activity (potentially
updated immediately before this stat flush), or some other per-backend
timestamp that is already maintained and considered accurate enough
for this use?
Regardless, with this patch as it is we get a new timestamp for each
relation processed, which I think is a waste of time (heh) even in
VDSO-enabled systems.

Apart from the above, I don't have any other meaningful opinion on
this patch - it might be a good addition, but I don't consume stats
often enough to make a good cost / benefit comparison.

Kind regards,

Matthias van de Meent

Re: PostgreSQL 15 release announcement draft

2022-08-31 Thread Michael Banck

Hi,

On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote:
> ### Other Notable Changes
> 
> PostgreSQL server-level statistics are now collected in shared memory,
> eliminating the statistics collector process and writing these stats to disk.
> PostgreSQL 15 also revokes the `CREATE` permission from all users except a
> database owner from the `public` (or default) schema.

It's a bit weird to lump those two in the same paragraph, but ok.

However, I think the "and writing these stats to disk." might not be
very clear to people not familiar with the feature, they might think
writing stats to disk is part of the new feature. So I propose "as well
as writing theses stats to disk" instead or something.

Michael

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2022-08-31 Thread Justin Pryzby

On Wed, Aug 31, 2022 at 12:50:19PM -0400, Reid Thompson wrote:
> Hi Hackers,
> 
> Add the ability to limit the amount of memory that can be allocated to
> backends.
> 
> This builds on the work that adds backend memory allocated to
> pg_stat_activity
> https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com
> Both patches are attached.

You should name the patches with different prefixes, like
001,002,003  Otherwise, cfbot may try to apply them in the wrong order.
git format-patch is the usual tool for that.

> +Specifies a limit to the amount of memory (MB) that may be allocated 
> to

MB are just the default unit, right ?
The user should be allowed to write max_total_backend_memory='2GB'

> +backends in total (i.e. this is not a per user or per backend limit).
> +If unset, or set to 0 it is disabled.  A backend request that would 
> push
> +the total over the limit will be denied with an out of memory error
> +causing that backends current query/transaction to fail. Due to the 
> dynamic

backend's

> +nature of memory allocations, this limit is not exact. If within 
> 1.5MB of
> +the limit and two backends request 1MB each at the same time both 
> may be
> +allocated exceeding the limit. Further requests will not be 
> allocated until

allocated, and exceed the limit

> +bool
> +exceeds_max_total_bkend_mem(uint64 allocation_request)
> +{
> + bool result = false;
> +
> + if (MyAuxProcType != NotAnAuxProcess)
> + return result;

The double negative is confusing, so could use a comment.

> + /* Convert max_total_bkend_mem to bytes for comparison */
> + if (max_total_bkend_mem &&
> + pgstat_get_all_backend_memory_allocated() +
> + allocation_request > (uint64)max_total_bkend_mem * 1024 * 1024)
> + {
> + /*
> +  * Explicitely identify the OOM being a result of this
> +  * configuration parameter vs a system failure to allocate OOM.
> +  */
> + elog(WARNING,
> +  "request will exceed postgresql.conf defined 
> max_total_backend_memory limit (%lu > %lu)",
> +  pgstat_get_all_backend_memory_allocated() +
> +  allocation_request, (uint64)max_total_bkend_mem * 1024 
> * 1024);

I think it should be ereport() rather than elog(), which is
internal-only, and not-translated.

> +   {"max_total_backend_memory", PGC_SIGHUP, RESOURCES_MEM,
> +   gettext_noop("Restrict total backend memory 
> allocations to this max."),
> +   gettext_noop("0 turns this feature off."),
> +   GUC_UNIT_MB
> +   },
> +   _total_bkend_mem,
> +   0, 0, INT_MAX,
> +   NULL, NULL, NULL

I think this needs a maximum like INT_MAX/1024/1024

> +uint64
> +pgstat_get_all_backend_memory_allocated(void)
> +{
...
> + for (i = 1; i <= NumBackendStatSlots; i++)
> + {

It's looping over every backend for each allocation.
Do you know if there's any performance impact of that ?

I think it may be necessary to track the current allocation size in
shared memory (with atomic increments?).  Maybe decrements would need to
be exactly accounted for, or otherwise Assert() that the value is not
negative.  I don't know how expensive it'd be to have conditionals for
each decrement, but maybe the value would only be decremented at
strategic times, like at transaction commit or backend shutdown.

-- 
Justin

Re: [PATCH] Add native windows on arm64 support

2022-08-31 Thread Tom Lane

Michael Paquier  writes:
> At the end, I'd like to think that it would be wiser to just remove
> entirely the test for node() or reduce the contents of the string to
> be able to strictly control the output order (say a simple
> 'prepost' would do the trick).  I cannot think
> about a better idea now.  Note that removing the test case where we
> have node() has no impact on the coverage of xml.c.

Yeah, I confirm that: local code-coverage testing shows no change
in the number of lines reached in xml.c when I remove that column:

-SELECT * FROM XMLTABLE('*' PASSING 'predeeppost' COLUMNS x xml PATH 'node()', 
y xml PATH '/');
+SELECT * FROM XMLTABLE('*' PASSING 'predeeppost' COLUMNS y xml PATH '/');

Dropping the query altogether does result in a reduction in the
number of lines hit.

I did not check the other XMLTABLE infrastructure, so what probably
is best to do is keep the two output columns but change 'node()'
to something with a more stable result; any preferences?

regards, tom lane

Re: SQL/JSON features for v15

2022-08-31 Thread Tom Lane

Robert Haas  writes:
> On Wed, Aug 31, 2022 at 1:06 PM Tom Lane  wrote:
>> (I don't say that we'd convert every datatype to do so right away;
>> in the long run we should, but I'm content to start with just the
>> same core types touched here.)

> I would be in favor of making more of an effort than just a few token
> data types. The initial patch could just touch a few, but once the
> infrastructure is in place we should really make a sweep through the
> tree and tidy up.

Sure, but my point is that we can do that in a time-extended fashion
rather than having a flag day where everything must be updated.
The initial patch just needs to update a few types as proof of concept.

regards, tom lane

Re: SQL/JSON features for v15

2022-08-31 Thread Robert Haas

On Wed, Aug 31, 2022 at 1:06 PM Tom Lane  wrote:
> The currently proposed patchset hacks up a relatively small number
> of core datatypes to be able to do that.  But it's just a hack
> and there's no prospect of extension types being able to join
> in the fun.  I think where we need to start, for v16, is making
> an API design that will let any datatype have this functionality.

This would be really nice to have.

> (I don't say that we'd convert every datatype to do so right away;
> in the long run we should, but I'm content to start with just the
> same core types touched here.)

I would be in favor of making more of an effort than just a few token
data types. The initial patch could just touch a few, but once the
infrastructure is in place we should really make a sweep through the
tree and tidy up.

> Beside the JSON stuff, there is
> another even more pressing application for such behavior, namely
> the often-requested COPY functionality to be able to shunt bad data
> off somewhere without losing the entire transfer.  In the COPY case
> I think we'd want to be able to capture the error message that
> would have been issued, which means the current patches are not
> at all appropriate as a basis for that API design: they're just
> returning a bool without any details.

Fully agreed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SQL/JSON features for v15

2022-08-31 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> From my POV the only real discussion is whether we'd want to revert this in 
>> 15
>> and HEAD or just 15. There's imo a decent point to be made to just revert in
>> 15 and aggressively press forward with the changes posted in this thread.

> I'm not for that.  Code that we don't think is ready to ship
> has no business being in the common tree, nor does it make
> review any easier to be looking at one bulky set of
> already-committed patches and another bulky set of deltas.

To enlarge on that a bit: it seems to me that the really fundamental
issue here is how to catch datatype-specific input and conversion
errors without using subtransactions, because those are too expensive
and can mask errors we'd rather not be masking, such as OOM.  (Andres
had some additional, more localized concerns, but I think this is the
one with big-picture implications.)

The currently proposed patchset hacks up a relatively small number
of core datatypes to be able to do that.  But it's just a hack
and there's no prospect of extension types being able to join
in the fun.  I think where we need to start, for v16, is making
an API design that will let any datatype have this functionality.
(I don't say that we'd convert every datatype to do so right away;
in the long run we should, but I'm content to start with just the
same core types touched here.)  Beside the JSON stuff, there is
another even more pressing application for such behavior, namely
the often-requested COPY functionality to be able to shunt bad data
off somewhere without losing the entire transfer.  In the COPY case
I think we'd want to be able to capture the error message that
would have been issued, which means the current patches are not
at all appropriate as a basis for that API design: they're just
returning a bool without any details.

So that's why I'm in favor of reverting and starting over.
There are probably big chunks of what's been done that can be
re-used, but it all needs to be re-examined with this sort of
design in mind.

As a really quick sketch of what such an API might look like:
we could invent a new node type, say IOCallContext, which is
intended to be passed as FunctionCallInfo.context to type
input functions and perhaps type conversion functions.
Call sites wishing to have no-thrown-error functionality would
initialize one of these to show "no error" and then pass it
to the data type's usual input function.  Old-style input
functions would ignore this and just throw errors as usual;
sorry, you don't get the no-error functionality you wanted.
But I/O functions that had been updated would know to store the
report of a relevant error into that node and then return NULL.
(Although I think there may be assumptions somewhere that
I/O functions don't return NULL, so maybe "just return any
dummy value" is a better idea?  Although likely it wouldn't
be hard to remove such assumptions from callers using this
functionality.)  The caller would detect the presence of an error
by examining the node contents and then do whatever it needs to do.

regards, tom lane

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-08-31 Thread Justin Pryzby

On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote:
> Hi Hackers,
> 
> Attached is a patch to 
> Add tracking of backend memory allocated to pg_stat_activity

> +  proargmodes => 
> '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',

In the past, there was concern about making pg_stat_activity wider by
adding information that's less-essential than what's been there for
years.  This is only an int64, so it's not "wide", but I wonder if
there's another way to expose this information?  Like adding backends to
pg_get_backend_memory_contexts() , maybe with another view on top of
that ?

+* shown allocated in pgstat_activity when the creator destroys 
the 
   

pg_stat

> +  * Posix creation calls dsm_impl_posix_resize implying that 
> resizing
> +  * occurs or may be added in the future. As implemented
> +  * dsm_impl_posix_resize utilizes fallocate or truncate, 
> passing the
> +  * whole new size as input, growing the allocation as needed * 
> (only
> +  * truncate supports shrinking). We update by replacing the * 
> old

wrapping caused extraneous stars

> +  * Do not allow backend_mem_allocated to go below zero. ereport if we
> +  * would have. There's no need for a lock around the read here asit's

as it's

> + ereport(LOG, (errmsg("decrease reduces reported backend memory 
> allocated below zero; setting reported to 0")));

errmsg() doesn't require the outside paranthesis since a couple years
ago.

> + /*
> +  * Until header allocation is included in context->mem_allocated cast to
> +  * slab and decrement the headerSize

add a comma before "cast" ?

-- 
Justin

Re: [PATCH] Query Jumbling for CALL and SET utility statements

2022-08-31 Thread Pavel Stehule

st 31. 8. 2022 v 17:50 odesílatel Pavel Stehule 
napsal:

> Hi
>
>
> st 31. 8. 2022 v 17:34 odesílatel Drouvot, Bertrand 
> napsal:
>
>> Hi hackers,
>>
>> While query jumbling is provided for function calls that’s currently not
>> the case for procedures calls.
>> The reason behind this is that all utility statements are currently
>> discarded for jumbling.
>>
>> We’ve recently seen performance impacts (LWLock contention) due to the
>> lack of jumbling on procedure calls with pg_stat_statements and
>> pg_stat_statements.track_utility enabled (think an application with a high
>> rate of procedure calls with unique parameters for each call).
>>
>> Jeremy has had this conversation on twitter (see
>> https://twitter.com/jer_s/status/1560003560116342785) and Nikolay
>> reported that he also had to work on a similar performance issue with SET
>> being used.
>>
>> That’s why we think it would make sense to allow jumbling for those 2
>> utility statements: CALL and SET.
>>
>> Please find attached a patch proposal for doing so.
>>
>> With the attached patch we would get things like:
>> CALL MINUS_TWO(3);
>> CALL MINUS_TWO(7);
>> CALL SUM_TWO(3, 8);
>> CALL SUM_TWO(7, 5);
>> set enable_seqscan=false;
>> set enable_seqscan=true;
>> set seq_page_cost=2.0;
>> set seq_page_cost=1.0;
>>
>> postgres=# SELECT query, calls, rows FROM pg_stat_statements;
>>query   | calls | rows
>> ---+---+--
>>  set seq_page_cost=$1  | 2 |0
>>  CALL MINUS_TWO($1)| 2 |0
>>  set enable_seqscan=$1 | 2 |0
>>  CALL SUM_TWO($1, $2)  | 2 |0
>>
>> Looking forward to your feedback,
>>
> The idea is good, but I think you should use pg_stat_functions instead.
> Maybe it is supported already (I didn't test it). I am  not sure so SET
> statement should be traced in pg_stat_statements - it is usually pretty
> fast, and without context it says nothing. It looks like just overhead.
>

I was wrong - there is an analogy with SELECT fx, and the statistics are in
pg_stat_statements, and in pg_stat_function too.

Regards

Pavel


>
> Regards
>
> Pavel
>
>
>> Thanks,
>>
>> Jeremy & Bertrand
>>
>> --
>> Bertrand Drouvot
>> Amazon Web Services: https://aws.amazon.com
>>
>>

Add the ability to limit the amount of memory that can be allocated to backends.

2022-08-31 Thread Reid Thompson

Hi Hackers,

Add the ability to limit the amount of memory that can be allocated to
backends.

This builds on the work that adds backend memory allocated to
pg_stat_activity
https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com
Both patches are attached.

Add GUC variable max_total_backend_memory.

Specifies a limit to the amount of memory (MB) that may be allocated to
backends in total (i.e. this is not a per user or per backend limit).
If unset, or set to 0 it is disabled. It is intended as a resource to
help avoid the OOM killer. A backend request that would push the total
over the limit will be denied with an out of memory error causing that
backends current query/transaction to fail. Due to the dynamic nature
of memory allocations, this limit is not exact. If within 1.5MB of the
limit and two backends request 1MB each at the same time both may be
allocated exceeding the limit. Further requests will not be allocated
until dropping below the limit. Keep this in mind when setting this
value to avoid the OOM killer. Currently, this limit does not affect
auxiliary backend processes, this list of non-affected backend
processes is open for discussion as to what should/should not be
included. Backend memory allocations are displayed in the
pg_stat_activity view. 



-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c7..caf958310a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2079,6 +2079,32 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_total_backend_memory (integer)
+  
+   max_total_backend_memory configuration parameter
+  
+  
+  
+   
+Specifies a limit to the amount of memory (MB) that may be allocated to
+backends in total (i.e. this is not a per user or per backend limit).
+If unset, or set to 0 it is disabled.  A backend request that would push
+the total over the limit will be denied with an out of memory error
+causing that backends current query/transaction to fail. Due to the dynamic
+nature of memory allocations, this limit is not exact. If within 1.5MB of
+the limit and two backends request 1MB each at the same time both may be
+allocated exceeding the limit. Further requests will not be allocated until
+dropping below the limit. Keep this in mind when setting this value. This
+limit does not affect auxiliary backend processes
+ . Backend memory allocations
+(backend_mem_allocated) are displayed in the
+pg_stat_activity
+view.
+   
+  
+ 
+
  
  
 
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 269ad2fe53..808ffe75f2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -253,6 +253,10 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		return true;
 	}
 
+	/* Do not exceed maximum allowed memory allocation */
+	if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size))
+		return false;
+
 	/*
 	 * Create new segment or open an existing one for attach.
 	 *
@@ -524,6 +528,10 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 		int			flags = IPCProtection;
 		size_t		segsize;
 
+		/* Do not exceed maximum allowed memory allocation */
+		if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size))
+			return false;
+
 		/*
 		 * Allocate the memory BEFORE acquiring the resource, so that we don't
 		 * leak the resource if memory allocation fails.
@@ -718,6 +726,10 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
 		return true;
 	}
 
+	/* Do not exceed maximum allowed memory allocation */
+	if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size))
+		return false;
+
 	/* Create new segment or open an existing one for attach. */
 	if (op == DSM_OP_CREATE)
 	{
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 17a00587f8..9137a000ae 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -44,6 +44,8 @@
  */
 bool		pgstat_track_activities = false;
 int			pgstat_track_activity_query_size = 1024;
+/* Max backend memory allocation allowed (MB). 0 = disabled */
+int			max_total_bkend_mem = 0;
 
 
 /* exposed so that backend_progress.c can access it */
@@ -1253,3 +1255,107 @@ pgstat_report_backend_mem_allocated_decrease(uint64 deallocation)
 	beentry->backend_mem_allocated -= deallocation;
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
+
+/* --
+ * pgstat_get_all_backend_memory_allocated() -
+ *
+ *	Return a uint64 representing the current shared memory allocated to all
+ *	backends.  This looks directly at the BackendStatusArray, and so will
+ *

Re: SQL/JSON features for v15

2022-08-31 Thread Jonathan S. Katz


On 8/31/22 12:26 PM, Robert Haas wrote:

On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz  wrote:

Andres, Robert, Tom: With this recent work, have any of your opinions
changed on including SQL/JSON in v15?


No. Nothing's been committed, and there's no time to review anything
in detail, and there was never going to be.


OK. Based on this feedback, the RMT is going to request that this is 
reverted.


With RMT hat on -- Andrew can you please revert the patchset?


 Nikita said he was ready
to start hacking in mid-August. That's good of him, but feature freeze
was in April. We don't start hacking on a feature 4 months after the
freeze. I'm unwilling to drop everything I'm working on to review
patches that were written in a last minute rush. Even if these patches
were more important to me than my own work, which they are not, I
couldn't possibly do a good job reviewing complex patches on top of
other complex patches in an area that I haven't studied in years. And
if I could do a good job, no doubt I'd find a bunch of problems -
whether they would be large or small, I don't know - and then that
would lead to more changes even closer to the intended release date.

I just don't understand what the RMT thinks it is doing here. When a
concern is raised about whether a feature is anywhere close to being
in a releasable state in August, "hack on it some more and then see
where we're at" seems like an obviously impractical way forward. It
seemed clear to me from the moment that Andres raised his concerns
that the only two viable strategies were (1) revert the feature and be
sad or (2) decide to ship it anyway and hope that Andres is incorrect
in thinking that it will become an embarrassment to the project. The
RMT has chosen neither of these, and in fact, really seems to want
someone else to make the decision. But that's not how it works. The
RMT concept was invented precisely to solve problems like this one,
where the patch authors don't really want to revert it but other
people think it's pretty busted. If such problems were best addressed
by waiting for a long time to see whether anything changes, we
wouldn't need an RMT. That's exactly how we used to handle these kinds
of problems, and it sucked.


This is fair feedback. However, there are a few things to consider here:

1. When Andres raised his initial concerns, the RMT did recommend to 
revert but did not force it. Part of the RMT charter is to try to get 
consensus before doing so and after we've exhausted the community 
process. As we moved closer, the patch others proposed some suggestions 
which other folks were amenable to trying.


Unfortunately, time has run out. However,

2. One of the other main goals of the RMT is to ensure the release ships 
"on time" which we define to be late Q3/early Q4. We factored that into 
the decision making process around this. We are still on time for the 
release.


I take responsibility for the decision making. I would be open to 
discuss this further around what worked / what didn't with the RMT and 
where we can improve in the future.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: SQL/JSON features for v15

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 12:26:29 -0400, Robert Haas wrote:
> On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz  
> wrote:
> > Andres, Robert, Tom: With this recent work, have any of your opinions
> > changed on including SQL/JSON in v15?
> 
> No. Nothing's been committed, and there's no time to review anything
> in detail, and there was never going to be. Nikita said he was ready
> to start hacking in mid-August. That's good of him, but feature freeze
> was in April.

As additional context: I had started raising those concerns mid June.

Greetings,

Andres Freund

Re: SQL/JSON features for v15

2022-08-31 Thread Robert Haas

On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz  wrote:
> Andres, Robert, Tom: With this recent work, have any of your opinions
> changed on including SQL/JSON in v15?

No. Nothing's been committed, and there's no time to review anything
in detail, and there was never going to be. Nikita said he was ready
to start hacking in mid-August. That's good of him, but feature freeze
was in April. We don't start hacking on a feature 4 months after the
freeze. I'm unwilling to drop everything I'm working on to review
patches that were written in a last minute rush. Even if these patches
were more important to me than my own work, which they are not, I
couldn't possibly do a good job reviewing complex patches on top of
other complex patches in an area that I haven't studied in years. And
if I could do a good job, no doubt I'd find a bunch of problems -
whether they would be large or small, I don't know - and then that
would lead to more changes even closer to the intended release date.

I just don't understand what the RMT thinks it is doing here. When a
concern is raised about whether a feature is anywhere close to being
in a releasable state in August, "hack on it some more and then see
where we're at" seems like an obviously impractical way forward. It
seemed clear to me from the moment that Andres raised his concerns
that the only two viable strategies were (1) revert the feature and be
sad or (2) decide to ship it anyway and hope that Andres is incorrect
in thinking that it will become an embarrassment to the project. The
RMT has chosen neither of these, and in fact, really seems to want
someone else to make the decision. But that's not how it works. The
RMT concept was invented precisely to solve problems like this one,
where the patch authors don't really want to revert it but other
people think it's pretty busted. If such problems were best addressed
by waiting for a long time to see whether anything changes, we
wouldn't need an RMT. That's exactly how we used to handle these kinds
of problems, and it sucked.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tracking last scan time

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-23 10:55:09 +0100, Dave Page wrote:
> Often it is beneficial to review one's schema with a view to removing
> indexes (and sometimes tables) that are no longer required. It's very
> difficult to understand when that is the case by looking at the number of
> scans of a relation as, for example, an index may be used infrequently but
> may be critical in those times when it is used.
> 
> The attached patch against HEAD adds optional tracking of the last scan
> time for relations. It updates pg_stat_*_tables with new last_seq_scan and
> last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to
> help with this.
> 
> Due to the use of gettimeofday(), those values are only maintained if a new
> GUC, track_scans, is set to on. By default, it is off.
> 
> I did run a 12 hour test to see what the performance impact is. pgbench was
> run with scale factor 1 and 75 users across 4 identical bare metal
> machines running Rocky 8 in parallel which showed roughly a -2% average
> performance penalty against HEAD with track_scans enabled. Machines were
> PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data
> directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source
> is tsc.
> 
>HEAD   track_scans  Penalty (%)
> box1   19582.4973519341.8881  -1.22869541
> box2   19936.5551319928.07479-0.04253664659
> box3   19631.7889518649.64379-5.002830696
> box4   19810.8676719420.67192-1.969604525
> Average 19740.4272819335.06965-2.05343896

Based on the size of those numbers this was a r/w pgbench. If it has this
noticable an impact for r/w, with a pretty low number of scans/sec, how's the
overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It
must be quite bad.

I don't think we should accept this feature with this overhead - but I also
think we can do better, by accepting a bit less accuracy. For this to be
useful we don't need a perfectly accurate timestamp. The statement start time
is probably not accurate enough, but we could just have bgwriter or such
update one in shared memory every time we wake up? Or perhaps we could go to
an even lower granularity, by putting in the current LSN or such?

Greetings,

Andres Freund

Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'

2022-08-31 Thread Reid Thompson

Hi Hackers,

This patch ensures get_database_list() switches back to the memory
context in use upon entry rather than returning with TopMemoryContext
as the context.

This will address memory allocations in autovacuum.c being associated
with TopMemoryContext when they should not be.

autovacuum.c do_start_worker() with current context
'Autovacuum start worker (tmp)' invokes get_database_list(). Upon
return, the current context has been changed to TopMemoryContext by
AtCommit_Memory() as part of an internal transaction. Further down
in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked.
Previously this didn't pose a issue, however recent changes altered
how pgstat_fetch_stat_dbentry() is implemented. The new
implementation has a branch utilizing palloc. The patch ensures these
allocations are associated with the 'Autovacuum start worker (tmp)'
context rather than the TopMemoryContext. Prior to the change,
leaving an idle laptop PG instance running just shy of 3 days saw the
autovacuum launcher process grow to 42MB with most of that growth in
TopMemoryContext due to the palloc allocations issued with autovacuum
worker startup.



-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com



-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index b3b1afba86..92b1ef45c1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1941,6 +1941,13 @@ get_database_list(void)
 
 	CommitTransactionCommand();
 
+	/*
+	 * CommitTransactionCommand returns with CurrentMemoryContext set
+	 * to TopMemoryContext. Switch back to the context that we entered
+	 * with.
+	 */
+	MemoryContextSwitchTo(resultcxt);
+
 	return dblist;
 }

Re: Tracking last scan time

2022-08-31 Thread Bruce Momjian

On Wed, Aug 31, 2022 at 05:02:33PM +0100, Dave Page wrote:
> 
> 
> On Tue, 30 Aug 2022 at 19:46, Bruce Momjian  wrote:
> 
> On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:
> > On Thu, 25 Aug 2022 at 01:44, David Rowley  wrote:
> >     I don't have a particular opinion about the patch, I'm just pointing
> >     out that there are other ways. Even just writing down the numbers on
> a
> >     post-it note and coming back in a month to see if they've changed is
> >     enough to tell if the table or index has been used.
> >
> >
> > There are usually other ways to perform monitoring tasks, but there is
> > something to be said for the convenience of having functionality built 
> in
> and
> > not having to rely on tools, scripts, or post-it notes :-)
> 
> Should we consider using something cheaper like time() so we don't need
> a GUC to enable this?
> 
> 
> Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls
> takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds.

Wow.  I was just thinking you need second-level accuracy, which must be
cheap somewhere.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: [PATCH] Query Jumbling for CALL and SET utility statements

2022-08-31 Thread Andres Freund

Hi,

On 2022-08-31 17:33:44 +0200, Drouvot, Bertrand wrote:
> While query jumbling is provided for function calls that’s currently not the
> case for procedures calls.
> The reason behind this is that all utility statements are currently
> discarded for jumbling.
> [...]
> That’s why we think it would make sense to allow jumbling for those 2
> utility statements: CALL and SET.

Yes, I've seen this be an issue repeatedly. Although more heavily for PREPARE
/ EXECUTE than either of the two cases you handle here. IME not tracking
PREPARE / EXECUTE can distort statistics substantially - there's appears to be
a surprising number of applications / frameworks resorting to them. Basically
requiring that track utility is turned on.

I suspect we should carve out things like CALL, PREPARE, EXECUTE from
track_utility - it's more or less an architectural accident that they're
utility statements.  It's a bit less clear that SET should be dealt with that
way.

> @@ -383,6 +384,30 @@ JumbleExpr(JumbleState *jstate, Node *node)
>   APP_JUMB(var->varlevelsup);
>   }
>   break;
> + case T_CallStmt:
> + {
> + CallStmt   *stmt = (CallStmt *) node;
> + FuncExpr   *expr = stmt->funcexpr;
> +
> + APP_JUMB(expr->funcid);
> + JumbleExpr(jstate, (Node *) expr->args);
> + }
> + break;

Why do we need to take the arguments into account?

> + case T_VariableSetStmt:
> + {
> + VariableSetStmt *stmt = (VariableSetStmt *) 
> node;
> +
> + APP_JUMB_STRING(stmt->name);
> + JumbleExpr(jstate, (Node *) stmt->args);
> + }
> + break;

Same?

> + case T_A_Const:
> + {
> + int loc = ((const A_Const 
> *) node)->location;
> +
> + RecordConstLocation(jstate, loc);
> + }
> + break;

I suspect we only need this because of the jumbling of unparsed arguments I
questioned above?  If we do end up needing it, shouldn't we include the type
in the jumbling?

Greetings,

Andres Freund

Re: SQL/JSON features for v15

2022-08-31 Thread Tom Lane

Andres Freund  writes:
> On 2022-08-31 10:20:24 -0400, Jonathan S. Katz wrote:
>> Andres, Robert, Tom: With this recent work, have any of your opinions
>> changed on including SQL/JSON in v15?

> I don't really know what to do here. It feels blatantly obvious that this code
> isn't even remotely close to being releasable. I'm worried about the impact of
> the big revert at this stage of the release cycle, and that's not getting
> better by delaying further. And I'm getting weary of being asked to make the
> obvious call that the authors of this feature as well as the RMT should have
> made a while ago.

I have to agree.  There is a large amount of code at stake here.
We're being asked to review a bunch of hastily-produced patches
to that code on an even more hasty schedule (and personally
I have other things I need to do today...)  I think the odds
of a favorable end result are small.

> From my POV the only real discussion is whether we'd want to revert this in 15
> and HEAD or just 15. There's imo a decent point to be made to just revert in
> 15 and aggressively press forward with the changes posted in this thread.

I'm not for that.  Code that we don't think is ready to ship
has no business being in the common tree, nor does it make
review any easier to be looking at one bulky set of
already-committed patches and another bulky set of deltas.

I'm okay with making an exception for the include/nodes/ and
backend/nodes/ files in HEAD, since the recent changes in that
area mean it'd be a lot of error-prone work to produce a reverting
patch there.  We can leave those in as dead code temporarily, I think.

regards, tom lane

Add tracking of backend memory allocated to pg_stat_activity

2022-08-31 Thread Reid Thompson

Hi Hackers,

Attached is a patch to 
Add tracking of backend memory allocated to pg_stat_activity
    
This new field displays the current bytes of memory allocated to the
backend process.  It is updated as memory for the process is
malloc'd/free'd. Memory allocated to items on the freelist is included
in the displayed value.  Dynamic shared memory allocations are included
only in the value displayed for the backend that created them, they are
not included in the value for backends that are attached to them to
avoid double counting. On occasion, orphaned memory segments may be
cleaned up on postmaster startup. This may result in decreasing the sum
without a prior increment. We limit the floor of backend_mem_allocated
to zero.


-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1d9509a2f6..40ae638f25 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -947,6 +947,18 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
 
 
+
+ 
+  backend_mem_allocated bigint
+ 
+ 
+  The byte count of memory allocated to this backend. Dynamic shared memory
+  allocations are included only in the value displayed for the backend that
+  created them, they are not included in the value for backends that are
+  attached to them to avoid double counting.
+ 
+
+
  
   
query text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5a844b63a1..d23f0e9dbb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -863,6 +863,7 @@ CREATE VIEW pg_stat_activity AS
 S.backend_xid,
 s.backend_xmin,
 S.query_id,
+S.backend_mem_allocated,
 S.query,
 S.backend_type
 FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index e1b90c5de4..269ad2fe53 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -66,6 +66,7 @@
 #include "postmaster/postmaster.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 
@@ -232,6 +233,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 			name)));
 			return false;
 		}
+		/*
+		 * Detach and destroy pass through here, only decrease the memory
+		 * shown allocated in pgstat_activity when the creator destroys the
+		 * allocation.
+		 */
+		if (op == DSM_OP_DESTROY)
+			pgstat_report_backend_mem_allocated_decrease(*mapped_size);
 		*mapped_address = NULL;
 		*mapped_size = 0;
 		if (op == DSM_OP_DESTROY && shm_unlink(name) != 0)
@@ -332,6 +340,36 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		name)));
 		return false;
 	}
+
+	/*
+	 * Attach and create pass through here, only update backend memory
+	 * allocated in pgstat_activity for the creator process.
+	 */
+	if (op == DSM_OP_CREATE)
+	{
+		/*
+		 * Posix creation calls dsm_impl_posix_resize implying that resizing
+		 * occurs or may be added in the future. As implemented
+		 * dsm_impl_posix_resize utilizes fallocate or truncate, passing the
+		 * whole new size as input, growing the allocation as needed * (only
+		 * truncate supports shrinking). We update by replacing the * old
+		 * allocation with the new.
+		 */
+#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
+		/*
+		 * posix_fallocate does not shrink allocations, adjust only on
+		 * allocation increase.
+		 */
+		if (request_size > *mapped_size)
+		{
+			pgstat_report_backend_mem_allocated_decrease(*mapped_size);
+			pgstat_report_backend_mem_allocated_increase(request_size);
+		}
+#else
+		pgstat_report_backend_mem_allocated_decrease(*mapped_size);
+		pgstat_report_backend_mem_allocated_increase(request_size);
+#endif
+	}
 	*mapped_address = address;
 	*mapped_size = request_size;
 	close(fd);
@@ -537,6 +575,14 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 			name)));
 			return false;
 		}
+
+		/*
+		 * Detach and destroy pass through here, only decrease the memory
+		 * shown allocated in pgstat_activity when the creator destroys the
+		 * allocation.
+		 */
+		if (op == DSM_OP_DESTROY)
+			pgstat_report_backend_mem_allocated_decrease(*mapped_size);
 		*mapped_address = NULL;
 		*mapped_size = 0;
 		if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0)
@@ -584,6 +630,13 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 		name)));
 		return false;
 	}
+
+	/*
+	 * Attach and create pass through here, only update backend memory
+	 * allocated in pgstat_activity for the creator process.
+	 */
+	if (op == DSM_OP_CREATE)
+

Re: Tracking last scan time

2022-08-31 Thread Dave Page

On Tue, 30 Aug 2022 at 19:46, Bruce Momjian  wrote:

> On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote:
> > On Thu, 25 Aug 2022 at 01:44, David Rowley  wrote:
> > I don't have a particular opinion about the patch, I'm just pointing
> > out that there are other ways. Even just writing down the numbers on
> a
> > post-it note and coming back in a month to see if they've changed is
> > enough to tell if the table or index has been used.
> >
> >
> > There are usually other ways to perform monitoring tasks, but there is
> > something to be said for the convenience of having functionality built
> in and
> > not having to rely on tools, scripts, or post-it notes :-)
>
> Should we consider using something cheaper like time() so we don't need
> a GUC to enable this?
>

Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls
takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds.

-- 
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

Re: SQL/JSON features for v15

2022-08-31 Thread Andrew Dunstan

On 2022-08-30 Tu 17:25, Nikita Glukhov wrote:
>
>
>
 Patches 0001-0006:

 Yeah, these add the overhead of an extra function call (typin() ->
 typin_opt_error()) in possibly very common paths.  Other than
 refactoring *all* places that call typin() to use the new API, the
 only other option seems to be to leave the typin() functions alone and
 duplicate their code in typin_opt_error() versions for all the types
 that this patch cares about.  Though maybe, that's not necessarily a
 better compromise than accepting the extra function call overhead.
>>> I think another possibility is to create a static inline function in the
>>> corresponding .c module (say boolin_impl() in bool.c), which is called
>>> by both the opt_error variant as well as the regular one.  This would
>>> avoid the duplicate code as well as the added function-call overhead.
>> +1
> I always thought about such internal inline functions, I 've added them in 
> v10.
>
>

A couple of questions about these:

1. Patch 5 changes the API of DecodeDateTime() and DecodeTimeOnly() by
adding an extra parameter bool *error. Would it be better to provide
_opt_error flavors of these?

2. Patch 6 changes jsonb_from_cstring so that it's no longer static
inline. Shouldn't we have a static inline function that can be called
from inside jsonb.c and is called by the extern function?

changing both of these things would be quite trivial and should not hold
anything up.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

1 2 >

1 - 100 of 140 matches

Mail list logo