from:"Ronan Dunklau"

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Ronan Dunklau

Le vendredi 8 mars 2024, 14:36:48 CET Tomas Vondra a écrit :
> > My guess would be 8af25652489, as it's the only storage-related commit.
> > 
> > I'm currently running tests to verify this.
> 
> Yup, the breakage starts with this commit. I haven't looked into the
> root cause, or whether the commit maybe just made some pre-existing
> issue easier to hit. Also, I haven't followed the discussion on the
> pgsql-bugs thread [1], maybe there are some interesting findings.
> 

If that happens only on HEAD and not on 16, and doesn't involve WAL replay, 
then it's not the same bug. 

--
Ronan Dunklau

Re: Provide a pg_truncate_freespacemap function

2024-03-06 Thread Ronan Dunklau

Le mercredi 6 mars 2024, 20:28:44 CET Stephen Frost a écrit :
> I agree that this would generally be a useful thing to have.

Thanks !

> 
> > Does that seem correct ?
> 
> Definitely needs to have a 'REVOKE ALL ON FUNCTION' at the end of the
> upgrade script, similar to what you'll find at the bottom of
> pg_visibility--1.1.sql in the tree today, otherwise anyone could run it.
> 
> Beyond that, I'd suggest a function-level comment above the definition
> of the function itself (which is where we tend to put those- not at the
> point where we declare the function).

Thank you for the review. Here is an updated patch for both of those.


Best regards,

--
Ronan>From 812061b47443ce1052f65ba2867223f9db2cfa18 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 1 Mar 2024 08:57:49 +0100
Subject: [PATCH] Provide a pg_truncate_freespacemap function.

Similar to the pg_truncate_visibilitymap function, this one allows to
avoid downtime to fix an FSM in the case a corruption event happens
---
 contrib/pg_freespacemap/Makefile  |  5 +-
 .../pg_freespacemap--1.2--1.3.sql | 12 
 contrib/pg_freespacemap/pg_freespacemap.c | 64 +++
 .../pg_freespacemap/pg_freespacemap.control   |  2 +-
 4 files changed, 80 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql

diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index b48e4b255bc..0f4b52f5812 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -6,8 +6,9 @@ OBJS = \
 	pg_freespacemap.o
 
 EXTENSION = pg_freespacemap
-DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \
-	pg_freespacemap--1.0--1.1.sql
+DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \
+			pg_freespacemap--1.1--1.2.sql \
+			pg_freespacemap--1.0--1.1.sql
 PGFILEDESC = "pg_freespacemap - monitoring of free space map"
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
new file mode 100644
index 000..cc94eccf2f6
--- /dev/null
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
@@ -0,0 +1,12 @@
+/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pg_truncate_freespace_map(regclass)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map'
+LANGUAGE C STRICT
+PARALLEL UNSAFE;
+
+REVOKE ALL ON FUNCTION pg_truncate_freespace_map(regclass) FROM PUBLIC;
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index b82cab2d97e..fa492c18534 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -9,8 +9,12 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "access/xloginsert.h"
+#include "catalog/storage_xlog.h"
 #include "funcapi.h"
 #include "storage/freespace.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
@@ -20,6 +24,8 @@ PG_MODULE_MAGIC;
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
 
+PG_FUNCTION_INFO_V1(pg_truncate_freespace_map);
+
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
 {
@@ -40,3 +46,61 @@ pg_freespace(PG_FUNCTION_ARGS)
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+
+/*
+ * Remove the free space map for a relation.
+ *
+ * This is useful in case of corruption of the FSM, as the
+ * only other options are either triggering a VACUUM FULL or
+ * shutting down the server and removing the FSM on the filesystem.
+ */
+Datum
+pg_truncate_freespace_map(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	ForkNumber	fork;
+	BlockNumber block;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* Only some relkinds have a freespace map */
+	if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("relation \"%s\" is of wrong relation kind",
+		RelationGetRelationName(rel)),
+ errdetail_relkind_not_supported(rel->rd_rel->relkind)));
+
+
+	/* Forcibly reset cached file size */
+	RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+
+	/* Just pretend we're going to wipeout the whole rel */
+	block = FreeSpaceMapPrepareTruncateRel(rel, 0);
+
+	if (BlockNumberIsValid(block))
+	{
+		fork = FSM_FORKNUM;
+		smgrtruncate(RelationGetSmgr(rel), , 1, );
+	}
+
+	if (RelationNeedsWAL(rel))
+	{
+		xl_smgr_truncate xlrec;
+
+		xlrec.blkno = 0;
+		xlrec.rlocator = rel->rd_locator;
+		xlrec.flags = SMGR_TRUNCATE

Provide a pg_truncate_freespacemap function

2024-03-04 Thread Ronan Dunklau

Hello,

As we are currently experiencing a FSM corruption issue [1], we need to 
rebuild FSM when we detect it. 

I noticed we have something to truncate a visibility map, but nothing for the 
freespace map, so I propose the attached (liberally copied from the VM 
counterpart) to allow to truncate a FSM without incurring downtime, as 
currently our only options are to either VACUUM FULL the table or stop the 
cluster and remove the FSM manually.

Does that seem correct ?


[1]  https://www.postgresql.org/message-id/flat/
1925490.taCxCBeP46%40aivenlaptop#7ace95c95cab17b6d92607e5362984ac

--
Ronan Dunklau



>From b3e28e4542311094ee7177b39cc9cdf3672d1b55 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 1 Mar 2024 08:57:49 +0100
Subject: [PATCH] Provide a pg_truncate_freespacemap function.

Similar to the pg_truncate_visibilitymap function, this one allows to
avoid downtime to fix an FSM in the case a corruption event happens
---
 contrib/pg_freespacemap/Makefile  |  5 +-
 .../pg_freespacemap--1.2--1.3.sql | 13 +
 contrib/pg_freespacemap/pg_freespacemap.c | 58 +++
 .../pg_freespacemap/pg_freespacemap.control   |  2 +-
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql

diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index b48e4b255bc..0f4b52f5812 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -6,8 +6,9 @@ OBJS = \
 	pg_freespacemap.o
 
 EXTENSION = pg_freespacemap
-DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \
-	pg_freespacemap--1.0--1.1.sql
+DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \
+			pg_freespacemap--1.1--1.2.sql \
+			pg_freespacemap--1.0--1.1.sql
 PGFILEDESC = "pg_freespacemap - monitoring of free space map"
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
new file mode 100644
index 000..1f25877a175
--- /dev/null
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
@@ -0,0 +1,13 @@
+/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pg_truncate_freespace_map(regclass)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map'
+LANGUAGE C STRICT
+PARALLEL UNSAFE;
+
+
+
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index b82cab2d97e..17f6a631ad8 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -9,8 +9,12 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "access/xloginsert.h"
+#include "catalog/storage_xlog.h"
 #include "funcapi.h"
 #include "storage/freespace.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
@@ -20,6 +24,9 @@ PG_MODULE_MAGIC;
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
 
+/* Truncate the free space map, in case of corruption */
+PG_FUNCTION_INFO_V1(pg_truncate_freespace_map);
+
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
 {
@@ -40,3 +47,54 @@ pg_freespace(PG_FUNCTION_ARGS)
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+
+Datum
+pg_truncate_freespace_map(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	ForkNumber	fork;
+	BlockNumber block;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* Only some relkinds have a freespacemap */
+	if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("relation \"%s\" is of wrong relation kind",
+		RelationGetRelationName(rel)),
+ errdetail_relkind_not_supported(rel->rd_rel->relkind)));
+
+
+	/* Forcibly reset cached file size */
+	RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+
+	/* Just pretend we're going to wipeout the whole rel */
+	block = FreeSpaceMapPrepareTruncateRel(rel, 0);
+
+	if (BlockNumberIsValid(block))
+	{
+		fork = FSM_FORKNUM;
+		smgrtruncate(RelationGetSmgr(rel), , 1, );
+	}
+
+	if (RelationNeedsWAL(rel))
+	{
+		xl_smgr_truncate xlrec;
+
+		xlrec.blkno = 0;
+		xlrec.rlocator = rel->rd_locator;
+		xlrec.flags = SMGR_TRUNCATE_FSM;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) , sizeof(xlrec));
+
+		XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control
index ac8fc5050a9

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-04 Thread Ronan Dunklau

Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit :
> These are "my" animals (running at a local university). There's a couple
> interesting details:

Hi Tomas,
do you still have the failing cluster data ? 

Noah pointed me to this thread, and it looks a bit similar to the FSM 
corruption issue I'm facing: https://www.postgresql.org/message-id/
1925490.taCxCBeP46%40aivenlaptop

So if you still have the data, it would be nice to see if you indeed have a 
corrupted FSM, and if you have indications when it happened.

Best regards,

--
Ronan Dunklau

Re: scalability bottlenecks with (many) partitions (and more)

2024-01-29 Thread Ronan Dunklau

Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
> I'm not sure work_mem is a good parameter to drive this. It doesn't say
> how much memory we expect the backend to use - it's a per-operation
> limit, so it doesn't work particularly well with partitioning (e.g. with
> 100 partitions, we may get 100 nodes, which is completely unrelated to
> what work_mem says). A backend running the join query with 1000
> partitions uses ~90MB (judging by data reported by the mempool), even
> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

I understand your point,  I was basing my previous observations on what a 
backend typically does during the execution.

> 
> The mempool could tell us how much memory we need (but we could track
> this in some other way too, probably). And we could even adjust the mmap
> parameters regularly, based on current workload.
> 
> But there's then there's the problem that the mmap parameters don't tell
> If we > > us how much memory to keep, but how large chunks to release.
> 
> Let's say we want to keep the 90MB (to allocate the memory once and then
> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
> but then it takes just a little bit of extra memory to release all the
> memory, or something.

For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a 
certain amount of memory is always kept. 

But the way the dynamic adjustment works makes it sort-of work like this. 
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't 
expect to keep much memory around. 

So even "small" memory allocations will be served using mmap at first. Once 
mmaped memory is released, glibc's consider it a benchmark for "normal" 
allocations that can be routinely freed, and adjusts mmap_threshold to the 
released mmaped region size, and trim threshold to two times that. 

It means over time the two values will converge either to the max value (32MB 
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to 
accomodate your released memory, since anything bigger than half trim 
threshold will be allocated using mmap. 

Setting any parameter disable that.

But I'm not arguing against the mempool, just chiming in with glibc's malloc 
tuning possibilities :-)

Re: scalability bottlenecks with (many) partitions (and more)

2024-01-29 Thread Ronan Dunklau

Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
> > Did you try running an strace on the process ? That may give you some
> > hindsights into what malloc is doing. A more sophisticated approach would
> > be using stap and plugging it into the malloc probes, for example
> > memory_sbrk_more and memory_sbrk_less.
> 
> No, I haven't tried that. In my experience strace is pretty expensive,
> and if the issue is in glibc itself (before it does the syscalls),
> strace won't really tell us much. Not sure, ofc.

It would tell you how malloc actually performs your allocations, and how often 
they end up translated into syscalls. The main issue with glibc would be that 
it releases the memory too agressively to the OS, IMO.

> 
> > An important part of glibc's malloc behaviour in that regard comes from
> > the
> > adjustment of the mmap and free threshold. By default, mmap adjusts them
> > dynamically and you can poke into that using the
> > memory_mallopt_free_dyn_thresholds probe.
> 
> Thanks, I'll take a look at that.
> 
> >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> >> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> >> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> >> way to tweak glibc to address this), and it also means systems using
> >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> >> says the mempool has ~20% benefit on top of jemalloc.
> > 
> > GLIBC's malloc offers some tuning for this. In particular, setting either
> > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
> > adjustment" beheviour and allow you to control what it's doing.
> > 
> > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
> > using sbrk isn't freed as easily, and you don't run into a pattern of
> > moving the sbrk pointer up and down repeatedly. The automatic trade off
> > between the mmap and trim thresholds is supposed to prevent that, but the
> > way it is incremented means you can end in a bad place depending on your
> > particular allocation patttern.
> 
> So, what values would you recommend for these parameters?
> 
> My concern is increasing those value would lead to (much) higher memory
> usage, with little control over it. With the mempool we keep more
> blocks, ofc, but we have control over freeing the memory.

Right now depending on your workload (especially if you use connection 
pooling) you can end up with something like 32 or 64MB of dynamically adjusted 
trim-threshold which will never be released back. 

The first heurstic I had in mind was to set it to work_mem, up to a 
"reasonable" limit I guess. One can argue that it is expected for a backend to 
use work_mem frequently, and as such it shouldn't be released back. By setting 
work_mem to a lower value, we could ask glibc at the same time to trim the 
excess kept memory. That could be useful when a long-lived connection is 
pooled, and sees a spike in memory usage only once. Currently that could well 
end up with 32MB "wasted" permanently but tuning it ourselves could allow us 
to releaase it back. 

Since it was last year I worked on this, I'm a bit fuzzy on the details but I 
hope this helps.

Re: scalability bottlenecks with (many) partitions (and more)

2024-01-29 Thread Ronan Dunklau

Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :

Hi Tomas !

I'll comment on glibc-malloc part as I studied that part last year, and 
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop


> FWIW where does the malloc overhead come from? For one, while we do have
> some caching of malloc-ed memory in memory contexts, that doesn't quite
> work cross-query, because we destroy the contexts at the end of the
> query. We attempt to cache the memory contexts too, but in this case
> that can't help because the allocations come from btbeginscan() where we
> do this:
> 
> so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
> 
> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
> thus always allocated using a separate malloc() call. Maybe we could
> break it into smaller/cacheable parts, but I haven't tried, and I doubt
> > > > it's the only such allocation.

Did you try running an strace on the process ? That may give you some 
hindsights into what malloc is doing. A more sophisticated approach would be 
using stap and plugging it into the malloc probes, for example 
memory_sbrk_more and memory_sbrk_less. 

An important part of glibc's malloc behaviour in that regard comes from the 
adjustment of the mmap and free threshold. By default, mmap adjusts them 
dynamically and you can poke into that using the 
memory_mallopt_free_dyn_thresholds probe.

> 
> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> way to tweak glibc to address this), and it also means systems using
> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either 
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto 
adjustment" beheviour and allow you to control what it's doing. 

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using 
sbrk isn't freed as easily, and you don't run into a pattern of moving the 
sbrk pointer up and down repeatedly. The automatic trade off between the mmap 
and trim thresholds is supposed to prevent that, but the way it is incremented 
means you can end in a bad place depending on your particular allocation 
patttern.

Best regards,

--
Ronan Dunklau

Re: LLVM 16 (opaque pointers)

2023-10-16 Thread Ronan Dunklau

Le vendredi 13 octobre 2023, 22:32:13 CEST Thomas Munro a écrit :
> On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau  
wrote:
> > Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit :
> > > The back-patch to 12 was a little trickier than anticipated, but after
> > > taking a break and trying again I now have PG 12...17 patches that
> > > I've tested against LLVM 10...18 (that's 54 combinations), in every
> > > case only with the clang corresponding to LLVM.
> > 
> > Thank you Thomas for those patches, and the extensive testing, I will run
> > my own and let you know.
> 
> Thanks!  No news is good news, I hope?  I'm hoping to commit this today.
> 
> > > I've attached only the patches for master, but the 12-16 versions are
> > > available at https://github.com/macdice/postgres/tree/llvm16-$N in
> > > case anyone has comments on those.
> > 
> > For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not
> > used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not
> > mistaken.
> 
> Right, looks like I can remove that in those branches.

Oh sorry I thought I followed up. I ran the same stress testing involving 
several hours of sqlsmith with all jit costs set to zero and didn't notice 
anything with LLVM16.

Thank you !

--
Ronan Dunklau

Re: LLVM 16 (opaque pointers)

2023-10-11 Thread Ronan Dunklau

Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit :
> The back-patch to 12 was a little trickier than anticipated, but after
> taking a break and trying again I now have PG 12...17 patches that
> I've tested against LLVM 10...18 (that's 54 combinations), in every
> case only with the clang corresponding to LLVM.

Thank you Thomas for those patches, and the extensive testing, I will run my 
own and let you know.

> I've attached only the patches for master, but the 12-16 versions are
> available at https://github.com/macdice/postgres/tree/llvm16-$N in
> case anyone has comments on those.

For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not 
used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not 
mistaken. 

Best regards,

--
Ronan Dunklau

Re: LLVM 16 (opaque pointers)

2023-08-10 Thread Ronan Dunklau

Le dimanche 21 mai 2023, 05:01:41 CEST Thomas Munro a écrit :
> Hi,
> 
> Here is a draft version of the long awaited patch to support LLVM 16.
> It's mostly mechanical donkeywork, but it took some time: this donkey
> found it quite hard to understand the mighty getelementptr
> instruction[1] and the code generation well enough to figure out all
> the right types, and small mistakes took some debugging effort*.  I
> now finally have a patch that passes all tests.
> 
> Though it's not quite ready yet, I thought I should give this status
> update to report that the main task is more or less complete, since
> we're starting to get quite a few emails about it (mostly from Fedora
> users) and there is an entry for it on the Open Items for 16 wiki
> page.  Comments/review/testing welcome.

Hello Thomas,

Thank you for this effort !

I've tested it against llvm 15 and 16, and found no problem with it.

> 6. I need to go through the types again with a fine tooth comb, and
> check the test coverage to look out for eg GEP array arithmetic with
> the wrong type/size that isn't being exercised.

I haven't gone through the test coverage myself, but I exercised the following 
things: 

 - running make installcheck with jit_above_cost = 0
 - letting sqlsmith hammer random queries at it for a few hours.

This didn't show obvious issues.

> *For anyone working with this type of IR generation code and
> questioning their sanity, I can pass on some excellent advice I got
> from Andres: build LLVM yourself with assertions enabled, as they
> catch some classes of silly mistake that otherwise just segfault
> > inscrutably on execution.

I tried my hand at backporting it to previous versions, and not knowing 
anything about it made me indeed question my sanity.  It's quite easy for PG 
15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier 
most notably due to to the change in fcinfo args representation. But I guess 
that's also a topic for another day :-)

Best regards,

--
Ronan Dunklau

Re: Add GUC to tune glibc's malloc implementation.

2023-06-27 Thread Ronan Dunklau

Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit :
> > Yes this is probably much more appropriate, but a much larger change with
> > greater risks of regression. Especially as we have to make sure we're not
> > overfitting our own code for a specific malloc implementation, to the
> > detriment of others.
> 
> I think your approach is fundamentally overfitting our code to a specific
> malloc implementation, in a way that's not tunable by mere mortals. It just
> seems like a dead end to me.

I see it as a way to have *some* sort of control over the malloc 
implementation we use, instead of tuning our allocations pattern on top of it 
while treating it entirely as a black box. As for the tuning, I proposed 
earlier to replace this parameter expressed in terms of size as a "profile" 
(greedy / conservative) to make it easier to pick a sensible value.

> 
> > Except if you hinted we should write our own directly instead ?
> 
> I don't think we should write our own malloc - we don't rely on it much
> ourselves. And if we replace it, we need to care about mallocs performance
> characteristics a whole lot, because various libraries etc do heavily rely
> on it.
> 
> However, I do think we should eventually avoid using malloc() for aset.c et
> al. malloc() is a general allocator, but at least for allocations below
> maxBlockSize aset.c's doesn't do allocations in a way that really benefit
> from that *at all*. It's not a lot of work to do such allocations on our
> own.
> > > We e.g. could keep a larger number of memory blocks reserved
> > > ourselves. Possibly by delaying the release of additionally held blocks
> > > until we have been idle for a few seconds or such.
> > 
> > I think keeping work_mem around after it has been used a couple times make
> > sense. This is the memory a user is willing to dedicate to operations,
> > after all.
> 
> The biggest overhead of returning pages to the kernel is that that triggers
> zeroing the data during the next allocation. Particularly on multi-node
> servers that's surprisingly slow.  It's most commonly not the brk() or
> mmap() themselves that are the performance issue.
> 
> Indeed, with your benchmark, I see that most of the time, on my dual Xeon
> Gold 5215 workstation, is spent zeroing newly allocated pages during page
> faults. That microarchitecture is worse at this than some others, but it's
> never free (or cache friendly).

I'm not sure I see the practical difference between those, but that's 
interesting. Were you able to reproduce my results ?

> FWIW, in my experience trimming the brk()ed region doesn't work reliably
> enough in real world postgres workloads to be worth relying on (from a
> memory usage POV). Sooner or later you're going to have longer lived
> allocations placed that will prevent it from happening.

I'm not sure I follow: given our workload is clearly split at queries and 
transactions boundaries, releasing memory at that time, I've assumed (and 
noticed in practice, albeit not on a production system) that most memory at 
the top of the heap would be trimmable as we don't keep much in between 
queries / transactions.

> 
> I have played around with telling aset.c that certain contexts are long
> lived and using mmap() for those, to make it more likely that the libc
> malloc/free can actually return memory to the system. I think that can be
> > quite worthwhile.

So if I understand your different suggestions, we should: 
 - use mmap ourselves for what we deem to be "one-off" allocations, to make 
sure that memory is not hanging around after we don't use
 - keep some pool allocated which will not be freed in between queries, but 
reused for the next time we need it. 

Thank you for looking at this problem.

Regards,

--
Ronan Dunklau

Re: Add GUC to tune glibc's malloc implementation.

2023-06-27 Thread Ronan Dunklau

Le mardi 27 juin 2023, 08:35:28 CEST Ronan Dunklau a écrit :
> I re-attached the simple script I used. I've run this script with different
> values for glibc_malloc_max_trim_threshold.

I forgot to add that it was using default parametrers except for work_mem, set 
to 32M, and max_parallel_workers_per_gather set to zero.

Re: Add GUC to tune glibc's malloc implementation.

2023-06-27 Thread Ronan Dunklau

Le lundi 26 juin 2023, 23:03:48 CEST Andres Freund a écrit :
> Hi,
> 
> On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote:
> > I hope what I'm trying to achieve is clearer that way. Maybe this patch is
> > not the best way to go about this, but since the memory allocator
> > behaviour can have such an impact it's a bit sad we have to leave half
> > the performance on the table because of it when there are easily
> > accessible knobs to avoid it.
> I'm *quite* doubtful this patch is the way to go.  If we want to more
> tightly control memory allocation patterns, because we have more
> information than glibc, we should do that, rather than try to nudge glibc's
> malloc in random direction.  In contrast a generic malloc() implementation
> we can have much more information about memory lifetimes etc due to memory
> contexts.

Yes this is probably much more appropriate, but a much larger change with 
greater risks of regression. Especially as we have to make sure we're not 
overfitting our own code for a specific malloc implementation, to the detriment 
of others. Except if you hinted we should write our own directly instead ?

> 
> We e.g. could keep a larger number of memory blocks reserved
> ourselves. Possibly by delaying the release of additionally held blocks
> until we have been idle for a few seconds or such.

I think keeping work_mem around after it has been used a couple times make 
sense. This is the memory a user is willing to dedicate to operations, after 
all.

> 
> 
> WRT to the difference in TPS in the benchmark you mention - I suspect that
> we are doing something bad that needs to be improved regardless of the
> underlying memory allocator implementation.  Due to the lack of detailed
> instructions I couldn't reproduce the results immediately.

I re-attached the simple script I used. I've run this script with different 
values for glibc_malloc_max_trim_threshold. 

Best regards,

--
Ronan Dunklau


bench.sh
Description: application/shellscript

Re: Add GUC to tune glibc's malloc implementation.

2023-06-26 Thread Ronan Dunklau

Le vendredi 23 juin 2023, 22:55:51 CEST Peter Eisentraut a écrit :
> On 22.06.23 15:35, Ronan Dunklau wrote:
> > The thing is, by default, those parameters are adjusted dynamically by the
> > glibc itself. It starts with quite small thresholds, and raises them when
> > the program frees some memory, up to a certain limit. This patch proposes
> > a new GUC allowing the user to adjust those settings according to their
> > workload.
> > 
> > This can cause problems. Let's take for example a table with 10k rows, and
> > 32 columns (as defined by a bench script David Rowley shared last year
> > when discussing the GenerationContext for tuplesort), and execute the
> > following
> > query, with 32MB of work_mem:

> I don't follow what you are trying to achieve with this.  The examples
> you show appear to work sensibly in my mind.  Using this setting, you
> can save some of the adjustments that glibc does after the first query.
> But that seems only useful if your session only does one query.  Is that
> what you are doing?

No, not at all: glibc does not do the right thing, we don't "save"  it. 
I will try to rephrase that.

In the first test case I showed, we see that glibc adjusts its threshold, but 
to a suboptimal value since repeated executions of a query needing the same 
amount of memory will release it back to the kernel, and move the brk pointer 
again, and will not adjust it again. On the other hand, by manually adjusting 
the thresholds, we can set them to a higher value which means that the memory 
will be kept in malloc's freelist for reuse for the next queries. As shown in 
the benchmark results I posted, this can have quite a dramatic effect, going 
from 396 tps to 894.  For ease of benchmarking, it is a single query being 
executed over and over again, but the same thing would be true if different 
queries allocating memories were executed by a single backend. 

The worst part of this means it is unpredictable: depending on past memory 
allocation patterns, glibc will end up in different states, and exhibit 
completely different performance for all subsequent queries. In fact, this is 
what Tomas noticed last year, (see [0]),  which led to investigation into 
this. 

I also tried to show that for certain cases glibcs behaviour can be on the 
contrary to greedy, and hold on too much memory if we just need the memory 
once and never allocate it again. 

I hope what I'm trying to achieve is clearer that way. Maybe this patch is not 
the best way to go about this, but since the memory allocator behaviour can 
have such an impact it's a bit sad we have to leave half the performance on 
the table because of it when there are easily accessible knobs to avoid it.

[0] 
https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com

Re: Add GUC to tune glibc's malloc implementation.

2023-06-22 Thread Ronan Dunklau

Le jeudi 22 juin 2023, 15:49:36 CEST Tom Lane a écrit :
> This seems like a pretty awful idea, mainly because there's no way
> to have such a GUC mean anything on non-glibc platforms, which is
> going to cause confusion or worse.

I named the GUC glibc_malloc_max_trim_threshold, I hope this is enough to 
clear up the confusion. We already have at least event_source, which is 
windows specific even if it's not clear from the name. 

> 
> Aren't these same settings controllable via environment variables?
> I could see adding some docs suggesting that you set thus-and-such
> values in the postmaster's startup script.  Admittedly, the confusion
> argument is perhaps still raisable; but we have a similar docs section
> discussing controlling Linux OOM behavior, and I've not heard much
> complaints about that.

Yes they are, but controlling them via an environment variable for the whole 
cluster defeats the point: different backends have different workloads, and 
being able to make sure for example the OLAP user is memory-greedy while the 
OLTP one is as conservative as possible is a worthwile goal.  Or even a 
specific backend may want to raise it's work_mem and adapt glibc behaviour 
accordingly, then get back to being conservative with memory until the next 
such transaction. 

Regards,

--
Ronan Dunklau

Add GUC to tune glibc's malloc implementation.

2023-06-22 Thread Ronan Dunklau

, and a second run 
will allocate it on the heap instead. So if we run the query twice, we end up 
with some memory in malloc's free lists that we may never use again. Using the 
new GUC, we can actually control wether it will be given back to the OS by 
setting a small value for the threshold.

I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark 
with different values for glibc_malloc_max_trim_threshold. 

I don't know how to write a test for this new feature so let me know if you 
have suggestions. Documentation is not written yet, as I expect discussion on 
this thread to lead to significant changes on the user-visible GUC or GUCs: 
 - should we provide one for trim which also adjusts mmap_threshold (current 
patch) or several GUCs ?
 - should this be simplified to only offer the default behaviour (glibc's takes 
care of the threshold) and some presets ("greedy", to set trim_threshold to 
work_mem, "frugal" to set it to a really small value)

Best regards,

--
Ronan Dunklau
>From 3686e660446facfb2d64683286176887914cd9fd Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 20 Jun 2023 16:17:32 +0200
Subject: [PATCH v1] Add options to tune malloc.

Add a new GUC glibc_malloc_max_trim_threshold which is used only when
being compiled against glibc.

This GUC adjusts the glibc's malloc options M_MMAP_THRESHOLD and M_TRIM_THRESHOLD.
The M_MMAP_THRESHOLD will be set to half M_TRIM_THRESHOLD. We cap the
value to the current work_mem, as it doesn't make much sense to reserve
more memory than that.

This new GUC allows the user to control how aggresively malloc will actually
return memory from the heap to the kernel, and when it should start
using mmap for bigger allocations. Depending on the workload, a user can
reduce the residual footprint of a long session (by reducing the trim
threshold to a fairly low value) or on the contrary make sure that
malloc keeps a sizable free chunk at the end of the heap for future
allocations.

On some benchmarks heavily dependent on memory allocations, like sorts,
this can dramatically influence performance.
---
 src/backend/utils/init/postinit.c |   6 +
 src/backend/utils/misc/guc_tables.c   |  16 ++-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/mmgr/Makefile   |   1 +
 src/backend/utils/mmgr/malloc_tuning.c| 122 ++
 src/backend/utils/mmgr/meson.build|   1 +
 src/include/utils/guc_hooks.h |   2 +
 src/include/utils/memutils.h  |  10 ++
 8 files changed, 158 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/utils/mmgr/malloc_tuning.c

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 561bd13ed2..707f20606d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -808,6 +808,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	InitCatalogCache();
 	InitPlanCache();
 
+	/* Adjust malloc options if needed.
+	 * This is done here because the implementation can vary depending on the
+	 * type of backend.
+	 */
+	MallocAdjustSettings();
+
 	/* Initialize portal manager */
 	EnablePortalManager();
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71e27f8eb0..f1e03dc306 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2338,7 +2338,7 @@ struct config_int ConfigureNamesInt[] =
 		},
 		_mem,
 		4096, 64, MAX_KILOBYTES,
-		NULL, NULL, NULL
+		NULL, _work_mem, NULL
 	},
 
 	{
@@ -2364,6 +2364,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"glibc_malloc_max_trim_threshold", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum value for glibc's M_TRIM_THRESHOLD option."),
+			gettext_noop("This controls how much memory glibc's will not return to the "
+	"OS once freed. An idle backend can thus consume that much memory "
+	"even if not in used. The default (-1) value disable static tuning "
+	"and relies on the default dynamic adjustment"),
+			GUC_UNIT_KB
+		},
+		_malloc_max_trim_threshold,
+		-1, -1, MAX_KILOBYTES,
+		NULL, _glibc_trim_threshold, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4c0269fa3..dc70a1dfa3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -157,6 +157,7 @@
 	#   windows
 	#   mmap
 	# (change requires restart)
+#glibc_malloc_max_trim_threshold = -1 # Only used with glibc's malloc. -1 disable it
 #min_dynamic_shared_memory = 0MB	# (change requires restart)
 #vacuum_buffer_usage_limit = 256kB	# size of

Re: Use generation context to speed up tuplesorts

2023-06-19 Thread Ronan Dunklau

Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit :
> Hi Ronan,
> 
> We briefly chatted about the glibc-tuning part of this thread at pgcon,
> so I wonder if you're still planning to pursue that. If you do, I
> suggest we start a fresh thread, so that it's not mixed with the already
> committed improvements of generation context.
> 
> I wonder what's the situation with the generation context improvements
> already pushed - does setting the glibc thresholds still help, and if
> yes how much?

Hi Tomas ! 

I'm currently working on it, trying to evaluate the possible benefits on the 
current ASet and Generation contexts usecases. 

I'll make sure to use a new thread to post my findings.

Thank you for remembering !

Best regards,

--
Ronan Dunklau

Re: Exclusion constraints on partitioned tables

2023-03-20 Thread Ronan Dunklau

Le vendredi 17 mars 2023, 17:03:09 CET Paul Jungwirth a écrit :
> I added the code about RTEqualStrategyNumber because that's what we need
> to find an equals operator when the index is GiST (except if it's using
> an opclass from btree_gist; then it needs to be BTEqual again). But then
> I realized that for exclusion constraints we have already figured out
> the operator (in RelationGetExclusionInfo) and put it in
> indexInfo->ii_ExclusionOps. So we can just compare against that. This
> works whether your index uses btree_gist or not.
> 
> Here is an updated patch with that change (also rebased).

Thanks ! This looks fine to me like this.

> 
> I also included a more specific error message. If we find a matching
> column in the index but with the wrong operator, we should say so, and
> not say there is no matching column.
>

I agree that's a nicer improvement. 

Regards,

--
Ronan Dunklau

Re: Allow ordered partition scans in more cases

2023-02-22 Thread Ronan Dunklau

Thank you for improving this optimization !

Le mardi 21 février 2023, 04:14:02 CET David Rowley a écrit :
> I still need to look to see if there's some small amount of data that
> can be loaded into the table to help coax the planner into producing
> the ordered scan for this one.  It works fine as-is for ORDER BY a,b
> and ORDER BY a; so I've put tests in for that.

I haven't looked too deeply into it, but it seems reasonable that the whole 
sort would cost cheaper than individual sorts on partitions + incremental 
sorts, except when the the whole sort would spill to disk much more than the 
incremental ones. I find it quite difficult to reason about what that threshold 
should be, but I managed to find a case which could fit in a test:

create table range_parted (a int, b int, c int) partition by range(a, b);
create table range_parted1 partition of range_parted for values from (0,0) to 
(10,10);
create table range_parted2 partition of range_parted for values from (10,10) 
to (20,20);
insert into range_parted(a, b, c) select i, j, k from generate_series(1, 19) 
i, generate_series(1, 19) j, generate_series(1, 5) k;
analyze range_parted;
set random_page_cost = 10;
set work_mem = '64kB';
explain (costs off) select * from range_parted order by a,b,c;

It's quite convoluted, because it needs the following:
 - estimate the individual partition sorts to fit into work_mem (even if that's 
not the case here at runtime)
 - estimate the whole table sort to not fit into work_mem
 - the difference between the two should be big enough to compensate the 
incremental sort penalty (hence raising random_page_cost).

This is completely tangential to the subject at hand, but maybe we have 
improvements to do with the way we estimate what type of sort will be 
performed ? It seems to underestimate the memory amount needed. I'm not sure 
it makes a real difference in real use cases though. 

Regards,

--
Ronan Dunklau

SQLFunctionCache and generic plans

2023-02-07 Thread Ronan Dunklau

Hello,

It has been brought to my attention that SQL functions always use generic 
plans.

Take this function for example:

create or replace function test_plpgsql(p1 oid) returns text as $$
BEGIN
   RETURN (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1);  
 
END;
$$ language plpgsql;

As expected, the PlanCache takes care of generating parameter specific plans, 
and correctly prunes the redundant OR depending on wether we call the function 
with a NULL value or not:

ro=# select test_plpgsql(NULL);
LOG:  duration: 0.030 ms  plan:
Query Text: (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 
1)
Result  (cost=0.04..0.05 rows=1 width=64)
  InitPlan 1 (returns $0)
->  Limit  (cost=0.00..0.04 rows=1 width=64)
  ->  Seq Scan on pg_class  (cost=0.00..18.12 rows=412 width=64)
LOG:  duration: 0.662 ms  plan:
Query Text: select test_plpgsql(NULL);
Result  (cost=0.00..0.26 rows=1 width=32)

ro=# select test_plpgsql(1);
LOG:  duration: 0.075 ms  plan:
Query Text: (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 
1)
Result  (cost=8.29..8.30 rows=1 width=64)
  InitPlan 1 (returns $0)
->  Limit  (cost=0.27..8.29 rows=1 width=64)
  ->  Index Scan using pg_class_oid_index on pg_class  
(cost=0.27..8.29 rows=1 width=64)
Index Cond: (oid = '1'::oid)
LOG:  duration: 0.675 ms  plan:
Query Text: select test_plpgsql(1);
Result  (cost=0.00..0.26 rows=1 width=32)


But writing the same function in SQL:
create or replace function test_sql(p1 oid) returns text as $$
SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1
$$ language sql;

we end up with a generic plan:

ro=# select test_sql(1);
LOG:  duration: 0.287 ms  plan:
Query Text:  SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1
Query Parameters: $1 = '1'
Limit  (cost=0.00..6.39 rows=1 width=32)
  ->  Seq Scan on pg_class  (cost=0.00..19.16 rows=3 width=32)
Filter: ((oid = $1) OR ($1 IS NULL))

This is due to the fact that SQL functions are planned once for the whole 
query using a specific SQLFunctionCache instead of using the whole PlanCache 
machinery. 

The following comment can be found in functions.c, about the SQLFunctionCache:

 * Note that currently this has only the lifespan of the calling query.
 * Someday we should rewrite this code to use plancache.c to save parse/plan
 * results for longer than that.

I would be interested in working on this, primarily to avoid this problem of 
having generic query plans for SQL functions but maybe having a longer lived 
cache as well would be nice to have.

Is there any reason not too, or pitfalls we would like to avoid ?

Best regards,

--
Ronan Dunklau

Re: Exclusion constraints on partitioned tables

2023-01-24 Thread Ronan Dunklau

Le vendredi 16 décembre 2022, 06:11:49 CET Paul Jungwirth a écrit :
> On 12/15/22 16:12, Tom Lane wrote:
> >> This patch also requires the matching constraint columns to use equality
> >> comparisons (`(foo WITH =)`), so it is really equivalent to the existing
> >> b-tree rule.
> > 
> > That's not quite good enough: you'd better enforce that it's the same
> > equality operator (and same collation, if relevant) as is being used
> > in the partition key.
> > [snip]
> > It might work better to consider the operator itself and ask if
> > it's equality in the same btree opfamily that's used by the
> > partition key.
> 
> Thank you for taking a look! Here is a comparison on just the operator
> itself.
> 

I've taken a look at the patch, and I'm not sure why you keep the restriction 
on the Gist operator being of the RTEqualStrategyNumber strategy. I don't 
think  we have any other place where we expect those strategy numbers to 
match. For hash it's different, as the hash-equality is the only operator 
strategy and as such there is no other way to look at it. Can't we just 
enforce partition_operator == exclusion_operator without adding the 
RTEqualStrategyNumber for the opfamily into the mix ?

Re: Ordering behavior for aggregates

2022-12-13 Thread Ronan Dunklau

Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
> Accordingly, I find nothing at all attractive in this proposal.
> I think the main thing it'd accomplish is to drive users back to
> the bad old days of ordering-by-subquery, if they have a requirement
> we failed to account for.

I think the ability to mark certain aggregates as being able to completely 
ignore the ordering because they produce exactly the same results is still a 
useful optimization.

--
Ronan Dunklau

Re: Ordering behavior for aggregates

2022-12-13 Thread Ronan Dunklau

Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit :
> On 12/13/22 13:55, Magnus Hagander wrote:
> > On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing  
wrote:
> >> However, it is completely useless for things like AVG() or SUM().  If
> >> you include it, the aggregate will do the sort even though it is neither
> >> required nor desired.

I'm not sure about this. For AVG and SUM, if you want reproducible results 
with floating point numbers, you may want it. And if you disallow it for most 
avg and sum implementations except for floating point types, it's not a very 
consistent user experience.


> >> 
> >> I am proposing something like pg_aggregate.aggordering which would be an
> >> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
> >> all aggregates would have 'a' but I am thinking that a lot of them could
> >> be switched to 'f'.  In that case, if a user supplies an ordering, an
> >> error is raised.
> > 
> > Should there perhaps also be an option for "ignored" where we'd allow the
> > user to specify it, but not actually do the sort because we know it's
> > pointless? Or maybe that should be the behaviour of "forbidden", which
> > should then perhaps have a different name?
> 
> I did think about that but I can't think of any reason we would want to
> silently ignore something the user has written.  If the ordering doesn't
> make sense, we should forbid it.

It is allowed as of now, and so it would be a compatibility issue for queries 
existing in the wild. Ignoring it is just an optimization, just how we 
optimize away some joins entirely. 

--
Ronan Dunklau

Re: Fix gin index cost estimation

2022-12-06 Thread Ronan Dunklau

Le vendredi 2 décembre 2022, 13:58:27 CET Ronan Dunklau a écrit :
> Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit :
> > Hi, Ronan!
> > Thank you for your patch.  Couple of quick questions.
> > 1) What magic number 50.0 stands for?  I think we at least should make
> > it a macro.
> 
> This is what is used in other tree-descending  estimation functions, so I
> used that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If
> so I'll separate this into two patches, one introducing the macro for the
> other estimation functions, and this patch for gin.

The 0001 patch does this.

> 
> > 2) "We only charge one data page for the startup cost" – should this
> > be dependent on number of search entries?

In fact there was another problem. The current code estimate two different 
pathes for fetching data pages: in the case of a partial match, it takes into 
account that all the data pages will have to be fetched. So this is is now 
taken into account for the CPU cost as well. 

For the regular search, we scale the number of data pages by the number of 
search entries.

Best regards,

--
Ronan Dunklau

>From b3d03a96ecb3d03c4c212e3c434def6e0e77fcd4 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 6 Dec 2022 11:08:46 +0100
Subject: [PATCH v4 1/2] Extract the cpu page-process cost multiplier into a
 macro.

Btree, GiST and SP-GiST all charge 50.0 * cpu_operator_cost for
processing an index page. Extract this to a macro to avoid repeated
magic numbers.
---
 src/backend/utils/adt/selfuncs.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f116924d3c..5baf8cf631 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -140,6 +140,7 @@
 #include "utils/timestamp.h"
 #include "utils/typcache.h"
 
+#define DEFAULT_PAGE_CPU_MULTIPLIER 50.0
 
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
@@ -6858,7 +6859,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * touched.  The number of such pages is btree tree height plus one (ie,
 	 * we charge for the leaf page too).  As above, charge once per SA scan.
 	 */
-	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
+	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
@@ -7053,7 +7054,7 @@ gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	/*
 	 * Likewise add a per-page charge, calculated the same as for btrees.
 	 */
-	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
+	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
@@ -7108,7 +7109,7 @@ spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	/*
 	 * Likewise add a per-page charge, calculated the same as for btrees.
 	 */
-	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
+	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
-- 
2.38.1

>From 817f02fa5ace98b343e585568165cc2aef84328a Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 12 Sep 2022 15:40:18 +0200
Subject: [PATCH v4 2/2] Fix gin costing.

GIN index scans were not taking any descent CPU-based cost into account. That made
them look cheaper than other types of indexes when they shouldn't be.

We use the same heuristic as for btree indexes, but multiplying it by
the number of searched entries.

Additionnally, the cpu cost for the tree was based largely on
genericcostestimate. For a GIN index, we should not charge index quals
per tuple, but per entry. On top of this, charge cpu_index_tuple_cost
per actual tuple.

This should fix the cases where a GIN index is preferred over a btree,
and the ones where a memoize node is not added on top of the GIN index
scan because it seemed too cheap.

Per report of Hung Nguyen.
---
 src/backend/utils/adt/selfuncs.c | 54 +---
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5baf8cf631..0b78fe450b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7446,6 +7446,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 qual_arg_cost,
 spc_random_page_cost,
 outer_scans;
+	Cost		descentCost;
 	Relation	indexRel;
 	GinStatsData ginStats;
 	ListCell   *lc;
@@ -767

Re: Fix gin index cost estimation

2022-12-02 Thread Ronan Dunklau

Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit :
> Hi, Ronan!
> Thank you for your patch.  Couple of quick questions.
> 1) What magic number 50.0 stands for?  I think we at least should make
> it a macro.

This is what is used in other tree-descending  estimation functions, so I used 
that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If so I'll 
separate this into two patches, one introducing the macro for the other 
estimation functions, and this patch for gin.

> 2) "We only charge one data page for the startup cost" – should this
> be dependent on number of search entries?

Good point, multiplying it by the number of search entries would do the trick. 

Thank you for looking at this !

Regards,

--
Ronan Dunklau

Re: Fix gin index cost estimation

2022-12-02 Thread Ronan Dunklau

Le mardi 25 octobre 2022, 16:18:57 CET Tom Lane a écrit :
> Alexander Korotkov  writes:
> > I think Tom's point was that it's wrong to add a separate entry-tree CPU
> > cost estimation to another estimation, which tries (very inadequately) to
> > estimate the whole scan cost. Instead, I propose writing better
> > estimations
> > for both entry-tree CPU cost and data-trees CPU cost and replacing
> > existing
> > CPU estimation altogether.
> 
> Great idea, if someone is willing to do it ...
> 
>   regards, tom lane

Hello,

Sorry for the delay, but here is an updated patch which changes the costing in 
the following way:

- add a descent cost similar to the btree one is charged for the initial 
entry-tree
- additionally, a charge is applied per page in both the entry tree and 
posting trees / lists
- instead of charging the quals to each tuple, charge them per entry only. We 
still charge cpu_index_tuple_cost per tuple though.

With those changes, no need to tweak the magic number formula estimating the 
number of pages. Maybe we can come up with something better for estimating 
those later on ?

Best regards,

--
Ronan Dunklau
>From 49bc71f25967f9367196803be363509922aac8ab Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 12 Sep 2022 15:40:18 +0200
Subject: [PATCH v3] Fix gin costing.

GIN index scans were not taking any descent CPU-based cost into account. That made
them look cheaper than other types of indexes when they shouldn't be.

We use the same heuristic as for btree indexes, but multiplying it by
the number of searched entries.

Additionnally, the cpu cost for the tree was based largely on
genericcostestimate. For a GIN index, we should not charge index quals
per tuple, but per entry. On top of this, charge cpu_index_tuple_cost
per actual tuple.

This should fix the cases where a GIN index is preferred over a btree,
and the ones where a memoize node is not added on top of the GIN index
scan because it seemed too cheap.

Per report of Hung Nguyen.
---
 src/backend/utils/adt/selfuncs.c | 50 +---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f116924d3c..1d07ae8e95 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7445,6 +7445,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 qual_arg_cost,
 spc_random_page_cost,
 outer_scans;
+	Cost		descentCost;
 	Relation	indexRel;
 	GinStatsData ginStats;
 	ListCell   *lc;
@@ -7669,6 +7670,41 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 */
 	dataPagesFetched = ceil(numDataPages * partialScale);
 
+	*indexStartupCost = 0;
+	*indexTotalCost = 0;
+
+	/*
+	 * Add a CPU-cost component to represent the costs of initial entry btree
+	 * descent.  We don't charge any I/O cost for touching upper btree levels,
+	 * since they tend to stay in cache, but we still have to do about log2(N)
+	 * comparisons to descend a btree of N leaf tuples.  We charge one
+	 * cpu_operator_cost per comparison.
+	 *
+	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
+	 * ones after the first one are not startup cost so far as the overall
+	 * plan is concerned, so add them only to "total" cost.
+	 */
+	if (numEntries > 1)			/* avoid computing log(0) */
+	{
+		descentCost = ceil(log(numEntries) / log(2.0)) * cpu_operator_cost;
+		*indexStartupCost += descentCost * counts.searchEntries;
+		*indexTotalCost += counts.arrayScans * descentCost * counts.searchEntries;
+	}
+
+	/*
+	 * Add a cpu cost per entry-page fetched. This is not amortized over a loop.
+	 */
+	*indexStartupCost += entryPagesFetched * 50.0 * cpu_operator_cost;
+	*indexTotalCost += entryPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost;
+
+	/*
+	 * Add a cpu cost per data-page fetched. This is also not amortized over a loop.
+	 * We only charge one data page for the startup cost, and everything else to
+	 * the total cost.
+	 */
+	*indexStartupCost += 50.0 * cpu_operator_cost;
+	*indexTotalCost += dataPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost;
+
 	/*
 	 * Calculate cache effects if more than one scan due to nestloops or array
 	 * quals.  The result is pro-rated per nestloop scan, but the array qual
@@ -7692,7 +7728,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Here we use random page cost because logically-close pages could be far
 	 * apart on disk.
 	 */
-	*indexStartupCost = (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
+	*indexStartupCost += (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
 
 	/*
 	 * Now compute the number of data pages fetched during the scan.
@@ -7720,6 +7756,9 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	if (dataPagesFetchedBySel > da

Re: Asynchronous execution support for Custom Scan

2022-12-01 Thread Ronan Dunklau

> IIUC, we already can use ctid in the where clause on the latest
> PostgreSQL, can't we?

Oh, sorry, I missed the TidRangeScan. My apologies for the noise.

Best regards,

--
Ronan Dunklau

Re: Asynchronous execution support for Custom Scan

2022-11-22 Thread Ronan Dunklau

Le mardi 6 septembre 2022, 11:29:55 CET Etsuro Fujita a écrit :
> On Mon, Sep 5, 2022 at 10:32 PM Kazutaka Onishi  wrote:
> > I'm sorry for my error on your name...
> 
> No problem.
> 
> > >  IIUC, it uses the proposed
> > > 
> > > APIs, but actually executes ctidscans *synchronously*, so it does not
> > > improve performance.  Right?
> > 
> > Exactly.
> > The actual CustomScan that supports asynchronous execution will
> > start processing in CustomScanAsyncRequest,
> > configure to detect completion via file descriptor in
> > CustomScanAsyncConfigureWait,
> > and receive the result in CustomScanAsyncNotify.
> 
> Ok, thanks!

Thanks for this patch, seems like a useful addition to the CustomScan API. 
Just to nitpick: there are extraneous tabs in createplan.c on a blank line.

Sorry for the digression, but I know your ctidscan module had been proposed 
for inclusion in contrib a long time ago, and I wonder if the rationale for 
not including it could have changed. We still don't have tests which cover 
CustomScan, and I can think of at least a few use cases where this customscan 
is helpful and not merely testing code.

One of those use case is when performing migrations on a table, and one wants 
to update the whole table by filling a new column with a computed value. You 
obviously don't want to do it in a single transaction, so you end up batching 
updates using an index looking for null values. If you want to do this, it's 
much faster to update rows in a range of block, performing a first series of 
batch updating all such block ranges, and then finally update the ones we 
missed transactionally (inserted in a block we already processed while in the 
middle of the batch, or in new blocks resulting from a relation extension).

Best regards,

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-08 Thread Ronan Dunklau

Le mardi 8 novembre 2022, 02:31:12 CET David Rowley a écrit :
> 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
> when we can add an Incremental Sort path instead. This removes quite a
> few redundant lines of code.

This seems sensible

> 2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
> 3. Does various other code tidy stuff in cost_incremental_sort().
> 4. Removes the test from incremental_sort.sql that was ensuring the
> inferior Sort -> Sort plan was being used instead of the superior Sort
> -> Incremental Sort plan.
> 
> I'm not really that 100% confident in the removal of the * 1.5 thing.
> I wonder if there's some reason we're not considering that might cause
> a performance regression if we're to remove it.

I'm not sure about it either. It seems to me that we were afraid of 
regressions, and having this overcharged just made us miss a new optimization 
without changing existing plans. With ordered aggregates, the balance is a bit 
trickier and we are at risk of either regressing on aggregate plans, or more 
common ordered ones.

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Ronan Dunklau

Le lundi 7 novembre 2022, 17:58:50 CET Pavel Luzanov a écrit :
> > I can't see why an incrementalsort could be more expensive than a full
> > sort, using the same presorted path.
> 
> The only reason I can see is the number of buffers to read. In the plan
> with incremental sort we read the whole index, ~19 buffers.
> And the plan with seq scan only required ~5000 (I think due to buffer
> ring optimization).

What I meant here is that disabling seqscans, the planner still chooses a full 
sort over a partial sort. The underlying index is the same, it is just a 
matter of choosing a Sort node over an IncrementalSort node. This, I think, is 
wrong: I can't see how it could be worse to use an incrementalsort in that 
case. 

It makes sense to prefer a SeqScan over an IndexScan if you are going to sort 
the whole table anyway. But in that case we shouldn't. What happened before is 
that some sort of incremental sort was always chosen, because it was hidden as 
an implementation detail of the agg node. But now it has to compete on a cost 
basis with the full sort, and that costing is wrong in that case. 

Maybe the original costing code for incremental sort was a bit too 
pessimistic. 

--
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Ronan Dunklau

Le samedi 5 novembre 2022, 09:51:23 CET Pavel Luzanov a écrit :
> While playing with the patch I found a situation where the performance
> may be degraded compared to previous versions.
> 
> The test case below.
> If you create a proper index for the query (a,c), version 16 wins. On my
> notebook, the query runs ~50% faster.
> But if there is no index (a,c), but only (a,b), in previous versions the
> planner uses it, but with this patch a full table scan is selected.

Hello,

In your exact use case, the combo incremental-sort + Index scan is evaluated 
to cost more than doing a full sort + seqscan. 

If you try for example to create an index on (b, a) and group by b, you will 
get the expected behaviour:

ro=# create index on t (b, a);
CREATE INDEX
ro=# explain  select b, array_agg(c order by c) from t group by b;
   QUERY PLAN   
 
-
 GroupAggregate  (cost=10.64..120926.80 rows=9970 width=36)
   Group Key: b
   ->  Incremental Sort  (cost=10.64..115802.17 rows=100 width=6)
 Sort Key: b, c
 Presorted Key: b
 ->  Index Scan using t_b_a_idx on t  (cost=0.42..47604.12 
rows=100 width=6)
(6 rows)

I think we can trace that back to incremental sort being pessimistic about 
it's performance. If you try the same query, but with set enable_seqscan = off, 
you will get a full sort over an index scan:

   QUERY PLAN   
 
-
 GroupAggregate  (cost=154944.94..162446.19 rows=100 width=34)
   Group Key: a
   ->  Sort  (cost=154944.94..157444.94 rows=100 width=4)
 Sort Key: a, c
 ->  Index Scan using t_a_b_idx on t  (cost=0.42..41612.60 
rows=100 width=4)
(5 rows)


This probably comes from the overly pessimistic behaviour that the number of 
tuples per group will be 1.5 times as much as we should estimate:

/*
 * Estimate average cost of sorting of one group where presorted 
keys are
 * equal.  Incremental sort is sensitive to distribution of tuples 
to the
 * groups, where we're relying on quite rough assumptions.  Thus, 
we're
 * pessimistic about incremental sort performance and increase its 
average
 * group size by half.
 */

I can't see why an incrementalsort could be more expensive than a full sort, 
using the same presorted path. It looks to me that in that case we should 
always prefer an incrementalsort. Maybe we should bound incremental sorts cost 
to make sure they are never more expensive than the full sort ?

Also, prior to this commit I don't think it made a real difference, because 
worst case scenario we would have missed an incremental sort, which we didn't 
have beforehand. But with this patch, we may actually replace a "hidden" 
incremental sort which was done in the agg codepath by a full sort. 

Best regards,

--
Ronan Dunklau

Re: Fix gin index cost estimation

2022-10-12 Thread Ronan Dunklau

> > You're right, I was too eager to try to raise the CPU cost proportionnally 
to
> > the number of array scans (scalararrayop). I'd really like to understand 
where
> > this equation comes from though...
> 
> So, what's the latest update here?

Thanks Michael for reviving this thread.

Before proceeding any further with this, I'd like to understand where we 
stand. Tom argued my way of charging cost per entry pages visited boils down 
to charging per tuple, which I expressed disagreement with. 

If we can come to a consensus whether that's a bogus way of thinking about it 
I'll proceed with what we agree on.

-- 
Ronan Dunklau

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Ronan Dunklau

Le jeudi 29 septembre 2022, 16:10:03 CEST Tom Lane a écrit :
> Ronan Dunklau  writes:
> >> Yeah, I think the same rules around scope apply as
> >> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> >> copy flag to the existing function, but I'd rather not add the
> >> branching to that function. It's probably just better to duplicate it
> >> and adjust.
> > 
> > For the record, I tried to see if gcc would optimize the function by
> > generating two different versions when copy is true or false, thus getting 
rid
> > of the branching while still having only one function to deal with.
> 
> TBH, I think this is completely ridiculous over-optimization.
> There's exactly zero evidence that a second copy of the function
> would improve performance, or do anything but contribute to code
> bloat (which does have a distributed performance cost).

I wasn't commenting on the merit of the optimization, but just that I tried to 
get gcc to apply it itself, which it doesn't.

Regards,

-- 
Ronan Dunklau

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-29 Thread Ronan Dunklau

> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.

Thank you David for taking care of this.

> Yeah, I think the same rules around scope apply as
> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> copy flag to the existing function, but I'd rather not add the
> branching to that function. It's probably just better to duplicate it
> and adjust.
> 

For the record, I tried to see if gcc would optimize the function by 
generating two different versions when copy is true or false, thus getting rid 
of the branching while still having only one function to deal with. Using the 
-fipa-cp-clone (or even the whole set of additional flags coming with -O3), it 
does generate a special-case version of the function, but it seems to then 
only be used by heapam_index_validate_scan and 
percentile_cont_multi_final_common. This is from my investigation looking for 
references to the specialized version in the DWARF debug information.

Regards,

-- 
Ronan Dunklau

Re: Fix gin index cost estimation

2022-09-19 Thread Ronan Dunklau

Le vendredi 16 septembre 2022, 22:04:59 CEST Tom Lane a écrit :
> Ronan Dunklau  writes:
> > The attached does that and is much simpler. I only took into account
> > entryPagesFetched, not sure if we should also charge something for data 
pages.
> 
> I think this is wrong, because there is already a CPU charge based on
> the number of tuples visited, down at the very end of the routine:
> 
>   *indexTotalCost += (numTuples * *indexSelectivity) * 
(cpu_index_tuple_cost + qual_op_cost);
> 
> It's certainly possible to argue that that's incorrectly computed,
> but just adding another cost charge for the same topic can't be right.

I don't think it's the same thing. The entryPagesFetched is computed 
independently of the selectivity and the number of tuples. As such, I think it 
makes sense to use it to compute the cost of descending the entry tree.

As mentioned earlier, I don't really understand the formula for computing 
entryPagesFetched. If we were to estimate the tree height to compute the 
descent cost as I first proposed, I feel like we would use two different 
metrics 
for what is essentially the same cost: something proportional to the size of 
the entry tree.

> 
> I do suspect that that calculation is bogus, because it looks like it's
> based on the concept of "apply the quals at each index entry", which we
> know is not how GIN operates.  So maybe we should drop that bit in favor
> of a per-page-ish cost like you have here.  Not sure.  In any case it
> seems orthogonal to the question of startup/descent costs.  Maybe we'd
> better tackle just one thing at a time.

Hum, good point. Maybe that should be revisited too.

> 
> (BTW, given that that charge does exist and is not affected by
> repeated-scan amortization, why do we have a problem in the first place?
> Is it just too small?  I guess that when we're only expecting one tuple
> to be retrieved, it would only add about cpu_index_tuple_cost.)

Because with a very low selectivity, we end up under-charging for the cost of 
walking the entry tree by a significant amount. As said above, I don't see how 
those two things are the same: that charge estimates the cost of applying 
index quals to the visited tuples, which is not the same as charging per entry 
page visited.

> 
> > Is the power(0.15) used an approximation for a log ? If so why ? Also
> > shouldn't we round that up ?
> 
> No idea, but I'm pretty hesitant to just randomly fool with that equation
> when (a) neither of us know where it came from and (b) exactly no evidence
> has been provided that it's wrong.
> 
> I note for instance that the existing logic switches from charging 1 page
> per search to 2 pages per search at numEntryPages = 15 (1.5 ^ (1/0.15)).
> Your version would switch at 2 pages, as soon as the pow() result is even
> fractionally above 1.0.  Maybe the existing logic is too optimistic there,
> but ceil() makes it too pessimistic I think.  I'd sooner tweak the power
> constant than change from rint() to ceil().

You're right, I was too eager to try to raise the CPU cost proportionnally to 
the number of array scans (scalararrayop). I'd really like to understand where 
this equation comes from though... 

Best regards,

-- 
Ronan Dunklau

Re: Fix gin index cost estimation

2022-09-15 Thread Ronan Dunklau

Le lundi 12 septembre 2022, 16:41:16 CEST Ronan Dunklau a écrit :
> But I realised that another approach might be better suited: since we want 
to
> charge a cpu cost for every page visited, actually basing that on the 
already
> estimated entryPagesFetched and dataPagesFetched would be better, instead of
> copying what is done for other indexes type and estimating the tree height. 
It
> would be simpler, as we don't need to estimate the tree height anymore.
> 
> I will submit a patch doing that.

The attached does that and is much simpler. I only took into account 
entryPagesFetched, not sure if we should also charge something for data pages.

Instead of trying to estimate the height of the tree, we rely on the 
(imperfect) estimation of the number of entry pages fetched, and charge 50 
times cpu_operator_cost to that, in addition to the cpu_operator_cost charged 
per entry visited.

I also adapted to take into accounts multiple scans induced by scalar array 
operations. 

As it is, I don't understand the following calculation:

/*
* Estimate number of entry pages read.  We need to do
* counts.searchEntries searches.  Use a power function as it should be,
 * but tuples on leaf pages usually is much greater. Here we include all
 * searches in entry tree, including search of first entry in partial
 * match algorithm
 */
entryPagesFetched += ceil(counts.searchEntries * rint(pow(numEntryPages, 
0.15)));

Is the power(0.15) used an approximation for a log ? If so why ? Also 
shouldn't we round that up ?
It seems to me it's unlikely to affect the total too much in normal cases 
(adding at worst random_page_cost) but if we start to charge cpu operator 
costs as proposed here it makes a big difference and it is probably
safer to overestimate a bit than the opposite.

With those changes, the gin cost (purely cpu-wise) stays above the btree one 
as I think it should be. 

-- 
Ronan Dunklau>From 8635a22ee5f90297756f072199c69a318bd17ea2 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 12 Sep 2022 15:40:18 +0200
Subject: [PATCH v2] Fix gin costing.

GIN index scans were not taking any descent CPU-based cost into account. That made
them look cheaper than other types of indexes when they shouldn't be.

We use the same heuristic as for btree indexes, but multiplying it by
the number of searched entries.

Per report of Hung Nguyen.
---
 src/backend/utils/adt/selfuncs.c | 34 +---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c746759eef..881e470a07 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7419,6 +7419,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 qual_arg_cost,
 spc_random_page_cost,
 outer_scans;
+	Cost		descentCost;
 	Relation	indexRel;
 	GinStatsData ginStats;
 	ListCell   *lc;
@@ -7622,7 +7623,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * searches in entry tree, including search of first entry in partial
 	 * match algorithm
 	 */
-	entryPagesFetched += ceil(counts.searchEntries * rint(pow(numEntryPages, 0.15)));
+	entryPagesFetched += ceil(counts.searchEntries * ceil(pow(numEntryPages, 0.15)));
 
 	/*
 	 * Add an estimate of entry pages read by partial match algorithm. It's a
@@ -7643,6 +7644,33 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 */
 	dataPagesFetched = ceil(numDataPages * partialScale);
 
+	*indexStartupCost = 0;
+	*indexTotalCost = 0;
+
+	/*
+	 * Add a CPU-cost component to represent the costs of initial entry btree
+	 * descent.  We don't charge any I/O cost for touching upper btree levels,
+	 * since they tend to stay in cache, but we still have to do about log2(N)
+	 * comparisons to descend a btree of N leaf tuples.  We charge one
+	 * cpu_operator_cost per comparison.
+	 *
+	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
+	 * ones after the first one are not startup cost so far as the overall
+	 * plan is concerned, so add them only to "total" cost.
+	 */
+	if (numEntries > 1)			/* avoid computing log(0) */
+	{
+		descentCost = ceil(log(numEntries) / log(2.0)) * cpu_operator_cost;
+		*indexStartupCost += descentCost * counts.searchEntries;
+		*indexTotalCost += counts.arrayScans * descentCost * counts.searchEntries;
+	}
+
+	/*
+	 * Add a cpu cost per page fetched. This is not amortized over a loop.
+	 */
+	*indexStartupCost += entryPagesFetched * 50.0 * cpu_operator_cost;
+	*indexTotalCost += entryPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost;
+
 	/*
 	 * Calculate cache effects if more than one scan due to nestloops or array
 	 * quals.  The result is pro-rated per nestloop scan, but the array qual
@@ -7666,7 +7694,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Here we use random page cost because logicall

Re: Fix gin index cost estimation

2022-09-12 Thread Ronan Dunklau

Thank you for looking at it.

> I looked this over briefly.  I think you are correct to charge an
> initial-search cost per searchEntries count, but don't we also need to
> scale up by arrayScans, similar to the "corrections for cache effects"?
> 
> +  * We model index descent costs similarly to those for btree, but 
we also
> +  * need an idea of the tree_height.
> +  * We use numEntries / numEntryPages as the fanout factor.
> 
> I'm not following that calculation?  It seems like it'd be correct
> only for a tree height of 1, although maybe I'm just misunderstanding
> this (overly terse, perhaps) comment.

I don't really understand why that would work only with a tree height of one ? 
Every entry page contains a certain amount of entries, and as such computing 
the average number of entries per page seems to be a good approximation for 
the fanout. But I may have misunderstood what was done in other index types.

For consistency, maybe we should just use a hard coded value of 100 for the 
fanout factor, similarly to what we do for other index types.

But I realised that another approach might be better suited: since we want to 
charge a cpu cost for every page visited, actually basing that on the already 
estimated entryPagesFetched and dataPagesFetched would be better, instead of 
copying what is done for other indexes type and estimating the tree height. It 
would be simpler, as we don't need to estimate the tree height anymore.

I will submit a patch doing that.

> 
> +  * We charge descentCost once for every entry
> +  */
> + if (numTuples > 1)
> + {
> + descentCost = ceil(log(numTuples) / log(2.0)) * 
cpu_operator_cost;
> + *indexStartupCost += descentCost * 
counts.searchEntries;
> + }
> 
> I had to read this twice before absorbing the point of the numTuples
> test.  Maybe help the reader a bit:
> 
> + if (numTuples > 1)/* ensure positive log() */
> 

Ok. On second read, I think that part was actually wrong: what we care about 
is not the number of tuples here, but the number of entries. 

> Personally I'd duplicate the comments from nbtreecostestimate rather
> than just assuming the reader will go consult them.  For that matter,
> why didn't you duplicate nbtree's logic for charging for SA scans?
> This bit seems just as relevant for GIN:
> 
>* If there are ScalarArrayOpExprs, charge this once per SA scan.  
The
>* ones after the first one are not startup cost so far as the 
overall
>* plan is concerned, so add them only to "total" cost.
> 

You're right. So what we need to do here is scale up whatever we charge for 
the startup cost by the number of arrayscans for the total cost.

> Keep in mind also that pgindent will have its own opinions about how to
> format these comments, and it can out-stubborn you.  Either run the
> comments into single paragraphs, or if you really want them to be two
> paras then leave an empty comment line between.  Another formatting
> nitpick is that you seem to have added a number of unnecessary blank
> lines.

Thanks, noted.

I'll submit a new patch soon, as soon as i've resolved some of the problems I 
have when accounting for scalararrayops.

Best regards,

-- 
Ronan Dunklau

Fix gin index cost estimation

2022-08-03 Thread Ronan Dunklau

Hello,

Following the bug report at [1], I sent the attached patch to pgsql-bugs 
mailing list. I'm starting a thread here to add it to the next commitfest.

The problem I'm trying to solve is that, contrary to btree, gist and sp-gist 
indexes, gin indexes do not charge any cpu-cost for descending the entry tree.

This can be a problem in cases where the io cost is very low. This can happen 
with manual tuning of course, but more surprisingly when the the IO cost is 
amortized over a large number of iterations in a nested loop. In that case, we 
basically consider it free since everything should already be in the shared 
buffers. This leads to some inefficient plans, as an equivalent btree index 
should be picked instead. 

This has been discovered in PG14, as this release makes it possible to use a 
pg_trgm gin index with the equality operator. Before that, only the btree 
would have been considered and as such the discrepancy in the way we charge 
cpu cost didn't have noticeable effects. However, I suspect users of btree_gin 
could have the same kind of problems in prior versions.

Best regards,

[1]: https://www.postgresql.org/message-id/flat/
2187702.iZASKD2KPV%40aivenronan#0c2498c6a85e31a589b3e9a6a3616c52

-- 
Ronan Dunklau>From 0aa1ff24e58234d759c07f4eeec163a82244be25 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Wed, 6 Jul 2022 17:29:01 +0200
Subject: [PATCH v1] Fix gin costing.

GIN index scans were not taking any descent CPU-based cost into account. That made
them look cheaper than other types of indexes when they shouldn't be.

We use the same heuristic as for btree indexes, but multiplying it by
the number of searched entries.

Per report of Hung Nguyen.
---
 src/backend/utils/adt/selfuncs.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa1f589fad..21407c3d38 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7425,6 +7425,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 qual_arg_cost,
 spc_random_page_cost,
 outer_scans;
+	Cost		descentCost;
 	Relation	indexRel;
 	GinStatsData ginStats;
 	ListCell   *lc;
@@ -7524,6 +7525,18 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			  _random_page_cost,
 			  NULL);
 
+
+	/*
+	 * We model index descent costs similarly to those for btree, but we also
+	 * need an idea of the tree_height.
+	 * We use numEntries / numEntryPages as the fanout factor.
+	 */
+	if (index->tree_height < 0)
+	{
+		index->tree_height = ceil(log(numEntries) / log(numEntries / numEntryPages));
+	}
+
+
 	/*
 	 * Generic assumption about index correlation: there isn't any.
 	 */
@@ -7649,6 +7662,27 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 */
 	dataPagesFetched = ceil(numDataPages * partialScale);
 
+
+	*indexStartupCost = 0;
+
+	/*
+	 * Add a CPU-cost component similar to btree to represent the costs of the
+	 * initial descent.
+	 * We charge descentCost once for every entry
+	 */
+	if (numTuples > 1)
+	{
+		descentCost = ceil(log(numTuples) / log(2.0)) * cpu_operator_cost;
+		*indexStartupCost += descentCost * counts.searchEntries;
+	}
+
+	/*
+	 * Add a similar per-page charge, depending on the tree heights.
+	 */
+	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
+	*indexStartupCost += descentCost * counts.searchEntries;
+
+
 	/*
 	 * Calculate cache effects if more than one scan due to nestloops or array
 	 * quals.  The result is pro-rated per nestloop scan, but the array qual
@@ -7672,7 +7706,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Here we use random page cost because logically-close pages could be far
 	 * apart on disk.
 	 */
-	*indexStartupCost = (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
+	*indexStartupCost += (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
 
 	/*
 	 * Now compute the number of data pages fetched during the scan.
-- 
2.37.0

Re: Support for grabbing multiple consecutive values with nextval()

2022-07-13 Thread Ronan Dunklau

> It is Friday here, so I would easily miss something..  It is possible
> to use COPY FROM with a list of columns, so assuming that you could
> use a default expression with nextval() or just a SERIAL column not
> listed in the COPY FROM query to do the job, what do we gain with this
> feature?  In which aspect does the preallocation of a range handled
> on the client side after being allocated in the backend make things
> better?

The problem the author wants to solve is the fact they don't have a way of 
returning the ids when using COPY FROM. Pre-allocating them and assigning them 
to the individual records before sending them via COPY FROM would solve that 
for them.

-- 
Ronan Dunklau

Re: Support for grabbing multiple consecutive values with nextval()

2022-07-08 Thread Ronan Dunklau

Hello,

Reading the thread, I think the feature has value: it would basically transfer 
control of the sequence cache to the client application.

However, I don't think that returning only the last value is a sensible thing 
to do. The client will need to know the details of the sequence to do anything 
useful about this, especially it's increment, minvalue, maxvalue and cycle 
options. 

As suggested upthread, returning a resultset would probably be better. If the 
client application is concerned about the volume of data exchanged with the 
server, and is willing to deal with handling the knowledge of the sequence 
details themselves, they can always wrap it in an aggregate:

SELECT min(newvals), max(newvals)   FROM nextvals() as newvals

Regards,

--
Ronan Dunklau

Re: Removing unneeded self joins

2022-07-04 Thread Ronan Dunklau

Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit :
> On 19/5/2022 16:47, Ronan Dunklau wrote:
> > I'll take a look at that one.
> 
> New version of the patch, rebased on current master:
> 1. pgindent over the patch have passed.
> 2. number of changed files is reduced.
> 3. Some documentation and comments is added.

Hello Andrey,

Thanks for the updates. 

The general approach seems sensible to me, so I'm going to focus on some 
details. 

In a very recent thread [1], Tom Lane is proposing to add infrastructure to 
make Var aware of their nullability by outer joins. I wonder if that would help 
with avoiding the need for adding is not null clauses when the column is known 
not null ?
If we have a precedent for adding a BitmapSet to the Var itself, maybe the 
whole discussion regarding keeping track of nullability can be extended to the 
original column nullability ?

Also, I saw it was mentioned earlier in the thread but how difficult would it 
be to process the transformed quals through the EquivalenceClass machinery and 
the qual simplification ? 
For example, if the target audience of this patch is ORM, or inlined views, it 
wouldn't surprise me to see queries of this kind in the wild, which could be 
avoided altogether:

postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a = s2.a 
where s1.b = 2 and s2.b =3;
 QUERY PLAN  
-
 Seq Scan on sj s2
   Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2))
(2 lignes)

+   for (counter = 0; counter < list_length(*sources);)
+   {
+   ListCell   *cell = list_nth_cell(*sources, counter);
+   RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell));
+   int counter1;
+

+ ec->ec_members = list_delete_cell(ec->ec_members, cell);

Why don't you use foreach() and foreach_delete_current macros for iterating and 
removing items in the lists, both in update_ec_members and update_ec_sources ?

+   if ((bms_is_member(k, info->syn_lefthand) ^
+bms_is_member(r, info->syn_lefthand)) 
||
+   (bms_is_member(k, info->syn_righthand) ^
+bms_is_member(r, info->syn_righthand)))

I think this is more compact and easier to follow than the previous version, 
but I'm not sure how common it is in postgres source code to use that kind of 
construction ?

Some review about the comments:

I see you keep using the terms "the keeping relation" and "the removing 
relation" in reference to the relation that is kept and the one that is removed.
Aside from the grammar (the kept relation or the removed relation), maybe it 
would make it clearer to call them something else. In other parts of the code, 
you used "the remaining relation / the removed relation" which makes sense.

 /*
  * Remove the target relid from the planner's data structures, having
- * determined that there is no need to include it in the query.
+ * determined that there is no need to include it in the query. Or replace
+ * with another relid.
+ * To reusability, this routine can work in two modes: delete relid from a plan
+ * or replace it. It is used in replace mode in a self-join removing process.

This could be rephrased: ", optionally replacing it with another relid. The 
latter is used by the self-join removing process."

[1] https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us

-- 
Ronan Dunklau

Re: Removing unneeded self joins

2022-05-19 Thread Ronan Dunklau

Le jeudi 19 mai 2022, 12:48:18 CEST Andrey Lepikhov a écrit :
> On 5/17/22 19:14, Ronan Dunklau wrote:
> > Le vendredi 13 mai 2022, 07:07:47 CEST Andrey Lepikhov a écrit :
> >> New version of the feature.
> >> Here a minor bug with RowMarks is fixed. A degenerated case is fixed,
> >> when uniqueness of an inner deduced not from join quals, but from a
> >> baserestrictinfo clauses 'x=const', where x - unique field.
> >> Code, dedicated to solve second issue is controversial, so i attached
> >> delta.txt for quick observation.
> >> Maybe we should return to previous version of code, when we didn't split
> >> restriction list into join quals and base quals?
> > 
> > Hello,
> > 
> > I tried to find problematic cases, which would make the planning time grow
> > unacceptably, and couldn't devise it.
> > 
> > The worst case scenario I could think of was as follows:
> >   - a query with many different self joins
> >   - an abundance of unique indexes on combinations of this table columns
> >   to
> > 
> > consider
> > 
> >   - additional predicates on the where clause on columns.
> 
> Looking into the patch I can imagine, that the most difficult case is
> when a set of relations with the same OID is huge, but only small part
> of them (or nothing) can be removed.
> Also, removing a clause from restrictinfo list or from equivalence class
> adds non-linear complexity. So, you can dig this way ).
> 
> > The base table I used for this was a table with 40 integers. 39 unique
> > indexes were defined on every combination of (c1, cX) with cX being
> > columns c2 to c40.
> > 
> > I turned geqo off, set from_collapse_limit and join_collapse_limit to
> > unreasonably high values (30), and tried to run queries of the form:
> > 
> > SELECT * FROM test_table t1
> > JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX
> > ...
> > JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX.
> > 
> > So no self join can be eliminated in that case.
> 
> I think, you should compare t1.cX with tX.cX to eliminate self-join.
> Cross-unique-index proof isn't supported now.

Yes, that's the point. I wanted to try to introduce as much complexity as I 
could, without actually performing any self join elimination. The idea was to 
try to come up with the worst case scenario.
> 
> > The performance was very similar with or without the GUC enabled. I tested
> > the same thing without the patch, since the test for uniqueness has been
> > slightly altered and incurs a new allocation, but it doesn't seem to
> > change.
> > 
> > One interesting side effect of this patch, is that removing any unneeded
> > self join cuts down the planification time very significantly, as we
> > lower the number of combinations to consider.
> 
> Even more - removing a join we improve cardinality estimation.
> 
> > As for the code:
> >   - Comments on relation_has_unique_index_ext and
> >   relation_has_unique_index_for> 
> > should be rewritten, as relation_has_unique_index_for is now just a
> > special
> > case of relation_has_unique_index_ext. By the way, the comment would
> > probably be better read as: "but if extra_clauses isn't NULL".
> > 
> >   - The whole thing about "extra_clauses", ie, baserestrictinfos which
> >   were
> > 
> > used to determine uniqueness, is not very clear. Most functions where the
> > new argument has been added have not seen an update in their comments,
> > and the name itself doesn't really convey the intented meaning: perhaps
> > required_non_join_clauses ?
> > 
> > The way this works should be explained a bit more thoroughly, for example
> > in remove_self_joins_one_group the purpose of uclauses should be
> > explained. The fact that degenerate_case returns true when we don't have
> > any additional base restrict info is also confusing, as well as the
> > degenerate_case name.
> Agree,
> but after this case thoughts wander in my head: should we make one step
> back to pre-[1] approach? It looks like we have quite similar changes,
> but without special function for a 'degenerate case' detection and
> restrictlist splitting.

I'll take a look at that one. 

> 
> > I'll update if I think of more interesting things to add.
> 
> Thank you for your efforts!
> 
> See in attachment next version which fixes mistakes detected by
> z...@yugabyte.com.
> 
> [1]
> https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW
> 5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com


-- 
Ronan Dunklau

Re: Limiting memory allocation

2022-05-18 Thread Ronan Dunklau

Le mercredi 18 mai 2022, 16:23:34 CEST Jan Wieck a écrit :
> On 5/17/22 18:30, Stephen Frost wrote:
> > Greetings,
> > 
> > On Tue, May 17, 2022 at 18:12 Tom Lane  > 
> > <mailto:t...@sss.pgh.pa.us>> wrote:
> > Jan Wieck mailto:j...@wi3ck.info>> writes:
> >  > On 5/17/22 15:42, Stephen Frost wrote:
> >  >> Thoughts?
> >  > 
> >  > Using cgroups one can actually force a certain process (or user, or
> >  > service) to use swap if and when that service is using more
> > 
> > memory than
> > 
> >  > it was "expected" to use.
> > 
> > I wonder if we shouldn't just provide documentation pointing to
> > OS-level
> > facilities like that one.  The kernel has a pretty trivial way to
> > check
> > the total memory used by a process.  We don't: it'd require tracking
> > total
> > space used in all our memory contexts, and then extracting some
> > number out
> > of our rear ends for allocations made directly from malloc.  In short,
> > anything we do here will be slow and unreliable, unless you want to
> > depend
> > on platform-specific things like looking at /proc/self/maps.
> > 
> > This isn’t actually a solution though and that’s the problem- you end up
> > using swap but if you use more than “expected” the OOM killer comes in
> > and happily blows you up anyway. Cgroups are containers and exactly what
> > kube is doing.
> 
> Maybe I'm missing something, but what is it that you would actually
> consider a solution? Knowing your current memory consumption doesn't
> make the need for allocating some right now go away. What do you
> envision the response of PostgreSQL to be if we had that information
> about resource pressure? I don't see us using mallopt(3) or
> malloc_trim(3) anywhere in the code, so I don't think any of our
> processes give back unused heap at this point (please correct me if I'm
> wrong). This means that even if we knew about the memory pressure of the
> system, adjusting things like work_mem on the fly may not do much at
> all, unless there is a constant turnover of backends.

I'm not sure I understand your point: when we free() a pointer, malloc is 
allowed to release the corresponding memory to the kernel. In the case of 
glibc, it doesn't necessarily do so, but it trims the top of the heap if it is 
in excess of M_TRIM_THRESHOLD. In the default glibc configuration, this 
parameter is dynamically adjusted by mmap itself, to a maximum value of 64MB 
IIRC. So any memory freed on the top of the heap totalling more than that 
threshold ends up actually freed.

In another thread, I proposed to take control over this tuning instead of 
letting malloc do it itself, as we may have better knowledge of the memory 
allocations pattern than what malloc empirically discovers: in particular, we 
could lower work_mem, adjust the threshold and maybe even call malloc_trim 
ourselves when work_mem is lowered, to reduce the padding we may keep.

> 
> So what do you propose PostgreSQL's response to high memory pressure to be?
> 
> 
> Regards, Jan


-- 
Ronan Dunklau

Re: Removing unneeded self joins

2022-05-17 Thread Ronan Dunklau

Le vendredi 13 mai 2022, 07:07:47 CEST Andrey Lepikhov a écrit :
> New version of the feature.
> Here a minor bug with RowMarks is fixed. A degenerated case is fixed,
> when uniqueness of an inner deduced not from join quals, but from a
> baserestrictinfo clauses 'x=const', where x - unique field.
> Code, dedicated to solve second issue is controversial, so i attached
> delta.txt for quick observation.
> Maybe we should return to previous version of code, when we didn't split
> restriction list into join quals and base quals?

Hello,

I tried to find problematic cases, which would make the planning time grow 
unacceptably, and couldn't devise it.

The worst case scenario I could think of was as follows:

 - a query with many different self joins
 - an abundance of unique indexes on combinations of this table columns to 
consider
 - additional predicates on the where clause on columns.

The base table I used for this was a table with 40 integers. 39 unique indexes 
were defined on every combination of (c1, cX) with cX being columns c2 to c40.

I turned geqo off, set from_collapse_limit and join_collapse_limit to 
unreasonably high values (30), and tried to run queries of the form:

SELECT * FROM test_table t1
JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX
...
JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX.

So no self join can be eliminated in that case.
The performance was very similar with or without the GUC enabled. I tested the 
same thing without the patch, since the test for uniqueness has been slightly 
altered and incurs a new allocation, but it doesn't seem to change. 

One interesting side effect of this patch, is that removing any unneeded self 
join cuts down the planification time very significantly, as we lower the 
number 
of combinations to consider. 

As for the code: 

 - Comments on relation_has_unique_index_ext and relation_has_unique_index_for 
should be rewritten, as relation_has_unique_index_for is now just a special 
case of relation_has_unique_index_ext. By the way, the comment would probably 
be better read as: "but if extra_clauses isn't NULL".
 - The whole thing about "extra_clauses", ie, baserestrictinfos which were 
used to determine uniqueness, is not very clear. Most functions where the new 
argument has been added have not seen an update in their comments, and the 
name itself doesn't really convey the intented meaning: perhaps 
required_non_join_clauses ?

The way this works should be explained a bit more thoroughly, for example in 
remove_self_joins_one_group the purpose of uclauses should be explained. The 
fact that degenerate_case returns true when we don't have any additional base 
restrict info is also confusing, as well as the degenerate_case name.

I'll update if I think of more interesting things to add. 

-- 
Ronan Dunklau

Re: ORDER BY pushdowns seem broken in postgres_fdw

2022-03-31 Thread Ronan Dunklau

Le jeudi 31 mars 2022, 01:41:37 CEST Tom Lane a écrit :
> I looked through this patch.  It's going in the right direction,
> but I have a couple of nitpicks:

Thank you Tom for taking a look at this.

> I think that instead of doubling down on a wrong API, we should just
> take that out and move the logic into postgres_fdw.c.  This also has
> the advantage of producing a patch that's much safer to backpatch,
> because it doesn't rely on the core backend getting updated before
> postgres_fdw.so is.

It makes total sense.

> If no objections, I think this is committable.

No objections on my end.

-- 
Ronan Dunklau

Re: Proposal: More structured logging

2022-01-27 Thread Ronan Dunklau

Le jeudi 27 janvier 2022, 08:15:01 CET Michael Paquier a écrit :
> On Tue, Jan 18, 2022 at 06:46:03AM +0100, Ronan Dunklau wrote:
> > Hum, there was a missing import in csvlog.c from the fix above. Sorry
> > about
> > that.
> 
> +
>  
> 
> You are also forgetting that the table listing all the jsonlog fields
> needs a refresh with this new key called "tags", and that it has a
> JSON object underneath.

Done.

> 
> If you want to test all the formats supported by logging_destination,
> wouldn't this stuff be better if tested through TAP with
> logging_destination set with csvlog, jsonlog and stderr?  This is
> lacking coverage for csvlog.c and jsonlog.c, paths touched by your
> patch.

Done,  I also added coverage for log_verbosity = verbose while I was at it, 
and fixed a bug I introduced in csvlog while rebasing... 

-- 
Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v7 1/4] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 7402696986..4fb4c67c3f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -460,6 +460,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 3eb8de3966..615fae47ef 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error context callback functions, not
  * within an ereport() invocation. The callback function can be in a different
@@ -395,11 +398,18 @@ typedef struct ErrorData
 	int			internalpos;	/* cursor index into internalquery */
 	char	   *internalquery;	/* text of internally-generated query */
 	int			saved_errno;	/* errno at entry */
+	List	   *tags;			/* List of error tags */
 
 	/* context containing associated non-constant strings */
 	struct MemoryContextData *assoc_context;
 } ErrorData;
 
+typedef struct ErrorTag
+{
+	const char *tagname;
+	char	   *tagvalue;
+} ErrorTag;
+
 extern void EmitErrorReport(void);
 extern ErrorData *CopyErrorData(void);
 extern void FreeErrorData(ErrorData *edata);
-- 
2.35.0

>From 872770026fd90

Re: Use generation context to speed up tuplesorts

2022-01-25 Thread Ronan Dunklau

Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit :
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> > > I'm also a bit confused about which patch(es) should actually be
> > > reviewed
> > > in that thread.  My understanding is that the original patch might not
> > > be
> > > relevant anymore but I might be wrong.  The CF entry should probably be
> > > updated to clarify that, with an annotation/ title change / author
> > > update
> > > or something.
> > > 
> > > In the meantime I will switch the entry to Waiting on Author.
> > 
> > I think what's going on here is a bit confusing. There's quite a few
> > patches on here that aim to do very different things.
> > 
> > The patch I originally proposed was just to make tuplesort use
> > generation memory contexts. This helps speed up tuplesort because
> > generation contexts allow palloc() to be faster than the standard
> > allocator. Additionally, there's no power-of-2 wastage with generation
> > contexts like there is with the standard allocator. This helps fit
> > more tuples on fewer CPU cache lines.
> > 
> > I believe Andres then mentioned that the fixed block size for
> > generation contexts might become a problem. With the standard
> > allocator the actual size of the underlying malloc() grows up until a
> > maximum size.  With generation contexts this is always the same, so we
> > could end up with a very large number of blocks which means lots of
> > small mallocs which could end up slow.  Tomas then posted a series of
> > patches to address this.
> > 
> > I then posted another patch that has the planner make a choice on the
> > size of the blocks to use for the generation context based on the
> > tuple width and number of tuple estimates. My hope there was to
> > improve performance further by making a better choice for how large to
> > make the blocks in the first place.  This reduces the need for Tomas'
> > patch, but does not eliminate the need for it.
> > 
> > As of now, I still believe we'll need Tomas' patches to allow the
> > block size to grow up to a maximum size.  I think those patches are
> > likely needed before we think about making tuplesort use generation
> > contexts.  The reason I believe this is that we don't always have good
> > estimates for the number of tuples we're going to sort.  nodeSort.c is
> > fairly easy as it just fetches tuples once and then sorts them.  The
> > use of tuplesort for nodeIncrementalSort.c is much more complex as
> > we'll likely be performing a series of smaller sorts which are much
> > harder to get size estimates for. This means there's likely no magic
> > block size that we can use for the generation context that's used to
> > store the tuples in tuplesort.  This is also the case for order by
> > aggregates and probably most other usages of tuplesort.
> 
> You left out the discussion I started about glibc's malloc specific
> behaviour.
> 
> I tried to demonstrate that a big part of the performance gain you were
> seeing were not in fact due to our allocator by itself, but by the way
> different block sizes allocations interact with this malloc implementation
> and how it handles releasing memory back to the system. I also tried to
> show that we can give DBAs more control over how much memory to "waste" as
> the price for faster allocations.
> 
> I agree that the generation context memory manager is useful in the
> tuplesort context, if only for the fact that we fall back to disk less
> quickly due to the reduced wastage of memory, but I would be wary of
> drawing conclusions too quickly based on a specific malloc implementation
> which shows threshold effects, which are compounded by our growing block
> sizes code in general.
> 
> I have on my TODO list to run the same kind of benchmarks using jemalloc, to
> better isolate the performance gains expected from the generation allocator
> itself from the side effect of avoiding glibc's pattern of releasing memory
> back to the kernel too quickly.
> 

I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and 
jemalloc, with the standard aset memory manager or with David's generation v2 
patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting 
postgres. 

I have not tried to measure jemalloc's memory footprint like I did for glibc's 
malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a 
glibc's malloc replacement. 

Please find the results attached. As I suspected, most of the gain observed 
with Davi

Re: Use generation context to speed up tuplesorts

2022-01-19 Thread Ronan Dunklau

Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> > I'm also a bit confused about which patch(es) should actually be reviewed
> > in that thread.  My understanding is that the original patch might not be
> > relevant anymore but I might be wrong.  The CF entry should probably be
> > updated to clarify that, with an annotation/ title change / author update
> > or something.
> > 
> > In the meantime I will switch the entry to Waiting on Author.
> 
> I think what's going on here is a bit confusing. There's quite a few
> patches on here that aim to do very different things.
> 
> The patch I originally proposed was just to make tuplesort use
> generation memory contexts. This helps speed up tuplesort because
> generation contexts allow palloc() to be faster than the standard
> allocator. Additionally, there's no power-of-2 wastage with generation
> contexts like there is with the standard allocator. This helps fit
> more tuples on fewer CPU cache lines.
> 
> I believe Andres then mentioned that the fixed block size for
> generation contexts might become a problem. With the standard
> allocator the actual size of the underlying malloc() grows up until a
> maximum size.  With generation contexts this is always the same, so we
> could end up with a very large number of blocks which means lots of
> small mallocs which could end up slow.  Tomas then posted a series of
> patches to address this.
> 
> I then posted another patch that has the planner make a choice on the
> size of the blocks to use for the generation context based on the
> tuple width and number of tuple estimates. My hope there was to
> improve performance further by making a better choice for how large to
> make the blocks in the first place.  This reduces the need for Tomas'
> patch, but does not eliminate the need for it.
> 
> As of now, I still believe we'll need Tomas' patches to allow the
> block size to grow up to a maximum size.  I think those patches are
> likely needed before we think about making tuplesort use generation
> contexts.  The reason I believe this is that we don't always have good
> estimates for the number of tuples we're going to sort.  nodeSort.c is
> fairly easy as it just fetches tuples once and then sorts them.  The
> use of tuplesort for nodeIncrementalSort.c is much more complex as
> we'll likely be performing a series of smaller sorts which are much
> harder to get size estimates for. This means there's likely no magic
> block size that we can use for the generation context that's used to
> store the tuples in tuplesort.  This is also the case for order by
> aggregates and probably most other usages of tuplesort.
> 

You left out the discussion I started about glibc's malloc specific behaviour.

I tried to demonstrate that a big part of the performance gain you were seeing 
were not in fact due to our allocator by itself, but by the way different block 
sizes allocations interact with this malloc implementation and how it handles 
releasing memory back to the system. I also tried to show that we can give 
DBAs more control over how much memory to "waste" as the price for faster 
allocations.

I agree that the generation context memory manager is useful in the tuplesort 
context, if only for the fact that we fall back to disk less quickly due to 
the reduced wastage of memory, but I would be wary of drawing conclusions too 
quickly based on a specific malloc implementation which shows threshold 
effects, 
which are compounded by our growing block sizes code in general.

I have on my TODO list to run the same kind of benchmarks using jemalloc, to 
better isolate the performance gains expected from the generation allocator 
itself from the side effect of avoiding glibc's pattern of releasing memory 
back to the kernel too quickly. 

Julien, sorry for the confusion  of adding that discussion and those patches 
to the same thread, but I don't really see a way of dissociating those two 
aspects.

-- 
Ronan Dunklau

Re: Proposal: More structured logging

2022-01-17 Thread Ronan Dunklau

Le lundi 17 janvier 2022, 09:18:04 CET Ronan Dunklau a écrit :
> Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit :
> > Hi,
> > 
> > On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote:
> > > Done, and I added anoher commit per your suggestion to add this comment.
> > 
> > The cfbot reports that the patchset doesn't apply anymore:
> > 
> > http://cfbot.cputube.org/patch_36_3293.log
> > === Applying patches on top of PostgreSQL commit ID
> > 43c2175121c829c8591fc5117b725f1f22bfb670 === [...]
> > === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch
> > [...]
> > patching file src/backend/utils/error/elog.c
> > Hunk #2 FAILED at 3014.
> > 1 out of 2 hunks FAILED -- saving rejects to file
> > src/backend/utils/error/elog.c.rej
> > 
> > Could you send a rebased version?  In the meantime I will switch the cf
> > entry to Waiting on Author.
> 
> Thank you for this notification !
> 
> The csvlog has been refactored since I wrote the patch, and the new jsonlog
> destination has been introduced. I rebased to fix the first patch, and added
> a new patch to add support for tags in jsonlog output.
> 
> Best regards,

Hum, there was a missing import in csvlog.c from the fix above. Sorry about 
that. 


-- 
Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v6 1/5] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 7402696986..4fb4c67c3f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -460,6 +460,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 3eb8de3966..615fae47ef 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error context callback functions, not
  * within an ereport() invocation. The callback function can be in a different
@@ -395,11 +398,18 @@ typedef struct ErrorData
 	int			internalpos;	/* cursor index into internalquery */
 	char	   *internalquery;	/* text o

Re: Proposal: More structured logging

2022-01-17 Thread Ronan Dunklau

Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit :
> Hi,
> 
> On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote:
> > Done, and I added anoher commit per your suggestion to add this comment.
> 
> The cfbot reports that the patchset doesn't apply anymore:
> 
> http://cfbot.cputube.org/patch_36_3293.log
> === Applying patches on top of PostgreSQL commit ID
> 43c2175121c829c8591fc5117b725f1f22bfb670 === [...]
> === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch
> [...]
> patching file src/backend/utils/error/elog.c
> Hunk #2 FAILED at 3014.
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/backend/utils/error/elog.c.rej
> 
> Could you send a rebased version?  In the meantime I will switch the cf
> entry to Waiting on Author.

Thank you for this notification !

The csvlog has been refactored since I wrote the patch, and the new jsonlog 
destination has been introduced. I rebased to fix the first patch, and added a 
new patch to add support for tags in jsonlog output.

Best regards,

-- 
Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v5 1/5] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 7402696986..4fb4c67c3f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -460,6 +460,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 3eb8de3966..615fae47ef 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error context callback functions, not
  * within an ereport() invocation. The callback function can be in a different
@@ -395,11 +398,18 @@ typedef struct ErrorData
 	int			internalpos;	/* cursor index into internalquery */
 	char	   *internalquery;	/* text of internally-generated query */
 	int			saved_errno;	/* errno at entry */
+	List	   *tags;			/* List of error tags */
 
 	/* context containing associated non-constant strings */
 	struct MemoryContextData *assoc_context;
 } ErrorData;
 
+typedef struct ErrorTag
+{
+	const char *tagname

Re: Proposal: More structured logging

2022-01-11 Thread Ronan Dunklau

Le mercredi 29 décembre 2021, 14:59:16 CET Justin Pryzby a écrit :
> > Subject: [PATCH v3 2/3] Add test module for the new tag functionality.
> 
> ...
> 
> > +test_logging(PG_FUNCTION_ARGS)
> > +{
> 
> ...
> 
> > +(errmsg("%s", message),
> > + ({
> > +   forboth(lk, keys, lv, values)
> > +   {
> > +   (errtag(lfirst(lk), "%s", (char *) 
lfirst(lv)));
> > +   }})
> > +   ));
> 
> The windows build fails with that syntax.
> http://cfbot.cputube.org/ronan-dunklau.html
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.157923

Thank you. I switched to an explicit sequence of errstart / errfinish to avoid  
putting too much things in nested macro calls. As I don't have any Windows 
knowledge, I am very grateful for the new cirrus-ci integration which allowed 
me to build on Windows without hassle.

> > Subject: [PATCH v3 3/3] Output error tags in CSV logs
> > +++ b/doc/src/sgml/config.sgml
> > @@ -7370,6 +7371,7 @@ CREATE TABLE postgres_log
> > 
> >backend_type text,
> >leader_pid integer,
> >query_id bigint,
> > 
> > +  tags jsonb
> > 
> >PRIMARY KEY (session_id, session_line_num)
> >  
> >  );
> >  
> 
> That's invalid sql due to missing a trailing ",".

Thanks, fixed.

> 
> You should also update file-fdw.sgml - which I only think of since we forgot
> in to update it before e568ed0eb and 0830d21f5.  config.sgml should have a
> comment as a reminder to do that.

Done, and I added anoher commit per your suggestion to add this comment.

Thank you for this review.

Regards,

-- 
Ronan Dunklau>From 487c34465b34ef6ef8cb466247cbaaa8cf4bc534 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 11 Jan 2022 10:16:53 +0100
Subject: [PATCH v4 4/4] Add comment in config.sgml as a reminder to also
 update file_fdw.sgml

---
 doc/src/sgml/config.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4df79fcbcc..d4c20f40ac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7375,7 +7375,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 query id,
 and optional tags added by the logger as JSON.
 Here is a sample table definition for storing CSV-format log output:
-
+
 
 CREATE TABLE postgres_log
 (
-- 
2.34.1

>From a1c4ae874bf156a05da436838d6b2b73f6621905 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:32 +0200
Subject: [PATCH v4 3/4] Output error tags in CSV logs

---
 doc/src/sgml/config.sgml   |  4 +++-
 doc/src/sgml/file-fdw.sgml |  1 +
 src/backend/utils/error/elog.c | 25 -
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e3..4df79fcbcc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7372,7 +7372,8 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 location of the error in the PostgreSQL source code
 (if log_error_verbosity is set to verbose),
 application name, backend type, process ID of parallel group leader,
-and query id.
+query id,
+and optional tags added by the logger as JSON.
 Here is a sample table definition for storing CSV-format log output:
 
 
@@ -7404,6 +7405,7 @@ CREATE TABLE postgres_log
   backend_type text,
   leader_pid integer,
   query_id bigint,
+  tags jsonb,
   PRIMARY KEY (session_id, session_line_num)
 );
 
diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
index 5b98782064..ccb4e9d8dd 100644
--- a/doc/src/sgml/file-fdw.sgml
+++ b/doc/src/sgml/file-fdw.sgml
@@ -268,6 +268,7 @@ CREATE FOREIGN TABLE pglog (
   backend_type text,
   leader_pid integer,
   query_id bigint
+  tags jsonb,
 ) SERVER pglog
 OPTIONS ( filename 'log/pglog.csv', format 'csv' );
 
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d43e1c2c31..1e6c7222c3 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -80,6 +80,7 @@
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/guc.h"
+#include "utils/json.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 
@@ -3013,10 +3014,32 @@ write_csvlog(ErrorData *edata)
 			appendStringInfo(, "%d", leader->pid);
 	}
 	appendStringInfoChar(, ',');
-
 	/* query id */
 	appendStringInfo(, "%lld", (long long) pgstat_get_my_query_id());
+	appendStringInfoChar(, ',');
+	if (edata->tags != NIL)
+	{
+		StringInfoData tagbuf;
+		ListCell   *lc;
+		bool		first = true;
 
+		initStringInfo();
+		appendStringInfoChar(, '{');
+		foreach(lc, edata

Matching domains-over-enums to anyenum types

2022-01-10 Thread Ronan Dunklau

Hello,

A colleague of mine was surprised to discover the following statements raised 
an error:

postgres=# CREATE TYPE abc_enum AS ENUM ('a', 'b', 'c');
CREATE TYPE
postgres=# CREATE DOMAIN abc_domain AS abc_enum;
CREATE DOMAIN
postgres=#  SELECT 'a'::abc_domain = 'a'::abc_domain; 
ERROR:  operator does not exist: abc_domain = abc_domain
LINE 1: SELECT 'a'::abc_domain = 'a'::abc_domain;
   ^
HINT:  No operator matches the given name and argument types. You might need 
to add explicit type casts.

This has been already discussed a long time ago, and the idea was rejected at 
the time since there was no demand for it:

https://www.postgresql.org/message-id/flat/BANLkTi%3DaGxDbGPSF043V2K-C2vF2YzGz9w%40mail.gmail.com#da4826d2cbbaca20e3440aadb3093158

Given that we implemented that behaviour for domains over ranges and 
multiranges, I don't see the harm in doing the same for domains over enums.

What do you think ?

-- 
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

2022-01-07 Thread Ronan Dunklau

Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit :
> On 1/7/22 12:03, Ronan Dunklau wrote:
> > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> >> I've attached some benchmark results that I took recently.  The
> >> spreadsheet contains results from 3 versions. master, master + 0001 -
> >> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
> >> more conservative about the chunk sizes it allocates and also tries to
> >> allocate the tuple array according to the number of tuples we expect
> >> to be able to sort in a single batch for when the sort is not
> >> estimated to fit inside work_mem.
> > 
> > (Sorry for trying to merge back the discussion on the two sides of the
> > thread)
> > 
> > In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan,
> > I expressed the idea of being able to tune glibc's malloc behaviour.
> > 
> > I implemented that (patch 0001) to provide a new hook which is called on
> > backend startup, and anytime we set work_mem. This hook is # defined
> > depending on the malloc implementation: currently a default, no-op
> > implementation is provided as well as a glibc's malloc implementation.
> 
> Not sure I'd call this a hook - that usually means a way to plug-in
> custom code through a callback, and this is simply ifdefing a block of
> code to pick the right implementation. Which may be a good way to do
> that, just let's not call that a hook.
> 
> There's a commented-out MallocTuneHook() call, probably not needed.

Ok, I'll clean that up if we decide to proceed with this.

> 
> I wonder if #ifdefing is sufficient solution, because it happens at
> compile time, so what if someone overrides the allocator in LD_PRELOAD?
> That was a fairly common way to use a custom allocator in an existing
> application. But I don't know how many people do that with Postgres (I'm
> not aware of anyone doing that) or if we support that (it'd probably
> apply to other stuff too, not just malloc). So maybe it's OK, and I
> can't think of a better way anyway.

I couldn't think of a better way either, maybe there is something to be done 
with trying to dlsym something specific to glibc's malloc implementation ?


> 
> > The glibc's malloc implementation relies on a new GUC,
> > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
> > don't tune malloc at all, exactly as in HEAD. If a different value is
> > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD
> > to this value, capped by work_mem / 2 and work_mem respectively.
> > 
> > The net result is that we can then allow to keep more unused memory at the
> > top of the heap, and to use mmap less frequently, if the DBA chooses too.
> > A possible other use case would be to on the contrary, limit the
> > allocated memory in idle backends to a minimum.
> > 
> > The reasoning behind this is that glibc's malloc default way of handling
> > those two thresholds is to adapt to the size of the last freed mmaped
> > block.
> > 
> > I've run the same "up to 32 columns" benchmark as you did, with this new
> > patch applied on top of both HEAD and your v2 patchset incorporating
> > planner estimates for the block sizez. Those are called "aset" and
> > "generation" in the attached spreadsheet. For each, I've run it with
> > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
> > 
> > I've measured two things:
> >   - query latency, as reported by pgbench
> >   - total memory allocated by malloc at backend ext after running each
> >   query
> > 
> > three times. This represents the "idle" memory consumption, and thus what
> > we waste in malloc inside of releasing back to the system. This
> > measurement has been performed using the very small module presented in
> > patch 0002. Please note that I in no way propose that we include this
> > module, it was just a convenient way for me to measure memory footprint.
> > 
> > My conclusion is that the impressive gains you see from using the
> > generation context with bigger blocks mostly comes from the fact that we
> > allocate bigger blocks, and that this moves the mmap thresholds
> > accordingly. I wonder how much of a difference it would make on other
> > malloc implementation: I'm afraid the optimisation presented here would
> > in fact be specific to glibc's malloc, since we have almost the same
> > gains with both allocators when tuning malloc to keep more memory. I
> > still think both approaches are useful, and would be nec

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit :
> I wasn't really suggesting to investigate those other allocators in this
> patch - it seems like a task requiring a pretty significant amount of
> work/time. My point was that we should make it reasonably easy to add
> tweaks for those other environments, if someone is interested enough to
> do the legwork.
> 
> >> 2) In fact, I wonder if different glibc versions behave differently?
> >> Hopefully it's not changing that much, though. Ditto kernel versions,
> >> but the mmap/sbrk interface is likely more stable. We can test this.
> > 
> > That could be tested, yes. As a matter of fact, a commit removing the
> > upper
> > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to
> > glibc,
> > which means we can service much bigger allocation without mmap.
> 
> Yeah, I noticed that commit too. Most systems stick to one glibc
> version, so it'll take time to reach most systems. Let's continue with
> just one glibc version and then maybe test other versions.

Yes, I also need to figure out how to detect we're using glibc as I'm not very 
familiar with configure. 

> 
> >> 3) If we bump the thresholds, won't that work against reusing the
> >> memory? I mean, if we free a whole block (from any allocator we have),
> >> glibc might return it to kernel, depending on mmap threshold value. It's
> >> not guaranteed, but increasing the malloc thresholds will make that even
> >> less likely. So we might just as well increase the minimum block size,
> >> with about the same effect, no?
> > 
> > It is my understanding that malloc will try to compact memory by moving it
> > around. So the memory should be actually be released to the kernel at some
> > point. In the meantime, malloc can reuse it for our next invocation (which
> > can be in a different memory context on our side).
> > 
> > If we increase the minimum block size, this is memory we will actually
> > 
> > reserve, and it will not protect us against the ramping-up behaviour:
> >  - the first allocation of a big block may be over mmap_threshold, and
> >  serviced> 
> > by an expensive mmap
> > 
> >  - when it's free, the threshold is doubled
> >  - next invocation is serviced by an sbrk call
> >  - freeing it will be above the trim threshold, and it will be returned.
> > 
> > After several "big" allocations, the thresholds will raise to their
> > maximum
> > values (well, it used to, I need to check what happens with that latest
> > patch of glibc...)
> > 
> > This will typically happen several times as malloc doubles the threshold
> > each time. This is probably the reason quadrupling the block sizes was
> > more effective.
> 
> Hmmm, OK. Can we we benchmark the case with large initial block size, at
> least for comparison?

The benchmark I called "fixed" was with a fixed block size of 
ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same 
performance profile as the growing blocks + malloc tuning.
But if  I understand correctly, you implemented the growing blocks logic after 
concerns about wasting memory with a constant large block size. If we tune 
malloc, that memory would not be wasted if we don't alloc it, just not 
released as eagerly when it's allocated. 

Or do you want a benchmark with an even bigger initial block size ? With the 
growing blocks patch with a large initial size ? I can run either, I just want 
to understand what is interesting to you.

One thing that would be interesting would be to trace the total amount of 
memory allocated in the different cases. This is something I will need to do 
anyway when I propose that patch; 

Best regards,

-- 
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit :
> It is my understanding that malloc will try to compact memory by moving it
> around. So the memory should be actually be released to the kernel at some
> point. In the meantime, malloc can reuse it for our next invocation (which
> can be in a different memory context on our side).

I've been told off-list this comment wasn't clear: I meant that it compacts 
*free* memory, consolidating into larger blocks that will eventually reach the 
threshold, and be released.

-- 
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau

Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
> On 12/16/21 17:03, Ronan Dunklau wrote:
> > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> >> I will follow up with a benchmark of the test sorting a table with a
> >> width
> >> varying from 1 to 32 columns.
> > 
> > So please find attached another benchmark for that case.
> > 
> > The 3 different patchsets tested are:
> >   - master
> >   - fixed (David's original patch)
> >   - adjust (Thomas growing blocks patch)
> 
> Presumably Thomas is me, right?

I'm really sorry for this typo... Please accept my apologies. 

> 
> > So it looks like tuning malloc for this would be very benificial for any
> > kind of allocation, and by doing so we reduce the problems seen with the
> > growing blocks patch to next to nothing, while keeping the ability to not
> > allocate too much memory from the get go.
> 
> Thanks for running those tests and investigating the glibc behavior! I
> find those results very interesting. My conclusions from this is that
> the interaction interaction between "our" allocator and the allocator in
> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
> optimization somewhat tricky because code changes may trigger behavior
> change in glibc (or whatever allocator backs malloc).
> 
> I think it's worth exploring if we can tune this in a reasonable way,
> but I have a couple concerns related to that:
> 
> 1) I wonder how glibc-specific this is - I'd bet it applies to other
> allocators (either on another OS or just different allocator on Linux)
> too. Tweaking glibc parameters won't affect those systems, of course,
> but maybe we should allow tweaking those systems too ...

I agree, finding their specific problems and see if we can workaround it would 
be interesting. I suppose glibc's malloc is the most commonly used allocator 
in production, as it is the default for most Linux distributions.

> 
> 2) In fact, I wonder if different glibc versions behave differently?
> Hopefully it's not changing that much, though. Ditto kernel versions,
> but the mmap/sbrk interface is likely more stable. We can test this.

That could be tested, yes. As a matter of fact, a commit removing the upper 
limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
which means we can service much bigger allocation without mmap.


> 
> 3) If we bump the thresholds, won't that work against reusing the
> memory? I mean, if we free a whole block (from any allocator we have),
> glibc might return it to kernel, depending on mmap threshold value. It's
> not guaranteed, but increasing the malloc thresholds will make that even
> less likely. So we might just as well increase the minimum block size,
> with about the same effect, no?

It is my understanding that malloc will try to compact memory by moving it 
around. So the memory should be actually be released to the kernel at some 
point. In the meantime, malloc can reuse it for our next invocation (which can 
be in a different memory context on our side). 

If we increase the minimum block size, this is memory we will actually 
reserve, and it will not protect us against the ramping-up behaviour:
 - the first allocation of a big block may be over mmap_threshold, and serviced 
by an expensive mmap
 - when it's free, the threshold is doubled
 - next invocation is serviced by an sbrk call
 - freeing it will be above the trim threshold, and it will be returned.

After several "big" allocations, the thresholds will raise to their maximum 
values (well, it used to, I need to check what happens with that latest patch 
of glibc...)
 
This will typically happen several times as malloc doubles the threshold each 
time. This is probably the reason quadrupling the block sizes was more 
effective. 


> 
> > I would like to try to implement some dynamic glibc malloc tuning, if that
> > is something we don't reject on principle from the get go.
> 
> +1 to that

Ok, I'll work on a patch for this and submit a new thread. 


-- 
Ronan Dunklau

Re: Use generation context to speed up tuplesorts

2021-12-16 Thread Ronan Dunklau

Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> I will follow up with a benchmark of the test sorting a table with a width
> varying from 1 to 32 columns.
> 

So please find attached another benchmark for that case.

The 3 different patchsets tested are:

 - master
 - fixed (David's original patch)
 - adjust (Thomas growing blocks patch)

So it looks like tuning malloc for this would be very benificial for any kind 
of allocation, and by doing so we reduce the problems seen with the growing 
blocks patch to next to nothing, while keeping the ability to not allocate too 
much memory from the get go.

I would like to try to implement some dynamic glibc malloc tuning, if that is 
something we don't reject on principle from the get go. 


-- 
Ronan Dunklau

bench_results.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: Use generation context to speed up tuplesorts

2021-12-16 Thread Ronan Dunklau

Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit :
> On 12/8/21 16:51, Ronan Dunklau wrote:
> > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> >> And now comes the funny part - if I run it in the same backend as the
> >> 
> >> "full" benchmark, I get roughly the same results:
> >>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>  
> >>  ++---+--+-
> >>  
> >>32768 |512 | 806256640 |37159 |   76669
> >> 
> >> but if I reconnect and run it in the new backend, I get this:
> >>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>  
> >>  ++---+--+-
> >>  
> >>32768 |512 | 806158336 |   233909 |  100785
> >>  
> >>  (1 row)
> >> 
> >> It does not matter if I wait a bit before running the query, if I run it
> >> repeatedly, etc. The machine is not doing anything else, the CPU is set
> >> to use "performance" governor, etc.
> > 
> > I've reproduced the behaviour you mention.
> > I also noticed asm_exc_page_fault showing up in the perf report in that
> > case.
> > 
> > Running an strace on it shows that in one case, we have a lot of brk
> > calls,
> > while when we run in the same process as the previous tests, we don't.
> > 
> > My suspicion is that the previous workload makes glibc malloc change it's
> > trim_threshold and possibly other dynamic options, which leads to
> > constantly moving the brk pointer in one case and not the other.
> > 
> > Running your fifo test with absurd malloc options shows that indeed that
> > might be the case (I needed to change several, because changing one
> > disable the dynamic adjustment for every single one of them, and malloc
> > would fall back to using mmap and freeing it on each iteration):
> > 
> > mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> > 
> > I get the following results for your self contained test. I ran the query
> > twice, in each case, seeing the importance of the first allocation and the
> > subsequent ones:
> > 
> > With default malloc options:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   300156 |  207557
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   211942 |   77207
> > 
> > With the oversized values above:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   219000 |   36223
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |75761 |   78082
> > 
> > (1 row)
> > 
> > I can't tell how representative your benchmark extension would be of real
> > life allocation / free patterns, but there is probably something we can
> > improve here.
> 
> Thanks for looking at this. I think those allocation / free patterns are
> fairly extreme, and there probably are no workloads doing exactly this.
> The idea is the actual workloads are likely some combination of these
> extreme cases.
> 
> > I'll try to see if I can understand more precisely what is happening.
> 
> Thanks, that'd be helpful. Maybe we can learn something about tuning
> malloc parameters to get significantly better performance.
> 

Apologies for the long email, maybe what I will state here is obvious for 
others but I learnt a lot, so... 

I think I understood what the problem was in your generation tests: depending 
on the sequence of allocations we will raise a different maximum for 
mmap_threshold and trim_threshold. When an mmap block is freed, malloc will 
raise it's threshold as it consider memory freed to be better served by 
regular moving of the sbrk pointer, instead of using mmap to map memory.  This 
threshold is upped by multiplying it by two anytime we free a mmap'ed block

Re: Use generation context to speed up tuplesorts

2021-12-08 Thread Ronan Dunklau

Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
> 
>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>  ++---+--+-
>32768 |512 | 806256640 |37159 |   76669
> 
> but if I reconnect and run it in the new backend, I get this:
> 
>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>  ++---+--+-
>32768 |512 | 806158336 |   233909 |  100785
>  (1 row)
> 
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls, 
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's 
trim_threshold and possibly other dynamic options, which leads to constantly 
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might 
be the case (I needed to change several, because changing one disable the 
dynamic adjustment for every single one of them, and malloc would fall back to 
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query 
twice, in each case, seeing the importance of the first allocation and the 
subsequent ones:

With default malloc options:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   300156 |  207557

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   211942 |   77207


With the oversized values above:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   219000 |   36223


 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |75761 |   78082
(1 row)

I can't tell how representative your benchmark extension would be of real life 
allocation / free patterns, but there is probably something we can improve 
here.

I'll try to see if I can understand more precisely what is happening.

-- 
Ronan Dunklau

Re: Non-superuser subscription owners

2021-12-07 Thread Ronan Dunklau

Le lundi 6 décembre 2021, 16:56:56 CET Mark Dilger a écrit :
> > On Dec 6, 2021, at 2:19 AM, Amit Kapila  wrote:
> >>> If we want to maintain the property that subscriptions can only be
> >>> owned by superuser
> 
> We don't want to maintain such a property, or at least, that's not what I
> want.  I don't think that's what Jeff wants, either.

That's not what I want either: the ability to run and refresh subscriptions as 
a non superuser is a desirable feature. 

The REFRESH part was possible before PG 14, when it was allowed to run REFRESH 
in a function, which could be made to run as security definer. 


> As I perceive the roadmap:
> 
> 1) Fix the current bug wherein subscription changes are applied with
> superuser force after the subscription owner has superuser privileges
> revoked. 2) Allow the transfer of subscriptions to non-superuser owners.
> 3) Allow the creation of subscriptions by non-superusers who are members of
> some as yet to be created predefined role, say "pg_create_subscriptions"

This roadmap seems sensible.

-- 
Ronan Dunklau

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-11-04 Thread Ronan Dunklau

Le jeudi 22 juillet 2021, 10:42:49 CET Ronan Dunklau a écrit :
> Le jeudi 22 juillet 2021, 09:38:50 CEST David Rowley a écrit :
> > On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau  
wrote:
> > > I tested the 0001 patch against both HEAD and my proposed bugfix for
> > > postgres_fdw.
> > > 
> > > There is a problem that the ordered aggregate is not pushed down
> > > anymore.
> > > The underlying Sort node is correctly pushed down though.
> > > 
> > > This comes from the fact that postgres_fdw grouping path never contains
> > > any
> > > pathkey. Since the cost is fuzzily the same between the pushed-down
> > > aggregate and the locally performed one, the tie is broken against the
> > > pathkeys.
> > 
> > I think this might be more down to a lack of any penalty cost for
> > fetching foreign tuples. Looking at create_foreignscan_path(), I don't
> > see anything that adds anything to account for fetching the tuples
> > from the foreign server. If there was something like that then there'd
> > be more of a preference to perform the remote aggregation so that
> > fewer rows must arrive from the remote server.
> > 
> > I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in
> > create_foreignscan_path() and I got the plan with the remote
> > aggregation. That's a fairly large penalty of 1.0 per row. Much bigger
> > than parallel_tuple_cost's default value.
> > 
> > I'm a bit undecided on how much this patch needs to get involved in
> > adjusting foreign scan costs.  The problem is that we've given the
> > executor a new path to consider and nobody has done any proper
> > costings for the foreign scan so that it properly prefers paths that
> > have to pull fewer foreign tuples.  This is a pretty similar problem
> > to what parallel_tuple_cost aims to fix. Also similar to how we had to
> > add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates
> > prefer grouping at the partition level rather than at the partitioned
> > table level.
> > 
> > > Ideally we would add the group pathkeys to the grouping path, but this
> > > would add an additional ORDER BY expression matching the GROUP BY.
> > > Moreover, some triaging of the pathkeys would be necessary since we now
> > > mix the sort-in- aggref pathkeys with the group_pathkeys.
> > 
> > I think you're talking about passing pathkeys into
> > create_foreign_upper_path in add_foreign_grouping_paths. If so, I
> > don't really see how it would be safe to add pathkeys to the foreign
> > grouping path. What if the foreign server did a Hash Aggregate?  The
> > rows might come back in any random order.
> 
> Yes, I was suggesting to add a new path with the pathkeys factored in, which
> if chosen over the non-ordered path would result in an additional ORDER BY
> clause to prevent a HashAggregate. But that doesn't seem a good idea after
> all.
> 
> > I kinda think that to fix this properly would need a new foreign
> > server option such as foreign_tuple_cost. I'd feel better about
> > something like that with some of the people with a vested interest in
> > the FDW code were watching more closely. So far we've not managed to
> > entice any of them with the bug report yet, but it's maybe early days
> > yet.
> 
> We already have that in the form of fdw_tuple_cost as a server option if I'm
> not mistaken ? It works as expected when the number of tuples is notably
> reduced by the foreign group by.
> 
> The problem arise when the cardinality of the groups is equal to the input's
> cardinality. I think even in that case we should try to use a remote
> aggregate since it's a computation that will not happen on the local
> server. I also think we're more likely to have up to date statistics
> remotely than the ones collected locally on the foreign tables, and the
> estimated number of groups would be more accurate on the remote side than
> the local one.

I took some time to toy with this again.

At first I thought that charging a discount in foreign grouping paths for 
Aggref targets (since they are computed remotely) would be a good idea, 
similar to what is done for the grouping keys.

But in the end, it's probably not something we would like to do. Yes, the 
group planning will be more accurate on the remote side generally (better 
statistics than locally for estimating the number of groups) but executing the 
grouping locally when the number of groups is close to the input's cardinality 
(ex: group by unique_key)  gives us a form of parallelism which can actually 
perform better. 

For the other cases where there is fewer output than input tuples, that is, 
when an actual grouping takes place, adjusting fdw_tuple_cost might be enough 
to tune the behavior to what is desirable.


-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-29 Thread Ronan Dunklau

Le vendredi 29 octobre 2021, 04:27:51 CEST Michael Paquier a écrit :
> On Thu, Oct 28, 2021 at 03:55:12PM +0200, Ronan Dunklau wrote:
> > Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s
> > on average on my machine.
> > I think if you reduce the size of the generate_series batches, this should
> > probably be reduced everywhere. With what we do though, inserting a single
> > line should work just as well, I wonder why we insist on inserting a
> > hundred lines ? I updated your patch with that small modification, it
> > also makes the code less verbose.
> 
> Thanks for the extra numbers.  I have added your suggestions,
> switching the dummy table to use a primary key with different values,
> while on it, as there is an argument that it makes debugging easier,
> and applied the speedup patch.

Thanks !

> 
> >> +$standby->psql('',
> >> +  "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
> >> +  replication => 1);
> >> Here as well we could use a restart point to reduce the number of
> >> segments archived.
> > 
> > The restart point should be very close, as we don't generate any activity
> > on the primary between the backup and the slot's creation. I'm not sure
> > adding the complexity of triggering a checkpoint on the primary and
> > waiting for the standby to catch up on it would be that useful.
> 
> Yes, you are right here.  The base backup taken from the primary
> at this point ensures a fresh point.
> 
> +# This test is split in two, using the same standby: one test check the
> +# resume-from-folder case, the other the resume-from-slot one.
> This comment needs a refresh, as the resume-from-folder case is no
> more.
> 

Done.

> +$standby->psql(
> +  'postgres',
> +  "SELECT pg_promote(wait_seconds => 300)");
> This could be $standby->promote.
> 

Oh, didn't know about that. 

> +# Switch wal to make sure it is not a partial file but a complete
> segment.
> +$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
> +$primary->psql('postgres', 'SELECT pg_switch_wal();');
> +$primary->wait_for_catchup($standby, 'replay', $primary->lsn('write'));
> This INSERT needs a slight change to adapt to the primary key of the
> table.  This one is on me :p

Done. 

> 
> Anyway, is this first segment switch really necessary?  From the data
> archived by pg_receivewal in the command testing the TLI jump, we
> finish with the following contents (contents generated after fixing
> the three INSERTs):
> 0001000B
> 0001000C
> 0002000D
> 0002000E.partial
> 0002.history
> 
> So, even if we don't do the first switch, we'd still have one
> completed segment on the previous timeline, before switching to the
> new timeline and the next segment (pg_receivewal is a bit inconsistent
> with the backend here, by the way, as the first segment on the new
> timeline would map with the last segment of the old timeline, but here
> we have a clean switch as of stop_streaming in pg_receivewal.c).

The first completed segment on the previous timeline comes from the fact we 
stream from the restart point. I removed the switch to use the walfilename of 
the replication slot's restart point instead. This means querying both the 
standby (to get the replication slot's restart_lsn) and the primary (to have 
access to pg_walfile_name). 

We could use a single query on the primary (using the primary's checkpoint LSN 
instead)  but it feels a bit convoluted just to avoid a query on the standby.


-- 
Ronan Dunklau>From eb65a8b5883d903dda78bb66c2e1795b48cafc24 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 26 Oct 2021 10:54:12 +0200
Subject: [PATCH v17] Add a test for pg_receivewal following timeline switch.

pg_receivewal is able to follow a timeline switch, but this was not
tested anywher. This test case verify that it works as expected both
when resuming from a replication slot and from the archive directory,
which have different methods of retrieving the timeline it left off.
---
 src/bin/pg_basebackup/t/020_pg_receivewal.pl | 57 +++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 2da200396e..1fe32758a0 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
-use Test::More tests => 31;
+use Test::More tests => 35;
 
 program_help_ok('pg_receivewal');
 program_version_ok('pg_receivewal');
@@ -208,3 +208,58 @@ $primary->command_

Re: pg_receivewal starting position

2021-10-28 Thread Ronan Dunklau

Le jeudi 28 octobre 2021, 14:31:36 CEST Michael Paquier a écrit :
> On Wed, Oct 27, 2021 at 10:11:00AM +0200, Ronan Dunklau wrote:
> > Sorry I sent an intermediary version of the patch, here is the correct
> > one.
> 
> While looking at this patch, I have figured out a simple way to make
> the tests faster without impacting the coverage.  The size of the WAL
> segments archived is a bit of a bottleneck as they need to be zeroed
> by pg_receivewal at creation.  This finishes by being a waste of time,
> even if we don't flush the data.  So I'd like to switch the test so as
> we use a segment size of 1MB, first.
> 
> A second thing is that we copy too many segments than really needed
> when using the slot's restart_lsn as starting point as RESERVE_WAL
> would use the current redo location, so it seems to me that a
> checkpoint is in order before the slot creation.  A third thing is
> that generating some extra data after the end LSN we want to use makes
> the test much faster at the end.
> 
> With those three methods combined, the test goes down from 11s to 9s
> here.  Attached is a patch I'd like to apply to make the test
> cheaper.

Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s on 
average on my machine. 
I think if you reduce the size of the generate_series batches, this should 
probably be reduced everywhere. With what we do though, inserting a single 
line should work just as well, I wonder why we insist on inserting a hundred 
lines ? I updated your patch with that small modification, it also makes the 
code less verbose.


> 
> I also had a look at your patch.  Here are some comments.
> 
> +# Cleanup the previous stream directories to reuse them
> +unlink glob "'${stream_dir}/*'";
> +unlink glob "'${slot_dir}/*'";
> I think that we'd better choose a different location for the
> archives.  Keeping the segments of the previous tests is useful for
> debugging if a previous step of the test failed.

Ok.
> 
> +$standby->psql('',
> +  "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
> +  replication => 1);
> Here as well we could use a restart point to reduce the number of
> segments archived.

The restart point should be very close, as we don't generate any activity on 
the primary between the backup and the slot's creation. I'm not sure adding 
the complexity of triggering a checkpoint on the primary and waiting for the 
standby to catch up on it would be that useful. 
> 
> +# Now, try to resume after the promotion, from the folder.
> +$standby->command_ok(
> +   [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
> +'--slot', $folder_slot, '--no-sync'],
> +   "Stream some wal after promoting, resuming from the folder's position");
> What is called the "resume-from-folder" test in the patch is IMO
> too costly and brings little extra value, requiring two commands of
> pg_receivewal (one to populate the folder and one to test the TLI jump
> from the previously-populated point) to test basically the same thing
> as when the starting point is taken from the slot.  Except that
> restarting from the slot is twice cheaper.  The point of the
> resume-from-folder case is to make sure that we restart from the point
> of the archive folder rather than the slot's restart_lsn, but your
> test fails this part, in fact, because the first command populating
> the archive folder also uses "--slot $folder_slot", updating the
> slot's restart_lsn before the second pg_receivewal uses this slot
> again.
> 
> I think, as a whole, that testing for the case where an archive folder
> is populated (without a slot!), followed by a second command where we
> use a slot that has a restart_lsn older than the archive's location,
> to not be that interesting.  If we include such a test, there is no
> need to include that within the TLI jump part, in my opinion.  So I
> think that we had better drop this part of the patch, and keep only
> the case where we resume from a slot for the TLI jump.

You're right about the test not being that interesting in that case. I thought 
it would be worthwhile to test both cases, but resume_from_folder doesn't 
actually exercise any code that was not already called before (timeline 
computation from segment name).

> The commands of pg_receivewal included in the test had better use -n
> so as there is no infinite loop on failure.

Ok.

-- 
Ronan Dunklau>From ad09605b6f8b2d1cc91e67a255bf469363cad2b3 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Thu, 28 Oct 2021 15:12:04 +0200
Subject: [PATCH v16 2/2] Speed up pg_receivewal tests.

Author: Michael Paquier
---
 src/bin/pg_basebackup/t/020_pg_receivewal.pl | 21 ++--
 1 file changed, 11 insertions(+)

Re: pg_receivewal starting position

2021-10-27 Thread Ronan Dunklau

Le mercredi 27 octobre 2021, 10:00:40 CEST Ronan Dunklau a écrit :
> Le mercredi 27 octobre 2021, 04:17:28 CEST Kyotaro Horiguchi a écrit :
> > +my @walfiles = glob "$slot_dir/*";
> > 
> > This is not used.
> 
> Sorry, fixed in attached version.
> 
> > Each pg_receivewal run stalls for about 10 or more seconds before
> > finishing, which is not great from the standpoint of recently
> > increasing test run time.
> > 
> > Maybe we want to advance LSN a bit, after taking $nextlsn then pass
> > "-s 1" to pg_receivewal.
> 
> I incorrectly assumed it was due to the promotion time without looking into
> it. In fact, you're right the LSN was not incremented after we fetched the
> end lsn, and thus we would wait for quite a while. I fixed that too.
> 
> Thank you for the review !

Sorry I sent an intermediary version of the patch, here is the correct one.

-- 
Ronan Dunklau>From 8fa22687e50f02539119568f98b6a11c077f1774 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 26 Oct 2021 10:54:12 +0200
Subject: [PATCH v15] Add a test for pg_receivewal following timeline switch.

pg_receivewal is able to follow a timeline switch, but this was not
tested anywher. This test case verify that it works as expected both
when resuming from a replication slot and from the archive directory,
which have different methods of retrieving the timeline it left off.
---
 src/bin/pg_basebackup/t/020_pg_receivewal.pl | 88 +++-
 1 file changed, 87 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 092c9b6f25..f561aca8f8 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
-use Test::More tests => 31;
+use Test::More tests => 39;
 
 program_help_ok('pg_receivewal');
 program_version_ok('pg_receivewal');
@@ -206,3 +206,89 @@ $primary->command_ok(
 	"WAL streamed from the slot's restart_lsn");
 ok(-e "$slot_dir/$walfile_streamed",
 	"WAL from the slot's restart_lsn has been archived");
+
+# Test a timeline switch.
+# This test is split in two, using the same standby: one test check the
+# resume-from-folder case, the other the resume-from-slot one.
+
+# Setup a standby for our tests
+my $backup_name = "basebackup";
+$primary->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new("standby");
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->start;
+
+# Cleanup the previous stream directories to reuse them
+unlink glob "'${stream_dir}/*'";
+unlink glob "'${slot_dir}/*'";
+
+# Create two replication slots.
+# First one is to make sure we keep the wal for the resume-from-folder case.
+my $folder_slot = "folder_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Second one is for testing the resume-from-slot case
+my $archive_slot = "archive_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Get a walfilename from before the promotion to make sure it is archived
+# after promotion
+my $walfile_before_promotion = $primary->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+# Populate the stream_dir for the resume-from-folder case.
+$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+# Switch wal to make sure it is not a partial file but a complete segment.
+$primary->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$primary->psql('postgres', 'SELECT pg_switch_wal();');
+
+$standby->run_log(
+[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
+  '--slot', $folder_slot, '--no-sync'],
+"Stream some wal before promoting");
+
+# Everything is setup, promote the standby to trigger a timeline switch
+$standby->psql(
+  'postgres',
+  "SELECT pg_promote(wait_seconds => 300)");
+
+# Force a wal switch to make sure at least one full WAL is archived on the new
+# timeline, and fetch this walfilename.
+my $walfile_after_promotion = $standby->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+$standby->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$standby->psql('postgres', 'SELECT pg_switch_wal();');
+$nextlsn =
+  $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+$standby->psql('postgres',
+	'INSERT INTO test_table VALUES

Re: pg_receivewal starting position

2021-10-27 Thread Ronan Dunklau

Le mercredi 27 octobre 2021, 04:17:28 CEST Kyotaro Horiguchi a écrit :
> +my @walfiles = glob "$slot_dir/*";
> 
> This is not used.
> 

Sorry, fixed in attached version.

> Each pg_receivewal run stalls for about 10 or more seconds before
> finishing, which is not great from the standpoint of recently
> increasing test run time.

> Maybe we want to advance LSN a bit, after taking $nextlsn then pass
> "-s 1" to pg_receivewal.

I incorrectly assumed it was due to the promotion time without looking into 
it. In fact, you're right the LSN was not incremented after we fetched the end 
lsn, and thus we would wait for quite a while. I fixed that too.

Thank you for the review !

-- 
Ronan Dunklau>From 04747199b7be896a62021e4f064b2342234427d5 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 26 Oct 2021 10:54:12 +0200
Subject: [PATCH v14] Add a test for pg_receivewal following timeline switch.

pg_receivewal is able to follow a timeline switch, but this was not
tested anywher. This test case verify that it works as expected both
when resuming from a replication slot and from the archive directory,
which have different methods of retrieving the timeline it left off.
---
 src/bin/pg_basebackup/t/020_pg_receivewal.pl | 86 +++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 092c9b6f25..28e4e9b3a4 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
-use Test::More tests => 31;
+use Test::More tests => 39;
 
 program_help_ok('pg_receivewal');
 program_version_ok('pg_receivewal');
@@ -206,3 +206,87 @@ $primary->command_ok(
 	"WAL streamed from the slot's restart_lsn");
 ok(-e "$slot_dir/$walfile_streamed",
 	"WAL from the slot's restart_lsn has been archived");
+
+# Test a timeline switch.
+# This test is split in two, using the same standby: one test check the
+# resume-from-folder case, the other the resume-from-slot one.
+
+# Setup a standby for our tests
+my $backup_name = "basebackup";
+$primary->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new("standby");
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->start;
+
+# Cleanup the previous stream directories to reuse them
+unlink glob "'${stream_dir}/*'";
+unlink glob "'${slot_dir}/*'";
+
+# Create two replication slots.
+# First one is to make sure we keep the wal for the resume-from-folder case.
+my $folder_slot = "folder_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Second one is for testing the resume-from-slot case
+my $archive_slot = "archive_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Get a walfilename from before the promotion to make sure it is archived
+# after promotion
+my $walfile_before_promotion = $primary->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+# Switch wal to make sure it is not a partial file but a complete segment.
+$primary->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$primary->psql('postgres', 'SELECT pg_switch_wal();');
+
+# Populate the stream_dir for the resume-from-folder case.
+$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+$standby->run_log(
+[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
+  '--slot', $folder_slot, '--no-sync'],
+"Stream some wal before promoting");
+
+# Everything is setup, promote the standby to trigger a timeline switch
+$standby->psql(
+  'postgres',
+  "SELECT pg_promote(wait_seconds => 300)");
+
+# Force a wal switch to make sure at least one full WAL is archived on the new
+# timeline, and fetch this walfilename.
+my $walfile_after_promotion = $standby->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+$standby->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$standby->psql('postgres', 'SELECT pg_switch_wal();');
+$nextlsn =
+  $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+
+# Now, try to resume after the promotion, from the folder.
+$standby->command_ok(
+	[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
+'--slot', $folder_slot, '--no-sync'],
+	"Stream some wal after promoting, resuming from the folder's position");
+
+ok(-e "$stream_dir/$wal

Re: pg_receivewal starting position

2021-10-26 Thread Ronan Dunklau

Le mardi 26 octobre 2021, 08:27:47 CEST Ronan Dunklau a écrit :
> Yes, I will try to simplify the logic of the patch I sent last week. I'll
> keep you posted here soon.

I was able to simplify it quite a bit, by using only one standby for both test 
scenarios.

This test case verify that after a timeline switch, if we resume from a 
previous state we will archive: 
 - segments from the old timeline
 - segments from the new timeline
 - the timeline history file itself.

I chose to check against a full segment from the previous timeline, but it 
would have been possible to check that the latest timeline segment was 
partial. I chose not not, in the unlikely event we promote at an exact segment 
boundary. I don't think it matters much, since partial wal files are already 
covered by other tests.

-- 
Ronan Dunklau>From 0daf03e7e69aa11accb18df86d686004002115ff Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 26 Oct 2021 10:54:12 +0200
Subject: [PATCH v13] Add a test for pg_receivewal following timeline switch.

pg_receivewal is able to follow a timeline switch, but this was not
tested anywher. This test case verify that it works as expected both
when resuming from a replication slot and from the archive directory,
which have different methods of retrieving the timeline it left off.
---
 src/bin/pg_basebackup/t/020_pg_receivewal.pl | 87 +++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 092c9b6f25..f285e9b3f3 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
-use Test::More tests => 31;
+use Test::More tests => 39;
 
 program_help_ok('pg_receivewal');
 program_version_ok('pg_receivewal');
@@ -206,3 +206,88 @@ $primary->command_ok(
 	"WAL streamed from the slot's restart_lsn");
 ok(-e "$slot_dir/$walfile_streamed",
 	"WAL from the slot's restart_lsn has been archived");
+
+# Test a timeline switch.
+# This test is split in two, using the same standby: one test check the
+# resume-from-folder case, the other the resume-from-slot one.
+
+# Setup a standby for our tests
+my $backup_name = "basebackup";
+$primary->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new("standby");
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->start;
+
+# Cleanup the previous stream directories to reuse them
+unlink glob "'${stream_dir}/*'";
+unlink glob "'${slot_dir}/*'";
+
+# Create two replication slots.
+# First one is to make sure we keep the wal for the resume-from-folder case.
+my $folder_slot = "folder_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Second one is for testing the resume-from-slot case
+my $archive_slot = "archive_slot";
+$standby->psql('',
+  "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)",
+  replication => 1);
+# Get a walfilename from before the promotion to make sure it is archived
+# after promotion
+my $walfile_before_promotion = $primary->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+# Switch wal to make sure it is not a partial file but a complete segment.
+$primary->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$primary->psql('postgres', 'SELECT pg_switch_wal();');
+
+# Populate the stream_dir for the resume-from-folder case.
+$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+$standby->run_log(
+[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
+  '--slot', $folder_slot, '--no-sync'],
+"Stream some wal before promoting");
+
+# Everything is setup, promote the standby to trigger a timeline switch
+$standby->psql(
+  'postgres',
+  "SELECT pg_promote(wait_seconds => 300)");
+
+# Force a wal switch to make sure at least one full WAL is archived on the new
+# timeline, and fetch this walfilename.
+my $walfile_after_promotion = $standby->safe_psql('postgres',
+  "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+$standby->psql('postgres',
+	'INSERT INTO test_table VALUES (generate_series(1,100));');
+$standby->psql('postgres', 'SELECT pg_switch_wal();');
+$nextlsn =
+  $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+
+# Now, try to resume after the promotion, from the folder.
+$standby->command_ok(
+	[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
+'--slot', $folder_slot, '--no-sync'],
+	"Stream some wal after

Re: pg_receivewal starting position

2021-10-26 Thread Ronan Dunklau

Le mardi 26 octobre 2021, 03:15:40 CEST Michael Paquier a écrit :
> I have changed the patch per Ronan's suggestion to have the version
> check out of GetSlotInformation(), addressed what you have reported,
> and the result looked good.  So I have applied this part.

Thanks !
> 
> What remains on this thread is the addition of new tests to make sure
> that pg_receivewal is able to follow a timeline switch.  Now that we
> can restart from a slot that should be a bit easier to implemented as
> a test by creating a slot on a standby.  Ronan, are you planning to
> send a new patch for this part?

Yes, I will try to simplify the logic of the patch I sent last week. I'll keep 
you posted here soon.


-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-25 Thread Ronan Dunklau

Le lundi 25 octobre 2021, 10:10:13 CEST Michael Paquier a écrit :
> On Mon, Oct 25, 2021 at 09:50:01AM +0200, Ronan Dunklau wrote:
> > Le lundi 25 octobre 2021, 09:40:10 CEST Michael Paquier a écrit :
> >> On Mon, Oct 25, 2021 at 09:15:32AM +0200, Ronan Dunklau wrote:
> >>> Does it make sense though ? The NULL slot_name case handling is pretty
> >>> straight forward has it will be handled by string formatting, but in the
> >>> case of a null restart_lsn, we have no way of knowing if the command was
> >>> issued at all.
> >> 
> >> If I am following your point, I don't think that it matters much here,
> >> and it seems useful to me to be able to pass NULL for both of them, so
> >> as one can check if the slot exists or not with an API designed this
> >> way.
> > 
> > You're right, but I'm afraid we would have to check the server version
> > twice in any case different from the basic pg_receivewal on (once in the
> > function itself, and one before calling it if we want a meaningful
> > result). Maybe we should move the version check outside the
> > GetSlotInformation function to avoid this, and let it fail with a syntax
> > error when the server doesn't support it ?
> With the approach taken by the patch, we fall down silently to the
> previous behavior if we connect to a server <= 14, and rely on the new
> behavior with a server >= 15, ensuring compatibility.  Why would you
> want to make sure that the command is executed when we should just
> enforce that the old behavior is what happens when there is a slot
> defined and a backend <= 14?

Sorry I haven't been clear. For the use case of this patch, the current 
approach is perfect (and we supply the restart_lsn).

However, if we want to support the case of "just check if the slot exists", we 
need to make sure the command is actually executed, and check the version 
before calling the function, which would make the check executed twice.

What I'm proposing is just that we let the responsibility of checking 
PQServerVersion() to the caller, and remove it from GetSlotInformation, ie:

-   if (replication_slot != NULL)
+   if (replication_slot != NULL && PQserverVersion(conn) >= 
15)
{
if (!GetSlotInformation(conn, replication_slot, 
,
        
))

That way, if we introduce a caller wanting to use this function as an API to 
check a slot exists, the usage of checking the server version beforehand will 
be consistent.



-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-25 Thread Ronan Dunklau

Le lundi 25 octobre 2021, 09:40:10 CEST Michael Paquier a écrit :
> On Mon, Oct 25, 2021 at 09:15:32AM +0200, Ronan Dunklau wrote:
> > Does it make sense though ? The NULL slot_name case handling is pretty
> > straight forward has it will be handled by string formatting, but in the
> > case of a null restart_lsn, we have no way of knowing if the command was
> > issued at all.
> 
> If I am following your point, I don't think that it matters much here,
> and it seems useful to me to be able to pass NULL for both of them, so
> as one can check if the slot exists or not with an API designed this
> way.

You're right, but I'm afraid we would have to check the server version twice 
in any case different from the basic pg_receivewal on (once in the function 
itself, and one before calling it if we want a meaningful result). Maybe we 
should move the version check outside the GetSlotInformation function to avoid 
this, and let it fail with a syntax error when the server doesn't support it ?

-- 
Ronan Dunklau

Re: Non-superuser subscription owners

2021-10-25 Thread Ronan Dunklau

Le mercredi 20 octobre 2021, 20:40:39 CEST Mark Dilger a écrit :
> These patches have been split off the now deprecated monolithic "Delegating
> superuser tasks to new security roles" thread at [1].
> 
> The purpose of these patches is to allow non-superuser subscription owners
> without risk of them overwriting tables they lack privilege to write
> directly. This both allows subscriptions to be managed by non-superusers,
> and protects servers with subscriptions from malicious activity on the
> publisher side.

Thank you Mark for splitting this.

This patch looks good to me, and provides both better security (by closing the 
"dropping superuser role" loophole) and usefule features. 


-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-25 Thread Ronan Dunklau

Le lundi 25 octobre 2021, 00:43:11 CEST Michael Paquier a écrit :
> On Sun, Oct 24, 2021 at 07:20:57PM +0530, Bharath Rupireddy wrote:
> > Thanks. I've no further comments on the v10 patch.
> 
> Okay, thanks.  I have applied this part, then.
> --
> Michael

Thank you all for your work on this patch. 

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-25 Thread Ronan Dunklau

Le lundi 25 octobre 2021, 08:51:23 CEST Michael Paquier a écrit :
> On Thu, Oct 21, 2021 at 10:36:42AM +0200, Ronan Dunklau wrote:
> > Done. I haven't touched the timeline switch test patch for now, but I
> > still
> > include it here for completeness.
> 
> As 0001 and 0002 have been applied, I have put my hands on 0003, and
> found a couple of issues upon review.
> 
> +   Assert(slot_name != NULL);
> +   Assert(restart_lsn != NULL);
> There is no need for those asserts, as we should support the case
> where the caller gives NULL for those variables.

Does it make sense though ? The NULL slot_name case handling is pretty 
straight forward has it will be handled by string formatting, but in the case 
of a null restart_lsn, we have no way of knowing if the command was issued at 
all. 

> 
> +   if (PQserverVersion(conn) < 15)
> +   return false;
> Returning false is incorrect for older server versions as we won't
> fallback to the old method when streaming from older server.  What
> this needs to do is return true and set restart_lsn to
> InvalidXLogRecPtr, so as pg_receivewal would just stream from the
> current flush location.  "false" should just be returned on error,
> with pg_log_error().

Thank you, this was an oversight when moving from the more complicated error 
handling code. 

> 
> +$primary->psql('postgres',
> +   'INSERT INTO test_table VALUES (generate_series(1,100));');
> +$primary->psql('postgres', 'SELECT pg_switch_wal();');
> +$nextlsn =
> +  $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
> +chomp($nextlsn);
> There is no need to switch twice to a new WAL segment as we just need
> to be sure that the WAL segment of the restart_lsn is the one
> archived.  Note that RESERVE_WAL uses the last redo point, so it is
> better to use a checkpoint and reduce the number of logs we stream
> into the new location.
> 
> Better to add some --no-sync to the new commands of pg_receivewal, to
> not stress the I/O more than necessary.  I have added some extra -n
> while on it to avoid loops on failure.
> 
> Attached is the updated patch I am finishing with, which is rather
> clean now.  I have tweaked a couple of things while on it, and
> documented better the new GetSlotInformation() in streamutil.c.
> --
> Michael


-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-21 Thread Ronan Dunklau

Le jeudi 21 octobre 2021, 09:21:44 CEST Michael Paquier a écrit :
> On Thu, Oct 21, 2021 at 08:29:54AM +0200, Ronan Dunklau wrote:
> > Ok, do you want me to propose a different patch for previous versions ?
> 
> That's not necessary.  Thanks for proposing.
> 
> > Do you mean restart_lsn as the pointer argument to the function, or
> > restart_lsn as the field returned by the command ? If it's the first, I'll
> > change it but if it's the latter it is expected that we sometime run this
> > on a slot where WAL has never been reserved yet.
> 
> restart_lsn as the pointer of the function.

Done. I haven't touched the timeline switch test patch for now, but I still 
include it here for completeness.





-- 
Ronan Dunklau>From b3703868d230f773378742a0deb4360f6cc7343a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 21 Oct 2021 13:21:43 +0900
Subject: [PATCH v10 1/4] doc: Describe calculation method of streaming start
 for pg_receivewal

The documentation was unprecise about the fact that the current WAL
flush location is used if nothing can be found on the local archive
directory describe, independently of the compression used by each
segment (ZLIB or uncompressed).
---
 doc/src/sgml/ref/pg_receivewal.sgml | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 45b544cf49..9adbddb657 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -75,6 +75,29 @@ PostgreSQL documentation
one session available for the stream.
   
 
+  
+   The starting point of the write-ahead log streaming is calculated when
+   pg_receivewal starts:
+   
+
+ 
+  First, scan the directory where the WAL segment files are written and
+  find the newest completed segment, using as starting point the beginning
+  of the next WAL segment file. This is calculated independently of the
+  compression method used to compress each segment.
+ 
+
+
+
+ 
+  If a starting point cannot be calculated with the previous method,
+  the latest WAL flush location is used as reported by the server from
+  a IDENTIFY_SYSTEM command.
+ 
+
+   
+  
+
   
If the connection is lost, or if it cannot be initially established,
with a non-fatal error, pg_receivewal will
-- 
2.33.0

>From 067353a88803e96b295f425167c1195ccf984352 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 21 Oct 2021 14:23:41 +0900
Subject: [PATCH v10 2/4] Add READ_REPLICATION_SLOT

---
 doc/src/sgml/protocol.sgml  | 48 +++
 src/backend/replication/repl_gram.y | 16 +++-
 src/backend/replication/repl_scanner.l  |  1 +
 src/backend/replication/walsender.c | 93 +
 src/include/nodes/nodes.h   |  1 +
 src/include/nodes/replnodes.h   | 11 +++
 src/test/recovery/t/001_stream_rep.pl   | 32 ++-
 src/test/recovery/t/006_logical_decoding.pl | 11 ++-
 src/tools/pgindent/typedefs.list|  1 +
 9 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b95cc88599..37c26ec8ae 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2067,6 +2067,54 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+ 
+  Read some information associated to a replication slot. Returns a tuple
+  with NULL values if the replication slot does not
+  exist. This command is currently only supported for physical replication
+  slots.
+ 
+ 
+  In response to this command, the server will return a one-row result set,
+  containing the following fields:
+  
+   
+slot_type (text)
+
+ 
+  The replication slot's type, either physical or
+  NULL.
+ 
+
+   
+
+   
+restart_lsn (text)
+
+ 
+  The replication slot's restart_lsn.
+ 
+
+   
+
+   
+restart_tli (int4)
+
+ 
+  The timeline ID associated to restart_lsn,
+  following the current timeline history.
+ 
+
+   
+  
+ 
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]
  START_REPLICATION
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 126380e2df..dcb1108579 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void);
 /* Keyword tokens. */
 %token K_BASE_BACKUP
 %token K_IDENTIFY_SYSTEM
+%token K_READ_REPLICATION_SLOT
 %token K_SHOW
 %token K_START_REPLICATION
 %token K_CREATE_REPLICATION_SLOT
@@ -94,7 +95,7 @@ st

Re: pg_receivewal starting position

2021-10-21 Thread Ronan Dunklau

Le jeudi 21 octobre 2021, 07:35:08 CEST Michael Paquier a écrit :
> On Wed, Oct 20, 2021 at 02:58:26PM +0200, Ronan Dunklau wrote:
> > After sending the previous patch suite, I figured it would be worthwhile
> > to
> > also have tests covering timeline switches, which was not covered before.
> 
> That seems independent to me.  I'll take a look.
> 
> > So please find attached a new version with an additional patch for those
> > tests, covering both  "resume from last know archive" and "resume from
> > the replication slots position" cases.
> 
> So, taking things in order, I have looked at 0003 and 0001, and
> attached are refined versions for both of them.
> 
> 0003 is an existing hole in the docs, which I think we had better
> address first and backpatch, taking into account that the starting
> point calculation considers compressed segments when looking for
> completed segments.

Ok, do you want me to propose a different patch for previous versions ?

> 
> Regarding 0001, I have found the last test to check for NULL values
> returned by READ_REPLICATION_SLOT after dropping the slot overlaps
> with the first test, so I have removed that.  I have expanded a bit
> the use of like(), and there were some confusion with
> PostgresNode::psql and some extra arguments (see DROP_REPLICATION_SLOT
> and CREATE_REPLICATION_SLOT, and no need for return values in the
> CREATE case either).  Some comments, docs and code have been slightly
> tweaked.

Thank you for this. 


> 
> Here are some comments about 0002.
> 
> +   /* The commpand should always return precisely one tuple */
> s/commpand/command/
> 
> +   pg_log_error("could not fetch replication slot: got %d rows and %d
> fields, expected %d rows and %d or more fields", +   
> PQntuples(res), PQnfields(res), 1, 3);
> Should this be "could not read" instead?
> 
> +   if (sscanf(PQgetvalue(res, 0, 1), "%X/%X", , ) != 2)
> +   {
> +   pg_log_error("could not parse slot's restart_lsn \"%s\"",
> +PQgetvalue(res, 0, 1));
> +   PQclear(res);
> +   return false;
> +   }
> Wouldn't it be saner to initialize *restart_lsn and *restart_tli to
> some default values at the top of GetSlotInformation() instead, if
> they are specified by the caller?  

Ok.

> And I think that we should still
> complain even if restart_lsn is NULL.

Do you mean restart_lsn as the pointer argument to the function, or 
restart_lsn as the field returned by the command ? If it's the first, I'll 
change it but if it's the latter it is expected that we sometime run this on a 
slot where WAL has never been reserved yet.

> 
> On a quick read of 0004, I find the split of the logic with
> change_timeline() a bit hard to understand.  It looks like we should
> be able to make a cleaner split, but I am not sure how that would
> look, though.

Thanks, at least if the proposal to test this seems sensible I can move 
forward. I wanted to avoid having a lot of code duplication since the test 
setup is a bit more complicated. 
My first approach was to split it into two functions, setup_standby and 
change_timeline, but then realized that what would happen between the two 
invocations would basically be the same for the two test cases, so I ended up 
with that patch. I'll try to see if I can see a better way of organizing that 
code.

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-10-20 Thread Ronan Dunklau

Le mercredi 20 octobre 2021 11:40:18 CEST, vous avez écrit :
> > +# Setup the slot, and connect to it a first time
> > +$primary->run_log(
> > +   [ 'pg_receivewal', '--slot', $slot_name, '--create-slot' ],
> > +   'creating a replication slot');
> > +$primary->psql('postgres',
> > +   'INSERT INTO test_table VALUES (generate_series(1,100));');
> > +$primary->psql('postgres', 'SELECT pg_switch_wal();');
> > +$nextlsn =
> > +  $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
> > +chomp($nextlsn);
> > Wouldn't it be simpler to use CREATE_REPLICATION_SLOT with RESERVE_WAL
> > here, rather than going through pg_receivewal?  It seems to me that
> > this would be cheaper without really impacting the coverage.
> 
> You're right, we can skip two invocations of pg_receivewal like this (for
> the slot creation + for starting the slot a first time).

After sending the previous patch suite, I figured it would be worthwhile to 
also have tests covering timeline switches, which was not covered before.
So please find attached a new version with an additional patch for those tests, 
covering both  "resume from last know archive" and "resume from the 
replication slots position" cases.

-- 
Ronan Dunklau>From 5a47f17a17594cc171f744ce383ba820d44b6446 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 2 Sep 2021 16:25:25 +0900
Subject: [PATCH v8 1/4] Add READ_REPLICATION_SLOT command

---
 doc/src/sgml/protocol.sgml  | 47 +++
 src/backend/replication/repl_gram.y | 16 +++-
 src/backend/replication/repl_scanner.l  |  1 +
 src/backend/replication/walsender.c | 89 +
 src/include/nodes/nodes.h   |  1 +
 src/include/nodes/replnodes.h   | 10 +++
 src/test/recovery/t/001_stream_rep.pl   | 47 ++-
 src/test/recovery/t/006_logical_decoding.pl |  9 ++-
 src/tools/pgindent/typedefs.list|  1 +
 9 files changed, 218 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b95cc88599..51a15cc3da 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2067,6 +2067,53 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+ 
+  Read the information of a replication slot. Returns a tuple with
+  NULL values if the replication slot does not
+  exist. This command is currently only supported for physical slots.
+ 
+ 
+  In response to this command, the server will return a one-row result set,
+  containing the following fields:
+  
+   
+slot_type (text)
+
+ 
+  The replication slot's type, either physical or
+  NULL.
+ 
+
+   
+
+   
+restart_lsn (text)
+
+ 
+  The replication slot's restart_lsn.
+ 
+
+   
+
+   
+restart_tli (int4)
+
+ 
+  The timeline ID for the restart_lsn position,
+  when following the current timeline history.
+ 
+
+   
+  
+ 
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]
  START_REPLICATION
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 126380e2df..913a99da5a 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void);
 /* Keyword tokens. */
 %token K_BASE_BACKUP
 %token K_IDENTIFY_SYSTEM
+%token K_READ_REPLICATION_SLOT
 %token K_SHOW
 %token K_START_REPLICATION
 %token K_CREATE_REPLICATION_SLOT
@@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void);
 %type 	command
 %type 	base_backup start_replication start_logical_replication
 create_replication_slot drop_replication_slot identify_system
-timeline_history show sql_cmd
+read_replication_slot timeline_history show sql_cmd
 %type 	base_backup_legacy_opt_list generic_option_list
 %type 	base_backup_legacy_opt generic_option
 %type 	opt_timeline
@@ -120,6 +121,7 @@ opt_semicolon:	';'
 
 command:
 			identify_system
+			| read_replication_slot
 			| base_backup
 			| start_replication
 			| start_logical_replication
@@ -140,6 +142,18 @@ identify_system:
 }
 			;
 
+/*
+ * READ_REPLICATION_SLOT %s
+ */
+read_replication_slot:
+			K_READ_REPLICATION_SLOT var_name
+{
+	ReadReplicationSlotCmd *n = makeNode(ReadReplicationSlotCmd);
+	n->slotname = $2;
+	$$ = (Node *) n;
+}
+			;
+
 /*
  * SHOW setting
  */
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index c038a636c3..1b599c255e 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanne

Re: pg_receivewal starting position

2021-10-20 Thread Ronan Dunklau

Le mercredi 20 octobre 2021, 07:13:15 CEST Michael Paquier a écrit :
> On Tue, Oct 19, 2021 at 05:32:55PM +0200, Ronan Dunklau wrote:
> > Following recommendations, I stripped most of the features from the patch.
> > For now we support only physical replication slots, and only provide the
> > two fields of interest (restart_lsn, restart_tli) in addition to the slot
> > type (fixed at physical) to not paint ourselves in a corner.
> > 
> > I also removed the part about pg_basebackup since other fixes have been
> > proposed for that case.
> 
> Patch 0001 looks rather clean.  I have a couple of comments.

Thank you for the quick review !


> 
> +   if (OidIsValid(slot_contents.data.database))
> +   elog(ERROR, "READ_REPLICATION_SLOT is only supported for
> physical slots");
> 
> elog() can only be used for internal errors.  Errors that can be
> triggered by a user should use ereport() instead.

Ok.
> 
> +ok($stdout eq '||',
> +   "READ_REPLICATION_SLOT returns NULL values if slot does not exist");
> [...]
> +ok($stdout =~ 'physical\|[^|]*\|1',
> +   "READ_REPLICATION_SLOT returns tuple corresponding to the slot");
> Isn't result pattern matching something we usually test with like()?

Ok.
> 
> +($ret, $stdout, $stderr) = $node_primary->psql(
> +   'postgres',
> +   "READ_REPLICATION_SLOT $slotname;",
> +   extra_params => [ '-d', $connstr_rep ]);
> No need for extra_params in this test.  You can just pass down
> "replication => 1" instead, no?

In that test file, every replication connection is obtained by using 
connstr_rep so I thought it would be best to use the same thing.

> 
> --- a/src/test/recovery/t/006_logical_decoding.pl
> +++ b/src/test/recovery/t/006_logical_decoding.pl
> [...]
> +ok($stderr=~ 'READ_REPLICATION_SLOT is only supported for physical slots',
> +   'Logical replication slot is not supported');
> This one should use like().

Ok.

> 
> +
> +The slot's restart_lsn can also beused as a
> starting +point if the target directory is empty.
> +
> I am not sure that there is a need for this addition as the same thing
> is said when describing the lookup ordering.

Ok, removed.

> 
> +  If nothing is found and a slot is specified, use the
> +  READ_REPLICATION_SLOT
> +  command.
> It may be clearer to say that the position is retrieved from the
> command.

Ok, done. The doc also uses the active voice here now.

> 
> +bool
> +GetSlotInformation(PGconn *conn, const char *slot_name, XLogRecPtr
> *restart_lsn, TimeLineID* restart_tli)
> +{
> Could you extend that so as we still run the command but don't crash
> if the caller specifies NULL for any of the result fields?  This would
> be handy.

Done.

> 
> +   if (PQgetisnull(res, 0, 0))
> +   {
> +   PQclear(res);
> +   pg_log_error("replication slot \"%s\" does not exist",
> slot_name);
> +   return false;
> +   }
> +   if (PQntuples(res) != 1 || PQnfields(res) < 3)
> +   {
> +   pg_log_error("could not fetch replication slot: got %d rows
> and %d fields, expected %d rows and %d or more fields",
> +PQntuples(res), PQnfields(res), 1, 3);
> +   PQclear(res);
> +   return false;
> +   }
> Wouldn't it be better to reverse the order of these two checks?

Yes it is, and the PQntuples condition should be removed from the first error 
test.

> 
> I don't mind the addition of the slot type being part of the result of
> READ_REPLICATION_SLOT even if it is not mandatory (?), but at least
> GetSlotInformation() should check after it.

Ok.

> 
> +# Setup the slot, and connect to it a first time
> +$primary->run_log(
> +   [ 'pg_receivewal', '--slot', $slot_name, '--create-slot' ],
> +   'creating a replication slot');
> +$primary->psql('postgres',
> +   'INSERT INTO test_table VALUES (generate_series(1,100));');
> +$primary->psql('postgres', 'SELECT pg_switch_wal();');
> +$nextlsn =
> +  $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
> +chomp($nextlsn);
> Wouldn't it be simpler to use CREATE_REPLICATION_SLOT with RESERVE_WAL
> here, rather than going through pg_receivewal?  It seems to me that
> this would be cheaper without really impacting the coverage.

You're right, we can skip two invocations of pg_receivewal like this (for the 
slot creation + for starting the slot a first time).


-- 
Ronan Dunklau>From 845b2fdd33318f0d7528ef0ecbb69e4aecf43c46 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 6 Sep 2021 11:08:54 +0200
Subject: [PATCH v7 3/3] Add documentation for pg_receivewal

---
 doc/src/sgml/ref/pg_receivewal.sgml |

Re: pg_receivewal starting position

2021-10-19 Thread Ronan Dunklau

Hello,

Following recommendations, I stripped most of the features from the patch. For 
now we support only physical replication slots, and only provide the two fields 
of interest (restart_lsn, restart_tli) in addition to the slot type (fixed at 
physical) to not paint ourselves in a corner.

I also removed the part about pg_basebackup since other fixes have been 
proposed for that case. 

Le vendredi 1 octobre 2021, 09:05:18 CEST Michael Paquier a écrit :
> On Mon, Sep 06, 2021 at 04:17:28PM +0900, Michael Paquier wrote:
> > Using READ_REPLICATION_SLOT as the command name is fine, and it could
> > be extended with more fields if necessary, implemented now with only
> > what we think is useful.  Returning errors on cases that are still not
> > supported yet is fine, say for logical slots if we decide to reject
> > the case for now, and testrictions can always be lifted in the
> > future.
> 
> And marked as RwF as this was three weeks ago.  Please feel free to
> register a new entry if this is being worked on.
> --
> Michael


-- 
Ronan Dunklau>From 4147665664164eb597fdcc86637ec9c497c36660 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 2 Sep 2021 16:25:25 +0900
Subject: [PATCH v6 1/3] Add READ_REPLICATION_SLOT command

---
 doc/src/sgml/protocol.sgml  | 47 +++
 src/backend/replication/repl_gram.y | 16 +++-
 src/backend/replication/repl_scanner.l  |  1 +
 src/backend/replication/walsender.c | 87 +
 src/include/nodes/nodes.h   |  1 +
 src/include/nodes/replnodes.h   | 10 +++
 src/test/recovery/t/001_stream_rep.pl   | 47 ++-
 src/test/recovery/t/006_logical_decoding.pl |  9 ++-
 src/tools/pgindent/typedefs.list|  1 +
 9 files changed, 216 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index b95cc88599..51a15cc3da 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2067,6 +2067,53 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+ 
+  Read the information of a replication slot. Returns a tuple with
+  NULL values if the replication slot does not
+  exist. This command is currently only supported for physical slots.
+ 
+ 
+  In response to this command, the server will return a one-row result set,
+  containing the following fields:
+  
+   
+slot_type (text)
+
+ 
+  The replication slot's type, either physical or
+  NULL.
+ 
+
+   
+
+   
+restart_lsn (text)
+
+ 
+  The replication slot's restart_lsn.
+ 
+
+   
+
+   
+restart_tli (int4)
+
+ 
+  The timeline ID for the restart_lsn position,
+  when following the current timeline history.
+ 
+
+   
+  
+ 
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]
  START_REPLICATION
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 126380e2df..913a99da5a 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void);
 /* Keyword tokens. */
 %token K_BASE_BACKUP
 %token K_IDENTIFY_SYSTEM
+%token K_READ_REPLICATION_SLOT
 %token K_SHOW
 %token K_START_REPLICATION
 %token K_CREATE_REPLICATION_SLOT
@@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void);
 %type 	command
 %type 	base_backup start_replication start_logical_replication
 create_replication_slot drop_replication_slot identify_system
-timeline_history show sql_cmd
+read_replication_slot timeline_history show sql_cmd
 %type 	base_backup_legacy_opt_list generic_option_list
 %type 	base_backup_legacy_opt generic_option
 %type 	opt_timeline
@@ -120,6 +121,7 @@ opt_semicolon:	';'
 
 command:
 			identify_system
+			| read_replication_slot
 			| base_backup
 			| start_replication
 			| start_logical_replication
@@ -140,6 +142,18 @@ identify_system:
 }
 			;
 
+/*
+ * READ_REPLICATION_SLOT %s
+ */
+read_replication_slot:
+			K_READ_REPLICATION_SLOT var_name
+{
+	ReadReplicationSlotCmd *n = makeNode(ReadReplicationSlotCmd);
+	n->slotname = $2;
+	$$ = (Node *) n;
+}
+			;
+
 /*
  * SHOW setting
  */
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index c038a636c3..1b599c255e 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -85,6 +85,7 @@ identifier		{ident_start}{ident_cont}*
 BASE_BACKUP			{ return K_BASE_BACKUP; }
 FAST			{ return K_FAST; }
 IDENTIFY_SYSTEM		{ return K_IDENTIFY_SYSTEM; }
+READ_REPLICATION_SLOT	{ return K_READ_REPLICATION_SLOT; }
 S

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-10-19 Thread Ronan Dunklau

Le lundi 27 septembre 2021, 20:15:05 CEST Mark Dilger a écrit :
> > On Sep 21, 2021, at 12:58 PM, Andrew Dunstan  wrote:
> > 
> > This patch set is failing to apply for me - it fails on patch 2.
> 
> Thanks for looking!  I have pulled together a new patch set which applies
> cleanly against current master.
> > I haven't dug terribly deeply into it yet, but I notice that there is a
> > very large increase in test volume, which appears to account for much of
> > the 44635 lines of the patch set. I think we're probably going to want
> > to reduce that. We've had complaints in the past from prominent hackers
> > about adding too much volume to the regression tests.
> 
> The v8 patch set is much smaller, with the reduction being in the size of
> regression tests covering which roles can perform SET, RESET, ALTER SYSTEM
> SET, and ALTER SYSTEM RESET and on which GUCs.  The v7 patch set did
> exhaustive testing on this, which is why it was so big.  The v8 set does
> just a sampling of GUCs per role.  The total number of lines for the patch
> set drops from 44635 to 13026, with only 1960 lines total between the .sql
> and .out tests of GUC privileges.
> > I do like the basic thrust of reducing the power of CREATEROLE. There's
> > an old legal maxim I learned in my distant youth that says "nemo dat
> > quod non habet" - Nobody can give something they don't own. This seems
> > to be in that spirit, and I approve :-)
> 
> Great!  I'm glad to hear the approach has some support.
> 
> 
> Other changes in v8:
> 
> Add a new test for subscriptions owned by non-superusers to verify that the
> tablesync and apply workers replicating their subscription won't replicate
> into schemas and tables that the subscription owner lacks privilege to
> touch.  The logic to prevent that existed in the v7 patch, but I overlooked
> adding tests for it.  Fixed.
> 
> Allow non-superusers to create event triggers.  The logic already existed
> and is unchanged to handle event triggers owned by non-superusers and
> conditioning those triggers firing on the (trigger-owner,
> role-performing-event) pair.  The v7 patch set didn't go quite so far as
> allowing non-superusers to create event triggers, but that undercuts much
> of the benefit of the changes for no obvious purpose.
> 
> 
> Not changed in v8, but worth discussing:
> 
> Non-superusers are still prohibited from creating subscriptions, despite
> improvements to the security around that circumstance.  Improvements to the
> security model around event triggers does not have to also include
> permission for non-superuser to create event triggers, but v8 does both. 
> These could be viewed as inconsistent choices, but I struck the balance
> this way because roles creating event triggers does not entail them doing
> anything that they can't already do, whereas allowing arbitrary users to
> create subscriptions would entail an ordinary user causing external network
> connections being initiated.  We likely need to create another privileged
> role and require a non-superuser to be part of that role before they can
> create subscriptions.  That seems, however, like something to do as a
> follow-on patch, since tightening up the security on subscriptions as done
> in this patch doesn't depend on that.

The changes proposed around subscription management make a lot of sense, 
especially considering that now that we don't allow to run ALTER SUBSCRIPTION 
REFRESH in a function anymore, there is no way to delegate this to a non 
superuser (using a security definer function). Since it doesn't involve the 
rest of the patchset, patches 16, 17 and 18 could be separated in another 
thread / patchset. 

-- 
Ronan Dunklau

Re: Patch to avoid orphaned dependencies

2021-09-17 Thread Ronan Dunklau

Hello Bertrand,

Le mardi 4 mai 2021, 11:55:43 CEST Drouvot, Bertrand a écrit :
> 
> Implementation overview:
> 
>   * A new catalog snapshot is added: DirtyCatalogSnapshot.
>   * This catalog snapshot is a dirty one to be able to look for
> in-flight dependencies.
>   * Its usage is controlled by a new UseDirtyCatalogSnapshot variable.
>   * Any time this variable is being set to true, then the next call to
> GetNonHistoricCatalogSnapshot() is returning the dirty snapshot.
>   * This snapshot is being used to check for in-flight dependencies and
> also to get the objects description to generate the error messages.
> 

I quickly tested the patch, it behaves as advertised, and passes tests.

Isolation tests should be added to demonstrate the issues it is solving.

However, I am bit wary of this behaviour of setting the DirtyCatalogSnapshot 
global variable which is then reset after each snapshot acquisition: I'm 
having trouble understanding all the implications of that, if it would be 
possible to introduce an unforeseen snapshot before the one we actually want 
to be dirty. 

I don't want to derail this thread, but couldn't predicate locks on the 
pg_depend index pages corresponding to the dropped object / referenced objects 
work as a different approach ? I'm not familiar enough with them so maybe there 
is some fundamental misunderstanding on my end.

-- 
Ronan Dunklau

Re: Proposal: More structured logging

2021-09-09 Thread Ronan Dunklau

Le mercredi 8 septembre 2021, 11:51:31 CEST Peter Eisentraut a écrit :
> On 01.09.21 10:00, Ronan Dunklau wrote:
> > In-core it would open up the possibility to split log messages into
> > different fields, for example the different statistics reported in the
> > logs by VACUUM / ANALYZE VERBOSE and make it easier to consume the output
> > without having to parse the message. Parsing the message also means that
> > any tool parsing it needs to either be aware of the locale, or to force
> > the user to use a specific one.
> 
> I think those messages would themselves have substructure.  For example,
> the current message
> 
>  "I/O timings: read: %.3f ms, write: %.3f ms\n"
> 
> might be
> 
>  {"I/O timings": {"read": ..., "write": ...}}
> 
> and that in turn is already part of a larger message.
> 
> So just having a single level of tags wouldn't really work for this.

I agree having a nested structure may feel more natural, but I don't think it 
would matter too much if we standardize on ini-style property names (ie, 
iotimings.read, iotimings.write and so on...)

Regards,

-- 
Ronan Dunklau

Re: ORDER BY pushdowns seem broken in postgres_fdw

2021-09-06 Thread Ronan Dunklau

Le lundi 6 septembre 2021, 11:25:39 CEST Zhihong Yu a écrit :
> On Mon, Sep 6, 2021 at 1:17 AM Ronan Dunklau  wrote:
> > Le vendredi 3 septembre 2021, 22:54:25 CEST David Zhang a écrit :
> > > The following review has been posted through the commitfest application:
> > > make installcheck-world:  tested, failed
> > > Implements feature:   tested, passed
> > > Spec compliant:   not tested
> > > Documentation:not tested
> > > 
> > > Applied the v6 patch to master branch and ran regression test for
> > 
> > contrib,
> > 
> > > the result was "All tests successful."
> > 
> > What kind of error did you get running make installcheck-world ? If it
> > passed
> > the make check for contrib, I can't see why it would fail running make
> > installcheck-world.
> > 
> > In any case, I just checked and running make installcheck-world doesn't
> > produce any error.
> > 
> > Since HEAD had moved a bit since the last version, I rebased the patch,
> > resulting in the attached v7.
> > 
> > Best regards,
> > 
> > --
> > Ronan Dunklau
> 
> Hi,
> bq. a pushed-down order by could return wrong results.
> 
> Can you briefly summarize the approach for fixing the bug in the
> description ?

Done, let me know what you think about it.

> 
> + * Returns true if it's safe to push down a sort as described by 'pathkey'
> to
> + * the foreign server
> + */
> +bool
> +is_foreign_pathkey(PlannerInfo *root,
> 
> It would be better to name the method which reflects whether pushdown is
> safe. e.g. is_pathkey_safe_for_pushdown.

The convention used here is the same one as in is_foreign_expr and 
is_foreign_param, which are also related to pushdown-safety. 

-- 
Ronan Dunklau>From 8d0ebd4df2e5e72a8d490a65001c9971516a4337 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 6 Sep 2021 09:54:43 +0200
Subject: [PATCH v8] Fix orderby handling in postgres_fdw

The logic for pushing down order bys in postgres fdw didn't take into
account the specific operator used, and as such a pushed-down order by
could return wrong results.

This patch looks up the original operator associated to the pathkey
opfamily, and checks that it actually exists on the foreign side.

If it does, the operator is then used to rebuild an equivalent USING
 clause.
---
 contrib/postgres_fdw/deparse.c| 156 +-
 .../postgres_fdw/expected/postgres_fdw.out|  23 +++
 contrib/postgres_fdw/postgres_fdw.c   |  28 ++--
 contrib/postgres_fdw/postgres_fdw.h   |  11 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |   8 +
 src/backend/optimizer/path/equivclass.c   |  26 ++-
 src/include/optimizer/paths.h |   2 +
 7 files changed, 187 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index d98bd66681..fb1b5f9d9b 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -37,9 +37,11 @@
 #include "access/sysattr.h"
 #include "access/table.h"
 #include "catalog/pg_aggregate.h"
+#include "catalog/pg_amop.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_opfamily.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
@@ -47,6 +49,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/paths.h"
 #include "optimizer/prep.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
@@ -182,6 +185,8 @@ static void deparseRangeTblRef(StringInfo buf, PlannerInfo *root,
 			   Index ignore_rel, List **ignore_conds, List **params_list);
 static void deparseAggref(Aggref *node, deparse_expr_cxt *context);
 static void appendGroupByClause(List *tlist, deparse_expr_cxt *context);
+static void appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first,
+deparse_expr_cxt *context);
 static void appendAggOrderBy(List *orderList, List *targetList,
 			 deparse_expr_cxt *context);
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
@@ -1037,6 +1042,41 @@ is_foreign_param(PlannerInfo *root,
 	return false;
 }
 
+/*
+ * Returns true if it's safe to push down a sort as described by 'pathkey' to
+ * the foreign server
+ */
+bool
+is_foreign_pathkey(PlannerInfo *root,
+   RelOptInfo *baserel,
+   PathKey *pathkey)
+{
+	EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+	PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) baserel->fdw_private;
+	Expr	   *em_expr;
+
+	/*

Re: ORDER BY pushdowns seem broken in postgres_fdw

2021-09-06 Thread Ronan Dunklau

Le vendredi 3 septembre 2021, 22:54:25 CEST David Zhang a écrit :
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
> 
> Applied the v6 patch to master branch and ran regression test for contrib,
> the result was "All tests successful."

What kind of error did you get running make installcheck-world ? If it passed 
the make check for contrib, I can't see why it would fail running make 
installcheck-world. 

In any case, I just checked and running make installcheck-world doesn't 
produce any error.

Since HEAD had moved a bit since the last version, I rebased the patch, 
resulting in the attached v7.

Best regards,

--
Ronan Dunklau
>From ccfb0a3d7dae42f56c699aac6f699c3c2f08c812 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 6 Sep 2021 09:54:43 +0200
Subject: [PATCH v7] Fix orderby handling in postgres_fdw

The logic for pushing down order bys in postgres fdw didn't take into
account the specific operator used, and as such a pushed-down order by
could return wrong results.
---
 contrib/postgres_fdw/deparse.c| 156 +-
 .../postgres_fdw/expected/postgres_fdw.out|  23 +++
 contrib/postgres_fdw/postgres_fdw.c   |  28 ++--
 contrib/postgres_fdw/postgres_fdw.h   |  11 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |   8 +
 src/backend/optimizer/path/equivclass.c   |  26 ++-
 src/include/optimizer/paths.h |   2 +
 7 files changed, 187 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index d98bd66681..fb1b5f9d9b 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -37,9 +37,11 @@
 #include "access/sysattr.h"
 #include "access/table.h"
 #include "catalog/pg_aggregate.h"
+#include "catalog/pg_amop.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_opfamily.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
@@ -47,6 +49,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/paths.h"
 #include "optimizer/prep.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
@@ -182,6 +185,8 @@ static void deparseRangeTblRef(StringInfo buf, PlannerInfo *root,
 			   Index ignore_rel, List **ignore_conds, List **params_list);
 static void deparseAggref(Aggref *node, deparse_expr_cxt *context);
 static void appendGroupByClause(List *tlist, deparse_expr_cxt *context);
+static void appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first,
+deparse_expr_cxt *context);
 static void appendAggOrderBy(List *orderList, List *targetList,
 			 deparse_expr_cxt *context);
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
@@ -1037,6 +1042,41 @@ is_foreign_param(PlannerInfo *root,
 	return false;
 }
 
+/*
+ * Returns true if it's safe to push down a sort as described by 'pathkey' to
+ * the foreign server
+ */
+bool
+is_foreign_pathkey(PlannerInfo *root,
+   RelOptInfo *baserel,
+   PathKey *pathkey)
+{
+	EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
+	PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) baserel->fdw_private;
+	Expr	   *em_expr;
+
+	/*
+	 * is_foreign_expr would detect volatile expressions as well, but checking
+	 * ec_has_volatile here saves some cycles.
+	 */
+	if (pathkey_ec->ec_has_volatile)
+		return false;
+
+	/* can't push down the sort if the pathkey's opfamily is not shippable */
+	if (!is_shippable(pathkey->pk_opfamily, OperatorFamilyRelationId, fpinfo))
+		return false;
+
+	em_expr = find_em_expr_for_rel(pathkey_ec, baserel);
+	if (em_expr == NULL)
+		return false;
+
+	/*
+	 * Finally, determine if it's safe to evaluate the found expr on the
+	 * foreign server.
+	 */
+	return is_foreign_expr(root, baserel, em_expr);
+}
+
 /*
  * Convert type OID + typmod info into a type name we can ship to the remote
  * server.  Someplace else had better have verified that this type name is
@@ -3331,6 +3371,45 @@ deparseAggref(Aggref *node, deparse_expr_cxt *context)
 	appendStringInfoChar(buf, ')');
 }
 
+/*
+ * Append the ASC, DESC, USING  and NULLS FIRST / NULLS LAST part
+ * of the ORDER BY clause
+ */
+static void
+appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first,
+	deparse_expr_cxt *context)
+{
+	StringInfo	buf = context->buf;
+	TypeCacheEntry *typentry;
+
+	typentry = lookup_type_cache(sortcoltype, TYPECACHE_LT_OPR | TYPECACHE_GT_OPR);
+
+	if (sortop == typentry-

Re: pg_receivewal starting position

2021-09-06 Thread Ronan Dunklau

Le vendredi 3 septembre 2021 17:49:34 CEST, vous avez écrit :
> On Fri, Sep 3, 2021 at 3:28 PM Ronan Dunklau  wrote:
> > There is still the concern raised by Bharath about being able to select
> > the
> > way to fetch the replication slot information for the user, and what
> > should or should not be included in the documentation. I am in favor of
> > documenting the process of selecting the wal start, and maybe include a
> > --start-lsn option for the user to override it, but that's maybe for
> > another patch.
> 
> Let's hear from others.
> 
> Thanks for the patches. I have some quick comments on the v5 patch-set:
> 
> 0001:
> 1) Do you also want to MemSet values too in ReadReplicationSlot?
> 
> 2) When if clause has single statement we don't generally use "{" and "}"
> + if (slot == NULL || !slot->in_use)
> + {
> + LWLockRelease(ReplicationSlotControlLock);
> + }
> you can just have:
> + if (slot == NULL || !slot->in_use)
> + LWLockRelease(ReplicationSlotControlLock);
> 
> 3) This is still not copying the slot contents, as I said earlier you
> can just take the required info into some local variables instead of
> taking the slot pointer to slot_contents pointer.
> + /* Copy slot contents while holding spinlock */
> + SpinLockAcquire(>mutex);
> + slot_contents = *slot;
> + SpinLockRelease(>mutex);
> + LWLockRelease(ReplicationSlotControlLock);
> 
> 4) As I said earlier, why can't we have separate tli variables for
> more readability instead of just one slots_position_timeline and
> timeline_history variable? And you are not even resetting those 2
> variables after if
> (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)), you might end
> up sending the restart_lsn timelineid for confirmed_flush. So, better
> use two separate variables. In fact you can use block local variables:
> + if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
> + {
> + List*tl_history= NIL;
> + TimeLineID tli;
> + tl_history= readTimeLineHistory(ThisTimeLineID);
> + tli = tliOfPointInHistory(slot_contents.data.restart_lsn,
> +   tl_history);
> + values[i] = Int32GetDatum(tli);
> + nulls[i] = false;
> + }
> similarly for confirmed_flush.
> 
> 5) I still don't see the need for below test case:
> + "READ_REPLICATION_SLOT does not produce an error with non existent slot");
> when we have
> + "READ_REPLICATION_SLOT does not produce an error with dropped slot");
> Because for a user, dropped or non existent slot is same, it's just
> that for dropped slot we internally don't delete its entry from the
> shared memory.
> 

Thank you for reiterating those concerns. As I said, I haven't touched 
Michael's version of the patch. I was incorporating those changes in my branch 
before he sent this, so I'll probably merge both before sending an updated 
patch.

> 0002:
> 1) Imagine GetSlotInformation always returns
> READ_REPLICATION_SLOT_ERROR, don't you think StreamLog enters an
> infinite loop there? Instead, why don't you just exit(1); instead of
> return; and retry? Similarly for READ_REPLICATION_SLOT_NONEXISTENT? I
> think, you can just do exit(1), no need to retry.
> + case READ_REPLICATION_SLOT_ERROR:
> +
> + /*
> + * Error has been logged by GetSlotInformation, return and
> + * maybe retry
> + */
> + return;

This is the same behaviour we had before: if there is an error with 
pg_receivewal we retry the command. There was no special case for the 
replication slot not existing before, I don't see why we should change it now 
?

Eg:

2021-09-06 09:11:07.774 CEST [95853] ERROR:  replication slot "nonexistent" 
does not exist
2021-09-06 09:11:07.774 CEST [95853] STATEMENT:  START_REPLICATION SLOT 
"nonexistent" 0/100 TIMELINE 1
pg_receivewal: error: could not send replication command "START_REPLICATION": 
ERROR:  replication slot "nonexistent" does not exist
pg_receivewal: disconnected; waiting 5 seconds to try again

Users may rely on it to keep retrying in the background until the slot has 
been created for example.

> 
> 2) Why is it returning READ_REPLICATION_SLOT_OK when slot_info isn't
> passed? And why are you having this check after you connect to the
> server and fetch the data?
> + /* If no slotinformation has been passed, we can return immediately */
> + if (slot_info == NULL)
> + {
> + PQclear(res);
> + return READ_REPLICATION_SLOT_OK;
> + }
> Instead you can just have a single assert:
> 
> + Assert(slot_name && slot_info );

At first it was so that we didn't have to fill in all required information if 
we 
don't need too, but it turns out pg_basebackup also has to check for the 
slot's type. I agree we should not support the null slot_in

Re: pg_receivewal starting position

2021-09-06 Thread Ronan Dunklau

Le lundi 6 septembre 2021, 06:22:30 CEST Michael Paquier a écrit :
> On Fri, Sep 03, 2021 at 11:58:27AM +0200, Ronan Dunklau wrote:
> > 0002 for pg_receivewal tried to simplify the logic of information to
> > return, by using a dedicated struct for this. This accounts for Bahrath's
> > demands to return every possible field.
> > In particular, an is_logical field simplifies the detection of the type of
> > slot. In my opinion it makes sense to simplify it like this on the client
> > side while being more open-minded on the server side if we ever need to
> > provide a new type of slot. Also, GetSlotInformation now returns an enum
> > to be able to handle the different modes of failures, which differ
> > between pg_receivewal and pg_basebackup.
> 
> +   if (PQserverVersion(conn) < 15)
> +   return READ_REPLICATION_SLOT_UNSUPPORTED;
> [...]
> +typedef enum {
> +   READ_REPLICATION_SLOT_OK,
> +   READ_REPLICATION_SLOT_UNSUPPORTED,
> +   READ_REPLICATION_SLOT_ERROR,
> +   READ_REPLICATION_SLOT_NONEXISTENT
> +} ReadReplicationSlotStatus;
> 
> Do you really need this much complication?  We could treat the
> unsupported case and the non-existing case the same way: we don't know
> so just assign InvalidXlogRecPtr or NULL to the fields of the
> structure, and make GetSlotInformation() return false just on error,
> with some pg_log_error() where adapted in its internals.

I actually started with the implementation you propose, but changed my mind 
while writing it because I realised it's easier to reason about like this, 
instead of failing silently during READ_REPLICATION_SLOT to fail a bit later 
when actually trying to start the replication slot because it doesn't exists. 
Either the user expects the replication slot to exists, and in this case we 
should retry the whole loop in the hope of getting an interesting LSN, or the 
user doesn't and shouldn't even pass a replication_slot name to begin with.

> 
> > There is still the concern raised by Bharath about being able to select
> > the
> > way to fetch the replication slot information for the user, and what
> > should or should not be included in the documentation. I am in favor of
> > documenting the process of selecting the wal start, and maybe include a
> > --start-lsn option for the user to override it, but that's maybe for
> > another patch.
> 
> The behavior of pg_receivewal that you are implementing should be
> documented.  We don't say either how the start location is selected
> when working on an existing directory, so I would recommend to add a
> paragraph in the description section to detail all that, as of:
> - First a scan of the existing archives is done.
> - If nothing is found, and if there is a slot, request the slot
> information.
> - If still nothing (aka slot undefined, or command not supported), use
> the last flush location.

Sounds good, I will add another patch for the documentation of this.

> 
> As a whole, I am not really convinced that we need a new option for
> that as long as we rely on a slot with pg_receivewal as these are used
> to make sure that we avoid holes in the WAL archives.
> 
> Regarding pg_basebackup, Daniel has proposed a couple of days ago a
> different solution to trap errors earlier, which would cover the case
> dealt with here:
> https://www.postgresql.org/message-id/0F69E282-97F9-4DB7-8D6D-F927AA6340C8@y
> esql.se 

I will take a look.

> We should not mimic in the frontend errors that are safely trapped
> in the backend with the proper locks, in any case.

I don't understand what you mean by this ? My original proposal was for the 
command to actually attach to the replication slot while reading it, thus 
keeping a lock on it to prevent dropping or concurrent access on the server.
> 
> While on it, READ_REPLICATION_SLOT returns a confirmed LSN when
> grabbing the data of a logical slot.  We are not going to use that
> with pg_recvlogical as by default START_STREAMING 0/0 would just use
> the confirmed LSN.  Do we have use cases where this information would
> help?  There is the argument of consistency with physical slots and
> that this can be helpful to do sanity checks for clients, but that's
> rather thin.

If we don't, we should rename the command to READ_PHYSICAL_REPLICATION_SLOT 
and error out on the server if the slot is not actually a physical one to 
spare the client from performing those checks. I still think it's better to 
support both cases as opposed to having two completely different APIs 
(READ_(PHYSICAL)_REPLICATION_SLOT for physical ones on a replication 
connection, pg_replication_slots view for logical ones) as it would enable 
more for third-party clients at a relatively low maintenance cost for us.  

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-09-03 Thread Ronan Dunklau

Le jeudi 2 septembre 2021, 10:37:20 CEST Michael Paquier a écrit :
> On Thu, Sep 02, 2021 at 10:08:26AM +0200, Ronan Dunklau wrote:
> > I could see the use for sending active_pid for use within pg_basebackup:
> > at
> > least we could fail early if the slot is already in use. But at the same
> > time, maybe it won't be in use anymore once we need it.
> 
> There is no real concurrent protection with this design.  You could
> read that the slot is not active during READ_REPLICATION_SLOT just to
> find out after in the process of pg_basebackup streaming WAL that it
> became in use in-between.  And the backend-side protections would kick
> in at this stage.
> 
> Hmm.  The logic doing the decision-making with pg_receivewal may
> become more tricky when it comes to pg_replication_slots.wal_status,
> max_slot_wal_keep_size and pg_replication_slots.safe_wal_size.  The
> number of cases we'd like to consider impacts directly the amount of
> data send through READ_REPLICATION_SLOT.  That's not really different
> than deciding of a failure, a success or a retry with active_pid at an
> earlier or a later stage of a base backup.  pg_receivewal, on the
> contrary, can just rely on what the backend tells when it begins
> streaming.  So I'd prefer keeping things simple and limit the number
> of fields a minimum for this command.

Ok. Please find attached new patches rebased on master.*

0001 is yours without any modification.

0002 for pg_receivewal tried to simplify the logic of information to return, 
by using a dedicated struct for this. This accounts for Bahrath's demands to 
return every possible field.
In particular, an is_logical field simplifies the detection of the type of 
slot. 
In my opinion it makes sense to simplify it like this on the client side while 
being more open-minded on the server side if we ever need to provide a new 
type of slot. Also, GetSlotInformation now returns an enum to be able to 
handle the different modes of failures, which differ between pg_receivewal and 
pg_basebackup. 

0003 is the pg_basebackup one, not much changed except for the concerns you 
had about the log message and handling of different failure modes.

There is still the concern raised by Bharath about being able to select the 
way to fetch the replication slot information for the user, and what should or 
should not be included in the documentation. I am in favor of documenting the 
process of selecting the wal start, and maybe include a --start-lsn option for 
the user to override it, but that's maybe for another patch.

-- 
Ronan Dunklau>From 60b8cedb196f5acdd75b489c1d2155c2698084a4 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 2 Sep 2021 16:25:25 +0900
Subject: [PATCH v5 1/3] Add READ_REPLICATION_SLOT command

---
 doc/src/sgml/protocol.sgml  |  66 
 src/backend/replication/repl_gram.y |  16 ++-
 src/backend/replication/repl_scanner.l  |   1 +
 src/backend/replication/walsender.c | 112 
 src/include/nodes/nodes.h   |   1 +
 src/include/nodes/replnodes.h   |  10 ++
 src/test/recovery/t/001_stream_rep.pl   |  47 +++-
 src/test/recovery/t/006_logical_decoding.pl |  15 ++-
 src/tools/pgindent/typedefs.list|   1 +
 9 files changed, 266 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a232546b1d..8191f17137 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2052,6 +2052,72 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+ 
+  Read the information of a replication slot. Returns a tuple with
+  NULL values if the replication slot does not
+  exist.
+ 
+ 
+  In response to this command, the server will return a one-row result set,
+  containing the following fields:
+  
+   
+slot_type (text)
+
+ 
+  The replication slot's type, either physical or
+  logical.
+ 
+
+   
+
+   
+restart_lsn (text)
+
+ 
+  The replication slot's restart_lsn.
+ 
+
+   
+
+   
+confirmed_flush_lsn (text)
+
+ 
+  The replication slot's confirmed_flush_lsn.
+ 
+
+   
+
+   
+restart_tli (int4)
+
+ 
+  The timeline ID for the restart_lsn position,
+  when following the current timeline history.
+ 
+
+   
+
+   
+confirmed_flush_tli (int4)
+
+ 
+  The timeline ID for the confirmed_flush_lsn
+  position, when following the current timeline history.
+ 
+
+   
+  
+ 
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHY

Re: improve pg_receivewal code

2021-09-02 Thread Ronan Dunklau

Le lundi 30 août 2021, 09:32:05 CEST Bharath Rupireddy a écrit :
> Hi,
> 
> I see there's a scope to do following improvements to pg_receivewal.c:

Thank you Bharath for this patch.

> 
> 1) Fetch the server system identifier in the first RunIdentifySystem
> call and use it to identify(via pg_receivewal's ReceiveXlogStream) any
> unexpected changes that may happen in the server while pg_receivewal
> is connected to it. This can be helpful in scenarios when
> pg_receivewal tries to reconnect to the server (see the code around
> pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has
> happnend in the server that changed the its system identifier. Once
> the pg_receivewal establishes the conenction to server again, then the
> ReceiveXlogStream has a code chunk to compare the system identifier
> that we received in the initial connection.

I'm not sure what kind of problem could be happening here: if you were 
somewhat routed to another server ? Or if we "switched" the cluster listening 
on that port ? 

> 2) Move the RunIdentifySystem to identify timeline id and start LSN
> from the server only if the pg_receivewal failed to get them from
> FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is
> avoided.

That makes sense, even if we add another IDENTIFY_SYSTEM to check against the 
one set in the first place.

> 3) Place the "replication connetion shouldn't have any database name
> associated" error check right after RunIdentifySystem so that we can
> avoid fetching wal segment size with RetrieveWalSegSize if at all we
> were to fail with that error. This change is similar to what
> pg_recvlogical.c does.

Makes sense.

> 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters
> main loop to get the wal from the server. This avoids an unnecessary
> query for pg_receivewal with "--create-slot" or "--drop-slot".
> 5) Have an assertion after the pg_receivewal done a good amount of
> work to find start timeline and LSN might be helpful:
> Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr);
> 
> Attaching a patch that does take care of above improvements. Thoughts?

Overall I think it is good.

However, you have some formatting issues, here it mixes tabs and spaces:

+   /*
+* No valid data can be found in the existing output 
directory.
+* Get start LSN position and current timeline ID from 
the server.
+    */

And here it is not formatted properly:

+static char   *server_sysid = NULL;



> 
> Regards,
> Bharath Rupireddy.


-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-09-02 Thread Ronan Dunklau

 on the same thing concurrently in the future to 
avoid duplicate efforts. I'll rebase and send the updated version for patches 
0002 and 0003 of my original proposal once we reach an agreement over the 
behaviour / options of pg_receivewal.

Also considering the number of different fields to be filled by the 
GetSlotInformation function, my local branch groups them into a dedicated 
struct which is more convenient than having X possibly null arguments.

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-09-02 Thread Ronan Dunklau

Le jeudi 2 septembre 2021, 08:42:22 CEST Bharath Rupireddy a écrit :
> On Thu, Sep 2, 2021 at 11:15 AM Kyotaro Horiguchi
> 
>  wrote:
> > At Wed, 1 Sep 2021 10:30:05 +0530, Bharath Rupireddy
> >  wrote in> 
> > > On Mon, Aug 30, 2021 at 3:26 PM Ronan Dunklau  
wrote:
> > > > > 7) How about we let pg_receivewal use READ_REPLICATION_SLOT as an
> > > > > option?
> > > > 
> > > > From my point of view, I already expected it to use something like
> > > > that when using a replication slot. Maybe an option to turn it off
> > > > could be useful ?> > 
> > > IMO, pg_receivewal should have a way to turn off/on using
> > > READ_REPLICATION_SLOT. Imagine if the postgres server doesn't support
> > > READ_REPLICATION_SLOT (a lower version) but for some reasons the
> > > pg_receivewal(running separately) is upgraded to the version that uses
> > > READ_REPLICATION_SLOT, knowing that the server doesn't support
> > > READ_REPLICATION_SLOT why should user let pg_receivewal run an
> > > unnecessary code?
> > 
> > If I read the patch correctly the situation above is warned by the
> > following message then continue to the next step giving up to identify
> > start position from slot data.
> > 
> > > "server does not suport fetching the slot's position, resuming from the
> > > current server position instead"> 
> > (The message looks a bit too long, though..)
> > 
> > However, if the operator doesn't know the server is old, pg_receivewal
> > starts streaming from unexpected position, which is a kind of
> > disaster. So I'm inclined to agree to Bharath, but rather I imagine of
> > an option to explicitly specify how to determine the start position.
> > 
> > --start-source=[server,wal,slot]  specify starting-LSN source, default is
> > 
> >  trying all of them in the order of wal, slot, server.
> > 
> > I don't think the option doesn't need to accept multiple values at once.
> 
> If --start-source = 'wal' fails, then pg_receivewal should show up an
> error saying "cannot find start position from <>
> directory, try with "server" or "slot" for --start-source". We might
> end having similar errors for other options as well. Isn't this going
> to create unnecessary complexity?
> 
> The existing way the pg_receivewal fetches start pos i.e. first from
> wal directory and then from server start position, isn't known to the
> user at all, no verbose message or anything specified in the docs. Why
> do we need to expose this with the --start-source option? IMO, we can
> keep it that way and we can just have a way to turn off the new
> behaviour that we are proposing here, i.e.fetching the start position
> from the slot's restart_lsn.

Then it should probably be documented. We write in the docs that it is 
strongly recommended to use a replication slot, but do not mention how we 
resume from have been already processed.

If someone really cares about having control over how the start position is 
defined instead of relying on the auto detection, it would be wiser to add a --
startpos parameter similar to the endpos one, which would override everything 
else, instead of different flags for different behaviours.

Regards,

-- 
Ronan Dunklau

Re: Proposal: More structured logging

2021-09-01 Thread Ronan Dunklau

Le mercredi 1 septembre 2021, 09:36:50 CEST Peter Eisentraut a écrit :
> On 13.08.21 15:23, Ronan Dunklau wrote:
> > The logging system already captures a lot of information in the ErrorData.
> > But at present there is no way for a log message authors to include more
> > metadata about the log outside of the log message itself. For example,
> > including the extension name which can be useful for filtering /
> > dispatching log messages. This can be done in the log message itself, but
> > that requires specialized parsing in the log output.
> > 
> > Even though among the available loggers in core, only the csv logger would
> > be able to make sense of it, I feel it would be beneficial to add a
> > tagging system to logs, by adding different (tag, value) combinations to
> > a log entry, with an API similar to what we do for other ErrorData
> > fields:
> > 
> > ereport(NOTICE,
> > 
> >   (errmsg("My log message")),
> >   (errtag("EMITTER", "MYEXTENSION")),
> >   (errtag("MSG-ID", "%d", error_message_id))
> > 
> > );
> 
> What are some more examples of where this could be used?  The extension
> name could be added more or less automatically to ereport() calls.  I
> don't know what the MSG-ID is supposed to be.  Are there other use cases?

Adding it automatically would be nice, but how would you go about that ?

In-core it would open up the possibility to split log messages into different 
fields, for example the different statistics reported in the logs by VACUUM / 
ANALYZE VERBOSE and make it easier to consume the output without having to 
parse the message. Parsing the message also means that any tool parsing it 
needs to either be aware of the locale, or to force the user to use a specific 
one.

Out-of-core, the same things could be done for extensions like pg_audit which 
logs structured information into the message itself, leaving the message 
format to be parsed at a later time.

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-08-31 Thread Ronan Dunklau

Le mardi 31 août 2021, 13:17:22 CEST Bharath Rupireddy a écrit :
> On Mon, Aug 30, 2021 at 3:26 PM Ronan Dunklau  
wrote:
> > Thank you for this review ! Please see the other side of the thread where
> > I
> > answered Michael Paquier with a new patchset, which includes some of your
> > proposed modifications.
> 
> Thanks for the updated patches. Here are some comments on v3-0001
> patch. I will continue to review 0002 and 0003.

Thank you ! I will send a new version shortly, once I address your remarks 
concerning patch 0002 (and hopefully 0003 :-) ) 
> 
> 1) Missing full stops "." at the end.
> +   logical
> +   when following the current timeline history
> +   position, when following the current timeline history
> 

Good catch, I will take care of it for the next version.

> 2) Can we have the "type" column as "slot_type" just to keep in sync
> with the pg_replication_slots view?

You're right, it makes more sense like this.

> 
> 3) Can we mention the link to pg_replication_slots view in the columns
> - "type", "restart_lsn", "confirmed_flush_lsn"?
> Something like: the "slot_type"/"restart_lsn"/"confirmed_flush_lsn" is
> same as   linkend="view-pg-replication-slots">pg_replication_slots tname> view.

Same as above, thanks.

> 
> 4) Can we use "read_replication_slot" instead of
> "identify_replication_slot", just to be in sync with the actual
> command?

That must have been a leftover from an earlier version of the patch, I will fix 
it also.

> 
> 5) Are you actually copying the slot contents into the slot_contents
> variable here? Isn't just taking the pointer to the shared memory?
> + /* Copy slot contents while holding spinlock */
> + SpinLockAcquire(>mutex);
> + slot_contents = *slot;
> + SpinLockRelease(>mutex);
> + LWLockRelease(ReplicationSlotControlLock);
> 
> You could just do:
> + Oid dbid;
> + XLogRecPtr restart_lsn;
> + XLogRecPtr confirmed_flush;
> 
> + /* Copy the required slot contents */
> + SpinLockAcquire(>mutex);
> + dbid = slot.data.database;
> + restart_lsn = slot.data.restart_lsn;
> + confirmed_flush = slot.data.confirmed_flush;
> + SpinLockRelease(>mutex);
> + LWLockRelease(ReplicationSlotControlLock);

It's probably simpler that way.

> 
> 6) It looks like you are not sending anything as a response to the
> READ_REPLICATION_SLOT command, if the slot specified doesn't exist.
> You are just calling end_tup_output which just calls rShutdown (points
> to donothingCleanup of printsimpleDR)
> if (has_value)
> do_tup_output(tstate, values, nulls);
> end_tup_output(tstate);

> 
> Can you test the use case where the pg_receivewal asks
> READ_REPLICATION_SLOT with a non-existent replication slot and see
> with your v3 patch how it behaves?

> 
> Why don't you remove has_value flag and do this in ReadReplicationSlot:
> Datum values[5];
> bool nulls[5];
> MemSet(values, 0, sizeof(values));
> MemSet(nulls, 0, sizeof(nulls));
> 
> + dest = CreateDestReceiver(DestRemoteSimple);
> + tstate = begin_tup_output_tupdesc(dest, tupdesc, );
> + do_tup_output(tstate, values, nulls);
> + end_tup_output(tstate);

As for the API, I think it's cleaner to just send an empty result set instead 
of a null tuple in that case but I won't fight over it if there is consensus on 
having an all-nulls tuple value instead. 

There is indeed a bug, but not here, in the second patch: I still test the 
slot type even when we didn't fetch anything. So I will add a test for that 
too. 
> 
> 7) Why don't we use 2 separate variables "restart_tli",
> "confirmed_flush_tli" instead of "slots_position_timeline", just to be
> more informative?

You're right.

> 
> 8) I still have the question like, how can a client (pg_receivewal for
> instance) know that it is the current owner/user of the slot it is
> requesting the info? As I said upthread, why don't we send "active"
> and "active_pid" fields of the pg_replication_slots view?
> Also, it would be good to send the "wal_status" field so that the
> client can check if the "wal_status" is not "lost"?

 As for pg_receivewal, it can only check that it's not active at that time, 
since we only aquire the replication slot once we know the start_lsn. For the 
basebackup case it's the same thing as we only want to check if it exists. 
As such, I didn't add them as I didn't see the need, but if it can be useful 
why not ? I will do that in the next version.

> 
> 9) There are 2 new lines at the end of ReadReplicationSlot. We give
> only one new line after each function de

Re: pg_receivewal starting position

2021-08-30 Thread Ronan Dunklau

Le samedi 28 août 2021, 14:10:25 CEST Bharath Rupireddy a écrit :
> On Thu, Aug 26, 2021 at 5:45 PM Ronan Dunklau  
wrote:
> > order to fail early if the replication slot doesn't exist, so please find
> > attached v2 for that.
> 
> Thanks for the patches. Here are some comments:
> 

Thank you for this review ! Please see the other side of the thread where I 
answered Michael Paquier with a new patchset, which includes some of your 
proposed modifications.

> 1) While the intent of these patches looks good, I have following
> concern with new replication command READ_REPLICATION_SLOT: what if
> the pg_receivewal exits (because user issued a SIGINT or for some
> reason) after flushing  the received WAL to disk, before it sends
> sendFeedback to postgres server's walsender so that it doesn't get a
> chance to update the restart_lsn in the replication slot via
> PhysicalConfirmReceivedLocation. If the pg_receivewal is started
> again, isn't it going to get the previous restart_lsn and receive the
> last chunk of flushed WAL again?

I've kept the existing directory as the source of truth if we have any WAL 
there already. If we don't, we fallback to the restart_lsn on the replication 
slot.
So in the event that we start it again from somewhere else where we don't have 
access to those WALs anymore, we could be receiving it again, which in my 
opinion is better than losing everything in between in that case. 

> 
> 2) What is the significance of READ_REPLICATION_SLOT for logical
> replication slots? I read above that somebody suggested to restrict
> the walsender to handle READ_REPLICATION_SLOT for physical replication
> slots so that the callers will see a command failure. But I tend to
> think that it is clean to have this command common for both physical
> and logical replication slots and the callers can have an Assert(type
> == 'physical').

I've updated the patch to make it easy for the caller to check the slot's type 
and added a verification for those cases.

In general, I tried to implement the meaning of the different fields exactly as 
it's done in the pg_replication_slots view.

> 
> 3) Isn't it useful to send active, active_pid info of the replication
> slot via READ_REPLICATION_SLOT? pg_receivewal can use Assert(active ==
> true && active_pid == getpid()) as an assertion to ensure that it is
> the sole owner of the replication slot? Also, is it good send
> wal_status info

Typically we read the slot before attaching to it, since what we want to do is 
check if it exists. It may be worthwile to check that it's not already used by 
another backend though.

> 
> 4) I think below messages should start with lower case letter and also
> there are some typos:
> + pg_log_warning("Could not fetch the replication_slot \"%s\" information "
> + pg_log_warning("Server does not suport fetching the slot's position, "
> something like:
> + pg_log_warning("could not fetch replication slot \"%s\" information, "
> +"resuming from current server position instead", replication_slot);
> + pg_log_warning("server does not support fetching replication slot
> information, "
> +"resuming from current server position instead");
> 
I've rephrased it a bit in v3, let me know if that's what you had in mind.

> 5) How about emitting the above messages in case of "verbose"?

I think it is useful to warn the user even if not in the verbose case, but if 
the consensus is to move it to verbose only output I can change it.

> 
> 6) How about an assertion like below?
> + if (stream.startpos == InvalidXLogRecPtr)
> + {
> + stream.startpos = serverpos;
> + stream.timeline = servertli;
> + }
> +
> +Assert(stream.startpos != InvalidXLogRecPtr)>>

Good idea.

> 
> 7) How about we let pg_receivewal use READ_REPLICATION_SLOT as an option?

From my point of view, I already expected it to use something like that when 
using a replication slot. Maybe an option to turn it off could be useful ? 

> 
> 8) Just an idea, how about we store pg_receivewal's lastFlushPosition
> in a file before pg_receivewal exits and compare it with the
> restart_lsn that it received from the replication slot, if
> lastFlushPosition == received_restart_lsn well and good, if not, then
> something would have happened and we always start at the
> lastFlushPosition ?

The patch idea originally came from the fact that some utility use 
pg_receivewal to fetch WALs, and move them elsewhere. In that sense I don't 
really see what value this brings compared to the existing (and unmodified) way 
of computing the restart position from the already stored files ?

Best regards,

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-08-30 Thread Ronan Dunklau

Le vendredi 27 août 2021, 05:44:32 CEST Michael Paquier a écrit :
> On Thu, Aug 26, 2021 at 02:14:27PM +0200, Ronan Dunklau wrote:
> > Following the discussion at [1], I refactored the implementation into
> > streamutil and added a third patch making use of it in pg_basebackup
> > itself in order to fail early if the replication slot doesn't exist, so
> > please find attached v2 for that.
> 
> Thanks for the split.  That helps a lot.
> 

Thank you very much for the review, please find attached an updated patchset.
I've also taken into account some remarks made by Bharath Rupireddy.

> +
> +
>  /*
>   * Run IDENTIFY_SYSTEM through a given connection and give back to caller
> 
> The patch series has some noise diffs here and there, you may want to
> clean up that for clarity.

Ok, sorry about that.

> 
> +   if (slot == NULL || !slot->in_use)
> +   {
> +   LWLockRelease(ReplicationSlotControlLock);
> +
> +   ereport(ERROR,
> +   (errcode(ERRCODE_UNDEFINED_OBJECT),
> LWLocks are released on ERROR, so no need for LWLockRelease() here.
> 

Following your suggestion of not erroring out on an unexisting slot this point 
is no longer be relevant, but thanks for pointing this out anyway.

> +
> +  
> +  Read information about the named replication slot.  This is
> useful to determine which WAL location we should be asking the server
> to start streaming at.
> 
> A nit.  You may want to be more careful with the indentation of the
> documentation.  Things are usually limited in width for readability.
> More  markups would be nice for the field names used in the
> descriptions.

Ok.

> 
> +   if (slot == NULL || !slot->in_use)  
>
> [...] +
>   ereport(ERROR,
> +   (errcode(ERRCODE_UNDEFINED_OBJECT),
> +errmsg("replication slot \"%s\" does not exist",
> +   cmd->slotname)));
> [...]
> +   if (PQntuples(res) == 0)
> +   {
> +   pg_log_error("replication slot %s does not exist",
> slot_name); +   PQclear(0);
> +   return false;
> So, the backend and ReadReplicationSlot() report an ERROR if a slot
> does not exist but pg_basebackup's GetSlotInformation() does the same
> if there are no tuples returned.  That's inconsistent.  Wouldn't it be
> more instinctive to return a NULL tuple instead if the slot does not
> exist to be able to check after real ERRORs in frontends using this
> interface?  

The attached patch returns no tuple at all when the replication slot doesn't 
exist. I'm not sure if that's what you meant by returning a NULL tuple ? 

> A slot in use exists, so the error is a bit confusing here
> anyway, no?

From my understanding, a slot  *not* in use doesn't exist anymore, as such I 
don't really understand this point. Could you clarify ?


> 
> +* XXX: should we allow the caller to specify which target timeline it
> wants +* ?
> +*/
> What are you thinking about here?

I was thinking that maybe instead of walking back the timeline history from 
where we currently are on the server, we could allow an additional argument 
for the client to specify which timeline it wants. But I guess a replication 
slot can not be present for a past, divergent timeline ? I have removed that 
suggestion. 

> 
> -# restarts of pg_receivewal will see this segment as full..
> +# restarts of pg_receivewal will see this segment as full../
> Typo.

Ok.

> 
> +   TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 4,
> "restart_lsn_timeline", + INT4OID, -1, 0);
> +   TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 5,
> "confirmed_flush_lsn_timeline", + INT4OID, -1,
> 0);
> I would call these restart_tli and confirmed_flush_tli., without the
> "lsn" part.

Ok.
> 
> The patch for READ_REPLICATION_SLOT could have some tests using a
> connection that has replication=1 in some TAP tests.  We do that in
> 001_stream_rep.pl with SHOW, as one example.

Ok. I added the physical part to 001_stream_rep.pl, using the protocol 
interface directly for creating / dropping the slot, and some tests for 
logical replication slots to 006_logical_decoding.pl.

> 
> -   'slot0'
> +   'slot0', '-p',
> +   "$port"
> Something we are missing here?

The thing we're missing here is a wrapper for command_fails_like. I've added 
this to PostgresNode.pm.

Best regards,


-- 
Ronan Dunklau>From 9fa01789f663975b

Re: pg_receivewal starting position

2021-08-26 Thread Ronan Dunklau

Le jeudi 29 juillet 2021, 11:09:40 CEST Ronan Dunklau a écrit :
> Patch 0001 adds the new READ_REPLICATION_SLOT command.
> It returns for a given slot the type, restart_lsn, flush_lsn,
> restart_lsn_timeline and flush_lsn_timeline.
> The timelines are determined by reading the current timeline history, and
> finding the timeline where we may find the record. I didn't find explicit
> test for eg IDENTIFY_SYSTEM so didn't write one either for this new
> command, but it is tested indirectly in patch 0002.
> 
> Patch 0002 makes pg_receivewal use that command if we use a replication slot
> and the command is available, and use the restart_lsn and
> restart_lsn_timeline as a starting point. It also adds a small test to
> check that we start back from the previous restart_lsn instead of the
> current flush position when our destination directory does not contain any
> WAL file.
> 
> I also noticed we don't test following a timeline switch. It would probably
> be good to add that, both for the case where we determine the previous
> timeline from the archived segments and when it comes from the new command.
> What do you think ?

Following the discussion at [1], I refactored the implementation into 
streamutil and added a third patch making use of it in pg_basebackup itself in 
order to fail early if the replication slot doesn't exist, so please find 
attached v2 for that.

Best regards,

[1]: https://www.postgresql.org/message-id/flat/
CAD21AoDYmv0yJMQnWtCx_kZGwVZnkQSTQ1re2JNSgM0k37afYQ%40mail.gmail.com

-- 
Ronan Dunklau>From fff8786049326864d3ef8fe4539e1829f933f32f Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Wed, 28 Jul 2021 16:34:54 +0200
Subject: [PATCH v2 1/3] Add READ_REPLICATION_SLOT command.

This commit introduces a new READ_REPLICATION_SLOT  command.
This command is used to read information about a replication slot when
using a physical replication connection.

In this first version it returns the slot type, restart_lsn, flush_lsn and
the timeline of the restart_lsn and flush_lsn, which are obtained by following the
current timeline history.
---
 doc/src/sgml/protocol.sgml |  62 +++
 src/backend/replication/repl_gram.y|  18 -
 src/backend/replication/repl_scanner.l |   1 +
 src/backend/replication/walsender.c| 106 +
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/replnodes.h  |  10 +++
 6 files changed, 197 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a232546b1d..6207171426 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2052,6 +2052,68 @@ The commands accepted in replication mode are:
 
   
 
+  
+READ_REPLICATION_SLOT slot_name
+  READ_REPLICATION_SLOT
+
+
+  
+  Read information about the named replication slot.  This is useful to determine which WAL location we should be asking the server to start streaming at.
+  
+  
+  In response to this command, the server will return a one-row result set, containing the following fields:
+
+  
+type (text)
+
+  
+   The replication slot's type, either physical or logical
+  
+
+  
+
+  
+restart_lsn (text)
+
+  
+   The replication slot's restart_lsn.
+  
+
+  
+
+  
+confirmed_flush_lsn (text)
+
+  
+   The replication slot's confirmed_flush LSN.
+  
+
+  
+
+  
+restart_lsn_timeline (int4)
+
+  
+   The timeline ID for the restart_lsn position, when following the current timeline
+   history
+  
+
+  
+
+  
+confirmed_flush_lsn_timeline (int4)
+
+  
+   The timeline ID for the confirmed_flush_lsn position, when following the current timeline
+   history
+  
+
+  
+
+  
+
+  
+
   
 START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]
  START_REPLICATION
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index e1e8ec29cc..7298f44008 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void);
 /* Keyword tokens. */
 %token K_BASE_BACKUP
 %token K_IDENTIFY_SYSTEM
+%token K_READ_REPLICATION_SLOT
 %token K_SHOW
 %token K_START_REPLICATION
 %token K_CREATE_REPLICATION_SLOT
@@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void);
 %type 	command
 %type 	base_backup start_replication start_logical_replication
 create_replication_slot drop_replica

Re: Proposal: More structured logging

2021-08-23 Thread Ronan Dunklau

Le vendredi 20 août 2021, 11:31:21 CEST Ronan Dunklau a écrit :
> Le jeudi 19 août 2021, 15:04:30 CEST Alvaro Herrera a écrit :
> > On 2021-Aug-13, Ronan Dunklau wrote:
> > > ereport(NOTICE,
> > > 
> > >  (errmsg("My log message")),
> > >  (errtag("EMITTER", "MYEXTENSION")),
> > >  (errtag("MSG-ID", "%d", error_message_id))
> > > 
> > > );
> > 
> > Interesting idea.   I agree this would be useful.
> > 
> > > Please find attached a very small POC patch to better demonstrate what I
> > > propose.
> > 
> > Seems like a good start.  I think a further step towards a committable
> > patch would include a test module that uses the new tagging
> > functionality.
> 
> Please find attached the original patch + a new one adding a test module.
> The test module exposes a function for generating logs with tags, and a log
> hook which format the tags in the DETAIL field for easy regression testing.
> 
> Next step would be to emit those tags in the CSV logs. I'm not sure what
> representation they should have though: a nested csv in it's own column ? A
> key => value pairs list similar to hstore ? A json object ?

I opted for a JSON representation in the CSV logs, please find attached v3 
which is the same thing as v2 with another patch for CSV log output.

> 
> Also we should probably make this available to the client too, but once
> again the format of the tag field needs to be defined. Any opinion ?


-- 
Ronan Dunklau>From e5af5d1a07e65926eee90e0d18443a673d1fcba8 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v3 1/3] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a3e1c59a82..5b9b1b8a72 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..1c490d1b11 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error co

Re: Proposal: More structured logging

2021-08-20 Thread Ronan Dunklau

Le jeudi 19 août 2021, 16:50:10 CEST Alvaro Herrera a écrit :
> On 2021-Aug-19, Magnus Hagander wrote:
> > Another thing I've noticed in more and more other products is to be
> > able to log as json, which is then later thrown into a central logging
> > system somewhere. Basically like csv, but with the schema defined in
> > each row. Yes, a lot more overhead, but some systems really do like to
> > consume json So when we're on the topic of more structured
> > logging...
> 
> Yeah, I was thinking in json logging too -- specifically thinking about
> Fluentbit and similar tools.

Michael, your jsonlog module already fullfills this need. Is it something that 
should be merged into our tree ?

-- 
Ronan Dunklau

Re: Proposal: More structured logging

2021-08-20 Thread Ronan Dunklau

Le jeudi 19 août 2021, 15:04:30 CEST Alvaro Herrera a écrit :
> On 2021-Aug-13, Ronan Dunklau wrote:
> > ereport(NOTICE,
> > 
> >  (errmsg("My log message")),
> >  (errtag("EMITTER", "MYEXTENSION")),
> >  (errtag("MSG-ID", "%d", error_message_id))
> > 
> > );
> 
> Interesting idea.   I agree this would be useful.
> 
> > Please find attached a very small POC patch to better demonstrate what I
> > propose.
> 
> Seems like a good start.  I think a further step towards a committable
> patch would include a test module that uses the new tagging
> functionality.

Please find attached the original patch + a new one adding a test module.
The test module exposes a function for generating logs with tags, and a log 
hook which format the tags in the DETAIL field for easy regression testing.

Next step would be to emit those tags in the CSV logs. I'm not sure what 
representation they should have though: a nested csv in it's own column ? A 
key => value pairs list similar to hstore ? A json object ? 

Also we should probably make this available to the client too, but once again 
the format of the tag field needs to be defined. Any opinion ? 



-- 
Ronan Dunklau>From e5af5d1a07e65926eee90e0d18443a673d1fcba8 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v2 1/3] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a3e1c59a82..5b9b1b8a72 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..1c490d1b11 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error context callback functions, not
  * within an ereport() invocation. The callback function can be in a different
@@ -395,11 +398,18 @@ typedef struct ErrorData
 	int			internalpos;	/* cursor index into internalquery */
 	char	   *internalquery;	/* text of internally-generated query */
 	int			saved_errno;	/* errno at entry */
+	List	   *tags;			/* List of

Proposal: More structured logging

2021-08-13 Thread Ronan Dunklau

Hello,

The logging system already captures a lot of information in the ErrorData. But 
at present there is no way for a log message authors to include more metadata 
about the log outside of the log message itself. For example, including the 
extension name which can be useful for filtering / dispatching log messages.
This can be done in the log message itself, but that requires specialized 
parsing in the log output. 

Even though among the available loggers in core, only the csv logger would be 
able to make sense of it, I feel it would be beneficial to add a tagging system 
to logs, by adding different (tag, value) combinations to a log entry, with an 
API similar to what we do for other ErrorData fields:

ereport(NOTICE,
 (errmsg("My log message")),
 (errtag("EMITTER", "MYEXTENSION")),
 (errtag("MSG-ID", "%d", error_message_id))
);

Please find attached a very small POC patch to better demonstrate what I 
propose.  

Third party logging hooks could then exploit those values to output them 
correctly. For example the json loggger written by Michael Paquier here: 
https://github.com/michaelpq/pg_plugins/tree/master/jsonlog, or  the 
seeminlgy-abandonned journald hook here: https://github.com/intgr/pg_journal 
could add those information in a structured way.

I think the pgaudit extension (or something similar) could also make good use 
of such a feature instead of dumping a CSV entry in the log file. 

As for Postgres own log messages, I'm sure we could find practical use cases 
for tagging messages like this.

On a related note, the only structured logger we have in-core is the CSV 
logger. Many users nowadays end up feeding the logs to journald either by 
capturing the stderr output with systemd, or by having syslog implemented by 
journald itself. Would there be any interest in having native journald support 
as a logging_destination ?

Best regards,

-- 
Ronan Dunklau>From 270ffc5ed2fbe5f5076bddee14c5fb3555b87e4f Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v1 1/2] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index a3e1c59a82..5b9b1b8a72 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -465,6 +465,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = [errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo();
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f53607e12e..1c490d1b11 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_print

Re: Minimal logical decoding on standbys

2021-08-02 Thread Ronan Dunklau

Le lundi 2 août 2021, 17:31:46 CEST Drouvot, Bertrand a écrit :
> > I think the beahviour when dropping a database on the primary should be
> > documented, and proper procedures for handling it correctly should be
> > suggested.
> > 
> > Something along the lines of:
> > 
> > "If a database is dropped on the primary server, the logical replication
> > slot on the standby will be dropped as well. This means that you should
> > ensure that the client usually connected to this slot has had the
> > opportunity to stream the latest changes before the database is dropped."
> 
> I am not sure we should highlight this as part of this patch.
> 
> I mean, the same would currently occur on a non standby if you drop a
> database that has a replication slot linked to it.

The way I see it, the main difference is that you drop an additional object on 
the standby, which doesn't exist and that you don't necessarily have any 
knowledge of on the primary. As such, I thought it would be better to be 
explicit about it to warn users of that possible case.


Regards,

--
Ronan Dunklau

Re: Minimal logical decoding on standbys

2021-08-02 Thread Ronan Dunklau

Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :
> Thanks for the warning, rebase done and new v21 version attached.
> 
> Bertrand

Hello,

I've taken a look at this patch, and it looks like you adressed every prior 
remark, including the race condition Andres was worried about.

As for the basics: make check-world and make installcheck-world pass. 

I think the beahviour when dropping a database on the primary should be 
documented, and proper procedures for handling it correctly should be 
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication slot 
on the standby will be dropped as well. This means that you should ensure that 
the client usually connected to this slot has had the opportunity to stream 
the latest changes before the database is dropped."

As for the patches themselves, I only have two small comments to make.

In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the 
need to check for an InvalidOid since we already check the SlotIsLogical:

+   /* We are only dealing with *logical* slot conflicts. */
+   if (!SlotIsLogical(s))
+   continue;
+
+   /* not our database and we don't want all the database, 
skip */
+   if ((s->data.database != InvalidOid && s->data.database 
!= dboid) && TransactionIdIsValid(xid))
+   continue;

In patch 0004, small typo in the test file:
+##
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##

Thank you for working on this !

Regards,

-- 
Ronan Dunklau

Re: pg_receivewal starting position

2021-07-29 Thread Ronan Dunklau

Le jeudi 29 juillet 2021, 07:32:37 CEST Kyotaro Horiguchi a écrit :
> I didn't thought in details.  But I forgot that ordinary SQL commands
> have been prohibited in physical replication connection.  So we need a
> new replication command but it's not that a big deal.

Thank you for your feedback !


> 
> > I don't see any reason not to make it work for logical replication
> > connections / slots, but it wouldn't be that useful since we can query
> > the database in that case.
> 
> Ordinary SQL queries are usable on a logical replication slot so
> I'm not sure how logical replication connection uses the command.
> However, like you, I wouldn't bother restricting the command to
> physical replication, but perhaps the new command should return the
> slot type.
> 

Ok done in the attached patch.

> 
> I'm not sure it's worth adding complexity for such strictness.
> START_REPLICATION safely fails if someone steals the slot meanwhile.
> In the first place there's no means to protect a slot from others
> while idle.  One possible problem is the case where START_REPLICATION
> successfully acquire the slot after the new command failed.  But that
> case doesn't seem worse than the case someone advances the slot while
> absence.  So I think READ_REPLICATION_SLOT is sufficient.
> 

Ok, I implemented it like this. I tried to follow the pg_get_replication_slots 
approach with regards to how to prevent concurrent modification while reading 
the slot.

> > From pg_receivewal point of view, this would amount to:
> >  - check if we currently have wal in the target directory.
> >  
> >- if we do, proceed as currently done, by computing the start lsn and
> > 
> > timeline from the last archived wal
> > 
> >   - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the
> > 
> > restart_lsn as the start lsn if there is one, and don't provide a timeline
> > 
> >   - if we still don't have a start_lsn, fallback to using the current
> >   server
> > 
> > wal position as is done.
> 
> That's pretty much it.

Great.

> 
> > What do you think ? Which information should we provide about the slot ?
> 
> We need the timeline id to start with when using restart_lsn.  The
> current timeline can be used in most cases but there's a case where
> the LSN is historical.

Ok, see below.

> 
> pg_receivewal doesn't send a replication status report when a segment
> is finished. So after pg_receivewal stops just after a segment is
> finished, the slot stays at the beginning of the last segment.  Thus
> next time it will start from there, creating a duplicate segment.

I'm not sure I see where the problem is here. If we don't keep the segments in 
pg_walreceiver target directory, then it would be the responsibility of 
whoever moved them to make sure we don't have duplicates, or to handle them 
gracefully. 

Even if we were forcing a feedback after a segment is finished, there could 
still be a problem if the feedback never made it to the server but the segment 
was here. It might be interesting to send a feedback anyway.

Please find attached two patches implementing what we've been discussing.

Patch 0001 adds the new READ_REPLICATION_SLOT command. 
It returns for a given slot the type, restart_lsn, flush_lsn, 
restart_lsn_timeline and flush_lsn_timeline.
The timelines are determined by reading the current timeline history, and 
finding the timeline where we may find the record. I didn't find explicit test 
for eg IDENTIFY_SYSTEM so didn't write one either for this new command, but it 
is tested indirectly in patch 0002.

Patch 0002 makes pg_receivewal use that command if we use a replication slot 
and the command is available, and use the restart_lsn and restart_lsn_timeline 
as a starting point. It also adds a small test to check that we start back 
from the previous restart_lsn instead of the current flush position when our 
destination directory does not contain any WAL file. 

I also noticed we don't test following a timeline switch. It would probably be 
good to add that, both for the case where we determine the previous timeline 
from the archived segments and when it comes from the new command. What do you 
think ?


Regards,

-- 
Ronan Dunklau>From 37d9545c05b9e36aafac751f9dc549e75798413c Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Wed, 28 Jul 2021 16:34:54 +0200
Subject: [PATCH v1 1/2] Add READ_REPLICATION_SLOT command.

This commit introduces a new READ_REPLICATION_SLOT  command.
This command is used to read information about a replication slot when
using a physical replication connection.

In this first version it returns the slot type, restart_lsn, flush_lsn and
the timeline of the restart_lsn and flush_lsn, which are obtained by following the
current timeline history.
---
 doc/src/sgml/protocol.sg

1 2 >

1 - 100 of 144 matches

Mail list logo