Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Le vendredi 8 mars 2024, 14:36:48 CET Tomas Vondra a écrit : > > My guess would be 8af25652489, as it's the only storage-related commit. > > > > I'm currently running tests to verify this. > > Yup, the breakage starts with this commit. I haven't looked into the > root cause, or whether the commit maybe just made some pre-existing > issue easier to hit. Also, I haven't followed the discussion on the > pgsql-bugs thread [1], maybe there are some interesting findings. > If that happens only on HEAD and not on 16, and doesn't involve WAL replay, then it's not the same bug. -- Ronan Dunklau
Re: Provide a pg_truncate_freespacemap function
Le mercredi 6 mars 2024, 20:28:44 CET Stephen Frost a écrit : > I agree that this would generally be a useful thing to have. Thanks ! > > > Does that seem correct ? > > Definitely needs to have a 'REVOKE ALL ON FUNCTION' at the end of the > upgrade script, similar to what you'll find at the bottom of > pg_visibility--1.1.sql in the tree today, otherwise anyone could run it. > > Beyond that, I'd suggest a function-level comment above the definition > of the function itself (which is where we tend to put those- not at the > point where we declare the function). Thank you for the review. Here is an updated patch for both of those. Best regards, -- Ronan>From 812061b47443ce1052f65ba2867223f9db2cfa18 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 1 Mar 2024 08:57:49 +0100 Subject: [PATCH] Provide a pg_truncate_freespacemap function. Similar to the pg_truncate_visibilitymap function, this one allows to avoid downtime to fix an FSM in the case a corruption event happens --- contrib/pg_freespacemap/Makefile | 5 +- .../pg_freespacemap--1.2--1.3.sql | 12 contrib/pg_freespacemap/pg_freespacemap.c | 64 +++ .../pg_freespacemap/pg_freespacemap.control | 2 +- 4 files changed, 80 insertions(+), 3 deletions(-) create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile index b48e4b255bc..0f4b52f5812 100644 --- a/contrib/pg_freespacemap/Makefile +++ b/contrib/pg_freespacemap/Makefile @@ -6,8 +6,9 @@ OBJS = \ pg_freespacemap.o EXTENSION = pg_freespacemap -DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \ - pg_freespacemap--1.0--1.1.sql +DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \ + pg_freespacemap--1.1--1.2.sql \ + pg_freespacemap--1.0--1.1.sql PGFILEDESC = "pg_freespacemap - monitoring of free space map" REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql new file mode 100644 index 000..cc94eccf2f6 --- /dev/null +++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql @@ -0,0 +1,12 @@ +/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit + +CREATE FUNCTION pg_truncate_freespace_map(regclass) +RETURNS void +AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map' +LANGUAGE C STRICT +PARALLEL UNSAFE; + +REVOKE ALL ON FUNCTION pg_truncate_freespace_map(regclass) FROM PUBLIC; diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c index b82cab2d97e..fa492c18534 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.c +++ b/contrib/pg_freespacemap/pg_freespacemap.c @@ -9,8 +9,12 @@ #include "postgres.h" #include "access/relation.h" +#include "access/xloginsert.h" +#include "catalog/storage_xlog.h" #include "funcapi.h" #include "storage/freespace.h" +#include "storage/smgr.h" +#include "utils/rel.h" PG_MODULE_MAGIC; @@ -20,6 +24,8 @@ PG_MODULE_MAGIC; */ PG_FUNCTION_INFO_V1(pg_freespace); +PG_FUNCTION_INFO_V1(pg_truncate_freespace_map); + Datum pg_freespace(PG_FUNCTION_ARGS) { @@ -40,3 +46,61 @@ pg_freespace(PG_FUNCTION_ARGS) relation_close(rel, AccessShareLock); PG_RETURN_INT16(freespace); } + + +/* + * Remove the free space map for a relation. + * + * This is useful in case of corruption of the FSM, as the + * only other options are either triggering a VACUUM FULL or + * shutting down the server and removing the FSM on the filesystem. + */ +Datum +pg_truncate_freespace_map(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + Relation rel; + ForkNumber fork; + BlockNumber block; + + rel = relation_open(relid, AccessExclusiveLock); + + /* Only some relkinds have a freespace map */ + if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind)) + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation \"%s\" is of wrong relation kind", + RelationGetRelationName(rel)), + errdetail_relkind_not_supported(rel->rd_rel->relkind))); + + + /* Forcibly reset cached file size */ + RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; + + /* Just pretend we're going to wipeout the whole rel */ + block = FreeSpaceMapPrepareTruncateRel(rel, 0); + + if (BlockNumberIsValid(block)) + { + fork = FSM_FORKNUM; + smgrtruncate(RelationGetSmgr(rel), , 1, ); + } + + if (RelationNeedsWAL(rel)) + { + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rlocator = rel->rd_locator; + xlrec.flags = SMGR_TRUNCATE
Provide a pg_truncate_freespacemap function
Hello, As we are currently experiencing a FSM corruption issue [1], we need to rebuild FSM when we detect it. I noticed we have something to truncate a visibility map, but nothing for the freespace map, so I propose the attached (liberally copied from the VM counterpart) to allow to truncate a FSM without incurring downtime, as currently our only options are to either VACUUM FULL the table or stop the cluster and remove the FSM manually. Does that seem correct ? [1] https://www.postgresql.org/message-id/flat/ 1925490.taCxCBeP46%40aivenlaptop#7ace95c95cab17b6d92607e5362984ac -- Ronan Dunklau >From b3e28e4542311094ee7177b39cc9cdf3672d1b55 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 1 Mar 2024 08:57:49 +0100 Subject: [PATCH] Provide a pg_truncate_freespacemap function. Similar to the pg_truncate_visibilitymap function, this one allows to avoid downtime to fix an FSM in the case a corruption event happens --- contrib/pg_freespacemap/Makefile | 5 +- .../pg_freespacemap--1.2--1.3.sql | 13 + contrib/pg_freespacemap/pg_freespacemap.c | 58 +++ .../pg_freespacemap/pg_freespacemap.control | 2 +- 4 files changed, 75 insertions(+), 3 deletions(-) create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile index b48e4b255bc..0f4b52f5812 100644 --- a/contrib/pg_freespacemap/Makefile +++ b/contrib/pg_freespacemap/Makefile @@ -6,8 +6,9 @@ OBJS = \ pg_freespacemap.o EXTENSION = pg_freespacemap -DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \ - pg_freespacemap--1.0--1.1.sql +DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \ + pg_freespacemap--1.1--1.2.sql \ + pg_freespacemap--1.0--1.1.sql PGFILEDESC = "pg_freespacemap - monitoring of free space map" REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql new file mode 100644 index 000..1f25877a175 --- /dev/null +++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql @@ -0,0 +1,13 @@ +/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit + +CREATE FUNCTION pg_truncate_freespace_map(regclass) +RETURNS void +AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map' +LANGUAGE C STRICT +PARALLEL UNSAFE; + + + diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c index b82cab2d97e..17f6a631ad8 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.c +++ b/contrib/pg_freespacemap/pg_freespacemap.c @@ -9,8 +9,12 @@ #include "postgres.h" #include "access/relation.h" +#include "access/xloginsert.h" +#include "catalog/storage_xlog.h" #include "funcapi.h" #include "storage/freespace.h" +#include "storage/smgr.h" +#include "utils/rel.h" PG_MODULE_MAGIC; @@ -20,6 +24,9 @@ PG_MODULE_MAGIC; */ PG_FUNCTION_INFO_V1(pg_freespace); +/* Truncate the free space map, in case of corruption */ +PG_FUNCTION_INFO_V1(pg_truncate_freespace_map); + Datum pg_freespace(PG_FUNCTION_ARGS) { @@ -40,3 +47,54 @@ pg_freespace(PG_FUNCTION_ARGS) relation_close(rel, AccessShareLock); PG_RETURN_INT16(freespace); } + + +Datum +pg_truncate_freespace_map(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + Relation rel; + ForkNumber fork; + BlockNumber block; + + rel = relation_open(relid, AccessExclusiveLock); + + /* Only some relkinds have a freespacemap */ + if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind)) + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation \"%s\" is of wrong relation kind", + RelationGetRelationName(rel)), + errdetail_relkind_not_supported(rel->rd_rel->relkind))); + + + /* Forcibly reset cached file size */ + RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; + + /* Just pretend we're going to wipeout the whole rel */ + block = FreeSpaceMapPrepareTruncateRel(rel, 0); + + if (BlockNumberIsValid(block)) + { + fork = FSM_FORKNUM; + smgrtruncate(RelationGetSmgr(rel), , 1, ); + } + + if (RelationNeedsWAL(rel)) + { + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rlocator = rel->rd_locator; + xlrec.flags = SMGR_TRUNCATE_FSM; + + XLogBeginInsert(); + XLogRegisterData((char *) , sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + } + + relation_close(rel, AccessExclusiveLock); + + PG_RETURN_VOID(); +} diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control index ac8fc5050a9
Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit : > These are "my" animals (running at a local university). There's a couple > interesting details: Hi Tomas, do you still have the failing cluster data ? Noah pointed me to this thread, and it looks a bit similar to the FSM corruption issue I'm facing: https://www.postgresql.org/message-id/ 1925490.taCxCBeP46%40aivenlaptop So if you still have the data, it would be nice to see if you indeed have a corrupted FSM, and if you have indications when it happened. Best regards, -- Ronan Dunklau
Re: scalability bottlenecks with (many) partitions (and more)
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit : > I'm not sure work_mem is a good parameter to drive this. It doesn't say > how much memory we expect the backend to use - it's a per-operation > limit, so it doesn't work particularly well with partitioning (e.g. with > 100 partitions, we may get 100 nodes, which is completely unrelated to > what work_mem says). A backend running the join query with 1000 > partitions uses ~90MB (judging by data reported by the mempool), even > with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. I understand your point, I was basing my previous observations on what a backend typically does during the execution. > > The mempool could tell us how much memory we need (but we could track > this in some other way too, probably). And we could even adjust the mmap > parameters regularly, based on current workload. > > But there's then there's the problem that the mmap parameters don't tell > If we > > us how much memory to keep, but how large chunks to release. > > Let's say we want to keep the 90MB (to allocate the memory once and then > reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, > but then it takes just a little bit of extra memory to release all the > memory, or something. For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a certain amount of memory is always kept. But the way the dynamic adjustment works makes it sort-of work like this. MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't expect to keep much memory around. So even "small" memory allocations will be served using mmap at first. Once mmaped memory is released, glibc's consider it a benchmark for "normal" allocations that can be routinely freed, and adjusts mmap_threshold to the released mmaped region size, and trim threshold to two times that. It means over time the two values will converge either to the max value (32MB for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to accomodate your released memory, since anything bigger than half trim threshold will be allocated using mmap. Setting any parameter disable that. But I'm not arguing against the mempool, just chiming in with glibc's malloc tuning possibilities :-)
Re: scalability bottlenecks with (many) partitions (and more)
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit : > > Did you try running an strace on the process ? That may give you some > > hindsights into what malloc is doing. A more sophisticated approach would > > be using stap and plugging it into the malloc probes, for example > > memory_sbrk_more and memory_sbrk_less. > > No, I haven't tried that. In my experience strace is pretty expensive, > and if the issue is in glibc itself (before it does the syscalls), > strace won't really tell us much. Not sure, ofc. It would tell you how malloc actually performs your allocations, and how often they end up translated into syscalls. The main issue with glibc would be that it releases the memory too agressively to the OS, IMO. > > > An important part of glibc's malloc behaviour in that regard comes from > > the > > adjustment of the mmap and free threshold. By default, mmap adjusts them > > dynamically and you can poke into that using the > > memory_mallopt_free_dyn_thresholds probe. > > Thanks, I'll take a look at that. > > >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > >> the behavior a lot - it gets us maybe ~80% of the mempool benefits. > >> Which is nice, it confirms it's glibc-specific (I wonder if there's a > >> way to tweak glibc to address this), and it also means systems using > >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > >> says the mempool has ~20% benefit on top of jemalloc. > > > > GLIBC's malloc offers some tuning for this. In particular, setting either > > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto > > adjustment" beheviour and allow you to control what it's doing. > > > > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated > > using sbrk isn't freed as easily, and you don't run into a pattern of > > moving the sbrk pointer up and down repeatedly. The automatic trade off > > between the mmap and trim thresholds is supposed to prevent that, but the > > way it is incremented means you can end in a bad place depending on your > > particular allocation patttern. > > So, what values would you recommend for these parameters? > > My concern is increasing those value would lead to (much) higher memory > usage, with little control over it. With the mempool we keep more > blocks, ofc, but we have control over freeing the memory. Right now depending on your workload (especially if you use connection pooling) you can end up with something like 32 or 64MB of dynamically adjusted trim-threshold which will never be released back. The first heurstic I had in mind was to set it to work_mem, up to a "reasonable" limit I guess. One can argue that it is expected for a backend to use work_mem frequently, and as such it shouldn't be released back. By setting work_mem to a lower value, we could ask glibc at the same time to trim the excess kept memory. That could be useful when a long-lived connection is pooled, and sees a spike in memory usage only once. Currently that could well end up with 32MB "wasted" permanently but tuning it ourselves could allow us to releaase it back. Since it was last year I worked on this, I'm a bit fuzzy on the details but I hope this helps.
Re: scalability bottlenecks with (many) partitions (and more)
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit : Hi Tomas ! I'll comment on glibc-malloc part as I studied that part last year, and proposed some things here: https://www.postgresql.org/message-id/ 3424675.QJadu78ljV%40aivenlaptop > FWIW where does the malloc overhead come from? For one, while we do have > some caching of malloc-ed memory in memory contexts, that doesn't quite > work cross-query, because we destroy the contexts at the end of the > query. We attempt to cache the memory contexts too, but in this case > that can't help because the allocations come from btbeginscan() where we > do this: > > so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); > > and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and > thus always allocated using a separate malloc() call. Maybe we could > break it into smaller/cacheable parts, but I haven't tried, and I doubt > > > > it's the only such allocation. Did you try running an strace on the process ? That may give you some hindsights into what malloc is doing. A more sophisticated approach would be using stap and plugging it into the malloc probes, for example memory_sbrk_more and memory_sbrk_less. An important part of glibc's malloc behaviour in that regard comes from the adjustment of the mmap and free threshold. By default, mmap adjusts them dynamically and you can poke into that using the memory_mallopt_free_dyn_thresholds probe. > > FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > the behavior a lot - it gets us maybe ~80% of the mempool benefits. > Which is nice, it confirms it's glibc-specific (I wonder if there's a > way to tweak glibc to address this), and it also means systems using > jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > says the mempool has ~20% benefit on top of jemalloc. GLIBC's malloc offers some tuning for this. In particular, setting either M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto adjustment" beheviour and allow you to control what it's doing. By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using sbrk isn't freed as easily, and you don't run into a pattern of moving the sbrk pointer up and down repeatedly. The automatic trade off between the mmap and trim thresholds is supposed to prevent that, but the way it is incremented means you can end in a bad place depending on your particular allocation patttern. Best regards, -- Ronan Dunklau
Re: LLVM 16 (opaque pointers)
Le vendredi 13 octobre 2023, 22:32:13 CEST Thomas Munro a écrit : > On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau wrote: > > Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit : > > > The back-patch to 12 was a little trickier than anticipated, but after > > > taking a break and trying again I now have PG 12...17 patches that > > > I've tested against LLVM 10...18 (that's 54 combinations), in every > > > case only with the clang corresponding to LLVM. > > > > Thank you Thomas for those patches, and the extensive testing, I will run > > my own and let you know. > > Thanks! No news is good news, I hope? I'm hoping to commit this today. > > > > I've attached only the patches for master, but the 12-16 versions are > > > available at https://github.com/macdice/postgres/tree/llvm16-$N in > > > case anyone has comments on those. > > > > For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not > > used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not > > mistaken. > > Right, looks like I can remove that in those branches. Oh sorry I thought I followed up. I ran the same stress testing involving several hours of sqlsmith with all jit costs set to zero and didn't notice anything with LLVM16. Thank you ! -- Ronan Dunklau
Re: LLVM 16 (opaque pointers)
Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit : > The back-patch to 12 was a little trickier than anticipated, but after > taking a break and trying again I now have PG 12...17 patches that > I've tested against LLVM 10...18 (that's 54 combinations), in every > case only with the clang corresponding to LLVM. Thank you Thomas for those patches, and the extensive testing, I will run my own and let you know. > I've attached only the patches for master, but the 12-16 versions are > available at https://github.com/macdice/postgres/tree/llvm16-$N in > case anyone has comments on those. For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not mistaken. Best regards, -- Ronan Dunklau
Re: LLVM 16 (opaque pointers)
Le dimanche 21 mai 2023, 05:01:41 CEST Thomas Munro a écrit : > Hi, > > Here is a draft version of the long awaited patch to support LLVM 16. > It's mostly mechanical donkeywork, but it took some time: this donkey > found it quite hard to understand the mighty getelementptr > instruction[1] and the code generation well enough to figure out all > the right types, and small mistakes took some debugging effort*. I > now finally have a patch that passes all tests. > > Though it's not quite ready yet, I thought I should give this status > update to report that the main task is more or less complete, since > we're starting to get quite a few emails about it (mostly from Fedora > users) and there is an entry for it on the Open Items for 16 wiki > page. Comments/review/testing welcome. Hello Thomas, Thank you for this effort ! I've tested it against llvm 15 and 16, and found no problem with it. > 6. I need to go through the types again with a fine tooth comb, and > check the test coverage to look out for eg GEP array arithmetic with > the wrong type/size that isn't being exercised. I haven't gone through the test coverage myself, but I exercised the following things: - running make installcheck with jit_above_cost = 0 - letting sqlsmith hammer random queries at it for a few hours. This didn't show obvious issues. > *For anyone working with this type of IR generation code and > questioning their sanity, I can pass on some excellent advice I got > from Andres: build LLVM yourself with assertions enabled, as they > catch some classes of silly mistake that otherwise just segfault > > inscrutably on execution. I tried my hand at backporting it to previous versions, and not knowing anything about it made me indeed question my sanity. It's quite easy for PG 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier most notably due to to the change in fcinfo args representation. But I guess that's also a topic for another day :-) Best regards, -- Ronan Dunklau
Re: Add GUC to tune glibc's malloc implementation.
Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit : > > Yes this is probably much more appropriate, but a much larger change with > > greater risks of regression. Especially as we have to make sure we're not > > overfitting our own code for a specific malloc implementation, to the > > detriment of others. > > I think your approach is fundamentally overfitting our code to a specific > malloc implementation, in a way that's not tunable by mere mortals. It just > seems like a dead end to me. I see it as a way to have *some* sort of control over the malloc implementation we use, instead of tuning our allocations pattern on top of it while treating it entirely as a black box. As for the tuning, I proposed earlier to replace this parameter expressed in terms of size as a "profile" (greedy / conservative) to make it easier to pick a sensible value. > > > Except if you hinted we should write our own directly instead ? > > I don't think we should write our own malloc - we don't rely on it much > ourselves. And if we replace it, we need to care about mallocs performance > characteristics a whole lot, because various libraries etc do heavily rely > on it. > > However, I do think we should eventually avoid using malloc() for aset.c et > al. malloc() is a general allocator, but at least for allocations below > maxBlockSize aset.c's doesn't do allocations in a way that really benefit > from that *at all*. It's not a lot of work to do such allocations on our > own. > > > We e.g. could keep a larger number of memory blocks reserved > > > ourselves. Possibly by delaying the release of additionally held blocks > > > until we have been idle for a few seconds or such. > > > > I think keeping work_mem around after it has been used a couple times make > > sense. This is the memory a user is willing to dedicate to operations, > > after all. > > The biggest overhead of returning pages to the kernel is that that triggers > zeroing the data during the next allocation. Particularly on multi-node > servers that's surprisingly slow. It's most commonly not the brk() or > mmap() themselves that are the performance issue. > > Indeed, with your benchmark, I see that most of the time, on my dual Xeon > Gold 5215 workstation, is spent zeroing newly allocated pages during page > faults. That microarchitecture is worse at this than some others, but it's > never free (or cache friendly). I'm not sure I see the practical difference between those, but that's interesting. Were you able to reproduce my results ? > FWIW, in my experience trimming the brk()ed region doesn't work reliably > enough in real world postgres workloads to be worth relying on (from a > memory usage POV). Sooner or later you're going to have longer lived > allocations placed that will prevent it from happening. I'm not sure I follow: given our workload is clearly split at queries and transactions boundaries, releasing memory at that time, I've assumed (and noticed in practice, albeit not on a production system) that most memory at the top of the heap would be trimmable as we don't keep much in between queries / transactions. > > I have played around with telling aset.c that certain contexts are long > lived and using mmap() for those, to make it more likely that the libc > malloc/free can actually return memory to the system. I think that can be > > quite worthwhile. So if I understand your different suggestions, we should: - use mmap ourselves for what we deem to be "one-off" allocations, to make sure that memory is not hanging around after we don't use - keep some pool allocated which will not be freed in between queries, but reused for the next time we need it. Thank you for looking at this problem. Regards, -- Ronan Dunklau
Re: Add GUC to tune glibc's malloc implementation.
Le mardi 27 juin 2023, 08:35:28 CEST Ronan Dunklau a écrit : > I re-attached the simple script I used. I've run this script with different > values for glibc_malloc_max_trim_threshold. I forgot to add that it was using default parametrers except for work_mem, set to 32M, and max_parallel_workers_per_gather set to zero.
Re: Add GUC to tune glibc's malloc implementation.
Le lundi 26 juin 2023, 23:03:48 CEST Andres Freund a écrit : > Hi, > > On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote: > > I hope what I'm trying to achieve is clearer that way. Maybe this patch is > > not the best way to go about this, but since the memory allocator > > behaviour can have such an impact it's a bit sad we have to leave half > > the performance on the table because of it when there are easily > > accessible knobs to avoid it. > I'm *quite* doubtful this patch is the way to go. If we want to more > tightly control memory allocation patterns, because we have more > information than glibc, we should do that, rather than try to nudge glibc's > malloc in random direction. In contrast a generic malloc() implementation > we can have much more information about memory lifetimes etc due to memory > contexts. Yes this is probably much more appropriate, but a much larger change with greater risks of regression. Especially as we have to make sure we're not overfitting our own code for a specific malloc implementation, to the detriment of others. Except if you hinted we should write our own directly instead ? > > We e.g. could keep a larger number of memory blocks reserved > ourselves. Possibly by delaying the release of additionally held blocks > until we have been idle for a few seconds or such. I think keeping work_mem around after it has been used a couple times make sense. This is the memory a user is willing to dedicate to operations, after all. > > > WRT to the difference in TPS in the benchmark you mention - I suspect that > we are doing something bad that needs to be improved regardless of the > underlying memory allocator implementation. Due to the lack of detailed > instructions I couldn't reproduce the results immediately. I re-attached the simple script I used. I've run this script with different values for glibc_malloc_max_trim_threshold. Best regards, -- Ronan Dunklau bench.sh Description: application/shellscript
Re: Add GUC to tune glibc's malloc implementation.
Le vendredi 23 juin 2023, 22:55:51 CEST Peter Eisentraut a écrit : > On 22.06.23 15:35, Ronan Dunklau wrote: > > The thing is, by default, those parameters are adjusted dynamically by the > > glibc itself. It starts with quite small thresholds, and raises them when > > the program frees some memory, up to a certain limit. This patch proposes > > a new GUC allowing the user to adjust those settings according to their > > workload. > > > > This can cause problems. Let's take for example a table with 10k rows, and > > 32 columns (as defined by a bench script David Rowley shared last year > > when discussing the GenerationContext for tuplesort), and execute the > > following > > query, with 32MB of work_mem: > I don't follow what you are trying to achieve with this. The examples > you show appear to work sensibly in my mind. Using this setting, you > can save some of the adjustments that glibc does after the first query. > But that seems only useful if your session only does one query. Is that > what you are doing? No, not at all: glibc does not do the right thing, we don't "save" it. I will try to rephrase that. In the first test case I showed, we see that glibc adjusts its threshold, but to a suboptimal value since repeated executions of a query needing the same amount of memory will release it back to the kernel, and move the brk pointer again, and will not adjust it again. On the other hand, by manually adjusting the thresholds, we can set them to a higher value which means that the memory will be kept in malloc's freelist for reuse for the next queries. As shown in the benchmark results I posted, this can have quite a dramatic effect, going from 396 tps to 894. For ease of benchmarking, it is a single query being executed over and over again, but the same thing would be true if different queries allocating memories were executed by a single backend. The worst part of this means it is unpredictable: depending on past memory allocation patterns, glibc will end up in different states, and exhibit completely different performance for all subsequent queries. In fact, this is what Tomas noticed last year, (see [0]), which led to investigation into this. I also tried to show that for certain cases glibcs behaviour can be on the contrary to greedy, and hold on too much memory if we just need the memory once and never allocate it again. I hope what I'm trying to achieve is clearer that way. Maybe this patch is not the best way to go about this, but since the memory allocator behaviour can have such an impact it's a bit sad we have to leave half the performance on the table because of it when there are easily accessible knobs to avoid it. [0] https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com
Re: Add GUC to tune glibc's malloc implementation.
Le jeudi 22 juin 2023, 15:49:36 CEST Tom Lane a écrit : > This seems like a pretty awful idea, mainly because there's no way > to have such a GUC mean anything on non-glibc platforms, which is > going to cause confusion or worse. I named the GUC glibc_malloc_max_trim_threshold, I hope this is enough to clear up the confusion. We already have at least event_source, which is windows specific even if it's not clear from the name. > > Aren't these same settings controllable via environment variables? > I could see adding some docs suggesting that you set thus-and-such > values in the postmaster's startup script. Admittedly, the confusion > argument is perhaps still raisable; but we have a similar docs section > discussing controlling Linux OOM behavior, and I've not heard much > complaints about that. Yes they are, but controlling them via an environment variable for the whole cluster defeats the point: different backends have different workloads, and being able to make sure for example the OLAP user is memory-greedy while the OLTP one is as conservative as possible is a worthwile goal. Or even a specific backend may want to raise it's work_mem and adapt glibc behaviour accordingly, then get back to being conservative with memory until the next such transaction. Regards, -- Ronan Dunklau
Add GUC to tune glibc's malloc implementation.
, and a second run will allocate it on the heap instead. So if we run the query twice, we end up with some memory in malloc's free lists that we may never use again. Using the new GUC, we can actually control wether it will be given back to the OS by setting a small value for the threshold. I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark with different values for glibc_malloc_max_trim_threshold. I don't know how to write a test for this new feature so let me know if you have suggestions. Documentation is not written yet, as I expect discussion on this thread to lead to significant changes on the user-visible GUC or GUCs: - should we provide one for trim which also adjusts mmap_threshold (current patch) or several GUCs ? - should this be simplified to only offer the default behaviour (glibc's takes care of the threshold) and some presets ("greedy", to set trim_threshold to work_mem, "frugal" to set it to a really small value) Best regards, -- Ronan Dunklau >From 3686e660446facfb2d64683286176887914cd9fd Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 20 Jun 2023 16:17:32 +0200 Subject: [PATCH v1] Add options to tune malloc. Add a new GUC glibc_malloc_max_trim_threshold which is used only when being compiled against glibc. This GUC adjusts the glibc's malloc options M_MMAP_THRESHOLD and M_TRIM_THRESHOLD. The M_MMAP_THRESHOLD will be set to half M_TRIM_THRESHOLD. We cap the value to the current work_mem, as it doesn't make much sense to reserve more memory than that. This new GUC allows the user to control how aggresively malloc will actually return memory from the heap to the kernel, and when it should start using mmap for bigger allocations. Depending on the workload, a user can reduce the residual footprint of a long session (by reducing the trim threshold to a fairly low value) or on the contrary make sure that malloc keeps a sizable free chunk at the end of the heap for future allocations. On some benchmarks heavily dependent on memory allocations, like sorts, this can dramatically influence performance. --- src/backend/utils/init/postinit.c | 6 + src/backend/utils/misc/guc_tables.c | 16 ++- src/backend/utils/misc/postgresql.conf.sample | 1 + src/backend/utils/mmgr/Makefile | 1 + src/backend/utils/mmgr/malloc_tuning.c| 122 ++ src/backend/utils/mmgr/meson.build| 1 + src/include/utils/guc_hooks.h | 2 + src/include/utils/memutils.h | 10 ++ 8 files changed, 158 insertions(+), 1 deletion(-) create mode 100644 src/backend/utils/mmgr/malloc_tuning.c diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 561bd13ed2..707f20606d 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -808,6 +808,12 @@ InitPostgres(const char *in_dbname, Oid dboid, InitCatalogCache(); InitPlanCache(); + /* Adjust malloc options if needed. + * This is done here because the implementation can vary depending on the + * type of backend. + */ + MallocAdjustSettings(); + /* Initialize portal manager */ EnablePortalManager(); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 71e27f8eb0..f1e03dc306 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2338,7 +2338,7 @@ struct config_int ConfigureNamesInt[] = }, _mem, 4096, 64, MAX_KILOBYTES, - NULL, NULL, NULL + NULL, _work_mem, NULL }, { @@ -2364,6 +2364,20 @@ struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"glibc_malloc_max_trim_threshold", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum value for glibc's M_TRIM_THRESHOLD option."), + gettext_noop("This controls how much memory glibc's will not return to the " + "OS once freed. An idle backend can thus consume that much memory " + "even if not in used. The default (-1) value disable static tuning " + "and relies on the default dynamic adjustment"), + GUC_UNIT_KB + }, + _malloc_max_trim_threshold, + -1, -1, MAX_KILOBYTES, + NULL, _glibc_trim_threshold, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index e4c0269fa3..dc70a1dfa3 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -157,6 +157,7 @@ # windows # mmap # (change requires restart) +#glibc_malloc_max_trim_threshold = -1 # Only used with glibc's malloc. -1 disable it #min_dynamic_shared_memory = 0MB # (change requires restart) #vacuum_buffer_usage_limit = 256kB # size of
Re: Use generation context to speed up tuplesorts
Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit : > Hi Ronan, > > We briefly chatted about the glibc-tuning part of this thread at pgcon, > so I wonder if you're still planning to pursue that. If you do, I > suggest we start a fresh thread, so that it's not mixed with the already > committed improvements of generation context. > > I wonder what's the situation with the generation context improvements > already pushed - does setting the glibc thresholds still help, and if > yes how much? Hi Tomas ! I'm currently working on it, trying to evaluate the possible benefits on the current ASet and Generation contexts usecases. I'll make sure to use a new thread to post my findings. Thank you for remembering ! Best regards, -- Ronan Dunklau
Re: Exclusion constraints on partitioned tables
Le vendredi 17 mars 2023, 17:03:09 CET Paul Jungwirth a écrit : > I added the code about RTEqualStrategyNumber because that's what we need > to find an equals operator when the index is GiST (except if it's using > an opclass from btree_gist; then it needs to be BTEqual again). But then > I realized that for exclusion constraints we have already figured out > the operator (in RelationGetExclusionInfo) and put it in > indexInfo->ii_ExclusionOps. So we can just compare against that. This > works whether your index uses btree_gist or not. > > Here is an updated patch with that change (also rebased). Thanks ! This looks fine to me like this. > > I also included a more specific error message. If we find a matching > column in the index but with the wrong operator, we should say so, and > not say there is no matching column. > I agree that's a nicer improvement. Regards, -- Ronan Dunklau
Re: Allow ordered partition scans in more cases
Thank you for improving this optimization ! Le mardi 21 février 2023, 04:14:02 CET David Rowley a écrit : > I still need to look to see if there's some small amount of data that > can be loaded into the table to help coax the planner into producing > the ordered scan for this one. It works fine as-is for ORDER BY a,b > and ORDER BY a; so I've put tests in for that. I haven't looked too deeply into it, but it seems reasonable that the whole sort would cost cheaper than individual sorts on partitions + incremental sorts, except when the the whole sort would spill to disk much more than the incremental ones. I find it quite difficult to reason about what that threshold should be, but I managed to find a case which could fit in a test: create table range_parted (a int, b int, c int) partition by range(a, b); create table range_parted1 partition of range_parted for values from (0,0) to (10,10); create table range_parted2 partition of range_parted for values from (10,10) to (20,20); insert into range_parted(a, b, c) select i, j, k from generate_series(1, 19) i, generate_series(1, 19) j, generate_series(1, 5) k; analyze range_parted; set random_page_cost = 10; set work_mem = '64kB'; explain (costs off) select * from range_parted order by a,b,c; It's quite convoluted, because it needs the following: - estimate the individual partition sorts to fit into work_mem (even if that's not the case here at runtime) - estimate the whole table sort to not fit into work_mem - the difference between the two should be big enough to compensate the incremental sort penalty (hence raising random_page_cost). This is completely tangential to the subject at hand, but maybe we have improvements to do with the way we estimate what type of sort will be performed ? It seems to underestimate the memory amount needed. I'm not sure it makes a real difference in real use cases though. Regards, -- Ronan Dunklau
SQLFunctionCache and generic plans
Hello, It has been brought to my attention that SQL functions always use generic plans. Take this function for example: create or replace function test_plpgsql(p1 oid) returns text as $$ BEGIN RETURN (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1); END; $$ language plpgsql; As expected, the PlanCache takes care of generating parameter specific plans, and correctly prunes the redundant OR depending on wether we call the function with a NULL value or not: ro=# select test_plpgsql(NULL); LOG: duration: 0.030 ms plan: Query Text: (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1) Result (cost=0.04..0.05 rows=1 width=64) InitPlan 1 (returns $0) -> Limit (cost=0.00..0.04 rows=1 width=64) -> Seq Scan on pg_class (cost=0.00..18.12 rows=412 width=64) LOG: duration: 0.662 ms plan: Query Text: select test_plpgsql(NULL); Result (cost=0.00..0.26 rows=1 width=32) ro=# select test_plpgsql(1); LOG: duration: 0.075 ms plan: Query Text: (SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1) Result (cost=8.29..8.30 rows=1 width=64) InitPlan 1 (returns $0) -> Limit (cost=0.27..8.29 rows=1 width=64) -> Index Scan using pg_class_oid_index on pg_class (cost=0.27..8.29 rows=1 width=64) Index Cond: (oid = '1'::oid) LOG: duration: 0.675 ms plan: Query Text: select test_plpgsql(1); Result (cost=0.00..0.26 rows=1 width=32) But writing the same function in SQL: create or replace function test_sql(p1 oid) returns text as $$ SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1 $$ language sql; we end up with a generic plan: ro=# select test_sql(1); LOG: duration: 0.287 ms plan: Query Text: SELECT relname FROM pg_class WHERE oid = p1 OR p1 IS NULL LIMIT 1 Query Parameters: $1 = '1' Limit (cost=0.00..6.39 rows=1 width=32) -> Seq Scan on pg_class (cost=0.00..19.16 rows=3 width=32) Filter: ((oid = $1) OR ($1 IS NULL)) This is due to the fact that SQL functions are planned once for the whole query using a specific SQLFunctionCache instead of using the whole PlanCache machinery. The following comment can be found in functions.c, about the SQLFunctionCache: * Note that currently this has only the lifespan of the calling query. * Someday we should rewrite this code to use plancache.c to save parse/plan * results for longer than that. I would be interested in working on this, primarily to avoid this problem of having generic query plans for SQL functions but maybe having a longer lived cache as well would be nice to have. Is there any reason not too, or pitfalls we would like to avoid ? Best regards, -- Ronan Dunklau
Re: Exclusion constraints on partitioned tables
Le vendredi 16 décembre 2022, 06:11:49 CET Paul Jungwirth a écrit : > On 12/15/22 16:12, Tom Lane wrote: > >> This patch also requires the matching constraint columns to use equality > >> comparisons (`(foo WITH =)`), so it is really equivalent to the existing > >> b-tree rule. > > > > That's not quite good enough: you'd better enforce that it's the same > > equality operator (and same collation, if relevant) as is being used > > in the partition key. > > [snip] > > It might work better to consider the operator itself and ask if > > it's equality in the same btree opfamily that's used by the > > partition key. > > Thank you for taking a look! Here is a comparison on just the operator > itself. > I've taken a look at the patch, and I'm not sure why you keep the restriction on the Gist operator being of the RTEqualStrategyNumber strategy. I don't think we have any other place where we expect those strategy numbers to match. For hash it's different, as the hash-equality is the only operator strategy and as such there is no other way to look at it. Can't we just enforce partition_operator == exclusion_operator without adding the RTEqualStrategyNumber for the opfamily into the mix ?
Re: Ordering behavior for aggregates
Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit : > Accordingly, I find nothing at all attractive in this proposal. > I think the main thing it'd accomplish is to drive users back to > the bad old days of ordering-by-subquery, if they have a requirement > we failed to account for. I think the ability to mark certain aggregates as being able to completely ignore the ordering because they produce exactly the same results is still a useful optimization. -- Ronan Dunklau
Re: Ordering behavior for aggregates
Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit : > On 12/13/22 13:55, Magnus Hagander wrote: > > On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing wrote: > >> However, it is completely useless for things like AVG() or SUM(). If > >> you include it, the aggregate will do the sort even though it is neither > >> required nor desired. I'm not sure about this. For AVG and SUM, if you want reproducible results with floating point numbers, you may want it. And if you disallow it for most avg and sum implementations except for floating point types, it's not a very consistent user experience. > >> > >> I am proposing something like pg_aggregate.aggordering which would be an > >> enum of behaviors such as f=Forbidden, a=Allowed, r=Required. Currently > >> all aggregates would have 'a' but I am thinking that a lot of them could > >> be switched to 'f'. In that case, if a user supplies an ordering, an > >> error is raised. > > > > Should there perhaps also be an option for "ignored" where we'd allow the > > user to specify it, but not actually do the sort because we know it's > > pointless? Or maybe that should be the behaviour of "forbidden", which > > should then perhaps have a different name? > > I did think about that but I can't think of any reason we would want to > silently ignore something the user has written. If the ordering doesn't > make sense, we should forbid it. It is allowed as of now, and so it would be a compatibility issue for queries existing in the wild. Ignoring it is just an optimization, just how we optimize away some joins entirely. -- Ronan Dunklau
Re: Fix gin index cost estimation
Le vendredi 2 décembre 2022, 13:58:27 CET Ronan Dunklau a écrit : > Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit : > > Hi, Ronan! > > Thank you for your patch. Couple of quick questions. > > 1) What magic number 50.0 stands for? I think we at least should make > > it a macro. > > This is what is used in other tree-descending estimation functions, so I > used that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If > so I'll separate this into two patches, one introducing the macro for the > other estimation functions, and this patch for gin. The 0001 patch does this. > > > 2) "We only charge one data page for the startup cost" – should this > > be dependent on number of search entries? In fact there was another problem. The current code estimate two different pathes for fetching data pages: in the case of a partial match, it takes into account that all the data pages will have to be fetched. So this is is now taken into account for the CPU cost as well. For the regular search, we scale the number of data pages by the number of search entries. Best regards, -- Ronan Dunklau >From b3d03a96ecb3d03c4c212e3c434def6e0e77fcd4 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 6 Dec 2022 11:08:46 +0100 Subject: [PATCH v4 1/2] Extract the cpu page-process cost multiplier into a macro. Btree, GiST and SP-GiST all charge 50.0 * cpu_operator_cost for processing an index page. Extract this to a macro to avoid repeated magic numbers. --- src/backend/utils/adt/selfuncs.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index f116924d3c..5baf8cf631 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -140,6 +140,7 @@ #include "utils/timestamp.h" #include "utils/typcache.h" +#define DEFAULT_PAGE_CPU_MULTIPLIER 50.0 /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; @@ -6858,7 +6859,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count, * touched. The number of such pages is btree tree height plus one (ie, * we charge for the leaf page too). As above, charge once per SA scan. */ - descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost; + descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost; costs.indexStartupCost += descentCost; costs.indexTotalCost += costs.num_sa_scans * descentCost; @@ -7053,7 +7054,7 @@ gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count, /* * Likewise add a per-page charge, calculated the same as for btrees. */ - descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost; + descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost; costs.indexStartupCost += descentCost; costs.indexTotalCost += costs.num_sa_scans * descentCost; @@ -7108,7 +7109,7 @@ spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count, /* * Likewise add a per-page charge, calculated the same as for btrees. */ - descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost; + descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost; costs.indexStartupCost += descentCost; costs.indexTotalCost += costs.num_sa_scans * descentCost; -- 2.38.1 >From 817f02fa5ace98b343e585568165cc2aef84328a Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 12 Sep 2022 15:40:18 +0200 Subject: [PATCH v4 2/2] Fix gin costing. GIN index scans were not taking any descent CPU-based cost into account. That made them look cheaper than other types of indexes when they shouldn't be. We use the same heuristic as for btree indexes, but multiplying it by the number of searched entries. Additionnally, the cpu cost for the tree was based largely on genericcostestimate. For a GIN index, we should not charge index quals per tuple, but per entry. On top of this, charge cpu_index_tuple_cost per actual tuple. This should fix the cases where a GIN index is preferred over a btree, and the ones where a memoize node is not added on top of the GIN index scan because it seemed too cheap. Per report of Hung Nguyen. --- src/backend/utils/adt/selfuncs.c | 54 +--- 1 file changed, 50 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5baf8cf631..0b78fe450b 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7446,6 +7446,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, qual_arg_cost, spc_random_page_cost, outer_scans; + Cost descentCost; Relation indexRel; GinStatsData ginStats; ListCell *lc; @@ -767
Re: Fix gin index cost estimation
Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit : > Hi, Ronan! > Thank you for your patch. Couple of quick questions. > 1) What magic number 50.0 stands for? I think we at least should make > it a macro. This is what is used in other tree-descending estimation functions, so I used that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If so I'll separate this into two patches, one introducing the macro for the other estimation functions, and this patch for gin. > 2) "We only charge one data page for the startup cost" – should this > be dependent on number of search entries? Good point, multiplying it by the number of search entries would do the trick. Thank you for looking at this ! Regards, -- Ronan Dunklau
Re: Fix gin index cost estimation
Le mardi 25 octobre 2022, 16:18:57 CET Tom Lane a écrit : > Alexander Korotkov writes: > > I think Tom's point was that it's wrong to add a separate entry-tree CPU > > cost estimation to another estimation, which tries (very inadequately) to > > estimate the whole scan cost. Instead, I propose writing better > > estimations > > for both entry-tree CPU cost and data-trees CPU cost and replacing > > existing > > CPU estimation altogether. > > Great idea, if someone is willing to do it ... > > regards, tom lane Hello, Sorry for the delay, but here is an updated patch which changes the costing in the following way: - add a descent cost similar to the btree one is charged for the initial entry-tree - additionally, a charge is applied per page in both the entry tree and posting trees / lists - instead of charging the quals to each tuple, charge them per entry only. We still charge cpu_index_tuple_cost per tuple though. With those changes, no need to tweak the magic number formula estimating the number of pages. Maybe we can come up with something better for estimating those later on ? Best regards, -- Ronan Dunklau >From 49bc71f25967f9367196803be363509922aac8ab Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 12 Sep 2022 15:40:18 +0200 Subject: [PATCH v3] Fix gin costing. GIN index scans were not taking any descent CPU-based cost into account. That made them look cheaper than other types of indexes when they shouldn't be. We use the same heuristic as for btree indexes, but multiplying it by the number of searched entries. Additionnally, the cpu cost for the tree was based largely on genericcostestimate. For a GIN index, we should not charge index quals per tuple, but per entry. On top of this, charge cpu_index_tuple_cost per actual tuple. This should fix the cases where a GIN index is preferred over a btree, and the ones where a memoize node is not added on top of the GIN index scan because it seemed too cheap. Per report of Hung Nguyen. --- src/backend/utils/adt/selfuncs.c | 50 +--- 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index f116924d3c..1d07ae8e95 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7445,6 +7445,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, qual_arg_cost, spc_random_page_cost, outer_scans; + Cost descentCost; Relation indexRel; GinStatsData ginStats; ListCell *lc; @@ -7669,6 +7670,41 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, */ dataPagesFetched = ceil(numDataPages * partialScale); + *indexStartupCost = 0; + *indexTotalCost = 0; + + /* + * Add a CPU-cost component to represent the costs of initial entry btree + * descent. We don't charge any I/O cost for touching upper btree levels, + * since they tend to stay in cache, but we still have to do about log2(N) + * comparisons to descend a btree of N leaf tuples. We charge one + * cpu_operator_cost per comparison. + * + * If there are ScalarArrayOpExprs, charge this once per SA scan. The + * ones after the first one are not startup cost so far as the overall + * plan is concerned, so add them only to "total" cost. + */ + if (numEntries > 1) /* avoid computing log(0) */ + { + descentCost = ceil(log(numEntries) / log(2.0)) * cpu_operator_cost; + *indexStartupCost += descentCost * counts.searchEntries; + *indexTotalCost += counts.arrayScans * descentCost * counts.searchEntries; + } + + /* + * Add a cpu cost per entry-page fetched. This is not amortized over a loop. + */ + *indexStartupCost += entryPagesFetched * 50.0 * cpu_operator_cost; + *indexTotalCost += entryPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost; + + /* + * Add a cpu cost per data-page fetched. This is also not amortized over a loop. + * We only charge one data page for the startup cost, and everything else to + * the total cost. + */ + *indexStartupCost += 50.0 * cpu_operator_cost; + *indexTotalCost += dataPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost; + /* * Calculate cache effects if more than one scan due to nestloops or array * quals. The result is pro-rated per nestloop scan, but the array qual @@ -7692,7 +7728,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, * Here we use random page cost because logically-close pages could be far * apart on disk. */ - *indexStartupCost = (entryPagesFetched + dataPagesFetched) * spc_random_page_cost; + *indexStartupCost += (entryPagesFetched + dataPagesFetched) * spc_random_page_cost; /* * Now compute the number of data pages fetched during the scan. @@ -7720,6 +7756,9 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, if (dataPagesFetchedBySel > da
Re: Asynchronous execution support for Custom Scan
> IIUC, we already can use ctid in the where clause on the latest > PostgreSQL, can't we? Oh, sorry, I missed the TidRangeScan. My apologies for the noise. Best regards, -- Ronan Dunklau
Re: Asynchronous execution support for Custom Scan
Le mardi 6 septembre 2022, 11:29:55 CET Etsuro Fujita a écrit : > On Mon, Sep 5, 2022 at 10:32 PM Kazutaka Onishi wrote: > > I'm sorry for my error on your name... > > No problem. > > > > IIUC, it uses the proposed > > > > > > APIs, but actually executes ctidscans *synchronously*, so it does not > > > improve performance. Right? > > > > Exactly. > > The actual CustomScan that supports asynchronous execution will > > start processing in CustomScanAsyncRequest, > > configure to detect completion via file descriptor in > > CustomScanAsyncConfigureWait, > > and receive the result in CustomScanAsyncNotify. > > Ok, thanks! Thanks for this patch, seems like a useful addition to the CustomScan API. Just to nitpick: there are extraneous tabs in createplan.c on a blank line. Sorry for the digression, but I know your ctidscan module had been proposed for inclusion in contrib a long time ago, and I wonder if the rationale for not including it could have changed. We still don't have tests which cover CustomScan, and I can think of at least a few use cases where this customscan is helpful and not merely testing code. One of those use case is when performing migrations on a table, and one wants to update the whole table by filling a new column with a computed value. You obviously don't want to do it in a single transaction, so you end up batching updates using an index looking for null values. If you want to do this, it's much faster to update rows in a range of block, performing a first series of batch updating all such block ranges, and then finally update the ones we missed transactionally (inserted in a block we already processed while in the middle of the batch, or in new blocks resulting from a relation extension). Best regards, -- Ronan Dunklau
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Le mardi 8 novembre 2022, 02:31:12 CET David Rowley a écrit : > 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path > when we can add an Incremental Sort path instead. This removes quite a > few redundant lines of code. This seems sensible > 2. Removes the * 1.5 fuzz-factor in cost_incremental_sort() > 3. Does various other code tidy stuff in cost_incremental_sort(). > 4. Removes the test from incremental_sort.sql that was ensuring the > inferior Sort -> Sort plan was being used instead of the superior Sort > -> Incremental Sort plan. > > I'm not really that 100% confident in the removal of the * 1.5 thing. > I wonder if there's some reason we're not considering that might cause > a performance regression if we're to remove it. I'm not sure about it either. It seems to me that we were afraid of regressions, and having this overcharged just made us miss a new optimization without changing existing plans. With ordered aggregates, the balance is a bit trickier and we are at risk of either regressing on aggregate plans, or more common ordered ones. -- Ronan Dunklau
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Le lundi 7 novembre 2022, 17:58:50 CET Pavel Luzanov a écrit : > > I can't see why an incrementalsort could be more expensive than a full > > sort, using the same presorted path. > > The only reason I can see is the number of buffers to read. In the plan > with incremental sort we read the whole index, ~19 buffers. > And the plan with seq scan only required ~5000 (I think due to buffer > ring optimization). What I meant here is that disabling seqscans, the planner still chooses a full sort over a partial sort. The underlying index is the same, it is just a matter of choosing a Sort node over an IncrementalSort node. This, I think, is wrong: I can't see how it could be worse to use an incrementalsort in that case. It makes sense to prefer a SeqScan over an IndexScan if you are going to sort the whole table anyway. But in that case we shouldn't. What happened before is that some sort of incremental sort was always chosen, because it was hidden as an implementation detail of the agg node. But now it has to compete on a cost basis with the full sort, and that costing is wrong in that case. Maybe the original costing code for incremental sort was a bit too pessimistic. -- Ronan Dunklau
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Le samedi 5 novembre 2022, 09:51:23 CET Pavel Luzanov a écrit : > While playing with the patch I found a situation where the performance > may be degraded compared to previous versions. > > The test case below. > If you create a proper index for the query (a,c), version 16 wins. On my > notebook, the query runs ~50% faster. > But if there is no index (a,c), but only (a,b), in previous versions the > planner uses it, but with this patch a full table scan is selected. Hello, In your exact use case, the combo incremental-sort + Index scan is evaluated to cost more than doing a full sort + seqscan. If you try for example to create an index on (b, a) and group by b, you will get the expected behaviour: ro=# create index on t (b, a); CREATE INDEX ro=# explain select b, array_agg(c order by c) from t group by b; QUERY PLAN - GroupAggregate (cost=10.64..120926.80 rows=9970 width=36) Group Key: b -> Incremental Sort (cost=10.64..115802.17 rows=100 width=6) Sort Key: b, c Presorted Key: b -> Index Scan using t_b_a_idx on t (cost=0.42..47604.12 rows=100 width=6) (6 rows) I think we can trace that back to incremental sort being pessimistic about it's performance. If you try the same query, but with set enable_seqscan = off, you will get a full sort over an index scan: QUERY PLAN - GroupAggregate (cost=154944.94..162446.19 rows=100 width=34) Group Key: a -> Sort (cost=154944.94..157444.94 rows=100 width=4) Sort Key: a, c -> Index Scan using t_a_b_idx on t (cost=0.42..41612.60 rows=100 width=4) (5 rows) This probably comes from the overly pessimistic behaviour that the number of tuples per group will be 1.5 times as much as we should estimate: /* * Estimate average cost of sorting of one group where presorted keys are * equal. Incremental sort is sensitive to distribution of tuples to the * groups, where we're relying on quite rough assumptions. Thus, we're * pessimistic about incremental sort performance and increase its average * group size by half. */ I can't see why an incrementalsort could be more expensive than a full sort, using the same presorted path. It looks to me that in that case we should always prefer an incrementalsort. Maybe we should bound incremental sorts cost to make sure they are never more expensive than the full sort ? Also, prior to this commit I don't think it made a real difference, because worst case scenario we would have missed an incremental sort, which we didn't have beforehand. But with this patch, we may actually replace a "hidden" incremental sort which was done in the agg codepath by a full sort. Best regards, -- Ronan Dunklau
Re: Fix gin index cost estimation
> > You're right, I was too eager to try to raise the CPU cost proportionnally to > > the number of array scans (scalararrayop). I'd really like to understand where > > this equation comes from though... > > So, what's the latest update here? Thanks Michael for reviving this thread. Before proceeding any further with this, I'd like to understand where we stand. Tom argued my way of charging cost per entry pages visited boils down to charging per tuple, which I expressed disagreement with. If we can come to a consensus whether that's a bogus way of thinking about it I'll proceed with what we agree on. -- Ronan Dunklau
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node
Le jeudi 29 septembre 2022, 16:10:03 CEST Tom Lane a écrit : > Ronan Dunklau writes: > >> Yeah, I think the same rules around scope apply as > >> tuplesort_gettupleslot() with copy==false. We could do it by adding a > >> copy flag to the existing function, but I'd rather not add the > >> branching to that function. It's probably just better to duplicate it > >> and adjust. > > > > For the record, I tried to see if gcc would optimize the function by > > generating two different versions when copy is true or false, thus getting rid > > of the branching while still having only one function to deal with. > > TBH, I think this is completely ridiculous over-optimization. > There's exactly zero evidence that a second copy of the function > would improve performance, or do anything but contribute to code > bloat (which does have a distributed performance cost). I wasn't commenting on the merit of the optimization, but just that I tried to get gcc to apply it itself, which it doesn't. Regards, -- Ronan Dunklau
Re: A potential memory leak on Merge Join when Sort node is not below Materialize node
> I've just pushed the disable byref Datums patch I posted earlier. I > only made a small adjustment to make use of the TupleDescAttr() macro. > Önder, thank you for the report. Thank you David for taking care of this. > Yeah, I think the same rules around scope apply as > tuplesort_gettupleslot() with copy==false. We could do it by adding a > copy flag to the existing function, but I'd rather not add the > branching to that function. It's probably just better to duplicate it > and adjust. > For the record, I tried to see if gcc would optimize the function by generating two different versions when copy is true or false, thus getting rid of the branching while still having only one function to deal with. Using the -fipa-cp-clone (or even the whole set of additional flags coming with -O3), it does generate a special-case version of the function, but it seems to then only be used by heapam_index_validate_scan and percentile_cont_multi_final_common. This is from my investigation looking for references to the specialized version in the DWARF debug information. Regards, -- Ronan Dunklau
Re: Fix gin index cost estimation
Le vendredi 16 septembre 2022, 22:04:59 CEST Tom Lane a écrit : > Ronan Dunklau writes: > > The attached does that and is much simpler. I only took into account > > entryPagesFetched, not sure if we should also charge something for data pages. > > I think this is wrong, because there is already a CPU charge based on > the number of tuples visited, down at the very end of the routine: > > *indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost); > > It's certainly possible to argue that that's incorrectly computed, > but just adding another cost charge for the same topic can't be right. I don't think it's the same thing. The entryPagesFetched is computed independently of the selectivity and the number of tuples. As such, I think it makes sense to use it to compute the cost of descending the entry tree. As mentioned earlier, I don't really understand the formula for computing entryPagesFetched. If we were to estimate the tree height to compute the descent cost as I first proposed, I feel like we would use two different metrics for what is essentially the same cost: something proportional to the size of the entry tree. > > I do suspect that that calculation is bogus, because it looks like it's > based on the concept of "apply the quals at each index entry", which we > know is not how GIN operates. So maybe we should drop that bit in favor > of a per-page-ish cost like you have here. Not sure. In any case it > seems orthogonal to the question of startup/descent costs. Maybe we'd > better tackle just one thing at a time. Hum, good point. Maybe that should be revisited too. > > (BTW, given that that charge does exist and is not affected by > repeated-scan amortization, why do we have a problem in the first place? > Is it just too small? I guess that when we're only expecting one tuple > to be retrieved, it would only add about cpu_index_tuple_cost.) Because with a very low selectivity, we end up under-charging for the cost of walking the entry tree by a significant amount. As said above, I don't see how those two things are the same: that charge estimates the cost of applying index quals to the visited tuples, which is not the same as charging per entry page visited. > > > Is the power(0.15) used an approximation for a log ? If so why ? Also > > shouldn't we round that up ? > > No idea, but I'm pretty hesitant to just randomly fool with that equation > when (a) neither of us know where it came from and (b) exactly no evidence > has been provided that it's wrong. > > I note for instance that the existing logic switches from charging 1 page > per search to 2 pages per search at numEntryPages = 15 (1.5 ^ (1/0.15)). > Your version would switch at 2 pages, as soon as the pow() result is even > fractionally above 1.0. Maybe the existing logic is too optimistic there, > but ceil() makes it too pessimistic I think. I'd sooner tweak the power > constant than change from rint() to ceil(). You're right, I was too eager to try to raise the CPU cost proportionnally to the number of array scans (scalararrayop). I'd really like to understand where this equation comes from though... Best regards, -- Ronan Dunklau
Re: Fix gin index cost estimation
Le lundi 12 septembre 2022, 16:41:16 CEST Ronan Dunklau a écrit : > But I realised that another approach might be better suited: since we want to > charge a cpu cost for every page visited, actually basing that on the already > estimated entryPagesFetched and dataPagesFetched would be better, instead of > copying what is done for other indexes type and estimating the tree height. It > would be simpler, as we don't need to estimate the tree height anymore. > > I will submit a patch doing that. The attached does that and is much simpler. I only took into account entryPagesFetched, not sure if we should also charge something for data pages. Instead of trying to estimate the height of the tree, we rely on the (imperfect) estimation of the number of entry pages fetched, and charge 50 times cpu_operator_cost to that, in addition to the cpu_operator_cost charged per entry visited. I also adapted to take into accounts multiple scans induced by scalar array operations. As it is, I don't understand the following calculation: /* * Estimate number of entry pages read. We need to do * counts.searchEntries searches. Use a power function as it should be, * but tuples on leaf pages usually is much greater. Here we include all * searches in entry tree, including search of first entry in partial * match algorithm */ entryPagesFetched += ceil(counts.searchEntries * rint(pow(numEntryPages, 0.15))); Is the power(0.15) used an approximation for a log ? If so why ? Also shouldn't we round that up ? It seems to me it's unlikely to affect the total too much in normal cases (adding at worst random_page_cost) but if we start to charge cpu operator costs as proposed here it makes a big difference and it is probably safer to overestimate a bit than the opposite. With those changes, the gin cost (purely cpu-wise) stays above the btree one as I think it should be. -- Ronan Dunklau>From 8635a22ee5f90297756f072199c69a318bd17ea2 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 12 Sep 2022 15:40:18 +0200 Subject: [PATCH v2] Fix gin costing. GIN index scans were not taking any descent CPU-based cost into account. That made them look cheaper than other types of indexes when they shouldn't be. We use the same heuristic as for btree indexes, but multiplying it by the number of searched entries. Per report of Hung Nguyen. --- src/backend/utils/adt/selfuncs.c | 34 +--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index c746759eef..881e470a07 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7419,6 +7419,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, qual_arg_cost, spc_random_page_cost, outer_scans; + Cost descentCost; Relation indexRel; GinStatsData ginStats; ListCell *lc; @@ -7622,7 +7623,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, * searches in entry tree, including search of first entry in partial * match algorithm */ - entryPagesFetched += ceil(counts.searchEntries * rint(pow(numEntryPages, 0.15))); + entryPagesFetched += ceil(counts.searchEntries * ceil(pow(numEntryPages, 0.15))); /* * Add an estimate of entry pages read by partial match algorithm. It's a @@ -7643,6 +7644,33 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, */ dataPagesFetched = ceil(numDataPages * partialScale); + *indexStartupCost = 0; + *indexTotalCost = 0; + + /* + * Add a CPU-cost component to represent the costs of initial entry btree + * descent. We don't charge any I/O cost for touching upper btree levels, + * since they tend to stay in cache, but we still have to do about log2(N) + * comparisons to descend a btree of N leaf tuples. We charge one + * cpu_operator_cost per comparison. + * + * If there are ScalarArrayOpExprs, charge this once per SA scan. The + * ones after the first one are not startup cost so far as the overall + * plan is concerned, so add them only to "total" cost. + */ + if (numEntries > 1) /* avoid computing log(0) */ + { + descentCost = ceil(log(numEntries) / log(2.0)) * cpu_operator_cost; + *indexStartupCost += descentCost * counts.searchEntries; + *indexTotalCost += counts.arrayScans * descentCost * counts.searchEntries; + } + + /* + * Add a cpu cost per page fetched. This is not amortized over a loop. + */ + *indexStartupCost += entryPagesFetched * 50.0 * cpu_operator_cost; + *indexTotalCost += entryPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost; + /* * Calculate cache effects if more than one scan due to nestloops or array * quals. The result is pro-rated per nestloop scan, but the array qual @@ -7666,7 +7694,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, * Here we use random page cost because logicall
Re: Fix gin index cost estimation
Thank you for looking at it. > I looked this over briefly. I think you are correct to charge an > initial-search cost per searchEntries count, but don't we also need to > scale up by arrayScans, similar to the "corrections for cache effects"? > > + * We model index descent costs similarly to those for btree, but we also > + * need an idea of the tree_height. > + * We use numEntries / numEntryPages as the fanout factor. > > I'm not following that calculation? It seems like it'd be correct > only for a tree height of 1, although maybe I'm just misunderstanding > this (overly terse, perhaps) comment. I don't really understand why that would work only with a tree height of one ? Every entry page contains a certain amount of entries, and as such computing the average number of entries per page seems to be a good approximation for the fanout. But I may have misunderstood what was done in other index types. For consistency, maybe we should just use a hard coded value of 100 for the fanout factor, similarly to what we do for other index types. But I realised that another approach might be better suited: since we want to charge a cpu cost for every page visited, actually basing that on the already estimated entryPagesFetched and dataPagesFetched would be better, instead of copying what is done for other indexes type and estimating the tree height. It would be simpler, as we don't need to estimate the tree height anymore. I will submit a patch doing that. > > + * We charge descentCost once for every entry > + */ > + if (numTuples > 1) > + { > + descentCost = ceil(log(numTuples) / log(2.0)) * cpu_operator_cost; > + *indexStartupCost += descentCost * counts.searchEntries; > + } > > I had to read this twice before absorbing the point of the numTuples > test. Maybe help the reader a bit: > > + if (numTuples > 1)/* ensure positive log() */ > Ok. On second read, I think that part was actually wrong: what we care about is not the number of tuples here, but the number of entries. > Personally I'd duplicate the comments from nbtreecostestimate rather > than just assuming the reader will go consult them. For that matter, > why didn't you duplicate nbtree's logic for charging for SA scans? > This bit seems just as relevant for GIN: > >* If there are ScalarArrayOpExprs, charge this once per SA scan. The >* ones after the first one are not startup cost so far as the overall >* plan is concerned, so add them only to "total" cost. > You're right. So what we need to do here is scale up whatever we charge for the startup cost by the number of arrayscans for the total cost. > Keep in mind also that pgindent will have its own opinions about how to > format these comments, and it can out-stubborn you. Either run the > comments into single paragraphs, or if you really want them to be two > paras then leave an empty comment line between. Another formatting > nitpick is that you seem to have added a number of unnecessary blank > lines. Thanks, noted. I'll submit a new patch soon, as soon as i've resolved some of the problems I have when accounting for scalararrayops. Best regards, -- Ronan Dunklau
Fix gin index cost estimation
Hello, Following the bug report at [1], I sent the attached patch to pgsql-bugs mailing list. I'm starting a thread here to add it to the next commitfest. The problem I'm trying to solve is that, contrary to btree, gist and sp-gist indexes, gin indexes do not charge any cpu-cost for descending the entry tree. This can be a problem in cases where the io cost is very low. This can happen with manual tuning of course, but more surprisingly when the the IO cost is amortized over a large number of iterations in a nested loop. In that case, we basically consider it free since everything should already be in the shared buffers. This leads to some inefficient plans, as an equivalent btree index should be picked instead. This has been discovered in PG14, as this release makes it possible to use a pg_trgm gin index with the equality operator. Before that, only the btree would have been considered and as such the discrepancy in the way we charge cpu cost didn't have noticeable effects. However, I suspect users of btree_gin could have the same kind of problems in prior versions. Best regards, [1]: https://www.postgresql.org/message-id/flat/ 2187702.iZASKD2KPV%40aivenronan#0c2498c6a85e31a589b3e9a6a3616c52 -- Ronan Dunklau>From 0aa1ff24e58234d759c07f4eeec163a82244be25 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Wed, 6 Jul 2022 17:29:01 +0200 Subject: [PATCH v1] Fix gin costing. GIN index scans were not taking any descent CPU-based cost into account. That made them look cheaper than other types of indexes when they shouldn't be. We use the same heuristic as for btree indexes, but multiplying it by the number of searched entries. Per report of Hung Nguyen. --- src/backend/utils/adt/selfuncs.c | 36 +++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index fa1f589fad..21407c3d38 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7425,6 +7425,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, qual_arg_cost, spc_random_page_cost, outer_scans; + Cost descentCost; Relation indexRel; GinStatsData ginStats; ListCell *lc; @@ -7524,6 +7525,18 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, _random_page_cost, NULL); + + /* + * We model index descent costs similarly to those for btree, but we also + * need an idea of the tree_height. + * We use numEntries / numEntryPages as the fanout factor. + */ + if (index->tree_height < 0) + { + index->tree_height = ceil(log(numEntries) / log(numEntries / numEntryPages)); + } + + /* * Generic assumption about index correlation: there isn't any. */ @@ -7649,6 +7662,27 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, */ dataPagesFetched = ceil(numDataPages * partialScale); + + *indexStartupCost = 0; + + /* + * Add a CPU-cost component similar to btree to represent the costs of the + * initial descent. + * We charge descentCost once for every entry + */ + if (numTuples > 1) + { + descentCost = ceil(log(numTuples) / log(2.0)) * cpu_operator_cost; + *indexStartupCost += descentCost * counts.searchEntries; + } + + /* + * Add a similar per-page charge, depending on the tree heights. + */ + descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost; + *indexStartupCost += descentCost * counts.searchEntries; + + /* * Calculate cache effects if more than one scan due to nestloops or array * quals. The result is pro-rated per nestloop scan, but the array qual @@ -7672,7 +7706,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, * Here we use random page cost because logically-close pages could be far * apart on disk. */ - *indexStartupCost = (entryPagesFetched + dataPagesFetched) * spc_random_page_cost; + *indexStartupCost += (entryPagesFetched + dataPagesFetched) * spc_random_page_cost; /* * Now compute the number of data pages fetched during the scan. -- 2.37.0
Re: Support for grabbing multiple consecutive values with nextval()
> It is Friday here, so I would easily miss something.. It is possible > to use COPY FROM with a list of columns, so assuming that you could > use a default expression with nextval() or just a SERIAL column not > listed in the COPY FROM query to do the job, what do we gain with this > feature? In which aspect does the preallocation of a range handled > on the client side after being allocated in the backend make things > better? The problem the author wants to solve is the fact they don't have a way of returning the ids when using COPY FROM. Pre-allocating them and assigning them to the individual records before sending them via COPY FROM would solve that for them. -- Ronan Dunklau
Re: Support for grabbing multiple consecutive values with nextval()
Hello, Reading the thread, I think the feature has value: it would basically transfer control of the sequence cache to the client application. However, I don't think that returning only the last value is a sensible thing to do. The client will need to know the details of the sequence to do anything useful about this, especially it's increment, minvalue, maxvalue and cycle options. As suggested upthread, returning a resultset would probably be better. If the client application is concerned about the volume of data exchanged with the server, and is willing to deal with handling the knowledge of the sequence details themselves, they can always wrap it in an aggregate: SELECT min(newvals), max(newvals) FROM nextvals() as newvals Regards, -- Ronan Dunklau
Re: Removing unneeded self joins
Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit : > On 19/5/2022 16:47, Ronan Dunklau wrote: > > I'll take a look at that one. > > New version of the patch, rebased on current master: > 1. pgindent over the patch have passed. > 2. number of changed files is reduced. > 3. Some documentation and comments is added. Hello Andrey, Thanks for the updates. The general approach seems sensible to me, so I'm going to focus on some details. In a very recent thread [1], Tom Lane is proposing to add infrastructure to make Var aware of their nullability by outer joins. I wonder if that would help with avoiding the need for adding is not null clauses when the column is known not null ? If we have a precedent for adding a BitmapSet to the Var itself, maybe the whole discussion regarding keeping track of nullability can be extended to the original column nullability ? Also, I saw it was mentioned earlier in the thread but how difficult would it be to process the transformed quals through the EquivalenceClass machinery and the qual simplification ? For example, if the target audience of this patch is ORM, or inlined views, it wouldn't surprise me to see queries of this kind in the wild, which could be avoided altogether: postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a = s2.a where s1.b = 2 and s2.b =3; QUERY PLAN - Seq Scan on sj s2 Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2)) (2 lignes) + for (counter = 0; counter < list_length(*sources);) + { + ListCell *cell = list_nth_cell(*sources, counter); + RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell)); + int counter1; + + ec->ec_members = list_delete_cell(ec->ec_members, cell); Why don't you use foreach() and foreach_delete_current macros for iterating and removing items in the lists, both in update_ec_members and update_ec_sources ? + if ((bms_is_member(k, info->syn_lefthand) ^ +bms_is_member(r, info->syn_lefthand)) || + (bms_is_member(k, info->syn_righthand) ^ +bms_is_member(r, info->syn_righthand))) I think this is more compact and easier to follow than the previous version, but I'm not sure how common it is in postgres source code to use that kind of construction ? Some review about the comments: I see you keep using the terms "the keeping relation" and "the removing relation" in reference to the relation that is kept and the one that is removed. Aside from the grammar (the kept relation or the removed relation), maybe it would make it clearer to call them something else. In other parts of the code, you used "the remaining relation / the removed relation" which makes sense. /* * Remove the target relid from the planner's data structures, having - * determined that there is no need to include it in the query. + * determined that there is no need to include it in the query. Or replace + * with another relid. + * To reusability, this routine can work in two modes: delete relid from a plan + * or replace it. It is used in replace mode in a self-join removing process. This could be rephrased: ", optionally replacing it with another relid. The latter is used by the self-join removing process." [1] https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us -- Ronan Dunklau
Re: Removing unneeded self joins
Le jeudi 19 mai 2022, 12:48:18 CEST Andrey Lepikhov a écrit : > On 5/17/22 19:14, Ronan Dunklau wrote: > > Le vendredi 13 mai 2022, 07:07:47 CEST Andrey Lepikhov a écrit : > >> New version of the feature. > >> Here a minor bug with RowMarks is fixed. A degenerated case is fixed, > >> when uniqueness of an inner deduced not from join quals, but from a > >> baserestrictinfo clauses 'x=const', where x - unique field. > >> Code, dedicated to solve second issue is controversial, so i attached > >> delta.txt for quick observation. > >> Maybe we should return to previous version of code, when we didn't split > >> restriction list into join quals and base quals? > > > > Hello, > > > > I tried to find problematic cases, which would make the planning time grow > > unacceptably, and couldn't devise it. > > > > The worst case scenario I could think of was as follows: > > - a query with many different self joins > > - an abundance of unique indexes on combinations of this table columns > > to > > > > consider > > > > - additional predicates on the where clause on columns. > > Looking into the patch I can imagine, that the most difficult case is > when a set of relations with the same OID is huge, but only small part > of them (or nothing) can be removed. > Also, removing a clause from restrictinfo list or from equivalence class > adds non-linear complexity. So, you can dig this way ). > > > The base table I used for this was a table with 40 integers. 39 unique > > indexes were defined on every combination of (c1, cX) with cX being > > columns c2 to c40. > > > > I turned geqo off, set from_collapse_limit and join_collapse_limit to > > unreasonably high values (30), and tried to run queries of the form: > > > > SELECT * FROM test_table t1 > > JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX > > ... > > JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX. > > > > So no self join can be eliminated in that case. > > I think, you should compare t1.cX with tX.cX to eliminate self-join. > Cross-unique-index proof isn't supported now. Yes, that's the point. I wanted to try to introduce as much complexity as I could, without actually performing any self join elimination. The idea was to try to come up with the worst case scenario. > > > The performance was very similar with or without the GUC enabled. I tested > > the same thing without the patch, since the test for uniqueness has been > > slightly altered and incurs a new allocation, but it doesn't seem to > > change. > > > > One interesting side effect of this patch, is that removing any unneeded > > self join cuts down the planification time very significantly, as we > > lower the number of combinations to consider. > > Even more - removing a join we improve cardinality estimation. > > > As for the code: > > - Comments on relation_has_unique_index_ext and > > relation_has_unique_index_for> > > should be rewritten, as relation_has_unique_index_for is now just a > > special > > case of relation_has_unique_index_ext. By the way, the comment would > > probably be better read as: "but if extra_clauses isn't NULL". > > > > - The whole thing about "extra_clauses", ie, baserestrictinfos which > > were > > > > used to determine uniqueness, is not very clear. Most functions where the > > new argument has been added have not seen an update in their comments, > > and the name itself doesn't really convey the intented meaning: perhaps > > required_non_join_clauses ? > > > > The way this works should be explained a bit more thoroughly, for example > > in remove_self_joins_one_group the purpose of uclauses should be > > explained. The fact that degenerate_case returns true when we don't have > > any additional base restrict info is also confusing, as well as the > > degenerate_case name. > Agree, > but after this case thoughts wander in my head: should we make one step > back to pre-[1] approach? It looks like we have quite similar changes, > but without special function for a 'degenerate case' detection and > restrictlist splitting. I'll take a look at that one. > > > I'll update if I think of more interesting things to add. > > Thank you for your efforts! > > See in attachment next version which fixes mistakes detected by > z...@yugabyte.com. > > [1] > https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW > 5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com -- Ronan Dunklau
Re: Limiting memory allocation
Le mercredi 18 mai 2022, 16:23:34 CEST Jan Wieck a écrit : > On 5/17/22 18:30, Stephen Frost wrote: > > Greetings, > > > > On Tue, May 17, 2022 at 18:12 Tom Lane > > > <mailto:t...@sss.pgh.pa.us>> wrote: > > Jan Wieck mailto:j...@wi3ck.info>> writes: > > > On 5/17/22 15:42, Stephen Frost wrote: > > >> Thoughts? > > > > > > Using cgroups one can actually force a certain process (or user, or > > > service) to use swap if and when that service is using more > > > > memory than > > > > > it was "expected" to use. > > > > I wonder if we shouldn't just provide documentation pointing to > > OS-level > > facilities like that one. The kernel has a pretty trivial way to > > check > > the total memory used by a process. We don't: it'd require tracking > > total > > space used in all our memory contexts, and then extracting some > > number out > > of our rear ends for allocations made directly from malloc. In short, > > anything we do here will be slow and unreliable, unless you want to > > depend > > on platform-specific things like looking at /proc/self/maps. > > > > This isn’t actually a solution though and that’s the problem- you end up > > using swap but if you use more than “expected” the OOM killer comes in > > and happily blows you up anyway. Cgroups are containers and exactly what > > kube is doing. > > Maybe I'm missing something, but what is it that you would actually > consider a solution? Knowing your current memory consumption doesn't > make the need for allocating some right now go away. What do you > envision the response of PostgreSQL to be if we had that information > about resource pressure? I don't see us using mallopt(3) or > malloc_trim(3) anywhere in the code, so I don't think any of our > processes give back unused heap at this point (please correct me if I'm > wrong). This means that even if we knew about the memory pressure of the > system, adjusting things like work_mem on the fly may not do much at > all, unless there is a constant turnover of backends. I'm not sure I understand your point: when we free() a pointer, malloc is allowed to release the corresponding memory to the kernel. In the case of glibc, it doesn't necessarily do so, but it trims the top of the heap if it is in excess of M_TRIM_THRESHOLD. In the default glibc configuration, this parameter is dynamically adjusted by mmap itself, to a maximum value of 64MB IIRC. So any memory freed on the top of the heap totalling more than that threshold ends up actually freed. In another thread, I proposed to take control over this tuning instead of letting malloc do it itself, as we may have better knowledge of the memory allocations pattern than what malloc empirically discovers: in particular, we could lower work_mem, adjust the threshold and maybe even call malloc_trim ourselves when work_mem is lowered, to reduce the padding we may keep. > > So what do you propose PostgreSQL's response to high memory pressure to be? > > > Regards, Jan -- Ronan Dunklau
Re: Removing unneeded self joins
Le vendredi 13 mai 2022, 07:07:47 CEST Andrey Lepikhov a écrit : > New version of the feature. > Here a minor bug with RowMarks is fixed. A degenerated case is fixed, > when uniqueness of an inner deduced not from join quals, but from a > baserestrictinfo clauses 'x=const', where x - unique field. > Code, dedicated to solve second issue is controversial, so i attached > delta.txt for quick observation. > Maybe we should return to previous version of code, when we didn't split > restriction list into join quals and base quals? Hello, I tried to find problematic cases, which would make the planning time grow unacceptably, and couldn't devise it. The worst case scenario I could think of was as follows: - a query with many different self joins - an abundance of unique indexes on combinations of this table columns to consider - additional predicates on the where clause on columns. The base table I used for this was a table with 40 integers. 39 unique indexes were defined on every combination of (c1, cX) with cX being columns c2 to c40. I turned geqo off, set from_collapse_limit and join_collapse_limit to unreasonably high values (30), and tried to run queries of the form: SELECT * FROM test_table t1 JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX ... JOIN test_table tX ON t1.c1 = tX.c1 AND t1.c[X+2] = tX.cX. So no self join can be eliminated in that case. The performance was very similar with or without the GUC enabled. I tested the same thing without the patch, since the test for uniqueness has been slightly altered and incurs a new allocation, but it doesn't seem to change. One interesting side effect of this patch, is that removing any unneeded self join cuts down the planification time very significantly, as we lower the number of combinations to consider. As for the code: - Comments on relation_has_unique_index_ext and relation_has_unique_index_for should be rewritten, as relation_has_unique_index_for is now just a special case of relation_has_unique_index_ext. By the way, the comment would probably be better read as: "but if extra_clauses isn't NULL". - The whole thing about "extra_clauses", ie, baserestrictinfos which were used to determine uniqueness, is not very clear. Most functions where the new argument has been added have not seen an update in their comments, and the name itself doesn't really convey the intented meaning: perhaps required_non_join_clauses ? The way this works should be explained a bit more thoroughly, for example in remove_self_joins_one_group the purpose of uclauses should be explained. The fact that degenerate_case returns true when we don't have any additional base restrict info is also confusing, as well as the degenerate_case name. I'll update if I think of more interesting things to add. -- Ronan Dunklau
Re: ORDER BY pushdowns seem broken in postgres_fdw
Le jeudi 31 mars 2022, 01:41:37 CEST Tom Lane a écrit : > I looked through this patch. It's going in the right direction, > but I have a couple of nitpicks: Thank you Tom for taking a look at this. > I think that instead of doubling down on a wrong API, we should just > take that out and move the logic into postgres_fdw.c. This also has > the advantage of producing a patch that's much safer to backpatch, > because it doesn't rely on the core backend getting updated before > postgres_fdw.so is. It makes total sense. > If no objections, I think this is committable. No objections on my end. -- Ronan Dunklau
Re: Proposal: More structured logging
Le jeudi 27 janvier 2022, 08:15:01 CET Michael Paquier a écrit : > On Tue, Jan 18, 2022 at 06:46:03AM +0100, Ronan Dunklau wrote: > > Hum, there was a missing import in csvlog.c from the fix above. Sorry > > about > > that. > > + > > > You are also forgetting that the table listing all the jsonlog fields > needs a refresh with this new key called "tags", and that it has a > JSON object underneath. Done. > > If you want to test all the formats supported by logging_destination, > wouldn't this stuff be better if tested through TAP with > logging_destination set with csvlog, jsonlog and stderr? This is > lacking coverage for csvlog.c and jsonlog.c, paths touched by your > patch. Done, I also added coverage for log_verbosity = verbose while I was at it, and fixed a bug I introduced in csvlog while rebasing... -- Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v7 1/4] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 7402696986..4fb4c67c3f 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -460,6 +460,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 3eb8de3966..615fae47ef 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error context callback functions, not * within an ereport() invocation. The callback function can be in a different @@ -395,11 +398,18 @@ typedef struct ErrorData int internalpos; /* cursor index into internalquery */ char *internalquery; /* text of internally-generated query */ int saved_errno; /* errno at entry */ + List *tags; /* List of error tags */ /* context containing associated non-constant strings */ struct MemoryContextData *assoc_context; } ErrorData; +typedef struct ErrorTag +{ + const char *tagname; + char *tagvalue; +} ErrorTag; + extern void EmitErrorReport(void); extern ErrorData *CopyErrorData(void); extern void FreeErrorData(ErrorData *edata); -- 2.35.0 >From 872770026fd90
Re: Use generation context to speed up tuplesorts
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit : > Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: > > > I'm also a bit confused about which patch(es) should actually be > > > reviewed > > > in that thread. My understanding is that the original patch might not > > > be > > > relevant anymore but I might be wrong. The CF entry should probably be > > > updated to clarify that, with an annotation/ title change / author > > > update > > > or something. > > > > > > In the meantime I will switch the entry to Waiting on Author. > > > > I think what's going on here is a bit confusing. There's quite a few > > patches on here that aim to do very different things. > > > > The patch I originally proposed was just to make tuplesort use > > generation memory contexts. This helps speed up tuplesort because > > generation contexts allow palloc() to be faster than the standard > > allocator. Additionally, there's no power-of-2 wastage with generation > > contexts like there is with the standard allocator. This helps fit > > more tuples on fewer CPU cache lines. > > > > I believe Andres then mentioned that the fixed block size for > > generation contexts might become a problem. With the standard > > allocator the actual size of the underlying malloc() grows up until a > > maximum size. With generation contexts this is always the same, so we > > could end up with a very large number of blocks which means lots of > > small mallocs which could end up slow. Tomas then posted a series of > > patches to address this. > > > > I then posted another patch that has the planner make a choice on the > > size of the blocks to use for the generation context based on the > > tuple width and number of tuple estimates. My hope there was to > > improve performance further by making a better choice for how large to > > make the blocks in the first place. This reduces the need for Tomas' > > patch, but does not eliminate the need for it. > > > > As of now, I still believe we'll need Tomas' patches to allow the > > block size to grow up to a maximum size. I think those patches are > > likely needed before we think about making tuplesort use generation > > contexts. The reason I believe this is that we don't always have good > > estimates for the number of tuples we're going to sort. nodeSort.c is > > fairly easy as it just fetches tuples once and then sorts them. The > > use of tuplesort for nodeIncrementalSort.c is much more complex as > > we'll likely be performing a series of smaller sorts which are much > > harder to get size estimates for. This means there's likely no magic > > block size that we can use for the generation context that's used to > > store the tuples in tuplesort. This is also the case for order by > > aggregates and probably most other usages of tuplesort. > > You left out the discussion I started about glibc's malloc specific > behaviour. > > I tried to demonstrate that a big part of the performance gain you were > seeing were not in fact due to our allocator by itself, but by the way > different block sizes allocations interact with this malloc implementation > and how it handles releasing memory back to the system. I also tried to > show that we can give DBAs more control over how much memory to "waste" as > the price for faster allocations. > > I agree that the generation context memory manager is useful in the > tuplesort context, if only for the fact that we fall back to disk less > quickly due to the reduced wastage of memory, but I would be wary of > drawing conclusions too quickly based on a specific malloc implementation > which shows threshold effects, which are compounded by our growing block > sizes code in general. > > I have on my TODO list to run the same kind of benchmarks using jemalloc, to > better isolate the performance gains expected from the generation allocator > itself from the side effect of avoiding glibc's pattern of releasing memory > back to the kernel too quickly. > I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and jemalloc, with the standard aset memory manager or with David's generation v2 patch. To use jemalloc, I just use a LD_PRELOAD env variable before starting postgres. I have not tried to measure jemalloc's memory footprint like I did for glibc's malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a glibc's malloc replacement. Please find the results attached. As I suspected, most of the gain observed with Davi
Re: Use generation context to speed up tuplesorts
Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: > > I'm also a bit confused about which patch(es) should actually be reviewed > > in that thread. My understanding is that the original patch might not be > > relevant anymore but I might be wrong. The CF entry should probably be > > updated to clarify that, with an annotation/ title change / author update > > or something. > > > > In the meantime I will switch the entry to Waiting on Author. > > I think what's going on here is a bit confusing. There's quite a few > patches on here that aim to do very different things. > > The patch I originally proposed was just to make tuplesort use > generation memory contexts. This helps speed up tuplesort because > generation contexts allow palloc() to be faster than the standard > allocator. Additionally, there's no power-of-2 wastage with generation > contexts like there is with the standard allocator. This helps fit > more tuples on fewer CPU cache lines. > > I believe Andres then mentioned that the fixed block size for > generation contexts might become a problem. With the standard > allocator the actual size of the underlying malloc() grows up until a > maximum size. With generation contexts this is always the same, so we > could end up with a very large number of blocks which means lots of > small mallocs which could end up slow. Tomas then posted a series of > patches to address this. > > I then posted another patch that has the planner make a choice on the > size of the blocks to use for the generation context based on the > tuple width and number of tuple estimates. My hope there was to > improve performance further by making a better choice for how large to > make the blocks in the first place. This reduces the need for Tomas' > patch, but does not eliminate the need for it. > > As of now, I still believe we'll need Tomas' patches to allow the > block size to grow up to a maximum size. I think those patches are > likely needed before we think about making tuplesort use generation > contexts. The reason I believe this is that we don't always have good > estimates for the number of tuples we're going to sort. nodeSort.c is > fairly easy as it just fetches tuples once and then sorts them. The > use of tuplesort for nodeIncrementalSort.c is much more complex as > we'll likely be performing a series of smaller sorts which are much > harder to get size estimates for. This means there's likely no magic > block size that we can use for the generation context that's used to > store the tuples in tuplesort. This is also the case for order by > aggregates and probably most other usages of tuplesort. > You left out the discussion I started about glibc's malloc specific behaviour. I tried to demonstrate that a big part of the performance gain you were seeing were not in fact due to our allocator by itself, but by the way different block sizes allocations interact with this malloc implementation and how it handles releasing memory back to the system. I also tried to show that we can give DBAs more control over how much memory to "waste" as the price for faster allocations. I agree that the generation context memory manager is useful in the tuplesort context, if only for the fact that we fall back to disk less quickly due to the reduced wastage of memory, but I would be wary of drawing conclusions too quickly based on a specific malloc implementation which shows threshold effects, which are compounded by our growing block sizes code in general. I have on my TODO list to run the same kind of benchmarks using jemalloc, to better isolate the performance gains expected from the generation allocator itself from the side effect of avoiding glibc's pattern of releasing memory back to the kernel too quickly. Julien, sorry for the confusion of adding that discussion and those patches to the same thread, but I don't really see a way of dissociating those two aspects. -- Ronan Dunklau
Re: Proposal: More structured logging
Le lundi 17 janvier 2022, 09:18:04 CET Ronan Dunklau a écrit : > Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit : > > Hi, > > > > On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote: > > > Done, and I added anoher commit per your suggestion to add this comment. > > > > The cfbot reports that the patchset doesn't apply anymore: > > > > http://cfbot.cputube.org/patch_36_3293.log > > === Applying patches on top of PostgreSQL commit ID > > 43c2175121c829c8591fc5117b725f1f22bfb670 === [...] > > === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch > > [...] > > patching file src/backend/utils/error/elog.c > > Hunk #2 FAILED at 3014. > > 1 out of 2 hunks FAILED -- saving rejects to file > > src/backend/utils/error/elog.c.rej > > > > Could you send a rebased version? In the meantime I will switch the cf > > entry to Waiting on Author. > > Thank you for this notification ! > > The csvlog has been refactored since I wrote the patch, and the new jsonlog > destination has been introduced. I rebased to fix the first patch, and added > a new patch to add support for tags in jsonlog output. > > Best regards, Hum, there was a missing import in csvlog.c from the fix above. Sorry about that. -- Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v6 1/5] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 7402696986..4fb4c67c3f 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -460,6 +460,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 3eb8de3966..615fae47ef 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error context callback functions, not * within an ereport() invocation. The callback function can be in a different @@ -395,11 +398,18 @@ typedef struct ErrorData int internalpos; /* cursor index into internalquery */ char *internalquery; /* text o
Re: Proposal: More structured logging
Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit : > Hi, > > On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote: > > Done, and I added anoher commit per your suggestion to add this comment. > > The cfbot reports that the patchset doesn't apply anymore: > > http://cfbot.cputube.org/patch_36_3293.log > === Applying patches on top of PostgreSQL commit ID > 43c2175121c829c8591fc5117b725f1f22bfb670 === [...] > === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch > [...] > patching file src/backend/utils/error/elog.c > Hunk #2 FAILED at 3014. > 1 out of 2 hunks FAILED -- saving rejects to file > src/backend/utils/error/elog.c.rej > > Could you send a rebased version? In the meantime I will switch the cf > entry to Waiting on Author. Thank you for this notification ! The csvlog has been refactored since I wrote the patch, and the new jsonlog destination has been introduced. I rebased to fix the first patch, and added a new patch to add support for tags in jsonlog output. Best regards, -- Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v5 1/5] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 7402696986..4fb4c67c3f 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -460,6 +460,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 3eb8de3966..615fae47ef 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error context callback functions, not * within an ereport() invocation. The callback function can be in a different @@ -395,11 +398,18 @@ typedef struct ErrorData int internalpos; /* cursor index into internalquery */ char *internalquery; /* text of internally-generated query */ int saved_errno; /* errno at entry */ + List *tags; /* List of error tags */ /* context containing associated non-constant strings */ struct MemoryContextData *assoc_context; } ErrorData; +typedef struct ErrorTag +{ + const char *tagname
Re: Proposal: More structured logging
Le mercredi 29 décembre 2021, 14:59:16 CET Justin Pryzby a écrit : > > Subject: [PATCH v3 2/3] Add test module for the new tag functionality. > > ... > > > +test_logging(PG_FUNCTION_ARGS) > > +{ > > ... > > > +(errmsg("%s", message), > > + ({ > > + forboth(lk, keys, lv, values) > > + { > > + (errtag(lfirst(lk), "%s", (char *) lfirst(lv))); > > + }}) > > + )); > > The windows build fails with that syntax. > http://cfbot.cputube.org/ronan-dunklau.html > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.157923 Thank you. I switched to an explicit sequence of errstart / errfinish to avoid putting too much things in nested macro calls. As I don't have any Windows knowledge, I am very grateful for the new cirrus-ci integration which allowed me to build on Windows without hassle. > > Subject: [PATCH v3 3/3] Output error tags in CSV logs > > +++ b/doc/src/sgml/config.sgml > > @@ -7370,6 +7371,7 @@ CREATE TABLE postgres_log > > > >backend_type text, > >leader_pid integer, > >query_id bigint, > > > > + tags jsonb > > > >PRIMARY KEY (session_id, session_line_num) > > > > ); > > > > That's invalid sql due to missing a trailing ",". Thanks, fixed. > > You should also update file-fdw.sgml - which I only think of since we forgot > in to update it before e568ed0eb and 0830d21f5. config.sgml should have a > comment as a reminder to do that. Done, and I added anoher commit per your suggestion to add this comment. Thank you for this review. Regards, -- Ronan Dunklau>From 487c34465b34ef6ef8cb466247cbaaa8cf4bc534 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 11 Jan 2022 10:16:53 +0100 Subject: [PATCH v4 4/4] Add comment in config.sgml as a reminder to also update file_fdw.sgml --- doc/src/sgml/config.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4df79fcbcc..d4c20f40ac 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7375,7 +7375,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a ' query id, and optional tags added by the logger as JSON. Here is a sample table definition for storing CSV-format log output: - + CREATE TABLE postgres_log ( -- 2.34.1 >From a1c4ae874bf156a05da436838d6b2b73f6621905 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:32 +0200 Subject: [PATCH v4 3/4] Output error tags in CSV logs --- doc/src/sgml/config.sgml | 4 +++- doc/src/sgml/file-fdw.sgml | 1 + src/backend/utils/error/elog.c | 25 - 3 files changed, 28 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index afbb6c35e3..4df79fcbcc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7372,7 +7372,8 @@ log_line_prefix = '%m [%p] %q%u@%d/%a ' location of the error in the PostgreSQL source code (if log_error_verbosity is set to verbose), application name, backend type, process ID of parallel group leader, -and query id. +query id, +and optional tags added by the logger as JSON. Here is a sample table definition for storing CSV-format log output: @@ -7404,6 +7405,7 @@ CREATE TABLE postgres_log backend_type text, leader_pid integer, query_id bigint, + tags jsonb, PRIMARY KEY (session_id, session_line_num) ); diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml index 5b98782064..ccb4e9d8dd 100644 --- a/doc/src/sgml/file-fdw.sgml +++ b/doc/src/sgml/file-fdw.sgml @@ -268,6 +268,7 @@ CREATE FOREIGN TABLE pglog ( backend_type text, leader_pid integer, query_id bigint + tags jsonb, ) SERVER pglog OPTIONS ( filename 'log/pglog.csv', format 'csv' ); diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index d43e1c2c31..1e6c7222c3 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -80,6 +80,7 @@ #include "storage/proc.h" #include "tcop/tcopprot.h" #include "utils/guc.h" +#include "utils/json.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -3013,10 +3014,32 @@ write_csvlog(ErrorData *edata) appendStringInfo(, "%d", leader->pid); } appendStringInfoChar(, ','); - /* query id */ appendStringInfo(, "%lld", (long long) pgstat_get_my_query_id()); + appendStringInfoChar(, ','); + if (edata->tags != NIL) + { + StringInfoData tagbuf; + ListCell *lc; + bool first = true; + initStringInfo(); + appendStringInfoChar(, '{'); + foreach(lc, edata
Matching domains-over-enums to anyenum types
Hello, A colleague of mine was surprised to discover the following statements raised an error: postgres=# CREATE TYPE abc_enum AS ENUM ('a', 'b', 'c'); CREATE TYPE postgres=# CREATE DOMAIN abc_domain AS abc_enum; CREATE DOMAIN postgres=# SELECT 'a'::abc_domain = 'a'::abc_domain; ERROR: operator does not exist: abc_domain = abc_domain LINE 1: SELECT 'a'::abc_domain = 'a'::abc_domain; ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. This has been already discussed a long time ago, and the idea was rejected at the time since there was no demand for it: https://www.postgresql.org/message-id/flat/BANLkTi%3DaGxDbGPSF043V2K-C2vF2YzGz9w%40mail.gmail.com#da4826d2cbbaca20e3440aadb3093158 Given that we implemented that behaviour for domains over ranges and multiranges, I don't see the harm in doing the same for domains over enums. What do you think ? -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit : > On 1/7/22 12:03, Ronan Dunklau wrote: > > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : > >> I've attached some benchmark results that I took recently. The > >> spreadsheet contains results from 3 versions. master, master + 0001 - > >> 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit > >> more conservative about the chunk sizes it allocates and also tries to > >> allocate the tuple array according to the number of tuples we expect > >> to be able to sort in a single batch for when the sort is not > >> estimated to fit inside work_mem. > > > > (Sorry for trying to merge back the discussion on the two sides of the > > thread) > > > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, > > I expressed the idea of being able to tune glibc's malloc behaviour. > > > > I implemented that (patch 0001) to provide a new hook which is called on > > backend startup, and anytime we set work_mem. This hook is # defined > > depending on the malloc implementation: currently a default, no-op > > implementation is provided as well as a glibc's malloc implementation. > > Not sure I'd call this a hook - that usually means a way to plug-in > custom code through a callback, and this is simply ifdefing a block of > code to pick the right implementation. Which may be a good way to do > that, just let's not call that a hook. > > There's a commented-out MallocTuneHook() call, probably not needed. Ok, I'll clean that up if we decide to proceed with this. > > I wonder if #ifdefing is sufficient solution, because it happens at > compile time, so what if someone overrides the allocator in LD_PRELOAD? > That was a fairly common way to use a custom allocator in an existing > application. But I don't know how many people do that with Postgres (I'm > not aware of anyone doing that) or if we support that (it'd probably > apply to other stuff too, not just malloc). So maybe it's OK, and I > can't think of a better way anyway. I couldn't think of a better way either, maybe there is something to be done with trying to dlsym something specific to glibc's malloc implementation ? > > > The glibc's malloc implementation relies on a new GUC, > > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we > > don't tune malloc at all, exactly as in HEAD. If a different value is > > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD > > to this value, capped by work_mem / 2 and work_mem respectively. > > > > The net result is that we can then allow to keep more unused memory at the > > top of the heap, and to use mmap less frequently, if the DBA chooses too. > > A possible other use case would be to on the contrary, limit the > > allocated memory in idle backends to a minimum. > > > > The reasoning behind this is that glibc's malloc default way of handling > > those two thresholds is to adapt to the size of the last freed mmaped > > block. > > > > I've run the same "up to 32 columns" benchmark as you did, with this new > > patch applied on top of both HEAD and your v2 patchset incorporating > > planner estimates for the block sizez. Those are called "aset" and > > "generation" in the attached spreadsheet. For each, I've run it with > > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case > > > > I've measured two things: > > - query latency, as reported by pgbench > > - total memory allocated by malloc at backend ext after running each > > query > > > > three times. This represents the "idle" memory consumption, and thus what > > we waste in malloc inside of releasing back to the system. This > > measurement has been performed using the very small module presented in > > patch 0002. Please note that I in no way propose that we include this > > module, it was just a convenient way for me to measure memory footprint. > > > > My conclusion is that the impressive gains you see from using the > > generation context with bigger blocks mostly comes from the fact that we > > allocate bigger blocks, and that this moves the mmap thresholds > > accordingly. I wonder how much of a difference it would make on other > > malloc implementation: I'm afraid the optimisation presented here would > > in fact be specific to glibc's malloc, since we have almost the same > > gains with both allocators when tuning malloc to keep more memory. I > > still think both approaches are useful, and would be nec
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit : > I wasn't really suggesting to investigate those other allocators in this > patch - it seems like a task requiring a pretty significant amount of > work/time. My point was that we should make it reasonably easy to add > tweaks for those other environments, if someone is interested enough to > do the legwork. > > >> 2) In fact, I wonder if different glibc versions behave differently? > >> Hopefully it's not changing that much, though. Ditto kernel versions, > >> but the mmap/sbrk interface is likely more stable. We can test this. > > > > That could be tested, yes. As a matter of fact, a commit removing the > > upper > > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to > > glibc, > > which means we can service much bigger allocation without mmap. > > Yeah, I noticed that commit too. Most systems stick to one glibc > version, so it'll take time to reach most systems. Let's continue with > just one glibc version and then maybe test other versions. Yes, I also need to figure out how to detect we're using glibc as I'm not very familiar with configure. > > >> 3) If we bump the thresholds, won't that work against reusing the > >> memory? I mean, if we free a whole block (from any allocator we have), > >> glibc might return it to kernel, depending on mmap threshold value. It's > >> not guaranteed, but increasing the malloc thresholds will make that even > >> less likely. So we might just as well increase the minimum block size, > >> with about the same effect, no? > > > > It is my understanding that malloc will try to compact memory by moving it > > around. So the memory should be actually be released to the kernel at some > > point. In the meantime, malloc can reuse it for our next invocation (which > > can be in a different memory context on our side). > > > > If we increase the minimum block size, this is memory we will actually > > > > reserve, and it will not protect us against the ramping-up behaviour: > > - the first allocation of a big block may be over mmap_threshold, and > > serviced> > > by an expensive mmap > > > > - when it's free, the threshold is doubled > > - next invocation is serviced by an sbrk call > > - freeing it will be above the trim threshold, and it will be returned. > > > > After several "big" allocations, the thresholds will raise to their > > maximum > > values (well, it used to, I need to check what happens with that latest > > patch of glibc...) > > > > This will typically happen several times as malloc doubles the threshold > > each time. This is probably the reason quadrupling the block sizes was > > more effective. > > Hmmm, OK. Can we we benchmark the case with large initial block size, at > least for comparison? The benchmark I called "fixed" was with a fixed block size of ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same performance profile as the growing blocks + malloc tuning. But if I understand correctly, you implemented the growing blocks logic after concerns about wasting memory with a constant large block size. If we tune malloc, that memory would not be wasted if we don't alloc it, just not released as eagerly when it's allocated. Or do you want a benchmark with an even bigger initial block size ? With the growing blocks patch with a large initial size ? I can run either, I just want to understand what is interesting to you. One thing that would be interesting would be to trace the total amount of memory allocated in the different cases. This is something I will need to do anyway when I propose that patch; Best regards, -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit : > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can be in a different memory context on our side). I've been told off-list this comment wasn't clear: I meant that it compacts *free* memory, consolidating into larger blocks that will eventually reach the threshold, and be released. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : > On 12/16/21 17:03, Ronan Dunklau wrote: > > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > >> I will follow up with a benchmark of the test sorting a table with a > >> width > >> varying from 1 to 32 columns. > > > > So please find attached another benchmark for that case. > > > > The 3 different patchsets tested are: > > - master > > - fixed (David's original patch) > > - adjust (Thomas growing blocks patch) > > Presumably Thomas is me, right? I'm really sorry for this typo... Please accept my apologies. > > > So it looks like tuning malloc for this would be very benificial for any > > kind of allocation, and by doing so we reduce the problems seen with the > > growing blocks patch to next to nothing, while keeping the ability to not > > allocate too much memory from the get go. > > Thanks for running those tests and investigating the glibc behavior! I > find those results very interesting. My conclusions from this is that > the interaction interaction between "our" allocator and the allocator in > malloc (e.g. glibc) can be problematic. Which makes benchmarking and > optimization somewhat tricky because code changes may trigger behavior > change in glibc (or whatever allocator backs malloc). > > I think it's worth exploring if we can tune this in a reasonable way, > but I have a couple concerns related to that: > > 1) I wonder how glibc-specific this is - I'd bet it applies to other > allocators (either on another OS or just different allocator on Linux) > too. Tweaking glibc parameters won't affect those systems, of course, > but maybe we should allow tweaking those systems too ... I agree, finding their specific problems and see if we can workaround it would be interesting. I suppose glibc's malloc is the most commonly used allocator in production, as it is the default for most Linux distributions. > > 2) In fact, I wonder if different glibc versions behave differently? > Hopefully it's not changing that much, though. Ditto kernel versions, > but the mmap/sbrk interface is likely more stable. We can test this. That could be tested, yes. As a matter of fact, a commit removing the upper limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, which means we can service much bigger allocation without mmap. > > 3) If we bump the thresholds, won't that work against reusing the > memory? I mean, if we free a whole block (from any allocator we have), > glibc might return it to kernel, depending on mmap threshold value. It's > not guaranteed, but increasing the malloc thresholds will make that even > less likely. So we might just as well increase the minimum block size, > with about the same effect, no? It is my understanding that malloc will try to compact memory by moving it around. So the memory should be actually be released to the kernel at some point. In the meantime, malloc can reuse it for our next invocation (which can be in a different memory context on our side). If we increase the minimum block size, this is memory we will actually reserve, and it will not protect us against the ramping-up behaviour: - the first allocation of a big block may be over mmap_threshold, and serviced by an expensive mmap - when it's free, the threshold is doubled - next invocation is serviced by an sbrk call - freeing it will be above the trim threshold, and it will be returned. After several "big" allocations, the thresholds will raise to their maximum values (well, it used to, I need to check what happens with that latest patch of glibc...) This will typically happen several times as malloc doubles the threshold each time. This is probably the reason quadrupling the block sizes was more effective. > > > I would like to try to implement some dynamic glibc malloc tuning, if that > > is something we don't reject on principle from the get go. > > +1 to that Ok, I'll work on a patch for this and submit a new thread. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > I will follow up with a benchmark of the test sorting a table with a width > varying from 1 to 32 columns. > So please find attached another benchmark for that case. The 3 different patchsets tested are: - master - fixed (David's original patch) - adjust (Thomas growing blocks patch) So it looks like tuning malloc for this would be very benificial for any kind of allocation, and by doing so we reduce the problems seen with the growing blocks patch to next to nothing, while keeping the ability to not allocate too much memory from the get go. I would like to try to implement some dynamic glibc malloc tuning, if that is something we don't reject on principle from the get go. -- Ronan Dunklau bench_results.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Use generation context to speed up tuplesorts
Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit : > On 12/8/21 16:51, Ronan Dunklau wrote: > > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > >> And now comes the funny part - if I run it in the same backend as the > >> > >> "full" benchmark, I get roughly the same results: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ++---+--+- > >> > >>32768 |512 | 806256640 |37159 | 76669 > >> > >> but if I reconnect and run it in the new backend, I get this: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ++---+--+- > >> > >>32768 |512 | 806158336 | 233909 | 100785 > >> > >> (1 row) > >> > >> It does not matter if I wait a bit before running the query, if I run it > >> repeatedly, etc. The machine is not doing anything else, the CPU is set > >> to use "performance" governor, etc. > > > > I've reproduced the behaviour you mention. > > I also noticed asm_exc_page_fault showing up in the perf report in that > > case. > > > > Running an strace on it shows that in one case, we have a lot of brk > > calls, > > while when we run in the same process as the previous tests, we don't. > > > > My suspicion is that the previous workload makes glibc malloc change it's > > trim_threshold and possibly other dynamic options, which leads to > > constantly moving the brk pointer in one case and not the other. > > > > Running your fifo test with absurd malloc options shows that indeed that > > might be the case (I needed to change several, because changing one > > disable the dynamic adjustment for every single one of them, and malloc > > would fall back to using mmap and freeing it on each iteration): > > > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > > > I get the following results for your self contained test. I ran the query > > twice, in each case, seeing the importance of the first allocation and the > > subsequent ones: > > > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 300156 | 207557 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 211942 | 77207 > > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 219000 | 36223 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 |75761 | 78082 > > > > (1 row) > > > > I can't tell how representative your benchmark extension would be of real > > life allocation / free patterns, but there is probably something we can > > improve here. > > Thanks for looking at this. I think those allocation / free patterns are > fairly extreme, and there probably are no workloads doing exactly this. > The idea is the actual workloads are likely some combination of these > extreme cases. > > > I'll try to see if I can understand more precisely what is happening. > > Thanks, that'd be helpful. Maybe we can learn something about tuning > malloc parameters to get significantly better performance. > Apologies for the long email, maybe what I will state here is obvious for others but I learnt a lot, so... I think I understood what the problem was in your generation tests: depending on the sequence of allocations we will raise a different maximum for mmap_threshold and trim_threshold. When an mmap block is freed, malloc will raise it's threshold as it consider memory freed to be better served by regular moving of the sbrk pointer, instead of using mmap to map memory. This threshold is upped by multiplying it by two anytime we free a mmap'ed block
Re: Use generation context to speed up tuplesorts
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > And now comes the funny part - if I run it in the same backend as the > "full" benchmark, I get roughly the same results: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- >32768 |512 | 806256640 |37159 | 76669 > > but if I reconnect and run it in the new backend, I get this: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- >32768 |512 | 806158336 | 233909 | 100785 > (1 row) > > It does not matter if I wait a bit before running the query, if I run it > repeatedly, etc. The machine is not doing anything else, the CPU is set > to use "performance" governor, etc. I've reproduced the behaviour you mention. I also noticed asm_exc_page_fault showing up in the perf report in that case. Running an strace on it shows that in one case, we have a lot of brk calls, while when we run in the same process as the previous tests, we don't. My suspicion is that the previous workload makes glibc malloc change it's trim_threshold and possibly other dynamic options, which leads to constantly moving the brk pointer in one case and not the other. Running your fifo test with absurd malloc options shows that indeed that might be the case (I needed to change several, because changing one disable the dynamic adjustment for every single one of them, and malloc would fall back to using mmap and freeing it on each iteration): mallopt(M_TOP_PAD, 1024 * 1024 * 1024); mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); I get the following results for your self contained test. I ran the query twice, in each case, seeing the importance of the first allocation and the subsequent ones: With default malloc options: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 300156 | 207557 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 211942 | 77207 With the oversized values above: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 219000 | 36223 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 |75761 | 78082 (1 row) I can't tell how representative your benchmark extension would be of real life allocation / free patterns, but there is probably something we can improve here. I'll try to see if I can understand more precisely what is happening. -- Ronan Dunklau
Re: Non-superuser subscription owners
Le lundi 6 décembre 2021, 16:56:56 CET Mark Dilger a écrit : > > On Dec 6, 2021, at 2:19 AM, Amit Kapila wrote: > >>> If we want to maintain the property that subscriptions can only be > >>> owned by superuser > > We don't want to maintain such a property, or at least, that's not what I > want. I don't think that's what Jeff wants, either. That's not what I want either: the ability to run and refresh subscriptions as a non superuser is a desirable feature. The REFRESH part was possible before PG 14, when it was allowed to run REFRESH in a function, which could be made to run as security definer. > As I perceive the roadmap: > > 1) Fix the current bug wherein subscription changes are applied with > superuser force after the subscription owner has superuser privileges > revoked. 2) Allow the transfer of subscriptions to non-superuser owners. > 3) Allow the creation of subscriptions by non-superusers who are members of > some as yet to be created predefined role, say "pg_create_subscriptions" This roadmap seems sensible. -- Ronan Dunklau
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Le jeudi 22 juillet 2021, 10:42:49 CET Ronan Dunklau a écrit : > Le jeudi 22 juillet 2021, 09:38:50 CEST David Rowley a écrit : > > On Thu, 22 Jul 2021 at 02:01, Ronan Dunklau wrote: > > > I tested the 0001 patch against both HEAD and my proposed bugfix for > > > postgres_fdw. > > > > > > There is a problem that the ordered aggregate is not pushed down > > > anymore. > > > The underlying Sort node is correctly pushed down though. > > > > > > This comes from the fact that postgres_fdw grouping path never contains > > > any > > > pathkey. Since the cost is fuzzily the same between the pushed-down > > > aggregate and the locally performed one, the tie is broken against the > > > pathkeys. > > > > I think this might be more down to a lack of any penalty cost for > > fetching foreign tuples. Looking at create_foreignscan_path(), I don't > > see anything that adds anything to account for fetching the tuples > > from the foreign server. If there was something like that then there'd > > be more of a preference to perform the remote aggregation so that > > fewer rows must arrive from the remote server. > > > > I tested by adding: total_cost += cpu_tuple_cost * rows * 100; in > > create_foreignscan_path() and I got the plan with the remote > > aggregation. That's a fairly large penalty of 1.0 per row. Much bigger > > than parallel_tuple_cost's default value. > > > > I'm a bit undecided on how much this patch needs to get involved in > > adjusting foreign scan costs. The problem is that we've given the > > executor a new path to consider and nobody has done any proper > > costings for the foreign scan so that it properly prefers paths that > > have to pull fewer foreign tuples. This is a pretty similar problem > > to what parallel_tuple_cost aims to fix. Also similar to how we had to > > add APPEND_CPU_COST_MULTIPLIER to have partition-wise aggregates > > prefer grouping at the partition level rather than at the partitioned > > table level. > > > > > Ideally we would add the group pathkeys to the grouping path, but this > > > would add an additional ORDER BY expression matching the GROUP BY. > > > Moreover, some triaging of the pathkeys would be necessary since we now > > > mix the sort-in- aggref pathkeys with the group_pathkeys. > > > > I think you're talking about passing pathkeys into > > create_foreign_upper_path in add_foreign_grouping_paths. If so, I > > don't really see how it would be safe to add pathkeys to the foreign > > grouping path. What if the foreign server did a Hash Aggregate? The > > rows might come back in any random order. > > Yes, I was suggesting to add a new path with the pathkeys factored in, which > if chosen over the non-ordered path would result in an additional ORDER BY > clause to prevent a HashAggregate. But that doesn't seem a good idea after > all. > > > I kinda think that to fix this properly would need a new foreign > > server option such as foreign_tuple_cost. I'd feel better about > > something like that with some of the people with a vested interest in > > the FDW code were watching more closely. So far we've not managed to > > entice any of them with the bug report yet, but it's maybe early days > > yet. > > We already have that in the form of fdw_tuple_cost as a server option if I'm > not mistaken ? It works as expected when the number of tuples is notably > reduced by the foreign group by. > > The problem arise when the cardinality of the groups is equal to the input's > cardinality. I think even in that case we should try to use a remote > aggregate since it's a computation that will not happen on the local > server. I also think we're more likely to have up to date statistics > remotely than the ones collected locally on the foreign tables, and the > estimated number of groups would be more accurate on the remote side than > the local one. I took some time to toy with this again. At first I thought that charging a discount in foreign grouping paths for Aggref targets (since they are computed remotely) would be a good idea, similar to what is done for the grouping keys. But in the end, it's probably not something we would like to do. Yes, the group planning will be more accurate on the remote side generally (better statistics than locally for estimating the number of groups) but executing the grouping locally when the number of groups is close to the input's cardinality (ex: group by unique_key) gives us a form of parallelism which can actually perform better. For the other cases where there is fewer output than input tuples, that is, when an actual grouping takes place, adjusting fdw_tuple_cost might be enough to tune the behavior to what is desirable. -- Ronan Dunklau
Re: pg_receivewal starting position
Le vendredi 29 octobre 2021, 04:27:51 CEST Michael Paquier a écrit : > On Thu, Oct 28, 2021 at 03:55:12PM +0200, Ronan Dunklau wrote: > > Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s > > on average on my machine. > > I think if you reduce the size of the generate_series batches, this should > > probably be reduced everywhere. With what we do though, inserting a single > > line should work just as well, I wonder why we insist on inserting a > > hundred lines ? I updated your patch with that small modification, it > > also makes the code less verbose. > > Thanks for the extra numbers. I have added your suggestions, > switching the dummy table to use a primary key with different values, > while on it, as there is an argument that it makes debugging easier, > and applied the speedup patch. Thanks ! > > >> +$standby->psql('', > >> + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)", > >> + replication => 1); > >> Here as well we could use a restart point to reduce the number of > >> segments archived. > > > > The restart point should be very close, as we don't generate any activity > > on the primary between the backup and the slot's creation. I'm not sure > > adding the complexity of triggering a checkpoint on the primary and > > waiting for the standby to catch up on it would be that useful. > > Yes, you are right here. The base backup taken from the primary > at this point ensures a fresh point. > > +# This test is split in two, using the same standby: one test check the > +# resume-from-folder case, the other the resume-from-slot one. > This comment needs a refresh, as the resume-from-folder case is no > more. > Done. > +$standby->psql( > + 'postgres', > + "SELECT pg_promote(wait_seconds => 300)"); > This could be $standby->promote. > Oh, didn't know about that. > +# Switch wal to make sure it is not a partial file but a complete > segment. > +$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);'); > +$primary->psql('postgres', 'SELECT pg_switch_wal();'); > +$primary->wait_for_catchup($standby, 'replay', $primary->lsn('write')); > This INSERT needs a slight change to adapt to the primary key of the > table. This one is on me :p Done. > > Anyway, is this first segment switch really necessary? From the data > archived by pg_receivewal in the command testing the TLI jump, we > finish with the following contents (contents generated after fixing > the three INSERTs): > 0001000B > 0001000C > 0002000D > 0002000E.partial > 0002.history > > So, even if we don't do the first switch, we'd still have one > completed segment on the previous timeline, before switching to the > new timeline and the next segment (pg_receivewal is a bit inconsistent > with the backend here, by the way, as the first segment on the new > timeline would map with the last segment of the old timeline, but here > we have a clean switch as of stop_streaming in pg_receivewal.c). The first completed segment on the previous timeline comes from the fact we stream from the restart point. I removed the switch to use the walfilename of the replication slot's restart point instead. This means querying both the standby (to get the replication slot's restart_lsn) and the primary (to have access to pg_walfile_name). We could use a single query on the primary (using the primary's checkpoint LSN instead) but it feels a bit convoluted just to avoid a query on the standby. -- Ronan Dunklau>From eb65a8b5883d903dda78bb66c2e1795b48cafc24 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 26 Oct 2021 10:54:12 +0200 Subject: [PATCH v17] Add a test for pg_receivewal following timeline switch. pg_receivewal is able to follow a timeline switch, but this was not tested anywher. This test case verify that it works as expected both when resuming from a replication slot and from the archive directory, which have different methods of retrieving the timeline it left off. --- src/bin/pg_basebackup/t/020_pg_receivewal.pl | 57 +++- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl index 2da200396e..1fe32758a0 100644 --- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl +++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 31; +use Test::More tests => 35; program_help_ok('pg_receivewal'); program_version_ok('pg_receivewal'); @@ -208,3 +208,58 @@ $primary->command_
Re: pg_receivewal starting position
Le jeudi 28 octobre 2021, 14:31:36 CEST Michael Paquier a écrit : > On Wed, Oct 27, 2021 at 10:11:00AM +0200, Ronan Dunklau wrote: > > Sorry I sent an intermediary version of the patch, here is the correct > > one. > > While looking at this patch, I have figured out a simple way to make > the tests faster without impacting the coverage. The size of the WAL > segments archived is a bit of a bottleneck as they need to be zeroed > by pg_receivewal at creation. This finishes by being a waste of time, > even if we don't flush the data. So I'd like to switch the test so as > we use a segment size of 1MB, first. > > A second thing is that we copy too many segments than really needed > when using the slot's restart_lsn as starting point as RESERVE_WAL > would use the current redo location, so it seems to me that a > checkpoint is in order before the slot creation. A third thing is > that generating some extra data after the end LSN we want to use makes > the test much faster at the end. > > With those three methods combined, the test goes down from 11s to 9s > here. Attached is a patch I'd like to apply to make the test > cheaper. Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s on average on my machine. I think if you reduce the size of the generate_series batches, this should probably be reduced everywhere. With what we do though, inserting a single line should work just as well, I wonder why we insist on inserting a hundred lines ? I updated your patch with that small modification, it also makes the code less verbose. > > I also had a look at your patch. Here are some comments. > > +# Cleanup the previous stream directories to reuse them > +unlink glob "'${stream_dir}/*'"; > +unlink glob "'${slot_dir}/*'"; > I think that we'd better choose a different location for the > archives. Keeping the segments of the previous tests is useful for > debugging if a previous step of the test failed. Ok. > > +$standby->psql('', > + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)", > + replication => 1); > Here as well we could use a restart point to reduce the number of > segments archived. The restart point should be very close, as we don't generate any activity on the primary between the backup and the slot's creation. I'm not sure adding the complexity of triggering a checkpoint on the primary and waiting for the standby to catch up on it would be that useful. > > +# Now, try to resume after the promotion, from the folder. > +$standby->command_ok( > + [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, > +'--slot', $folder_slot, '--no-sync'], > + "Stream some wal after promoting, resuming from the folder's position"); > What is called the "resume-from-folder" test in the patch is IMO > too costly and brings little extra value, requiring two commands of > pg_receivewal (one to populate the folder and one to test the TLI jump > from the previously-populated point) to test basically the same thing > as when the starting point is taken from the slot. Except that > restarting from the slot is twice cheaper. The point of the > resume-from-folder case is to make sure that we restart from the point > of the archive folder rather than the slot's restart_lsn, but your > test fails this part, in fact, because the first command populating > the archive folder also uses "--slot $folder_slot", updating the > slot's restart_lsn before the second pg_receivewal uses this slot > again. > > I think, as a whole, that testing for the case where an archive folder > is populated (without a slot!), followed by a second command where we > use a slot that has a restart_lsn older than the archive's location, > to not be that interesting. If we include such a test, there is no > need to include that within the TLI jump part, in my opinion. So I > think that we had better drop this part of the patch, and keep only > the case where we resume from a slot for the TLI jump. You're right about the test not being that interesting in that case. I thought it would be worthwhile to test both cases, but resume_from_folder doesn't actually exercise any code that was not already called before (timeline computation from segment name). > The commands of pg_receivewal included in the test had better use -n > so as there is no infinite loop on failure. Ok. -- Ronan Dunklau>From ad09605b6f8b2d1cc91e67a255bf469363cad2b3 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Thu, 28 Oct 2021 15:12:04 +0200 Subject: [PATCH v16 2/2] Speed up pg_receivewal tests. Author: Michael Paquier --- src/bin/pg_basebackup/t/020_pg_receivewal.pl | 21 ++-- 1 file changed, 11 insertions(+)
Re: pg_receivewal starting position
Le mercredi 27 octobre 2021, 10:00:40 CEST Ronan Dunklau a écrit : > Le mercredi 27 octobre 2021, 04:17:28 CEST Kyotaro Horiguchi a écrit : > > +my @walfiles = glob "$slot_dir/*"; > > > > This is not used. > > Sorry, fixed in attached version. > > > Each pg_receivewal run stalls for about 10 or more seconds before > > finishing, which is not great from the standpoint of recently > > increasing test run time. > > > > Maybe we want to advance LSN a bit, after taking $nextlsn then pass > > "-s 1" to pg_receivewal. > > I incorrectly assumed it was due to the promotion time without looking into > it. In fact, you're right the LSN was not incremented after we fetched the > end lsn, and thus we would wait for quite a while. I fixed that too. > > Thank you for the review ! Sorry I sent an intermediary version of the patch, here is the correct one. -- Ronan Dunklau>From 8fa22687e50f02539119568f98b6a11c077f1774 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 26 Oct 2021 10:54:12 +0200 Subject: [PATCH v15] Add a test for pg_receivewal following timeline switch. pg_receivewal is able to follow a timeline switch, but this was not tested anywher. This test case verify that it works as expected both when resuming from a replication slot and from the archive directory, which have different methods of retrieving the timeline it left off. --- src/bin/pg_basebackup/t/020_pg_receivewal.pl | 88 +++- 1 file changed, 87 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl index 092c9b6f25..f561aca8f8 100644 --- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl +++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 31; +use Test::More tests => 39; program_help_ok('pg_receivewal'); program_version_ok('pg_receivewal'); @@ -206,3 +206,89 @@ $primary->command_ok( "WAL streamed from the slot's restart_lsn"); ok(-e "$slot_dir/$walfile_streamed", "WAL from the slot's restart_lsn has been archived"); + +# Test a timeline switch. +# This test is split in two, using the same standby: one test check the +# resume-from-folder case, the other the resume-from-slot one. + +# Setup a standby for our tests +my $backup_name = "basebackup"; +$primary->backup($backup_name); +my $standby = PostgreSQL::Test::Cluster->new("standby"); +$standby->init_from_backup($primary, $backup_name, has_streaming => 1); +$standby->start; + +# Cleanup the previous stream directories to reuse them +unlink glob "'${stream_dir}/*'"; +unlink glob "'${slot_dir}/*'"; + +# Create two replication slots. +# First one is to make sure we keep the wal for the resume-from-folder case. +my $folder_slot = "folder_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Second one is for testing the resume-from-slot case +my $archive_slot = "archive_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Get a walfilename from before the promotion to make sure it is archived +# after promotion +my $walfile_before_promotion = $primary->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +# Populate the stream_dir for the resume-from-folder case. +$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); +# Switch wal to make sure it is not a partial file but a complete segment. +$primary->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$primary->psql('postgres', 'SELECT pg_switch_wal();'); + +$standby->run_log( +[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, + '--slot', $folder_slot, '--no-sync'], +"Stream some wal before promoting"); + +# Everything is setup, promote the standby to trigger a timeline switch +$standby->psql( + 'postgres', + "SELECT pg_promote(wait_seconds => 300)"); + +# Force a wal switch to make sure at least one full WAL is archived on the new +# timeline, and fetch this walfilename. +my $walfile_after_promotion = $standby->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +$standby->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$standby->psql('postgres', 'SELECT pg_switch_wal();'); +$nextlsn = + $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); +$standby->psql('postgres', + 'INSERT INTO test_table VALUES
Re: pg_receivewal starting position
Le mercredi 27 octobre 2021, 04:17:28 CEST Kyotaro Horiguchi a écrit : > +my @walfiles = glob "$slot_dir/*"; > > This is not used. > Sorry, fixed in attached version. > Each pg_receivewal run stalls for about 10 or more seconds before > finishing, which is not great from the standpoint of recently > increasing test run time. > Maybe we want to advance LSN a bit, after taking $nextlsn then pass > "-s 1" to pg_receivewal. I incorrectly assumed it was due to the promotion time without looking into it. In fact, you're right the LSN was not incremented after we fetched the end lsn, and thus we would wait for quite a while. I fixed that too. Thank you for the review ! -- Ronan Dunklau>From 04747199b7be896a62021e4f064b2342234427d5 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 26 Oct 2021 10:54:12 +0200 Subject: [PATCH v14] Add a test for pg_receivewal following timeline switch. pg_receivewal is able to follow a timeline switch, but this was not tested anywher. This test case verify that it works as expected both when resuming from a replication slot and from the archive directory, which have different methods of retrieving the timeline it left off. --- src/bin/pg_basebackup/t/020_pg_receivewal.pl | 86 +++- 1 file changed, 85 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl index 092c9b6f25..28e4e9b3a4 100644 --- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl +++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 31; +use Test::More tests => 39; program_help_ok('pg_receivewal'); program_version_ok('pg_receivewal'); @@ -206,3 +206,87 @@ $primary->command_ok( "WAL streamed from the slot's restart_lsn"); ok(-e "$slot_dir/$walfile_streamed", "WAL from the slot's restart_lsn has been archived"); + +# Test a timeline switch. +# This test is split in two, using the same standby: one test check the +# resume-from-folder case, the other the resume-from-slot one. + +# Setup a standby for our tests +my $backup_name = "basebackup"; +$primary->backup($backup_name); +my $standby = PostgreSQL::Test::Cluster->new("standby"); +$standby->init_from_backup($primary, $backup_name, has_streaming => 1); +$standby->start; + +# Cleanup the previous stream directories to reuse them +unlink glob "'${stream_dir}/*'"; +unlink glob "'${slot_dir}/*'"; + +# Create two replication slots. +# First one is to make sure we keep the wal for the resume-from-folder case. +my $folder_slot = "folder_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Second one is for testing the resume-from-slot case +my $archive_slot = "archive_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Get a walfilename from before the promotion to make sure it is archived +# after promotion +my $walfile_before_promotion = $primary->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +# Switch wal to make sure it is not a partial file but a complete segment. +$primary->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$primary->psql('postgres', 'SELECT pg_switch_wal();'); + +# Populate the stream_dir for the resume-from-folder case. +$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); +$standby->run_log( +[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, + '--slot', $folder_slot, '--no-sync'], +"Stream some wal before promoting"); + +# Everything is setup, promote the standby to trigger a timeline switch +$standby->psql( + 'postgres', + "SELECT pg_promote(wait_seconds => 300)"); + +# Force a wal switch to make sure at least one full WAL is archived on the new +# timeline, and fetch this walfilename. +my $walfile_after_promotion = $standby->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +$standby->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$standby->psql('postgres', 'SELECT pg_switch_wal();'); +$nextlsn = + $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); + +# Now, try to resume after the promotion, from the folder. +$standby->command_ok( + [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, +'--slot', $folder_slot, '--no-sync'], + "Stream some wal after promoting, resuming from the folder's position"); + +ok(-e "$stream_dir/$wal
Re: pg_receivewal starting position
Le mardi 26 octobre 2021, 08:27:47 CEST Ronan Dunklau a écrit : > Yes, I will try to simplify the logic of the patch I sent last week. I'll > keep you posted here soon. I was able to simplify it quite a bit, by using only one standby for both test scenarios. This test case verify that after a timeline switch, if we resume from a previous state we will archive: - segments from the old timeline - segments from the new timeline - the timeline history file itself. I chose to check against a full segment from the previous timeline, but it would have been possible to check that the latest timeline segment was partial. I chose not not, in the unlikely event we promote at an exact segment boundary. I don't think it matters much, since partial wal files are already covered by other tests. -- Ronan Dunklau>From 0daf03e7e69aa11accb18df86d686004002115ff Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Tue, 26 Oct 2021 10:54:12 +0200 Subject: [PATCH v13] Add a test for pg_receivewal following timeline switch. pg_receivewal is able to follow a timeline switch, but this was not tested anywher. This test case verify that it works as expected both when resuming from a replication slot and from the archive directory, which have different methods of retrieving the timeline it left off. --- src/bin/pg_basebackup/t/020_pg_receivewal.pl | 87 +++- 1 file changed, 86 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl index 092c9b6f25..f285e9b3f3 100644 --- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl +++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl @@ -5,7 +5,7 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 31; +use Test::More tests => 39; program_help_ok('pg_receivewal'); program_version_ok('pg_receivewal'); @@ -206,3 +206,88 @@ $primary->command_ok( "WAL streamed from the slot's restart_lsn"); ok(-e "$slot_dir/$walfile_streamed", "WAL from the slot's restart_lsn has been archived"); + +# Test a timeline switch. +# This test is split in two, using the same standby: one test check the +# resume-from-folder case, the other the resume-from-slot one. + +# Setup a standby for our tests +my $backup_name = "basebackup"; +$primary->backup($backup_name); +my $standby = PostgreSQL::Test::Cluster->new("standby"); +$standby->init_from_backup($primary, $backup_name, has_streaming => 1); +$standby->start; + +# Cleanup the previous stream directories to reuse them +unlink glob "'${stream_dir}/*'"; +unlink glob "'${slot_dir}/*'"; + +# Create two replication slots. +# First one is to make sure we keep the wal for the resume-from-folder case. +my $folder_slot = "folder_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Second one is for testing the resume-from-slot case +my $archive_slot = "archive_slot"; +$standby->psql('', + "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)", + replication => 1); +# Get a walfilename from before the promotion to make sure it is archived +# after promotion +my $walfile_before_promotion = $primary->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +# Switch wal to make sure it is not a partial file but a complete segment. +$primary->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$primary->psql('postgres', 'SELECT pg_switch_wal();'); + +# Populate the stream_dir for the resume-from-folder case. +$nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); +$standby->run_log( +[ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, + '--slot', $folder_slot, '--no-sync'], +"Stream some wal before promoting"); + +# Everything is setup, promote the standby to trigger a timeline switch +$standby->psql( + 'postgres', + "SELECT pg_promote(wait_seconds => 300)"); + +# Force a wal switch to make sure at least one full WAL is archived on the new +# timeline, and fetch this walfilename. +my $walfile_after_promotion = $standby->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_insert_lsn())"); +$standby->psql('postgres', + 'INSERT INTO test_table VALUES (generate_series(1,100));'); +$standby->psql('postgres', 'SELECT pg_switch_wal();'); +$nextlsn = + $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); +chomp($nextlsn); + +# Now, try to resume after the promotion, from the folder. +$standby->command_ok( + [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn, +'--slot', $folder_slot, '--no-sync'], + "Stream some wal after
Re: pg_receivewal starting position
Le mardi 26 octobre 2021, 03:15:40 CEST Michael Paquier a écrit : > I have changed the patch per Ronan's suggestion to have the version > check out of GetSlotInformation(), addressed what you have reported, > and the result looked good. So I have applied this part. Thanks ! > > What remains on this thread is the addition of new tests to make sure > that pg_receivewal is able to follow a timeline switch. Now that we > can restart from a slot that should be a bit easier to implemented as > a test by creating a slot on a standby. Ronan, are you planning to > send a new patch for this part? Yes, I will try to simplify the logic of the patch I sent last week. I'll keep you posted here soon. -- Ronan Dunklau
Re: pg_receivewal starting position
Le lundi 25 octobre 2021, 10:10:13 CEST Michael Paquier a écrit : > On Mon, Oct 25, 2021 at 09:50:01AM +0200, Ronan Dunklau wrote: > > Le lundi 25 octobre 2021, 09:40:10 CEST Michael Paquier a écrit : > >> On Mon, Oct 25, 2021 at 09:15:32AM +0200, Ronan Dunklau wrote: > >>> Does it make sense though ? The NULL slot_name case handling is pretty > >>> straight forward has it will be handled by string formatting, but in the > >>> case of a null restart_lsn, we have no way of knowing if the command was > >>> issued at all. > >> > >> If I am following your point, I don't think that it matters much here, > >> and it seems useful to me to be able to pass NULL for both of them, so > >> as one can check if the slot exists or not with an API designed this > >> way. > > > > You're right, but I'm afraid we would have to check the server version > > twice in any case different from the basic pg_receivewal on (once in the > > function itself, and one before calling it if we want a meaningful > > result). Maybe we should move the version check outside the > > GetSlotInformation function to avoid this, and let it fail with a syntax > > error when the server doesn't support it ? > With the approach taken by the patch, we fall down silently to the > previous behavior if we connect to a server <= 14, and rely on the new > behavior with a server >= 15, ensuring compatibility. Why would you > want to make sure that the command is executed when we should just > enforce that the old behavior is what happens when there is a slot > defined and a backend <= 14? Sorry I haven't been clear. For the use case of this patch, the current approach is perfect (and we supply the restart_lsn). However, if we want to support the case of "just check if the slot exists", we need to make sure the command is actually executed, and check the version before calling the function, which would make the check executed twice. What I'm proposing is just that we let the responsibility of checking PQServerVersion() to the caller, and remove it from GetSlotInformation, ie: - if (replication_slot != NULL) + if (replication_slot != NULL && PQserverVersion(conn) >= 15) { if (!GetSlotInformation(conn, replication_slot, , )) That way, if we introduce a caller wanting to use this function as an API to check a slot exists, the usage of checking the server version beforehand will be consistent. -- Ronan Dunklau
Re: pg_receivewal starting position
Le lundi 25 octobre 2021, 09:40:10 CEST Michael Paquier a écrit : > On Mon, Oct 25, 2021 at 09:15:32AM +0200, Ronan Dunklau wrote: > > Does it make sense though ? The NULL slot_name case handling is pretty > > straight forward has it will be handled by string formatting, but in the > > case of a null restart_lsn, we have no way of knowing if the command was > > issued at all. > > If I am following your point, I don't think that it matters much here, > and it seems useful to me to be able to pass NULL for both of them, so > as one can check if the slot exists or not with an API designed this > way. You're right, but I'm afraid we would have to check the server version twice in any case different from the basic pg_receivewal on (once in the function itself, and one before calling it if we want a meaningful result). Maybe we should move the version check outside the GetSlotInformation function to avoid this, and let it fail with a syntax error when the server doesn't support it ? -- Ronan Dunklau
Re: Non-superuser subscription owners
Le mercredi 20 octobre 2021, 20:40:39 CEST Mark Dilger a écrit : > These patches have been split off the now deprecated monolithic "Delegating > superuser tasks to new security roles" thread at [1]. > > The purpose of these patches is to allow non-superuser subscription owners > without risk of them overwriting tables they lack privilege to write > directly. This both allows subscriptions to be managed by non-superusers, > and protects servers with subscriptions from malicious activity on the > publisher side. Thank you Mark for splitting this. This patch looks good to me, and provides both better security (by closing the "dropping superuser role" loophole) and usefule features. -- Ronan Dunklau
Re: pg_receivewal starting position
Le lundi 25 octobre 2021, 00:43:11 CEST Michael Paquier a écrit : > On Sun, Oct 24, 2021 at 07:20:57PM +0530, Bharath Rupireddy wrote: > > Thanks. I've no further comments on the v10 patch. > > Okay, thanks. I have applied this part, then. > -- > Michael Thank you all for your work on this patch. -- Ronan Dunklau
Re: pg_receivewal starting position
Le lundi 25 octobre 2021, 08:51:23 CEST Michael Paquier a écrit : > On Thu, Oct 21, 2021 at 10:36:42AM +0200, Ronan Dunklau wrote: > > Done. I haven't touched the timeline switch test patch for now, but I > > still > > include it here for completeness. > > As 0001 and 0002 have been applied, I have put my hands on 0003, and > found a couple of issues upon review. > > + Assert(slot_name != NULL); > + Assert(restart_lsn != NULL); > There is no need for those asserts, as we should support the case > where the caller gives NULL for those variables. Does it make sense though ? The NULL slot_name case handling is pretty straight forward has it will be handled by string formatting, but in the case of a null restart_lsn, we have no way of knowing if the command was issued at all. > > + if (PQserverVersion(conn) < 15) > + return false; > Returning false is incorrect for older server versions as we won't > fallback to the old method when streaming from older server. What > this needs to do is return true and set restart_lsn to > InvalidXLogRecPtr, so as pg_receivewal would just stream from the > current flush location. "false" should just be returned on error, > with pg_log_error(). Thank you, this was an oversight when moving from the more complicated error handling code. > > +$primary->psql('postgres', > + 'INSERT INTO test_table VALUES (generate_series(1,100));'); > +$primary->psql('postgres', 'SELECT pg_switch_wal();'); > +$nextlsn = > + $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); > +chomp($nextlsn); > There is no need to switch twice to a new WAL segment as we just need > to be sure that the WAL segment of the restart_lsn is the one > archived. Note that RESERVE_WAL uses the last redo point, so it is > better to use a checkpoint and reduce the number of logs we stream > into the new location. > > Better to add some --no-sync to the new commands of pg_receivewal, to > not stress the I/O more than necessary. I have added some extra -n > while on it to avoid loops on failure. > > Attached is the updated patch I am finishing with, which is rather > clean now. I have tweaked a couple of things while on it, and > documented better the new GetSlotInformation() in streamutil.c. > -- > Michael -- Ronan Dunklau
Re: pg_receivewal starting position
Le jeudi 21 octobre 2021, 09:21:44 CEST Michael Paquier a écrit : > On Thu, Oct 21, 2021 at 08:29:54AM +0200, Ronan Dunklau wrote: > > Ok, do you want me to propose a different patch for previous versions ? > > That's not necessary. Thanks for proposing. > > > Do you mean restart_lsn as the pointer argument to the function, or > > restart_lsn as the field returned by the command ? If it's the first, I'll > > change it but if it's the latter it is expected that we sometime run this > > on a slot where WAL has never been reserved yet. > > restart_lsn as the pointer of the function. Done. I haven't touched the timeline switch test patch for now, but I still include it here for completeness. -- Ronan Dunklau>From b3703868d230f773378742a0deb4360f6cc7343a Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 21 Oct 2021 13:21:43 +0900 Subject: [PATCH v10 1/4] doc: Describe calculation method of streaming start for pg_receivewal The documentation was unprecise about the fact that the current WAL flush location is used if nothing can be found on the local archive directory describe, independently of the compression used by each segment (ZLIB or uncompressed). --- doc/src/sgml/ref/pg_receivewal.sgml | 23 +++ 1 file changed, 23 insertions(+) diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml index 45b544cf49..9adbddb657 100644 --- a/doc/src/sgml/ref/pg_receivewal.sgml +++ b/doc/src/sgml/ref/pg_receivewal.sgml @@ -75,6 +75,29 @@ PostgreSQL documentation one session available for the stream. + + The starting point of the write-ahead log streaming is calculated when + pg_receivewal starts: + + + + First, scan the directory where the WAL segment files are written and + find the newest completed segment, using as starting point the beginning + of the next WAL segment file. This is calculated independently of the + compression method used to compress each segment. + + + + + + If a starting point cannot be calculated with the previous method, + the latest WAL flush location is used as reported by the server from + a IDENTIFY_SYSTEM command. + + + + + If the connection is lost, or if it cannot be initially established, with a non-fatal error, pg_receivewal will -- 2.33.0 >From 067353a88803e96b295f425167c1195ccf984352 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 21 Oct 2021 14:23:41 +0900 Subject: [PATCH v10 2/4] Add READ_REPLICATION_SLOT --- doc/src/sgml/protocol.sgml | 48 +++ src/backend/replication/repl_gram.y | 16 +++- src/backend/replication/repl_scanner.l | 1 + src/backend/replication/walsender.c | 93 + src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 11 +++ src/test/recovery/t/001_stream_rep.pl | 32 ++- src/test/recovery/t/006_logical_decoding.pl | 11 ++- src/tools/pgindent/typedefs.list| 1 + 9 files changed, 211 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index b95cc88599..37c26ec8ae 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2067,6 +2067,54 @@ The commands accepted in replication mode are: + +READ_REPLICATION_SLOT slot_name + READ_REPLICATION_SLOT + + + + Read some information associated to a replication slot. Returns a tuple + with NULL values if the replication slot does not + exist. This command is currently only supported for physical replication + slots. + + + In response to this command, the server will return a one-row result set, + containing the following fields: + + +slot_type (text) + + + The replication slot's type, either physical or + NULL. + + + + + +restart_lsn (text) + + + The replication slot's restart_lsn. + + + + + +restart_tli (int4) + + + The timeline ID associated to restart_lsn, + following the current timeline history. + + + + + + + + START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ] START_REPLICATION diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y index 126380e2df..dcb1108579 100644 --- a/src/backend/replication/repl_gram.y +++ b/src/backend/replication/repl_gram.y @@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void); /* Keyword tokens. */ %token K_BASE_BACKUP %token K_IDENTIFY_SYSTEM +%token K_READ_REPLICATION_SLOT %token K_SHOW %token K_START_REPLICATION %token K_CREATE_REPLICATION_SLOT @@ -94,7 +95,7 @@ st
Re: pg_receivewal starting position
Le jeudi 21 octobre 2021, 07:35:08 CEST Michael Paquier a écrit : > On Wed, Oct 20, 2021 at 02:58:26PM +0200, Ronan Dunklau wrote: > > After sending the previous patch suite, I figured it would be worthwhile > > to > > also have tests covering timeline switches, which was not covered before. > > That seems independent to me. I'll take a look. > > > So please find attached a new version with an additional patch for those > > tests, covering both "resume from last know archive" and "resume from > > the replication slots position" cases. > > So, taking things in order, I have looked at 0003 and 0001, and > attached are refined versions for both of them. > > 0003 is an existing hole in the docs, which I think we had better > address first and backpatch, taking into account that the starting > point calculation considers compressed segments when looking for > completed segments. Ok, do you want me to propose a different patch for previous versions ? > > Regarding 0001, I have found the last test to check for NULL values > returned by READ_REPLICATION_SLOT after dropping the slot overlaps > with the first test, so I have removed that. I have expanded a bit > the use of like(), and there were some confusion with > PostgresNode::psql and some extra arguments (see DROP_REPLICATION_SLOT > and CREATE_REPLICATION_SLOT, and no need for return values in the > CREATE case either). Some comments, docs and code have been slightly > tweaked. Thank you for this. > > Here are some comments about 0002. > > + /* The commpand should always return precisely one tuple */ > s/commpand/command/ > > + pg_log_error("could not fetch replication slot: got %d rows and %d > fields, expected %d rows and %d or more fields", + > PQntuples(res), PQnfields(res), 1, 3); > Should this be "could not read" instead? > > + if (sscanf(PQgetvalue(res, 0, 1), "%X/%X", , ) != 2) > + { > + pg_log_error("could not parse slot's restart_lsn \"%s\"", > +PQgetvalue(res, 0, 1)); > + PQclear(res); > + return false; > + } > Wouldn't it be saner to initialize *restart_lsn and *restart_tli to > some default values at the top of GetSlotInformation() instead, if > they are specified by the caller? Ok. > And I think that we should still > complain even if restart_lsn is NULL. Do you mean restart_lsn as the pointer argument to the function, or restart_lsn as the field returned by the command ? If it's the first, I'll change it but if it's the latter it is expected that we sometime run this on a slot where WAL has never been reserved yet. > > On a quick read of 0004, I find the split of the logic with > change_timeline() a bit hard to understand. It looks like we should > be able to make a cleaner split, but I am not sure how that would > look, though. Thanks, at least if the proposal to test this seems sensible I can move forward. I wanted to avoid having a lot of code duplication since the test setup is a bit more complicated. My first approach was to split it into two functions, setup_standby and change_timeline, but then realized that what would happen between the two invocations would basically be the same for the two test cases, so I ended up with that patch. I'll try to see if I can see a better way of organizing that code. -- Ronan Dunklau
Re: pg_receivewal starting position
Le mercredi 20 octobre 2021 11:40:18 CEST, vous avez écrit : > > +# Setup the slot, and connect to it a first time > > +$primary->run_log( > > + [ 'pg_receivewal', '--slot', $slot_name, '--create-slot' ], > > + 'creating a replication slot'); > > +$primary->psql('postgres', > > + 'INSERT INTO test_table VALUES (generate_series(1,100));'); > > +$primary->psql('postgres', 'SELECT pg_switch_wal();'); > > +$nextlsn = > > + $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); > > +chomp($nextlsn); > > Wouldn't it be simpler to use CREATE_REPLICATION_SLOT with RESERVE_WAL > > here, rather than going through pg_receivewal? It seems to me that > > this would be cheaper without really impacting the coverage. > > You're right, we can skip two invocations of pg_receivewal like this (for > the slot creation + for starting the slot a first time). After sending the previous patch suite, I figured it would be worthwhile to also have tests covering timeline switches, which was not covered before. So please find attached a new version with an additional patch for those tests, covering both "resume from last know archive" and "resume from the replication slots position" cases. -- Ronan Dunklau>From 5a47f17a17594cc171f744ce383ba820d44b6446 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 2 Sep 2021 16:25:25 +0900 Subject: [PATCH v8 1/4] Add READ_REPLICATION_SLOT command --- doc/src/sgml/protocol.sgml | 47 +++ src/backend/replication/repl_gram.y | 16 +++- src/backend/replication/repl_scanner.l | 1 + src/backend/replication/walsender.c | 89 + src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 10 +++ src/test/recovery/t/001_stream_rep.pl | 47 ++- src/test/recovery/t/006_logical_decoding.pl | 9 ++- src/tools/pgindent/typedefs.list| 1 + 9 files changed, 218 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index b95cc88599..51a15cc3da 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2067,6 +2067,53 @@ The commands accepted in replication mode are: + +READ_REPLICATION_SLOT slot_name + READ_REPLICATION_SLOT + + + + Read the information of a replication slot. Returns a tuple with + NULL values if the replication slot does not + exist. This command is currently only supported for physical slots. + + + In response to this command, the server will return a one-row result set, + containing the following fields: + + +slot_type (text) + + + The replication slot's type, either physical or + NULL. + + + + + +restart_lsn (text) + + + The replication slot's restart_lsn. + + + + + +restart_tli (int4) + + + The timeline ID for the restart_lsn position, + when following the current timeline history. + + + + + + + + START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ] START_REPLICATION diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y index 126380e2df..913a99da5a 100644 --- a/src/backend/replication/repl_gram.y +++ b/src/backend/replication/repl_gram.y @@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void); /* Keyword tokens. */ %token K_BASE_BACKUP %token K_IDENTIFY_SYSTEM +%token K_READ_REPLICATION_SLOT %token K_SHOW %token K_START_REPLICATION %token K_CREATE_REPLICATION_SLOT @@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void); %type command %type base_backup start_replication start_logical_replication create_replication_slot drop_replication_slot identify_system -timeline_history show sql_cmd +read_replication_slot timeline_history show sql_cmd %type base_backup_legacy_opt_list generic_option_list %type base_backup_legacy_opt generic_option %type opt_timeline @@ -120,6 +121,7 @@ opt_semicolon: ';' command: identify_system + | read_replication_slot | base_backup | start_replication | start_logical_replication @@ -140,6 +142,18 @@ identify_system: } ; +/* + * READ_REPLICATION_SLOT %s + */ +read_replication_slot: + K_READ_REPLICATION_SLOT var_name +{ + ReadReplicationSlotCmd *n = makeNode(ReadReplicationSlotCmd); + n->slotname = $2; + $$ = (Node *) n; +} + ; + /* * SHOW setting */ diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l index c038a636c3..1b599c255e 100644 --- a/src/backend/replication/repl_scanner.l +++ b/src/backend/replication/repl_scanne
Re: pg_receivewal starting position
Le mercredi 20 octobre 2021, 07:13:15 CEST Michael Paquier a écrit : > On Tue, Oct 19, 2021 at 05:32:55PM +0200, Ronan Dunklau wrote: > > Following recommendations, I stripped most of the features from the patch. > > For now we support only physical replication slots, and only provide the > > two fields of interest (restart_lsn, restart_tli) in addition to the slot > > type (fixed at physical) to not paint ourselves in a corner. > > > > I also removed the part about pg_basebackup since other fixes have been > > proposed for that case. > > Patch 0001 looks rather clean. I have a couple of comments. Thank you for the quick review ! > > + if (OidIsValid(slot_contents.data.database)) > + elog(ERROR, "READ_REPLICATION_SLOT is only supported for > physical slots"); > > elog() can only be used for internal errors. Errors that can be > triggered by a user should use ereport() instead. Ok. > > +ok($stdout eq '||', > + "READ_REPLICATION_SLOT returns NULL values if slot does not exist"); > [...] > +ok($stdout =~ 'physical\|[^|]*\|1', > + "READ_REPLICATION_SLOT returns tuple corresponding to the slot"); > Isn't result pattern matching something we usually test with like()? Ok. > > +($ret, $stdout, $stderr) = $node_primary->psql( > + 'postgres', > + "READ_REPLICATION_SLOT $slotname;", > + extra_params => [ '-d', $connstr_rep ]); > No need for extra_params in this test. You can just pass down > "replication => 1" instead, no? In that test file, every replication connection is obtained by using connstr_rep so I thought it would be best to use the same thing. > > --- a/src/test/recovery/t/006_logical_decoding.pl > +++ b/src/test/recovery/t/006_logical_decoding.pl > [...] > +ok($stderr=~ 'READ_REPLICATION_SLOT is only supported for physical slots', > + 'Logical replication slot is not supported'); > This one should use like(). Ok. > > + > +The slot's restart_lsn can also beused as a > starting +point if the target directory is empty. > + > I am not sure that there is a need for this addition as the same thing > is said when describing the lookup ordering. Ok, removed. > > + If nothing is found and a slot is specified, use the > + READ_REPLICATION_SLOT > + command. > It may be clearer to say that the position is retrieved from the > command. Ok, done. The doc also uses the active voice here now. > > +bool > +GetSlotInformation(PGconn *conn, const char *slot_name, XLogRecPtr > *restart_lsn, TimeLineID* restart_tli) > +{ > Could you extend that so as we still run the command but don't crash > if the caller specifies NULL for any of the result fields? This would > be handy. Done. > > + if (PQgetisnull(res, 0, 0)) > + { > + PQclear(res); > + pg_log_error("replication slot \"%s\" does not exist", > slot_name); > + return false; > + } > + if (PQntuples(res) != 1 || PQnfields(res) < 3) > + { > + pg_log_error("could not fetch replication slot: got %d rows > and %d fields, expected %d rows and %d or more fields", > +PQntuples(res), PQnfields(res), 1, 3); > + PQclear(res); > + return false; > + } > Wouldn't it be better to reverse the order of these two checks? Yes it is, and the PQntuples condition should be removed from the first error test. > > I don't mind the addition of the slot type being part of the result of > READ_REPLICATION_SLOT even if it is not mandatory (?), but at least > GetSlotInformation() should check after it. Ok. > > +# Setup the slot, and connect to it a first time > +$primary->run_log( > + [ 'pg_receivewal', '--slot', $slot_name, '--create-slot' ], > + 'creating a replication slot'); > +$primary->psql('postgres', > + 'INSERT INTO test_table VALUES (generate_series(1,100));'); > +$primary->psql('postgres', 'SELECT pg_switch_wal();'); > +$nextlsn = > + $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); > +chomp($nextlsn); > Wouldn't it be simpler to use CREATE_REPLICATION_SLOT with RESERVE_WAL > here, rather than going through pg_receivewal? It seems to me that > this would be cheaper without really impacting the coverage. You're right, we can skip two invocations of pg_receivewal like this (for the slot creation + for starting the slot a first time). -- Ronan Dunklau>From 845b2fdd33318f0d7528ef0ecbb69e4aecf43c46 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 6 Sep 2021 11:08:54 +0200 Subject: [PATCH v7 3/3] Add documentation for pg_receivewal --- doc/src/sgml/ref/pg_receivewal.sgml |
Re: pg_receivewal starting position
Hello, Following recommendations, I stripped most of the features from the patch. For now we support only physical replication slots, and only provide the two fields of interest (restart_lsn, restart_tli) in addition to the slot type (fixed at physical) to not paint ourselves in a corner. I also removed the part about pg_basebackup since other fixes have been proposed for that case. Le vendredi 1 octobre 2021, 09:05:18 CEST Michael Paquier a écrit : > On Mon, Sep 06, 2021 at 04:17:28PM +0900, Michael Paquier wrote: > > Using READ_REPLICATION_SLOT as the command name is fine, and it could > > be extended with more fields if necessary, implemented now with only > > what we think is useful. Returning errors on cases that are still not > > supported yet is fine, say for logical slots if we decide to reject > > the case for now, and testrictions can always be lifted in the > > future. > > And marked as RwF as this was three weeks ago. Please feel free to > register a new entry if this is being worked on. > -- > Michael -- Ronan Dunklau>From 4147665664164eb597fdcc86637ec9c497c36660 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 2 Sep 2021 16:25:25 +0900 Subject: [PATCH v6 1/3] Add READ_REPLICATION_SLOT command --- doc/src/sgml/protocol.sgml | 47 +++ src/backend/replication/repl_gram.y | 16 +++- src/backend/replication/repl_scanner.l | 1 + src/backend/replication/walsender.c | 87 + src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 10 +++ src/test/recovery/t/001_stream_rep.pl | 47 ++- src/test/recovery/t/006_logical_decoding.pl | 9 ++- src/tools/pgindent/typedefs.list| 1 + 9 files changed, 216 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index b95cc88599..51a15cc3da 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2067,6 +2067,53 @@ The commands accepted in replication mode are: + +READ_REPLICATION_SLOT slot_name + READ_REPLICATION_SLOT + + + + Read the information of a replication slot. Returns a tuple with + NULL values if the replication slot does not + exist. This command is currently only supported for physical slots. + + + In response to this command, the server will return a one-row result set, + containing the following fields: + + +slot_type (text) + + + The replication slot's type, either physical or + NULL. + + + + + +restart_lsn (text) + + + The replication slot's restart_lsn. + + + + + +restart_tli (int4) + + + The timeline ID for the restart_lsn position, + when following the current timeline history. + + + + + + + + START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ] START_REPLICATION diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y index 126380e2df..913a99da5a 100644 --- a/src/backend/replication/repl_gram.y +++ b/src/backend/replication/repl_gram.y @@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void); /* Keyword tokens. */ %token K_BASE_BACKUP %token K_IDENTIFY_SYSTEM +%token K_READ_REPLICATION_SLOT %token K_SHOW %token K_START_REPLICATION %token K_CREATE_REPLICATION_SLOT @@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void); %type command %type base_backup start_replication start_logical_replication create_replication_slot drop_replication_slot identify_system -timeline_history show sql_cmd +read_replication_slot timeline_history show sql_cmd %type base_backup_legacy_opt_list generic_option_list %type base_backup_legacy_opt generic_option %type opt_timeline @@ -120,6 +121,7 @@ opt_semicolon: ';' command: identify_system + | read_replication_slot | base_backup | start_replication | start_logical_replication @@ -140,6 +142,18 @@ identify_system: } ; +/* + * READ_REPLICATION_SLOT %s + */ +read_replication_slot: + K_READ_REPLICATION_SLOT var_name +{ + ReadReplicationSlotCmd *n = makeNode(ReadReplicationSlotCmd); + n->slotname = $2; + $$ = (Node *) n; +} + ; + /* * SHOW setting */ diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l index c038a636c3..1b599c255e 100644 --- a/src/backend/replication/repl_scanner.l +++ b/src/backend/replication/repl_scanner.l @@ -85,6 +85,7 @@ identifier {ident_start}{ident_cont}* BASE_BACKUP { return K_BASE_BACKUP; } FAST { return K_FAST; } IDENTIFY_SYSTEM { return K_IDENTIFY_SYSTEM; } +READ_REPLICATION_SLOT { return K_READ_REPLICATION_SLOT; } S
Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Le lundi 27 septembre 2021, 20:15:05 CEST Mark Dilger a écrit : > > On Sep 21, 2021, at 12:58 PM, Andrew Dunstan wrote: > > > > This patch set is failing to apply for me - it fails on patch 2. > > Thanks for looking! I have pulled together a new patch set which applies > cleanly against current master. > > I haven't dug terribly deeply into it yet, but I notice that there is a > > very large increase in test volume, which appears to account for much of > > the 44635 lines of the patch set. I think we're probably going to want > > to reduce that. We've had complaints in the past from prominent hackers > > about adding too much volume to the regression tests. > > The v8 patch set is much smaller, with the reduction being in the size of > regression tests covering which roles can perform SET, RESET, ALTER SYSTEM > SET, and ALTER SYSTEM RESET and on which GUCs. The v7 patch set did > exhaustive testing on this, which is why it was so big. The v8 set does > just a sampling of GUCs per role. The total number of lines for the patch > set drops from 44635 to 13026, with only 1960 lines total between the .sql > and .out tests of GUC privileges. > > I do like the basic thrust of reducing the power of CREATEROLE. There's > > an old legal maxim I learned in my distant youth that says "nemo dat > > quod non habet" - Nobody can give something they don't own. This seems > > to be in that spirit, and I approve :-) > > Great! I'm glad to hear the approach has some support. > > > Other changes in v8: > > Add a new test for subscriptions owned by non-superusers to verify that the > tablesync and apply workers replicating their subscription won't replicate > into schemas and tables that the subscription owner lacks privilege to > touch. The logic to prevent that existed in the v7 patch, but I overlooked > adding tests for it. Fixed. > > Allow non-superusers to create event triggers. The logic already existed > and is unchanged to handle event triggers owned by non-superusers and > conditioning those triggers firing on the (trigger-owner, > role-performing-event) pair. The v7 patch set didn't go quite so far as > allowing non-superusers to create event triggers, but that undercuts much > of the benefit of the changes for no obvious purpose. > > > Not changed in v8, but worth discussing: > > Non-superusers are still prohibited from creating subscriptions, despite > improvements to the security around that circumstance. Improvements to the > security model around event triggers does not have to also include > permission for non-superuser to create event triggers, but v8 does both. > These could be viewed as inconsistent choices, but I struck the balance > this way because roles creating event triggers does not entail them doing > anything that they can't already do, whereas allowing arbitrary users to > create subscriptions would entail an ordinary user causing external network > connections being initiated. We likely need to create another privileged > role and require a non-superuser to be part of that role before they can > create subscriptions. That seems, however, like something to do as a > follow-on patch, since tightening up the security on subscriptions as done > in this patch doesn't depend on that. The changes proposed around subscription management make a lot of sense, especially considering that now that we don't allow to run ALTER SUBSCRIPTION REFRESH in a function anymore, there is no way to delegate this to a non superuser (using a security definer function). Since it doesn't involve the rest of the patchset, patches 16, 17 and 18 could be separated in another thread / patchset. -- Ronan Dunklau
Re: Patch to avoid orphaned dependencies
Hello Bertrand, Le mardi 4 mai 2021, 11:55:43 CEST Drouvot, Bertrand a écrit : > > Implementation overview: > > * A new catalog snapshot is added: DirtyCatalogSnapshot. > * This catalog snapshot is a dirty one to be able to look for > in-flight dependencies. > * Its usage is controlled by a new UseDirtyCatalogSnapshot variable. > * Any time this variable is being set to true, then the next call to > GetNonHistoricCatalogSnapshot() is returning the dirty snapshot. > * This snapshot is being used to check for in-flight dependencies and > also to get the objects description to generate the error messages. > I quickly tested the patch, it behaves as advertised, and passes tests. Isolation tests should be added to demonstrate the issues it is solving. However, I am bit wary of this behaviour of setting the DirtyCatalogSnapshot global variable which is then reset after each snapshot acquisition: I'm having trouble understanding all the implications of that, if it would be possible to introduce an unforeseen snapshot before the one we actually want to be dirty. I don't want to derail this thread, but couldn't predicate locks on the pg_depend index pages corresponding to the dropped object / referenced objects work as a different approach ? I'm not familiar enough with them so maybe there is some fundamental misunderstanding on my end. -- Ronan Dunklau
Re: Proposal: More structured logging
Le mercredi 8 septembre 2021, 11:51:31 CEST Peter Eisentraut a écrit : > On 01.09.21 10:00, Ronan Dunklau wrote: > > In-core it would open up the possibility to split log messages into > > different fields, for example the different statistics reported in the > > logs by VACUUM / ANALYZE VERBOSE and make it easier to consume the output > > without having to parse the message. Parsing the message also means that > > any tool parsing it needs to either be aware of the locale, or to force > > the user to use a specific one. > > I think those messages would themselves have substructure. For example, > the current message > > "I/O timings: read: %.3f ms, write: %.3f ms\n" > > might be > > {"I/O timings": {"read": ..., "write": ...}} > > and that in turn is already part of a larger message. > > So just having a single level of tags wouldn't really work for this. I agree having a nested structure may feel more natural, but I don't think it would matter too much if we standardize on ini-style property names (ie, iotimings.read, iotimings.write and so on...) Regards, -- Ronan Dunklau
Re: ORDER BY pushdowns seem broken in postgres_fdw
Le lundi 6 septembre 2021, 11:25:39 CEST Zhihong Yu a écrit : > On Mon, Sep 6, 2021 at 1:17 AM Ronan Dunklau wrote: > > Le vendredi 3 septembre 2021, 22:54:25 CEST David Zhang a écrit : > > > The following review has been posted through the commitfest application: > > > make installcheck-world: tested, failed > > > Implements feature: tested, passed > > > Spec compliant: not tested > > > Documentation:not tested > > > > > > Applied the v6 patch to master branch and ran regression test for > > > > contrib, > > > > > the result was "All tests successful." > > > > What kind of error did you get running make installcheck-world ? If it > > passed > > the make check for contrib, I can't see why it would fail running make > > installcheck-world. > > > > In any case, I just checked and running make installcheck-world doesn't > > produce any error. > > > > Since HEAD had moved a bit since the last version, I rebased the patch, > > resulting in the attached v7. > > > > Best regards, > > > > -- > > Ronan Dunklau > > Hi, > bq. a pushed-down order by could return wrong results. > > Can you briefly summarize the approach for fixing the bug in the > description ? Done, let me know what you think about it. > > + * Returns true if it's safe to push down a sort as described by 'pathkey' > to > + * the foreign server > + */ > +bool > +is_foreign_pathkey(PlannerInfo *root, > > It would be better to name the method which reflects whether pushdown is > safe. e.g. is_pathkey_safe_for_pushdown. The convention used here is the same one as in is_foreign_expr and is_foreign_param, which are also related to pushdown-safety. -- Ronan Dunklau>From 8d0ebd4df2e5e72a8d490a65001c9971516a4337 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 6 Sep 2021 09:54:43 +0200 Subject: [PATCH v8] Fix orderby handling in postgres_fdw The logic for pushing down order bys in postgres fdw didn't take into account the specific operator used, and as such a pushed-down order by could return wrong results. This patch looks up the original operator associated to the pathkey opfamily, and checks that it actually exists on the foreign side. If it does, the operator is then used to rebuild an equivalent USING clause. --- contrib/postgres_fdw/deparse.c| 156 +- .../postgres_fdw/expected/postgres_fdw.out| 23 +++ contrib/postgres_fdw/postgres_fdw.c | 28 ++-- contrib/postgres_fdw/postgres_fdw.h | 11 +- contrib/postgres_fdw/sql/postgres_fdw.sql | 8 + src/backend/optimizer/path/equivclass.c | 26 ++- src/include/optimizer/paths.h | 2 + 7 files changed, 187 insertions(+), 67 deletions(-) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index d98bd66681..fb1b5f9d9b 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -37,9 +37,11 @@ #include "access/sysattr.h" #include "access/table.h" #include "catalog/pg_aggregate.h" +#include "catalog/pg_amop.h" #include "catalog/pg_collation.h" #include "catalog/pg_namespace.h" #include "catalog/pg_operator.h" +#include "catalog/pg_opfamily.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" #include "commands/defrem.h" @@ -47,6 +49,7 @@ #include "nodes/nodeFuncs.h" #include "nodes/plannodes.h" #include "optimizer/optimizer.h" +#include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" @@ -182,6 +185,8 @@ static void deparseRangeTblRef(StringInfo buf, PlannerInfo *root, Index ignore_rel, List **ignore_conds, List **params_list); static void deparseAggref(Aggref *node, deparse_expr_cxt *context); static void appendGroupByClause(List *tlist, deparse_expr_cxt *context); +static void appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first, +deparse_expr_cxt *context); static void appendAggOrderBy(List *orderList, List *targetList, deparse_expr_cxt *context); static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); @@ -1037,6 +1042,41 @@ is_foreign_param(PlannerInfo *root, return false; } +/* + * Returns true if it's safe to push down a sort as described by 'pathkey' to + * the foreign server + */ +bool +is_foreign_pathkey(PlannerInfo *root, + RelOptInfo *baserel, + PathKey *pathkey) +{ + EquivalenceClass *pathkey_ec = pathkey->pk_eclass; + PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) baserel->fdw_private; + Expr *em_expr; + + /*
Re: ORDER BY pushdowns seem broken in postgres_fdw
Le vendredi 3 septembre 2021, 22:54:25 CEST David Zhang a écrit : > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: tested, passed > Spec compliant: not tested > Documentation:not tested > > Applied the v6 patch to master branch and ran regression test for contrib, > the result was "All tests successful." What kind of error did you get running make installcheck-world ? If it passed the make check for contrib, I can't see why it would fail running make installcheck-world. In any case, I just checked and running make installcheck-world doesn't produce any error. Since HEAD had moved a bit since the last version, I rebased the patch, resulting in the attached v7. Best regards, -- Ronan Dunklau >From ccfb0a3d7dae42f56c699aac6f699c3c2f08c812 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Mon, 6 Sep 2021 09:54:43 +0200 Subject: [PATCH v7] Fix orderby handling in postgres_fdw The logic for pushing down order bys in postgres fdw didn't take into account the specific operator used, and as such a pushed-down order by could return wrong results. --- contrib/postgres_fdw/deparse.c| 156 +- .../postgres_fdw/expected/postgres_fdw.out| 23 +++ contrib/postgres_fdw/postgres_fdw.c | 28 ++-- contrib/postgres_fdw/postgres_fdw.h | 11 +- contrib/postgres_fdw/sql/postgres_fdw.sql | 8 + src/backend/optimizer/path/equivclass.c | 26 ++- src/include/optimizer/paths.h | 2 + 7 files changed, 187 insertions(+), 67 deletions(-) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index d98bd66681..fb1b5f9d9b 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -37,9 +37,11 @@ #include "access/sysattr.h" #include "access/table.h" #include "catalog/pg_aggregate.h" +#include "catalog/pg_amop.h" #include "catalog/pg_collation.h" #include "catalog/pg_namespace.h" #include "catalog/pg_operator.h" +#include "catalog/pg_opfamily.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" #include "commands/defrem.h" @@ -47,6 +49,7 @@ #include "nodes/nodeFuncs.h" #include "nodes/plannodes.h" #include "optimizer/optimizer.h" +#include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" @@ -182,6 +185,8 @@ static void deparseRangeTblRef(StringInfo buf, PlannerInfo *root, Index ignore_rel, List **ignore_conds, List **params_list); static void deparseAggref(Aggref *node, deparse_expr_cxt *context); static void appendGroupByClause(List *tlist, deparse_expr_cxt *context); +static void appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first, +deparse_expr_cxt *context); static void appendAggOrderBy(List *orderList, List *targetList, deparse_expr_cxt *context); static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); @@ -1037,6 +1042,41 @@ is_foreign_param(PlannerInfo *root, return false; } +/* + * Returns true if it's safe to push down a sort as described by 'pathkey' to + * the foreign server + */ +bool +is_foreign_pathkey(PlannerInfo *root, + RelOptInfo *baserel, + PathKey *pathkey) +{ + EquivalenceClass *pathkey_ec = pathkey->pk_eclass; + PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) baserel->fdw_private; + Expr *em_expr; + + /* + * is_foreign_expr would detect volatile expressions as well, but checking + * ec_has_volatile here saves some cycles. + */ + if (pathkey_ec->ec_has_volatile) + return false; + + /* can't push down the sort if the pathkey's opfamily is not shippable */ + if (!is_shippable(pathkey->pk_opfamily, OperatorFamilyRelationId, fpinfo)) + return false; + + em_expr = find_em_expr_for_rel(pathkey_ec, baserel); + if (em_expr == NULL) + return false; + + /* + * Finally, determine if it's safe to evaluate the found expr on the + * foreign server. + */ + return is_foreign_expr(root, baserel, em_expr); +} + /* * Convert type OID + typmod info into a type name we can ship to the remote * server. Someplace else had better have verified that this type name is @@ -3331,6 +3371,45 @@ deparseAggref(Aggref *node, deparse_expr_cxt *context) appendStringInfoChar(buf, ')'); } +/* + * Append the ASC, DESC, USING and NULLS FIRST / NULLS LAST part + * of the ORDER BY clause + */ +static void +appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first, + deparse_expr_cxt *context) +{ + StringInfo buf = context->buf; + TypeCacheEntry *typentry; + + typentry = lookup_type_cache(sortcoltype, TYPECACHE_LT_OPR | TYPECACHE_GT_OPR); + + if (sortop == typentry-
Re: pg_receivewal starting position
Le vendredi 3 septembre 2021 17:49:34 CEST, vous avez écrit : > On Fri, Sep 3, 2021 at 3:28 PM Ronan Dunklau wrote: > > There is still the concern raised by Bharath about being able to select > > the > > way to fetch the replication slot information for the user, and what > > should or should not be included in the documentation. I am in favor of > > documenting the process of selecting the wal start, and maybe include a > > --start-lsn option for the user to override it, but that's maybe for > > another patch. > > Let's hear from others. > > Thanks for the patches. I have some quick comments on the v5 patch-set: > > 0001: > 1) Do you also want to MemSet values too in ReadReplicationSlot? > > 2) When if clause has single statement we don't generally use "{" and "}" > + if (slot == NULL || !slot->in_use) > + { > + LWLockRelease(ReplicationSlotControlLock); > + } > you can just have: > + if (slot == NULL || !slot->in_use) > + LWLockRelease(ReplicationSlotControlLock); > > 3) This is still not copying the slot contents, as I said earlier you > can just take the required info into some local variables instead of > taking the slot pointer to slot_contents pointer. > + /* Copy slot contents while holding spinlock */ > + SpinLockAcquire(>mutex); > + slot_contents = *slot; > + SpinLockRelease(>mutex); > + LWLockRelease(ReplicationSlotControlLock); > > 4) As I said earlier, why can't we have separate tli variables for > more readability instead of just one slots_position_timeline and > timeline_history variable? And you are not even resetting those 2 > variables after if > (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)), you might end > up sending the restart_lsn timelineid for confirmed_flush. So, better > use two separate variables. In fact you can use block local variables: > + if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) > + { > + List*tl_history= NIL; > + TimeLineID tli; > + tl_history= readTimeLineHistory(ThisTimeLineID); > + tli = tliOfPointInHistory(slot_contents.data.restart_lsn, > + tl_history); > + values[i] = Int32GetDatum(tli); > + nulls[i] = false; > + } > similarly for confirmed_flush. > > 5) I still don't see the need for below test case: > + "READ_REPLICATION_SLOT does not produce an error with non existent slot"); > when we have > + "READ_REPLICATION_SLOT does not produce an error with dropped slot"); > Because for a user, dropped or non existent slot is same, it's just > that for dropped slot we internally don't delete its entry from the > shared memory. > Thank you for reiterating those concerns. As I said, I haven't touched Michael's version of the patch. I was incorporating those changes in my branch before he sent this, so I'll probably merge both before sending an updated patch. > 0002: > 1) Imagine GetSlotInformation always returns > READ_REPLICATION_SLOT_ERROR, don't you think StreamLog enters an > infinite loop there? Instead, why don't you just exit(1); instead of > return; and retry? Similarly for READ_REPLICATION_SLOT_NONEXISTENT? I > think, you can just do exit(1), no need to retry. > + case READ_REPLICATION_SLOT_ERROR: > + > + /* > + * Error has been logged by GetSlotInformation, return and > + * maybe retry > + */ > + return; This is the same behaviour we had before: if there is an error with pg_receivewal we retry the command. There was no special case for the replication slot not existing before, I don't see why we should change it now ? Eg: 2021-09-06 09:11:07.774 CEST [95853] ERROR: replication slot "nonexistent" does not exist 2021-09-06 09:11:07.774 CEST [95853] STATEMENT: START_REPLICATION SLOT "nonexistent" 0/100 TIMELINE 1 pg_receivewal: error: could not send replication command "START_REPLICATION": ERROR: replication slot "nonexistent" does not exist pg_receivewal: disconnected; waiting 5 seconds to try again Users may rely on it to keep retrying in the background until the slot has been created for example. > > 2) Why is it returning READ_REPLICATION_SLOT_OK when slot_info isn't > passed? And why are you having this check after you connect to the > server and fetch the data? > + /* If no slotinformation has been passed, we can return immediately */ > + if (slot_info == NULL) > + { > + PQclear(res); > + return READ_REPLICATION_SLOT_OK; > + } > Instead you can just have a single assert: > > + Assert(slot_name && slot_info ); At first it was so that we didn't have to fill in all required information if we don't need too, but it turns out pg_basebackup also has to check for the slot's type. I agree we should not support the null slot_in
Re: pg_receivewal starting position
Le lundi 6 septembre 2021, 06:22:30 CEST Michael Paquier a écrit : > On Fri, Sep 03, 2021 at 11:58:27AM +0200, Ronan Dunklau wrote: > > 0002 for pg_receivewal tried to simplify the logic of information to > > return, by using a dedicated struct for this. This accounts for Bahrath's > > demands to return every possible field. > > In particular, an is_logical field simplifies the detection of the type of > > slot. In my opinion it makes sense to simplify it like this on the client > > side while being more open-minded on the server side if we ever need to > > provide a new type of slot. Also, GetSlotInformation now returns an enum > > to be able to handle the different modes of failures, which differ > > between pg_receivewal and pg_basebackup. > > + if (PQserverVersion(conn) < 15) > + return READ_REPLICATION_SLOT_UNSUPPORTED; > [...] > +typedef enum { > + READ_REPLICATION_SLOT_OK, > + READ_REPLICATION_SLOT_UNSUPPORTED, > + READ_REPLICATION_SLOT_ERROR, > + READ_REPLICATION_SLOT_NONEXISTENT > +} ReadReplicationSlotStatus; > > Do you really need this much complication? We could treat the > unsupported case and the non-existing case the same way: we don't know > so just assign InvalidXlogRecPtr or NULL to the fields of the > structure, and make GetSlotInformation() return false just on error, > with some pg_log_error() where adapted in its internals. I actually started with the implementation you propose, but changed my mind while writing it because I realised it's easier to reason about like this, instead of failing silently during READ_REPLICATION_SLOT to fail a bit later when actually trying to start the replication slot because it doesn't exists. Either the user expects the replication slot to exists, and in this case we should retry the whole loop in the hope of getting an interesting LSN, or the user doesn't and shouldn't even pass a replication_slot name to begin with. > > > There is still the concern raised by Bharath about being able to select > > the > > way to fetch the replication slot information for the user, and what > > should or should not be included in the documentation. I am in favor of > > documenting the process of selecting the wal start, and maybe include a > > --start-lsn option for the user to override it, but that's maybe for > > another patch. > > The behavior of pg_receivewal that you are implementing should be > documented. We don't say either how the start location is selected > when working on an existing directory, so I would recommend to add a > paragraph in the description section to detail all that, as of: > - First a scan of the existing archives is done. > - If nothing is found, and if there is a slot, request the slot > information. > - If still nothing (aka slot undefined, or command not supported), use > the last flush location. Sounds good, I will add another patch for the documentation of this. > > As a whole, I am not really convinced that we need a new option for > that as long as we rely on a slot with pg_receivewal as these are used > to make sure that we avoid holes in the WAL archives. > > Regarding pg_basebackup, Daniel has proposed a couple of days ago a > different solution to trap errors earlier, which would cover the case > dealt with here: > https://www.postgresql.org/message-id/0F69E282-97F9-4DB7-8D6D-F927AA6340C8@y > esql.se I will take a look. > We should not mimic in the frontend errors that are safely trapped > in the backend with the proper locks, in any case. I don't understand what you mean by this ? My original proposal was for the command to actually attach to the replication slot while reading it, thus keeping a lock on it to prevent dropping or concurrent access on the server. > > While on it, READ_REPLICATION_SLOT returns a confirmed LSN when > grabbing the data of a logical slot. We are not going to use that > with pg_recvlogical as by default START_STREAMING 0/0 would just use > the confirmed LSN. Do we have use cases where this information would > help? There is the argument of consistency with physical slots and > that this can be helpful to do sanity checks for clients, but that's > rather thin. If we don't, we should rename the command to READ_PHYSICAL_REPLICATION_SLOT and error out on the server if the slot is not actually a physical one to spare the client from performing those checks. I still think it's better to support both cases as opposed to having two completely different APIs (READ_(PHYSICAL)_REPLICATION_SLOT for physical ones on a replication connection, pg_replication_slots view for logical ones) as it would enable more for third-party clients at a relatively low maintenance cost for us. -- Ronan Dunklau
Re: pg_receivewal starting position
Le jeudi 2 septembre 2021, 10:37:20 CEST Michael Paquier a écrit : > On Thu, Sep 02, 2021 at 10:08:26AM +0200, Ronan Dunklau wrote: > > I could see the use for sending active_pid for use within pg_basebackup: > > at > > least we could fail early if the slot is already in use. But at the same > > time, maybe it won't be in use anymore once we need it. > > There is no real concurrent protection with this design. You could > read that the slot is not active during READ_REPLICATION_SLOT just to > find out after in the process of pg_basebackup streaming WAL that it > became in use in-between. And the backend-side protections would kick > in at this stage. > > Hmm. The logic doing the decision-making with pg_receivewal may > become more tricky when it comes to pg_replication_slots.wal_status, > max_slot_wal_keep_size and pg_replication_slots.safe_wal_size. The > number of cases we'd like to consider impacts directly the amount of > data send through READ_REPLICATION_SLOT. That's not really different > than deciding of a failure, a success or a retry with active_pid at an > earlier or a later stage of a base backup. pg_receivewal, on the > contrary, can just rely on what the backend tells when it begins > streaming. So I'd prefer keeping things simple and limit the number > of fields a minimum for this command. Ok. Please find attached new patches rebased on master.* 0001 is yours without any modification. 0002 for pg_receivewal tried to simplify the logic of information to return, by using a dedicated struct for this. This accounts for Bahrath's demands to return every possible field. In particular, an is_logical field simplifies the detection of the type of slot. In my opinion it makes sense to simplify it like this on the client side while being more open-minded on the server side if we ever need to provide a new type of slot. Also, GetSlotInformation now returns an enum to be able to handle the different modes of failures, which differ between pg_receivewal and pg_basebackup. 0003 is the pg_basebackup one, not much changed except for the concerns you had about the log message and handling of different failure modes. There is still the concern raised by Bharath about being able to select the way to fetch the replication slot information for the user, and what should or should not be included in the documentation. I am in favor of documenting the process of selecting the wal start, and maybe include a --start-lsn option for the user to override it, but that's maybe for another patch. -- Ronan Dunklau>From 60b8cedb196f5acdd75b489c1d2155c2698084a4 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 2 Sep 2021 16:25:25 +0900 Subject: [PATCH v5 1/3] Add READ_REPLICATION_SLOT command --- doc/src/sgml/protocol.sgml | 66 src/backend/replication/repl_gram.y | 16 ++- src/backend/replication/repl_scanner.l | 1 + src/backend/replication/walsender.c | 112 src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 10 ++ src/test/recovery/t/001_stream_rep.pl | 47 +++- src/test/recovery/t/006_logical_decoding.pl | 15 ++- src/tools/pgindent/typedefs.list| 1 + 9 files changed, 266 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index a232546b1d..8191f17137 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2052,6 +2052,72 @@ The commands accepted in replication mode are: + +READ_REPLICATION_SLOT slot_name + READ_REPLICATION_SLOT + + + + Read the information of a replication slot. Returns a tuple with + NULL values if the replication slot does not + exist. + + + In response to this command, the server will return a one-row result set, + containing the following fields: + + +slot_type (text) + + + The replication slot's type, either physical or + logical. + + + + + +restart_lsn (text) + + + The replication slot's restart_lsn. + + + + + +confirmed_flush_lsn (text) + + + The replication slot's confirmed_flush_lsn. + + + + + +restart_tli (int4) + + + The timeline ID for the restart_lsn position, + when following the current timeline history. + + + + + +confirmed_flush_tli (int4) + + + The timeline ID for the confirmed_flush_lsn + position, when following the current timeline history. + + + + + + + + START_REPLICATION [ SLOT slot_name ] [ PHY
Re: improve pg_receivewal code
Le lundi 30 août 2021, 09:32:05 CEST Bharath Rupireddy a écrit : > Hi, > > I see there's a scope to do following improvements to pg_receivewal.c: Thank you Bharath for this patch. > > 1) Fetch the server system identifier in the first RunIdentifySystem > call and use it to identify(via pg_receivewal's ReceiveXlogStream) any > unexpected changes that may happen in the server while pg_receivewal > is connected to it. This can be helpful in scenarios when > pg_receivewal tries to reconnect to the server (see the code around > pg_usleep with RECONNECT_SLEEP_TIME) but something unexpected has > happnend in the server that changed the its system identifier. Once > the pg_receivewal establishes the conenction to server again, then the > ReceiveXlogStream has a code chunk to compare the system identifier > that we received in the initial connection. I'm not sure what kind of problem could be happening here: if you were somewhat routed to another server ? Or if we "switched" the cluster listening on that port ? > 2) Move the RunIdentifySystem to identify timeline id and start LSN > from the server only if the pg_receivewal failed to get them from > FindStreamingStart. This way, an extra IDENTIFY_SYSTEM command is > avoided. That makes sense, even if we add another IDENTIFY_SYSTEM to check against the one set in the first place. > 3) Place the "replication connetion shouldn't have any database name > associated" error check right after RunIdentifySystem so that we can > avoid fetching wal segment size with RetrieveWalSegSize if at all we > were to fail with that error. This change is similar to what > pg_recvlogical.c does. Makes sense. > 4) Move the RetrieveWalSegSize to just before pg_receivewal.c enters > main loop to get the wal from the server. This avoids an unnecessary > query for pg_receivewal with "--create-slot" or "--drop-slot". > 5) Have an assertion after the pg_receivewal done a good amount of > work to find start timeline and LSN might be helpful: > Assert(stream.timeline != 0 && stream.startpos != InvalidXLogRecPtr); > > Attaching a patch that does take care of above improvements. Thoughts? Overall I think it is good. However, you have some formatting issues, here it mixes tabs and spaces: + /* +* No valid data can be found in the existing output directory. +* Get start LSN position and current timeline ID from the server. + */ And here it is not formatted properly: +static char *server_sysid = NULL; > > Regards, > Bharath Rupireddy. -- Ronan Dunklau
Re: pg_receivewal starting position
on the same thing concurrently in the future to avoid duplicate efforts. I'll rebase and send the updated version for patches 0002 and 0003 of my original proposal once we reach an agreement over the behaviour / options of pg_receivewal. Also considering the number of different fields to be filled by the GetSlotInformation function, my local branch groups them into a dedicated struct which is more convenient than having X possibly null arguments. -- Ronan Dunklau
Re: pg_receivewal starting position
Le jeudi 2 septembre 2021, 08:42:22 CEST Bharath Rupireddy a écrit : > On Thu, Sep 2, 2021 at 11:15 AM Kyotaro Horiguchi > > wrote: > > At Wed, 1 Sep 2021 10:30:05 +0530, Bharath Rupireddy > > wrote in> > > > On Mon, Aug 30, 2021 at 3:26 PM Ronan Dunklau wrote: > > > > > 7) How about we let pg_receivewal use READ_REPLICATION_SLOT as an > > > > > option? > > > > > > > > From my point of view, I already expected it to use something like > > > > that when using a replication slot. Maybe an option to turn it off > > > > could be useful ?> > > > > IMO, pg_receivewal should have a way to turn off/on using > > > READ_REPLICATION_SLOT. Imagine if the postgres server doesn't support > > > READ_REPLICATION_SLOT (a lower version) but for some reasons the > > > pg_receivewal(running separately) is upgraded to the version that uses > > > READ_REPLICATION_SLOT, knowing that the server doesn't support > > > READ_REPLICATION_SLOT why should user let pg_receivewal run an > > > unnecessary code? > > > > If I read the patch correctly the situation above is warned by the > > following message then continue to the next step giving up to identify > > start position from slot data. > > > > > "server does not suport fetching the slot's position, resuming from the > > > current server position instead"> > > (The message looks a bit too long, though..) > > > > However, if the operator doesn't know the server is old, pg_receivewal > > starts streaming from unexpected position, which is a kind of > > disaster. So I'm inclined to agree to Bharath, but rather I imagine of > > an option to explicitly specify how to determine the start position. > > > > --start-source=[server,wal,slot] specify starting-LSN source, default is > > > > trying all of them in the order of wal, slot, server. > > > > I don't think the option doesn't need to accept multiple values at once. > > If --start-source = 'wal' fails, then pg_receivewal should show up an > error saying "cannot find start position from <> > directory, try with "server" or "slot" for --start-source". We might > end having similar errors for other options as well. Isn't this going > to create unnecessary complexity? > > The existing way the pg_receivewal fetches start pos i.e. first from > wal directory and then from server start position, isn't known to the > user at all, no verbose message or anything specified in the docs. Why > do we need to expose this with the --start-source option? IMO, we can > keep it that way and we can just have a way to turn off the new > behaviour that we are proposing here, i.e.fetching the start position > from the slot's restart_lsn. Then it should probably be documented. We write in the docs that it is strongly recommended to use a replication slot, but do not mention how we resume from have been already processed. If someone really cares about having control over how the start position is defined instead of relying on the auto detection, it would be wiser to add a -- startpos parameter similar to the endpos one, which would override everything else, instead of different flags for different behaviours. Regards, -- Ronan Dunklau
Re: Proposal: More structured logging
Le mercredi 1 septembre 2021, 09:36:50 CEST Peter Eisentraut a écrit : > On 13.08.21 15:23, Ronan Dunklau wrote: > > The logging system already captures a lot of information in the ErrorData. > > But at present there is no way for a log message authors to include more > > metadata about the log outside of the log message itself. For example, > > including the extension name which can be useful for filtering / > > dispatching log messages. This can be done in the log message itself, but > > that requires specialized parsing in the log output. > > > > Even though among the available loggers in core, only the csv logger would > > be able to make sense of it, I feel it would be beneficial to add a > > tagging system to logs, by adding different (tag, value) combinations to > > a log entry, with an API similar to what we do for other ErrorData > > fields: > > > > ereport(NOTICE, > > > > (errmsg("My log message")), > > (errtag("EMITTER", "MYEXTENSION")), > > (errtag("MSG-ID", "%d", error_message_id)) > > > > ); > > What are some more examples of where this could be used? The extension > name could be added more or less automatically to ereport() calls. I > don't know what the MSG-ID is supposed to be. Are there other use cases? Adding it automatically would be nice, but how would you go about that ? In-core it would open up the possibility to split log messages into different fields, for example the different statistics reported in the logs by VACUUM / ANALYZE VERBOSE and make it easier to consume the output without having to parse the message. Parsing the message also means that any tool parsing it needs to either be aware of the locale, or to force the user to use a specific one. Out-of-core, the same things could be done for extensions like pg_audit which logs structured information into the message itself, leaving the message format to be parsed at a later time. -- Ronan Dunklau
Re: pg_receivewal starting position
Le mardi 31 août 2021, 13:17:22 CEST Bharath Rupireddy a écrit : > On Mon, Aug 30, 2021 at 3:26 PM Ronan Dunklau wrote: > > Thank you for this review ! Please see the other side of the thread where > > I > > answered Michael Paquier with a new patchset, which includes some of your > > proposed modifications. > > Thanks for the updated patches. Here are some comments on v3-0001 > patch. I will continue to review 0002 and 0003. Thank you ! I will send a new version shortly, once I address your remarks concerning patch 0002 (and hopefully 0003 :-) ) > > 1) Missing full stops "." at the end. > + logical > + when following the current timeline history > + position, when following the current timeline history > Good catch, I will take care of it for the next version. > 2) Can we have the "type" column as "slot_type" just to keep in sync > with the pg_replication_slots view? You're right, it makes more sense like this. > > 3) Can we mention the link to pg_replication_slots view in the columns > - "type", "restart_lsn", "confirmed_flush_lsn"? > Something like: the "slot_type"/"restart_lsn"/"confirmed_flush_lsn" is > same as linkend="view-pg-replication-slots">pg_replication_slots tname> view. Same as above, thanks. > > 4) Can we use "read_replication_slot" instead of > "identify_replication_slot", just to be in sync with the actual > command? That must have been a leftover from an earlier version of the patch, I will fix it also. > > 5) Are you actually copying the slot contents into the slot_contents > variable here? Isn't just taking the pointer to the shared memory? > + /* Copy slot contents while holding spinlock */ > + SpinLockAcquire(>mutex); > + slot_contents = *slot; > + SpinLockRelease(>mutex); > + LWLockRelease(ReplicationSlotControlLock); > > You could just do: > + Oid dbid; > + XLogRecPtr restart_lsn; > + XLogRecPtr confirmed_flush; > > + /* Copy the required slot contents */ > + SpinLockAcquire(>mutex); > + dbid = slot.data.database; > + restart_lsn = slot.data.restart_lsn; > + confirmed_flush = slot.data.confirmed_flush; > + SpinLockRelease(>mutex); > + LWLockRelease(ReplicationSlotControlLock); It's probably simpler that way. > > 6) It looks like you are not sending anything as a response to the > READ_REPLICATION_SLOT command, if the slot specified doesn't exist. > You are just calling end_tup_output which just calls rShutdown (points > to donothingCleanup of printsimpleDR) > if (has_value) > do_tup_output(tstate, values, nulls); > end_tup_output(tstate); > > Can you test the use case where the pg_receivewal asks > READ_REPLICATION_SLOT with a non-existent replication slot and see > with your v3 patch how it behaves? > > Why don't you remove has_value flag and do this in ReadReplicationSlot: > Datum values[5]; > bool nulls[5]; > MemSet(values, 0, sizeof(values)); > MemSet(nulls, 0, sizeof(nulls)); > > + dest = CreateDestReceiver(DestRemoteSimple); > + tstate = begin_tup_output_tupdesc(dest, tupdesc, ); > + do_tup_output(tstate, values, nulls); > + end_tup_output(tstate); As for the API, I think it's cleaner to just send an empty result set instead of a null tuple in that case but I won't fight over it if there is consensus on having an all-nulls tuple value instead. There is indeed a bug, but not here, in the second patch: I still test the slot type even when we didn't fetch anything. So I will add a test for that too. > > 7) Why don't we use 2 separate variables "restart_tli", > "confirmed_flush_tli" instead of "slots_position_timeline", just to be > more informative? You're right. > > 8) I still have the question like, how can a client (pg_receivewal for > instance) know that it is the current owner/user of the slot it is > requesting the info? As I said upthread, why don't we send "active" > and "active_pid" fields of the pg_replication_slots view? > Also, it would be good to send the "wal_status" field so that the > client can check if the "wal_status" is not "lost"? As for pg_receivewal, it can only check that it's not active at that time, since we only aquire the replication slot once we know the start_lsn. For the basebackup case it's the same thing as we only want to check if it exists. As such, I didn't add them as I didn't see the need, but if it can be useful why not ? I will do that in the next version. > > 9) There are 2 new lines at the end of ReadReplicationSlot. We give > only one new line after each function de
Re: pg_receivewal starting position
Le samedi 28 août 2021, 14:10:25 CEST Bharath Rupireddy a écrit : > On Thu, Aug 26, 2021 at 5:45 PM Ronan Dunklau wrote: > > order to fail early if the replication slot doesn't exist, so please find > > attached v2 for that. > > Thanks for the patches. Here are some comments: > Thank you for this review ! Please see the other side of the thread where I answered Michael Paquier with a new patchset, which includes some of your proposed modifications. > 1) While the intent of these patches looks good, I have following > concern with new replication command READ_REPLICATION_SLOT: what if > the pg_receivewal exits (because user issued a SIGINT or for some > reason) after flushing the received WAL to disk, before it sends > sendFeedback to postgres server's walsender so that it doesn't get a > chance to update the restart_lsn in the replication slot via > PhysicalConfirmReceivedLocation. If the pg_receivewal is started > again, isn't it going to get the previous restart_lsn and receive the > last chunk of flushed WAL again? I've kept the existing directory as the source of truth if we have any WAL there already. If we don't, we fallback to the restart_lsn on the replication slot. So in the event that we start it again from somewhere else where we don't have access to those WALs anymore, we could be receiving it again, which in my opinion is better than losing everything in between in that case. > > 2) What is the significance of READ_REPLICATION_SLOT for logical > replication slots? I read above that somebody suggested to restrict > the walsender to handle READ_REPLICATION_SLOT for physical replication > slots so that the callers will see a command failure. But I tend to > think that it is clean to have this command common for both physical > and logical replication slots and the callers can have an Assert(type > == 'physical'). I've updated the patch to make it easy for the caller to check the slot's type and added a verification for those cases. In general, I tried to implement the meaning of the different fields exactly as it's done in the pg_replication_slots view. > > 3) Isn't it useful to send active, active_pid info of the replication > slot via READ_REPLICATION_SLOT? pg_receivewal can use Assert(active == > true && active_pid == getpid()) as an assertion to ensure that it is > the sole owner of the replication slot? Also, is it good send > wal_status info Typically we read the slot before attaching to it, since what we want to do is check if it exists. It may be worthwile to check that it's not already used by another backend though. > > 4) I think below messages should start with lower case letter and also > there are some typos: > + pg_log_warning("Could not fetch the replication_slot \"%s\" information " > + pg_log_warning("Server does not suport fetching the slot's position, " > something like: > + pg_log_warning("could not fetch replication slot \"%s\" information, " > +"resuming from current server position instead", replication_slot); > + pg_log_warning("server does not support fetching replication slot > information, " > +"resuming from current server position instead"); > I've rephrased it a bit in v3, let me know if that's what you had in mind. > 5) How about emitting the above messages in case of "verbose"? I think it is useful to warn the user even if not in the verbose case, but if the consensus is to move it to verbose only output I can change it. > > 6) How about an assertion like below? > + if (stream.startpos == InvalidXLogRecPtr) > + { > + stream.startpos = serverpos; > + stream.timeline = servertli; > + } > + > +Assert(stream.startpos != InvalidXLogRecPtr)>> Good idea. > > 7) How about we let pg_receivewal use READ_REPLICATION_SLOT as an option? From my point of view, I already expected it to use something like that when using a replication slot. Maybe an option to turn it off could be useful ? > > 8) Just an idea, how about we store pg_receivewal's lastFlushPosition > in a file before pg_receivewal exits and compare it with the > restart_lsn that it received from the replication slot, if > lastFlushPosition == received_restart_lsn well and good, if not, then > something would have happened and we always start at the > lastFlushPosition ? The patch idea originally came from the fact that some utility use pg_receivewal to fetch WALs, and move them elsewhere. In that sense I don't really see what value this brings compared to the existing (and unmodified) way of computing the restart position from the already stored files ? Best regards, -- Ronan Dunklau
Re: pg_receivewal starting position
Le vendredi 27 août 2021, 05:44:32 CEST Michael Paquier a écrit : > On Thu, Aug 26, 2021 at 02:14:27PM +0200, Ronan Dunklau wrote: > > Following the discussion at [1], I refactored the implementation into > > streamutil and added a third patch making use of it in pg_basebackup > > itself in order to fail early if the replication slot doesn't exist, so > > please find attached v2 for that. > > Thanks for the split. That helps a lot. > Thank you very much for the review, please find attached an updated patchset. I've also taken into account some remarks made by Bharath Rupireddy. > + > + > /* > * Run IDENTIFY_SYSTEM through a given connection and give back to caller > > The patch series has some noise diffs here and there, you may want to > clean up that for clarity. Ok, sorry about that. > > + if (slot == NULL || !slot->in_use) > + { > + LWLockRelease(ReplicationSlotControlLock); > + > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_OBJECT), > LWLocks are released on ERROR, so no need for LWLockRelease() here. > Following your suggestion of not erroring out on an unexisting slot this point is no longer be relevant, but thanks for pointing this out anyway. > + > + > + Read information about the named replication slot. This is > useful to determine which WAL location we should be asking the server > to start streaming at. > > A nit. You may want to be more careful with the indentation of the > documentation. Things are usually limited in width for readability. > More markups would be nice for the field names used in the > descriptions. Ok. > > + if (slot == NULL || !slot->in_use) > > [...] + > ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_OBJECT), > +errmsg("replication slot \"%s\" does not exist", > + cmd->slotname))); > [...] > + if (PQntuples(res) == 0) > + { > + pg_log_error("replication slot %s does not exist", > slot_name); + PQclear(0); > + return false; > So, the backend and ReadReplicationSlot() report an ERROR if a slot > does not exist but pg_basebackup's GetSlotInformation() does the same > if there are no tuples returned. That's inconsistent. Wouldn't it be > more instinctive to return a NULL tuple instead if the slot does not > exist to be able to check after real ERRORs in frontends using this > interface? The attached patch returns no tuple at all when the replication slot doesn't exist. I'm not sure if that's what you meant by returning a NULL tuple ? > A slot in use exists, so the error is a bit confusing here > anyway, no? From my understanding, a slot *not* in use doesn't exist anymore, as such I don't really understand this point. Could you clarify ? > > +* XXX: should we allow the caller to specify which target timeline it > wants +* ? > +*/ > What are you thinking about here? I was thinking that maybe instead of walking back the timeline history from where we currently are on the server, we could allow an additional argument for the client to specify which timeline it wants. But I guess a replication slot can not be present for a past, divergent timeline ? I have removed that suggestion. > > -# restarts of pg_receivewal will see this segment as full.. > +# restarts of pg_receivewal will see this segment as full../ > Typo. Ok. > > + TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 4, > "restart_lsn_timeline", + INT4OID, -1, 0); > + TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 5, > "confirmed_flush_lsn_timeline", + INT4OID, -1, > 0); > I would call these restart_tli and confirmed_flush_tli., without the > "lsn" part. Ok. > > The patch for READ_REPLICATION_SLOT could have some tests using a > connection that has replication=1 in some TAP tests. We do that in > 001_stream_rep.pl with SHOW, as one example. Ok. I added the physical part to 001_stream_rep.pl, using the protocol interface directly for creating / dropping the slot, and some tests for logical replication slots to 006_logical_decoding.pl. > > - 'slot0' > + 'slot0', '-p', > + "$port" > Something we are missing here? The thing we're missing here is a wrapper for command_fails_like. I've added this to PostgresNode.pm. Best regards, -- Ronan Dunklau>From 9fa01789f663975b
Re: pg_receivewal starting position
Le jeudi 29 juillet 2021, 11:09:40 CEST Ronan Dunklau a écrit : > Patch 0001 adds the new READ_REPLICATION_SLOT command. > It returns for a given slot the type, restart_lsn, flush_lsn, > restart_lsn_timeline and flush_lsn_timeline. > The timelines are determined by reading the current timeline history, and > finding the timeline where we may find the record. I didn't find explicit > test for eg IDENTIFY_SYSTEM so didn't write one either for this new > command, but it is tested indirectly in patch 0002. > > Patch 0002 makes pg_receivewal use that command if we use a replication slot > and the command is available, and use the restart_lsn and > restart_lsn_timeline as a starting point. It also adds a small test to > check that we start back from the previous restart_lsn instead of the > current flush position when our destination directory does not contain any > WAL file. > > I also noticed we don't test following a timeline switch. It would probably > be good to add that, both for the case where we determine the previous > timeline from the archived segments and when it comes from the new command. > What do you think ? Following the discussion at [1], I refactored the implementation into streamutil and added a third patch making use of it in pg_basebackup itself in order to fail early if the replication slot doesn't exist, so please find attached v2 for that. Best regards, [1]: https://www.postgresql.org/message-id/flat/ CAD21AoDYmv0yJMQnWtCx_kZGwVZnkQSTQ1re2JNSgM0k37afYQ%40mail.gmail.com -- Ronan Dunklau>From fff8786049326864d3ef8fe4539e1829f933f32f Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Wed, 28 Jul 2021 16:34:54 +0200 Subject: [PATCH v2 1/3] Add READ_REPLICATION_SLOT command. This commit introduces a new READ_REPLICATION_SLOT command. This command is used to read information about a replication slot when using a physical replication connection. In this first version it returns the slot type, restart_lsn, flush_lsn and the timeline of the restart_lsn and flush_lsn, which are obtained by following the current timeline history. --- doc/src/sgml/protocol.sgml | 62 +++ src/backend/replication/repl_gram.y| 18 - src/backend/replication/repl_scanner.l | 1 + src/backend/replication/walsender.c| 106 + src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 10 +++ 6 files changed, 197 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index a232546b1d..6207171426 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2052,6 +2052,68 @@ The commands accepted in replication mode are: + +READ_REPLICATION_SLOT slot_name + READ_REPLICATION_SLOT + + + + Read information about the named replication slot. This is useful to determine which WAL location we should be asking the server to start streaming at. + + + In response to this command, the server will return a one-row result set, containing the following fields: + + +type (text) + + + The replication slot's type, either physical or logical + + + + + +restart_lsn (text) + + + The replication slot's restart_lsn. + + + + + +confirmed_flush_lsn (text) + + + The replication slot's confirmed_flush LSN. + + + + + +restart_lsn_timeline (int4) + + + The timeline ID for the restart_lsn position, when following the current timeline + history + + + + + +confirmed_flush_lsn_timeline (int4) + + + The timeline ID for the confirmed_flush_lsn position, when following the current timeline + history + + + + + + + + START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ] START_REPLICATION diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y index e1e8ec29cc..7298f44008 100644 --- a/src/backend/replication/repl_gram.y +++ b/src/backend/replication/repl_gram.y @@ -64,6 +64,7 @@ static SQLCmd *make_sqlcmd(void); /* Keyword tokens. */ %token K_BASE_BACKUP %token K_IDENTIFY_SYSTEM +%token K_READ_REPLICATION_SLOT %token K_SHOW %token K_START_REPLICATION %token K_CREATE_REPLICATION_SLOT @@ -94,7 +95,7 @@ static SQLCmd *make_sqlcmd(void); %type command %type base_backup start_replication start_logical_replication create_replication_slot drop_replica
Re: Proposal: More structured logging
Le vendredi 20 août 2021, 11:31:21 CEST Ronan Dunklau a écrit : > Le jeudi 19 août 2021, 15:04:30 CEST Alvaro Herrera a écrit : > > On 2021-Aug-13, Ronan Dunklau wrote: > > > ereport(NOTICE, > > > > > > (errmsg("My log message")), > > > (errtag("EMITTER", "MYEXTENSION")), > > > (errtag("MSG-ID", "%d", error_message_id)) > > > > > > ); > > > > Interesting idea. I agree this would be useful. > > > > > Please find attached a very small POC patch to better demonstrate what I > > > propose. > > > > Seems like a good start. I think a further step towards a committable > > patch would include a test module that uses the new tagging > > functionality. > > Please find attached the original patch + a new one adding a test module. > The test module exposes a function for generating logs with tags, and a log > hook which format the tags in the DETAIL field for easy regression testing. > > Next step would be to emit those tags in the CSV logs. I'm not sure what > representation they should have though: a nested csv in it's own column ? A > key => value pairs list similar to hstore ? A json object ? I opted for a JSON representation in the CSV logs, please find attached v3 which is the same thing as v2 with another patch for CSV log output. > > Also we should probably make this available to the client too, but once > again the format of the tag field needs to be defined. Any opinion ? -- Ronan Dunklau>From e5af5d1a07e65926eee90e0d18443a673d1fcba8 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v3 1/3] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index a3e1c59a82..5b9b1b8a72 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -465,6 +465,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..1c490d1b11 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error co
Re: Proposal: More structured logging
Le jeudi 19 août 2021, 16:50:10 CEST Alvaro Herrera a écrit : > On 2021-Aug-19, Magnus Hagander wrote: > > Another thing I've noticed in more and more other products is to be > > able to log as json, which is then later thrown into a central logging > > system somewhere. Basically like csv, but with the schema defined in > > each row. Yes, a lot more overhead, but some systems really do like to > > consume json So when we're on the topic of more structured > > logging... > > Yeah, I was thinking in json logging too -- specifically thinking about > Fluentbit and similar tools. Michael, your jsonlog module already fullfills this need. Is it something that should be merged into our tree ? -- Ronan Dunklau
Re: Proposal: More structured logging
Le jeudi 19 août 2021, 15:04:30 CEST Alvaro Herrera a écrit : > On 2021-Aug-13, Ronan Dunklau wrote: > > ereport(NOTICE, > > > > (errmsg("My log message")), > > (errtag("EMITTER", "MYEXTENSION")), > > (errtag("MSG-ID", "%d", error_message_id)) > > > > ); > > Interesting idea. I agree this would be useful. > > > Please find attached a very small POC patch to better demonstrate what I > > propose. > > Seems like a good start. I think a further step towards a committable > patch would include a test module that uses the new tagging > functionality. Please find attached the original patch + a new one adding a test module. The test module exposes a function for generating logs with tags, and a log hook which format the tags in the DETAIL field for easy regression testing. Next step would be to emit those tags in the CSV logs. I'm not sure what representation they should have though: a nested csv in it's own column ? A key => value pairs list similar to hstore ? A json object ? Also we should probably make this available to the client too, but once again the format of the tag field needs to be defined. Any opinion ? -- Ronan Dunklau>From e5af5d1a07e65926eee90e0d18443a673d1fcba8 Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v2 1/3] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index a3e1c59a82..5b9b1b8a72 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -465,6 +465,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..1c490d1b11 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_printf(1, 2); extern int errhint_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4); +extern int errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3); + /* * errcontext() is typically called in error context callback functions, not * within an ereport() invocation. The callback function can be in a different @@ -395,11 +398,18 @@ typedef struct ErrorData int internalpos; /* cursor index into internalquery */ char *internalquery; /* text of internally-generated query */ int saved_errno; /* errno at entry */ + List *tags; /* List of
Proposal: More structured logging
Hello, The logging system already captures a lot of information in the ErrorData. But at present there is no way for a log message authors to include more metadata about the log outside of the log message itself. For example, including the extension name which can be useful for filtering / dispatching log messages. This can be done in the log message itself, but that requires specialized parsing in the log output. Even though among the available loggers in core, only the csv logger would be able to make sense of it, I feel it would be beneficial to add a tagging system to logs, by adding different (tag, value) combinations to a log entry, with an API similar to what we do for other ErrorData fields: ereport(NOTICE, (errmsg("My log message")), (errtag("EMITTER", "MYEXTENSION")), (errtag("MSG-ID", "%d", error_message_id)) ); Please find attached a very small POC patch to better demonstrate what I propose. Third party logging hooks could then exploit those values to output them correctly. For example the json loggger written by Michael Paquier here: https://github.com/michaelpq/pg_plugins/tree/master/jsonlog, or the seeminlgy-abandonned journald hook here: https://github.com/intgr/pg_journal could add those information in a structured way. I think the pgaudit extension (or something similar) could also make good use of such a feature instead of dumping a CSV entry in the log file. As for Postgres own log messages, I'm sure we could find practical use cases for tagging messages like this. On a related note, the only structured logger we have in-core is the CSV logger. Many users nowadays end up feeding the logs to journald either by capturing the stderr output with systemd, or by having syslog implemented by journald itself. Would there be any interest in having native journald support as a logging_destination ? Best regards, -- Ronan Dunklau>From 270ffc5ed2fbe5f5076bddee14c5fb3555b87e4f Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Fri, 13 Aug 2021 15:03:18 +0200 Subject: [PATCH v1 1/2] Add ErrorTag support --- src/backend/utils/error/elog.c | 48 ++ src/include/utils/elog.h | 10 +++ 2 files changed, 58 insertions(+) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index a3e1c59a82..5b9b1b8a72 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -465,6 +465,7 @@ errstart(int elevel, const char *domain) edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION; /* errno is saved here so that error parameter eval can't change it */ edata->saved_errno = errno; + edata->tags = NIL; /* * Any allocations for this error state level should go into ErrorContext @@ -516,6 +517,7 @@ errfinish(const char *filename, int lineno, const char *funcname) int elevel; MemoryContext oldcontext; ErrorContextCallback *econtext; + ListCell *lc; recursion_depth++; CHECK_STACK_DEPTH(); @@ -621,7 +623,18 @@ errfinish(const char *filename, int lineno, const char *funcname) pfree(edata->constraint_name); if (edata->internalquery) pfree(edata->internalquery); + /* Every tag should have been palloc'ed */ + if (edata->tags != NIL) + { + foreach(lc, edata->tags) + { + ErrorTag *tag = (ErrorTag *) lfirst(lc); + pfree(tag->tagvalue); + pfree(tag); + } + pfree(edata->tags); + } errordata_stack_depth--; /* Exit error-handling context */ @@ -1192,6 +1205,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural, return 0; /* return value does not matter */ } +int +errtag(const char *tag, const char *fmt_value,...) +{ + ErrorData *edata = [errordata_stack_depth]; + ErrorTag *etag; + MemoryContext oldcontext; + StringInfoData buf; + + recursion_depth++; + CHECK_STACK_DEPTH(); + oldcontext = MemoryContextSwitchTo(edata->assoc_context); + etag = palloc(sizeof(ErrorTag)); + etag->tagname = tag; + initStringInfo(); + for (;;) + { + va_list args; + int needed; + + errno = edata->saved_errno; + va_start(args, fmt_value); + needed = appendStringInfoVA(, fmt_value, args); + va_end(args); + if (needed == 0) + break; + enlargeStringInfo(, needed); + } + etag->tagvalue = pstrdup(buf.data); + edata->tags = lappend(edata->tags, etag); + pfree(buf.data); + MemoryContextSwitchTo(oldcontext); + recursion_depth--; + return 0; +} + /* * errcontext_msg --- add a context error message text to the current error diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f53607e12e..1c490d1b11 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -15,6 +15,7 @@ #define ELOG_H #include +#include "nodes/pg_list.h" /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of @@ -193,6 +194,8 @@ extern int errhint(const char *fmt,...) pg_attribute_print
Re: Minimal logical decoding on standbys
Le lundi 2 août 2021, 17:31:46 CEST Drouvot, Bertrand a écrit : > > I think the beahviour when dropping a database on the primary should be > > documented, and proper procedures for handling it correctly should be > > suggested. > > > > Something along the lines of: > > > > "If a database is dropped on the primary server, the logical replication > > slot on the standby will be dropped as well. This means that you should > > ensure that the client usually connected to this slot has had the > > opportunity to stream the latest changes before the database is dropped." > > I am not sure we should highlight this as part of this patch. > > I mean, the same would currently occur on a non standby if you drop a > database that has a replication slot linked to it. The way I see it, the main difference is that you drop an additional object on the standby, which doesn't exist and that you don't necessarily have any knowledge of on the primary. As such, I thought it would be better to be explicit about it to warn users of that possible case. Regards, -- Ronan Dunklau
Re: Minimal logical decoding on standbys
Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit : > Thanks for the warning, rebase done and new v21 version attached. > > Bertrand Hello, I've taken a look at this patch, and it looks like you adressed every prior remark, including the race condition Andres was worried about. As for the basics: make check-world and make installcheck-world pass. I think the beahviour when dropping a database on the primary should be documented, and proper procedures for handling it correctly should be suggested. Something along the lines of: "If a database is dropped on the primary server, the logical replication slot on the standby will be dropped as well. This means that you should ensure that the client usually connected to this slot has had the opportunity to stream the latest changes before the database is dropped." As for the patches themselves, I only have two small comments to make. In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the need to check for an InvalidOid since we already check the SlotIsLogical: + /* We are only dealing with *logical* slot conflicts. */ + if (!SlotIsLogical(s)) + continue; + + /* not our database and we don't want all the database, skip */ + if ((s->data.database != InvalidOid && s->data.database != dboid) && TransactionIdIsValid(xid)) + continue; In patch 0004, small typo in the test file: +## +# Test standby promotion and logical decoding bheavior +# after the standby gets promoted. +## Thank you for working on this ! Regards, -- Ronan Dunklau
Re: pg_receivewal starting position
Le jeudi 29 juillet 2021, 07:32:37 CEST Kyotaro Horiguchi a écrit : > I didn't thought in details. But I forgot that ordinary SQL commands > have been prohibited in physical replication connection. So we need a > new replication command but it's not that a big deal. Thank you for your feedback ! > > > I don't see any reason not to make it work for logical replication > > connections / slots, but it wouldn't be that useful since we can query > > the database in that case. > > Ordinary SQL queries are usable on a logical replication slot so > I'm not sure how logical replication connection uses the command. > However, like you, I wouldn't bother restricting the command to > physical replication, but perhaps the new command should return the > slot type. > Ok done in the attached patch. > > I'm not sure it's worth adding complexity for such strictness. > START_REPLICATION safely fails if someone steals the slot meanwhile. > In the first place there's no means to protect a slot from others > while idle. One possible problem is the case where START_REPLICATION > successfully acquire the slot after the new command failed. But that > case doesn't seem worse than the case someone advances the slot while > absence. So I think READ_REPLICATION_SLOT is sufficient. > Ok, I implemented it like this. I tried to follow the pg_get_replication_slots approach with regards to how to prevent concurrent modification while reading the slot. > > From pg_receivewal point of view, this would amount to: > > - check if we currently have wal in the target directory. > > > >- if we do, proceed as currently done, by computing the start lsn and > > > > timeline from the last archived wal > > > > - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the > > > > restart_lsn as the start lsn if there is one, and don't provide a timeline > > > > - if we still don't have a start_lsn, fallback to using the current > > server > > > > wal position as is done. > > That's pretty much it. Great. > > > What do you think ? Which information should we provide about the slot ? > > We need the timeline id to start with when using restart_lsn. The > current timeline can be used in most cases but there's a case where > the LSN is historical. Ok, see below. > > pg_receivewal doesn't send a replication status report when a segment > is finished. So after pg_receivewal stops just after a segment is > finished, the slot stays at the beginning of the last segment. Thus > next time it will start from there, creating a duplicate segment. I'm not sure I see where the problem is here. If we don't keep the segments in pg_walreceiver target directory, then it would be the responsibility of whoever moved them to make sure we don't have duplicates, or to handle them gracefully. Even if we were forcing a feedback after a segment is finished, there could still be a problem if the feedback never made it to the server but the segment was here. It might be interesting to send a feedback anyway. Please find attached two patches implementing what we've been discussing. Patch 0001 adds the new READ_REPLICATION_SLOT command. It returns for a given slot the type, restart_lsn, flush_lsn, restart_lsn_timeline and flush_lsn_timeline. The timelines are determined by reading the current timeline history, and finding the timeline where we may find the record. I didn't find explicit test for eg IDENTIFY_SYSTEM so didn't write one either for this new command, but it is tested indirectly in patch 0002. Patch 0002 makes pg_receivewal use that command if we use a replication slot and the command is available, and use the restart_lsn and restart_lsn_timeline as a starting point. It also adds a small test to check that we start back from the previous restart_lsn instead of the current flush position when our destination directory does not contain any WAL file. I also noticed we don't test following a timeline switch. It would probably be good to add that, both for the case where we determine the previous timeline from the archived segments and when it comes from the new command. What do you think ? Regards, -- Ronan Dunklau>From 37d9545c05b9e36aafac751f9dc549e75798413c Mon Sep 17 00:00:00 2001 From: Ronan Dunklau Date: Wed, 28 Jul 2021 16:34:54 +0200 Subject: [PATCH v1 1/2] Add READ_REPLICATION_SLOT command. This commit introduces a new READ_REPLICATION_SLOT command. This command is used to read information about a replication slot when using a physical replication connection. In this first version it returns the slot type, restart_lsn, flush_lsn and the timeline of the restart_lsn and flush_lsn, which are obtained by following the current timeline history. --- doc/src/sgml/protocol.sg