Re: AIX support - alignment issues
Andres Freund writes: > I just thought an easier way - why don't we introduce a 'catalog_double' > that's defined to be pg_attribute_aligned(whatever-we-need) on AIX? Then we > can get rid of the manually enforced alignedness and we don't need to contort > catalog order. Hm, do all the AIX compilers we care about have support for that? If so, it seems like a great idea. regards, tom lane
Re: [PATCH] Add result_types column to pg_prepared_statements view
On 01.07.22 14:27, Dagfinn Ilmari Mannsåker wrote: Peter Eisentraut writes: On 19.05.22 17:34, Dagfinn Ilmari Mannsåker wrote: Prompted by a question on IRC, here's a patch to add a result_types column to the pg_prepared_statements view, so that one can see the types of the columns returned by a prepared statement, not just the parameter types. I'm not quite sure about the column name, suggestions welcome. I think this patch is sensible. The arguments about client-side type-specific value handling for RowDescription don't apply here IMO, since this view is more user-facing. I agree. It's also easy to change if needed. Committed as is.
Re: AIX support - alignment issues
Hi, On 2022-07-02 11:33:54 -0700, Andres Freund wrote: > If we decide we want to continue supporting AIX we should bite the bullet and > add a 64bit-int TYPALIGN_*. It might be worth to translate that to bytes when > building tupledescs, so we don't need more branches (reducing them compared to > today). I just thought an easier way - why don't we introduce a 'catalog_double' that's defined to be pg_attribute_aligned(whatever-we-need) on AIX? Then we can get rid of the manually enforced alignedness and we don't need to contort catalog order. Greetings, Andres Freund
Re: AIX support - alignment issues
Thomas Munro writes: > On Mon, Jul 4, 2022 at 12:08 PM Tom Lane wrote: >> I would not stand in the way of dropping HP-UX and IA64 support as of >> v16. (I do still feel that HPPA is of interest, to keep us honest >> about spinlock support --- but that dual-stack arrangement that IA64 >> uses is surely not part of anyone's future.) > I tried to find everything relating to HP-UX, aCC, ia64 and hppa. Or > do you still want to keep the hppa bits for NetBSD (I wasn't sure if > your threat to set up a NetBSD/hppa system was affected by the > hardware failure you mentioned)? No, the hardware failure is that the machine's SCSI controller seems to be fried, thus internal drives no longer accessible. I have a working NetBSD-current installation on an external USB drive, and plan to commission it as a buildfarm animal once NetBSD 10 is officially branched. It'll be a frankencritter of the first order, because USB didn't exist when the machine was built, but hey... regards, tom lane
Re: AIX support - alignment issues
On Mon, Jul 4, 2022 at 12:08 PM Tom Lane wrote: > I would not stand in the way of dropping HP-UX and IA64 support as of > v16. (I do still feel that HPPA is of interest, to keep us honest > about spinlock support --- but that dual-stack arrangement that IA64 > uses is surely not part of anyone's future.) I tried to find everything relating to HP-UX, aCC, ia64 and hppa. Or do you still want to keep the hppa bits for NetBSD (I wasn't sure if your threat to set up a NetBSD/hppa system was affected by the hardware failure you mentioned)? Or just leave it in there in orphaned hall-of-fame state, like m68k, m88k, Vax? From 78d5a122fef28e2dbfd307a584dbda5e830734cb Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 4 Jul 2022 16:24:16 +1200 Subject: [PATCH] Remove HP-UX, aCC, ia64 and hppa support. HP-UX hardware is no longer produced. This removes support for: HP-UX, the operating system. HP aCC, the compiler. HP/Intel Itanium (ia64), the CPU architecture discontinued in 2021. HP PA-RISC (hppa), the CPU architecture discontinued in 2008. --- configure | 99 +-- configure.ac | 8 -- contrib/pgcrypto/crypt-blowfish.c | 2 +- doc/src/sgml/dfunc.sgml| 29 - doc/src/sgml/installation.sgml | 3 +- doc/src/sgml/regress.sgml | 6 +- doc/src/sgml/runtime.sgml | 19 --- src/Makefile.shlib | 33 - src/backend/libpq/ifaddr.c | 12 +- src/backend/port/hpux/tas.c.template | 40 --- src/backend/port/tas/hpux_hppa.s | 28 - src/backend/tcop/postgres.c| 74 src/backend/utils/misc/ps_status.c | 17 --- src/common/sprompt.c | 2 +- src/include/miscadmin.h| 8 -- src/include/pg_config.h.in | 6 - src/include/port/atomics.h | 6 - src/include/port/atomics/arch-hppa.h | 17 --- src/include/port/atomics/arch-ia64.h | 29 - src/include/port/atomics/fallback.h| 12 -- src/include/port/atomics/generic-acc.h | 106 src/include/port/hpux.h| 3 - src/include/storage/s_lock.h | 160 - src/makefiles/Makefile.hpux| 47 src/pl/plperl/ppport.h | 2 +- src/port/dlopen.c | 50 +--- src/port/getrusage.c | 1 - src/template/hpux | 34 -- src/test/regress/resultmap | 1 - src/tools/msvc/Solution.pm | 2 - src/tools/pginclude/cpluspluscheck | 3 - src/tools/pginclude/headerscheck | 3 - 32 files changed, 13 insertions(+), 849 deletions(-) delete mode 100644 src/backend/port/hpux/tas.c.template delete mode 100644 src/backend/port/tas/hpux_hppa.s delete mode 100644 src/include/port/atomics/arch-hppa.h delete mode 100644 src/include/port/atomics/arch-ia64.h delete mode 100644 src/include/port/atomics/generic-acc.h delete mode 100644 src/include/port/hpux.h delete mode 100644 src/makefiles/Makefile.hpux delete mode 100644 src/template/hpux diff --git a/configure b/configure index fb07cd27d9..91b7b185f9 100755 --- a/configure +++ b/configure @@ -2994,7 +2994,6 @@ case $host_os in darwin*) template=darwin ;; dragonfly*) template=netbsd ;; freebsd*) template=freebsd ;; -hpux*) template=hpux ;; linux*|gnu*|k*bsd*-gnu) template=linux ;; mingw*) template=win32 ;; @@ -6856,100 +6855,6 @@ if test x"$pgac_cv_prog_CXX_cxxflags__qlonglong" = x"yes"; then fi -elif test "$PORTNAME" = "hpux"; then - # On some versions of HP-UX, libm functions do not set errno by default. - # Fix that by using +Olibmerrno if the compiler recognizes it. - -{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports +Olibmerrno, for CFLAGS" >&5 -$as_echo_n "checking whether ${CC} supports +Olibmerrno, for CFLAGS... " >&6; } -if ${pgac_cv_prog_CC_cflags_pOlibmerrno+:} false; then : - $as_echo_n "(cached) " >&6 -else - pgac_save_CFLAGS=$CFLAGS -pgac_save_CC=$CC -CC=${CC} -CFLAGS="${CFLAGS} +Olibmerrno" -ac_save_c_werror_flag=$ac_c_werror_flag -ac_c_werror_flag=yes -cat confdefs.h - <<_ACEOF >conftest.$ac_ext -/* end confdefs.h. */ - -int -main () -{ - - ; - return 0; -} -_ACEOF -if ac_fn_c_try_compile "$LINENO"; then : - pgac_cv_prog_CC_cflags_pOlibmerrno=yes -else - pgac_cv_prog_CC_cflags_pOlibmerrno=no -fi -rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext -ac_c_werror_flag=$ac_save_c_werror_flag -CFLAGS="$pgac_save_CFLAGS" -CC="$pgac_save_CC" -fi -{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags_pOlibmerrno" >&5 -$as_echo "$pgac_cv_prog_CC_cflags_pOlibmerrno" >&6; } -if test x"$pgac_cv_prog_CC_cflags_pOlibmerrno" = x"yes"; then - CFLAGS="${CFLAGS} +Olibmerrno" -fi - - - { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports +Olibmerrno,
Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH
On Tue, Jul 5, 2022 at 4:03 AM Peter Smith wrote: > > On Mon, Jul 4, 2022 at 10:29 PM Euler Taveira wrote: > > > > On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote: > > > > Yeah, it seems we have overlooked this point. I think we can do this > > just for HEAD but as the feature is introduced in PG-15 so there is no > > harm in pushing it to PG-15 as well especially because it is a > > straightforward change. What do you or others think? > > > > No objection. It is a good thing for future backpatches. > > > > Since there is no function change or bugfix here I thought it was only > applicable for HEAD. This change is almost in the same category as a > code comment typo patch - do those normally get backpatched? - maybe > follow the same convention here. OTOH, if you think it may be helpful > for future backpatches then I am also fine if you wanted to push it to > PG15. > It can help if there is any bug-fix in the same code path or if some other code adjustment in the same area is required in the back branch. I feel the chances of both are less but I just wanted to keep the code consistent for such a possibility. Anyway, I'll wait for a day or so and see if anyone has objections to it. -- With Regards, Amit Kapila.
Re: CFM Manager
On Tue, Jul 05, 2022 at 08:17:26AM +0500, Ibrar Ahmed wrote: > If nobody has already volunteered for the next upcoming commitfest. > I'd like to volunteer. I think early to say is better as always. Jacob and Daniel have already volunteered. Based on the number of patches at hand (305 in total), getting more help is always welcome, I guess. -- Michael signature.asc Description: PGP signature
Linking issue with libldap on morepork (OpenBSD 6.9)
Hi all, While looking at the buildfarm logs, I have noticed the following thing: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=morepork=2022-07-05%2002%3A45%3A32=recovery-check postgres:/usr/local/lib/libldap_r.so.13.2: /usr/local/lib/libldap.so.13.2 : WARNING: symbol(ldap_int_global_options) size mismatch, relink your program That seems to be pretty ancient, not related to aff45c8 as logs from two months ago also show this warning. Thanks, -- Michael signature.asc Description: PGP signature
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada wrote: > > On Tue, Jun 28, 2022 at 10:10 PM John Naylor > wrote: > > > > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada > > wrote: > > > > > > > I > > > > suspect other optimizations would be worth a lot more than using AVX2: > > > > - collapsing inner nodes > > > > - taking care when constructing the key (more on this when we > > > > integrate with VACUUM) > > > > ...and a couple Andres mentioned: > > > > - memory management: in > > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de > > > > - node dispatch: > > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de > > > > > > > > Therefore, I would suggest that we use SSE2 only, because: > > > > - portability is very easy > > > > - to avoid a performance hit from indirecting through a function pointer > > > > > > Okay, I'll try these optimizations and see if the performance becomes > > > better. > > > > FWIW, I think it's fine if we delay these until after committing a > > good-enough version. The exception is key construction and I think > > that deserves some attention now (more on this below). > > Agreed. > > > > > > I've done benchmark tests while changing the node types. The code base > > > is v3 patch that doesn't have the optimization you mentioned below > > > (memory management and node dispatch) but I added the code to use SSE2 > > > for node-16 and node-32. > > > > Great, this is helpful to visualize what's going on! > > > > > * sse2_4_16_48_256 > > > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = > > > 916433 > > > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50 > > > > > > * sse2_4_32_128_256 > > > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, > > > n256 = 31 > > > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = > > > 1 > > > > > Observations are: > > > > > > In both test cases, There is not much difference between using AVX2 > > > and SSE2. The more mode types, the more time it takes for loading the > > > data (see sse2_4_16_32_128_256). > > > > Good to know. And as Andres mentioned in his PoC, more node types > > would be a barrier for pointer tagging, since 32-bit platforms only > > have two spare bits in the pointer. > > > > > In dense case, since most nodes have around 100 children, the radix > > > tree that has node-128 had a good figure in terms of memory usage. On > > > > Looking at the node stats, and then your benchmark code, I think key > > construction is a major influence, maybe more than node type. The > > key/value scheme tested now makes sense: > > > > blockhi || blocklo || 9 bits of item offset > > > > (with the leaf nodes containing a bit map of the lowest few bits of > > this whole thing) > > > > We want the lower fanout nodes at the top of the tree and higher > > fanout ones at the bottom. > > So more inner nodes can fit in CPU cache, right? > > > > > Note some consequences: If the table has enough columns such that much > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the > > dense case the nodes above the leaves will have lower fanout (maybe > > they will fit in a node32). Also, the bitmap values in the leaves will > > be more empty. In other words, many tables in the wild *resemble* the > > sparse case a bit, even if truly all tuples on the page are dead. > > > > Note also that the dense case in the benchmark above has ~4500 times > > more keys than the sparse case, and uses about ~1000 times more > > memory. But the runtime is only 2-3 times longer. That's interesting > > to me. > > > > To optimize for the sparse case, it seems to me that the key/value would be > > > > blockhi || 9 bits of item offset || blocklo > > > > I believe that would make the leaf nodes more dense, with fewer inner > > nodes, and could drastically speed up the sparse case, and maybe many > > realistic dense cases. > > Does it have an effect on the number of inner nodes? > > > I'm curious to hear your thoughts. > > Thank you for your analysis. It's worth trying. We use 9 bits for item > offset but most pages don't use all bits in practice. So probably it > might be better to move the most significant bit of item offset to the > left of blockhi. Or more simply: > > 9 bits of item offset || blockhi || blocklo > > > > > > the other hand, the radix tree that doesn't have node-128 has a better > > > number in terms of insertion performance. This is probably because we > > > need to iterate over 'isset' flags from the beginning of the array in > > > order to find an empty slot when inserting new data. We do the same > > > thing also for node-48 but it was better than node-128 as it's up to > > > 48. > > > > I mentioned in my diff, but for those following along, I think we can > > improve that by iterating over the bytes and if it's 0xFF all 8 bits > > are set already so keep looking... > >
Re: pgsql: dshash: Add sequential scan support.
On Tue, Jul 5, 2022 at 11:25 AM Andres Freund wrote: > On 2022-07-05 11:20:54 +1200, Thomas Munro wrote: > > Since there were 6 places with I-hold-no-lock assertions, I shoved the > > loop into a function so I could do: > > > > - Assert(!status->hash_table->find_locked); > > + assert_no_lock_held_by_me(hash_table); > > I am a *bit* wary about the costs of that, even in assert builds - each of the > partition checks in the loop will in turn need to iterate through > held_lwlocks. But I guess we can also just later weaken them if it turns out > to be a problem. Maybe we should add assertion support for arrays of locks, so we don't need two levels of loop? Something like the attached? From d98874ab15fc568b3da8a6395e0a0ef02855a880 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 4 Jul 2022 11:56:02 +1200 Subject: [PATCH v3] Fix lock assertions in dshash.c. dshash.c maintained flags to track locking, in order to make assertions that the dshash API was being used correctly. Unfortunately the interaction with ereport's non-local exits was not thought through carefully enough and the flag could get out of sync with reality. Since these flags were primarily used for assertions that no lock was held, we can get most of the intended effect by making an assertion about the partition locks directly. To avoid N*M complexity while asserting that we hold no partition locks, add a new debugging-only interface LWLockAnyHeldByMe(), which can check if any lock in an array is held without the need for a loop within a loop. This problem was noted by Tom and Andres while reviewing changes to support the new shared memory stats system in release 15, the second user of dshash.c. On closer inspection, even the earlier typcache.c code was not guaranteed to meet the asserted conditions, now that it's been highlighted that even dsa_get_address() can throw (albeit in unlikely circumstances). Back-patch to 11, where dshash.c arrived. Reported-by: Tom Lane Reported-by: Andres Freund Reviewed-by: Kyotaro HORIGUCHI Reviewed-by: Zhihong Yu Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20220311012712.botrpsikaufzt...@alap3.anarazel.de --- src/backend/lib/dshash.c | 44 +++ src/backend/storage/lmgr/lwlock.c | 26 ++ src/include/storage/lwlock.h | 1 + 3 files changed, 37 insertions(+), 34 deletions(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index ec454b4d65..c5c032a593 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -110,8 +110,6 @@ struct dshash_table dshash_table_control *control; /* Control object in DSM. */ dsa_pointer *buckets; /* Current bucket pointers in DSM. */ size_t size_log2; /* log2(number of buckets) */ - bool find_locked; /* Is any partition lock held by 'find'? */ - bool find_exclusively_locked; /* ... exclusively? */ }; /* Given a pointer to an item, find the entry (user data) it holds. */ @@ -194,6 +192,10 @@ static inline bool equal_keys(dshash_table *hash_table, #define PARTITION_LOCK(hash_table, i) \ (&(hash_table)->control->partitions[(i)].lock) +#define ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table) \ + Assert(!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock, \ + DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))) + /* * Create a new hash table backed by the given dynamic shared area, with the * given parameters. The returned object is allocated in backend-local memory @@ -234,9 +236,6 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg) } } - hash_table->find_locked = false; - hash_table->find_exclusively_locked = false; - /* * Set up the initial array of buckets. Our initial size is the same as * the number of partitions. @@ -285,8 +284,6 @@ dshash_attach(dsa_area *area, const dshash_parameters *params, hash_table->params = *params; hash_table->arg = arg; hash_table->control = dsa_get_address(area, control); - hash_table->find_locked = false; - hash_table->find_exclusively_locked = false; Assert(hash_table->control->magic == DSHASH_MAGIC); /* @@ -309,7 +306,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params, void dshash_detach(dshash_table *hash_table) { - Assert(!hash_table->find_locked); + ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); /* The hash table may have been destroyed. Just free local memory. */ pfree(hash_table); @@ -400,7 +397,7 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive) partition = PARTITION_FOR_HASH(hash); Assert(hash_table->control->magic == DSHASH_MAGIC); - Assert(!hash_table->find_locked); + ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); LWLockAcquire(PARTITION_LOCK(hash_table, partition), exclusive ? LW_EXCLUSIVE : LW_SHARED); @@ -418,8 +415,6 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive) else { /* The caller will free
CFM Manager
If nobody has already volunteered for the next upcoming commitfest. I'd like to volunteer. I think early to say is better as always. -- Ibrar Ahmed
Re: Bugs in copyfuncs/equalfuncs support for JSON node types
Justin Pryzby writes: > Do the missing fields indicate a deficiency in test coverage ? > _copyJsonTablePlan.pathname and _equalJsonTable.plan. Yeah, I'd say so, but I think constructing a test case to prove it's broken might be more trouble than it's worth --- particularly seeing that we're about to automate this stuff. Because of that, I wouldn't even be really concerned about these bugs in HEAD; but this needs to be back-patched into v15. The existing COPY_PARSE_PLAN_TREES logic purports to test this area, but it fails to notice these bugs for a few reasons: * JsonTable.lateral: COPY_PARSE_PLAN_TREES itself failed to detect this problem because of matching omissions in _copyJsonTable and _equalJsonTable. But the lack of any follow-on failure implies that we don't have any test cases where the lateral flag is significant. Maybe that means the field is useless? This one would be worth a closer look, perhaps. * JsonTableColumn.format: this scalar-instead-of-deep-copy bug would only be detectable if you were able to clobber the original parse tree after copying. I have no ideas about an easy way to do that. It'd surely bite somebody in the field someday, but making a reproducible test is way harder. * JsonTable.plan: to detect the missed comparison, you'd have to build a test case where comparing two such trees actually made a visible difference; which would require a fair amount of thought I fear. IIUC this node type will only appear down inside jointrees, which we don't usually do comparisons on. regards, tom lane
Re: TAP output format in pg_regress
Hi, On 2022-07-04 21:56:24 -0400, Tom Lane wrote: > > For non-parallel tests I think we currently print the test name before > > running > > the test, which obviously doesn't work well when needing to print the 'ok' > > 'not ok' first. > > Is this still a consideration? We got rid of serial_schedule some > time ago. Not really for the main tests, there's a few serial steps, but not enough that a bit additional output would be an issue. I think all tests in contrib are serial though, and some have enough tests that it might be annoying? > > I wonder if for parallel tests we should print the test number based on the > > start of the test rather than the finish time? > > I think we need the test number to be stable, so it had better be the > ordering appearing in the schedule file. But we already print the > results in that order. I remembered some asynchronizity, but apparently that's just the "parallel group" line. Greetings, Andres Freund
Re: Bugs in copyfuncs/equalfuncs support for JSON node types
On Mon, Jul 04, 2022 at 09:23:08PM -0400, Tom Lane wrote: > In reviewing Peter's patch to auto-generate the backend/nodes > support files, I compared what the patch's script produces to > what is in the code now. I found several discrepancies in the > recently-added parse node types for JSON functions, and as far > as I can see every one of those discrepancies is an error in > the existing code. Some of them are relatively harmless > (e.g. COPY_LOCATION_FIELD isn't really different from > COPY_SCALAR_FIELD), but some of them definitely are live bugs. > I propose the attached patch. Do the missing fields indicate a deficiency in test coverage ? _copyJsonTablePlan.pathname and _equalJsonTable.plan. -- Justin
Re: TAP output format in pg_regress
Andres Freund writes: > I think with a bit of care the tap output could be nearly the same > "quality". It might not be the absolute "purest" tap output, but who cares. +1 > For non-parallel tests I think we currently print the test name before running > the test, which obviously doesn't work well when needing to print the 'ok' > 'not ok' first. Is this still a consideration? We got rid of serial_schedule some time ago. > I wonder if for parallel tests we should print the test number based on the > start of the test rather than the finish time? I think we need the test number to be stable, so it had better be the ordering appearing in the schedule file. But we already print the results in that order. regards, tom lane
Re: Bugs in copyfuncs/equalfuncs support for JSON node types
On Mon, Jul 4, 2022 at 6:23 PM Tom Lane wrote: > In reviewing Peter's patch to auto-generate the backend/nodes > support files, I compared what the patch's script produces to > what is in the code now. I found several discrepancies in the > recently-added parse node types for JSON functions, and as far > as I can see every one of those discrepancies is an error in > the existing code. Some of them are relatively harmless > (e.g. COPY_LOCATION_FIELD isn't really different from > COPY_SCALAR_FIELD), but some of them definitely are live bugs. > I propose the attached patch. > > regards, tom lane > > Hi, Patch looks good to me. Thanks
Bugs in copyfuncs/equalfuncs support for JSON node types
In reviewing Peter's patch to auto-generate the backend/nodes support files, I compared what the patch's script produces to what is in the code now. I found several discrepancies in the recently-added parse node types for JSON functions, and as far as I can see every one of those discrepancies is an error in the existing code. Some of them are relatively harmless (e.g. COPY_LOCATION_FIELD isn't really different from COPY_SCALAR_FIELD), but some of them definitely are live bugs. I propose the attached patch. regards, tom lane diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 51d630fa89..706d283a92 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -2703,7 +2703,8 @@ _copyJsonTable(const JsonTable *from) COPY_NODE_FIELD(plan); COPY_NODE_FIELD(on_error); COPY_NODE_FIELD(alias); - COPY_SCALAR_FIELD(location); + COPY_SCALAR_FIELD(lateral); + COPY_LOCATION_FIELD(location); return newnode; } @@ -2721,13 +2722,13 @@ _copyJsonTableColumn(const JsonTableColumn *from) COPY_NODE_FIELD(typeName); COPY_STRING_FIELD(pathspec); COPY_STRING_FIELD(pathname); - COPY_SCALAR_FIELD(format); + COPY_NODE_FIELD(format); COPY_SCALAR_FIELD(wrapper); COPY_SCALAR_FIELD(omit_quotes); COPY_NODE_FIELD(columns); COPY_NODE_FIELD(on_empty); COPY_NODE_FIELD(on_error); - COPY_SCALAR_FIELD(location); + COPY_LOCATION_FIELD(location); return newnode; } @@ -2742,10 +2743,10 @@ _copyJsonTablePlan(const JsonTablePlan *from) COPY_SCALAR_FIELD(plan_type); COPY_SCALAR_FIELD(join_type); - COPY_STRING_FIELD(pathname); COPY_NODE_FIELD(plan1); COPY_NODE_FIELD(plan2); - COPY_SCALAR_FIELD(location); + COPY_STRING_FIELD(pathname); + COPY_LOCATION_FIELD(location); return newnode; } diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index e747e1667d..fccc0b4a18 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -147,14 +147,29 @@ _equalTableFunc(const TableFunc *a, const TableFunc *b) return true; } +static bool +_equalJsonTablePlan(const JsonTablePlan *a, const JsonTablePlan *b) +{ + COMPARE_SCALAR_FIELD(plan_type); + COMPARE_SCALAR_FIELD(join_type); + COMPARE_NODE_FIELD(plan1); + COMPARE_NODE_FIELD(plan2); + COMPARE_STRING_FIELD(pathname); + COMPARE_LOCATION_FIELD(location); + + return true; +} + static bool _equalJsonTable(const JsonTable *a, const JsonTable *b) { COMPARE_NODE_FIELD(common); COMPARE_NODE_FIELD(columns); + COMPARE_NODE_FIELD(plan); COMPARE_NODE_FIELD(on_error); COMPARE_NODE_FIELD(alias); - COMPARE_SCALAR_FIELD(location); + COMPARE_SCALAR_FIELD(lateral); + COMPARE_LOCATION_FIELD(location); return true; } @@ -166,13 +181,14 @@ _equalJsonTableColumn(const JsonTableColumn *a, const JsonTableColumn *b) COMPARE_STRING_FIELD(name); COMPARE_NODE_FIELD(typeName); COMPARE_STRING_FIELD(pathspec); - COMPARE_SCALAR_FIELD(format); + COMPARE_STRING_FIELD(pathname); + COMPARE_NODE_FIELD(format); COMPARE_SCALAR_FIELD(wrapper); COMPARE_SCALAR_FIELD(omit_quotes); COMPARE_NODE_FIELD(columns); COMPARE_NODE_FIELD(on_empty); COMPARE_NODE_FIELD(on_error); - COMPARE_SCALAR_FIELD(location); + COMPARE_LOCATION_FIELD(location); return true; } @@ -4405,6 +4421,9 @@ equal(const void *a, const void *b) case T_JsonArgument: retval = _equalJsonArgument(a, b); break; + case T_JsonTablePlan: + retval = _equalJsonTablePlan(a, b); + break; case T_JsonTable: retval = _equalJsonTable(a, b); break;
Re: avoid multiple hard links to same WAL file after a crash
On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > I'd agree with removing all the callers at the end. pgrename() is > quite robust on Windows, but I'd keep the two checks in > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > consider a past TLI history file as in-use even if we have a crash > just after the file got created in the same path by the same standby, > and the WAL segment init part. Your patch does that. As v16 is now open for business, I have revisited this change and applied 0001 to change all the callers (aka removal of the assertion for the WAL receiver when it overwrites a TLI history file). The commit log includes details about the reasoning of all the areas changed, for clarity, as of the WAL recycling part, the TLI history file part and basic_archive. -- Michael signature.asc Description: PGP signature
Re: Handle infinite recursion in logical replication setup
I checked again the v26* patch set. I had no more comments for v26-0001, v26-0002, or v26-0004, but below are some comments for v26-0003. v26-0003 3.1 src/backend/catalog/pg_subscription.c 3.1.a +/* + * Get all relations for subscription that are in a ready state. + * + * Returned list is palloc'ed in current memory context. + */ +List * +GetSubscriptionReadyRelations(Oid subid) "subscription" -> "the subscription" Also, It might be better to say in the function header what kind of structures are in the returned List. E.g. Firstly, I'd assumed it was the same return type as the other function GetSubscriptionReadyRelations, but it isn’t. 3.1.b I think there might be a case to be made for *combining* those SubscriptionRelState and SubscripotionRel structs into a single common struct. Then all those GetSubscriptionNotReadyRelations and GetSubscriptionNotReadyRelations (also GetSubscriptionRelations?) can be merged to be just one common function that takes a parameter to say do you want to return the relation List of ALL / READY / NOT_READY? What do you think? == 3.2 src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed +/* + * Check and throw an error if the publisher has subscribed to the same table + * from some other publisher. This check is required only if copydata is ON and + * the origin is local. + */ +static void +check_pub_table_subscribed(WalReceiverConn *wrconn, List *publications, +CopyData copydata, char *origin, Oid subid) The function comment probably should say something about why the new READY logic was added in v26. ~~~ 3.3 3.3.a + /* + * The subid will be valid only for ALTER SUBSCRIPTION ... REFRESH + * PUBLICATION. Get the ready relations for the subscription only in case + * of ALTER SUBSCRIPTION case as there will be no relations in ready state + * while the subscription is created. + */ + if (subid != InvalidOid) + subreadyrels = GetSubscriptionReadyRelations(subid); The word "case" is 2x in the same sentence. I also paraphrased my understanding of this comment below. Maybe it is simpler? SUGGESTION Get the ready relations for the subscription. The subid will be valid only for ALTER SUBSCRIPTION ... REFRESH because there will be no relations in ready state while the subscription is created. 3.3.b + if (subid != InvalidOid) + subreadyrels = GetSubscriptionReadyRelations(subid); SUGGESTION if (OidIsValid(subid)) == 3.4 src/test/subscription/t/032_localonly.pl Now all the test cases are re-using the same data (e.g. 11,12,13) and you are deleting the data between the tests. I guess it is OK, but IMO the tests are easier to read when each test part was using unique data (e.g. 11,12,13, then 12,22,32, then 13,23,33 etc) -- Kind Regards, Peter Smith. Fujitsu Australia
Re: pgsql: dshash: Add sequential scan support.
Hi, On 2022-07-05 11:20:54 +1200, Thomas Munro wrote: > On Tue, Jul 5, 2022 at 8:54 AM Andres Freund wrote: > > > Yeah, it's all for assertions... let's just remove it. Those > > > assertions were useful to me at some stage in development but won't > > > hold as well as I thought, at least without widespread PG_FINALLY(), > > > which wouldn't be nice. > > > > Hm. I'd be inclined to at least add a few more > > Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to > > dshash_find_or_insert(). > > Yeah, I was wondering about that, but it needs to check the whole 128 > element lock array. I think it'd be ok to just check the current partition - yes, it'd not catch cases where we're still holding a lock on another partition, but that's imo not too bad? > Hmm, yeah that seems OK for assertion builds. > Since there were 6 places with I-hold-no-lock assertions, I shoved the > loop into a function so I could do: > > - Assert(!status->hash_table->find_locked); > + assert_no_lock_held_by_me(hash_table); I am a *bit* wary about the costs of that, even in assert builds - each of the partition checks in the loop will in turn need to iterate through held_lwlocks. But I guess we can also just later weaken them if it turns out to be a problem. Greetings, Andres Freund
Re: pgsql: dshash: Add sequential scan support.
On Tue, Jul 5, 2022 at 8:54 AM Andres Freund wrote: > On 2022-07-04 14:55:43 +1200, Thomas Munro wrote: > > > It's per-backend state at least and just used for assertions. We could > > > remove > > > it. Or stop checking it in places where it could be set wrongly: > > > dshash_find() > > > and dshash_detach() couldn't check anymore, but the rest of the assertions > > > would still be valid afaics? > > > > Yeah, it's all for assertions... let's just remove it. Those > > assertions were useful to me at some stage in development but won't > > hold as well as I thought, at least without widespread PG_FINALLY(), > > which wouldn't be nice. > > Hm. I'd be inclined to at least add a few more > Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to > dshash_find_or_insert(). Yeah, I was wondering about that, but it needs to check the whole 128 element lock array. Hmm, yeah that seems OK for assertion builds. Since there were 6 places with I-hold-no-lock assertions, I shoved the loop into a function so I could do: - Assert(!status->hash_table->find_locked); + assert_no_lock_held_by_me(hash_table); > > + Assert(LWLockHeldByMe(PARTITION_LOCK(hash_table, partition_index))); > > > > - hash_table->find_locked = false; > > - hash_table->find_exclusively_locked = false; > > LWLockRelease(PARTITION_LOCK(hash_table, partition_index)); > This LWLockHeldByMe() doesn't add much - the LWLockRelease() will error out if > we don't hold the lock. Duh. Removed. From da650f5dc0dda2155d088f56e2c062f7d10dec44 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 4 Jul 2022 11:56:02 +1200 Subject: [PATCH v2] Fix lock assertions in dshash.c. dshash.c maintained flags to track locking, in order to make assertions that the dshash API was being used correctly. Unfortunately the interaction with ereport's non-local exits was not thought through carefully enough and the flag could get out of sync with reality. Since these flags were primarily used for assertions that no lock was held, we can get most of the intended effect with a loop over all LWLocks instead (in assertion builds only). This problem was noted by Tom and Andres while reviewing changes to support the new shared memory stats system in release 15, the second user of dshash.c. On closer inspection, even the earlier typcache.c code was not guaranteed to meet the asserted conditions, now that it's been highlighted that even dsa_get_address() can throw (albeit in unlikely circumstances). Back-patch to 11, where dshash.c arrived. Reported-by: Tom Lane Reported-by: Andres Freund Reviewed-by: Kyotaro HORIGUCHI Reviewed-by: Zhihong Yu Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20220311012712.botrpsikaufzt...@alap3.anarazel.de --- src/backend/lib/dshash.c | 54 +++- 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index ec454b4d65..90255a5eb9 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -110,8 +110,6 @@ struct dshash_table dshash_table_control *control; /* Control object in DSM. */ dsa_pointer *buckets; /* Current bucket pointers in DSM. */ size_t size_log2; /* log2(number of buckets) */ - bool find_locked; /* Is any partition lock held by 'find'? */ - bool find_exclusively_locked; /* ... exclusively? */ }; /* Given a pointer to an item, find the entry (user data) it holds. */ @@ -167,6 +165,7 @@ struct dshash_table BUCKET_INDEX_FOR_HASH_AND_SIZE(hash, \ hash_table->size_log2)]) +static inline void assert_no_lock_held_by_me(dshash_table *hash_table); static void delete_item(dshash_table *hash_table, dshash_table_item *item); static void resize(dshash_table *hash_table, size_t new_size); @@ -234,9 +233,6 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg) } } - hash_table->find_locked = false; - hash_table->find_exclusively_locked = false; - /* * Set up the initial array of buckets. Our initial size is the same as * the number of partitions. @@ -285,8 +281,6 @@ dshash_attach(dsa_area *area, const dshash_parameters *params, hash_table->params = *params; hash_table->arg = arg; hash_table->control = dsa_get_address(area, control); - hash_table->find_locked = false; - hash_table->find_exclusively_locked = false; Assert(hash_table->control->magic == DSHASH_MAGIC); /* @@ -309,7 +303,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params, void dshash_detach(dshash_table *hash_table) { - Assert(!hash_table->find_locked); + assert_no_lock_held_by_me(hash_table); /* The hash table may have been destroyed. Just free local memory. */ pfree(hash_table); @@ -400,7 +394,7 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive) partition = PARTITION_FOR_HASH(hash); Assert(hash_table->control->magic == DSHASH_MAGIC); -
Re: TAP output format in pg_regress
Hi, On 2022-07-05 00:06:04 +0200, Daniel Gustafsson wrote: > > On 4 Jul 2022, at 16:27, Peter Eisentraut > > wrote: > > > > On 29.06.22 21:50, Daniel Gustafsson wrote: > >> Attached is a new version of this patch, which completes the TAP output > >> format > >> option such that all codepaths emitting output are TAP compliant. The > >> verbose > >> option is fixed to to not output extraneous newlines which the previous PoC > >> did. The output it made to conform to the original TAP spec since v13/14 > >> TAP > >> parsers seem less common than those that can handle the original spec. > >> Support > >> for the new format additions should be quite simple to add should we want > >> that. > >> Running pg_regress --verbose should give the current format output. > >> I did end up combining TAP and --verbose into a single patch, as the TAP > >> format > >> sort of depends on the verbose flag as TAP has no verbose mode. I can > >> split it > >> into two separate should a reviewer prefer that. > > > > I'm not sure what to make of all these options. I think providing a TAP > > output for pg_regress is a good idea. But then do we still need the old > > output? Is it worth maintaining two output formats that display exactly > > the same thing in slightly different ways? > > If we believe that TAP is good enough for human consumption and not just as > input to test runners then we don't. Personally I think the traditional > format > is more pleasant to read than raw TAP output when running tests. I think with a bit of care the tap output could be nearly the same "quality". It might not be the absolute "purest" tap output, but who cares. test tablespace ... ok 418 ms parallel group (20 tests): oid char int2 varchar name int4 text pg_lsn regproc txid money boolean uuid float4 int8 float8 bit enum rangetypes numeric boolean ... ok 34 ms char ... ok 20 ms name ... ok 26 ms isn't that different from ok 1 - tablespace 418ms # parallel group (20 tests): oid char int2 varchar name int4 text pg_lsn regproc txid money boolean uuid float4 int8 float8 bit enum rangetypes numeric ok 2 - boolean34ms ok 3 - char 20ms ok 4 - name 26ms or whatever. For non-parallel tests I think we currently print the test name before running the test, which obviously doesn't work well when needing to print the 'ok' 'not ok' first. We could just print # non-parallel group tablespace or such? That doesn't I wonder if for parallel tests we should print the test number based on the start of the test rather than the finish time? Hm, it also looks like it's legal to just leave the test number out? > The discussion on this was in > 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de > where it was proposed that we could cut the boilerplate. I think that was more about things like CREATE DATABASE etc. And I'm not sure we need an option to keep showing those details. > > I remember the last time I wanted to tweak the output of the parallel > > tests, people were very attached to the particular timing and spacing of > > the current output. So I'm not sure people will like this. > > > > The timing output is very popular. Where is that in the TAP output? > > As I mentioned in the mail upthread, TAP runners generally hide that info and > only show a running total. That being said, I do agree with adding back so > I'll do that in a new version of the patch. FWIW, meson's testrunner shows the individual tap output when using -v. One test where using -v is a problem is pg_dump's tests - the absurd number of 8169 tests makes showing that decidedly not fun. > Having test output format parity with supported back branches seemed like a > good idea to me at the time of writing at least. That's of course nice, but it also kind of corners us into not evolving the default format, which I don't think is warranted... Greetings, Andres Freund
Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH
On Mon, Jul 4, 2022 at 10:29 PM Euler Taveira wrote: > > On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote: > > Yeah, it seems we have overlooked this point. I think we can do this > just for HEAD but as the feature is introduced in PG-15 so there is no > harm in pushing it to PG-15 as well especially because it is a > straightforward change. What do you or others think? > > No objection. It is a good thing for future backpatches. > Since there is no function change or bugfix here I thought it was only applicable for HEAD. This change is almost in the same category as a code comment typo patch - do those normally get backpatched? - maybe follow the same convention here. OTOH, if you think it may be helpful for future backpatches then I am also fine if you wanted to push it to PG15. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: TAP output format in pg_regress
> On 4 Jul 2022, at 16:39, Tom Lane wrote: > > Peter Eisentraut writes: >> I'm not sure what to make of all these options. I think providing a TAP >> output for pg_regress is a good idea. But then do we still need the old >> output? Is it worth maintaining two output formats that display exactly >> the same thing in slightly different ways? > > Probably is, because this is bad: > >> ... The proposed default format now hides the >> fact that some tests are started in parallel. I remember the last time >> I wanted to tweak the output of the parallel tests, people were very >> attached to the particular timing and spacing of the current output. So >> I'm not sure people will like this. > > and so is this: > >> The timing output is very popular. Where is that in the TAP output? > > Both of those things are fairly critical for test development. You > need to know what else might be running in parallel with a test case, > and you need to know whether you just bloated the runtime unreasonably. > > More generally, I'm unhappy about the proposal that TAP should become > the default output. That's not my proposal though, my proposal is that the traditional format should be the default output (with the parallel test info added back, that was my bad), and that TAP is used in automated test runners like in meson. Hiding the timing in TAP was (as mentioned upthread) since TAP test runners generally never show that anyways, but I'll add it back since it clearly doesn't hurt to have even there. > There is nothing particularly human-friendly > about it, whereas the existing format is something we have tuned to > our liking over literally decades. I don't mind if there's a way to > get TAP when you're actually intending to feed it into a TAP-parsing > tool, but I am not a TAP-parsing tool and I don't see why I should > have to put up with it. I totally agree, and that's why the patch has the traditional format - without all the boilerplate - as the default. Unless opting-in there is no change over today, apart from the boilerplate. -- Daniel Gustafsson https://vmware.com/
Re: Request for assistance to backport CVE-2022-1552 fixes to 9.6 and 9.4
On Wed, Jun 08, 2022 at 05:31:22PM -0400, Roberto C. Sánchez wrote: > On Wed, Jun 08, 2022 at 04:15:47PM -0400, Tom Lane wrote: > > Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?= writes: > > > I am investigating backporting the fixes for CVE-2022-1552 to 9.6 and > > > 9.4 as part of Debian LTS and Extended LTS. I am aware that these > > > releases are no longer supported upstream, but I have made an attempt at > > > adapting commits ef792f7856dea2576dcd9cab92b2b05fe955696b and > > > f26d5702857a9c027f84850af48b0eea0f3aa15c from the REL_10_STABLE branch. > > > I would appreciate a review of the attached patches and any comments on > > > things that may have been missed and/or adapted improperly. > > > > FWIW, I would not recommend being in a huge hurry to back-port those > > changes, pending the outcome of this discussion: > > > > https://www.postgresql.org/message-id/flat/f8a4105f076544c180a87ef0c4822352%40stmuk.bayern.de > > > Thanks for the pointer. > > > We're going to have to tweak that code somehow, and it's not yet > > entirely clear how. > > > I will monitor the discussion to see what comes of it. > Based on the discussion in the other thread, I have made an attempt to backport commit 88b39e61486a8925a3861d50c306a51eaa1af8d6 to 9.6 and 9.4. The only significant change that I had to make was to add the full function signatures for the REVOKE/GRANT in the citext test. One question that I had about the change as committed is whether a REVOKE is needed on s.citext_ne, like so: REVOKE ALL ON FUNCTION s.citext_ne FROM PUBLIC; Or (for pre-10): REVOKE ALL ON FUNCTION s.citext_ne(s.citext, s.citext) FROM PUBLIC; I ask because the comment immediately preceding the sequence of REVOKEs includes the comment "Revoke all conceivably-relevant ACLs within the extension." I am not especially knowledgable about deep internals, but that function seems like it would belong in the same group with the others. In any event, would someone be willing to review the attached patches for correctness? I would like to shortly publish updates to 9.6 and 9.4 in Debian and a review would be most appreciated. Regards, -Roberto -- Roberto C. Sánchez >From ef792f7856dea2576dcd9cab92b2b05fe955696b Mon Sep 17 00:00:00 2001 From: Noah Misch Date: Mon, 9 May 2022 08:35:08 -0700 Subject: [PATCH] Make relation-enumerating operations be security-restricted operations. When a feature enumerates relations and runs functions associated with all found relations, the feature's user shall not need to trust every user having permission to create objects. BRIN-specific functionality in autovacuum neglected to account for this, as did pg_amcheck and CLUSTER. An attacker having permission to create non-temp objects in at least one schema could execute arbitrary SQL functions under the identity of the bootstrap superuser. CREATE INDEX (not a relation-enumerating operation) and REINDEX protected themselves too late. This change extends to the non-enumerating amcheck interface. Back-patch to v10 (all supported versions). Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin. Reported by Alexander Lakhin. Security: CVE-2022-1552 --- src/backend/access/brin/brin.c | 30 - src/backend/catalog/index.c | 41 +-- src/backend/commands/cluster.c | 35 src/backend/commands/indexcmds.c | 53 +-- src/backend/utils/init/miscinit.c| 24 -- src/test/regress/expected/privileges.out | 42 src/test/regress/sql/privileges.sql | 36 + 7 files changed, 231 insertions(+), 30 deletions(-) --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -28,6 +28,7 @@ #include "pgstat.h" #include "storage/bufmgr.h" #include "storage/freespace.h" +#include "utils/guc.h" #include "utils/index_selfuncs.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -786,6 +787,9 @@ Oid heapoid; Relation indexRel; Relation heapRel; + Oid save_userid; + int save_sec_context; + int save_nestlevel; double numSummarized = 0; if (RecoveryInProgress()) @@ -799,10 +803,28 @@ * passed indexoid isn't an index then IndexGetRelation() will fail. * Rather than emitting a not-very-helpful error message, postpone * complaining, expecting that the is-it-an-index test below will fail. + * + * unlike brin_summarize_range(), autovacuum never calls this. hence, we + * don't switch userid. */ heapoid = IndexGetRelation(indexoid, true); if (OidIsValid(heapoid)) + { heapRel = heap_open(heapoid, ShareUpdateExclusiveLock); + + /* + * Autovacuum calls us. For its benefit, switch to the table owner's + * userid, so that any index functions are run as that user. Also + * lock down security-restricted operations and arrange to make GUC + * variable changes local to this command. This is harmless,
Re: TAP output format in pg_regress
Andres Freund writes: >> Both of those things are fairly critical for test development. You >> need to know what else might be running in parallel with a test case, >> and you need to know whether you just bloated the runtime unreasonably. > That should be doable with tap as well - afaics the output of that could > nearly be the same as now, preceded by a #. I don't mind minor changes like prefixing # --- I just don't want to lose information. regards, tom lane
Re: TAP output format in pg_regress
> On 4 Jul 2022, at 16:27, Peter Eisentraut > wrote: > > On 29.06.22 21:50, Daniel Gustafsson wrote: >> Attached is a new version of this patch, which completes the TAP output >> format >> option such that all codepaths emitting output are TAP compliant. The >> verbose >> option is fixed to to not output extraneous newlines which the previous PoC >> did. The output it made to conform to the original TAP spec since v13/14 TAP >> parsers seem less common than those that can handle the original spec. >> Support >> for the new format additions should be quite simple to add should we want >> that. >> Running pg_regress --verbose should give the current format output. >> I did end up combining TAP and --verbose into a single patch, as the TAP >> format >> sort of depends on the verbose flag as TAP has no verbose mode. I can split >> it >> into two separate should a reviewer prefer that. > > I'm not sure what to make of all these options. I think providing a TAP > output for pg_regress is a good idea. But then do we still need the old > output? Is it worth maintaining two output formats that display exactly the > same thing in slightly different ways? If we believe that TAP is good enough for human consumption and not just as input to test runners then we don't. Personally I think the traditional format is more pleasant to read than raw TAP output when running tests. > What is the purpose of the --verbose option? When and how is one supposed to > activate that? The proposed default format now hides the fact that some > tests are started in parallel. The discussion on this was in 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de where it was proposed that we could cut the boilerplate. Thinking on it I agree that the parallel run info should be included even without --verbose so I'll add that back. In general, running with --verbose should ideally only be required for troubleshooting setup/teardown issues with testing. > I remember the last time I wanted to tweak the output of the parallel tests, > people were very attached to the particular timing and spacing of the current > output. So I'm not sure people will like this. > > The timing output is very popular. Where is that in the TAP output? As I mentioned in the mail upthread, TAP runners generally hide that info and only show a running total. That being said, I do agree with adding back so I'll do that in a new version of the patch. > More generally, what do you envision we do with this feature? Who is it for, > what are the tradeoffs, etc.? In general, my thinking with this was that normal testruns started with make check (or similar) by developers would use the traditional format (albeit less verbose) and that the TAP output was for automated test runners in general and the meson test runner in particular. The TAP format is an opt-in with the traditional format being the default. The tradeoff is of course that maintaining two output formats is more work than maintaining one, but it's not really something we change all that often so that might not be too heavy a burden. I personally didn't see us replacing the traditional format for "human readable" runs, if that's where the discussion is heading then the patch can look quite different. Having test output format parity with supported back branches seemed like a good idea to me at the time of writing at least. -- Daniel Gustafsson https://vmware.com/
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote: > In both test cases, There is not much difference between using AVX2 > and SSE2. The more mode types, the more time it takes for loading the > data (see sse2_4_16_32_128_256). Yea, at some point the compiler starts using a jump table instead of branches, and that turns out to be a good bit more expensive. And even with branches, it obviously adds hard to predict branches. IIRC I fought a bit with the compiler to avoid some of that cost, it's possible that got "lost" in Sawada-san's patch. Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node one is not unimportant until we have path compression. Right now the node struct sizes are: 4 - 48 bytes 32 - 296 bytes 128 - 1304 bytes 256 - 2088 bytes I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other bytes, but needing that separate isset array somehow is sad :/. I wonder if a smaller "free index" would do the trick? Point to the element + 1 where we searched last and start a plain loop there. Particularly in an insert-only workload that'll always work, and in other cases it'll still often work I think. One thing I was wondering about is trying to choose node types in roughly-power-of-two struct sizes. It's pretty easy to end up with significant fragmentation in the slabs right now when inserting as you go, because some of the smaller node types will be freed but not enough to actually free blocks of memory. If we instead have ~power-of-two sizes we could just use a single slab of the max size, and carve out the smaller node types out of that largest allocation. Btw, that fragmentation is another reason why I think it's better to track memory usage via memory contexts, rather than doing so based on GetMemoryChunkSpace(). > > Ideally, node16 and node32 would have the same code with a different > > loop count (1 or 2). More generally, there is too much duplication of > > code (noted by Andres in his PoC), and there are many variable names > > with the node size embedded. This is a bit tricky to make more > > general, so we don't need to try it yet, but ideally we would have > > something similar to: > > > > switch (node->kind) // todo: inspect tagged pointer > > { > > case RADIX_TREE_NODE_KIND_4: > >idx = node_search_eq(node, chunk, 4); > >do_action(node, idx, 4, ...); > >break; > > case RADIX_TREE_NODE_KIND_32: > >idx = node_search_eq(node, chunk, 32); > >do_action(node, idx, 32, ...); > > ... > > } FWIW, that should be doable with an inline function, if you pass it the memory to the "array" rather than the node directly. Not so sure it's a good idea to do dispatch between node types / search methods inside the helper, as you suggest below: > > static pg_alwaysinline void > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) > > { > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD) > > // do simple loop with (node_simple *) node; > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) > > // do vectorized loop where available with (node_vec *) node; > > ... > > } Greetings, Andres Freund
Re: TAP output format in pg_regress
> On 4 Jul 2022, at 22:13, Andres Freund wrote: > > Hi, > > On 2022-06-29 21:50:45 +0200, Daniel Gustafsson wrote: >> @@ -279,8 +648,7 @@ stop_postmaster(void) >> r = system(buf); >> if (r != 0) >> { >> -fprintf(stderr, _("\n%s: could not stop postmaster: >> exit code was %d\n"), >> -progname, r); >> +pg_log_error("could not stop postmaster: exit code was >> %d", r); >> _exit(2); /* not exit(), that >> could be recursive */ >> } > > There's a lot of stuff like this. Perhaps worth doing separately? I'm not sure > I unerstand where you used bail and where not. I assume it's mostly arund use > uf _exit() vs exit()? Since bail will cause the entire testrun to be considered a failure, the idea was to avoid using bail() for any errors in tearing down the test harness after an otherwise successful test run. Moving to pg_log_error() can for sure be broken out into a separate patch from the rest of the set (if we at all want to do that, but it seemed logical to address when dealing with other output routines). >> +test_status_ok(tests[i]); >> >> if (statuses[i] != 0) >> log_child_failure(statuses[i]); >> >> INSTR_TIME_SUBTRACT(stoptimes[i], starttimes[i]); >> -status(_(" %8.0f ms"), >> INSTR_TIME_GET_MILLISEC(stoptimes[i])); >> +runtime(tests[i], >> INSTR_TIME_GET_MILLISEC(stoptimes[i])); > > Based on the discussion downthread, let's just always compute this and display > it even in the tap format? Sure, it's easy enough to do and include in the test description. The reason I left it out is that the test runners I played around with all hide those details and only show a running total. That of course doesn't mean that all runners will do that (and anyone running TAP output for human consumption will want it), so I agree with putting it in, I'll fix that up in a v6 shortly. -- Daniel Gustafsson https://vmware.com/
Re: doc: BRIN indexes and autosummarize
On Mon, Jul 04, 2022 at 09:38:42PM +0200, Alvaro Herrera wrote: > + There are several triggers for initial summarization of a page range > + to occur. If the table is vacuumed, either because > +has been manually invoked or because > + autovacuum causes it, > + all existing unsummarized page ranges are summarized. I'd say "If the table is vacuumed manually or by autovacuum, ..." (Or "either manually or by autovacuum, ...") > + Also, if the index has the > +parameter set to on, Maybe say "If the autovacuum parameter is enabled" (this may avoid needing to revise it later if we change the default). > + then any run of autovacuum in the database will summarize all I'd avoid saying "run" and instead say "then anytime autovacuum runs in that database, all ..." > + unsummarized page ranges that have been completely filled recently, > + regardless of whether the table is processed by autovacuum for other > + reasons; see below. say "whether the table itself" and remove "for other reasons" ? > > When autosummarization is enabled, each time a page range is filled a Maybe: filled comma > - request is sent to autovacuum for it to execute a targeted summarization > - for that range, to be fulfilled at the end of the next worker run on the > - same database. If the request queue is full, the request is not recorded > - and a message is sent to the server log: > + request is sent to autovacuum for it to execute a > targeted > + summarization for that range, to be fulfilled at the end of the next > + autovacuum worker run on the same database. If the request queue is full, > the "to be fulfilled the next time an autovacuum worker finishes running in that database." or "to be fulfilled by an autovacuum worker the next it finishes running in that database." > +++ b/doc/src/sgml/ref/create_index.sgml > @@ -580,6 +580,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT > EXISTS ] > Defines whether a summarization run is invoked for the previous page > range whenever an insertion is detected on the next one. > + See for more details. > + The default is off. Maybe "invoked" should say "queued" ? Also, a reminder that this was never addressed (I wish the project had a way to keep track of known issues). https://www.postgresql.org/message-id/20201113160007.gq30...@telsasoft.com |error_severity of brin work item |left | could not open relation with OID 292103095 |left | processing work entry for relation "ts.child.alarms_202010_alarm_clear_time_idx" |Those happen following a REINDEX job on that index. This inline patch includes my changes as well as yours. And the attached patch is my changes only. diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index caf1ea4cef1..90897a4af07 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -73,31 +73,55 @@ summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. - This process can be invoked manually using the - brin_summarize_range(regclass, bigint) or - brin_summarize_new_values(regclass) functions; - automatically when VACUUM processes the table; - or by automatic summarization executed by autovacuum, as insertions - occur. (This last trigger is disabled by default and can be enabled - with the autosummarize parameter.) - Conversely, a range can be de-summarized using the - brin_desummarize_range(regclass, bigint) function, - which is useful when the index tuple is no longer a very good - representation because the existing values have changed. - When autosummarization is enabled, each time a page range is filled a - request is sent to autovacuum for it to execute a targeted summarization - for that range, to be fulfilled at the end of the next worker run on the - same database. If the request queue is full, the request is not recorded - and a message is sent to the server log: + There are several ways to trigger the initial summarization of a page range. + If the table is vacuumed, either manually or by + autovacuum, + all existing unsummarized page ranges are summarized. + Also, if the index's +parameter is enabled, + whenever autovacuum runs in that database, summarization will + occur for all + unsummarized page ranges that have been filled, + regardless of whether the table itself is processed by autovacuum; see below. + + Lastly, the following functions can be used: + + + + brin_summarize_range(regclass, bigint) + summarizes all unsummarized ranges + + + brin_summarize_new_values(regclass) + summarizes one specific range, if it is unsummarized + + + + + + When autosummarization is enabled, each time a page range is filled, a + request is sent to autovacuum to execute a targeted + summarization
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c > new file mode 100644 > index 00..bf87f932fd > --- /dev/null > +++ b/src/backend/lib/radixtree.c > @@ -0,0 +1,1763 @@ > +/*- > + * > + * radixtree.c > + * Implementation for adaptive radix tree. > + * > + * This module employs the idea from the paper "The Adaptive Radix Tree: > ARTful > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and > Thomas > + * Neumann, 2013. > + * > + * There are some differences from the proposed implementation. For > instance, > + * this radix tree module utilizes AVX2 instruction, enabling us to use > 256-bit > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. > + * Also, there is no support for path compression and lazy path expansion. > The > + * radix tree supports fixed length of the key so we don't expect the tree > level > + * wouldn't be high. I think we're going to need path compression at some point, fwiw. I'd bet on it being beneficial even for the tid case. > + * The key is a 64-bit unsigned integer and the value is a Datum. I don't think it's a good idea to define the value type to be a datum. > +/* > + * As we descend a radix tree, we push the node to the stack. The stack is > used > + * at deletion. > + */ > +typedef struct radix_tree_stack_data > +{ > + radix_tree_node *node; > + struct radix_tree_stack_data *parent; > +} radix_tree_stack_data; > +typedef radix_tree_stack_data *radix_tree_stack; I think it's a very bad idea for traversal to need allocations. I really want to eventually use this for shared structures (eventually with lock-free searches at least), and needing to do allocations while traversing the tree is a no-go for that. Particularly given that the tree currently has a fixed depth, can't you just allocate this on the stack once? > +/* > + * Allocate a new node with the given node kind. > + */ > +static radix_tree_node * > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) > +{ > + radix_tree_node *newnode; > + > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], > + > radix_tree_node_info[kind].size); > + newnode->kind = kind; > + > + /* update the statistics */ > + tree->mem_used += GetMemoryChunkSpace(newnode); > + tree->cnt[kind]++; > + > + return newnode; > +} Why are you tracking the memory usage at this level of detail? It's *much* cheaper to track memory usage via the memory contexts? Since they're dedicated for the radix tree, that ought to be sufficient? > + else if (idx != n4->n.count) > + { > + /* > + * the key needs to be inserted > in the middle of the > + * array, make space for the > new key. > + */ > + memmove(&(n4->chunks[idx + 1]), > &(n4->chunks[idx]), > + sizeof(uint8) * > (n4->n.count - idx)); > + memmove(&(n4->slots[idx + 1]), > &(n4->slots[idx]), > + > sizeof(radix_tree_node *) * (n4->n.count - idx)); > + } Maybe we could add a static inline helper for these memmoves? Both because it's repetitive (for different node types) and because the last time I looked gcc was generating quite bad code for this. And having to put workarounds into multiple places is obviously worse than having to do it in one place. > +/* > + * Insert the key with the val. > + * > + * found_p is set to true if the key already present, otherwise false, if > + * it's not NULL. > + * > + * XXX: do we need to support update_if_exists behavior? > + */ Yes, I think that's needed - hence using bfm_set() instead of insert() in the prototype. > +void > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > +{ > + int shift; > + boolreplaced; > + radix_tree_node *node; > + radix_tree_node *parent = tree->root; > + > + /* Empty tree, create the root */ > + if (!tree->root) > + radix_tree_new_root(tree, key, val); > + > + /* Extend the tree if necessary */ > + if (key > tree->max_val) > + radix_tree_extend(tree, key); FWIW, the reason I used separate functions for these in the prototype is that it turns out to generate a lot better code, because it allows non-inlined function calls to be
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, I just noticed that I had a reply forgotten in drafts... On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote: > To move this project forward, I've implemented radix tree > implementation from scratch while studying Andres's implementation. It > supports insertion, search, and iteration but not deletion yet. In my > implementation, I use Datum as the value so internal and lead nodes > have the same data structure, simplifying the implementation. The > iteration on the radix tree returns keys with the value in ascending > order of the key. The patch has regression tests for radix tree but is > still in PoC state: left many debugging codes, not supported SSE2 SIMD > instructions, added -mavx2 flag is hard-coded. Very cool - thanks for picking this up. Greetings, Andres Freund
Re: pgsql: dshash: Add sequential scan support.
Hi, On 2022-07-04 14:55:43 +1200, Thomas Munro wrote: > Right, as seen in the build farm at [1]. Also reproducible with something > like: > > @@ -269,6 +269,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size > request_size, > return false; > } > > + /* XXX random fault injection */ > + if (op == DSM_OP_ATTACH && random() < RAND_MAX / 8) > + { > + close(fd); > + elog(ERROR, "chaos"); > + return false; > + } > + > > I must have thought that it was easy and practical to write no-throw > straight-line code and be sure to reach dshash_release_lock(), but I > concede that it was a bad idea: even dsa_get_address() can throw*, and > you're often likely to need to call that while accessing dshash > elements. For example, in lookup_rowtype_tupdesc_internal(), there is > a sequence dshash_find(), ..., dsa_get_address(), ..., > dshash_release_lock(), and I must have considered the range of code > between find and release to be no-throw, but now I know that it is > not. Yea - I'd go as far as saying that it's almost never feasible. > > It's per-backend state at least and just used for assertions. We could > > remove > > it. Or stop checking it in places where it could be set wrongly: > > dshash_find() > > and dshash_detach() couldn't check anymore, but the rest of the assertions > > would still be valid afaics? > > Yeah, it's all for assertions... let's just remove it. Those > assertions were useful to me at some stage in development but won't > hold as well as I thought, at least without widespread PG_FINALLY(), > which wouldn't be nice. Hm. I'd be inclined to at least add a few more Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to dshash_find_or_insert(). > @@ -572,13 +552,8 @@ dshash_release_lock(dshash_table *hash_table, void > *entry) > size_t partition_index = PARTITION_FOR_HASH(item->hash); > > Assert(hash_table->control->magic == DSHASH_MAGIC); > - Assert(hash_table->find_locked); > - Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index), > - > hash_table->find_exclusively_locked > - ? LW_EXCLUSIVE > : LW_SHARED)); > + Assert(LWLockHeldByMe(PARTITION_LOCK(hash_table, partition_index))); > > - hash_table->find_locked = false; > - hash_table->find_exclusively_locked = false; > LWLockRelease(PARTITION_LOCK(hash_table, partition_index)); > } This LWLockHeldByMe() doesn't add much - the LWLockRelease() will error out if we don't hold the lock. Greetings, Andres Freund
Re: doc: BRIN indexes and autosummarize
On Mon, Jul 04, 2022 at 09:38:42PM +0200, Alvaro Herrera wrote: > What about this? > > -- > Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ > "Java is clearly an example of money oriented programming" (A. Stepanov) > diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml > index caf1ea4cef..0a715d41c7 100644 > --- a/doc/src/sgml/brin.sgml > +++ b/doc/src/sgml/brin.sgml > @@ -73,31 +73,55 @@ > summarized range, that range does not automatically acquire a summary > tuple; those tuples remain unsummarized until a summarization run is > invoked later, creating initial summaries. > - This process can be invoked manually using the > - brin_summarize_range(regclass, bigint) or > - brin_summarize_new_values(regclass) functions; > - automatically when VACUUM processes the table; > - or by automatic summarization executed by autovacuum, as insertions > - occur. (This last trigger is disabled by default and can be enabled > - with the autosummarize parameter.) > - Conversely, a range can be de-summarized using the > - brin_desummarize_range(regclass, bigint) function, > - which is useful when the index tuple is no longer a very good > - representation because the existing values have changed. > + > + I feel that somewhere in this paragraph it should be mentioned that is off by default. otherwise, +1 -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi, On 2022-06-27 16:55:55 +0200, David Geier wrote: > Indeed, the total JIT time increases the more modules are used. The reason > for this to happen is that the inlining pass loads and deserializes all to > be inlined modules (.bc files) from disk prior to inlining them via > llvm::IRMover. There's already a cache for such modules in the code, but it > is currently unused. This is because llvm::IRMover takes the module to be > inlined as std::unique_ptr. The by-value argument requires > the source module to be moved, which means it cannot be reused afterwards. > The code is accounting for that by erasing the module from the cache after > inlining it, which in turns requires reloading the module next time a > reference to it is encountered. > > Instead of each time loading and deserializing all to be inlined modules > from disk, they can reside in the cache and instead be cloned via > llvm::CloneModule() before they get inlined. Key to calling > llvm::CloneModule() is fully deserializing the module upfront, instead of > loading the module lazily. That is why I changed the call from > LLVMGetBitcodeModuleInContext2() (which lazily loads the module via > llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() (which > fully loads the module via llvm::parseBitcodeFile()). Beyond that it seems > like that prior to LLVM 13, cloning modules could fail with an assertion > (not sure though if that would cause problems in a release build without > assertions). Andres reported this problem back in the days here [1]. In the > meanwhile the issue got discussed in [2] and finally fixed for LLVM 13, see > [3]. Unfortunately that doesn't work right now - that's where I had started. The problem is that IRMover renames types. Which, in the case of cloned modules unfortunately means that types used cloned modules are also renamed in the "origin" module. Which then causes problems down the line, because parts of the LLVM code match types by type names. That can then have the effect of drastically decreasing code generation quality over time, because e.g. inlining never manages to find signatures compatible. > However, curiously the time spent on optimizing is also reduced (95ms > instead of 164ms). Could this be because some of the applied optimizations > are ending up in the cached module? I suspect it's more that optimization stops being able to do a lot, due to the type renamign issue. > @Andres: could you provide me with the queries that caused the assertion > failure in LLVM? I don't think I have the concrete query. What I tend to do is to run the whole regression tests with forced JITing. I'm fairly certain this triggered the bug at the time. > Have you ever observed a segfault with a non-assert-enabled build? I think I observed bogus code generation that then could lead to segfaults or such. > I just want to make sure this is truly fixed in LLVM 13. Running 'make > check-world' all tests passed. With jit-ing forced for everything? One more thing to try is to jit-compile twice and ensure the code is the same. It certainly wasn't in the past due to the above issue. Greetings, Andres Freund
Re: Export log_line_prefix(); useful for emit_log_hook.
On Mon, 2022-07-04 at 15:54 +0900, Michael Paquier wrote: > On Wed, Jun 29, 2022 at 03:09:42PM +0200, Alvaro Herrera wrote: > > Hmm, maybe your hypothetical book would prefer to use a different > > setting for log line prefix than Log_line_prefix, so it would make > > sense > > to pass the format string as a parameter to the function instead of > > relying on the GUC global. That is nicer, attached. I also renamed the function log_status_format(), and made log_line_prefix() a thin wrapper over that. I think that's less confusing. Regards, Jeff Davis From 4c64049e500c80ac495732d34ae56e07393115fe Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 28 Jun 2022 11:39:33 -0700 Subject: [PATCH] Provide log_status_format(), useful for an emit_log_hook. Refactor so that log_line_prefix() is a thin wrapper over a new function log_status_format(), and move the implementation to the latter. Export log_status_format() so that it can be used by an emit_log_hook. --- src/backend/utils/error/elog.c | 15 --- src/include/utils/elog.h | 3 +++ 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 55ee5423afb..7b2bb9c8b75 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -2439,10 +2439,19 @@ process_log_prefix_padding(const char *p, int *ppadding) } /* - * Format tag info for log lines; append to the provided buffer. + * Format log status information using Log_line_prefix. */ static void log_line_prefix(StringInfo buf, ErrorData *edata) +{ + log_status_format(buf, Log_line_prefix, edata); +} + +/* + * Format log status info; append to the provided buffer. + */ +void +log_status_format(StringInfo buf, char *format, ErrorData *edata) { /* static counter for line numbers */ static long log_line_number = 0; @@ -2466,10 +2475,10 @@ log_line_prefix(StringInfo buf, ErrorData *edata) } log_line_number++; - if (Log_line_prefix == NULL) + if (format == NULL) return; /* in case guc hasn't run yet */ - for (p = Log_line_prefix; *p != '\0'; p++) + for (p = format; *p != '\0'; p++) { if (*p != '%') { diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index f5c6cd904de..e88e3f8a801 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -16,6 +16,8 @@ #include +#include "lib/stringinfo.h" + /* Error level codes */ #define DEBUG5 10 /* Debugging messages, in categories of * decreasing detail. */ @@ -439,6 +441,7 @@ extern PGDLLIMPORT bool syslog_split_messages; #define LOG_DESTINATION_JSONLOG 16 /* Other exported functions */ +extern void log_status_format(StringInfo buf, char *format, ErrorData *edata); extern void DebugFileOpen(void); extern char *unpack_sql_state(int sql_state); extern bool in_error_recursion_trouble(void); -- 2.17.1
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi, On 2022-07-04 06:43:00 +, Luc Vlaming Hummel wrote: > Thanks for reviewing this and the interesting examples! > > Wanted to give a bit of extra insight as to why I'd love to have a system > that can lazily emit JIT code and hence creates roughly a module per function: > In the end I'm hoping that we can migrate to a system where we only JIT after > a configurable cost has been exceeded for this node, as well as a > configurable amount of rows has actually been processed. > Reason is that this would safeguard against some problematic planning issues > wrt JIT (node not being executed, row count being massively off). I still don't see how it's viable to move to always doing function-by-function emission overhead wise. I also want to go to do JIT in the background and triggered by acutal usage. But to me it seems a dead end to require moving to one-function-per-module model for that. > If this means we have to invest more in making it cheap(er) to emit modules, > I'm all for that. I think that's just inherently more expensive and thus a no-go. > @Andres if there's any other things we ought to fix to make this cheap > (enough) compared to the previous code I'd love to know your thoughts. I'm not seeing it. Greetings, Andres Freund
Re: Proposal: allow database-specific role memberships
Hi Antonin, First of all, thank you so much for taking the time to review my patch. I'll answer your questions in reverse order: The "unsafe_tests" directory is where the pre-existing role tests were located. According to the readme of the "unsafe_tests" directory, the tests contained within are not run during "make installcheck" because they could have side-effects that seem undesirable for a production installation. This seemed like a reasonable location as the new tests that this patch introduces also modifies the "state" of the database cluster by adding, modifying, and removing roles & databases (including template1). Regarding roles_is_member_of(), the nuance is that role "A" in your example would only be considered a member of role "B" (and by extension role "C") when connected to the database in which "A" was granted database-specific membership to "B". Conversely, when connected to any other database, "A" would not be considered to be a member of "B". This patch is designed to solve the scenarios in which one may want to grant constrained access to a broader set of privileges. For example, membership in "pg_read_all_data" effectively grants SELECT and USAGE rights on everything (implicitly cluster-wide in today's implementation). By granting a role membership to "pg_read_all_data" within the context of a specific database, the grantee's read-everything privilege is effectively constrained to just that specific database (as membership within "pg_read_all_data" would not otherwise be held). A rebased version is attached. Thanks again! - Kenaniah On Wed, Jun 29, 2022 at 6:45 AM Antonin Houska wrote: > Kenaniah Cerny wrote: > > > Attached is a newly-rebased patch -- would love to get a review from > someone whenever possible. > > I've picked this patch for a review. The patch currently does not apply to > the > master branch, so I could only read the diff. Following are my comments: > > * I think that roles_is_member_of() deserves a comment explaining why the > code > that you moved into append_role_memberships() needs to be called twice, > i.e. once for global memberships and once for the database-specific ones. > > I think the reason is that if, for example, role "A" is a > database-specific > member of role "B" and "B" is a "global" member of role "C", then "A" > should > not be considered a member of "C", unless "A" is granted "C" explicitly. > Is > this behavior intended? > > Note that in this example, the "C" members are a superset of "B" members, > and thus "C" should have weaker permissions on database objects than > "B". What's then the reason to not consider "A" a member of "C"? If "C" > gives its members some permissions of "B" (e.g. "pg_write_all_data"), > then I > think the roles hierarchy is poorly designed. > > A counter-example might help me to understand. > > * Why do you think that "unsafe_tests" is the appropriate name for the > directory that contains regression tests? > > I can spend more time on the review if the patch gets rebased. > > -- > Antonin Houska > Web: https://www.cybertec-postgresql.com > database-role-memberships-v9.patch Description: Binary data
Re: TAP output format in pg_regress
Hi, On 2022-06-29 21:50:45 +0200, Daniel Gustafsson wrote: > @@ -279,8 +648,7 @@ stop_postmaster(void) > r = system(buf); > if (r != 0) > { > - fprintf(stderr, _("\n%s: could not stop postmaster: > exit code was %d\n"), > - progname, r); > + pg_log_error("could not stop postmaster: exit code was > %d", r); > _exit(2); /* not exit(), that > could be recursive */ > } There's a lot of stuff like this. Perhaps worth doing separately? I'm not sure I unerstand where you used bail and where not. I assume it's mostly arund use uf _exit() vs exit()? > + test_status_ok(tests[i]); > > if (statuses[i] != 0) > log_child_failure(statuses[i]); > > INSTR_TIME_SUBTRACT(stoptimes[i], starttimes[i]); > - status(_(" %8.0f ms"), > INSTR_TIME_GET_MILLISEC(stoptimes[i])); > + runtime(tests[i], > INSTR_TIME_GET_MILLISEC(stoptimes[i])); Based on the discussion downthread, let's just always compute this and display it even in the tap format? Greetings, Andres Freund
Re: TAP output format in pg_regress
Hi, > Peter Eisentraut writes: > > I'm not sure what to make of all these options. I think providing a TAP > > output for pg_regress is a good idea. But then do we still need the old > > output? Is it worth maintaining two output formats that display exactly > > the same thing in slightly different ways? > > Probably is, because this is bad: > > ... The proposed default format now hides the > > fact that some tests are started in parallel. I remember the last time > > I wanted to tweak the output of the parallel tests, people were very > > attached to the particular timing and spacing of the current output. So > > I'm not sure people will like this. > > and so is this: > > > The timing output is very popular. Where is that in the TAP output? > > Both of those things are fairly critical for test development. You > need to know what else might be running in parallel with a test case, > and you need to know whether you just bloated the runtime unreasonably. That should be doable with tap as well - afaics the output of that could nearly be the same as now, preceded by a #. The test timing output could (and I think should) also be output - but if I read the tap specification correctly, we'd either need to make it part of the test "description" or on a separate line. On 2022-07-04 10:39:37 -0400, Tom Lane wrote: > More generally, I'm unhappy about the proposal that TAP should become > the default output. There is nothing particularly human-friendly > about it, whereas the existing format is something we have tuned to > our liking over literally decades. I don't mind if there's a way to > get TAP when you're actually intending to feed it into a TAP-parsing > tool, but I am not a TAP-parsing tool and I don't see why I should > have to put up with it. I'm mostly interested in the tap format because meson's testrunner can parse it - unsurprisingly it doesn't understand the current regress output. It's a lot nicer to immediately be pointed to the failed test(s) than having to scan through the output "manually". Greetings, Andres Freund
Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8
On 04.07.22 19:32, Tom Lane wrote: If the result of IDENTIFY_SYSTEM is always sent in text format, then I agree that this isn't very problematic. If there are any clients that fetch it in binary mode, though, this is absolutely a wire protocol break for them The result rows of the replication commands are always sent in text format.
Re: doc: BRIN indexes and autosummarize
What about this? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Java is clearly an example of money oriented programming" (A. Stepanov) diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index caf1ea4cef..0a715d41c7 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -73,31 +73,55 @@ summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. - This process can be invoked manually using the - brin_summarize_range(regclass, bigint) or - brin_summarize_new_values(regclass) functions; - automatically when VACUUM processes the table; - or by automatic summarization executed by autovacuum, as insertions - occur. (This last trigger is disabled by default and can be enabled - with the autosummarize parameter.) - Conversely, a range can be de-summarized using the - brin_desummarize_range(regclass, bigint) function, - which is useful when the index tuple is no longer a very good - representation because the existing values have changed. + + + + There are several triggers for initial summarization of a page range + to occur. If the table is vacuumed, either because +has been manually invoked or because + autovacuum causes it, + all existing unsummarized page ranges are summarized. + Also, if the index has the +parameter set to on, + then any run of autovacuum in the database will summarize all + unsummarized page ranges that have been completely filled recently, + regardless of whether the table is processed by autovacuum for other + reasons; see below. + Lastly, the following functions can be used: + + + + brin_summarize_range(regclass, bigint) + summarizes all unsummarized ranges + + + brin_summarize_new_values(regclass) + summarizes one specific range, if it is unsummarized + + When autosummarization is enabled, each time a page range is filled a - request is sent to autovacuum for it to execute a targeted summarization - for that range, to be fulfilled at the end of the next worker run on the - same database. If the request queue is full, the request is not recorded - and a message is sent to the server log: + request is sent to autovacuum for it to execute a targeted + summarization for that range, to be fulfilled at the end of the next + autovacuum worker run on the same database. If the request queue is full, the + request is not recorded and a message is sent to the server log: LOG: request for BRIN range summarization for index "brin_wi_idx" page 128 was not recorded When this happens, the range will be summarized normally during the next regular vacuum of the table. + + + Conversely, a range can be de-summarized using the + brin_desummarize_range(regclass, bigint) function, + which is useful when the index tuple is no longer a very good + representation because the existing values have changed. + See for details. + + diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 9ffcdc629e..d3db03278d 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -580,6 +580,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] Defines whether a summarization run is invoked for the previous page range whenever an insertion is detected on the next one. + See for more details. + The default is off.
Re: TAP output format in pg_regress
On 04.07.22 16:39, Tom Lane wrote: Probably is, because this is bad: ... The proposed default format now hides the fact that some tests are started in parallel. I remember the last time I wanted to tweak the output of the parallel tests, people were very attached to the particular timing and spacing of the current output. So I'm not sure people will like this. and so is this: The timing output is very popular. Where is that in the TAP output? Both of those things are fairly critical for test development. You need to know what else might be running in parallel with a test case, and you need to know whether you just bloated the runtime unreasonably. I don't think there is a reason these couldn't be shown in TAP output as well. Even if we keep the two output formats in parallel, it would be good if they showed the same set of information.
Re: Temporary file access API
On 04.07.22 18:35, Antonin Houska wrote: Attached is a new version of the patch, to evaluate what the API use in the backend could look like. I haven't touched places where the file is accessed in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is called. Rebased patch set is attached here, which applies to the current master. (A few more opportunities for the new API implemented here.) I don't understand what this patch set is supposed to do. AFAICT, the thread originally forked off from a TDE discussion and, considering the thread subject, was possibly once an attempt to refactor temporary file access to make integrating encryption easier? The discussion then talked about things like saving on system calls and excising all OS-level file access API use, which I found confusing, and the thread then just became a general TDE-related mini-discussion. The patches at hand extend some internal file access APIs and then sprinkle them around various places in the backend code, but there is no explanation why or how this is better. I don't see any real structural savings one might expect from a refactoring patch. No information has been presented how this might help encryption. I also suspect that changing around the use of various file access APIs needs to be carefully evaluated in each individual case. Various code has subtle expectations about atomic write behavior, syncing, flushing, error recovery, etc. I don't know if this has been considered here.
Re: Handle infinite recursion in logical replication setup
On Mon, Jul 4, 2022 at 7:44 PM Alvaro Herrera wrote: > > > From d8f8844f877806527b6f3f45320b6ba55a8e3154 Mon Sep 17 00:00:00 2001 > > From: Vigneshwaran C > > Date: Thu, 26 May 2022 19:29:33 +0530 > > Subject: [PATCH v27 1/4] Add a missing test to verify only-local parameter > > in > > test_decoding plugin. > > > > Add a missing test to verify only-local parameter in test_decoding plugin. > > I don't get it. replorigin.sql already has some lines to test > local-only. What is your patch adding that is new? Maybe instead of > adding some more lines at the end of the script, you should add lines > where this stuff is already being tested. But that assumes that there > is something new that is being tested; if so what is it? The test is to check that remote origin data (i.e. replication origin being set) will be filtered when only-local parameter is set. I felt that this scenario is not covered, unless I'm missing something. I have moved the test as suggested. I have changed this in the v28 patch attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com Regards, Vignesh
Re: Handle infinite recursion in logical replication setup
On Mon, Jul 4, 2022 at 9:23 AM Amit Kapila wrote: > > On Mon, Jul 4, 2022 at 3:59 AM Peter Smith wrote: > > > > On Mon, Jul 4, 2022 at 12:59 AM vignesh C wrote: > > ... > > > > 2. > > > > /* ALTER SUBSCRIPTION SET ( */ > > > > else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && > > > > TailMatches("SET", "(")) > > > > - COMPLETE_WITH("binary", "slot_name", "streaming", > > > > "synchronous_commit", "disable_on_error"); > > > > + COMPLETE_WITH("binary", "origin", "slot_name", "streaming", > > > > "synchronous_commit", "disable_on_error"); > > > > /* ALTER SUBSCRIPTION SKIP ( */ > > > > else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) && > > > > TailMatches("SKIP", "(")) > > > > COMPLETE_WITH("lsn"); > > > > @@ -3152,7 +3152,7 @@ psql_completion(const char *text, int start, int > > > > end) > > > > /* Complete "CREATE SUBSCRIPTION ... WITH ( " */ > > > > else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", > > > > "(")) > > > > COMPLETE_WITH("binary", "connect", "copy_data", "create_slot", > > > > - "enabled", "slot_name", "streaming", > > > > + "enabled", "origin", "slot_name", "streaming", > > > > "synchronous_commit", "two_phase", "disable_on_error"); > > > > > > > > Why do you choose to add a new option in-between other parameters > > > > instead of at the end which we normally do? The one possible reason I > > > > can think of is that all the parameters at the end are boolean so you > > > > want to add this before those but then why before slot_name, and again > > > > I don't see such a rule being followed for other parameters. > > > > > > I was not sure if it should be maintained in alphabetical order, > > > anyway since the last option "disable_on_error" is at the end, I have > > > changed it to the end. > > > > > > > Although it seems it is not a hard rule, mostly the COMPLETE_WITH are > > coded using alphabetical order. Anyway, I think that was a clear > > intention here too since 13 of 14 parameters were already in > > alphabetical order; it is actually only that "disable_on_error" > > parameter that was misplaced; not the new "origin" parameter. > > > > Agreed, but let's not change disable_on_error as part of this patch. I have changed this in the v28 patch attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com Regards, Vignesh
Re: Handle infinite recursion in logical replication setup
On Mon, Jul 4, 2022 at 1:46 PM shiy.f...@fujitsu.com wrote: > > On Sun, Jul 3, 2022 11:00 PM vignesh C wrote: > > > > Thanks for the comments, the attached v27 patch has the changes for the > > same. > > > > Thanks for updating the patch. > > A comment on 0003 patch: > + /* > +* No need to throw an error for the tables that are in ready > state, > +* as the walsender will send the changes from WAL in case of > tables > +* in ready state. > +*/ > + if (isreadytable) > + continue; > + > ... > + ereport(ERROR, > + > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("table: \"%s.%s\" might have > replicated data in the publisher", > + nspname, relname), > + errdetail("CREATE/ALTER SUBSCRIPTION with > origin = local and copy_data = on is not allowed when the publisher might > have replicated data."), > + errhint("Use CREATE/ALTER SUBSCRIPTION with > copy_data = off/force.")); > + > + ExecClearTuple(slot); > > I think we should call ExecClearTuple() before getting next tuple, so it > should > be called if the table is in ready state. How about modifying it to: > if (!isreadytable) > ereport(ERROR, > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("table: \"%s.%s\" might have > replicated data in the publisher", >nspname, relname), > errdetail("CREATE/ALTER SUBSCRIPTION > with origin = local and copy_data = on is not allowed when the publisher > might have replicated data."), > errhint("Use CREATE/ALTER > SUBSCRIPTION with copy_data = off/force.")); > > ExecClearTuple(slot); Thanks for the comment, I have modified this in the v28 patch attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com Regards, Vignesh
Re: make install-world fails sometimes in Mac M1
Gaddam Sai Ram writes: > We are using a script to install Postgres from source, the script works > fine in ubuntu and Mac(intel) but mostly fails(works sometimes) in Mac M1. We have developers (including me) and buildfarm machines using M1 Macs, and nobody else is reporting any big problem with them, so I don't believe that this is specifically due to that. > make[2]: *** [install] Killed: 9 kill -9 is not something that would happen internally to the install process. My guess is that that is interference from some external agent. Perhaps you have some resource-consumption-limiting daemon installed on that machine, and it's deciding that the command ran too long? regards, tom lane
Re: logical replication restrictions
On Wed, Mar 23, 2022, at 6:19 PM, Euler Taveira wrote: > On Mon, Mar 21, 2022, at 10:09 PM, Euler Taveira wrote: >> On Mon, Mar 21, 2022, at 10:04 PM, Andres Freund wrote: >>> On 2022-03-20 21:40:40 -0300, Euler Taveira wrote: >>> > On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote: >>> > > Long time, no patch. Here it is. I will provide documentation in the >>> > > next >>> > > version. I would appreciate some feedback. >>> > This patch is broken since commit >>> > 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I >>> > rebased it. >>> >>> This fails tests, specifically it seems psql crashes: >>> https://cirrus-ci.com/task/6592281292570624?logs=cores#L46 >> Yeah. I forgot to test this patch with cassert before sending it. :( I didn't >> send a new patch because there is another issue (with int128) that I'm >> currently reworking. I'll send another patch soon. > Here is another version after rebasing it. In this version I fixed the psql > issue and rewrote interval_to_ms function. >From the previous version, I added support for streamed transactions. For streamed transactions, the delay is applied during STREAM COMMIT message. That's ok if we add the delay before applying the spooled messages. Hence, we guarantee that the delay is applied *before* each transaction. The same logic is applied to prepared transactions. The delay is introduced before applying the spooled messages in STREAM PREPARE message. Tests were refactored a bit. A test for streamed transaction was included too. Version 4 is attached. -- Euler Taveira EDB https://www.enterprisedb.com/ From 7dd7a3523ed8e7a3494e7ec25ddc0af8ed4cf4d3 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Sat, 6 Nov 2021 11:31:10 -0300 Subject: [PATCH v4] Time-delayed logical replication subscriber Similar to physical replication, a time-delayed copy of the data for logical replication is useful for some scenarios (specially to fix errors that might cause data loss). If the subscriber sets min_apply_delay parameter, the logical replication worker will delay the transaction commit for min_apply_delay milliseconds. The delay is calculated between the WAL time stamp and the current time on the subscriber. The delay occurs only on WAL records for transaction begins. Regular and prepared transactions are covered. Streamed transactions are also covered. Author: Euler Taveira Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com --- doc/src/sgml/catalogs.sgml | 9 ++ doc/src/sgml/ref/alter_subscription.sgml | 5 +- doc/src/sgml/ref/create_subscription.sgml | 31 - src/backend/catalog/pg_subscription.c | 1 + src/backend/catalog/system_views.sql | 2 +- src/backend/commands/subscriptioncmds.c| 46 ++- src/backend/replication/logical/worker.c | 82 src/backend/utils/adt/timestamp.c | 32 + src/bin/pg_dump/pg_dump.c | 16 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/psql/describe.c| 8 +- src/bin/psql/tab-complete.c| 7 +- src/include/catalog/pg_subscription.h | 3 + src/include/datatype/timestamp.h | 2 + src/include/utils/timestamp.h | 2 + src/test/regress/expected/subscription.out | 149 - src/test/regress/sql/subscription.sql | 20 +++ src/test/subscription/t/032_apply_delay.pl | 110 +++ 18 files changed, 455 insertions(+), 71 deletions(-) create mode 100644 src/test/subscription/t/032_apply_delay.pl diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 25b02c4e37..9b94b7aef2 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7833,6 +7833,15 @@ SCRAM-SHA-256$iteration count: + + + subapplydelay int8 + + + Delay the application of changes by a specified amount of time. + + + subname name diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml index 353ea5def2..ae9d625f9d 100644 --- a/doc/src/sgml/ref/alter_subscription.sgml +++ b/doc/src/sgml/ref/alter_subscription.sgml @@ -207,8 +207,9 @@ ALTER SUBSCRIPTION name RENAME TO < information. The parameters that can be altered are slot_name, synchronous_commit, - binary, streaming, and - disable_on_error. + binary, streaming, + disable_on_error, and + min_apply_delay. diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml index 34b3264b26..ae80db8a3d 100644 --- a/doc/src/sgml/ref/create_subscription.sgml +++ b/doc/src/sgml/ref/create_subscription.sgml @@ -302,7 +302,36 @@ CREATE SUBSCRIPTION subscription_name - + + +min_apply_delay (integer) + + + By default, subscriber applies changes as soon
Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8
Peter Eisentraut writes: > On 04.07.22 07:55, Tom Lane wrote: >> But what about whatever code is reading the output? And what if >> that code isn't v16? I can't believe that we can make a wire >> protocol change as summarily as this. > I think a client will either just read the string value and convert it > to some numeric type without checking what type was actually sent, or if > the client API is type-aware and automatically converts to a native type > of some sort, then it will probably already support 64-bit ints. Do you > see some problem scenario? If the result of IDENTIFY_SYSTEM is always sent in text format, then I agree that this isn't very problematic. If there are any clients that fetch it in binary mode, though, this is absolutely a wire protocol break for them ... and no, I don't believe an unsupported claim that they'd adapt automatically. > I'm seeing a bigger problem now, which is that our client code doesn't > parse bigger-than-int32 timeline IDs correctly. Yup. regards, tom lane
make install-world fails sometimes in Mac M1
Hi team, We are using a script to install Postgres from source, the script works fine in ubuntu and Mac(intel) but mostly fails(works sometimes) in Mac M1. Configure and make world works fine. But fails during make install-world. /Applications/Xcode.app/Contents/Developer/usr/bin/make -C ./src/backend generated-headers /Applications/Xcode.app/Contents/Developer/usr/bin/make -C catalog distprep generated-header-symlinks make[2]: Nothing to be done for `distprep'. make[2]: Nothing to be done for `generated-header-symlinks'. /Applications/Xcode.app/Contents/Developer/usr/bin/make -C utils distprep generated-header-symlinks make[2]: Nothing to be done for `distprep'. make[2]: Nothing to be done for `generated-header-symlinks'. /Applications/Xcode.app/Contents/Developer/usr/bin/make -C doc install /Applications/Xcode.app/Contents/Developer/usr/bin/make -C src install /Applications/Xcode.app/Contents/Developer/usr/bin/make -C sgml install /bin/sh ../../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/share/doc/'/html '/Users/sairam/work/postgresql-11.14/share/man'/man1 '/Users/sairam/work/postgresql-11.14/share/man'/man3 '/Users/sairam/work/postgresql-11.14/share/man'/man7 cp -R `for f in ./html; do test -r $f && echo $f && break; done` '/Users/sairam/work/postgresql-11.14/share/doc/' cp -R `for f in ./man1; do test -r $f && echo $f && break; done` `for f in ./man3; do test -r $f && echo $f && break; done` `for f in ./man7; do test -r $f && echo $f && break; done` '/Users/sairam/work/postgresql-11.14/share/man' /Applications/Xcode.app/Contents/Developer/usr/bin/make -C src install /Applications/Xcode.app/Contents/Developer/usr/bin/make -C common install /bin/sh ../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/lib' /usr/bin/install -c -m 644 libpgcommon.a '/Users/sairam/work/postgresql-11.14/lib/libpgcommon.a' /Applications/Xcode.app/Contents/Developer/usr/bin/make -C port install /bin/sh ../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/lib' /usr/bin/install -c -m 644 libpgport.a '/Users/sairam/work/postgresql-11.14/lib/libpgport.a' /Applications/Xcode.app/Contents/Developer/usr/bin/make -C timezone install /Applications/Xcode.app/Contents/Developer/usr/bin/make -C ../../src/port all make[3]: Nothing to be done for `all'. /Applications/Xcode.app/Contents/Developer/usr/bin/make -C ../../src/common all make[3]: Nothing to be done for `all'. /bin/sh ../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/share' ./zic -d '/Users/sairam/work/postgresql-11.14/share/timezone' -p 'US/Eastern' -b fat ./data/tzdata.zi make[2]: *** [install] Killed: 9 make[1]: *** [install-timezone-recurse] Error 2 make: *** [install-world-src-recurse] Error 2 2 It's not like it fails every time, sometimes the same script works just fine. Sometimes after few retries it works. I have also noticed that, if it works without issue, binaries generated does't work immediately. i.e when I tried to query pg_config, the command waits for sometime and gets killed. After a few retries, it works there onwards. I'm wondering what could be the issue. I'm attaching the script to the same, kindly go through it and help me understand the issue. Postgres Version: 11.14 Machine details: MacBook Pro (13-inch, M1, 2020), Version 12.4 Regards G. Sai Ram install_pg.sh Description: Binary data
Re: automatically generating node support functions
Peter Eisentraut writes: > [ v6-0001-Automatically-generate-node-support-functions.patch ] I've now spent some time looking at this fairly carefully, and I think this is a direction we can pursue, but I'm not yet happy about the amount of magic knowledge that's embedded in the gen_node_support.pl script rather than being encoded in pg_node_attr markers. Once this is in place, people will stop thinking about the nodes/*funcs.c infrastructure altogether when they write patches, at least until they get badly burned by it; so I don't want there to be big gotchas. As an example, heaven help the future hacker who decides to change the contents of A_Const and doesn't realize that that still has a manually-implemented copyfuncs.c routine. So rather than embedding knowledge in gen_node_support.pl like this: my @custom_copy = qw(A_Const Const ExtensibleNode); I think we ought to put it into the *nodes.h headers as much as possible, perhaps like this: typedef struct A_Const pg_node_attr(custom_copy) { ... I will grant that there are some things that are okay to embed in gen_node_support.pl, such as the list of @scalar_types, because if you need to add an entry there you will find it out when the script complains it doesn't know how to process a field. So there is some judgment involved here, but on the whole I want to err on the side of exposing decisions in the headers. So I propose that we handle these things via struct-level pg_node_attr markers, rather than node-type lists embedded in the script: abstract_types no_copy no_read_write no_read custom_copy custom_readwrite (The markings that "we are not publishing right now to stay level with the manual system" are fine to apply in the script, since that's probably a temporary thing anyway. Also, I don't have a problem with applying no_copy etc to the contents of whole files in the script, rather than tediously labeling each struct in such files.) The hacks for scalar-copying EquivalenceClass*, EquivalenceMember*, struct CustomPathMethods*, and CustomScan.methods should be replaced with "pg_node_attr(copy_as_scalar)" labels on affected fields. I wonder whether this: # We do not support copying Path trees, mainly # because the circular linkages between RelOptInfo # and Path nodes can't be handled easily in a # simple depth-first traversal. couldn't be done better by inventing an inheritable no_copy attr to attach to the Path supertype. Or maybe it'd be okay to just automatically inherit the no_xxx properties from the supertype? I don't terribly like the ad-hoc mechanism for not comparing CoercionForm fields. OTOH, I am not sure whether replacing it with per-field equal_ignore attrs would be better; there's at least an argument that that invites bugs of omission. But implementing this with an uncommented test deep inside a script that most hackers should not need to read is not good. On the whole I'd lean towards the equal_ignore route. I'm confused by the "various field types to ignore" at the end of the outfuncs/readfuncs code. Do we really ignore those now? How could that be safe? If it is safe, wouldn't it be better to handle that with per-field pg_node_attrs? Silently doing what might be the wrong thing doesn't seem good. In the department of nitpicks: * copyfuncs.switch.c and equalfuncs.switch.c are missing trailing newlines. * pgindent is not very happy with a lot of your comments in *nodes.h. * I think we should add explicit dependencies in backend/nodes/Makefile, along the lines of copyfuncs.o: copyfuncs.c copyfuncs.funcs.c copyfuncs.switch.c Otherwise the whole thing is a big gotcha for anyone not using --enable-depend. I don't know if you have time right now to push forward with these points, but if you don't I can take a stab at it. I would like to see this done and committed PDQ, because 835d476fd already broke many patches that touch *nodes.h and I'd like to get the rest of the fallout in place before rebasing affected patches. regards, tom lane
Re: Temporary file access API
Antonin Houska wrote: > Robert Haas wrote: > > > On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska wrote: > > > Robert Haas wrote: > > > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska wrote: > > > > > There are't really that many kinds of files to encrypt: > > > > > > > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data > > > > > > > > > > (And pg_stat/* files should be removed from the list.) > > > > > > > > This kind of gets into some theoretical questions. Like, do we think > > > > that it's an information leak if people can look at how many > > > > transactions are committing and aborting in pg_xact_status? In theory > > > > it could be, but I know it's been argued that that's too much of a > > > > side channel. I'm not sure I believe that, but it's arguable. > > > > > > I was referring to the fact that the statistics are no longer stored in > > > files: > > > > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd > > > > Oh, yeah, I agree with that. > > I see now that the statistics are yet saved to a file on server shutdown. I've > updated the wiki page. > > Attached is a new version of the patch, to evaluate what the API use in the > backend could look like. I haven't touched places where the file is accessed > in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is > called. Rebased patch set is attached here, which applies to the current master. (A few more opportunities for the new API implemented here.) -- Antonin Houska Web: https://www.cybertec-postgresql.com temp_file_api_v3.tgz Description: application/gzip
Re: doc: BRIN indexes and autosummarize
On 2022-Jun-28, Roberto Mello wrote: > Here's a patch to clarify the BRIN indexes documentation, particularly with > regards to autosummarize, vacuum and autovacuum. It basically breaks > down a big blob of a paragraph into multiple paragraphs for clarity, > plus explicitly tells how summarization happens manually or > automatically. [Some of] these additions are wrong actually. It says that autovacuum will not summarize new entries; but it does. If you just let the table sit idle, any autovacuum run that cleans the table will also summarize any ranges that need summarization. What 'autosummarization=off' means is that the behavior to trigger an immediate summarization of a range once it becomes full is not default. This is very different. As for the new s that you added, I'd say they're stylistically wrong. Each paragraph is supposed to be one fully contained idea; what these tags do is split each idea across several smaller paragraphs. This is likely subjective though. > On this topic... I'm not familiar with with the internals of BRIN > indexes and in backend/access/common/reloptions.c I see: > > { > "autosummarize", > "Enables automatic summarization on this BRIN index", > RELOPT_KIND_BRIN, > AccessExclusiveLock > }, > > Is the exclusive lock on the index why autosummarize is off by default? No. The lock level mentioned here is what needs to be taken in order to change the value of this option. > What would be the downside (if any) of having autosummarize=on by default? I'm not aware of any. Maybe we should turn it on by default. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: TAP output format in pg_regress
Peter Eisentraut writes: > I'm not sure what to make of all these options. I think providing a TAP > output for pg_regress is a good idea. But then do we still need the old > output? Is it worth maintaining two output formats that display exactly > the same thing in slightly different ways? Probably is, because this is bad: > ... The proposed default format now hides the > fact that some tests are started in parallel. I remember the last time > I wanted to tweak the output of the parallel tests, people were very > attached to the particular timing and spacing of the current output. So > I'm not sure people will like this. and so is this: > The timing output is very popular. Where is that in the TAP output? Both of those things are fairly critical for test development. You need to know what else might be running in parallel with a test case, and you need to know whether you just bloated the runtime unreasonably. More generally, I'm unhappy about the proposal that TAP should become the default output. There is nothing particularly human-friendly about it, whereas the existing format is something we have tuned to our liking over literally decades. I don't mind if there's a way to get TAP when you're actually intending to feed it into a TAP-parsing tool, but I am not a TAP-parsing tool and I don't see why I should have to put up with it. regards, tom lane
Re: TAP output format in pg_regress
On 29.06.22 21:50, Daniel Gustafsson wrote: Attached is a new version of this patch, which completes the TAP output format option such that all codepaths emitting output are TAP compliant. The verbose option is fixed to to not output extraneous newlines which the previous PoC did. The output it made to conform to the original TAP spec since v13/14 TAP parsers seem less common than those that can handle the original spec. Support for the new format additions should be quite simple to add should we want that. Running pg_regress --verbose should give the current format output. I did end up combining TAP and --verbose into a single patch, as the TAP format sort of depends on the verbose flag as TAP has no verbose mode. I can split it into two separate should a reviewer prefer that. I'm not sure what to make of all these options. I think providing a TAP output for pg_regress is a good idea. But then do we still need the old output? Is it worth maintaining two output formats that display exactly the same thing in slightly different ways? What is the purpose of the --verbose option? When and how is one supposed to activate that? The proposed default format now hides the fact that some tests are started in parallel. I remember the last time I wanted to tweak the output of the parallel tests, people were very attached to the particular timing and spacing of the current output. So I'm not sure people will like this. The timing output is very popular. Where is that in the TAP output? More generally, what do you envision we do with this feature? Who is it for, what are the tradeoffs, etc.?
Re: Handle infinite recursion in logical replication setup
> From d8f8844f877806527b6f3f45320b6ba55a8e3154 Mon Sep 17 00:00:00 2001 > From: Vigneshwaran C > Date: Thu, 26 May 2022 19:29:33 +0530 > Subject: [PATCH v27 1/4] Add a missing test to verify only-local parameter in > test_decoding plugin. > > Add a missing test to verify only-local parameter in test_decoding plugin. I don't get it. replorigin.sql already has some lines to test local-only. What is your patch adding that is new? Maybe instead of adding some more lines at the end of the script, you should add lines where this stuff is already being tested. But that assumes that there is something new that is being tested; if so what is it? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
Re: Removing unneeded self joins
On Mon, Jul 4, 2022 at 6:52 AM Ronan Dunklau wrote: > Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit : > > On 19/5/2022 16:47, Ronan Dunklau wrote: > > > I'll take a look at that one. > > > > New version of the patch, rebased on current master: > > 1. pgindent over the patch have passed. > > 2. number of changed files is reduced. > > 3. Some documentation and comments is added. > > Hello Andrey, > > Thanks for the updates. > > The general approach seems sensible to me, so I'm going to focus on some > details. > > In a very recent thread [1], Tom Lane is proposing to add infrastructure > to make Var aware of their nullability by outer joins. I wonder if that > would help with avoiding the need for adding is not null clauses when the > column is known not null ? > If we have a precedent for adding a BitmapSet to the Var itself, maybe the > whole discussion regarding keeping track of nullability can be extended to > the original column nullability ? > > Also, I saw it was mentioned earlier in the thread but how difficult would > it be to process the transformed quals through the EquivalenceClass > machinery and the qual simplification ? > For example, if the target audience of this patch is ORM, or inlined > views, it wouldn't surprise me to see queries of this kind in the wild, > which could be avoided altogether: > > postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a = > s2.a where s1.b = 2 and s2.b =3; > QUERY PLAN > - > Seq Scan on sj s2 >Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2)) > (2 lignes) > > > + for (counter = 0; counter < list_length(*sources);) > + { > + ListCell *cell = list_nth_cell(*sources, counter); > + RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell)); > + int counter1; > + > > + ec->ec_members = list_delete_cell(ec->ec_members, cell); > > > Why don't you use foreach() and foreach_delete_current macros for > iterating and removing items in the lists, both in update_ec_members and > update_ec_sources ? > > > + if ((bms_is_member(k, info->syn_lefthand) ^ > +bms_is_member(r, > info->syn_lefthand)) || > + (bms_is_member(k, > info->syn_righthand) ^ > +bms_is_member(r, > info->syn_righthand))) > > I think this is more compact and easier to follow than the previous > version, but I'm not sure how common it is in postgres source code to use > that kind of construction ? > > Some review about the comments: > > > I see you keep using the terms "the keeping relation" and "the removing > relation" in reference to the relation that is kept and the one that is > removed. > Aside from the grammar (the kept relation or the removed relation), maybe > it would make it clearer to call them something else. In other parts of the > code, you used "the remaining relation / the removed relation" which makes > sense. > > /* > * Remove the target relid from the planner's data structures, having > - * determined that there is no need to include it in the query. > + * determined that there is no need to include it in the query. Or replace > + * with another relid. > + * To reusability, this routine can work in two modes: delete relid from > a plan > + * or replace it. It is used in replace mode in a self-join removing > process. > > This could be rephrased: ", optionally replacing it with another relid. > The latter is used by the self-join removing process." > > > [1] > https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us > > -- > Ronan Dunklau > > > Hi, bq. this is more compact and easier to follow than the previous version A code comment can be added above the expression (involving XOR) to explain the purpose of the expression. Cheers
Re: [PATCH] Compression dictionaries for JSONB
Hi Matthias, > > Although there is also a high-level idea (according to the > > presentations) to share common data between different TOASTed values, > > similarly to what compression dictionaries do, by looking at the > > current feedback and considering the overall complexity and the amount > > of open questions (e.g. interaction with different TableAMs, etc), I > > seriously doubt that this particular part of "pluggable TOASTer" will > > end-up in the core. > > Yes, and that's why I think that this where this dictionary > infrastructure could provide value, as an alternative or extension to > the proposed jsonb toaster in the 'pluggable toaster' thread. OK, I see your point now. And I think this is a very good point. Basing "Compression dictionaries" on the API provided by "pluggable TOASTer" can also be less hacky than what I'm currently doing with `typmod` argument. I'm going to switch the implementation at some point, unless anyone will object to the idea. -- Best regards, Aleksander Alekseev
Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8
On 04.07.22 07:55, Tom Lane wrote: Peter Eisentraut writes: Change timeline field of IDENTIFY_SYSTEM to int8 Surely this patch is far from complete? To start with, just a few lines down in IdentifySystem() the column is filled using Int32GetDatum not Int64GetDatum. I will get some popcorn and await the opinions of the 32-bit buildfarm animals. But what about whatever code is reading the output? And what if that code isn't v16? I can't believe that we can make a wire protocol change as summarily as this. I think a client will either just read the string value and convert it to some numeric type without checking what type was actually sent, or if the client API is type-aware and automatically converts to a native type of some sort, then it will probably already support 64-bit ints. Do you see some problem scenario? I'm seeing a bigger problem now, which is that our client code doesn't parse bigger-than-int32 timeline IDs correctly. libpqwalreceiver uses pg_strtoint32(), which will error on overflow. pg_basebackup uses atoi(), so it will just truncate the value, except for READ_REPLICATION_SLOT, where it uses atol(), so it will do the wrong thing on Windows only. There is clearly very little use for such near-overflow timeline IDs in practice. But it still seems pretty inconsistent.
Re: Removing unneeded self joins
Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit : > On 19/5/2022 16:47, Ronan Dunklau wrote: > > I'll take a look at that one. > > New version of the patch, rebased on current master: > 1. pgindent over the patch have passed. > 2. number of changed files is reduced. > 3. Some documentation and comments is added. Hello Andrey, Thanks for the updates. The general approach seems sensible to me, so I'm going to focus on some details. In a very recent thread [1], Tom Lane is proposing to add infrastructure to make Var aware of their nullability by outer joins. I wonder if that would help with avoiding the need for adding is not null clauses when the column is known not null ? If we have a precedent for adding a BitmapSet to the Var itself, maybe the whole discussion regarding keeping track of nullability can be extended to the original column nullability ? Also, I saw it was mentioned earlier in the thread but how difficult would it be to process the transformed quals through the EquivalenceClass machinery and the qual simplification ? For example, if the target audience of this patch is ORM, or inlined views, it wouldn't surprise me to see queries of this kind in the wild, which could be avoided altogether: postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a = s2.a where s1.b = 2 and s2.b =3; QUERY PLAN - Seq Scan on sj s2 Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2)) (2 lignes) + for (counter = 0; counter < list_length(*sources);) + { + ListCell *cell = list_nth_cell(*sources, counter); + RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell)); + int counter1; + + ec->ec_members = list_delete_cell(ec->ec_members, cell); Why don't you use foreach() and foreach_delete_current macros for iterating and removing items in the lists, both in update_ec_members and update_ec_sources ? + if ((bms_is_member(k, info->syn_lefthand) ^ +bms_is_member(r, info->syn_lefthand)) || + (bms_is_member(k, info->syn_righthand) ^ +bms_is_member(r, info->syn_righthand))) I think this is more compact and easier to follow than the previous version, but I'm not sure how common it is in postgres source code to use that kind of construction ? Some review about the comments: I see you keep using the terms "the keeping relation" and "the removing relation" in reference to the relation that is kept and the one that is removed. Aside from the grammar (the kept relation or the removed relation), maybe it would make it clearer to call them something else. In other parts of the code, you used "the remaining relation / the removed relation" which makes sense. /* * Remove the target relid from the planner's data structures, having - * determined that there is no need to include it in the query. + * determined that there is no need to include it in the query. Or replace + * with another relid. + * To reusability, this routine can work in two modes: delete relid from a plan + * or replace it. It is used in replace mode in a self-join removing process. This could be rephrased: ", optionally replacing it with another relid. The latter is used by the self-join removing process." [1] https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us -- Ronan Dunklau
Re: list of TransactionIds
On 2022-May-16, Amit Kapila wrote: > On Sun, May 15, 2022 at 5:05 PM Alvaro Herrera > wrote: > > I hesitate to add this the day just before beta. This is already in > > pg14, so maybe it's not a big deal if pg15 remains the same for the time > > being. Or we can change it for beta2. Or we could just punt until > > pg16. Any preferences? > > I prefer to do this for pg16 unless we see some bug due to this. Pushed now, to master only. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Minimal logical decoding on standbys
On Mon, Jul 4, 2022 at 6:12 PM Drouvot, Bertrand wrote: > Hi, > On 7/1/22 10:03 PM, Ibrar Ahmed wrote: > > > On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand > wrote: > >> I'm going to re-create a CF entry for it, as: >> >> - It seems there is a clear interest for the feature (given the time >> already spend on it and the number of people that worked on) >> >> - I've in mind to resume working on it >> >> I have already done some research on that, I can definitely look at it. > > Thanks! > > This feature proposal is currently made of 5 sub-patches: > > 0001: Add info in WAL records in preparation for logical slot conflict > handling > 0002: Handle logical slot conflicts on standby > 0003: Allow logical decoding on standby. > 0004: New TAP test for logical decoding on standby > 0005: Doc changes describing details about logical decoding > > I suggest that we focus on one sub-patch at a time. > > I'll start with 0001 and come back with a rebase addressing Andres and > Robert's previous comments. > > Sounds good to you? > > Thanks > > -- > Bertrand Drouvot > Amazon Web Services: https://aws.amazon.com > > That's great I am looking at "0002: Handle logical slot conflicts on standby". -- Ibrar Ahmed
Re: Patch proposal: New hooks in the connection path
On Mon, Jul 4, 2022 at 5:54 AM Drouvot, Bertrand wrote: > Hi, > > On 7/2/22 1:00 AM, Nathan Bossart wrote: > > Could we model this after fmgr_hook? The first argument in that hook > > indicates where it is being called from. This doesn't alleviate the need > > for several calls to the hook in the authentication logic, but extension > > authors would only need to define one hook. > > I like the idea and indeed fmgr.h looks a good place to model it. > > Attached a new patch version doing so. > > Thanks > > -- > > Bertrand Drouvot > Amazon Web Services: https://aws.amazon.com Hi, + FCET_SPT, /* startup packet timeout */ + FCET_BSP, /* bad startup packet */ Looking at existing enum type, such as FmgrHookEventType, the part after underscore is a word. I think it would be good to follow existing practice and make the enums more readable. Cheers
Re: [PATCH] Compression dictionaries for JSONB
Hi Alexander, On Fri, 17 Jun 2022 at 17:04, Aleksander Alekseev wrote: >> These are just my initial thoughts I would like to share though. I may >> change my mind after diving deeper into a "pluggable TOASTer" patch. > > I familiarized myself with the "pluggable TOASTer" thread and joined > the discussion [1]. > > I'm afraid so far I failed to understand your suggestion to base > "compression dictionaries" patch on "pluggable TOASTer", considering > the fair amount of push-back it got from the community, not to mention > a somewhat raw state of the patchset. It's true that Teodor and I are > trying to address similar problems. This however doesn't mean that > there should be a dependency between these patches. The reason I think this is better implemented as a pluggable toaster is because casts are necessarily opaque and require O(sizeofdata) copies or processing. The toaster infrastructure that is proposed in [0] seems to improve on the O(sizeofdata) requirement for toast, but that will not work with casts. > Also, I completely agree with Tomas [2]: > >> My main point is that we should not be making too many radical >> changes at once - it makes it much harder to actually get anything done. > > IMO the patches don't depend on each other but rather complement each > other. The user can switch between different TOAST methods, and the > compression dictionaries can work on top of different TOAST methods. I don't think that is possible (or at least, not as performant). To treat type X' as type X and use it as a stored medium instead, you must have either the whole binary representation of X, or have access to the internals of type X. I find it difficult to believe that casts can be done without a full detoast (or otherwise without deep knowledge about internal structure of the data type such as 'type A is binary compatible with type X'), and as such I think this feature 'compression dictionaries' is competing with the 'pluggable toaster' feature, if the one is used on top of the other. That is, the dictionary is still created like in the proposed patches (though preferably without the 64-byte NAMELEN limit), but the usage will be through "TOASTER my_dict_enabled_toaster". Additionally, I don't think we've ever accepted two different implementations of the same concept, at least not without first having good arguments why both competing implementations have obvious benefits over the other, and both implementations being incompatible. > Although there is also a high-level idea (according to the > presentations) to share common data between different TOASTed values, > similarly to what compression dictionaries do, by looking at the > current feedback and considering the overall complexity and the amount > of open questions (e.g. interaction with different TableAMs, etc), I > seriously doubt that this particular part of "pluggable TOASTer" will > end-up in the core. Yes, and that's why I think that this where this dictionary infrastructure could provide value, as an alternative or extension to the proposed jsonb toaster in the 'pluggable toaster' thread. Kind regards, Matthias van de Meent
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada wrote: > > I've attached three POC patches: > I think it will be a good idea if you can add a short commit message at least to say which patch is proposed for HEAD and which one is for back branches. Also, it would be good if you can add some description of the fix in the commit message. Let's remove poc* from the patch name. Review poc_add_running_catchanges_xacts_to_serialized_snapshot = 1. + /* + * Array of transactions that were running when the snapshot serialization + * and changed system catalogs, The part of the sentence after serialization is not very clear. 2. - if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid)) + if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) || + bsearch(, builder->catchanges.xip, builder->catchanges.xcnt, + sizeof(TransactionId), xidComparator) != NULL) Why are you using xid instead of subxid in bsearch call? Can we add a comment to say why it is okay to use xid if there is a valid reason? But note, we are using subxid to add to the committed xact array so not sure if this is a good idea but I might be missing something. Suggestions for improvement in comments: - /* -* Update the transactions that are running and changes catalogs that are -* not committed. -*/ + /* Update the catalog modifying transactions that are yet not committed. */ if (builder->catchanges.xip) pfree(builder->catchanges.xip); builder->catchanges.xip = ReorderBufferGetCatalogChangesXacts(builder->reorder, @@ -1647,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn) COMP_CRC32C(ondisk->checksum, ondisk_c, sz); ondisk_c += sz; - /* copy catalog-changes xacts */ + /* copy catalog modifying xacts */ sz = sizeof(TransactionId) * builder->catchanges.xcnt; memcpy(ondisk_c, builder->catchanges.xip, sz); COMP_CRC32C(ondisk->checksum, ondisk_c, sz); -- With Regards, Amit Kapila.
Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH
On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote: > Yeah, it seems we have overlooked this point. I think we can do this > just for HEAD but as the feature is introduced in PG-15 so there is no > harm in pushing it to PG-15 as well especially because it is a > straightforward change. What do you or others think? No objection. It is a good thing for future backpatches. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: automatically generating node support functions
On 03.07.22 21:14, Tom Lane wrote: Peter Eisentraut writes: Here is a patch that reformats the relevant (and a few more) comments that way. This has been run through pgindent, so the formatting should be stable. Now that that's been pushed, the main patch is of course quite broken. Are you working on a rebase? attached * AFAICT, the infrastructure for removing the generated files at "make *clean" is incomplete. I have fixed all the makefiles per your suggestions. and something similar to this for the "clean" rule: # fmgroids.h, fmgrprotos.h, fmgrtab.c, fmgr-stamp, and errcodes.h are in the # distribution tarball, so they are not cleaned here. Except this one, since there is no clean rule. I think seeing that files are listed under a maintainer-clean target conveys that same message. Also, I share David's upthread allergy to the option names "path_hackN" and to documenting those only inside the conversion script. I'll look into that again. BTW, I think this: "Unknown attributes are ignored" is a seriously bad idea; it will allow typos to escape detection. good pointFrom 0ad86183662b6637c9ec9a29374fecb4d11202e2 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 4 Jul 2022 14:17:02 +0200 Subject: [PATCH v6] Automatically generate node support functions Add a script to automatically generate the node support functions (copy, equal, out, and read, as well as the node tags enum) from the struct definitions. For each of the four node support files, it creates two include files, e.g., copyfuncs.funcs.c and copyfuncs.switch.c, to include in the main file. All the scaffolding of the main file stays in place. TODO: In this patch, I have only ifdef'ed out the code to could be removed, mainly so that it won't constantly have merge conflicts. Eventually, that should all be changed to delete the code. All the code comments that are worth keeping from those sections have already been moved to the header files where the structs are defined. I have tried to mostly make the coverage of the output match what is currently there. For example, one could now do out/read coverage of utility statement nodes, but I have manually excluded those for now. The reason is mainly that it's easier to diff the before and after, and adding a bunch of stuff like this might require a separate analysis and review. Subtyping (TidScan -> Scan) is supported. For the hard cases, you can just write a manual function and exclude generating one. For the not so hard cases, there is a way of annotating struct fields to get special behaviors. For example, pg_node_attr(equal_ignore) has the field ignored in equal functions. Discussion: https://www.postgresql.org/message-id/flat/c1097590-a6a4-486a-64b1-e1f9cc0533ce%40enterprisedb.com --- src/backend/Makefile | 10 +- src/backend/nodes/.gitignore | 4 + src/backend/nodes/Makefile| 54 ++ src/backend/nodes/copyfuncs.c | 19 +- src/backend/nodes/equalfuncs.c| 22 +- src/backend/nodes/gen_node_support.pl | 729 ++ src/backend/nodes/outfuncs.c | 34 +- src/backend/nodes/readfuncs.c | 23 +- src/include/Makefile | 1 + src/include/nodes/.gitignore | 2 + src/include/nodes/nodes.h | 27 + src/include/nodes/parsenodes.h| 4 +- src/include/nodes/pathnodes.h | 175 --- src/include/nodes/plannodes.h | 90 ++-- src/include/nodes/primnodes.h | 34 +- src/include/utils/rel.h | 6 +- src/tools/msvc/Solution.pm| 46 ++ 17 files changed, 1124 insertions(+), 156 deletions(-) create mode 100644 src/backend/nodes/.gitignore create mode 100644 src/backend/nodes/gen_node_support.pl create mode 100644 src/include/nodes/.gitignore diff --git a/src/backend/Makefile b/src/backend/Makefile index 4a02006788..953c80db5a 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -143,11 +143,15 @@ storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lw submake-catalog-headers: $(MAKE) -C catalog distprep generated-header-symlinks +# run this unconditionally to avoid needing to know its dependencies here: +submake-nodes-headers: + $(MAKE) -C nodes distprep generated-header-symlinks + # run this unconditionally to avoid needing to know its dependencies here: submake-utils-headers: $(MAKE) -C utils distprep generated-header-symlinks -.PHONY: submake-catalog-headers submake-utils-headers +.PHONY: submake-catalog-headers submake-nodes-headers submake-utils-headers # Make symlinks for these headers in the include directory. That way # we can cut down on the -I options. Also, a symlink is automatically @@ -162,7 +166,7 @@ submake-utils-headers: .PHONY: generated-headers -generated-headers: $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers
Re: generate_series for timestamptz and time zone problem
Przemysław Sztoch wrote on 01.07.2022 15:43: Gurjeet Singh wrote on 01.07.2022 06:35: On Tue, Jun 21, 2022 at 7:56 AM Przemysław Sztoch wrote: Please give me feedback on how to properly store the timezone name in the function context structure. I can't finish my work without it. The way I see it, I don't think you need to store the tz-name in the function context structure, like you're currently doing. I think you can remove the additional member from the generate_series_timestamptz_fctx struct, and refactor your code in generate_series_timestamptz() to work without it.; you seem to be using the tzname member almost as a boolean flag, because the actual value you pass to DFCall3() can be calculated without first storing anything in the struct. Do I understand correctly that functions that return SET are executed multiple times? Is access to arguments available all the time? I thought PG_GETARG_ could only be used when SRF_IS_FIRSTCALL () is true - was I right or wrong? Dear Gurjeet, I thought a bit after riding the bikes and the code repaired itself. :-) Thanks for the clarification. Please check if patch v5 is satisfactory for you. Can you please explain why you chose to remove the provolatile attribute from the existing entry of date_trunc in pg_proc.dat. I believe it was a mistake in PG code. All timestamptz functions must be STABLE as they depend on the current: SHOW timezone. If new functions are created that pass the zone as a parameter, they become IMMUTABLE. FIrst date_trunc function implementaion was without time zone parameter and someone who added second variant (with timezone as parameter) copied the definition without removing the STABLE flag. Have I convinced everyone that this change is right? I assume I'm right and the mistake will be fatal. -- Przemysław Sztoch | Mobile +48 509 99 00 66
Re: EINTR in ftruncate()
On 2022-Jul-01, Andres Freund wrote: > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > > On 2022-Jul-01, Andres Freund wrote: > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? > > > What > > > if there's dsm operations intentionally done while QueryCancelPending? > > > > That mirrors the test for the other block in that function, which was > > added by 63efab4ca139, whose commit message explains: > That whole approach seems quite wrong to me. At the absolute very least the > code needs to check if interrupts are being processed in the current context > before just giving up due to ProcDiePending || QueryCancelPending. For the time being, I can just push the addition of the EINTR retry without testing ProcDiePending || QueryCancelPending. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "El sudor es la mejor cura para un pensamiento enfermo" (Bardia) >From b7591beb919c3cfaa8090bda3977b7127de8de28 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 1 Jul 2022 17:16:33 +0200 Subject: [PATCH v3] retry ftruncate --- src/backend/storage/ipc/dsm_impl.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index 82b7978aeb..145a02204b 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -417,8 +417,19 @@ dsm_impl_posix_resize(int fd, off_t size) { int rc; - /* Truncate (or extend) the file to the requested size. */ - rc = ftruncate(fd, size); + /* + * Truncate (or extend) the file to the requested size. If we're + * interrupted by a signal, retry; have caller handle any other error. + */ + for (;;) + { + errno = 0; + rc = ftruncate(fd, size); + if (rc == 0) + break; + if (errno != EINTR) + return rc; + } /* * On Linux, a shm_open fd is backed by a tmpfs file. After resizing with -- 2.30.2
Re: generate_series for timestamptz and time zone problem
Tom Lane wrote on 04.07.2022 00:31: =?UTF-8?Q?Przemys=c5=82aw_Sztoch?= writes: I have problem with this: -- Considering only built-in procs (prolang = 12), look for multiple uses -- of the same internal function (ie, matching prosrc fields). It's OK to -- have several entries with different pronames for the same internal function, -- but conflicts in the number of arguments and other critical items should -- be complained of. (We don't check data types here; see next query.) It's telling you you're violating project style. Don't make multiple pg_proc entries point at the same C function and then use PG_NARGS to disambiguate; instead point at two separate functions. The functions can share code at the next level down, if they want. (Just looking at the patch, though, I wonder if sharing code is really beneficial in this case. It seems quite messy, and I wouldn't be surprised if it hurts performance in the existing case.) You also need to expend some more effort on refactoring code, to eliminate silliness like looking up the timezone name each time through the SRF. That's got to be pretty awful performance-wise. regards, tom lane Thx. Code is refactored. It is better, now. -- Przemysław Sztoch | Mobile +48 509 99 00 66 diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index f70f829d83..ee1dee7c84 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -69,6 +69,7 @@ typedef struct TimestampTz finish; Intervalstep; int step_sign; + pg_tz *attimezone; } generate_series_timestamptz_fctx; @@ -547,6 +548,48 @@ parse_sane_timezone(struct pg_tm *tm, text *zone) return tz; } +static pg_tz * +lookup_timezone(text *zone) +{ + chartzname[TZ_STRLEN_MAX + 1]; + char *lowzone; + int type, + val; + pg_tz *tzp; + /* +* Look up the requested timezone (see notes in timestamptz_zone()). +*/ + text_to_cstring_buffer(zone, tzname, sizeof(tzname)); + + /* DecodeTimezoneAbbrev requires lowercase input */ + lowzone = downcase_truncate_identifier(tzname, + strlen(tzname), + false); + + type = DecodeTimezoneAbbrev(0, lowzone, , ); + + if (type == TZ || type == DTZ) + { + /* fixed-offset abbreviation, get a pg_tz descriptor for that */ + tzp = pg_tzset_offset(-val); + } + else if (type == DYNTZ) + { + /* dynamic-offset abbreviation, use its referenced timezone */ + } + else + { + /* try it as a full zone name */ + tzp = pg_tzset(tzname); + if (!tzp) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("time zone \"%s\" not recognized", tzname))); + } + + return tzp; +} + /* * make_timestamp_internal * workhorse for make_timestamp and make_timestamptz @@ -2989,97 +3032,107 @@ timestamp_mi_interval(PG_FUNCTION_ARGS) } -/* timestamptz_pl_interval() - * Add an interval to a timestamp with time zone data type. - * Note that interval has provisions for qualitative year/month +/* + * Note that interval has provisions for qualitative year/month and day * units, so try to do the right thing with them. * To add a month, increment the month, and use the same day of month. * Then, if the next month has fewer days, set the day of month * to the last day of month. + * To add a day, increment the mday, and use the same time of day. * Lastly, add in the "quantitative time". */ +static TimestampTz +timestamptz_pl_interval_internal(TimestampTz timestamp, Interval *span, pg_tz *attimezone) +{ + int tz; + + /* Use session timezone if caller asks for default */ + if (attimezone == NULL) + attimezone = session_timezone; + + if (span->month != 0) + { + struct pg_tm tt, + *tm = + fsec_t fsec; + + if (timestamp2tm(timestamp, , tm, , NULL, attimezone) != 0) + ereport(ERROR, + (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), +errmsg("timestamp out of range"))); + + tm->tm_mon += span->month; + if (tm->tm_mon > MONTHS_PER_YEAR) + { + tm->tm_year += (tm->tm_mon - 1) / MONTHS_PER_YEAR; + tm->tm_mon = ((tm->tm_mon - 1) % MONTHS_PER_YEAR) +
Re: Error from the foreign RDBMS on a foreign table I have no privilege on
On Fri, Jun 10, 2022 at 6:17 PM Laurenz Albe wrote: > On Fri, 2022-06-10 at 17:17 +0900, Etsuro Fujita wrote: > > > I am not sure if it worth adding to the documentation. I would never > > > have thought > > > of the problem if Phil hadn't brought it up. On the other hand, I was > > > surprised > > > to learn that permissions aren't checked until the executor kicks in. > > > It makes sense, but some documentation might help others in that > > > situation. > > > > +1 for adding such a document. > > > > > I'll gladly leave the decision to your judgement as a committer. > > > > IIRC, there are no reports about this from the postgres_fdw users, so > > my inclination would be to leave the documentation alone, for now. > > I understand that you are for documenting the timing of permission checks, > but not in the postgres_fdw documentation. Yes, I think so. > However, this is the only occasion > where the user might notice unexpected behavior on account of the timing of > permission checks. Other than that, I consider this below the threshold for > user-facing documentation. I think PREPARE/EXECUTE have a similar issue: postgres=# create table t1 (a int, b int); CREATE TABLE postgres=# create user foouser; CREATE ROLE postgres=# set role foouser; SET postgres=> prepare fooplan (int, int) as insert into t1 values ($1, $2); PREPARE postgres=> execute fooplan (, ); ERROR: permission denied for table t1 The user foouser is allowed to PREPARE the insert statement, without the insert privilege on the table t1, as the permission check is delayed until EXECUTE. So I thought it would be good to add a note about the timing to the documentation about the Postgres core, such as arch-dev.sgml (the "Overview of PostgreSQL Internals" chapter). But as far as I know, there aren’t any reports on the PREPARE/EXECUTE behavior, either, so there might be less need to do so, I think. Thanks for the discussion! Sorry for the delay. Best regards, Etsuro Fujita
Re: pgsql: dshash: Add sequential scan support.
On Sun, Jul 3, 2022 at 7:56 PM Thomas Munro wrote: > [Re-directing to -hackers] > > On Fri, Mar 11, 2022 at 2:27 PM Andres Freund wrote: > > On 2022-03-10 20:09:56 -0500, Tom Lane wrote: > > > Andres Freund writes: > > > > dshash: Add sequential scan support. > > > > Add ability to scan all entries sequentially to dshash. The > interface is > > > > similar but a bit different both from that of dynahash and simple > dshash > > > > search functions. The most significant differences is that dshash's > interfac > > > > always needs a call to dshash_seq_term when scan ends. > > > > > > Umm ... what about error recovery? Or have you just cemented the > > > proposition that long-lived dshashes are unsafe? > > > > I don't think this commit made it worse. dshash_seq_term() releases an > lwlock > > (which will be released in case of an error) and unsets > > hash_table->find_[exclusively_]locked. The latter weren't introduced by > this > > patch, and are also set by dshash_find(). > > > > I agree that ->find_[exclusively_]locked are problematic from an error > > recovery perspective. > > Right, as seen in the build farm at [1]. Also reproducible with something > like: > > @@ -269,6 +269,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size > request_size, > return false; > } > > + /* XXX random fault injection */ > + if (op == DSM_OP_ATTACH && random() < RAND_MAX / 8) > + { > + close(fd); > + elog(ERROR, "chaos"); > + return false; > + } > + > > I must have thought that it was easy and practical to write no-throw > straight-line code and be sure to reach dshash_release_lock(), but I > concede that it was a bad idea: even dsa_get_address() can throw*, and > you're often likely to need to call that while accessing dshash > elements. For example, in lookup_rowtype_tupdesc_internal(), there is > a sequence dshash_find(), ..., dsa_get_address(), ..., > dshash_release_lock(), and I must have considered the range of code > between find and release to be no-throw, but now I know that it is > not. > > > It's per-backend state at least and just used for assertions. We could > remove > > it. Or stop checking it in places where it could be set wrongly: > dshash_find() > > and dshash_detach() couldn't check anymore, but the rest of the > assertions > > would still be valid afaics? > > Yeah, it's all for assertions... let's just remove it. Those > assertions were useful to me at some stage in development but won't > hold as well as I thought, at least without widespread PG_FINALLY(), > which wouldn't be nice. > > *dsa_get_address() might need to adjust the memory map with system > calls, which might fail. If you think of DSA as not only an allocator > but also a poor man's user level virtual memory scheme to tide us over > until we get threads, then this is a pretty low level kind of > should-not-happen failure that is analogous on some level to SIGBUS or > SIGSEGV or something like that, and we should PANIC. Then we could > claim that dsa_get_address() is no-throw. At least, that was one > argument I had with myself while investigating that strange Solaris > shm_open() failure, but ... I lost the argument. It's quite an > extreme position to take just to support these assertions, which are > of pretty limited value. > > [1] > https://www.postgresql.org/message-id/20220701232009.jcwxpl45bptaxv5n%40alap3.anarazel.de Hi, In the description, `new shared memory stats system in 15` It would be clearer to add `release` before `15`. Cheers
Re: Handle infinite recursion in logical replication setup
On Sun, Jul 3, 2022 at 8:29 PM vignesh C wrote: > Review comments === 1. @@ -530,7 +557,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA | SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY | SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT | - SUBOPT_DISABLE_ON_ERR); + SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN); parse_subscription_options(pstate, stmt->options, supported_opts, ); /* @@ -606,6 +633,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, LOGICALREP_TWOPHASE_STATE_PENDING : LOGICALREP_TWOPHASE_STATE_DISABLED); values[Anum_pg_subscription_subdisableonerr - 1] = BoolGetDatum(opts.disableonerr); + values[Anum_pg_subscription_suborigin - 1] = + CStringGetTextDatum(opts.origin); values[Anum_pg_subscription_subconninfo - 1] = CStringGetTextDatum(conninfo); if (opts.slot_name) ... ... /* List of publications subscribed to */ text subpublications[1] BKI_FORCE_NOT_NULL; + + /* Only publish data originating from the specified origin */ + text suborigin BKI_DEFAULT(LOGICALREP_ORIGIN_ANY); #endif } FormData_pg_subscription; The order of declaration and assignment for 'suborigin' should match in above usage. 2. Similarly the changes in GetSubscription() should also match the declaration of the origin column. 3. GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled, - subbinary, substream, subtwophasestate, subdisableonerr, subslotname, - subsynccommit, subpublications) + subbinary, substream, subtwophasestate, subdisableonerr, + suborigin, subslotname, subsynccommit, subpublications) ON pg_subscription TO public; This should also match the order of columns as in pg_subscription.h unless there is a reason for not doing so. 4. + /* + * Even though "origin" parameter allows only "local" and "any" + * values, it is implemented as a string type so that the parameter + * can be extended in future versions to support filtering using + * origin names specified by the user. /Even though "origin" .../Even though the "origin" parameter ... 5. + +# Create tables on node_A +$node_A->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)"); + +# Create the same tables on node_B +$node_B->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)"); In both the above comments, you should use table instead of tables as the test creates only one table. 6. +$node_A->safe_psql('postgres', "DELETE FROM tab_full;"); After this, the test didn't ensure that this operation is replicated. Can't that lead to unpredictable results for the other tests after this test? 7. +# Setup logical replication +# node_C (pub) -> node_B (sub) +my $node_C_connstr = $node_C->connstr . ' dbname=postgres'; +$node_C->safe_psql('postgres', + "CREATE PUBLICATION tap_pub_C FOR TABLE tab_full"); + +my $appname_B2 = 'tap_sub_B2'; +$node_B->safe_psql( + 'postgres', " + CREATE SUBSCRIPTION tap_sub_B2 + CONNECTION '$node_C_connstr application_name=$appname_B2' + PUBLICATION tap_pub_C + WITH (origin = local)"); + +$node_C->wait_for_catchup($appname_B2); + +$node_C->poll_query_until('postgres', $synced_query) + or die "Timed out while waiting for subscriber to synchronize data"; This test allows the publisher (node_C) to poll for sync but it should be the subscriber (node_B) that needs to poll to allow the initial sync to finish. 8. Do you think it makes sense to see if this new option can also be supported by pg_recvlogical? I see that previously we have not extended pg_recvlogical for all the newly added options but I feel we should keep pg_recvlogical up to date w.r.t new options. We can do this as a separate patch if we agree? -- With Regards, Amit Kapila.
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
On 2022-Jul-04, Alvaro Herrera wrote: > BTW I patch for the problem with uniqviol also (not fixed by v7). I'll > send an updated patch in a little while. Here it is. I ran "libpq_pipeline uniqviol" in a tight loop a few thousand times and didn't get any error. Before these fixes, it would fail in half a dozen iterations. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ >From d4d446531d52a34115fe7446732d61ae8f61d8bb Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 1 Jul 2022 18:11:10 +0200 Subject: [PATCH v8] libpq: Improve idle state handling in pipeline mode We were going into IDLE state too soon when executing queries via PQsendQuery in pipeline mode, causing several scenarios to misbehave in different ways -- most notably, as reported by Daniele Varrazzo, that a warning message is produced by libpq: message type 0x33 arrived from server while idle But it is also possible, if queries are sent and results consumed not in lockstep, for the expected mediating NULL result values from PQgetResult to be lost (a problem which has not been reported, but which is more serious). Fix this by introducing two new concepts: one is a command queue element PGQUERY_CLOSE to tell libpq to wait for the CloseComplete server response to the Close message that is sent by PQsendQuery. Because the application is not expecting any PGresult from this, the mechanism to consume it is a bit hackish. The other concept, authored by Horiguchi-san, is a PGASYNC_PIPELINE_IDLE state for libpq's state machine to differentiate "really idle" from merely "the idle state that occurs in between reading results from the server for elements in the pipeline". This makes libpq not go fully IDLE when the libpq command queue contains entries; in normal cases, we only go IDLE once at the end of the pipeline, when the server response to the final SYNC message is received. (However, there are corner cases it doesn't fix, such as terminating the query sequence by PQsendFlushRequest instead of PQpipelineSync; this sort of scenario is what requires PGQUERY_CLOSE bit above.) This last bit helps make the libpq state machine clearer; in particular we can get rid of an ugly hack in pqParseInput3 to avoid considering IDLE as such when the command queue contains entries. A new test mode is added to libpq_pipeline.c to tickle some related problematic cases. Reported-by: Daniele Varrazzo Co-authored-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/ca+mi_8bvd0_cw3sumgwpvwdnzxy32itog_16tdyru_1s2gv...@mail.gmail.com --- src/interfaces/libpq/fe-exec.c| 116 -- src/interfaces/libpq/fe-protocol3.c | 30 +-- src/interfaces/libpq/libpq-int.h | 6 +- .../modules/libpq_pipeline/libpq_pipeline.c | 215 +- .../libpq_pipeline/t/001_libpq_pipeline.pl| 3 +- .../libpq_pipeline/traces/pipeline_idle.trace | 93 6 files changed, 425 insertions(+), 38 deletions(-) create mode 100644 src/test/modules/libpq_pipeline/traces/pipeline_idle.trace diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 4180683194..e22d0814f0 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -1279,7 +1279,8 @@ pqAppendCmdQueueEntry(PGconn *conn, PGcmdQueueEntry *entry) * itself consume commands from the queue; if we're in any other * state, we don't have to do anything. */ - if (conn->asyncStatus == PGASYNC_IDLE) + if (conn->asyncStatus == PGASYNC_IDLE || +conn->asyncStatus == PGASYNC_PIPELINE_IDLE) { resetPQExpBuffer(>errorMessage); pqPipelineProcessQueue(conn); @@ -1338,6 +1339,7 @@ static int PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery) { PGcmdQueueEntry *entry = NULL; + PGcmdQueueEntry *entry2 = NULL; if (!PQsendQueryStart(conn, newQuery)) return 0; @@ -1353,6 +1355,12 @@ PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery) entry = pqAllocCmdQueueEntry(conn); if (entry == NULL) return 0;/* error msg already set */ + if (conn->pipelineStatus != PQ_PIPELINE_OFF) + { + entry2 = pqAllocCmdQueueEntry(conn); + if (entry2 == NULL) + goto sendFailed; + } /* Send the query message(s) */ if (conn->pipelineStatus == PQ_PIPELINE_OFF) @@ -1422,6 +1430,20 @@ PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery) /* OK, it's launched! */ pqAppendCmdQueueEntry(conn, entry); + + /* + * When pipeline mode is in use, we need a second entry in the command + * queue to represent Close Portal message. This allows us later to wait + * for the CloseComplete message to be received before getting in IDLE + * state. + */ + if (conn->pipelineStatus != PQ_PIPELINE_OFF) + { + entry2->queryclass = PGQUERY_CLOSE; + entry2->query = NULL; + pqAppendCmdQueueEntry(conn, entry2); + } + return 1; sendFailed: @@ -1667,11 +1689,13 @@ PQsendQueryStart(PGconn *conn, bool newQuery) switch
Add last_vacuum_index_scans in pg_stat_all_tables
Hi hackers, I think having number of index scans of the last vacuum in pg_stat_all_tables can be helpful. This value shows how efficiently vacuums have performed and can be an indicator to increase maintenance_work_mem. It was proposed previously[1], but it was not accepted due to the limitation of stats collector. Statistics are now stored in shared memory, so we got more rooms to store statistics. I think this statistics is still valuable for some people, so I am proposing this again. Best wishes, -- Ken Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION [1] https://www.postgresql.org/message-id/20171010.192616.108347483.horiguchi.kyotaro%40lab.ntt.co.jpdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 4549c2560e..bd7bfa7d9d 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4557,6 +4557,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i daemon + + + + last_vacuum_index_scans bigint + + + Number of splitted index scans performed during the the last vacuum + on this table + + diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index b802ed247e..bf47d6837d 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -625,7 +625,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, rel->rd_rel->relisshared, Max(vacrel->new_live_tuples, 0), vacrel->recently_dead_tuples + - vacrel->missed_dead_tuples); + vacrel->missed_dead_tuples, + vacrel->num_index_scans); pgstat_progress_end_command(); if (instrument) diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index fedaed533b..475551d9b3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -676,7 +676,9 @@ CREATE VIEW pg_stat_all_tables AS pg_stat_get_vacuum_count(C.oid) AS vacuum_count, pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count, pg_stat_get_analyze_count(C.oid) AS analyze_count, -pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count +pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count, +pg_stat_get_last_vacuum_index_scans(C.oid) AS last_vacuum_index_scans + FROM pg_class C LEFT JOIN pg_index I ON C.oid = I.indrelid LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index a846d9ffb6..ffc9daf944 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -208,8 +208,8 @@ pgstat_drop_relation(Relation rel) * Report that the table was just vacuumed. */ void -pgstat_report_vacuum(Oid tableoid, bool shared, - PgStat_Counter livetuples, PgStat_Counter deadtuples) +pgstat_report_vacuum(Oid tableoid, bool shared, PgStat_Counter livetuples, + PgStat_Counter deadtuples, PgStat_Counter num_index_scans) { PgStat_EntryRef *entry_ref; PgStatShared_Relation *shtabentry; @@ -232,6 +232,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared, tabentry->n_live_tuples = livetuples; tabentry->n_dead_tuples = deadtuples; + tabentry->n_index_scans = num_index_scans; /* * It is quite possible that a non-aggressive VACUUM ended up skipping diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 893690dad5..57d1de52e6 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -179,6 +179,20 @@ pg_stat_get_dead_tuples(PG_FUNCTION_ARGS) PG_RETURN_INT64(result); } +Datum +pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + int64 result; + PgStat_StatTabEntry *tabentry; + + if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) + result = 0; + else + result = tabentry->n_index_scans; + + PG_RETURN_INT64(result); +} Datum pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index a77b293723..d061cd2d8d 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5331,6 +5331,10 @@ proname => 'pg_stat_get_autoanalyze_count', provolatile => 's', proparallel => 'r', prorettype => 'int8', proargtypes => 'oid', prosrc => 'pg_stat_get_autoanalyze_count' }, +{ oid => '3813', descr => 'statistics: last vacuum index scans for a table', + proname => 'pg_stat_get_last_vacuum_index_scans', provolatile => 's', + proparallel => 'r', prorettype => 'int8', proargtypes => 'oid', + prosrc => 'pg_stat_get_last_vacuum_index_scans' }, { oid => '1936', descr => 'statistics: currently active backend IDs', proname =>
Re: Emit extra debug message when executing extension script.
On 2022-Jun-29, Robert Haas wrote: > On Wed, Jun 29, 2022 at 9:26 AM Peter Eisentraut > wrote: > > On 28.06.22 21:10, Jeff Davis wrote: > > > + ereport(DEBUG1, errmsg("executing extension script: %s", filename)); > > > > This should either be elog or use errmsg_internal. > > Why? The reason is that errmsg() marks the message for translation, and we don't want to burden translators with messages that are of little interest to most users. Using either elog() or errmsg_internal() avoids that. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "El sudor es la mejor cura para un pensamiento enfermo" (Bardia)
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
On 2022-Jul-04, Kyotaro Horiguchi wrote: > At Wed, 29 Jun 2022 14:09:17 +0200, Alvaro Herrera > wrote in > > However, another problem case, not fixed by PIPELINE_IDLE, occurs if you > > exit pipeline mode after PQsendQuery() and then immediately use > > PQexec(). The CloseComplete will be received at the wrong time, and a > > notice is emitted nevertheless. > > Mmm. My patch moves the point of failure of the scenario a bit but > still a little short. However, as my understanding, it seems like the > task of the PQpipelineSync()-PQgetResult() pair to consume the > CloseComplete. If Iinserted PQpipelineSync() just after PQsendQuery() > and called PQgetResult() for PGRES_PIPELINE_SYNC before > PQexitPipelineMode(), the out-of-sync CloseComplete is not seen in the > scenario. But if it is right, I'd like to complain about the > obscure-but-stiff protocol of pipleline mode.. Yeah, if you introduce PQpipelineSync then I think it'll work okay, but my point here was to make it work without requiring that; that's why I wrote the test to use PQsendFlushRequest instead. BTW I patch for the problem with uniqviol also (not fixed by v7). I'll send an updated patch in a little while. > > I produced pipeline_idle.trace file by running the test in the fully > > By the perl script doesn't produce the trace file since the list in > $cmptrace line doesn't contain pipleline_idle.. Ouch, of course, thanks for noticing. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
RE: Handle infinite recursion in logical replication setup
On Mon, Jul 4, 2022 4:17 PM shiy.f...@fujitsu.com wrote: > > On Sun, Jul 3, 2022 11:00 PM vignesh C wrote: > > > > Thanks for the comments, the attached v27 patch has the changes for the > > same. > > > > Thanks for updating the patch. > > A comment on 0003 patch: > > I think we should call ExecClearTuple() before getting next tuple, so it > should > be called if the table is in ready state. How about modifying it to: > By the way, I have tested pg_dump/psql changes in this patch with older version (server: pg10 ~ pg15, pg_dump/psql: pg16), and it worked ok. Regards, Shi yu
Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH
On Mon, Jul 4, 2022 at 1:07 PM Peter Smith wrote: > > By convention, the tab-complete logical replication subscription > parameters are listed in the COMPLETE_WITH lists in alphabetical > order, but when the "disable_on_error" parameter was added this was > not done. > Yeah, it seems we have overlooked this point. I think we can do this just for HEAD but as the feature is introduced in PG-15 so there is no harm in pushing it to PG-15 as well especially because it is a straightforward change. What do you or others think? -- With Regards, Amit Kapila.
Re: pgsql: dshash: Add sequential scan support.
At Mon, 4 Jul 2022 14:55:43 +1200, Thomas Munro wrote in > [Re-directing to -hackers] > > On Fri, Mar 11, 2022 at 2:27 PM Andres Freund wrote: > > It's per-backend state at least and just used for assertions. We could > > remove > > it. Or stop checking it in places where it could be set wrongly: > > dshash_find() > > and dshash_detach() couldn't check anymore, but the rest of the assertions > > would still be valid afaics? > > Yeah, it's all for assertions... let's just remove it. Those > assertions were useful to me at some stage in development but won't > hold as well as I thought, at least without widespread PG_FINALLY(), > which wouldn't be nice. > > *dsa_get_address() might need to adjust the memory map with system > calls, which might fail. If you think of DSA as not only an allocator > but also a poor man's user level virtual memory scheme to tide us over > until we get threads, then this is a pretty low level kind of > should-not-happen failure that is analogous on some level to SIGBUS or > SIGSEGV or something like that, and we should PANIC. Then we could > claim that dsa_get_address() is no-throw. At least, that was one > argument I had with myself while investigating that strange Solaris > shm_open() failure, but ... I lost the argument. It's quite an > extreme position to take just to support these assertions, which are > of pretty limited value. > > [1] > https://www.postgresql.org/message-id/20220701232009.jcwxpl45bptaxv5n%40alap3.anarazel.de FWIW, the discussion above is convincing to me and the patch looks good. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
Thanks for the further testing scenario. At Wed, 29 Jun 2022 14:09:17 +0200, Alvaro Herrera wrote in > So I wrote some more test scenarios for this, and as I wrote in some > other thread, I realized that there are more problems than just some > NOTICE trouble. For instance, if you send a query, then read the result > but not the terminating NULL then send another query, everything gets > confused; the next thing you'll read is the result for the second query, > without having read the NULL that terminates the results of the first > query. Any application that expects the usual flow of results will be > confused. Kyotaro's patch to add PIPELINE_IDLE fixes this bit too, as > far as I can tell. > > However, another problem case, not fixed by PIPELINE_IDLE, occurs if you > exit pipeline mode after PQsendQuery() and then immediately use > PQexec(). The CloseComplete will be received at the wrong time, and a > notice is emitted nevertheless. Mmm. My patch moves the point of failure of the scenario a bit but still a little short. However, as my understanding, it seems like the task of the PQpipelineSync()-PQgetResult() pair to consume the CloseComplete. If Iinserted PQpipelineSync() just after PQsendQuery() and called PQgetResult() for PGRES_PIPELINE_SYNC before PQexitPipelineMode(), the out-of-sync CloseComplete is not seen in the scenario. But if it is right, I'd like to complain about the obscure-but-stiff protocol of pipleline mode.. > I spent a lot of time trying to understand how to fix this last bit, and > the solution I came up with is that PQsendQuery() must add a second > entry to the command queue after the PGQUERY_EXTENDED one, to match the > CloseComplete message being expected; with that entry in the queue, > PQgetResult will really only get to IDLE mode after the Close has been > seen, which is what we want. I named it PGQUERY_CLOSE. > > Sadly, some hacks are needed to make this work fully: > > 1. the client is never expecting that PQgetResult() would return >anything for the CloseComplete, so something needs to consume the >CloseComplete silently (including the queue entry for it) when it is >received; I chose to do this directly in pqParseInput3. I tried to >make PQgetResult itself do it, but it became a pile of hacks until I >was no longer sure what was going on. Putting it in fe-protocol3.c >ends up a lot cleaner. However, we still need PQgetResult to invoke >parseInput() at the point where Close is expected. > > 2. if an error occurs while executing the query, the CloseComplete will >of course never arrive, so we need to erase it from the queue >silently if we're returning an error. > > I toyed with the idea of having parseInput() produce a PGresult that > encodes the Close message, and have PQgetResult consume and discard > that, then read some further message to have something to return. But > it seemed inefficient and equally ugly and I'm not sure that flow > control is any simpler. > > I think another possibility would be to make PQexitPipelineMode > responsible for /something/, but I'm not sure what that would be. > Checking the queue and seeing if the next message is CloseComplete, then > eating that message before exiting pipeline mode? That seems way too > magical. I didn't attempt this. > > I attach a patch series that implements the proposed fix (again for > REL_14_STABLE) in steps, to make it easy to read. I intend to squash > the whole lot into a single commit before pushing. But while writing > this email it occurred to me that I need to add at least one more test, > to receive a WARNING while waiting for CloseComplete. AFAICT it should > work, but better make sure. > > I produced pipeline_idle.trace file by running the test in the fully By the perl script doesn't produce the trace file since the list in $cmptrace line doesn't contain pipleline_idle.. > fixed tree, then used it to verify that all tests fail in different ways > when run without the fixes. The first fix with PIPELINE_IDLE fixes some > of these failures, and the PGQUERY_CLOSE one fixes the remaining one. > Reading the trace file, it looks correct to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Allowing REINDEX to have an optional name
On 2022-Jun-29, Simon Riggs wrote: > > -if (strcmp(objectName, get_database_name(objectOid)) != 0) > > +if (objectName && strcmp(objectName, get_database_name(objectOid)) > > != 0) > > ereport(ERROR, > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > errmsg("can only reindex the currently open > > database"))); > > if (!pg_database_ownercheck(objectOid, GetUserId())) > > aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_DATABASE, > > - objectName); > > + get_database_name(objectOid)); > > > > This could call get_database_name() just once. > > It could, but I couldn't see any benefit in changing that for the code > under discussion. > > If calling get_database_name() multiple times is an issue, I've added > a cache for that - another patch attached, if you think its worth it. TBH I doubt that this is an issue: since we're throwing an error anyway, the memory would be released, and error cases are not considered worth of performance optimization anyway. Putting that thought aside, if we were to think that this is an issue, I don't think the cache as implemented here is a good idea, because then caller is responsible for tracking whether to free or not the return value. I think that Michaël's idea could be implemented more easily by having a local variable that receives the return value from get_database_name. But I think the coding as Simon had it was all right. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Las navajas y los monos deben estar siempre distantes" (Germán Poo)
RE: Handle infinite recursion in logical replication setup
On Sun, Jul 3, 2022 11:00 PM vignesh C wrote: > > Thanks for the comments, the attached v27 patch has the changes for the > same. > Thanks for updating the patch. A comment on 0003 patch: + /* +* No need to throw an error for the tables that are in ready state, +* as the walsender will send the changes from WAL in case of tables +* in ready state. +*/ + if (isreadytable) + continue; + ... + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("table: \"%s.%s\" might have replicated data in the publisher", + nspname, relname), + errdetail("CREATE/ALTER SUBSCRIPTION with origin = local and copy_data = on is not allowed when the publisher might have replicated data."), + errhint("Use CREATE/ALTER SUBSCRIPTION with copy_data = off/force.")); + + ExecClearTuple(slot); I think we should call ExecClearTuple() before getting next tuple, so it should be called if the table is in ready state. How about modifying it to: if (!isreadytable) ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("table: \"%s.%s\" might have replicated data in the publisher", nspname, relname), errdetail("CREATE/ALTER SUBSCRIPTION with origin = local and copy_data = on is not allowed when the publisher might have replicated data."), errhint("Use CREATE/ALTER SUBSCRIPTION with copy_data = off/force.")); ExecClearTuple(slot); Regards, Shi yu
Re-order "disable_on_error" in tab-complete COMPLETE_WITH
Hi. By convention, the tab-complete logical replication subscription parameters are listed in the COMPLETE_WITH lists in alphabetical order, but when the "disable_on_error" parameter was added this was not done. This patch just tidies that up; there is no functional change. PSA -- Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Re-order-disable_on_error-in-tab-complete-COMPLET.patch Description: Binary data
Re: Prevent writes on large objects in read-only transactions
On Mon, 4 Jul 2022 15:51:32 +0900 Michael Paquier wrote: > On Wed, Jun 29, 2022 at 05:29:50PM +0900, Yugo NAGATA wrote: > > Thank you for reviewing the patch. I attached the updated patch. > > Thanks for the new version. I have looked at that again, and the set > of changes seem fine (including the change for the alternate output). > So, applied. Thanks! -- Yugo NAGATA
Re: Allowing REINDEX to have an optional name
On Sun, Jul 03, 2022 at 05:41:31PM -0400, Tom Lane wrote: > This is marked as Ready for Committer, but that seems unduly > optimistic. Please note that patch authors should not switch a patch as RfC by themselves. This is something that a reviewer should do. > The cfbot shows that it's failing on all platforms --- > and not in the same way on each, suggesting there are multiple > problems. A wild guess is that this comes from the patch that manipulates get_database_name(), something that there is no need for as long as the routine is called once in ReindexMultipleTables(). -- Michael signature.asc Description: PGP signature
Re: Export log_line_prefix(); useful for emit_log_hook.
On Wed, Jun 29, 2022 at 03:09:42PM +0200, Alvaro Herrera wrote: > Hmm, maybe your hypothetical book would prefer to use a different > setting for log line prefix than Log_line_prefix, so it would make sense > to pass the format string as a parameter to the function instead of > relying on the GUC global. +1. -- Michael signature.asc Description: PGP signature
Re: Prevent writes on large objects in read-only transactions
On Wed, Jun 29, 2022 at 05:29:50PM +0900, Yugo NAGATA wrote: > Thank you for reviewing the patch. I attached the updated patch. Thanks for the new version. I have looked at that again, and the set of changes seem fine (including the change for the alternate output). So, applied. -- Michael signature.asc Description: PGP signature
Re: Perform streaming logical transactions by background workers and parallel apply
Below are some review comments for patch v14-0004: v14-0004 4.0 General. This comment is an after-thought but as I write this mail I am wondering if most of this 0004 patch is even necessary at all? Instead of introducing a new column and all the baggage that goes with it, can't the same functionality be achieved just by toggling the streaming mode 'substream' value from 'p' (parallel) to 't' (on) whenever an error occurs causing a retry? Anyway, if you do change it this way then most of the following comments can be disregarded. == 4.1 Commit message Patch needs an explanatory commit message. Currently, there is nothing. == 4.2 doc/src/sgml/catalogs.sgml + + + subretry bool + + + If true, the subscription will not try to apply streaming transaction + in parallel mode. See +for more information. + + I think it is overkill to mention anything about the streaming=parallel here because IIUC it is nothing to do with this field at all. I thought you really only need to say something brief like: SUGGESTION: True if the previous apply change failed and a retry was required. == 4.3 doc/src/sgml/ref/create_subscription.sgml @@ -244,6 +244,10 @@ CREATE SUBSCRIPTION subscription_nameon + mode. I did not think it is good to say "we" in the docs. SUGGESTION When applying a streaming transaction, if either requirement is not met, the background worker will exit with an error. Parallel mode is disregarded when retrying; instead the transaction will be applied using streaming = on. == 4.4 .../replication/logical/applybgworker.c + /* + * We don't start new background worker if retry was set as it's possible + * that the last time we tried to apply a transaction in background worker + * and the check failed (see function apply_bgworker_relation_check). So + * we will try to apply this transaction in apply worker. + */ SUGGESTION (simplified, and remove "we") Don't use apply background workers for retries, because it is possible that the last time we tried to apply a transaction using an apply background worker the checks failed (see function apply_bgworker_relation_check). ~~~ 4.5 + elog(DEBUG1, "retry to apply an streaming transaction in apply " + "background worker"); IMO the log message is too confusing SUGGESTION "apply background workers are not used for retries" == 4.6 src/backend/replication/logical/worker.c 4.6.a - apply_handle_commit + /* Set the flag that we will not retry later. */ + set_subscription_retry(false); But the comment is wrong, isn't it? Shouldn’t it just say that we are not *currently* retrying? And can’t this just anyway be redundant if only the catalog column has a DEFAULT value of false? 4.6.b - apply_handle_prepare Ditto 4.6.c - apply_handle_commit_prepared Ditto 4.6.d - apply_handle_rollback_prepared Ditto 4.6.e - apply_handle_stream_prepare Ditto 4.6.f - apply_handle_stream_abort Ditto 4.6.g - apply_handle_stream_commit Ditto ~~~ 4.7 src/backend/replication/logical/worker.c 4.7.a - start_table_sync @@ -3894,6 +3917,9 @@ start_table_sync(XLogRecPtr *origin_startpos, char **myslotname) } PG_CATCH(); { + /* Set the flag that we will retry later. */ + set_subscription_retry(true); Maybe this should say more like "Flag that the next apply will be the result of a retry" 4.7.b - start_apply Ditto ~~~ 4.8 src/backend/replication/logical/worker.c - set_subscription_retry + +/* + * Set subretry of pg_subscription catalog. + * + * If retry is true, subscriber is about to exit with an error. Otherwise, it + * means that the changes was applied successfully. + */ +static void +set_subscription_retry(bool retry) "changes" -> "change" ? ~~~ 4.8 src/backend/replication/logical/worker.c - set_subscription_retry Isn't this flag only every used when streaming=parallel? But it does not seem ot be checking that anywhere before potentiall executing all this code when maybe will never be used. == 4.9 src/include/catalog/pg_subscription.h @@ -76,6 +76,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW bool subdisableonerr; /* True if a worker error should cause the * subscription to be disabled */ + bool subretry; /* True if the previous apply change failed. */ I was wondering if you can give this column a DEFAULT value of false, because then perhaps most of the patch code from worker.c may be able to be eliminated. == 4.10 .../subscription/t/032_streaming_apply.pl I felt that the test cases all seem to blend together. IMO it will be more readable if the main text parts are visually separated e.g using a comment like: # = -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi Alvaro, hi David, Thanks for reviewing this and the interesting examples! Wanted to give a bit of extra insight as to why I'd love to have a system that can lazily emit JIT code and hence creates roughly a module per function: In the end I'm hoping that we can migrate to a system where we only JIT after a configurable cost has been exceeded for this node, as well as a configurable amount of rows has actually been processed. Reason is that this would safeguard against some problematic planning issues wrt JIT (node not being executed, row count being massively off). It would also allow for more finegrained control, with a cost system similar to most other planning costs, as they are also per node and not globally, and would potentially allow us to only JIT things where we expect to truly gain any runtime compared to the costs of doing it. If this means we have to invest more in making it cheap(er) to emit modules, I'm all for that. Kudos to David for fixing the caching in that sense :) @Andres if there's any other things we ought to fix to make this cheap (enough) compared to the previous code I'd love to know your thoughts. Best, Luc Vlaming (ServiceNow) From: David Geier Sent: Wednesday, June 29, 2022 11:03 AM To: Alvaro Herrera Cc: Luc Vlaming ; Andres Freund ; PostgreSQL-development Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions [External Email] Hi Alvaro, That's a very interesting case and might indeed be fixed or at least improved by this patch. I tried to reproduce this, but at least when running a simple, serial query with increasing numbers of functions, the time spent per function is linear or even slightly sub-linear (same as Tom observed in [1]). I also couldn't reproduce the JIT runtimes you shared, when running the attached catalog query. The catalog query ran serially for me with the following JIT stats: JIT: Functions: 169 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491 ms, Emission 283.464 ms, Total 701.501 ms Is it possible that the query ran in parallel for you? For parallel queries, every worker JITs all of the functions it uses. Even though the workers might JIT the functions in parallel, the time reported in the EXPLAIN ANALYZE output is the sum of the time spent by all workers. With this patch applied, the JIT time drops significantly, as many of the generated functions remain unused. JIT: Modules: 15 Functions: 26 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms, Emission 70.347 ms, Total 140.195 ms Of course, this does not prove that the nonlinearity that you observed went away. Could you share with me how you ran the query so that I can reproduce your numbers on master to then compare them with the patched version? Also, which LLVM version did you run with? I'm currently running with LLVM 13. Thanks! -- David Geier (ServiceNow) On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera wrote: On 2021-Jan-18, Luc Vlaming wrote: > I would like this topic to somehow progress and was wondering what other > benchmarks / tests would be needed to have some progress? I've so far > provided benchmarks for small(ish) queries and some tpch numbers, assuming > those would be enough. Hi, some time ago I reported a case[1] where our JIT implementation does a very poor job and perhaps the changes that you're making could explain what is going on, and maybe even fix it: [1] https://postgr.es/m/20241706.wqq7xoyigwa2@alvherre.pgsql The query for which I investigated the problem involved some pg_logical metadata tables, so I didn't post it anywhere public; but the blog post I found later contains a link to a query that shows the same symptoms, and which is luckily still available online: https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580 I attach it here in case it goes missing sometime. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/