date:20220704

Re: AIX support - alignment issues

2022-07-04 Thread Tom Lane

Andres Freund  writes:
> I just thought an easier way - why don't we introduce a 'catalog_double'
> that's defined to be pg_attribute_aligned(whatever-we-need) on AIX? Then we
> can get rid of the manually enforced alignedness and we don't need to contort
> catalog order.

Hm, do all the AIX compilers we care about have support for that?
If so, it seems like a great idea.

regards, tom lane

Re: [PATCH] Add result_types column to pg_prepared_statements view

2022-07-04 Thread Peter Eisentraut




On 01.07.22 14:27, Dagfinn Ilmari Mannsåker wrote:

Peter Eisentraut  writes:


On 19.05.22 17:34, Dagfinn Ilmari Mannsåker wrote:

Prompted by a question on IRC, here's a patch to add a result_types
column to the pg_prepared_statements view, so that one can see the types
of the columns returned by a prepared statement, not just the parameter
types.
I'm not quite sure about the column name, suggestions welcome.


I think this patch is sensible.



The arguments about client-side type-specific value handling for
RowDescription don't apply here IMO, since this view is more
user-facing.


I agree.  It's also easy to change if needed.  Committed as is.

Re: AIX support - alignment issues

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-02 11:33:54 -0700, Andres Freund wrote:
> If we decide we want to continue supporting AIX we should bite the bullet and
> add a 64bit-int TYPALIGN_*. It might be worth to translate that to bytes when
> building tupledescs, so we don't need more branches (reducing them compared to
> today).

I just thought an easier way - why don't we introduce a 'catalog_double'
that's defined to be pg_attribute_aligned(whatever-we-need) on AIX? Then we
can get rid of the manually enforced alignedness and we don't need to contort
catalog order.

Greetings,

Andres Freund

Re: AIX support - alignment issues

2022-07-04 Thread Tom Lane

Thomas Munro  writes:
> On Mon, Jul 4, 2022 at 12:08 PM Tom Lane  wrote:
>> I would not stand in the way of dropping HP-UX and IA64 support as of
>> v16.  (I do still feel that HPPA is of interest, to keep us honest
>> about spinlock support --- but that dual-stack arrangement that IA64
>> uses is surely not part of anyone's future.)

> I tried to find everything relating to HP-UX, aCC, ia64 and hppa.  Or
> do you still want to keep the hppa bits for NetBSD (I wasn't sure if
> your threat to set up a NetBSD/hppa system was affected by the
> hardware failure you mentioned)?

No, the hardware failure is that the machine's SCSI controller seems
to be fried, thus internal drives no longer accessible.  I have a
working NetBSD-current installation on an external USB drive, and plan
to commission it as a buildfarm animal once NetBSD 10 is officially
branched.  It'll be a frankencritter of the first order, because
USB didn't exist when the machine was built, but hey...

regards, tom lane

Re: AIX support - alignment issues

2022-07-04 Thread Thomas Munro

On Mon, Jul 4, 2022 at 12:08 PM Tom Lane  wrote:
> I would not stand in the way of dropping HP-UX and IA64 support as of
> v16.  (I do still feel that HPPA is of interest, to keep us honest
> about spinlock support --- but that dual-stack arrangement that IA64
> uses is surely not part of anyone's future.)

I tried to find everything relating to HP-UX, aCC, ia64 and hppa.  Or
do you still want to keep the hppa bits for NetBSD (I wasn't sure if
your threat to set up a NetBSD/hppa system was affected by the
hardware failure you mentioned)?  Or just leave it in there in
orphaned hall-of-fame state, like m68k, m88k, Vax?
From 78d5a122fef28e2dbfd307a584dbda5e830734cb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 4 Jul 2022 16:24:16 +1200
Subject: [PATCH] Remove HP-UX, aCC, ia64 and hppa support.

HP-UX hardware is no longer produced.  This removes support for:

HP-UX, the operating system.
HP aCC, the compiler.
HP/Intel Itanium (ia64), the CPU architecture discontinued in 2021.
HP PA-RISC (hppa), the CPU architecture discontinued in 2008.
---
 configure  |  99 +--
 configure.ac   |   8 --
 contrib/pgcrypto/crypt-blowfish.c  |   2 +-
 doc/src/sgml/dfunc.sgml|  29 -
 doc/src/sgml/installation.sgml |   3 +-
 doc/src/sgml/regress.sgml  |   6 +-
 doc/src/sgml/runtime.sgml  |  19 ---
 src/Makefile.shlib |  33 -
 src/backend/libpq/ifaddr.c |  12 +-
 src/backend/port/hpux/tas.c.template   |  40 ---
 src/backend/port/tas/hpux_hppa.s   |  28 -
 src/backend/tcop/postgres.c|  74 
 src/backend/utils/misc/ps_status.c |  17 ---
 src/common/sprompt.c   |   2 +-
 src/include/miscadmin.h|   8 --
 src/include/pg_config.h.in |   6 -
 src/include/port/atomics.h |   6 -
 src/include/port/atomics/arch-hppa.h   |  17 ---
 src/include/port/atomics/arch-ia64.h   |  29 -
 src/include/port/atomics/fallback.h|  12 --
 src/include/port/atomics/generic-acc.h | 106 
 src/include/port/hpux.h|   3 -
 src/include/storage/s_lock.h   | 160 -
 src/makefiles/Makefile.hpux|  47 
 src/pl/plperl/ppport.h |   2 +-
 src/port/dlopen.c  |  50 +---
 src/port/getrusage.c   |   1 -
 src/template/hpux  |  34 --
 src/test/regress/resultmap |   1 -
 src/tools/msvc/Solution.pm |   2 -
 src/tools/pginclude/cpluspluscheck |   3 -
 src/tools/pginclude/headerscheck   |   3 -
 32 files changed, 13 insertions(+), 849 deletions(-)
 delete mode 100644 src/backend/port/hpux/tas.c.template
 delete mode 100644 src/backend/port/tas/hpux_hppa.s
 delete mode 100644 src/include/port/atomics/arch-hppa.h
 delete mode 100644 src/include/port/atomics/arch-ia64.h
 delete mode 100644 src/include/port/atomics/generic-acc.h
 delete mode 100644 src/include/port/hpux.h
 delete mode 100644 src/makefiles/Makefile.hpux
 delete mode 100644 src/template/hpux

diff --git a/configure b/configure
index fb07cd27d9..91b7b185f9 100755
--- a/configure
+++ b/configure
@@ -2994,7 +2994,6 @@ case $host_os in
   darwin*) template=darwin ;;
 dragonfly*) template=netbsd ;;
  freebsd*) template=freebsd ;;
-hpux*) template=hpux ;;
  linux*|gnu*|k*bsd*-gnu)
template=linux ;;
mingw*) template=win32 ;;
@@ -6856,100 +6855,6 @@ if test x"$pgac_cv_prog_CXX_cxxflags__qlonglong" = x"yes"; then
 fi
 
 
-elif test "$PORTNAME" = "hpux"; then
-  # On some versions of HP-UX, libm functions do not set errno by default.
-  # Fix that by using +Olibmerrno if the compiler recognizes it.
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports +Olibmerrno, for CFLAGS" >&5
-$as_echo_n "checking whether ${CC} supports +Olibmerrno, for CFLAGS... " >&6; }
-if ${pgac_cv_prog_CC_cflags_pOlibmerrno+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-pgac_save_CC=$CC
-CC=${CC}
-CFLAGS="${CFLAGS} +Olibmerrno"
-ac_save_c_werror_flag=$ac_c_werror_flag
-ac_c_werror_flag=yes
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_prog_CC_cflags_pOlibmerrno=yes
-else
-  pgac_cv_prog_CC_cflags_pOlibmerrno=no
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-ac_c_werror_flag=$ac_save_c_werror_flag
-CFLAGS="$pgac_save_CFLAGS"
-CC="$pgac_save_CC"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags_pOlibmerrno" >&5
-$as_echo "$pgac_cv_prog_CC_cflags_pOlibmerrno" >&6; }
-if test x"$pgac_cv_prog_CC_cflags_pOlibmerrno" = x"yes"; then
-  CFLAGS="${CFLAGS} +Olibmerrno"
-fi
-
-
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports +Olibmerrno,

Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH

2022-07-04 Thread Amit Kapila

On Tue, Jul 5, 2022 at 4:03 AM Peter Smith  wrote:
>
> On Mon, Jul 4, 2022 at 10:29 PM Euler Taveira  wrote:
> >
> > On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote:
> >
> > Yeah, it seems we have overlooked this point. I think we can do this
> > just for HEAD but as the feature is introduced in PG-15 so there is no
> > harm in pushing it to PG-15 as well especially because it is a
> > straightforward change. What do you or others think?
> >
> > No objection. It is a good thing for future backpatches.
> >
>
> Since there is no function change or bugfix here I thought it was only
> applicable for HEAD. This change is almost in the same category as a
> code comment typo patch - do those normally get backpatched? - maybe
> follow the same convention here. OTOH, if you think it may be helpful
> for future backpatches then I am also fine if you wanted to push it to
> PG15.
>

It can help if there is any bug-fix in the same code path or if some
other code adjustment in the same area is required in the back branch.
I feel the chances of both are less but I just wanted to keep the code
consistent for such a possibility. Anyway, I'll wait for a day or so
and see if anyone has objections to it.

-- 
With Regards,
Amit Kapila.

Re: CFM Manager

2022-07-04 Thread Michael Paquier

On Tue, Jul 05, 2022 at 08:17:26AM +0500, Ibrar Ahmed wrote:
> If nobody has already volunteered for the next upcoming commitfest.
> I'd like to volunteer. I think early to say is better as always.

Jacob and Daniel have already volunteered.  Based on the number of
patches at hand (305 in total), getting more help is always welcome, I
guess.
--
Michael


signature.asc
Description: PGP signature

Linking issue with libldap on morepork (OpenBSD 6.9)

2022-07-04 Thread Michael Paquier

Hi all,

While looking at the buildfarm logs, I have noticed the following
thing:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=morepork=2022-07-05%2002%3A45%3A32=recovery-check

postgres:/usr/local/lib/libldap_r.so.13.2:
/usr/local/lib/libldap.so.13.2 : WARNING:
symbol(ldap_int_global_options) size mismatch, relink your program

That seems to be pretty ancient, not related to aff45c8 as logs from
two months ago also show this warning.

Thanks,
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Masahiko Sawada

On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada  wrote:
>
> On Tue, Jun 28, 2022 at 10:10 PM John Naylor
>  wrote:
> >
> > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  
> > wrote:
> > >
> > > > I
> > > > suspect other optimizations would be worth a lot more than using AVX2:
> > > > - collapsing inner nodes
> > > > - taking care when constructing the key (more on this when we
> > > > integrate with VACUUM)
> > > > ...and a couple Andres mentioned:
> > > > - memory management: in
> > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > > - node dispatch:
> > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > > >
> > > > Therefore, I would suggest that we use SSE2 only, because:
> > > > - portability is very easy
> > > > - to avoid a performance hit from indirecting through a function pointer
> > >
> > > Okay, I'll try these optimizations and see if the performance becomes 
> > > better.
> >
> > FWIW, I think it's fine if we delay these until after committing a
> > good-enough version. The exception is key construction and I think
> > that deserves some attention now (more on this below).
>
> Agreed.
>
> >
> > > I've done benchmark tests while changing the node types. The code base
> > > is v3 patch that doesn't have the optimization you mentioned below
> > > (memory management and node dispatch) but I added the code to use SSE2
> > > for node-16 and node-32.
> >
> > Great, this is helpful to visualize what's going on!
> >
> > > * sse2_4_16_48_256
> > > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 
> > > 916433
> > > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
> > >
> > > * sse2_4_32_128_256
> > > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, 
> > > n256 = 31
> > > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 
> > > 1
> >
> > > Observations are:
> > >
> > > In both test cases, There is not much difference between using AVX2
> > > and SSE2. The more mode types, the more time it takes for loading the
> > > data (see sse2_4_16_32_128_256).
> >
> > Good to know. And as Andres mentioned in his PoC, more node types
> > would be a barrier for pointer tagging, since 32-bit platforms only
> > have two spare bits in the pointer.
> >
> > > In dense case, since most nodes have around 100 children, the radix
> > > tree that has node-128 had a good figure in terms of memory usage. On
> >
> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?
>
> >
> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo
>
> >
> > > the other hand, the radix tree that doesn't have node-128 has a better
> > > number in terms of insertion performance. This is probably because we
> > > need to iterate over 'isset' flags from the beginning of the array in
> > > order to find an empty slot when inserting new data. We do the same
> > > thing also for node-48 but it was better than node-128 as it's up to
> > > 48.
> >
> > I mentioned in my diff, but for those following along, I think we can
> > improve that by iterating over the bytes and if it's 0xFF all 8 bits
> > are set already so keep looking...
>
>

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Thomas Munro

On Tue, Jul 5, 2022 at 11:25 AM Andres Freund  wrote:
> On 2022-07-05 11:20:54 +1200, Thomas Munro wrote:
> > Since there were 6 places with I-hold-no-lock assertions, I shoved the
> > loop into a function so I could do:
> >
> > -   Assert(!status->hash_table->find_locked);
> > +   assert_no_lock_held_by_me(hash_table);
>
> I am a *bit* wary about the costs of that, even in assert builds - each of the
> partition checks in the loop will in turn need to iterate through
> held_lwlocks. But I guess we can also just later weaken them if it turns out
> to be a problem.

Maybe we should add assertion support for arrays of locks, so we don't
need two levels of loop?  Something like the attached?
From d98874ab15fc568b3da8a6395e0a0ef02855a880 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 4 Jul 2022 11:56:02 +1200
Subject: [PATCH v3] Fix lock assertions in dshash.c.

dshash.c maintained flags to track locking, in order to make assertions
that the dshash API was being used correctly.  Unfortunately the
interaction with ereport's non-local exits was not thought through
carefully enough and the flag could get out of sync with reality.  Since
these flags were primarily used for assertions that no lock was held, we
can get most of the intended effect by making an assertion about the
partition locks directly.

To avoid N*M complexity while asserting that we hold no partition locks,
add a new debugging-only interface LWLockAnyHeldByMe(), which can check
if any lock in an array is held without the need for a loop within a
loop.

This problem was noted by Tom and Andres while reviewing changes to
support the new shared memory stats system in release 15, the second
user of dshash.c.  On closer inspection, even the earlier typcache.c
code was not guaranteed to meet the asserted conditions, now that it's
been highlighted that even dsa_get_address() can throw (albeit in
unlikely circumstances).

Back-patch to 11, where dshash.c arrived.

Reported-by: Tom Lane 
Reported-by: Andres Freund 
Reviewed-by: Kyotaro HORIGUCHI 
Reviewed-by: Zhihong Yu 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/20220311012712.botrpsikaufzt...@alap3.anarazel.de
---
 src/backend/lib/dshash.c  | 44 +++
 src/backend/storage/lmgr/lwlock.c | 26 ++
 src/include/storage/lwlock.h  |  1 +
 3 files changed, 37 insertions(+), 34 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index ec454b4d65..c5c032a593 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -110,8 +110,6 @@ struct dshash_table
 	dshash_table_control *control;	/* Control object in DSM. */
 	dsa_pointer *buckets;		/* Current bucket pointers in DSM. */
 	size_t		size_log2;		/* log2(number of buckets) */
-	bool		find_locked;	/* Is any partition lock held by 'find'? */
-	bool		find_exclusively_locked;	/* ... exclusively? */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -194,6 +192,10 @@ static inline bool equal_keys(dshash_table *hash_table,
 #define PARTITION_LOCK(hash_table, i)			\
 	(&(hash_table)->control->partitions[(i)].lock)
 
+#define ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table) \
+	Assert(!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock, \
+		   DSHASH_NUM_PARTITIONS, sizeof(dshash_partition)))
+
 /*
  * Create a new hash table backed by the given dynamic shared area, with the
  * given parameters.  The returned object is allocated in backend-local memory
@@ -234,9 +236,6 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 		}
 	}
 
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
-
 	/*
 	 * Set up the initial array of buckets.  Our initial size is the same as
 	 * the number of partitions.
@@ -285,8 +284,6 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
 	hash_table->params = *params;
 	hash_table->arg = arg;
 	hash_table->control = dsa_get_address(area, control);
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
 	Assert(hash_table->control->magic == DSHASH_MAGIC);
 
 	/*
@@ -309,7 +306,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
 void
 dshash_detach(dshash_table *hash_table)
 {
-	Assert(!hash_table->find_locked);
+	ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
 
 	/* The hash table may have been destroyed.  Just free local memory. */
 	pfree(hash_table);
@@ -400,7 +397,7 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 	partition = PARTITION_FOR_HASH(hash);
 
 	Assert(hash_table->control->magic == DSHASH_MAGIC);
-	Assert(!hash_table->find_locked);
+	ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
 
 	LWLockAcquire(PARTITION_LOCK(hash_table, partition),
   exclusive ? LW_EXCLUSIVE : LW_SHARED);
@@ -418,8 +415,6 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 	else
 	{
 		/* The caller will free

CFM Manager

2022-07-04 Thread Ibrar Ahmed

If nobody has already volunteered for the next upcoming commitfest.
I'd like to volunteer. I think early to say is better as always.

--
Ibrar Ahmed

Re: Bugs in copyfuncs/equalfuncs support for JSON node types

2022-07-04 Thread Tom Lane

Justin Pryzby  writes:
> Do the missing fields indicate a deficiency in test coverage ?
> _copyJsonTablePlan.pathname and _equalJsonTable.plan.

Yeah, I'd say so, but I think constructing a test case to prove
it's broken might be more trouble than it's worth --- particularly
seeing that we're about to automate this stuff.  Because of that,
I wouldn't even be really concerned about these bugs in HEAD; but
this needs to be back-patched into v15.

The existing COPY_PARSE_PLAN_TREES logic purports to test this
area, but it fails to notice these bugs for a few reasons:

* JsonTable.lateral: COPY_PARSE_PLAN_TREES itself failed to detect
this problem because of matching omissions in _copyJsonTable and
_equalJsonTable.  But the lack of any follow-on failure implies
that we don't have any test cases where the lateral flag is significant.
Maybe that means the field is useless?  This one would be worth a closer
look, perhaps.

* JsonTableColumn.format: this scalar-instead-of-deep-copy bug
would only be detectable if you were able to clobber the original
parse tree after copying.  I have no ideas about an easy way to
do that.  It'd surely bite somebody in the field someday, but
making a reproducible test is way harder.

* JsonTable.plan: to detect the missed comparison, you'd have to
build a test case where comparing two such trees actually made
a visible difference; which would require a fair amount of thought
I fear.  IIUC this node type will only appear down inside jointrees,
which we don't usually do comparisons on.

regards, tom lane

Re: TAP output format in pg_regress

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-04 21:56:24 -0400, Tom Lane wrote:
> > For non-parallel tests I think we currently print the test name before 
> > running
> > the test, which obviously doesn't work well when needing to print the 'ok'
> > 'not ok' first.
> 
> Is this still a consideration?  We got rid of serial_schedule some
> time ago.

Not really for the main tests, there's a few serial steps, but not enough that
a bit additional output would be an issue. I think all tests in contrib are
serial though, and some have enough tests that it might be annoying?

> > I wonder if for parallel tests we should print the test number based on the
> > start of the test rather than the finish time?
> 
> I think we need the test number to be stable, so it had better be the
> ordering appearing in the schedule file.  But we already print the
> results in that order.

I remembered some asynchronizity, but apparently that's just the "parallel
group" line.

Greetings,

Andres Freund

Re: Bugs in copyfuncs/equalfuncs support for JSON node types

2022-07-04 Thread Justin Pryzby

On Mon, Jul 04, 2022 at 09:23:08PM -0400, Tom Lane wrote:
> In reviewing Peter's patch to auto-generate the backend/nodes
> support files, I compared what the patch's script produces to
> what is in the code now.  I found several discrepancies in the
> recently-added parse node types for JSON functions, and as far
> as I can see every one of those discrepancies is an error in
> the existing code.  Some of them are relatively harmless
> (e.g. COPY_LOCATION_FIELD isn't really different from
> COPY_SCALAR_FIELD), but some of them definitely are live bugs.
> I propose the attached patch.

Do the missing fields indicate a deficiency in test coverage ?
_copyJsonTablePlan.pathname and _equalJsonTable.plan.

-- 
Justin

Re: TAP output format in pg_regress

2022-07-04 Thread Tom Lane

Andres Freund  writes:
> I think with a bit of care the tap output could be nearly the same
> "quality". It might not be the absolute "purest" tap output, but who cares.

+1

> For non-parallel tests I think we currently print the test name before running
> the test, which obviously doesn't work well when needing to print the 'ok'
> 'not ok' first.

Is this still a consideration?  We got rid of serial_schedule some
time ago.

> I wonder if for parallel tests we should print the test number based on the
> start of the test rather than the finish time?

I think we need the test number to be stable, so it had better be the
ordering appearing in the schedule file.  But we already print the
results in that order.

regards, tom lane

Re: Bugs in copyfuncs/equalfuncs support for JSON node types

2022-07-04 Thread Zhihong Yu

On Mon, Jul 4, 2022 at 6:23 PM Tom Lane  wrote:

> In reviewing Peter's patch to auto-generate the backend/nodes
> support files, I compared what the patch's script produces to
> what is in the code now.  I found several discrepancies in the
> recently-added parse node types for JSON functions, and as far
> as I can see every one of those discrepancies is an error in
> the existing code.  Some of them are relatively harmless
> (e.g. COPY_LOCATION_FIELD isn't really different from
> COPY_SCALAR_FIELD), but some of them definitely are live bugs.
> I propose the attached patch.
>
> regards, tom lane
>
> Hi,
Patch looks good to me.

Thanks

Bugs in copyfuncs/equalfuncs support for JSON node types

2022-07-04 Thread Tom Lane

In reviewing Peter's patch to auto-generate the backend/nodes
support files, I compared what the patch's script produces to
what is in the code now.  I found several discrepancies in the
recently-added parse node types for JSON functions, and as far
as I can see every one of those discrepancies is an error in
the existing code.  Some of them are relatively harmless
(e.g. COPY_LOCATION_FIELD isn't really different from
COPY_SCALAR_FIELD), but some of them definitely are live bugs.
I propose the attached patch.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 51d630fa89..706d283a92 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2703,7 +2703,8 @@ _copyJsonTable(const JsonTable *from)
 	COPY_NODE_FIELD(plan);
 	COPY_NODE_FIELD(on_error);
 	COPY_NODE_FIELD(alias);
-	COPY_SCALAR_FIELD(location);
+	COPY_SCALAR_FIELD(lateral);
+	COPY_LOCATION_FIELD(location);
 
 	return newnode;
 }
@@ -2721,13 +2722,13 @@ _copyJsonTableColumn(const JsonTableColumn *from)
 	COPY_NODE_FIELD(typeName);
 	COPY_STRING_FIELD(pathspec);
 	COPY_STRING_FIELD(pathname);
-	COPY_SCALAR_FIELD(format);
+	COPY_NODE_FIELD(format);
 	COPY_SCALAR_FIELD(wrapper);
 	COPY_SCALAR_FIELD(omit_quotes);
 	COPY_NODE_FIELD(columns);
 	COPY_NODE_FIELD(on_empty);
 	COPY_NODE_FIELD(on_error);
-	COPY_SCALAR_FIELD(location);
+	COPY_LOCATION_FIELD(location);
 
 	return newnode;
 }
@@ -2742,10 +2743,10 @@ _copyJsonTablePlan(const JsonTablePlan *from)
 
 	COPY_SCALAR_FIELD(plan_type);
 	COPY_SCALAR_FIELD(join_type);
-	COPY_STRING_FIELD(pathname);
 	COPY_NODE_FIELD(plan1);
 	COPY_NODE_FIELD(plan2);
-	COPY_SCALAR_FIELD(location);
+	COPY_STRING_FIELD(pathname);
+	COPY_LOCATION_FIELD(location);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index e747e1667d..fccc0b4a18 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -147,14 +147,29 @@ _equalTableFunc(const TableFunc *a, const TableFunc *b)
 	return true;
 }
 
+static bool
+_equalJsonTablePlan(const JsonTablePlan *a, const JsonTablePlan *b)
+{
+	COMPARE_SCALAR_FIELD(plan_type);
+	COMPARE_SCALAR_FIELD(join_type);
+	COMPARE_NODE_FIELD(plan1);
+	COMPARE_NODE_FIELD(plan2);
+	COMPARE_STRING_FIELD(pathname);
+	COMPARE_LOCATION_FIELD(location);
+
+	return true;
+}
+
 static bool
 _equalJsonTable(const JsonTable *a, const JsonTable *b)
 {
 	COMPARE_NODE_FIELD(common);
 	COMPARE_NODE_FIELD(columns);
+	COMPARE_NODE_FIELD(plan);
 	COMPARE_NODE_FIELD(on_error);
 	COMPARE_NODE_FIELD(alias);
-	COMPARE_SCALAR_FIELD(location);
+	COMPARE_SCALAR_FIELD(lateral);
+	COMPARE_LOCATION_FIELD(location);
 
 	return true;
 }
@@ -166,13 +181,14 @@ _equalJsonTableColumn(const JsonTableColumn *a, const JsonTableColumn *b)
 	COMPARE_STRING_FIELD(name);
 	COMPARE_NODE_FIELD(typeName);
 	COMPARE_STRING_FIELD(pathspec);
-	COMPARE_SCALAR_FIELD(format);
+	COMPARE_STRING_FIELD(pathname);
+	COMPARE_NODE_FIELD(format);
 	COMPARE_SCALAR_FIELD(wrapper);
 	COMPARE_SCALAR_FIELD(omit_quotes);
 	COMPARE_NODE_FIELD(columns);
 	COMPARE_NODE_FIELD(on_empty);
 	COMPARE_NODE_FIELD(on_error);
-	COMPARE_SCALAR_FIELD(location);
+	COMPARE_LOCATION_FIELD(location);
 
 	return true;
 }
@@ -4405,6 +4421,9 @@ equal(const void *a, const void *b)
 		case T_JsonArgument:
 			retval = _equalJsonArgument(a, b);
 			break;
+		case T_JsonTablePlan:
+			retval = _equalJsonTablePlan(a, b);
+			break;
 		case T_JsonTable:
 			retval = _equalJsonTable(a, b);
 			break;

Re: avoid multiple hard links to same WAL file after a crash

2022-07-04 Thread Michael Paquier

On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> I'd agree with removing all the callers at the end.  pgrename() is
> quite robust on Windows, but I'd keep the two checks in
> writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> consider a past TLI history file as in-use even if we have a crash
> just after the file got created in the same path by the same standby,
> and the WAL segment init part.  Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file).  The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive. 
--
Michael


signature.asc
Description: PGP signature

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread Peter Smith

I checked again the v26* patch set.

I had no more comments for v26-0001, v26-0002, or v26-0004, but below
are some comments for v26-0003.


v26-0003


3.1 src/backend/catalog/pg_subscription.c

3.1.a
+/*
+ * Get all relations for subscription that are in a ready state.
+ *
+ * Returned list is palloc'ed in current memory context.
+ */
+List *
+GetSubscriptionReadyRelations(Oid subid)

"subscription" -> "the subscription"

Also, It might be better to say in the function header what kind of
structures are in the returned List. E.g. Firstly, I'd assumed it was
the same return type as the other function
GetSubscriptionReadyRelations, but it isn’t.

3.1.b
I think there might be a case to be made for *combining* those
SubscriptionRelState and SubscripotionRel structs into a single common
struct. Then all those GetSubscriptionNotReadyRelations and
GetSubscriptionNotReadyRelations (also GetSubscriptionRelations?) can
be merged to be just one common function that takes a parameter to say
do you want to return the relation List of ALL / READY /  NOT_READY?
What do you think?

==

3.2 src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed

+/*
+ * Check and throw an error if the publisher has subscribed to the same table
+ * from some other publisher. This check is required only if copydata is ON and
+ * the origin is local.
+ */
+static void
+check_pub_table_subscribed(WalReceiverConn *wrconn, List *publications,
+CopyData copydata, char *origin, Oid subid)

The function comment probably should say something about why the new
READY logic was added in v26.

~~~

3.3

3.3.a
+ /*
+ * The subid will be valid only for ALTER SUBSCRIPTION ... REFRESH
+ * PUBLICATION. Get the ready relations for the subscription only in case
+ * of ALTER SUBSCRIPTION case as there will be no relations in ready state
+ * while the subscription is created.
+ */
+ if (subid != InvalidOid)
+ subreadyrels = GetSubscriptionReadyRelations(subid);

The word "case" is 2x in the same sentence. I also paraphrased my
understanding of this comment below. Maybe it is simpler?

SUGGESTION
Get the ready relations for the subscription. The subid will be valid
only for ALTER SUBSCRIPTION ... REFRESH because there will be no
relations in ready state while the subscription is created.

3.3.b
+ if (subid != InvalidOid)
+ subreadyrels = GetSubscriptionReadyRelations(subid);

SUGGESTION
if (OidIsValid(subid))

==

3.4 src/test/subscription/t/032_localonly.pl

Now all the test cases are re-using the same data (e.g. 11,12,13) and
you are deleting the data between the tests. I guess it is OK, but IMO
the tests are easier to read when each test part was using unique data
(e.g. 11,12,13, then 12,22,32, then 13,23,33 etc)

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-05 11:20:54 +1200, Thomas Munro wrote:
> On Tue, Jul 5, 2022 at 8:54 AM Andres Freund  wrote:
> > > Yeah, it's all for assertions... let's just remove it.  Those
> > > assertions were useful to me at some stage in development but won't
> > > hold as well as I thought, at least without widespread PG_FINALLY(),
> > > which wouldn't be nice.
> >
> > Hm. I'd be inclined to at least add a few more
> > Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to
> > dshash_find_or_insert().
> 
> Yeah, I was wondering about that, but it needs to check the whole 128
> element lock array.

I think it'd be ok to just check the current partition - yes, it'd not catch
cases where we're still holding a lock on another partition, but that's imo
not too bad?

> Hmm, yeah that seems OK for assertion builds.
> Since there were 6 places with I-hold-no-lock assertions, I shoved the
> loop into a function so I could do:
> 
> -   Assert(!status->hash_table->find_locked);
> +   assert_no_lock_held_by_me(hash_table);

I am a *bit* wary about the costs of that, even in assert builds - each of the
partition checks in the loop will in turn need to iterate through
held_lwlocks. But I guess we can also just later weaken them if it turns out
to be a problem.

Greetings,

Andres Freund

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Thomas Munro

On Tue, Jul 5, 2022 at 8:54 AM Andres Freund  wrote:
> On 2022-07-04 14:55:43 +1200, Thomas Munro wrote:
> > > It's per-backend state at least and just used for assertions. We could 
> > > remove
> > > it. Or stop checking it in places where it could be set wrongly: 
> > > dshash_find()
> > > and dshash_detach() couldn't check anymore, but the rest of the assertions
> > > would still be valid afaics?
> >
> > Yeah, it's all for assertions... let's just remove it.  Those
> > assertions were useful to me at some stage in development but won't
> > hold as well as I thought, at least without widespread PG_FINALLY(),
> > which wouldn't be nice.
>
> Hm. I'd be inclined to at least add a few more
> Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to
> dshash_find_or_insert().

Yeah, I was wondering about that, but it needs to check the whole 128
element lock array.  Hmm, yeah that seems OK for assertion builds.
Since there were 6 places with I-hold-no-lock assertions, I shoved the
loop into a function so I could do:

-   Assert(!status->hash_table->find_locked);
+   assert_no_lock_held_by_me(hash_table);

> > + Assert(LWLockHeldByMe(PARTITION_LOCK(hash_table, partition_index)));
> >
> > - hash_table->find_locked = false;
> > - hash_table->find_exclusively_locked = false;
> >   LWLockRelease(PARTITION_LOCK(hash_table, partition_index));

> This LWLockHeldByMe() doesn't add much - the LWLockRelease() will error out if
> we don't hold the lock.

Duh.  Removed.
From da650f5dc0dda2155d088f56e2c062f7d10dec44 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 4 Jul 2022 11:56:02 +1200
Subject: [PATCH v2] Fix lock assertions in dshash.c.

dshash.c maintained flags to track locking, in order to make assertions
that the dshash API was being used correctly.  Unfortunately the
interaction with ereport's non-local exits was not thought through
carefully enough and the flag could get out of sync with reality.  Since
these flags were primarily used for assertions that no lock was held, we
can get most of the intended effect with a loop over all LWLocks instead
(in assertion builds only).

This problem was noted by Tom and Andres while reviewing changes to
support the new shared memory stats system in release 15, the second
user of dshash.c.  On closer inspection, even the earlier typcache.c
code was not guaranteed to meet the asserted conditions, now that it's
been highlighted that even dsa_get_address() can throw (albeit in
unlikely circumstances).

Back-patch to 11, where dshash.c arrived.

Reported-by: Tom Lane 
Reported-by: Andres Freund 
Reviewed-by: Kyotaro HORIGUCHI 
Reviewed-by: Zhihong Yu 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/20220311012712.botrpsikaufzt...@alap3.anarazel.de
---
 src/backend/lib/dshash.c | 54 +++-
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index ec454b4d65..90255a5eb9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -110,8 +110,6 @@ struct dshash_table
 	dshash_table_control *control;	/* Control object in DSM. */
 	dsa_pointer *buckets;		/* Current bucket pointers in DSM. */
 	size_t		size_log2;		/* log2(number of buckets) */
-	bool		find_locked;	/* Is any partition lock held by 'find'? */
-	bool		find_exclusively_locked;	/* ... exclusively? */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -167,6 +165,7 @@ struct dshash_table
 		BUCKET_INDEX_FOR_HASH_AND_SIZE(hash,			\
 	   hash_table->size_log2)])
 
+static inline void assert_no_lock_held_by_me(dshash_table *hash_table);
 static void delete_item(dshash_table *hash_table,
 		dshash_table_item *item);
 static void resize(dshash_table *hash_table, size_t new_size);
@@ -234,9 +233,6 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 		}
 	}
 
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
-
 	/*
 	 * Set up the initial array of buckets.  Our initial size is the same as
 	 * the number of partitions.
@@ -285,8 +281,6 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
 	hash_table->params = *params;
 	hash_table->arg = arg;
 	hash_table->control = dsa_get_address(area, control);
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
 	Assert(hash_table->control->magic == DSHASH_MAGIC);
 
 	/*
@@ -309,7 +303,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
 void
 dshash_detach(dshash_table *hash_table)
 {
-	Assert(!hash_table->find_locked);
+	assert_no_lock_held_by_me(hash_table);
 
 	/* The hash table may have been destroyed.  Just free local memory. */
 	pfree(hash_table);
@@ -400,7 +394,7 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 	partition = PARTITION_FOR_HASH(hash);
 
 	Assert(hash_table->control->magic == DSHASH_MAGIC);
-

Re: TAP output format in pg_regress

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-05 00:06:04 +0200, Daniel Gustafsson wrote:
> > On 4 Jul 2022, at 16:27, Peter Eisentraut 
> >  wrote:
> >
> > On 29.06.22 21:50, Daniel Gustafsson wrote:
> >> Attached is a new version of this patch, which completes the TAP output 
> >> format
> >> option such that all codepaths emitting output are TAP compliant.  The 
> >> verbose
> >> option is fixed to to not output extraneous newlines which the previous PoC
> >> did.  The output it made to conform to the original TAP spec since v13/14 
> >> TAP
> >> parsers seem less common than those that can handle the original spec.  
> >> Support
> >> for the new format additions should be quite simple to add should we want 
> >> that.
> >> Running pg_regress --verbose should give the current format output.
> >> I did end up combining TAP and --verbose into a single patch, as the TAP 
> >> format
> >> sort of depends on the verbose flag as TAP has no verbose mode.  I can 
> >> split it
> >> into two separate should a reviewer prefer that.
> >
> > I'm not sure what to make of all these options.  I think providing a TAP 
> > output for pg_regress is a good idea.  But then do we still need the old 
> > output?  Is it worth maintaining two output formats that display exactly 
> > the same thing in slightly different ways?
>
> If we believe that TAP is good enough for human consumption and not just as
> input to test runners then we don't.  Personally I think the traditional 
> format
> is more pleasant to read than raw TAP output when running tests.

I think with a bit of care the tap output could be nearly the same
"quality". It might not be the absolute "purest" tap output, but who cares.

test tablespace   ... ok  418 ms
parallel group (20 tests):  oid char int2 varchar name int4 text pg_lsn regproc 
txid money boolean uuid float4 int8 float8 bit enum rangetypes numeric
 boolean  ... ok   34 ms
 char ... ok   20 ms
 name ... ok   26 ms

isn't that different from

ok 1 - tablespace   418ms
# parallel group (20 tests):  oid char int2 varchar name int4 text pg_lsn 
regproc txid money boolean uuid float4 int8 float8 bit enum rangetypes numeric
ok 2 -  boolean34ms
ok 3 -  char   20ms
ok 4 -  name   26ms

or whatever.

For non-parallel tests I think we currently print the test name before running
the test, which obviously doesn't work well when needing to print the 'ok'
'not ok' first. We could just print
# non-parallel group tablespace
or such? That doesn't

I wonder if for parallel tests we should print the test number based on the
start of the test rather than the finish time? Hm, it also looks like it's
legal to just leave the test number out?

> The discussion on this was in 
> 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de
> where it was proposed that we could cut the boilerplate.

I think that was more about things like CREATE DATABASE etc. And I'm not sure
we need an option to keep showing those details.

> >  I remember the last time I wanted to tweak the output of the parallel 
> > tests, people were very attached to the particular timing and spacing of 
> > the current output.  So I'm not sure people will like this.
> >
> > The timing output is very popular.  Where is that in the TAP output?
>
> As I mentioned in the mail upthread, TAP runners generally hide that info and
> only show a running total.  That being said, I do agree with adding back so
> I'll do that in a new version of the patch.

FWIW, meson's testrunner shows the individual tap output when using -v.

One test where using -v is a problem is pg_dump's tests - the absurd number of
8169 tests makes showing that decidedly not fun.

> Having test output format parity with supported back branches seemed like a
> good idea to me at the time of writing at least.

That's of course nice, but it also kind of corners us into not evolving the
default format, which I don't think is warranted...

Greetings,

Andres Freund

Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH

2022-07-04 Thread Peter Smith

On Mon, Jul 4, 2022 at 10:29 PM Euler Taveira  wrote:
>
> On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote:
>
> Yeah, it seems we have overlooked this point. I think we can do this
> just for HEAD but as the feature is introduced in PG-15 so there is no
> harm in pushing it to PG-15 as well especially because it is a
> straightforward change. What do you or others think?
>
> No objection. It is a good thing for future backpatches.
>

Since there is no function change or bugfix here I thought it was only
applicable for HEAD. This change is almost in the same category as a
code comment typo patch - do those normally get backpatched? - maybe
follow the same convention here. OTOH, if you think it may be helpful
for future backpatches then I am also fine if you wanted to push it to
PG15.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: TAP output format in pg_regress

2022-07-04 Thread Daniel Gustafsson

> On 4 Jul 2022, at 16:39, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> I'm not sure what to make of all these options.  I think providing a TAP 
>> output for pg_regress is a good idea.  But then do we still need the old 
>> output?  Is it worth maintaining two output formats that display exactly 
>> the same thing in slightly different ways?
> 
> Probably is, because this is bad:
> 
>> ... The proposed default format now hides the 
>> fact that some tests are started in parallel.  I remember the last time 
>> I wanted to tweak the output of the parallel tests, people were very 
>> attached to the particular timing and spacing of the current output.  So 
>> I'm not sure people will like this.
> 
> and so is this:
> 
>> The timing output is very popular.  Where is that in the TAP output?
> 
> Both of those things are fairly critical for test development.  You
> need to know what else might be running in parallel with a test case,
> and you need to know whether you just bloated the runtime unreasonably.
> 
> More generally, I'm unhappy about the proposal that TAP should become
> the default output.  

That's not my proposal though, my proposal is that the traditional format
should be the default output (with the parallel test info added back, that was
my bad), and that TAP is used in automated test runners like in meson.  Hiding
the timing in TAP was (as mentioned upthread) since TAP test runners generally
never show that anyways, but I'll add it back since it clearly doesn't hurt to
have even there.

> There is nothing particularly human-friendly
> about it, whereas the existing format is something we have tuned to
> our liking over literally decades.  I don't mind if there's a way to
> get TAP when you're actually intending to feed it into a TAP-parsing
> tool, but I am not a TAP-parsing tool and I don't see why I should
> have to put up with it.

I totally agree, and that's why the patch has the traditional format - without
all the boilerplate - as the default.  Unless opting-in there is no change over
today, apart from the boilerplate.

--
Daniel Gustafsson   https://vmware.com/

Re: Request for assistance to backport CVE-2022-1552 fixes to 9.6 and 9.4

2022-07-04 Thread Roberto C . Sánchez

On Wed, Jun 08, 2022 at 05:31:22PM -0400, Roberto C. Sánchez wrote:
> On Wed, Jun 08, 2022 at 04:15:47PM -0400, Tom Lane wrote:
> > Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?=  writes:
> > > I am investigating backporting the fixes for CVE-2022-1552 to 9.6 and
> > > 9.4 as part of Debian LTS and Extended LTS.  I am aware that these
> > > releases are no longer supported upstream, but I have made an attempt at
> > > adapting commits ef792f7856dea2576dcd9cab92b2b05fe955696b and
> > > f26d5702857a9c027f84850af48b0eea0f3aa15c from the REL_10_STABLE branch.
> > > I would appreciate a review of the attached patches and any comments on
> > > things that may have been missed and/or adapted improperly.
> > 
> > FWIW, I would not recommend being in a huge hurry to back-port those
> > changes, pending the outcome of this discussion:
> > 
> > https://www.postgresql.org/message-id/flat/f8a4105f076544c180a87ef0c4822352%40stmuk.bayern.de
> > 
> Thanks for the pointer.
> 
> > We're going to have to tweak that code somehow, and it's not yet
> > entirely clear how.
> > 
> I will monitor the discussion to see what comes of it.
> 
Based on the discussion in the other thread, I have made an attempt to
backport commit 88b39e61486a8925a3861d50c306a51eaa1af8d6 to 9.6 and 9.4.
The only significant change that I had to make was to add the full
function signatures for the REVOKE/GRANT in the citext test.

One question that I had about the change as committed is whether a
REVOKE is needed on s.citext_ne, like so:

REVOKE ALL ON FUNCTION s.citext_ne FROM PUBLIC;

Or (for pre-10):

REVOKE ALL ON FUNCTION s.citext_ne(s.citext, s.citext) FROM PUBLIC;

I ask because the comment immediately preceding the sequence of REVOKEs
includes the comment "Revoke all conceivably-relevant ACLs within the
extension."  I am not especially knowledgable about deep internals, but
that function seems like it would belong in the same group with the
others.

In any event, would someone be willing to review the attached patches
for correctness?  I would like to shortly publish updates to 9.6 and 9.4
in Debian and a review would be most appreciated.

Regards,

-Roberto

-- 
Roberto C. Sánchez
>From ef792f7856dea2576dcd9cab92b2b05fe955696b Mon Sep 17 00:00:00 2001
From: Noah Misch 
Date: Mon, 9 May 2022 08:35:08 -0700
Subject: [PATCH] Make relation-enumerating operations be security-restricted
 operations.

When a feature enumerates relations and runs functions associated with
all found relations, the feature's user shall not need to trust every
user having permission to create objects.  BRIN-specific functionality
in autovacuum neglected to account for this, as did pg_amcheck and
CLUSTER.  An attacker having permission to create non-temp objects in at
least one schema could execute arbitrary SQL functions under the
identity of the bootstrap superuser.  CREATE INDEX (not a
relation-enumerating operation) and REINDEX protected themselves too
late.  This change extends to the non-enumerating amcheck interface.
Back-patch to v10 (all supported versions).

Sergey Shinderuk, reviewed (in earlier versions) by Alexander Lakhin.
Reported by Alexander Lakhin.

Security: CVE-2022-1552
---
 src/backend/access/brin/brin.c   |   30 -
 src/backend/catalog/index.c  |   41 +--
 src/backend/commands/cluster.c   |   35 
 src/backend/commands/indexcmds.c |   53 +--
 src/backend/utils/init/miscinit.c|   24 --
 src/test/regress/expected/privileges.out |   42 
 src/test/regress/sql/privileges.sql  |   36 +
 7 files changed, 231 insertions(+), 30 deletions(-)

--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
+#include "utils/guc.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -786,6 +787,9 @@
 	Oid			heapoid;
 	Relation	indexRel;
 	Relation	heapRel;
+	Oid			save_userid;
+	int			save_sec_context;
+	int			save_nestlevel;
 	double		numSummarized = 0;
 
 	if (RecoveryInProgress())
@@ -799,10 +803,28 @@
 	 * passed indexoid isn't an index then IndexGetRelation() will fail.
 	 * Rather than emitting a not-very-helpful error message, postpone
 	 * complaining, expecting that the is-it-an-index test below will fail.
+	 *
+	 * unlike brin_summarize_range(), autovacuum never calls this.  hence, we
+	 * don't switch userid.
 	 */
 	heapoid = IndexGetRelation(indexoid, true);
 	if (OidIsValid(heapoid))
+	{
 		heapRel = heap_open(heapoid, ShareUpdateExclusiveLock);
+
+		/*
+		 * Autovacuum calls us.  For its benefit, switch to the table owner's
+		 * userid, so that any index functions are run as that user.  Also
+		 * lock down security-restricted operations and arrange to make GUC
+		 * variable changes local to this command.  This is harmless,

Re: TAP output format in pg_regress

2022-07-04 Thread Tom Lane

Andres Freund  writes:
>> Both of those things are fairly critical for test development.  You
>> need to know what else might be running in parallel with a test case,
>> and you need to know whether you just bloated the runtime unreasonably.

> That should be doable with tap as well - afaics the output of that could
> nearly be the same as now, preceded by a #.

I don't mind minor changes like prefixing # --- I just don't want
to lose information.

regards, tom lane

Re: TAP output format in pg_regress

2022-07-04 Thread Daniel Gustafsson

> On 4 Jul 2022, at 16:27, Peter Eisentraut  
> wrote:
> 
> On 29.06.22 21:50, Daniel Gustafsson wrote:
>> Attached is a new version of this patch, which completes the TAP output 
>> format
>> option such that all codepaths emitting output are TAP compliant.  The 
>> verbose
>> option is fixed to to not output extraneous newlines which the previous PoC
>> did.  The output it made to conform to the original TAP spec since v13/14 TAP
>> parsers seem less common than those that can handle the original spec.  
>> Support
>> for the new format additions should be quite simple to add should we want 
>> that.
>> Running pg_regress --verbose should give the current format output.
>> I did end up combining TAP and --verbose into a single patch, as the TAP 
>> format
>> sort of depends on the verbose flag as TAP has no verbose mode.  I can split 
>> it
>> into two separate should a reviewer prefer that.
> 
> I'm not sure what to make of all these options.  I think providing a TAP 
> output for pg_regress is a good idea.  But then do we still need the old 
> output?  Is it worth maintaining two output formats that display exactly the 
> same thing in slightly different ways?

If we believe that TAP is good enough for human consumption and not just as
input to test runners then we don't.  Personally I think the traditional format
is more pleasant to read than raw TAP output when running tests.

> What is the purpose of the --verbose option?  When and how is one supposed to 
> activate that?  The proposed default format now hides the fact that some 
> tests are started in parallel.

The discussion on this was in 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de
where it was proposed that we could cut the boilerplate.  Thinking on it I
agree that the parallel run info should be included even without --verbose so
I'll add that back.  In general, running with --verbose should ideally only be
required for troubleshooting setup/teardown issues with testing.

>  I remember the last time I wanted to tweak the output of the parallel tests, 
> people were very attached to the particular timing and spacing of the current 
> output.  So I'm not sure people will like this.
> 
> The timing output is very popular.  Where is that in the TAP output?

As I mentioned in the mail upthread, TAP runners generally hide that info and
only show a running total.  That being said, I do agree with adding back so
I'll do that in a new version of the patch.

> More generally, what do you envision we do with this feature?  Who is it for, 
> what are the tradeoffs, etc.?

In general, my thinking with this was that normal testruns started with make
check (or similar) by developers would use the traditional format (albeit less
verbose) and that the TAP output was for automated test runners in general and
the meson test runner in particular.  The TAP format is an opt-in with the
traditional format being the default.

The tradeoff is of course that maintaining two output formats is more work than
maintaining one, but it's not really something we change all that often so that
might not be too heavy a burden.

I personally didn't see us replacing the traditional format for "human
readable" runs, if that's where the discussion is heading then the patch can
look quite different.  Having test output format parity with supported back
branches seemed like a good idea to me at the time of writing at least.

--
Daniel Gustafsson   https://vmware.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund

Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.

Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.

One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().

> > Ideally, node16 and node32 would have the same code with a different
> > loop count (1 or 2). More generally, there is too much duplication of
> > code (noted by Andres in his PoC), and there are many variable names
> > with the node size embedded. This is a bit tricky to make more
> > general, so we don't need to try it yet, but ideally we would have
> > something similar to:
> >
> > switch (node->kind) // todo: inspect tagged pointer
> > {
> >   case RADIX_TREE_NODE_KIND_4:
> >idx = node_search_eq(node, chunk, 4);
> >do_action(node, idx, 4, ...);
> >break;
> >   case RADIX_TREE_NODE_KIND_32:
> >idx = node_search_eq(node, chunk, 32);
> >do_action(node, idx, 32, ...);
> >   ...
> > }

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:

> > static pg_alwaysinline void
> > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > {
> > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> >   // do simple loop with (node_simple *) node;
> > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> >   // do vectorized loop where available with (node_vec *) node;
> > ...
> > }

Greetings,

Andres Freund

Re: TAP output format in pg_regress

2022-07-04 Thread Daniel Gustafsson

> On 4 Jul 2022, at 22:13, Andres Freund  wrote:
> 
> Hi,
> 
> On 2022-06-29 21:50:45 +0200, Daniel Gustafsson wrote:
>> @@ -279,8 +648,7 @@ stop_postmaster(void)
>>  r = system(buf);
>>  if (r != 0)
>>  {
>> -fprintf(stderr, _("\n%s: could not stop postmaster: 
>> exit code was %d\n"),
>> -progname, r);
>> +pg_log_error("could not stop postmaster: exit code was 
>> %d", r);
>>  _exit(2);   /* not exit(), that 
>> could be recursive */
>>  }
> 
> There's a lot of stuff like this. Perhaps worth doing separately? I'm not sure
> I unerstand where you used bail and where not. I assume it's mostly arund use
> uf _exit() vs exit()?

Since bail will cause the entire testrun to be considered a failure, the idea
was to avoid using bail() for any errors in tearing down the test harness after
an otherwise successful test run.

Moving to pg_log_error() can for sure be broken out into a separate patch from
the rest of the set (if we at all want to do that, but it seemed logical to
address when dealing with other output routines).

>> +test_status_ok(tests[i]);
>> 
>>  if (statuses[i] != 0)
>>  log_child_failure(statuses[i]);
>> 
>>  INSTR_TIME_SUBTRACT(stoptimes[i], starttimes[i]);
>> -status(_(" %8.0f ms"), 
>> INSTR_TIME_GET_MILLISEC(stoptimes[i]));
>> +runtime(tests[i], 
>> INSTR_TIME_GET_MILLISEC(stoptimes[i]));
> 
> Based on the discussion downthread, let's just always compute this and display
> it even in the tap format?

Sure, it's easy enough to do and include in the test description.  The reason I
left it out is that the test runners I played around with all hide those
details and only show a running total.  That of course doesn't mean that all
runners will do that (and anyone running TAP output for human consumption will
want it), so I agree with putting it in, I'll fix that up in a v6 shortly.

--
Daniel Gustafsson   https://vmware.com/

Re: doc: BRIN indexes and autosummarize

2022-07-04 Thread Justin Pryzby

On Mon, Jul 04, 2022 at 09:38:42PM +0200, Alvaro Herrera wrote:
> +   There are several triggers for initial summarization of a page range
> +   to occur.  If the table is vacuumed, either because
> +has been manually invoked or because
> +   autovacuum causes it,
> +   all existing unsummarized page ranges are summarized.

I'd say "If the table is vacuumed manually or by autovacuum, ..."
(Or "either manually or by autovacuum, ...")

> +   Also, if the index has the
> +parameter set to on,

Maybe say "If the autovacuum parameter is enabled" (this may avoid needing to
revise it later if we change the default).

> +   then any run of autovacuum in the database will summarize all

I'd avoid saying "run" and instead say "then anytime autovacuum runs in that
database, all ..."

> +   unsummarized page ranges that have been completely filled recently,
> +   regardless of whether the table is processed by autovacuum for other
> +   reasons; see below.

say "whether the table itself" and remove "for other reasons" ?

>
> When autosummarization is enabled, each time a page range is filled a

Maybe: filled comma

> -   request is sent to autovacuum for it to execute a targeted summarization
> -   for that range, to be fulfilled at the end of the next worker run on the
> -   same database.  If the request queue is full, the request is not recorded
> -   and a message is sent to the server log:
> +   request is sent to autovacuum for it to execute a 
> targeted
> +   summarization for that range, to be fulfilled at the end of the next
> +   autovacuum worker run on the same database. If the request queue is full, 
> the

"to be fulfilled the next time an autovacuum worker finishes running in that
database."

or

"to be fulfilled by an autovacuum worker the next it finishes running in that
database."

> +++ b/doc/src/sgml/ref/create_index.sgml
> @@ -580,6 +580,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT 
> EXISTS ]   
>   Defines whether a summarization run is invoked for the previous page
>   range whenever an insertion is detected on the next one.
> + See  for more details.
> + The default is off.

Maybe "invoked" should say "queued" ?

Also, a reminder that this was never addressed (I wish the project had a way to
keep track of known issues).

https://www.postgresql.org/message-id/20201113160007.gq30...@telsasoft.com
|error_severity of brin work item
|left | could not open relation with OID 292103095
|left | processing work entry for relation 
"ts.child.alarms_202010_alarm_clear_time_idx"
|Those happen following a REINDEX job on that index.

This inline patch includes my changes as well as yours.
And the attached patch is my changes only.

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index caf1ea4cef1..90897a4af07 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -73,31 +73,55 @@
summarized range, that range does not automatically acquire a summary
tuple; those tuples remain unsummarized until a summarization run is
invoked later, creating initial summaries.
-   This process can be invoked manually using the
-   brin_summarize_range(regclass, bigint) or
-   brin_summarize_new_values(regclass) functions;
-   automatically when VACUUM processes the table;
-   or by automatic summarization executed by autovacuum, as insertions
-   occur.  (This last trigger is disabled by default and can be enabled
-   with the autosummarize parameter.)
-   Conversely, a range can be de-summarized using the
-   brin_desummarize_range(regclass, bigint) function,
-   which is useful when the index tuple is no longer a very good
-   representation because the existing values have changed.
   
 
   
-   When autosummarization is enabled, each time a page range is filled a
-   request is sent to autovacuum for it to execute a targeted summarization
-   for that range, to be fulfilled at the end of the next worker run on the
-   same database.  If the request queue is full, the request is not recorded
-   and a message is sent to the server log:
+   There are several ways to trigger the initial summarization of a page range.
+   If the table is vacuumed, either manually or by
+   autovacuum,
+   all existing unsummarized page ranges are summarized.
+   Also, if the index's
+parameter is enabled,
+   whenever autovacuum runs in that database, summarization will
+   occur for all
+   unsummarized page ranges that have been filled,
+   regardless of whether the table itself is processed by autovacuum; see 
below.
+
+   Lastly, the following functions can be used:
+
+   
+
+ brin_summarize_range(regclass, bigint)
+ summarizes all unsummarized ranges
+
+
+ brin_summarize_new_values(regclass)
+ summarizes one specific range, if it is unsummarized
+
+   
+  
+
+  
+   When autosummarization is enabled, each time a page range is filled, a
+   request is sent to autovacuum to execute a targeted
+   summarization

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund

Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> new file mode 100644
> index 00..bf87f932fd
> --- /dev/null
> +++ b/src/backend/lib/radixtree.c
> @@ -0,0 +1,1763 @@
> +/*-
> + *
> + * radixtree.c
> + *   Implementation for adaptive radix tree.
> + *
> + * This module employs the idea from the paper "The Adaptive Radix Tree: 
> ARTful
> + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and 
> Thomas
> + * Neumann, 2013.
> + *
> + * There are some differences from the proposed implementation.  For 
> instance,
> + * this radix tree module utilizes AVX2 instruction, enabling us to use 
> 256-bit
> + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> + * Also, there is no support for path compression and lazy path expansion. 
> The
> + * radix tree supports fixed length of the key so we don't expect the tree 
> level
> + * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.


> + * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.


> +/*
> + * As we descend a radix tree, we push the node to the stack. The stack is 
> used
> + * at deletion.
> + */
> +typedef struct radix_tree_stack_data
> +{
> + radix_tree_node *node;
> + struct radix_tree_stack_data *parent;
> +} radix_tree_stack_data;
> +typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

> +/*
> + * Allocate a new node with the given node kind.
> + */
> +static radix_tree_node *
> +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> +{
> + radix_tree_node *newnode;
> +
> + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> + 
>  radix_tree_node_info[kind].size);
> + newnode->kind = kind;
> +
> + /* update the statistics */
> + tree->mem_used += GetMemoryChunkSpace(newnode);
> + tree->cnt[kind]++;
> +
> + return newnode;
> +}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?


> + else if (idx != n4->n.count)
> + {
> + /*
> +  * the key needs to be inserted 
> in the middle of the
> +  * array, make space for the 
> new key.
> +  */
> + memmove(&(n4->chunks[idx + 1]), 
> &(n4->chunks[idx]),
> + sizeof(uint8) * 
> (n4->n.count - idx));
> + memmove(&(n4->slots[idx + 1]), 
> &(n4->slots[idx]),
> + 
> sizeof(radix_tree_node *) * (n4->n.count - idx));
> + }

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.


> +/*
> + * Insert the key with the val.
> + *
> + * found_p is set to true if the key already present, otherwise false, if
> + * it's not NULL.
> + *
> + * XXX: do we need to support update_if_exists behavior?
> + */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.


> +void
> +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> +{
> + int shift;
> + boolreplaced;
> + radix_tree_node *node;
> + radix_tree_node *parent = tree->root;
> +
> + /* Empty tree, create the root */
> + if (!tree->root)
> + radix_tree_new_root(tree, key, val);
> +
> + /* Extend the tree if necessary */
> + if (key > tree->max_val)
> + radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund

Hi,

I just noticed that I had a reply forgotten in drafts...

On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote:
> To move this project forward, I've implemented radix tree
> implementation from scratch while studying Andres's implementation. It
> supports insertion, search, and iteration but not deletion yet. In my
> implementation, I use Datum as the value so internal and lead nodes
> have the same data structure, simplifying the implementation. The
> iteration on the radix tree returns keys with the value in ascending
> order of the key. The patch has regression tests for radix tree but is
> still in PoC state: left many debugging codes, not supported SSE2 SIMD
> instructions, added -mavx2 flag is hard-coded.

Very cool - thanks for picking this up.

Greetings,

Andres Freund

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-04 14:55:43 +1200, Thomas Munro wrote:
> Right, as seen in the build farm at [1].  Also reproducible with something 
> like:
>
> @@ -269,6 +269,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
> request_size,
> return false;
> }
>
> +   /* XXX random fault injection */
> +   if (op == DSM_OP_ATTACH && random() < RAND_MAX / 8)
> +   {
> +   close(fd);
> +   elog(ERROR, "chaos");
> +   return false;
> +   }
> +
>
> I must have thought that it was easy and practical to write no-throw
> straight-line code and be sure to reach dshash_release_lock(), but I
> concede that it was a bad idea: even dsa_get_address() can throw*, and
> you're often likely to need to call that while accessing dshash
> elements.  For example, in lookup_rowtype_tupdesc_internal(), there is
> a sequence dshash_find(), ..., dsa_get_address(), ...,
> dshash_release_lock(), and I must have considered the range of code
> between find and release to be no-throw, but now I know that it is
> not.

Yea - I'd go as far as saying that it's almost never feasible.


> > It's per-backend state at least and just used for assertions. We could 
> > remove
> > it. Or stop checking it in places where it could be set wrongly: 
> > dshash_find()
> > and dshash_detach() couldn't check anymore, but the rest of the assertions
> > would still be valid afaics?
>
> Yeah, it's all for assertions... let's just remove it.  Those
> assertions were useful to me at some stage in development but won't
> hold as well as I thought, at least without widespread PG_FINALLY(),
> which wouldn't be nice.

Hm. I'd be inclined to at least add a few more
Assert(!LWLockHeldByMe[InMode]()) style assertions. E.g. to
dshash_find_or_insert().



> @@ -572,13 +552,8 @@ dshash_release_lock(dshash_table *hash_table, void 
> *entry)
>   size_t  partition_index = PARTITION_FOR_HASH(item->hash);
>
>   Assert(hash_table->control->magic == DSHASH_MAGIC);
> - Assert(hash_table->find_locked);
> - Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
> - 
> hash_table->find_exclusively_locked
> - ? LW_EXCLUSIVE 
> : LW_SHARED));
> + Assert(LWLockHeldByMe(PARTITION_LOCK(hash_table, partition_index)));
>
> - hash_table->find_locked = false;
> - hash_table->find_exclusively_locked = false;
>   LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
>  }

This LWLockHeldByMe() doesn't add much - the LWLockRelease() will error out if
we don't hold the lock.

Greetings,

Andres Freund

Re: doc: BRIN indexes and autosummarize

2022-07-04 Thread Jaime Casanova

On Mon, Jul 04, 2022 at 09:38:42PM +0200, Alvaro Herrera wrote:
> What about this?
> 
> -- 
> Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
> "Java is clearly an example of money oriented programming"  (A. Stepanov)

> diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
> index caf1ea4cef..0a715d41c7 100644
> --- a/doc/src/sgml/brin.sgml
> +++ b/doc/src/sgml/brin.sgml
> @@ -73,31 +73,55 @@
> summarized range, that range does not automatically acquire a summary
> tuple; those tuples remain unsummarized until a summarization run is
> invoked later, creating initial summaries.
> -   This process can be invoked manually using the
> -   brin_summarize_range(regclass, bigint) or
> -   brin_summarize_new_values(regclass) functions;
> -   automatically when VACUUM processes the table;
> -   or by automatic summarization executed by autovacuum, as insertions
> -   occur.  (This last trigger is disabled by default and can be enabled
> -   with the autosummarize parameter.)
> -   Conversely, a range can be de-summarized using the
> -   brin_desummarize_range(regclass, bigint) function,
> -   which is useful when the index tuple is no longer a very good
> -   representation because the existing values have changed.
> +  
> +

I feel that somewhere in this paragraph it should be mentioned that is
off by default.

otherwise, +1

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-04 Thread Andres Freund

Hi,

On 2022-06-27 16:55:55 +0200, David Geier wrote:
> Indeed, the total JIT time increases the more modules are used. The reason
> for this to happen is that the inlining pass loads and deserializes all to
> be inlined modules (.bc files) from disk prior to inlining them via
> llvm::IRMover. There's already a cache for such modules in the code, but it
> is currently unused. This is because llvm::IRMover takes the module to be
> inlined as std::unique_ptr. The by-value argument requires
> the source module to be moved, which means it cannot be reused afterwards.
> The code is accounting for that by erasing the module from the cache after
> inlining it, which in turns requires reloading the module next time a
> reference to it is encountered.
>
> Instead of each time loading and deserializing all to be inlined modules
> from disk, they can reside in the cache and instead be cloned via
> llvm::CloneModule() before they get inlined. Key to calling
> llvm::CloneModule() is fully deserializing the module upfront, instead of
> loading the module lazily. That is why I changed the call from
> LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
> llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() (which
> fully loads the module via llvm::parseBitcodeFile()). Beyond that it seems
> like that prior to LLVM 13, cloning modules could fail with an assertion
> (not sure though if that would cause problems in a release build without
> assertions). Andres reported this problem back in the days here [1]. In the
> meanwhile the issue got discussed in [2] and finally fixed for LLVM 13, see
> [3].

Unfortunately that doesn't work right now - that's where I had started. The
problem is that IRMover renames types. Which, in the case of cloned modules
unfortunately means that types used cloned modules are also renamed in the
"origin" module. Which then causes problems down the line, because parts of
the LLVM code match types by type names.

That can then have the effect of drastically decreasing code generation
quality over time, because e.g. inlining never manages to find signatures
compatible.

> However, curiously the time spent on optimizing is also reduced (95ms
> instead of 164ms). Could this be because some of the applied optimizations
> are ending up in the cached module?

I suspect it's more that optimization stops being able to do a lot, due to the
type renamign issue.

> @Andres: could you provide me with the queries that caused the assertion
> failure in LLVM?

I don't think I have the concrete query. What I tend to do is to run the whole
regression tests with forced JITing. I'm fairly certain this triggered the bug
at the time.

> Have you ever observed a segfault with a non-assert-enabled build?

I think I observed bogus code generation that then could lead to segfaults or
such.

> I just want to make sure this is truly fixed in LLVM 13. Running 'make
> check-world' all tests passed.

With jit-ing forced for everything?

One more thing to try is to jit-compile twice and ensure the code is the
same. It certainly wasn't in the past due to the above issue.

Greetings,

Andres Freund

Re: Export log_line_prefix(); useful for emit_log_hook.

2022-07-04 Thread Jeff Davis

On Mon, 2022-07-04 at 15:54 +0900, Michael Paquier wrote:
> On Wed, Jun 29, 2022 at 03:09:42PM +0200, Alvaro Herrera wrote:
> > Hmm, maybe your hypothetical book would prefer to use a different
> > setting for log line prefix than Log_line_prefix, so it would make
> > sense
> > to pass the format string as a parameter to the function instead of
> > relying on the GUC global.

That is nicer, attached.

I also renamed the function log_status_format(), and made
log_line_prefix() a thin wrapper over that. I think that's less
confusing.

Regards,
Jeff Davis

From 4c64049e500c80ac495732d34ae56e07393115fe Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 28 Jun 2022 11:39:33 -0700
Subject: [PATCH] Provide log_status_format(), useful for an emit_log_hook.

Refactor so that log_line_prefix() is a thin wrapper over a new
function log_status_format(), and move the implementation to the
latter. Export log_status_format() so that it can be used by an
emit_log_hook.
---
 src/backend/utils/error/elog.c | 15 ---
 src/include/utils/elog.h   |  3 +++
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 55ee5423afb..7b2bb9c8b75 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2439,10 +2439,19 @@ process_log_prefix_padding(const char *p, int *ppadding)
 }
 
 /*
- * Format tag info for log lines; append to the provided buffer.
+ * Format log status information using Log_line_prefix.
  */
 static void
 log_line_prefix(StringInfo buf, ErrorData *edata)
+{
+	log_status_format(buf, Log_line_prefix, edata);
+}
+
+/*
+ * Format log status info; append to the provided buffer.
+ */
+void
+log_status_format(StringInfo buf, char *format, ErrorData *edata)
 {
 	/* static counter for line numbers */
 	static long log_line_number = 0;
@@ -2466,10 +2475,10 @@ log_line_prefix(StringInfo buf, ErrorData *edata)
 	}
 	log_line_number++;
 
-	if (Log_line_prefix == NULL)
+	if (format == NULL)
 		return;	/* in case guc hasn't run yet */
 
-	for (p = Log_line_prefix; *p != '\0'; p++)
+	for (p = format; *p != '\0'; p++)
 	{
 		if (*p != '%')
 		{
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f5c6cd904de..e88e3f8a801 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -16,6 +16,8 @@
 
 #include 
 
+#include "lib/stringinfo.h"
+
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
  * decreasing detail. */
@@ -439,6 +441,7 @@ extern PGDLLIMPORT bool syslog_split_messages;
 #define LOG_DESTINATION_JSONLOG	16
 
 /* Other exported functions */
+extern void log_status_format(StringInfo buf, char *format, ErrorData *edata);
 extern void DebugFileOpen(void);
 extern char *unpack_sql_state(int sql_state);
 extern bool in_error_recursion_trouble(void);
-- 
2.17.1

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-04 Thread Andres Freund

Hi,

On 2022-07-04 06:43:00 +, Luc Vlaming Hummel wrote:
> Thanks for reviewing this and the interesting examples!
> 
> Wanted to give a bit of extra insight as to why I'd love to have a system 
> that can lazily emit JIT code and hence creates roughly a module per function:
> In the end I'm hoping that we can migrate to a system where we only JIT after 
> a configurable cost has been exceeded for this node, as well as a 
> configurable amount of rows has actually been processed.
> Reason is that this would safeguard against some problematic planning issues
> wrt JIT (node not being executed, row count being massively off).

I still don't see how it's viable to move to always doing function-by-function
emission overhead wise.

I also want to go to do JIT in the background and triggered by acutal
usage. But to me it seems a dead end to require moving to
one-function-per-module model for that.


> If this means we have to invest more in making it cheap(er) to emit modules,
> I'm all for that.

I think that's just inherently more expensive and thus a no-go.


> @Andres if there's any other things we ought to fix to make this cheap
> (enough) compared to the previous code I'd love to know your thoughts.

I'm not seeing it.

Greetings,

Andres Freund

Re: Proposal: allow database-specific role memberships

2022-07-04 Thread Kenaniah Cerny

Hi Antonin,

First of all, thank you so much for taking the time to review my patch.
I'll answer your questions in reverse order:

The "unsafe_tests" directory is where the pre-existing role tests were
located. According to the readme of the "unsafe_tests" directory, the tests
contained within are not run during "make installcheck" because they could
have side-effects that seem undesirable for a production installation. This
seemed like a reasonable location as the new tests that this patch
introduces also modifies the "state" of the database cluster by adding,
modifying, and removing roles & databases (including template1).

Regarding roles_is_member_of(), the nuance is that role "A" in your example
would only be considered a member of role "B" (and by extension role "C")
when connected to the database in which "A" was granted database-specific
membership to "B". Conversely, when connected to any other database, "A"
would not be considered to be a member of "B".

This patch is designed to solve the scenarios in which one may want to
grant constrained access to a broader set of privileges. For example,
membership in "pg_read_all_data" effectively grants SELECT and USAGE rights
on everything (implicitly cluster-wide in today's implementation). By
granting a role membership to "pg_read_all_data" within the context of a
specific database, the grantee's read-everything privilege is effectively
constrained to just that specific database (as membership within
"pg_read_all_data" would not otherwise be held).

A rebased version is attached.

Thanks again!

- Kenaniah

On Wed, Jun 29, 2022 at 6:45 AM Antonin Houska  wrote:

> Kenaniah Cerny  wrote:
>
> > Attached is a newly-rebased patch -- would love to get a review from
> someone whenever possible.
>
> I've picked this patch for a review. The patch currently does not apply to
> the
> master branch, so I could only read the diff. Following are my comments:
>
> * I think that roles_is_member_of() deserves a comment explaining why the
> code
>   that you moved into append_role_memberships() needs to be called twice,
>   i.e. once for global memberships and once for the database-specific ones.
>
>   I think the reason is that if, for example, role "A" is a
> database-specific
>   member of role "B" and "B" is a "global" member of role "C", then "A"
> should
>   not be considered a member of "C", unless "A" is granted "C" explicitly.
> Is
>   this behavior intended?
>
>   Note that in this example, the "C" members are a superset of "B" members,
>   and thus "C" should have weaker permissions on database objects than
>   "B". What's then the reason to not consider "A" a member of "C"? If "C"
>   gives its members some permissions of "B" (e.g. "pg_write_all_data"),
> then I
>   think the roles hierarchy is poorly designed.
>
>   A counter-example might help me to understand.
>
> * Why do you think that "unsafe_tests" is the appropriate name for the
>   directory that contains regression tests?
>
> I can spend more time on the review if the patch gets rebased.
>
> --
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
>

database-role-memberships-v9.patch
Description: Binary data

Re: TAP output format in pg_regress

2022-07-04 Thread Andres Freund

Hi,

On 2022-06-29 21:50:45 +0200, Daniel Gustafsson wrote:
> @@ -279,8 +648,7 @@ stop_postmaster(void)
>   r = system(buf);
>   if (r != 0)
>   {
> - fprintf(stderr, _("\n%s: could not stop postmaster: 
> exit code was %d\n"),
> - progname, r);
> + pg_log_error("could not stop postmaster: exit code was 
> %d", r);
>   _exit(2);   /* not exit(), that 
> could be recursive */
>   }

There's a lot of stuff like this. Perhaps worth doing separately? I'm not sure
I unerstand where you used bail and where not. I assume it's mostly arund use
uf _exit() vs exit()?


> + test_status_ok(tests[i]);
>  
>   if (statuses[i] != 0)
>   log_child_failure(statuses[i]);
>  
>   INSTR_TIME_SUBTRACT(stoptimes[i], starttimes[i]);
> - status(_(" %8.0f ms"), 
> INSTR_TIME_GET_MILLISEC(stoptimes[i]));
> + runtime(tests[i], 
> INSTR_TIME_GET_MILLISEC(stoptimes[i]));

Based on the discussion downthread, let's just always compute this and display
it even in the tap format?


Greetings,

Andres Freund

Re: TAP output format in pg_regress

2022-07-04 Thread Andres Freund

Hi,

> Peter Eisentraut  writes:
> > I'm not sure what to make of all these options.  I think providing a TAP
> > output for pg_regress is a good idea.  But then do we still need the old
> > output?  Is it worth maintaining two output formats that display exactly
> > the same thing in slightly different ways?
>
> Probably is, because this is bad:

> > ... The proposed default format now hides the
> > fact that some tests are started in parallel.  I remember the last time
> > I wanted to tweak the output of the parallel tests, people were very
> > attached to the particular timing and spacing of the current output.  So
> > I'm not sure people will like this.
>
> and so is this:
>
> > The timing output is very popular.  Where is that in the TAP output?
>
> Both of those things are fairly critical for test development.  You
> need to know what else might be running in parallel with a test case,
> and you need to know whether you just bloated the runtime unreasonably.

That should be doable with tap as well - afaics the output of that could
nearly be the same as now, preceded by a #.

The test timing output could (and I think should) also be output - but if I
read the tap specification correctly, we'd either need to make it part of the
test "description" or on a separate line.


On 2022-07-04 10:39:37 -0400, Tom Lane wrote:
> More generally, I'm unhappy about the proposal that TAP should become
> the default output.  There is nothing particularly human-friendly
> about it, whereas the existing format is something we have tuned to
> our liking over literally decades.  I don't mind if there's a way to
> get TAP when you're actually intending to feed it into a TAP-parsing
> tool, but I am not a TAP-parsing tool and I don't see why I should
> have to put up with it.

I'm mostly interested in the tap format because meson's testrunner can parse
it - unsurprisingly it doesn't understand the current regress output. It's a
lot nicer to immediately be pointed to the failed test(s) than having to scan
through the output "manually".

Greetings,

Andres Freund

Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8

2022-07-04 Thread Peter Eisentraut


On 04.07.22 19:32, Tom Lane wrote:

If the result of IDENTIFY_SYSTEM is always sent in text format, then
I agree that this isn't very problematic.  If there are any clients
that fetch it in binary mode, though, this is absolutely a wire
protocol break for them


The result rows of the replication commands are always sent in text format.

Re: doc: BRIN indexes and autosummarize

2022-07-04 Thread Alvaro Herrera

What about this?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming"  (A. Stepanov)
diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index caf1ea4cef..0a715d41c7 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -73,31 +73,55 @@
summarized range, that range does not automatically acquire a summary
tuple; those tuples remain unsummarized until a summarization run is
invoked later, creating initial summaries.
-   This process can be invoked manually using the
-   brin_summarize_range(regclass, bigint) or
-   brin_summarize_new_values(regclass) functions;
-   automatically when VACUUM processes the table;
-   or by automatic summarization executed by autovacuum, as insertions
-   occur.  (This last trigger is disabled by default and can be enabled
-   with the autosummarize parameter.)
-   Conversely, a range can be de-summarized using the
-   brin_desummarize_range(regclass, bigint) function,
-   which is useful when the index tuple is no longer a very good
-   representation because the existing values have changed.
+  
+
+  
+   There are several triggers for initial summarization of a page range
+   to occur.  If the table is vacuumed, either because
+has been manually invoked or because
+   autovacuum causes it,
+   all existing unsummarized page ranges are summarized.
+   Also, if the index has the
+parameter set to on,
+   then any run of autovacuum in the database will summarize all
+   unsummarized page ranges that have been completely filled recently,
+   regardless of whether the table is processed by autovacuum for other
+   reasons; see below.
+   Lastly, the following functions can be used:
+
+   
+
+ brin_summarize_range(regclass, bigint)
+ summarizes all unsummarized ranges
+
+
+ brin_summarize_new_values(regclass)
+ summarizes one specific range, if it is unsummarized
+
+   
   
 
   
When autosummarization is enabled, each time a page range is filled a
-   request is sent to autovacuum for it to execute a targeted summarization
-   for that range, to be fulfilled at the end of the next worker run on the
-   same database.  If the request queue is full, the request is not recorded
-   and a message is sent to the server log:
+   request is sent to autovacuum for it to execute a targeted
+   summarization for that range, to be fulfilled at the end of the next
+   autovacuum worker run on the same database. If the request queue is full, the
+   request is not recorded and a message is sent to the server log:
 
 LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was not recorded
 
When this happens, the range will be summarized normally during the next
regular vacuum of the table.
   
+
+  
+   Conversely, a range can be de-summarized using the
+   brin_desummarize_range(regclass, bigint) function,
+   which is useful when the index tuple is no longer a very good
+   representation because the existing values have changed.
+   See  for details.
+  
+
  
 
 
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 9ffcdc629e..d3db03278d 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -580,6 +580,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
  Defines whether a summarization run is invoked for the previous page
  range whenever an insertion is detected on the next one.
+ See  for more details.
+ The default is off.

Re: TAP output format in pg_regress

2022-07-04 Thread Peter Eisentraut


On 04.07.22 16:39, Tom Lane wrote:

Probably is, because this is bad:


... The proposed default format now hides the
fact that some tests are started in parallel.  I remember the last time
I wanted to tweak the output of the parallel tests, people were very
attached to the particular timing and spacing of the current output.  So
I'm not sure people will like this.

and so is this:


The timing output is very popular.  Where is that in the TAP output?

Both of those things are fairly critical for test development.  You
need to know what else might be running in parallel with a test case,
and you need to know whether you just bloated the runtime unreasonably.


I don't think there is a reason these couldn't be shown in TAP output as 
well.


Even if we keep the two output formats in parallel, it would be good if 
they showed the same set of information.

Re: Temporary file access API

2022-07-04 Thread Peter Eisentraut


On 04.07.22 18:35, Antonin Houska wrote:

Attached is a new version of the patch, to evaluate what the API use in the
backend could look like. I haven't touched places where the file is accessed
in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is
called.

Rebased patch set is attached here, which applies to the current master.

(A few more opportunities for the new API implemented here.)


I don't understand what this patch set is supposed to do.  AFAICT, the 
thread originally forked off from a TDE discussion and, considering the 
thread subject, was possibly once an attempt to refactor temporary file 
access to make integrating encryption easier?  The discussion then 
talked about things like saving on system calls and excising all 
OS-level file access API use, which I found confusing, and the thread 
then just became a general TDE-related mini-discussion.


The patches at hand extend some internal file access APIs and then 
sprinkle them around various places in the backend code, but there is no 
explanation why or how this is better.  I don't see any real structural 
savings one might expect from a refactoring patch.  No information has 
been presented how this might help encryption.


I also suspect that changing around the use of various file access APIs 
needs to be carefully evaluated in each individual case.  Various code 
has subtle expectations about atomic write behavior, syncing, flushing, 
error recovery, etc.  I don't know if this has been considered here.

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread vignesh C

On Mon, Jul 4, 2022 at 7:44 PM Alvaro Herrera  wrote:
>
> > From d8f8844f877806527b6f3f45320b6ba55a8e3154 Mon Sep 17 00:00:00 2001
> > From: Vigneshwaran C 
> > Date: Thu, 26 May 2022 19:29:33 +0530
> > Subject: [PATCH v27 1/4] Add a missing test to verify only-local parameter 
> > in
> >  test_decoding plugin.
> >
> > Add a missing test to verify only-local parameter in test_decoding plugin.
>
> I don't get it.  replorigin.sql already has some lines to test
> local-only.  What is your patch adding that is new?  Maybe instead of
> adding some more lines at the end of the script, you should add lines
> where this stuff is already being tested.  But that assumes that there
> is something new that is being tested; if so what is it?

The test is to check that remote origin data (i.e. replication origin
being set) will be filtered when only-local parameter is set. I felt
that this scenario is not covered, unless I'm missing something. I
have moved the test as suggested.
I have changed this in the v28 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com

Regards,
Vignesh

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread vignesh C

On Mon, Jul 4, 2022 at 9:23 AM Amit Kapila  wrote:
>
> On Mon, Jul 4, 2022 at 3:59 AM Peter Smith  wrote:
> >
> > On Mon, Jul 4, 2022 at 12:59 AM vignesh C  wrote:
> > ...
> > > > 2.
> > > > /* ALTER SUBSCRIPTION  SET ( */
> > > >   else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) &&
> > > > TailMatches("SET", "("))
> > > > - COMPLETE_WITH("binary", "slot_name", "streaming",
> > > > "synchronous_commit", "disable_on_error");
> > > > + COMPLETE_WITH("binary", "origin", "slot_name", "streaming",
> > > > "synchronous_commit", "disable_on_error");
> > > >   /* ALTER SUBSCRIPTION  SKIP ( */
> > > >   else if (HeadMatches("ALTER", "SUBSCRIPTION", MatchAny) &&
> > > > TailMatches("SKIP", "("))
> > > >   COMPLETE_WITH("lsn");
> > > > @@ -3152,7 +3152,7 @@ psql_completion(const char *text, int start, int 
> > > > end)
> > > >   /* Complete "CREATE SUBSCRIPTION  ...  WITH ( " */
> > > >   else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", 
> > > > "("))
> > > >   COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
> > > > -   "enabled", "slot_name", "streaming",
> > > > +   "enabled", "origin", "slot_name", "streaming",
> > > > "synchronous_commit", "two_phase", "disable_on_error");
> > > >
> > > > Why do you choose to add a new option in-between other parameters
> > > > instead of at the end which we normally do? The one possible reason I
> > > > can think of is that all the parameters at the end are boolean so you
> > > > want to add this before those but then why before slot_name, and again
> > > > I don't see such a rule being followed for other parameters.
> > >
> > > I was not sure if it should be maintained in alphabetical order,
> > > anyway since the last option "disable_on_error" is at the end, I have
> > > changed it to the end.
> > >
> >
> > Although it seems it is not a hard rule, mostly the COMPLETE_WITH are
> > coded using alphabetical order. Anyway, I think that was a clear
> > intention here too since 13 of 14 parameters were already in
> > alphabetical order; it is actually only that "disable_on_error"
> > parameter that was misplaced; not the new "origin" parameter.
> >
>
> Agreed, but let's not change disable_on_error as part of this patch.

I have changed this in the v28 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com

Regards,
Vignesh

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread vignesh C

On Mon, Jul 4, 2022 at 1:46 PM shiy.f...@fujitsu.com
 wrote:
>
> On Sun, Jul 3, 2022 11:00 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v27 patch has the changes for the
> > same.
> >
>
> Thanks for updating the patch.
>
> A comment on 0003 patch:
> +   /*
> +* No need to throw an error for the tables that are in ready 
> state,
> +* as the walsender will send the changes from WAL in case of 
> tables
> +* in ready state.
> +*/
> +   if (isreadytable)
> +   continue;
> +
> ...
> +   ereport(ERROR,
> +   
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +   errmsg("table: \"%s.%s\" might have 
> replicated data in the publisher",
> +  nspname, relname),
> +   errdetail("CREATE/ALTER SUBSCRIPTION with 
> origin = local and copy_data = on is not allowed when the publisher might 
> have replicated data."),
> +   errhint("Use CREATE/ALTER SUBSCRIPTION with 
> copy_data = off/force."));
> +
> +   ExecClearTuple(slot);
>
> I think we should call ExecClearTuple() before getting next tuple, so it 
> should
> be called if the table is in ready state. How about modifying it to:
> if (!isreadytable)
> ereport(ERROR,
> 
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("table: \"%s.%s\" might have 
> replicated data in the publisher",
>nspname, relname),
> errdetail("CREATE/ALTER SUBSCRIPTION 
> with origin = local and copy_data = on is not allowed when the publisher 
> might have replicated data."),
> errhint("Use CREATE/ALTER 
> SUBSCRIPTION with copy_data = off/force."));
>
> ExecClearTuple(slot);

Thanks for the comment, I have modified this in the v28 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3MNK2hMYroTiHGS9HkSxiA-az1QC1mpa0YwDZ8nGmmZg%40mail.gmail.com

Regards,
Vignesh

Re: make install-world fails sometimes in Mac M1

2022-07-04 Thread Tom Lane

Gaddam Sai Ram  writes:
>       We are using a script to install Postgres from source, the script works 
> fine in ubuntu and Mac(intel) but mostly fails(works sometimes) in Mac M1. 

We have developers (including me) and buildfarm machines using M1 Macs,
and nobody else is reporting any big problem with them, so I don't believe
that this is specifically due to that.

> make[2]: *** [install] Killed: 9

kill -9 is not something that would happen internally to the install
process.  My guess is that that is interference from some external agent.
Perhaps you have some resource-consumption-limiting daemon installed on
that machine, and it's deciding that the command ran too long?

regards, tom lane

Re: logical replication restrictions

2022-07-04 Thread Euler Taveira

On Wed, Mar 23, 2022, at 6:19 PM, Euler Taveira wrote:
> On Mon, Mar 21, 2022, at 10:09 PM, Euler Taveira wrote:
>> On Mon, Mar 21, 2022, at 10:04 PM, Andres Freund wrote:
>>> On 2022-03-20 21:40:40 -0300, Euler Taveira wrote:
>>> > On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:
>>> > > Long time, no patch. Here it is. I will provide documentation in the 
>>> > > next
>>> > > version. I would appreciate some feedback.
>>> > This patch is broken since commit 
>>> > 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I
>>> > rebased it.
>>> 
>>> This fails tests, specifically it seems psql crashes:
>>> https://cirrus-ci.com/task/6592281292570624?logs=cores#L46
>> Yeah. I forgot to test this patch with cassert before sending it. :( I didn't
>> send a new patch because there is another issue (with int128) that I'm
>> currently reworking. I'll send another patch soon.
> Here is another version after rebasing it. In this version I fixed the psql
> issue and rewrote interval_to_ms function.
>From the previous version, I added support for streamed transactions. For
streamed transactions, the delay is applied during STREAM COMMIT message.
That's ok if we add the delay before applying the spooled messages. Hence, we
guarantee that the delay is applied *before* each transaction. The same logic
is applied to prepared transactions. The delay is introduced before applying
the spooled messages in STREAM PREPARE message.

Tests were refactored a bit. A test for streamed transaction was included too.

Version 4 is attached.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 7dd7a3523ed8e7a3494e7ec25ddc0af8ed4cf4d3 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sat, 6 Nov 2021 11:31:10 -0300
Subject: [PATCH v4] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (specially to fix
errors that might cause data loss).

If the subscriber sets min_apply_delay parameter, the logical
replication worker will delay the transaction commit for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs only on WAL records for transaction begins. Regular and
prepared transactions are covered. Streamed transactions are also
covered.

Author: Euler Taveira
Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml |   9 ++
 doc/src/sgml/ref/alter_subscription.sgml   |   5 +-
 doc/src/sgml/ref/create_subscription.sgml  |  31 -
 src/backend/catalog/pg_subscription.c  |   1 +
 src/backend/catalog/system_views.sql   |   2 +-
 src/backend/commands/subscriptioncmds.c|  46 ++-
 src/backend/replication/logical/worker.c   |  82 
 src/backend/utils/adt/timestamp.c  |  32 +
 src/bin/pg_dump/pg_dump.c  |  16 ++-
 src/bin/pg_dump/pg_dump.h  |   1 +
 src/bin/psql/describe.c|   8 +-
 src/bin/psql/tab-complete.c|   7 +-
 src/include/catalog/pg_subscription.h  |   3 +
 src/include/datatype/timestamp.h   |   2 +
 src/include/utils/timestamp.h  |   2 +
 src/test/regress/expected/subscription.out | 149 -
 src/test/regress/sql/subscription.sql  |  20 +++
 src/test/subscription/t/032_apply_delay.pl | 110 +++
 18 files changed, 455 insertions(+), 71 deletions(-)
 create mode 100644 src/test/subscription/t/032_apply_delay.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 25b02c4e37..9b94b7aef2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7833,6 +7833,15 @@ SCRAM-SHA-256$iteration count:
   
  
 
+ 
+  
+   subapplydelay int8
+  
+  
+   Delay the application of changes by a specified amount of time.
+  
+ 
+
  
   
subname name
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 353ea5def2..ae9d625f9d 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -207,8 +207,9 @@ ALTER SUBSCRIPTION name RENAME TO <
   information.  The parameters that can be altered
   are slot_name,
   synchronous_commit,
-  binary, streaming, and
-  disable_on_error.
+  binary, streaming,
+  disable_on_error, and
+  min_apply_delay.
  
 

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 34b3264b26..ae80db8a3d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -302,7 +302,36 @@ CREATE SUBSCRIPTION subscription_name
 

-  
+
+   
+min_apply_delay (integer)
+
+ 
+  By default, subscriber applies changes as soon

Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8

2022-07-04 Thread Tom Lane

Peter Eisentraut  writes:
> On 04.07.22 07:55, Tom Lane wrote:
>> But what about whatever code is reading the output?  And what if
>> that code isn't v16?  I can't believe that we can make a wire
>> protocol change as summarily as this.

> I think a client will either just read the string value and convert it 
> to some numeric type without checking what type was actually sent, or if 
> the client API is type-aware and automatically converts to a native type 
> of some sort, then it will probably already support 64-bit ints.  Do you 
> see some problem scenario?

If the result of IDENTIFY_SYSTEM is always sent in text format, then
I agree that this isn't very problematic.  If there are any clients
that fetch it in binary mode, though, this is absolutely a wire
protocol break for them ... and no, I don't believe an unsupported
claim that they'd adapt automatically.

> I'm seeing a bigger problem now, which is that our client code doesn't 
> parse bigger-than-int32 timeline IDs correctly.

Yup.

regards, tom lane

make install-world fails sometimes in Mac M1

2022-07-04 Thread Gaddam Sai Ram

Hi team, 


      We are using a script to install Postgres from source, the script works 
fine in ubuntu and Mac(intel) but mostly fails(works sometimes) in Mac M1. 


Configure and make world works fine. But fails during make install-world.


/Applications/Xcode.app/Contents/Developer/usr/bin/make -C ./src/backend 
generated-headers

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C catalog distprep 
generated-header-symlinks

make[2]: Nothing to be done for `distprep'.

make[2]: Nothing to be done for `generated-header-symlinks'.

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C utils distprep 
generated-header-symlinks

make[2]: Nothing to be done for `distprep'.

make[2]: Nothing to be done for `generated-header-symlinks'.

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C doc install

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C src install

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C sgml install

/bin/sh ../../../config/install-sh -c -d 
'/Users/sairam/work/postgresql-11.14/share/doc/'/html 
'/Users/sairam/work/postgresql-11.14/share/man'/man1 
'/Users/sairam/work/postgresql-11.14/share/man'/man3 
'/Users/sairam/work/postgresql-11.14/share/man'/man7

cp -R `for f in ./html; do test -r $f && echo $f && break; done` 
'/Users/sairam/work/postgresql-11.14/share/doc/'

cp -R `for f in ./man1; do test -r $f && echo $f && break; done` `for f in 
./man3; do test -r $f && echo $f && break; done` `for f in ./man7; do test -r 
$f && echo $f && break; done` '/Users/sairam/work/postgresql-11.14/share/man'

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C src install

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C common install

/bin/sh ../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/lib'

/usr/bin/install -c -m 644  libpgcommon.a 
'/Users/sairam/work/postgresql-11.14/lib/libpgcommon.a'

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C port install

/bin/sh ../../config/install-sh -c -d '/Users/sairam/work/postgresql-11.14/lib'

/usr/bin/install -c -m 644  libpgport.a 
'/Users/sairam/work/postgresql-11.14/lib/libpgport.a'

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C timezone install

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C ../../src/port all

make[3]: Nothing to be done for `all'.

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C ../../src/common all

make[3]: Nothing to be done for `all'.

/bin/sh ../../config/install-sh -c -d 
'/Users/sairam/work/postgresql-11.14/share'

./zic -d '/Users/sairam/work/postgresql-11.14/share/timezone' -p 'US/Eastern' 
-b fat  ./data/tzdata.zi

make[2]: *** [install] Killed: 9

make[1]: *** [install-timezone-recurse] Error 2

make: *** [install-world-src-recurse] Error 2

2






It's not like it fails every time, sometimes the same script works just fine. 
Sometimes after few retries it works.



I have also noticed that, if it works without issue, binaries generated does't 
work immediately. i.e when I tried to query pg_config, the command waits for 
sometime and gets killed. After a few retries, it works there onwards.



I'm wondering what could be the issue. I'm attaching the script to the same, 
kindly go through it and help me understand the issue.



Postgres Version: 11.14
Machine details: MacBook Pro (13-inch, M1, 2020), Version 12.4


Regards
G. Sai Ram

install_pg.sh
Description: Binary data

Re: automatically generating node support functions

2022-07-04 Thread Tom Lane

Peter Eisentraut  writes:
> [ v6-0001-Automatically-generate-node-support-functions.patch ]

I've now spent some time looking at this fairly carefully, and I think
this is a direction we can pursue, but I'm not yet happy about the
amount of magic knowledge that's embedded in the gen_node_support.pl
script rather than being encoded in pg_node_attr markers.  Once this
is in place, people will stop thinking about the nodes/*funcs.c
infrastructure altogether when they write patches, at least until
they get badly burned by it; so I don't want there to be big gotchas.
As an example, heaven help the future hacker who decides to change
the contents of A_Const and doesn't realize that that still has a
manually-implemented copyfuncs.c routine.  So rather than embedding
knowledge in gen_node_support.pl like this:

my @custom_copy = qw(A_Const Const ExtensibleNode);

I think we ought to put it into the *nodes.h headers as much as
possible, perhaps like this:

typedef struct A_Const pg_node_attr(custom_copy)
{ ...

I will grant that there are some things that are okay to embed
in gen_node_support.pl, such as the list of @scalar_types,
because if you need to add an entry there you will find it out
when the script complains it doesn't know how to process a field.
So there is some judgment involved here, but on the whole I want
to err on the side of exposing decisions in the headers.

So I propose that we handle these things via struct-level pg_node_attr
markers, rather than node-type lists embedded in the script:

abstract_types
no_copy
no_read_write
no_read
custom_copy
custom_readwrite

(The markings that "we are not publishing right now to stay level with the
manual system" are fine to apply in the script, since that's probably a
temporary thing anyway.  Also, I don't have a problem with applying
no_copy etc to the contents of whole files in the script, rather than
tediously labeling each struct in such files.)

The hacks for scalar-copying EquivalenceClass*, EquivalenceMember*,
struct CustomPathMethods*, and CustomScan.methods should be replaced
with "pg_node_attr(copy_as_scalar)" labels on affected fields.

I wonder whether this:

# We do not support copying Path trees, mainly
# because the circular linkages between RelOptInfo
# and Path nodes can't be handled easily in a
# simple depth-first traversal.

couldn't be done better by inventing an inheritable no_copy attr
to attach to the Path supertype.  Or maybe it'd be okay to just
automatically inherit the no_xxx properties from the supertype?

I don't terribly like the ad-hoc mechanism for not comparing
CoercionForm fields.  OTOH, I am not sure whether replacing it
with per-field equal_ignore attrs would be better; there's at least
an argument that that invites bugs of omission.  But implementing
this with an uncommented test deep inside a script that most hackers
should not need to read is not good.  On the whole I'd lean towards
the equal_ignore route.

I'm confused by the "various field types to ignore" at the end
of the outfuncs/readfuncs code.  Do we really ignore those now?
How could that be safe?  If it is safe, wouldn't it be better
to handle that with per-field pg_node_attrs?  Silently doing
what might be the wrong thing doesn't seem good.

In the department of nitpicks:

* copyfuncs.switch.c and equalfuncs.switch.c are missing trailing
newlines.

* pgindent is not very happy with a lot of your comments in *nodes.h.

* I think we should add explicit dependencies in backend/nodes/Makefile,
along the lines of

copyfuncs.o: copyfuncs.c copyfuncs.funcs.c copyfuncs.switch.c

Otherwise the whole thing is a big gotcha for anyone not using
--enable-depend.

I don't know if you have time right now to push forward with these
points, but if you don't I can take a stab at it.  I would like to
see this done and committed PDQ, because 835d476fd already broke
many patches that touch *nodes.h and I'd like to get the rest of
the fallout in place before rebasing affected patches.

regards, tom lane

Re: Temporary file access API

2022-07-04 Thread Antonin Houska

Antonin Houska  wrote:

> Robert Haas  wrote:
> 
> > On Tue, Apr 12, 2022 at 5:30 AM Antonin Houska  wrote:
> > > Robert Haas  wrote:
> > > > On Mon, Apr 11, 2022 at 4:05 AM Antonin Houska  wrote:
> > > > > There are't really that many kinds of files to encrypt:
> > > > >
> > > > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#List_of_the_files_that_contain_user_data
> > > > >
> > > > > (And pg_stat/* files should be removed from the list.)
> > > >
> > > > This kind of gets into some theoretical questions. Like, do we think
> > > > that it's an information leak if people can look at how many
> > > > transactions are committing and aborting in pg_xact_status? In theory
> > > > it could be, but I know it's been argued that that's too much of a
> > > > side channel. I'm not sure I believe that, but it's arguable.
> > >
> > > I was referring to the fact that the statistics are no longer stored in 
> > > files:
> > >
> > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd
> > 
> > Oh, yeah, I agree with that.
> 
> I see now that the statistics are yet saved to a file on server shutdown. I've
> updated the wiki page.
> 
> Attached is a new version of the patch, to evaluate what the API use in the
> backend could look like. I haven't touched places where the file is accessed
> in a non-trivial way, e.g. lseek() / fseek() or pg_pwrite() / pg_pread() is
> called.

Rebased patch set is attached here, which applies to the current master.

(A few more opportunities for the new API implemented here.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



temp_file_api_v3.tgz
Description: application/gzip

Re: doc: BRIN indexes and autosummarize

2022-07-04 Thread Alvaro Herrera

On 2022-Jun-28, Roberto Mello wrote:

> Here's a patch to clarify the BRIN indexes documentation, particularly with
> regards to autosummarize, vacuum and autovacuum. It basically breaks
> down a big blob of a paragraph into multiple paragraphs for clarity,
> plus explicitly tells how summarization happens manually or
> automatically.

[Some of] these additions are wrong actually.  It says that autovacuum
will not summarize new entries; but it does.  If you just let the table
sit idle, any autovacuum run that cleans the table will also summarize
any ranges that need summarization.

What 'autosummarization=off' means is that the behavior to trigger an
immediate summarization of a range once it becomes full is not default.
This is very different.

As for the new s that you added, I'd say they're
stylistically wrong.  Each paragraph is supposed to be one fully
contained idea; what these tags do is split each idea across several
smaller paragraphs.  This is likely subjective though.

> On this topic... I'm not familiar with with the internals of BRIN
> indexes and in backend/access/common/reloptions.c I see:
> 
> {
> "autosummarize",
> "Enables automatic summarization on this BRIN index",
> RELOPT_KIND_BRIN,
> AccessExclusiveLock
> },
> 
> Is the exclusive lock on the index why autosummarize is off by default?

No.  The lock level mentioned here is what needs to be taken in order to
change the value of this option.

> What would be the downside (if any) of having autosummarize=on by default?

I'm not aware of any.  Maybe we should turn it on by default.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: TAP output format in pg_regress

2022-07-04 Thread Tom Lane

Peter Eisentraut  writes:
> I'm not sure what to make of all these options.  I think providing a TAP 
> output for pg_regress is a good idea.  But then do we still need the old 
> output?  Is it worth maintaining two output formats that display exactly 
> the same thing in slightly different ways?

Probably is, because this is bad:

> ... The proposed default format now hides the 
> fact that some tests are started in parallel.  I remember the last time 
> I wanted to tweak the output of the parallel tests, people were very 
> attached to the particular timing and spacing of the current output.  So 
> I'm not sure people will like this.

and so is this:

> The timing output is very popular.  Where is that in the TAP output?

Both of those things are fairly critical for test development.  You
need to know what else might be running in parallel with a test case,
and you need to know whether you just bloated the runtime unreasonably.

More generally, I'm unhappy about the proposal that TAP should become
the default output.  There is nothing particularly human-friendly
about it, whereas the existing format is something we have tuned to
our liking over literally decades.  I don't mind if there's a way to
get TAP when you're actually intending to feed it into a TAP-parsing
tool, but I am not a TAP-parsing tool and I don't see why I should
have to put up with it.

regards, tom lane

Re: TAP output format in pg_regress

2022-07-04 Thread Peter Eisentraut


On 29.06.22 21:50, Daniel Gustafsson wrote:

Attached is a new version of this patch, which completes the TAP output format
option such that all codepaths emitting output are TAP compliant.  The verbose
option is fixed to to not output extraneous newlines which the previous PoC
did.  The output it made to conform to the original TAP spec since v13/14 TAP
parsers seem less common than those that can handle the original spec.  Support
for the new format additions should be quite simple to add should we want that.

Running pg_regress --verbose should give the current format output.

I did end up combining TAP and --verbose into a single patch, as the TAP format
sort of depends on the verbose flag as TAP has no verbose mode.  I can split it
into two separate should a reviewer prefer that.


I'm not sure what to make of all these options.  I think providing a TAP 
output for pg_regress is a good idea.  But then do we still need the old 
output?  Is it worth maintaining two output formats that display exactly 
the same thing in slightly different ways?


What is the purpose of the --verbose option?  When and how is one 
supposed to activate that?  The proposed default format now hides the 
fact that some tests are started in parallel.  I remember the last time 
I wanted to tweak the output of the parallel tests, people were very 
attached to the particular timing and spacing of the current output.  So 
I'm not sure people will like this.


The timing output is very popular.  Where is that in the TAP output?

More generally, what do you envision we do with this feature?  Who is it 
for, what are the tradeoffs, etc.?

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread Alvaro Herrera

> From d8f8844f877806527b6f3f45320b6ba55a8e3154 Mon Sep 17 00:00:00 2001
> From: Vigneshwaran C 
> Date: Thu, 26 May 2022 19:29:33 +0530
> Subject: [PATCH v27 1/4] Add a missing test to verify only-local parameter in
>  test_decoding plugin.
> 
> Add a missing test to verify only-local parameter in test_decoding plugin.

I don't get it.  replorigin.sql already has some lines to test
local-only.  What is your patch adding that is new?  Maybe instead of
adding some more lines at the end of the script, you should add lines
where this stuff is already being tested.  But that assumes that there
is something new that is being tested; if so what is it?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)

Re: Removing unneeded self joins

2022-07-04 Thread Zhihong Yu

On Mon, Jul 4, 2022 at 6:52 AM Ronan Dunklau  wrote:

> Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit :
> > On 19/5/2022 16:47, Ronan Dunklau wrote:
> > > I'll take a look at that one.
> >
> > New version of the patch, rebased on current master:
> > 1. pgindent over the patch have passed.
> > 2. number of changed files is reduced.
> > 3. Some documentation and comments is added.
>
> Hello Andrey,
>
> Thanks for the updates.
>
> The general approach seems sensible to me, so I'm going to focus on some
> details.
>
> In a very recent thread [1], Tom Lane is proposing to add infrastructure
> to make Var aware of their nullability by outer joins. I wonder if that
> would help with avoiding the need for adding is not null clauses when the
> column is known not null ?
> If we have a precedent for adding a BitmapSet to the Var itself, maybe the
> whole discussion regarding keeping track of nullability can be extended to
> the original column nullability ?
>
> Also, I saw it was mentioned earlier in the thread but how difficult would
> it be to process the transformed quals through the EquivalenceClass
> machinery and the qual simplification ?
> For example, if the target audience of this patch is ORM, or inlined
> views, it wouldn't surprise me to see queries of this kind in the wild,
> which could be avoided altogether:
>
> postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a =
> s2.a where s1.b = 2 and s2.b =3;
>  QUERY PLAN
> -
>  Seq Scan on sj s2
>Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2))
> (2 lignes)
>
>
> +   for (counter = 0; counter < list_length(*sources);)
> +   {
> +   ListCell   *cell = list_nth_cell(*sources, counter);
> +   RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell));
> +   int counter1;
> +
> 
> + ec->ec_members = list_delete_cell(ec->ec_members, cell);
>
>
> Why don't you use foreach() and foreach_delete_current macros for
> iterating and removing items in the lists, both in update_ec_members and
> update_ec_sources ?
>
>
> +   if ((bms_is_member(k, info->syn_lefthand) ^
> +bms_is_member(r,
> info->syn_lefthand)) ||
> +   (bms_is_member(k,
> info->syn_righthand) ^
> +bms_is_member(r,
> info->syn_righthand)))
>
> I think this is more compact and easier to follow than the previous
> version, but I'm not sure how common it is in postgres source code to use
> that kind of construction ?
>
> Some review about the comments:
>
>
> I see you keep using the terms "the keeping relation" and "the removing
> relation" in reference to the relation that is kept and the one that is
> removed.
> Aside from the grammar (the kept relation or the removed relation), maybe
> it would make it clearer to call them something else. In other parts of the
> code, you used "the remaining relation / the removed relation" which makes
> sense.
>
>  /*
>   * Remove the target relid from the planner's data structures, having
> - * determined that there is no need to include it in the query.
> + * determined that there is no need to include it in the query. Or replace
> + * with another relid.
> + * To reusability, this routine can work in two modes: delete relid from
> a plan
> + * or replace it. It is used in replace mode in a self-join removing
> process.
>
> This could be rephrased: ", optionally replacing it with another relid.
> The latter is used by the self-join removing process."
>
>
> [1]
> https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us
>
> --
> Ronan Dunklau
>
>
> Hi,
bq. this is more compact and easier to follow than the previous version

A code comment can be added above the expression (involving XOR) to explain
the purpose of the expression.

Cheers

Re: [PATCH] Compression dictionaries for JSONB

2022-07-04 Thread Aleksander Alekseev

Hi Matthias,

> > Although there is also a high-level idea (according to the
> > presentations) to share common data between different TOASTed values,
> > similarly to what compression dictionaries do, by looking at the
> > current feedback and considering the overall complexity and the amount
> > of open questions (e.g. interaction with different TableAMs, etc), I
> > seriously doubt that this particular part of "pluggable TOASTer" will
> > end-up in the core.
>
> Yes, and that's why I think that this where this dictionary
> infrastructure could provide value, as an alternative or extension to
> the proposed jsonb toaster in the 'pluggable toaster' thread.

OK, I see your point now. And I think this is a very good point.
Basing "Compression dictionaries" on the API provided by "pluggable
TOASTer" can also be less hacky than what I'm currently doing with
`typmod` argument. I'm going to switch the implementation at some
point, unless anyone will object to the idea.

-- 
Best regards,
Aleksander Alekseev

Re: pgsql: Change timeline field of IDENTIFY_SYSTEM to int8

2022-07-04 Thread Peter Eisentraut




On 04.07.22 07:55, Tom Lane wrote:

Peter Eisentraut  writes:

Change timeline field of IDENTIFY_SYSTEM to int8


Surely this patch is far from complete?

To start with, just a few lines down in IdentifySystem() the column
is filled using Int32GetDatum not Int64GetDatum.  I will get some
popcorn and await the opinions of the 32-bit buildfarm animals.

But what about whatever code is reading the output?  And what if
that code isn't v16?  I can't believe that we can make a wire
protocol change as summarily as this.


I think a client will either just read the string value and convert it 
to some numeric type without checking what type was actually sent, or if 
the client API is type-aware and automatically converts to a native type 
of some sort, then it will probably already support 64-bit ints.  Do you 
see some problem scenario?


I'm seeing a bigger problem now, which is that our client code doesn't 
parse bigger-than-int32 timeline IDs correctly.


libpqwalreceiver uses pg_strtoint32(), which will error on overflow.

pg_basebackup uses atoi(), so it will just truncate the value, except 
for READ_REPLICATION_SLOT, where it uses atol(), so it will do the wrong 
thing on Windows only.


There is clearly very little use for such near-overflow timeline IDs in 
practice.  But it still seems pretty inconsistent.

Re: Removing unneeded self joins

2022-07-04 Thread Ronan Dunklau

Le jeudi 30 juin 2022, 16:11:51 CEST Andrey Lepikhov a écrit :
> On 19/5/2022 16:47, Ronan Dunklau wrote:
> > I'll take a look at that one.
> 
> New version of the patch, rebased on current master:
> 1. pgindent over the patch have passed.
> 2. number of changed files is reduced.
> 3. Some documentation and comments is added.

Hello Andrey,

Thanks for the updates. 

The general approach seems sensible to me, so I'm going to focus on some 
details. 

In a very recent thread [1], Tom Lane is proposing to add infrastructure to 
make Var aware of their nullability by outer joins. I wonder if that would help 
with avoiding the need for adding is not null clauses when the column is known 
not null ?
If we have a precedent for adding a BitmapSet to the Var itself, maybe the 
whole discussion regarding keeping track of nullability can be extended to the 
original column nullability ?

Also, I saw it was mentioned earlier in the thread but how difficult would it 
be to process the transformed quals through the EquivalenceClass machinery and 
the qual simplification ? 
For example, if the target audience of this patch is ORM, or inlined views, it 
wouldn't surprise me to see queries of this kind in the wild, which could be 
avoided altogether:

postgres=# explain (costs off) select * from sj s1 join sj s2 on s1.a = s2.a 
where s1.b = 2 and s2.b =3;
 QUERY PLAN  
-
 Seq Scan on sj s2
   Filter: ((a IS NOT NULL) AND (b = 3) AND (b = 2))
(2 lignes)

+   for (counter = 0; counter < list_length(*sources);)
+   {
+   ListCell   *cell = list_nth_cell(*sources, counter);
+   RestrictInfo *rinfo = castNode(RestrictInfo, lfirst(cell));
+   int counter1;
+

+ ec->ec_members = list_delete_cell(ec->ec_members, cell);

Why don't you use foreach() and foreach_delete_current macros for iterating and 
removing items in the lists, both in update_ec_members and update_ec_sources ?

+   if ((bms_is_member(k, info->syn_lefthand) ^
+bms_is_member(r, info->syn_lefthand)) 
||
+   (bms_is_member(k, info->syn_righthand) ^
+bms_is_member(r, info->syn_righthand)))

I think this is more compact and easier to follow than the previous version, 
but I'm not sure how common it is in postgres source code to use that kind of 
construction ?

Some review about the comments:

I see you keep using the terms "the keeping relation" and "the removing 
relation" in reference to the relation that is kept and the one that is removed.
Aside from the grammar (the kept relation or the removed relation), maybe it 
would make it clearer to call them something else. In other parts of the code, 
you used "the remaining relation / the removed relation" which makes sense.

 /*
  * Remove the target relid from the planner's data structures, having
- * determined that there is no need to include it in the query.
+ * determined that there is no need to include it in the query. Or replace
+ * with another relid.
+ * To reusability, this routine can work in two modes: delete relid from a plan
+ * or replace it. It is used in replace mode in a self-join removing process.

This could be rephrased: ", optionally replacing it with another relid. The 
latter is used by the self-join removing process."

[1] https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us

-- 
Ronan Dunklau

Re: list of TransactionIds

2022-07-04 Thread Alvaro Herrera

On 2022-May-16, Amit Kapila wrote:

> On Sun, May 15, 2022 at 5:05 PM Alvaro Herrera  
> wrote:

> > I hesitate to add this the day just before beta.  This is already in
> > pg14, so maybe it's not a big deal if pg15 remains the same for the time
> > being.  Or we can change it for beta2.  Or we could just punt until
> > pg16.  Any preferences?
> 
> I prefer to do this for pg16 unless we see some bug due to this.

Pushed now, to master only.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Minimal logical decoding on standbys

2022-07-04 Thread Ibrar Ahmed

On Mon, Jul 4, 2022 at 6:12 PM Drouvot, Bertrand 
wrote:

> Hi,
> On 7/1/22 10:03 PM, Ibrar Ahmed wrote:
>
>
> On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand 
> wrote:
>
>> I'm going to re-create a CF entry for it, as:
>>
>> - It seems there is a clear interest for the feature (given the time
>> already spend on it and the number of people that worked on)
>>
>> - I've in mind to resume working on it
>>
>> I have already done some research on that, I can definitely look at it.
>
> Thanks!
>
> This feature proposal is currently made of 5 sub-patches:
>
> 0001: Add info in WAL records in preparation for logical slot conflict
> handling
> 0002: Handle logical slot conflicts on standby
> 0003: Allow logical decoding on standby.
> 0004: New TAP test for logical decoding on standby
> 0005: Doc changes describing details about logical decoding
>
> I suggest that we focus on one sub-patch at a time.
>
> I'll start with 0001 and come back with a rebase addressing Andres and
> Robert's previous comments.
>
> Sounds good to you?
>
> Thanks
>
> --
> Bertrand Drouvot
> Amazon Web Services: https://aws.amazon.com
>
> That's great I am looking at "0002: Handle logical slot conflicts on
standby".


-- 
Ibrar Ahmed

Re: Patch proposal: New hooks in the connection path

2022-07-04 Thread Zhihong Yu

On Mon, Jul 4, 2022 at 5:54 AM Drouvot, Bertrand 
wrote:

> Hi,
>
> On 7/2/22 1:00 AM, Nathan Bossart wrote:
> > Could we model this after fmgr_hook?  The first argument in that hook
> > indicates where it is being called from.  This doesn't alleviate the need
> > for several calls to the hook in the authentication logic, but extension
> > authors would only need to define one hook.
>
> I like the idea and indeed fmgr.h looks a good place to model it.
>
> Attached a new patch version doing so.
>
> Thanks
>
> --
>
> Bertrand Drouvot
> Amazon Web Services: https://aws.amazon.com

Hi,
+   FCET_SPT,   /* startup packet timeout */
+   FCET_BSP,   /* bad startup packet */

Looking at existing enum type, such as FmgrHookEventType, the part after
underscore is a word.
I think it would be good to follow existing practice and make the enums
more readable.

Cheers

Re: [PATCH] Compression dictionaries for JSONB

2022-07-04 Thread Matthias van de Meent

Hi Alexander,

On Fri, 17 Jun 2022 at 17:04, Aleksander Alekseev
 wrote:
>> These are just my initial thoughts I would like to share though. I may
>> change my mind after diving deeper into a "pluggable TOASTer" patch.
>
> I familiarized myself with the "pluggable TOASTer" thread and joined
> the discussion [1].
>
> I'm afraid so far I failed to understand your suggestion to base
> "compression dictionaries" patch on "pluggable TOASTer", considering
> the fair amount of push-back it got from the community, not to mention
> a somewhat raw state of the patchset. It's true that Teodor and I are
> trying to address similar problems. This however doesn't mean that
> there should be a dependency between these patches.

The reason I think this is better implemented as a pluggable toaster
is because casts are necessarily opaque and require O(sizeofdata)
copies or processing. The toaster infrastructure that is proposed in
[0] seems to improve on the O(sizeofdata) requirement for toast, but
that will not work with casts.

> Also, I completely agree with Tomas [2]:
>
>> My main point is that we should not be making too many radical
>> changes at once - it makes it much harder to actually get anything done.
>
> IMO the patches don't depend on each other but rather complement each
> other. The user can switch between different TOAST methods, and the
> compression dictionaries can work on top of different TOAST methods.

I don't think that is possible (or at least, not as performant). To
treat type X' as type X and use it as a stored medium instead, you
must have either the whole binary representation of X, or have access
to the internals of type X. I find it difficult to believe that casts
can be done without a full detoast (or otherwise without deep
knowledge about internal structure of the data type such as 'type A is
binary compatible with type X'), and as such I think this feature
'compression dictionaries' is competing with the 'pluggable toaster'
feature, if the one is used on top of the other. That is, the
dictionary is still created like in the proposed patches (though
preferably without the 64-byte NAMELEN limit), but the usage will be
through "TOASTER my_dict_enabled_toaster".

Additionally, I don't think we've ever accepted two different
implementations of the same concept, at least not without first having
good arguments why both competing implementations have obvious
benefits over the other, and both implementations being incompatible.

> Although there is also a high-level idea (according to the
> presentations) to share common data between different TOASTed values,
> similarly to what compression dictionaries do, by looking at the
> current feedback and considering the overall complexity and the amount
> of open questions (e.g. interaction with different TableAMs, etc), I
> seriously doubt that this particular part of "pluggable TOASTer" will
> end-up in the core.

Yes, and that's why I think that this where this dictionary
infrastructure could provide value, as an alternative or extension to
the proposed jsonb toaster in the 'pluggable toaster' thread.

Kind regards,

Matthias van de Meent

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-04 Thread Amit Kapila

On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  wrote:
>
> I've attached three POC patches:
>

I think it will be a good idea if you can add a short commit message
at least to say which patch is proposed for HEAD and which one is for
back branches. Also, it would be good if you can add some description
of the fix in the commit message. Let's remove poc* from the patch
name.

Review poc_add_running_catchanges_xacts_to_serialized_snapshot
=
1.
+ /*
+ * Array of transactions that were running when the snapshot serialization
+ * and changed system catalogs,

The part of the sentence after serialization is not very clear.

2.
- if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+ if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid) ||
+ bsearch(, builder->catchanges.xip, builder->catchanges.xcnt,
+ sizeof(TransactionId), xidComparator) != NULL)

Why are you using xid instead of subxid in bsearch call? Can we add a
comment to say why it is okay to use xid if there is a valid reason?
But note, we are using subxid to add to the committed xact array so
not sure if this is a good idea but I might be missing something.

Suggestions for improvement in comments:
-   /*
-* Update the transactions that are running and changes
catalogs that are
-* not committed.
-*/
+   /* Update the catalog modifying transactions that are yet not
committed. */
if (builder->catchanges.xip)
pfree(builder->catchanges.xip);
builder->catchanges.xip =
ReorderBufferGetCatalogChangesXacts(builder->reorder,
@@ -1647,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
ondisk_c += sz;

-   /* copy catalog-changes xacts */
+   /* copy catalog modifying xacts */
sz = sizeof(TransactionId) * builder->catchanges.xcnt;
memcpy(ondisk_c, builder->catchanges.xip, sz);
COMP_CRC32C(ondisk->checksum, ondisk_c, sz);

-- 
With Regards,
Amit Kapila.

Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH

2022-07-04 Thread Euler Taveira

On Mon, Jul 4, 2022, at 5:37 AM, Amit Kapila wrote:
> Yeah, it seems we have overlooked this point. I think we can do this
> just for HEAD but as the feature is introduced in PG-15 so there is no
> harm in pushing it to PG-15 as well especially because it is a
> straightforward change. What do you or others think?
No objection. It is a good thing for future backpatches.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: automatically generating node support functions

2022-07-04 Thread Peter Eisentraut


On 03.07.22 21:14, Tom Lane wrote:

Peter Eisentraut  writes:

Here is a patch that reformats the relevant (and a few more) comments
that way.  This has been run through pgindent, so the formatting should
be stable.


Now that that's been pushed, the main patch is of course quite broken.
Are you working on a rebase?


attached


* AFAICT, the infrastructure for removing the generated files at
"make *clean" is incomplete.


I have fixed all the makefiles per your suggestions.


and something similar to this for the "clean" rule:
# fmgroids.h, fmgrprotos.h, fmgrtab.c, fmgr-stamp, and errcodes.h are in the
# distribution tarball, so they are not cleaned here.


Except this one, since there is no clean rule.  I think seeing that 
files are listed under a maintainer-clean target conveys that same message.



Also, I share David's upthread allergy to the option names "path_hackN"
and to documenting those only inside the conversion script.


I'll look into that again.


BTW, I think this: "Unknown attributes are ignored" is a seriously
bad idea; it will allow typos to escape detection.


good pointFrom 0ad86183662b6637c9ec9a29374fecb4d11202e2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 4 Jul 2022 14:17:02 +0200
Subject: [PATCH v6] Automatically generate node support functions

Add a script to automatically generate the node support functions
(copy, equal, out, and read, as well as the node tags enum) from the
struct definitions.

For each of the four node support files, it creates two include files,
e.g., copyfuncs.funcs.c and copyfuncs.switch.c, to include in the main
file.  All the scaffolding of the main file stays in place.

TODO: In this patch, I have only ifdef'ed out the code to could be
removed, mainly so that it won't constantly have merge conflicts.
Eventually, that should all be changed to delete the code.  All the
code comments that are worth keeping from those sections have already
been moved to the header files where the structs are defined.

I have tried to mostly make the coverage of the output match what is
currently there.  For example, one could now do out/read coverage of
utility statement nodes, but I have manually excluded those for now.
The reason is mainly that it's easier to diff the before and after,
and adding a bunch of stuff like this might require a separate
analysis and review.

Subtyping (TidScan -> Scan) is supported.

For the hard cases, you can just write a manual function and exclude
generating one.  For the not so hard cases, there is a way of
annotating struct fields to get special behaviors.  For example,
pg_node_attr(equal_ignore) has the field ignored in equal functions.

Discussion: 
https://www.postgresql.org/message-id/flat/c1097590-a6a4-486a-64b1-e1f9cc0533ce%40enterprisedb.com
---
 src/backend/Makefile  |  10 +-
 src/backend/nodes/.gitignore  |   4 +
 src/backend/nodes/Makefile|  54 ++
 src/backend/nodes/copyfuncs.c |  19 +-
 src/backend/nodes/equalfuncs.c|  22 +-
 src/backend/nodes/gen_node_support.pl | 729 ++
 src/backend/nodes/outfuncs.c  |  34 +-
 src/backend/nodes/readfuncs.c |  23 +-
 src/include/Makefile  |   1 +
 src/include/nodes/.gitignore  |   2 +
 src/include/nodes/nodes.h |  27 +
 src/include/nodes/parsenodes.h|   4 +-
 src/include/nodes/pathnodes.h | 175 ---
 src/include/nodes/plannodes.h |  90 ++--
 src/include/nodes/primnodes.h |  34 +-
 src/include/utils/rel.h   |   6 +-
 src/tools/msvc/Solution.pm|  46 ++
 17 files changed, 1124 insertions(+), 156 deletions(-)
 create mode 100644 src/backend/nodes/.gitignore
 create mode 100644 src/backend/nodes/gen_node_support.pl
 create mode 100644 src/include/nodes/.gitignore

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 4a02006788..953c80db5a 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -143,11 +143,15 @@ storage/lmgr/lwlocknames.h: 
storage/lmgr/generate-lwlocknames.pl storage/lmgr/lw
 submake-catalog-headers:
$(MAKE) -C catalog distprep generated-header-symlinks
 
+# run this unconditionally to avoid needing to know its dependencies here:
+submake-nodes-headers:
+   $(MAKE) -C nodes distprep generated-header-symlinks
+
 # run this unconditionally to avoid needing to know its dependencies here:
 submake-utils-headers:
$(MAKE) -C utils distprep generated-header-symlinks
 
-.PHONY: submake-catalog-headers submake-utils-headers
+.PHONY: submake-catalog-headers submake-nodes-headers submake-utils-headers
 
 # Make symlinks for these headers in the include directory. That way
 # we can cut down on the -I options. Also, a symlink is automatically
@@ -162,7 +166,7 @@ submake-utils-headers:
 
 .PHONY: generated-headers
 
-generated-headers: $(top_builddir)/src/include/parser/gram.h 
$(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers

Re: generate_series for timestamptz and time zone problem

2022-07-04 Thread Przemysław Sztoch


Przemysław Sztoch wrote on 01.07.2022 15:43:

Gurjeet Singh wrote on 01.07.2022 06:35:

On Tue, Jun 21, 2022 at 7:56 AM Przemysław Sztoch  wrote:

Please give me feedback on how to properly store the timezone name in the 
function context structure. I can't finish my work without it.

The way I see it, I don't think you need to store the tz-name in the
function context structure, like you're currently doing. I think you
can remove the additional member from the
generate_series_timestamptz_fctx struct, and refactor your code in
generate_series_timestamptz() to work without it.; you seem to  be
using the tzname member almost as a boolean flag, because the actual
value you pass to DFCall3() can be calculated without first storing
anything in the struct.
Do I understand correctly that functions that return SET are executed 
multiple times?

Is access to arguments available all the time?
I thought PG_GETARG_ could only be used when SRF_IS_FIRSTCALL () is 
true - was I right or wrong?

Dear Gurjeet,
I thought a bit after riding the bikes and the code repaired itself. :-)
Thanks for the clarification. Please check if patch v5 is satisfactory 
for you.

Can you please explain why you chose to remove the provolatile
attribute from the existing entry of date_trunc in pg_proc.dat.

I believe it was a mistake in PG code.
All timestamptz functions must be STABLE as they depend on the 
current: SHOW timezone.
If new functions are created that pass the zone as a parameter, they 
become IMMUTABLE.
FIrst date_trunc function implementaion was without time zone 
parameter and someone who
added second variant (with timezone as parameter) copied the 
definition without removing the STABLE flag.
Have I convinced everyone that this change is right? I assume I'm right 
and the mistake will be fatal.


--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: EINTR in ftruncate()

2022-07-04 Thread Alvaro Herrera

On 2022-Jul-01, Andres Freund wrote:

> On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote:
> > On 2022-Jul-01, Andres Freund wrote:

> > > What is the reason for the || ProcDiePending || QueryCancelPending bit? 
> > > What
> > > if there's dsm operations intentionally done while QueryCancelPending?
> > 
> > That mirrors the test for the other block in that function, which was
> > added by 63efab4ca139, whose commit message explains:

> That whole approach seems quite wrong to me. At the absolute very least the
> code needs to check if interrupts are being processed in the current context
> before just giving up due to ProcDiePending || QueryCancelPending.

For the time being, I can just push the addition of the EINTR retry
without testing ProcDiePending || QueryCancelPending.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)
>From b7591beb919c3cfaa8090bda3977b7127de8de28 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 1 Jul 2022 17:16:33 +0200
Subject: [PATCH v3] retry ftruncate

---
 src/backend/storage/ipc/dsm_impl.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 82b7978aeb..145a02204b 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -417,8 +417,19 @@ dsm_impl_posix_resize(int fd, off_t size)
 {
 	int			rc;
 
-	/* Truncate (or extend) the file to the requested size. */
-	rc = ftruncate(fd, size);
+	/*
+	 * Truncate (or extend) the file to the requested size.  If we're
+	 * interrupted by a signal, retry; have caller handle any other error.
+	 */
+	for (;;)
+	{
+		errno = 0;
+		rc = ftruncate(fd, size);
+		if (rc == 0)
+			break;
+		if (errno != EINTR)
+			return rc;
+	}
 
 	/*
 	 * On Linux, a shm_open fd is backed by a tmpfs file.  After resizing with
-- 
2.30.2

Re: generate_series for timestamptz and time zone problem

2022-07-04 Thread Przemysław Sztoch


Tom Lane wrote on 04.07.2022 00:31:

=?UTF-8?Q?Przemys=c5=82aw_Sztoch?=  writes:

I have problem with this:
-- Considering only built-in procs (prolang = 12), look for multiple uses
-- of the same internal function (ie, matching prosrc fields).  It's OK to
-- have several entries with different pronames for the same internal
function,
-- but conflicts in the number of arguments and other critical items should
-- be complained of.  (We don't check data types here; see next query.)

It's telling you you're violating project style.  Don't make multiple
pg_proc entries point at the same C function and then use PG_NARGS
to disambiguate; instead point at two separate functions.  The functions
can share code at the next level down, if they want.  (Just looking
at the patch, though, I wonder if sharing code is really beneficial
in this case.  It seems quite messy, and I wouldn't be surprised
if it hurts performance in the existing case.)

You also need to expend some more effort on refactoring code, to
eliminate silliness like looking up the timezone name each time
through the SRF.  That's got to be pretty awful performance-wise.

regards, tom lane

Thx. Code is refactored. It is better, now.

--
Przemysław Sztoch | Mobile +48 509 99 00 66
diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index f70f829d83..ee1dee7c84 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -69,6 +69,7 @@ typedef struct
TimestampTz finish;
Intervalstep;
int step_sign;
+   pg_tz   *attimezone;
 } generate_series_timestamptz_fctx;
 
 
@@ -547,6 +548,48 @@ parse_sane_timezone(struct pg_tm *tm, text *zone)
return tz;
 }
 
+static pg_tz *
+lookup_timezone(text *zone)
+{
+   chartzname[TZ_STRLEN_MAX + 1];
+   char   *lowzone;
+   int type,
+   val;
+   pg_tz  *tzp;
+   /*
+* Look up the requested timezone (see notes in timestamptz_zone()).
+*/
+   text_to_cstring_buffer(zone, tzname, sizeof(tzname));
+
+   /* DecodeTimezoneAbbrev requires lowercase input */
+   lowzone = downcase_truncate_identifier(tzname,
+   
   strlen(tzname),
+   
   false);
+
+   type = DecodeTimezoneAbbrev(0, lowzone, , );
+
+   if (type == TZ || type == DTZ)
+   {
+   /* fixed-offset abbreviation, get a pg_tz descriptor for that */
+   tzp = pg_tzset_offset(-val);
+   }
+   else if (type == DYNTZ)
+   {
+   /* dynamic-offset abbreviation, use its referenced timezone */
+   }
+   else
+   {
+   /* try it as a full zone name */
+   tzp = pg_tzset(tzname);
+   if (!tzp)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("time zone \"%s\" not 
recognized", tzname)));
+   }
+
+   return tzp;
+}
+
 /*
  * make_timestamp_internal
  * workhorse for make_timestamp and make_timestamptz
@@ -2989,97 +3032,107 @@ timestamp_mi_interval(PG_FUNCTION_ARGS)
 }
 
 
-/* timestamptz_pl_interval()
- * Add an interval to a timestamp with time zone data type.
- * Note that interval has provisions for qualitative year/month
+/*
+ * Note that interval has provisions for qualitative year/month and day
  * units, so try to do the right thing with them.
  * To add a month, increment the month, and use the same day of month.
  * Then, if the next month has fewer days, set the day of month
  * to the last day of month.
+ * To add a day, increment the mday, and use the same time of day.
  * Lastly, add in the "quantitative time".
  */
+static TimestampTz
+timestamptz_pl_interval_internal(TimestampTz timestamp, Interval *span, pg_tz 
*attimezone)
+{
+   int tz;
+
+   /* Use session timezone if caller asks for default */
+   if (attimezone == NULL)
+   attimezone = session_timezone;
+   
+   if (span->month != 0)
+   {
+   struct pg_tm tt,
+  *tm = 
+   fsec_t  fsec;
+
+   if (timestamp2tm(timestamp, , tm, , NULL, attimezone) 
!= 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+errmsg("timestamp out of range")));
+
+   tm->tm_mon += span->month;
+   if (tm->tm_mon > MONTHS_PER_YEAR)
+   {
+   tm->tm_year += (tm->tm_mon - 1) / MONTHS_PER_YEAR;
+   tm->tm_mon = ((tm->tm_mon - 1) % MONTHS_PER_YEAR) +

Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-07-04 Thread Etsuro Fujita

On Fri, Jun 10, 2022 at 6:17 PM Laurenz Albe  wrote:
> On Fri, 2022-06-10 at 17:17 +0900, Etsuro Fujita wrote:
> > > I am not sure if it worth adding to the documentation.  I would never 
> > > have thought
> > > of the problem if Phil hadn't brought it up.  On the other hand, I was 
> > > surprised
> > > to learn that permissions aren't checked until the executor kicks in.
> > > It makes sense, but some documentation might help others in that 
> > > situation.
> >
> > +1 for adding such a document.
> >
> > > I'll gladly leave the decision to your judgement as a committer.
> >
> > IIRC, there are no reports about this from the postgres_fdw users, so
> > my inclination would be to leave the documentation alone, for now.
>
> I understand that you are for documenting the timing of permission checks,
> but not in the postgres_fdw documentation.

Yes, I think so.

> However, this is the only occasion
> where the user might notice unexpected behavior on account of the timing of
> permission checks.  Other than that, I consider this below the threshold for
> user-facing documentation.

I think PREPARE/EXECUTE have a similar issue:

postgres=# create table t1 (a int, b int);
CREATE TABLE
postgres=# create user foouser;
CREATE ROLE
postgres=# set role foouser;
SET
postgres=> prepare fooplan (int, int) as insert into t1 values ($1, $2);
PREPARE
postgres=> execute fooplan (, );
ERROR:  permission denied for table t1

The user foouser is allowed to PREPARE the insert statement, without
the insert privilege on the table t1, as the permission check is
delayed until EXECUTE.

So I thought it would be good to add a note about the timing to the
documentation about the Postgres core, such as arch-dev.sgml (the
"Overview of PostgreSQL Internals" chapter).  But as far as I know,
there aren’t any reports on the PREPARE/EXECUTE behavior, either, so
there might be less need to do so, I think.

Thanks for the discussion!

Sorry for the delay.

Best regards,
Etsuro Fujita

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Zhihong Yu

On Sun, Jul 3, 2022 at 7:56 PM Thomas Munro  wrote:

> [Re-directing to -hackers]
>
> On Fri, Mar 11, 2022 at 2:27 PM Andres Freund  wrote:
> > On 2022-03-10 20:09:56 -0500, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > dshash: Add sequential scan support.
> > > > Add ability to scan all entries sequentially to dshash. The
> interface is
> > > > similar but a bit different both from that of dynahash and simple
> dshash
> > > > search functions. The most significant differences is that dshash's
> interfac
> > > > always needs a call to dshash_seq_term when scan ends.
> > >
> > > Umm ... what about error recovery?  Or have you just cemented the
> > > proposition that long-lived dshashes are unsafe?
> >
> > I don't think this commit made it worse. dshash_seq_term() releases an
> lwlock
> > (which will be released in case of an error) and unsets
> > hash_table->find_[exclusively_]locked. The latter weren't introduced by
> this
> > patch, and are also set by dshash_find().
> >
> > I agree that ->find_[exclusively_]locked are problematic from an error
> > recovery perspective.
>
> Right, as seen in the build farm at [1].  Also reproducible with something
> like:
>
> @@ -269,6 +269,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
> request_size,
> return false;
> }
>
> +   /* XXX random fault injection */
> +   if (op == DSM_OP_ATTACH && random() < RAND_MAX / 8)
> +   {
> +   close(fd);
> +   elog(ERROR, "chaos");
> +   return false;
> +   }
> +
>
> I must have thought that it was easy and practical to write no-throw
> straight-line code and be sure to reach dshash_release_lock(), but I
> concede that it was a bad idea: even dsa_get_address() can throw*, and
> you're often likely to need to call that while accessing dshash
> elements.  For example, in lookup_rowtype_tupdesc_internal(), there is
> a sequence dshash_find(), ..., dsa_get_address(), ...,
> dshash_release_lock(), and I must have considered the range of code
> between find and release to be no-throw, but now I know that it is
> not.
>
> > It's per-backend state at least and just used for assertions. We could
> remove
> > it. Or stop checking it in places where it could be set wrongly:
> dshash_find()
> > and dshash_detach() couldn't check anymore, but the rest of the
> assertions
> > would still be valid afaics?
>
> Yeah, it's all for assertions... let's just remove it.  Those
> assertions were useful to me at some stage in development but won't
> hold as well as I thought, at least without widespread PG_FINALLY(),
> which wouldn't be nice.
>
> *dsa_get_address() might need to adjust the memory map with system
> calls, which might fail.  If you think of DSA as not only an allocator
> but also a poor man's user level virtual memory scheme to tide us over
> until we get threads, then this is a pretty low level kind of
> should-not-happen failure that is analogous on some level to SIGBUS or
> SIGSEGV or something like that, and we should PANIC.  Then we could
> claim that dsa_get_address() is no-throw.  At least, that was one
> argument I had with myself while investigating that strange Solaris
> shm_open() failure, but ... I lost the argument.  It's quite an
> extreme position to take just to support these assertions, which are
> of pretty limited value.
>
> [1]
> https://www.postgresql.org/message-id/20220701232009.jcwxpl45bptaxv5n%40alap3.anarazel.de

Hi,
In the description,

`new shared memory stats system in 15`

It would be clearer to add `release` before `15`.

Cheers

Re: Handle infinite recursion in logical replication setup

2022-07-04 Thread Amit Kapila

On Sun, Jul 3, 2022 at 8:29 PM vignesh C  wrote:
>

Review comments
===
1.
@@ -530,7 +557,7 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT |
-   SUBOPT_DISABLE_ON_ERR);
+   SUBOPT_DISABLE_ON_ERR | SUBOPT_ORIGIN);
  parse_subscription_options(pstate, stmt->options, supported_opts, );

  /*
@@ -606,6 +633,8 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
  LOGICALREP_TWOPHASE_STATE_PENDING :
  LOGICALREP_TWOPHASE_STATE_DISABLED);
  values[Anum_pg_subscription_subdisableonerr - 1] =
BoolGetDatum(opts.disableonerr);
+ values[Anum_pg_subscription_suborigin - 1] =
+ CStringGetTextDatum(opts.origin);
  values[Anum_pg_subscription_subconninfo - 1] =
  CStringGetTextDatum(conninfo);
  if (opts.slot_name)
...
...

  /* List of publications subscribed to */
  text subpublications[1] BKI_FORCE_NOT_NULL;
+
+ /* Only publish data originating from the specified origin */
+ text suborigin BKI_DEFAULT(LOGICALREP_ORIGIN_ANY);
 #endif
 } FormData_pg_subscription;

The order of declaration and assignment for 'suborigin' should match
in above usage.

2. Similarly the changes in GetSubscription() should also match the
declaration of the origin column.

3.
 GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled,
-  subbinary, substream, subtwophasestate,
subdisableonerr, subslotname,
-  subsynccommit, subpublications)
+  subbinary, substream, subtwophasestate, subdisableonerr,
+  suborigin, subslotname, subsynccommit, subpublications)
 ON pg_subscription TO public;

This should also match the order of columns as in pg_subscription.h
unless there is a reason for not doing so.

4.
+ /*
+ * Even though "origin" parameter allows only "local" and "any"
+ * values, it is implemented as a string type so that the parameter
+ * can be extended in future versions to support filtering using
+ * origin names specified by the user.

/Even though "origin" .../Even though the "origin" parameter ...

5.
+
+# Create tables on node_A
+$node_A->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Create the same tables on node_B
+$node_B->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");

In both the above comments, you should use table instead of tables as
the test creates only one table.

6.
+$node_A->safe_psql('postgres', "DELETE FROM tab_full;");

After this, the test didn't ensure that this operation is replicated.
Can't that lead to unpredictable results for the other tests after
this test?

7.
+# Setup logical replication
+# node_C (pub) -> node_B (sub)
+my $node_C_connstr = $node_C->connstr . ' dbname=postgres';
+$node_C->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub_C FOR TABLE tab_full");
+
+my $appname_B2 = 'tap_sub_B2';
+$node_B->safe_psql(
+ 'postgres', "
+ CREATE SUBSCRIPTION tap_sub_B2
+ CONNECTION '$node_C_connstr application_name=$appname_B2'
+ PUBLICATION tap_pub_C
+ WITH (origin = local)");
+
+$node_C->wait_for_catchup($appname_B2);
+
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";

This test allows the publisher (node_C) to poll for sync but it should
be the subscriber (node_B) that needs to poll to allow the initial
sync to finish.

8. Do you think it makes sense to see if this new option can also be
supported by pg_recvlogical? I see that previously we have not
extended pg_recvlogical for all the newly added options but I feel we
should keep pg_recvlogical up to date w.r.t new options. We can do
this as a separate patch if we agree?

-- 
With Regards,
Amit Kapila.

Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-07-04 Thread Alvaro Herrera

On 2022-Jul-04, Alvaro Herrera wrote:

> BTW I patch for the problem with uniqviol also (not fixed by v7).  I'll
> send an updated patch in a little while.

Here it is.  I ran "libpq_pipeline uniqviol" in a tight loop a few
thousand times and didn't get any error.  Before these fixes, it would
fail in half a dozen iterations.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From d4d446531d52a34115fe7446732d61ae8f61d8bb Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 1 Jul 2022 18:11:10 +0200
Subject: [PATCH v8] libpq: Improve idle state handling in pipeline mode

We were going into IDLE state too soon when executing queries via
PQsendQuery in pipeline mode, causing several scenarios to misbehave in
different ways -- most notably, as reported by Daniele Varrazzo, that a
warning message is produced by libpq:
  message type 0x33 arrived from server while idle
But it is also possible, if queries are sent and results consumed not in
lockstep, for the expected mediating NULL result values from PQgetResult
to be lost (a problem which has not been reported, but which is more
serious).

Fix this by introducing two new concepts: one is a command queue element
PGQUERY_CLOSE to tell libpq to wait for the CloseComplete server
response to the Close message that is sent by PQsendQuery.  Because the
application is not expecting any PGresult from this, the mechanism to
consume it is a bit hackish.

The other concept, authored by Horiguchi-san, is a PGASYNC_PIPELINE_IDLE
state for libpq's state machine to differentiate "really idle" from
merely "the idle state that occurs in between reading results from the
server for elements in the pipeline".  This makes libpq not go fully
IDLE when the libpq command queue contains entries; in normal cases, we
only go IDLE once at the end of the pipeline, when the server response
to the final SYNC message is received.  (However, there are corner cases
it doesn't fix, such as terminating the query sequence by
PQsendFlushRequest instead of PQpipelineSync; this sort of scenario is
what requires PGQUERY_CLOSE bit above.)

This last bit helps make the libpq state machine clearer; in particular
we can get rid of an ugly hack in pqParseInput3 to avoid considering
IDLE as such when the command queue contains entries.

A new test mode is added to libpq_pipeline.c to tickle some related
problematic cases.

Reported-by: Daniele Varrazzo 
Co-authored-by: Kyotaro Horiguchi 
Discussion: https://postgr.es/m/ca+mi_8bvd0_cw3sumgwpvwdnzxy32itog_16tdyru_1s2gv...@mail.gmail.com
---
 src/interfaces/libpq/fe-exec.c| 116 --
 src/interfaces/libpq/fe-protocol3.c   |  30 +--
 src/interfaces/libpq/libpq-int.h  |   6 +-
 .../modules/libpq_pipeline/libpq_pipeline.c   | 215 +-
 .../libpq_pipeline/t/001_libpq_pipeline.pl|   3 +-
 .../libpq_pipeline/traces/pipeline_idle.trace |  93 
 6 files changed, 425 insertions(+), 38 deletions(-)
 create mode 100644 src/test/modules/libpq_pipeline/traces/pipeline_idle.trace

diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 4180683194..e22d0814f0 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -1279,7 +1279,8 @@ pqAppendCmdQueueEntry(PGconn *conn, PGcmdQueueEntry *entry)
 			 * itself consume commands from the queue; if we're in any other
 			 * state, we don't have to do anything.
 			 */
-			if (conn->asyncStatus == PGASYNC_IDLE)
+			if (conn->asyncStatus == PGASYNC_IDLE ||
+conn->asyncStatus == PGASYNC_PIPELINE_IDLE)
 			{
 resetPQExpBuffer(>errorMessage);
 pqPipelineProcessQueue(conn);
@@ -1338,6 +1339,7 @@ static int
 PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery)
 {
 	PGcmdQueueEntry *entry = NULL;
+	PGcmdQueueEntry *entry2 = NULL;

 	if (!PQsendQueryStart(conn, newQuery))
 		return 0;
@@ -1353,6 +1355,12 @@ PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery)
 	entry = pqAllocCmdQueueEntry(conn);
 	if (entry == NULL)
 		return 0;/* error msg already set */
+	if (conn->pipelineStatus != PQ_PIPELINE_OFF)
+	{
+		entry2 = pqAllocCmdQueueEntry(conn);
+		if (entry2 == NULL)
+			goto sendFailed;
+	}

 	/* Send the query message(s) */
 	if (conn->pipelineStatus == PQ_PIPELINE_OFF)
@@ -1422,6 +1430,20 @@ PQsendQueryInternal(PGconn *conn, const char *query, bool newQuery)

 	/* OK, it's launched! */
 	pqAppendCmdQueueEntry(conn, entry);
+
+	/*
+	 * When pipeline mode is in use, we need a second entry in the command
+	 * queue to represent Close Portal message.  This allows us later to wait
+	 * for the CloseComplete message to be received before getting in IDLE
+	 * state.
+	 */
+	if (conn->pipelineStatus != PQ_PIPELINE_OFF)
+	{
+		entry2->queryclass = PGQUERY_CLOSE;
+		entry2->query = NULL;
+		pqAppendCmdQueueEntry(conn, entry2);
+	}
+
 	return 1;

 sendFailed:
@@ -1667,11 +1689,13 @@ PQsendQueryStart(PGconn *conn, bool newQuery)
 		switch

Add last_vacuum_index_scans in pg_stat_all_tables

2022-07-04 Thread Ken Kato



Hi hackers,

I think having number of index scans of the last vacuum in 
pg_stat_all_tables can be helpful. This value shows how efficiently 
vacuums have performed and can be an indicator to increase 
maintenance_work_mem.


It was proposed previously[1], but it was not accepted due to the 
limitation of stats collector. Statistics are now stored in shared 
memory, so we got more rooms to store statistics. I think this 
statistics is still valuable for some people, so I am proposing this 
again.


Best wishes,

--
Ken Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

[1] 
https://www.postgresql.org/message-id/20171010.192616.108347483.horiguchi.kyotaro%40lab.ntt.co.jpdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..bd7bfa7d9d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4557,6 +4557,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
daemon
   
  
+
+ 
+  
+   last_vacuum_index_scans bigint
+  
+  
+   Number of splitted index scans performed during the the last vacuum
+   on this table
+  
+ 
 

   
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b802ed247e..bf47d6837d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -625,7 +625,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 rel->rd_rel->relisshared,
 		 Max(vacrel->new_live_tuples, 0),
 		 vacrel->recently_dead_tuples +
-		 vacrel->missed_dead_tuples);
+		 vacrel->missed_dead_tuples,
+		 vacrel->num_index_scans);
 	pgstat_progress_end_command();
 
 	if (instrument)
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..475551d9b3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -676,7 +676,9 @@ CREATE VIEW pg_stat_all_tables AS
 pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 pg_stat_get_analyze_count(C.oid) AS analyze_count,
-pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count
+pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count,
+pg_stat_get_last_vacuum_index_scans(C.oid) AS last_vacuum_index_scans
+
 FROM pg_class C LEFT JOIN
  pg_index I ON C.oid = I.indrelid
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..ffc9daf944 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -208,8 +208,8 @@ pgstat_drop_relation(Relation rel)
  * Report that the table was just vacuumed.
  */
 void
-pgstat_report_vacuum(Oid tableoid, bool shared,
-	 PgStat_Counter livetuples, PgStat_Counter deadtuples)
+pgstat_report_vacuum(Oid tableoid, bool shared, PgStat_Counter livetuples,
+	 PgStat_Counter deadtuples, PgStat_Counter num_index_scans)
 {
 	PgStat_EntryRef *entry_ref;
 	PgStatShared_Relation *shtabentry;
@@ -232,6 +232,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 
 	tabentry->n_live_tuples = livetuples;
 	tabentry->n_dead_tuples = deadtuples;
+	tabentry->n_index_scans = num_index_scans;
 
 	/*
 	 * It is quite possible that a non-aggressive VACUUM ended up skipping
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..57d1de52e6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -179,6 +179,20 @@ pg_stat_get_dead_tuples(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(result);
 }
 
+Datum
+pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		result;
+	PgStat_StatTabEntry *tabentry;
+
+	if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+		result = 0;
+	else
+		result = tabentry->n_index_scans;
+
+	PG_RETURN_INT64(result);
+}
 
 Datum
 pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a77b293723..d061cd2d8d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5331,6 +5331,10 @@
   proname => 'pg_stat_get_autoanalyze_count', provolatile => 's',
   proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_autoanalyze_count' },
+{ oid => '3813', descr => 'statistics: last vacuum index scans for a table',
+  proname => 'pg_stat_get_last_vacuum_index_scans', provolatile => 's',
+  proparallel => 'r', prorettype => 'int8', proargtypes => 'oid',
+  prosrc => 'pg_stat_get_last_vacuum_index_scans' },
 { oid => '1936', descr => 'statistics: currently active backend IDs',
   proname =>

Re: Emit extra debug message when executing extension script.

2022-07-04 Thread Alvaro Herrera

On 2022-Jun-29, Robert Haas wrote:

> On Wed, Jun 29, 2022 at 9:26 AM Peter Eisentraut
>  wrote:
> > On 28.06.22 21:10, Jeff Davis wrote:
> > > + ereport(DEBUG1, errmsg("executing extension script: %s", filename));
> >
> > This should either be elog or use errmsg_internal.
> 
> Why?

The reason is that errmsg() marks the message for translation, and we
don't want to burden translators with messages that are of little
interest to most users.  Using either elog() or errmsg_internal()
avoids that.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)

Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-07-04 Thread Alvaro Herrera

On 2022-Jul-04, Kyotaro Horiguchi wrote:

> At Wed, 29 Jun 2022 14:09:17 +0200, Alvaro Herrera  
> wrote in 

> > However, another problem case, not fixed by PIPELINE_IDLE, occurs if you
> > exit pipeline mode after PQsendQuery() and then immediately use
> > PQexec().  The CloseComplete will be received at the wrong time, and a
> > notice is emitted nevertheless.
> 
> Mmm. My patch moves the point of failure of the scenario a bit but
> still a little short.  However, as my understanding, it seems like the
> task of the PQpipelineSync()-PQgetResult() pair to consume the
> CloseComplete. If Iinserted PQpipelineSync() just after PQsendQuery()
> and called PQgetResult() for PGRES_PIPELINE_SYNC before
> PQexitPipelineMode(), the out-of-sync CloseComplete is not seen in the
> scenario.  But if it is right, I'd like to complain about the
> obscure-but-stiff protocol of pipleline mode..

Yeah, if you introduce PQpipelineSync then I think it'll work okay, but
my point here was to make it work without requiring that; that's why I
wrote the test to use PQsendFlushRequest instead.

BTW I patch for the problem with uniqviol also (not fixed by v7).  I'll
send an updated patch in a little while.

> > I produced pipeline_idle.trace file by running the test in the fully
> 
> By the perl script doesn't produce the trace file since the list in
> $cmptrace line doesn't contain pipleline_idle..

Ouch, of course, thanks for noticing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

RE: Handle infinite recursion in logical replication setup

2022-07-04 Thread shiy.f...@fujitsu.com

On Mon, Jul 4, 2022 4:17 PM shiy.f...@fujitsu.com  wrote:
> 
> On Sun, Jul 3, 2022 11:00 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v27 patch has the changes for the
> > same.
> >
> 
> Thanks for updating the patch.
> 
> A comment on 0003 patch:
> 
> I think we should call ExecClearTuple() before getting next tuple, so it 
> should
> be called if the table is in ready state. How about modifying it to:
> 

By the way, I have tested pg_dump/psql changes in this patch with older version
(server: pg10 ~ pg15, pg_dump/psql: pg16), and it worked ok.

Regards,
Shi yu

Re: Re-order "disable_on_error" in tab-complete COMPLETE_WITH

2022-07-04 Thread Amit Kapila

On Mon, Jul 4, 2022 at 1:07 PM Peter Smith  wrote:
>
> By convention, the tab-complete logical replication subscription
> parameters are listed in the COMPLETE_WITH lists in alphabetical
> order, but when the "disable_on_error" parameter was added this was
> not done.
>

Yeah, it seems we have overlooked this point. I think we can do this
just for HEAD but as the feature is introduced in PG-15 so there is no
harm in pushing it to PG-15 as well especially because it is a
straightforward change. What do you or others think?

-- 
With Regards,
Amit Kapila.

Re: pgsql: dshash: Add sequential scan support.

2022-07-04 Thread Kyotaro Horiguchi

At Mon, 4 Jul 2022 14:55:43 +1200, Thomas Munro  wrote 
in 
> [Re-directing to -hackers]
> 
> On Fri, Mar 11, 2022 at 2:27 PM Andres Freund  wrote:
> > It's per-backend state at least and just used for assertions. We could 
> > remove
> > it. Or stop checking it in places where it could be set wrongly: 
> > dshash_find()
> > and dshash_detach() couldn't check anymore, but the rest of the assertions
> > would still be valid afaics?
> 
> Yeah, it's all for assertions... let's just remove it.  Those
> assertions were useful to me at some stage in development but won't
> hold as well as I thought, at least without widespread PG_FINALLY(),
> which wouldn't be nice.
> 
> *dsa_get_address() might need to adjust the memory map with system
> calls, which might fail.  If you think of DSA as not only an allocator
> but also a poor man's user level virtual memory scheme to tide us over
> until we get threads, then this is a pretty low level kind of
> should-not-happen failure that is analogous on some level to SIGBUS or
> SIGSEGV or something like that, and we should PANIC.  Then we could
> claim that dsa_get_address() is no-throw.  At least, that was one
> argument I had with myself while investigating that strange Solaris
> shm_open() failure, but ... I lost the argument.  It's quite an
> extreme position to take just to support these assertions, which are
> of pretty limited value.
> 
> [1] 
> https://www.postgresql.org/message-id/20220701232009.jcwxpl45bptaxv5n%40alap3.anarazel.de

FWIW, the discussion above is convincing to me and the patch looks good.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-07-04 Thread Kyotaro Horiguchi

Thanks for the further testing scenario.

At Wed, 29 Jun 2022 14:09:17 +0200, Alvaro Herrera  
wrote in 
> So I wrote some more test scenarios for this, and as I wrote in some
> other thread, I realized that there are more problems than just some
> NOTICE trouble.  For instance, if you send a query, then read the result
> but not the terminating NULL then send another query, everything gets
> confused; the next thing you'll read is the result for the second query,
> without having read the NULL that terminates the results of the first
> query.  Any application that expects the usual flow of results will be
> confused.  Kyotaro's patch to add PIPELINE_IDLE fixes this bit too, as
> far as I can tell.
> 
> However, another problem case, not fixed by PIPELINE_IDLE, occurs if you
> exit pipeline mode after PQsendQuery() and then immediately use
> PQexec().  The CloseComplete will be received at the wrong time, and a
> notice is emitted nevertheless.

Mmm. My patch moves the point of failure of the scenario a bit but
still a little short.  However, as my understanding, it seems like the
task of the PQpipelineSync()-PQgetResult() pair to consume the
CloseComplete. If Iinserted PQpipelineSync() just after PQsendQuery()
and called PQgetResult() for PGRES_PIPELINE_SYNC before
PQexitPipelineMode(), the out-of-sync CloseComplete is not seen in the
scenario.  But if it is right, I'd like to complain about the
obscure-but-stiff protocol of pipleline mode..

> I spent a lot of time trying to understand how to fix this last bit, and
> the solution I came up with is that PQsendQuery() must add a second
> entry to the command queue after the PGQUERY_EXTENDED one, to match the
> CloseComplete message being expected; with that entry in the queue,
> PQgetResult will really only get to IDLE mode after the Close has been
> seen, which is what we want.  I named it PGQUERY_CLOSE.
> 
> Sadly, some hacks are needed to make this work fully:
> 
> 1. the client is never expecting that PQgetResult() would return
>anything for the CloseComplete, so something needs to consume the
>CloseComplete silently (including the queue entry for it) when it is
>received; I chose to do this directly in pqParseInput3.  I tried to
>make PQgetResult itself do it, but it became a pile of hacks until I
>was no longer sure what was going on.  Putting it in fe-protocol3.c
>ends up a lot cleaner.  However, we still need PQgetResult to invoke
>parseInput() at the point where Close is expected.
> 
> 2. if an error occurs while executing the query, the CloseComplete will
>of course never arrive, so we need to erase it from the queue
>silently if we're returning an error.
> 
> I toyed with the idea of having parseInput() produce a PGresult that
> encodes the Close message, and have PQgetResult consume and discard
> that, then read some further message to have something to return.  But
> it seemed inefficient and equally ugly and I'm not sure that flow
> control is any simpler.
> 
> I think another possibility would be to make PQexitPipelineMode
> responsible for /something/, but I'm not sure what that would be.
> Checking the queue and seeing if the next message is CloseComplete, then
> eating that message before exiting pipeline mode?  That seems way too
> magical.  I didn't attempt this.
> 
> I attach a patch series that implements the proposed fix (again for
> REL_14_STABLE) in steps, to make it easy to read.  I intend to squash
> the whole lot into a single commit before pushing.  But while writing
> this email it occurred to me that I need to add at least one more test,
> to receive a WARNING while waiting for CloseComplete.  AFAICT it should
> work, but better make sure.
> 
> I produced pipeline_idle.trace file by running the test in the fully

By the perl script doesn't produce the trace file since the list in
$cmptrace line doesn't contain pipleline_idle..

> fixed tree, then used it to verify that all tests fail in different ways
> when run without the fixes.  The first fix with PIPELINE_IDLE fixes some
> of these failures, and the PGQUERY_CLOSE one fixes the remaining one.
> Reading the trace file, it looks correct to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Allowing REINDEX to have an optional name

2022-07-04 Thread Alvaro Herrera

On 2022-Jun-29, Simon Riggs wrote:

> > -if (strcmp(objectName, get_database_name(objectOid)) != 0)
> > +if (objectName && strcmp(objectName, get_database_name(objectOid)) 
> > != 0)
> >  ereport(ERROR,
> >  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >   errmsg("can only reindex the currently open 
> > database")));
> >  if (!pg_database_ownercheck(objectOid, GetUserId()))
> >  aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_DATABASE,
> > -   objectName);
> > +   get_database_name(objectOid));
> >
> > This could call get_database_name() just once.
> 
> It could, but I couldn't see any benefit in changing that for the code
> under discussion.
> 
> If calling get_database_name() multiple times is an issue, I've added
> a cache for that - another patch attached, if you think its worth it.

TBH I doubt that this is an issue: since we're throwing an error anyway,
the memory would be released, and error cases are not considered worth
of performance optimization anyway.

Putting that thought aside, if we were to think that this is an issue, I
don't think the cache as implemented here is a good idea, because then
caller is responsible for tracking whether to free or not the return
value.

I think that Michaël's idea could be implemented more easily by having a
local variable that receives the return value from get_database_name.
But I think the coding as Simon had it was all right.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las navajas y los monos deben estar siempre distantes"   (Germán Poo)

RE: Handle infinite recursion in logical replication setup

2022-07-04 Thread shiy.f...@fujitsu.com

On Sun, Jul 3, 2022 11:00 PM vignesh C  wrote:
> 
> Thanks for the comments, the attached v27 patch has the changes for the
> same.
> 

Thanks for updating the patch.

A comment on 0003 patch:
+   /*
+* No need to throw an error for the tables that are in ready 
state,
+* as the walsender will send the changes from WAL in case of 
tables
+* in ready state.
+*/
+   if (isreadytable)
+   continue;
+
...
+   ereport(ERROR,
+   
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+   errmsg("table: \"%s.%s\" might have replicated 
data in the publisher",
+  nspname, relname),
+   errdetail("CREATE/ALTER SUBSCRIPTION with 
origin = local and copy_data = on is not allowed when the publisher might have 
replicated data."),
+   errhint("Use CREATE/ALTER SUBSCRIPTION with 
copy_data = off/force."));
+
+   ExecClearTuple(slot);

I think we should call ExecClearTuple() before getting next tuple, so it should
be called if the table is in ready state. How about modifying it to:
if (!isreadytable)
ereport(ERROR,

errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("table: \"%s.%s\" might have 
replicated data in the publisher",
   nspname, relname),
errdetail("CREATE/ALTER SUBSCRIPTION 
with origin = local and copy_data = on is not allowed when the publisher might 
have replicated data."),
errhint("Use CREATE/ALTER SUBSCRIPTION 
with copy_data = off/force."));

ExecClearTuple(slot);

Regards,
Shi yu

Re-order "disable_on_error" in tab-complete COMPLETE_WITH

2022-07-04 Thread Peter Smith

Hi.

By convention, the tab-complete logical replication subscription
parameters are listed in the COMPLETE_WITH lists in alphabetical
order, but when the "disable_on_error" parameter was added this was
not done.

This patch just tidies that up; there is no functional change.

PSA

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Re-order-disable_on_error-in-tab-complete-COMPLET.patch
Description: Binary data

Re: Prevent writes on large objects in read-only transactions

2022-07-04 Thread Yugo NAGATA

On Mon, 4 Jul 2022 15:51:32 +0900
Michael Paquier  wrote:

> On Wed, Jun 29, 2022 at 05:29:50PM +0900, Yugo NAGATA wrote:
> > Thank you for reviewing the patch. I attached the updated patch.
> 
> Thanks for the new version.  I have looked at that again, and the set
> of changes seem fine (including the change for the alternate output).
> So, applied.


Thanks!


-- 
Yugo NAGATA

Re: Allowing REINDEX to have an optional name

2022-07-04 Thread Michael Paquier

On Sun, Jul 03, 2022 at 05:41:31PM -0400, Tom Lane wrote:
> This is marked as Ready for Committer, but that seems unduly
> optimistic.

Please note that patch authors should not switch a patch as RfC by
themselves.  This is something that a reviewer should do.

> The cfbot shows that it's failing on all platforms ---
> and not in the same way on each, suggesting there are multiple
> problems.

A wild guess is that this comes from the patch that manipulates
get_database_name(), something that there is no need for as long as
the routine is called once in ReindexMultipleTables().
--
Michael


signature.asc
Description: PGP signature

Re: Export log_line_prefix(); useful for emit_log_hook.

2022-07-04 Thread Michael Paquier

On Wed, Jun 29, 2022 at 03:09:42PM +0200, Alvaro Herrera wrote:
> Hmm, maybe your hypothetical book would prefer to use a different
> setting for log line prefix than Log_line_prefix, so it would make sense
> to pass the format string as a parameter to the function instead of
> relying on the GUC global.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Prevent writes on large objects in read-only transactions

2022-07-04 Thread Michael Paquier

On Wed, Jun 29, 2022 at 05:29:50PM +0900, Yugo NAGATA wrote:
> Thank you for reviewing the patch. I attached the updated patch.

Thanks for the new version.  I have looked at that again, and the set
of changes seem fine (including the change for the alternate output).
So, applied.
--
Michael


signature.asc
Description: PGP signature

Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-04 Thread Peter Smith

Below are some review comments for patch v14-0004:


v14-0004


4.0 General.

This comment is an after-thought but as I write this mail I am
wondering if most of this 0004 patch is even necessary at all? Instead
of introducing a new column and all the baggage that goes with it,
can't the same functionality be achieved just by toggling the
streaming mode 'substream' value from 'p' (parallel) to 't' (on)
whenever an error occurs causing a retry? Anyway, if you do change it
this way then most of the following comments can be disregarded.


==

4.1 Commit message

Patch needs an explanatory commit message. Currently, there is nothing.

==

4.2 doc/src/sgml/catalogs.sgml

+ 
+  
+   subretry bool
+  
+  
+   If true, the subscription will not try to apply streaming transaction
+   in parallel mode. See
+for more information.
+  
+ 

I think it is overkill to mention anything about the
streaming=parallel here because IIUC it is nothing to do with this
field at all. I thought you really only need to say something brief
like:

SUGGESTION:
True if the previous apply change failed and a retry was required.

==

4.3 doc/src/sgml/ref/create_subscription.sgml

@@ -244,6 +244,10 @@ CREATE SUBSCRIPTION subscription_nameon
+  mode.
  

I did not think it is good to say "we" in the docs.

SUGGESTION
When applying a streaming transaction, if either requirement is not
met, the background worker will exit with an error. Parallel mode is
disregarded when retrying; instead the transaction will be applied
using streaming = on.

==

4.4 .../replication/logical/applybgworker.c

+ /*
+ * We don't start new background worker if retry was set as it's possible
+ * that the last time we tried to apply a transaction in background worker
+ * and the check failed (see function apply_bgworker_relation_check). So
+ * we will try to apply this transaction in apply worker.
+ */

SUGGESTION (simplified, and remove "we")
Don't use apply background workers for retries, because it is possible
that the last time we tried to apply a transaction using an apply
background worker the checks failed (see function
apply_bgworker_relation_check).

~~~

4.5

+ elog(DEBUG1, "retry to apply an streaming transaction in apply "
+ "background worker");

IMO the log message is too confusing

SUGGESTION
"apply background workers are not used for retries"

==

4.6 src/backend/replication/logical/worker.c

4.6.a - apply_handle_commit

+ /* Set the flag that we will not retry later. */
+ set_subscription_retry(false);

But the comment is wrong, isn't it? Shouldn’t it just say that we are
not *currently* retrying? And can’t this just anyway be redundant if
only the catalog column has a DEFAULT value of false?

4.6.b - apply_handle_prepare
Ditto

4.6.c - apply_handle_commit_prepared
Ditto

4.6.d - apply_handle_rollback_prepared
Ditto

4.6.e - apply_handle_stream_prepare
Ditto

4.6.f - apply_handle_stream_abort
Ditto

4.6.g - apply_handle_stream_commit
Ditto

~~~

4.7 src/backend/replication/logical/worker.c

4.7.a - start_table_sync

@@ -3894,6 +3917,9 @@ start_table_sync(XLogRecPtr *origin_startpos,
char **myslotname)
  }
  PG_CATCH();
  {
+ /* Set the flag that we will retry later. */
+ set_subscription_retry(true);

Maybe this should say more like "Flag that the next apply will be the
result of a retry"

4.7.b - start_apply
Ditto

~~~

4.8 src/backend/replication/logical/worker.c - set_subscription_retry

+
+/*
+ * Set subretry of pg_subscription catalog.
+ *
+ * If retry is true, subscriber is about to exit with an error. Otherwise, it
+ * means that the changes was applied successfully.
+ */
+static void
+set_subscription_retry(bool retry)

"changes" -> "change" ?

~~~

4.8 src/backend/replication/logical/worker.c - set_subscription_retry

Isn't this flag only every used when streaming=parallel? But it does
not seem ot be checking that anywhere before potentiall executing all
this code when maybe will never be used.

==

4.9 src/include/catalog/pg_subscription.h

@@ -76,6 +76,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW
  bool subdisableonerr; /* True if a worker error should cause the
  * subscription to be disabled */

+ bool subretry; /* True if the previous apply change failed. */

I was wondering if you can give this column a DEFAULT value of false,
because then perhaps most of the patch code from worker.c may be able
to be eliminated.

==

4.10 .../subscription/t/032_streaming_apply.pl

I felt that the test cases all seem to blend together. IMO it will be
more readable if the main text parts are  visually separated

e.g using a comment like:
# =


--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-04 Thread Luc Vlaming Hummel

Hi Alvaro, hi David,

Thanks for reviewing this and the interesting examples!

Wanted to give a bit of extra insight as to why I'd love to have a system that 
can lazily emit JIT code and hence creates roughly a module per function:
In the end I'm hoping that we can migrate to a system where we only JIT after a 
configurable cost has been exceeded for this node, as well as a configurable 
amount of rows has actually been processed.
Reason is that this would safeguard against some problematic planning issues 
wrt JIT (node not being executed, row count being massively off).
It would also allow for more finegrained control, with a cost system similar to 
most other planning costs, as they are also per node and not globally, and 
would potentially allow us to only JIT things where we expect to truly gain any 
runtime compared to the costs of doing it.

If this means we have to invest more in making it cheap(er) to emit modules, 
I'm all for that. Kudos to David for fixing the caching in that sense :) 
@Andres if there's any other things we ought to fix to make this cheap (enough) 
compared to the previous code I'd love to know your thoughts.

Best,
Luc Vlaming
(ServiceNow)

From: David Geier 
Sent: Wednesday, June 29, 2022 11:03 AM
To: Alvaro Herrera 
Cc: Luc Vlaming ; Andres Freund ; 
PostgreSQL-development 
Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions 

[External Email]

Hi Alvaro,

That's a very interesting case and might indeed be fixed or at least improved 
by this patch. I tried to reproduce this, but at least when running a simple, 
serial query with increasing numbers of functions, the time spent per function 
is linear or even slightly sub-linear (same as Tom observed in [1]).

I also couldn't reproduce the JIT runtimes you shared, when running the 
attached catalog query. The catalog query ran serially for me with the 
following JIT stats:

 JIT:
   Functions: 169
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491 ms, 
Emission 283.464 ms, Total 701.501 ms

Is it possible that the query ran in parallel for you? For parallel queries, 
every worker JITs all of the functions it uses. Even though the workers might 
JIT the functions in parallel, the time reported in the EXPLAIN ANALYZE output 
is the sum of the time spent by all workers. With this patch applied, the JIT 
time drops significantly, as many of the generated functions remain unused.

 JIT:
   Modules: 15
   Functions: 26
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms, 
Emission 70.347 ms, Total 140.195 ms

Of course, this does not prove that the nonlinearity that you observed went 
away. Could you share with me how you ran the query so that I can reproduce 
your numbers on master to then compare them with the patched version? Also, 
which LLVM version did you run with? I'm currently running with LLVM 13.

Thanks!

--
David Geier
(ServiceNow)

On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera  wrote:
On 2021-Jan-18, Luc Vlaming wrote:

> I would like this topic to somehow progress and was wondering what other
> benchmarks / tests would be needed to have some progress? I've so far
> provided benchmarks for small(ish) queries and some tpch numbers, assuming
> those would be enough.

Hi, some time ago I reported a case[1] where our JIT implementation does
a very poor job and perhaps the changes that you're making could explain
what is going on, and maybe even fix it:

[1] https://postgr.es/m/20241706.wqq7xoyigwa2@alvherre.pgsql

The query for which I investigated the problem involved some pg_logical
metadata tables, so I didn't post it anywhere public; but the blog post
I found later contains a link to a query that shows the same symptoms,
and which is luckily still available online:
https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580
I attach it here in case it goes missing sometime.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

91 matches

Mail list logo