date:20240223

Re: Removing unneeded self joins

2024-02-23 Thread Noah Misch

On Sat, Feb 24, 2024 at 12:36:59AM +0200, Alexander Korotkov wrote:
> On Thu, Feb 22, 2024 at 10:51 AM Andrei Lepikhov  
> wrote:
> > On 21/2/2024 14:26, Richard Guo wrote:
> > > I think the right fix for these issues is to introduce a new element
> > > 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker
> > > to 1) recurse into subselects with sublevels_up increased, and 2)
> > > perform the replacement only when varlevelsup is equal to sublevels_up.
> > This code looks good. No idea how we have lost it before.
> 
> Thanks to Richard for the patch and to Andrei for review.  I also find
> code looking good.  Pushed with minor edits from me.

I feel this, commit 466979e, misses a few of our project standards:

- The patch makes many non-whitespace changes to existing test queries.  This
  makes it hard to review the consequences of the non-test part of the patch.
  Did you minimize such edits?  Of course, not every such edit is avoidable.

- The commit message doesn't convey whether this is refactoring or is a bug
  fix.  This makes it hard to write release notes, among other things.  From
  this mailing list thread, it gather it's a bug fix in 489072ab7a, hence
  v17-specific.  The commit message for 489072ab7a is also silent about that
  commit's status as refactoring or as a bug fix.

- Normally, I could answer the previous question by reading the test case
  diffs.  However, in addition to the first point about non-whitespace
  changes, the first three join.sql patch hunks just change whitespace.
  Worse, since they move line breaks, "git diff -w" doesn't filter them out.

To what extent are those community standards vs. points of individual
committer preference?  Please tell me where I'm wrong here.

Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-02-23 Thread Noah Misch

On Fri, Feb 23, 2024 at 08:47:52PM +0530, Robert Haas wrote:
> If XLOG_DBASE_CREATE_FILE_COPY occurs between an incremental backup
> and its reference backup, every relation whose DB OID and tablespace
> OID match the corresponding values in that record should be backed up
> in full. Currently that's not happening, because the WAL summarizer
> doesn't see the XLOG_DBASE_CREATE_FILE_COPY as referencing any
> particular relfilenode and so basically ignores it. The same happens
> for XLOG_DBASE_CREATE_WAL_LOG, but that case is OK because that only
> covers creating the directory itself, not anything underneath it, and
> there will be separate WAL records telling us the relfilenodes created
> below the new directory and the pages modified therein.

XLOG_DBASE_CREATE_WAL_LOG creates PG_VERSION in addition to creating the
directory.  I see your patch covers it.

> I thought about whether there were any other WAL records that have
> similar problems to XLOG_DBASE_CREATE_FILE_COPY and didn't come up
> with anything. If anyone knows of any similar cases, please let me
> know.

Regarding records the summarizer potentially can't ignore that don't deal in
relfilenodes, these come to mind:

XLOG_DBASE_DROP - covered in this thread's patch
XLOG_RELMAP_UPDATE
XLOG_TBLSPC_CREATE
XLOG_TBLSPC_DROP
XLOG_XACT_PREPARE

Also, any record that writes XIDs needs to update nextXid; likewise for other
ID spaces.  See the comment at "XLOG stuff" in heap_lock_tuple().  Perhaps you
don't summarize past a checkpoint, making that irrelevant.

If walsummarizer.c handles any of the above, my brief look missed it.  I also
didn't find the string "clog" or "slru" anywhere in dc21234 "Add support for
incremental backup", 174c480 "Add a new WAL summarizer process.", or thread
https://postgr.es/m/flat/CA%2BTgmoYOYZfMCyOXFyC-P%2B-mdrZqm5pP2N7S-r0z3_402h9rsA%40mail.gmail.com
"trying again to get incremental backup".  I wouldn't be surprised if you
treat clog, pg_filenode.map, and/or 2PC state files as unconditionally
non-incremental, in which case some of the above doesn't need explicit
summarization code.  I stopped looking for that logic, though.

> --- a/src/backend/postmaster/walsummarizer.c
> +++ b/src/backend/postmaster/walsummarizer.c

> +  * Technically, this special handling is only needed in the case of
> +  * XLOG_DBASE_CREATE_FILE_COPY, because that can create a whole bunch
> +  * of relation files in a directory without logging anything
> +  * specific to each one. If we didn't mark the whole DB OID/TS OID
> +  * combination in some way, then a tablespace that was dropped after
s/tablespace/database/ I suspect.
> +  * the reference backup and recreated using the FILE_COPY method prior
> +  * to the incremental backup would look just like one that was never
> +  * touched at all, which would be catastrophic.

Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-23 Thread Amit Kapila

On Sat, Feb 24, 2024 at 2:31 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Seeing no objections, I have pushed the required test changes to silence 
> > the BF.
>
> The farm is still unhappy in the v16 branch.
>

As this feature and tests were introduced in 15, I think we should
backpatch till 15. I'll do that early next week unless you or someone
else thinks otherwise.

-- 
With Regards,
Amit Kapila.

Re: Potential issue in ecpg-informix decimal converting functions

2024-02-23 Thread Michael Paquier

On Fri, Feb 23, 2024 at 06:03:41PM +0300, a.ima...@postgrespro.ru wrote:
> Thank's for advice, the patch will be registered for the next commitfest.

The risk looks really minimal to me, but playing with error codes
while the logic of the function is unchanged does not strike me as
something to backpatch as it could slightly break applications.  On
HEAD, I'm OK with that.
--
Michael


signature.asc
Description: PGP signature

Re: ExecAppendAsyncEventWait() in REL_14_STABLE can corrupt PG_exception_stack

2024-02-23 Thread Michael Paquier

On Fri, Feb 23, 2024 at 01:21:14PM +0300, Alexander Pyhalov wrote:
> Recent commit 555276f8594087ba15e0d58e38cd2186b9f39f6d introduced final
> cleanup of node->as_eventset in ExecAppendAsyncEventWait().
> Unfortunately, now this function can return in the middle of TRY/FINALLY
> block, without restoring PG_exception_stack.
> 
> We found this while working on our FDW. Unfortunately, I couldn't reproduce
> the issue with postgres_fdw, but it seems it is also affected.

Ugh, yes, you are obviously right that the early return is wrong.
I'll look into fixing that where appropriate.  Thanks for the report,
Alexander!
--
Michael


signature.asc
Description: PGP signature

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-02-23 Thread David Zhang


Hi Hackers,

This is the 2nd version patch with following updates:

1) Changed the frontend SSL parameter from `ssl_ocsp_stapling` to 
`sslocspstapling` to align with other SSL parameters.


2) Documented both the backend parameter `ssl_ocsp_file` and the 
frontend parameter `sslocspstapling`.


3) Implemented a check for the `nextUpdate` field in the OCSP response. 
If it is present but expired, the TLS connection will fail."


To add the test cases for OCSP Stapling, I think I should 1) add one 
section to conf/cas.config to generate `ocsp_ca.crt`; 2) use this 
`ocsp_ca` to sign some OCSP responses for server side certificates with 
`good`, `revoked` and `unknown`, and then 3) add some test cases to 
t/001_ssltests.pl.


Any comments or feedback would be greatly appreciated!

Thank you,

David


On 2024-02-05 3:51 p.m., David Zhang wrote:

Hello PostgreSQL Hackers,

This proposal suggests implementing OCSP Stapling in PostgreSQL as an 
alternative and more efficient method for checking certificate 
revocation, aligning with the trend shift from Certificate Revocation 
Lists (CRL).



1. benefits
OCSP Stapling offers several advantages over traditional CRL checks, 
including:


*) enhances user trust and real-time certificate verification without 
relying on potentially outdated CRLs.
*) helps address privacy concerns associated with traditional OCSP 
checks, where the client contacts the OCSP responder directly.
*) reduces latency by eliminating the need for the client to perform 
an additional round-trip to the OCSP responder.
*) efficient resource utilization by allowing the server to cache and 
reuse OCSP responses.




2. a POC patch with below changes:
*) a new configuration option 'ssl_ocsp_file' to enable/disable OCSP 
Stapling and specify OCSP responses for PostgreSQL servers. For 
instance, ssl_ocsp_file = '_server.resp'


*) a server-side callback function responsible for generating OCSP 
stapling responses. This function comes into play only when a client 
requests the server's certificate status during the SSL/TLS handshake.


*) a new connection parameter 'ssl_ocsp_stapling' on the client side. 
For example, when 'ssl_ocsp_stapling=1', the psql client will send a 
certificate status request to the PostgreSQL server.


*) a client-side callback function within the libpq interface to 
validate and check the stapled OCSP response received from the server. 
If the server's certificate status is valid, the TLS handshake 
continues; otherwise, the connection is rejected.




3.  test cases for 'make check' are not yet ready as they could be 
complicated, but basic tests can be performed as demonstrated below:
To run the tests, OpenSSL tools are required to simulate the OCSP 
responder for generating OCSP responses. Additionally, initial 
certificate generation, including a self-signed root CA, OCSP response 
signing certificate, and PostgreSQL server certificate, is needed.


*) add ocsp atrributes to openssl.cnf
$ openssl version
OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

$ diff openssl-ocsp.cnf /etc/ssl/openssl.cnf
204d203
< authorityInfoAccess = OCSP;URI:http://127.0.0.1:6655
232,235d230
< [ v3_ocsp ]
< basicConstraints = CA:FALSE
< keyUsage = nonRepudiation, digitalSignature, keyEncipherment
< extendedKeyUsage = OCSPSigning
255c250
< keyUsage = critical, cRLSign, digitalSignature, keyCertSign
---
>


*) prepare OCSP responder for generating OCSP response
$ mkdir -p demoCA/newcerts
$ touch demoCA/index.txt
$ echo '01' > demoCA/serial

# create a self-signed root CA
$ openssl req -new -nodes -out rootCA.csr -keyout rootCA.key -subj 
"/C=CA/ST=BC/L=VAN/O=IDO/OU=DEV/CN=rootCA"
$ openssl x509 -req -in rootCA.csr -days 3650 -extfile 
openssl-ocsp.cnf -extensions v3_ca -signkey rootCA.key -out rootCA.crt


# create a certificate for OCSP responder
$ openssl req -new -nodes -out ocspSigning.csr -keyout ocspSigning.key 
-subj "/C=CA/ST=BC/L=VAN/O=IDO/OU=DEV/CN=ocspSigner"
$ openssl ca -keyfile rootCA.key -cert rootCA.crt -in ocspSigning.csr 
-out ocspSigning.crt -config openssl-ocsp.cnf -extensions v3_ocsp

    Sign the certificate? [y/n]:y
    1 out of 1 certificate requests certified, commit? [y/n]y

# create a certificate for PostgreSQL server
$ openssl req -new -nodes -out server.csr -keyout server.key -subj 
"/C=CA/ST=BC/L=VAN/O=IDO/OU=DEV/CN=server"
$ openssl ca -batch -days 365 -keyfile rootCA.key -cert rootCA.crt 
-config openssl-ocsp.cnf -out server.crt -infiles server.csr



# start OCSP responder
$ openssl ocsp -index demoCA/index.txt -port 6655 -rsigner 
ocspSigning.crt -rkey ocspSigning.key -CA rootCA.crt -text


# make sure PostgreSQL server's certificate is 'good'
$ openssl ocsp -issuer rootCA.crt -url http://127.0.0.1:6655 
-resp_text -noverify -cert server.crt


# generate OCSP response when certificate status is 'good' and save as 
_server.resp-good:
$ openssl ocsp -issuer rootCA.crt -cert server.crt -url 
http://127.0.0.1:6655 -respout _server.resp-good



#

Re: RangeTblEntry jumble omissions

2024-02-23 Thread Tom Lane

Julien Rouhaud  writes:
> On Fri, Feb 23, 2024 at 04:26:53PM +0100, Peter Eisentraut wrote:
>> - funcordinality
>> This was probably just forgotten.  It should be included because the WITH
>> ORDINALITY clause changes the query result.

> Agreed.

Seems OK.

>> - lateral
>> Also probably forgotten.  A query specifying LATERAL is clearly different
>> from one without it.

> Agreed.

Nah ... I think that LATERAL should be ignored on essentially the
same grounds on which you argue for ignoring aliases.  If it
affects the query's semantics, it's because there is a lateral
reference in the subject subquery or function, and that reference
already contributes to the query hash.  If there is no such
reference, then LATERAL is a noise word.  It doesn't help any that
LATERAL is actually optional for functions, making it certainly a
noise word there.

IIRC, the parser+planner cooperatively fix things so that the final
state of an RTE's lateral field reflects reality.  But if we are
hashing before that's happened, it's not worth all that much.

regards, tom lane

Re: Documentation: warn about two_phase when altering a subscription

2024-02-23 Thread Tristen Raab

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hello,

I've reviewed your patch, it applies correctly and the documentation builds 
without any errors. As for the content of the patch itself, I think so far it's 
good but would make two modifications. I like how the patch was worded 
originally when referring to the subscription, stating these parameters were 
'in' the subscription rather than 'by' it. So I'd propose changing

> parameters specified by the subscription. When creating the slot, ensure

to 

> parameters specified in the subscription. When creating the slot, ensure

and secondly the section "ensure the slot properties failover and two_phase 
match their counterpart parameters of the subscription" sounds a bit clunky. So 
I'd also propose changing:

> the slot properties failover and 
> two_phase
> match their counterpart parameters of the subscription.

to

> the slot properties failover and 
> two_phase
> match their counterpart parameters in the subscription.

I feel this makes the description flow a bit better when reading. But other 
than that I think it's quite clear.

kind regards,

---
Tristen Raab
Highgo Software Canada
www.highgo.ca

Re: Add lookup table for replication slot invalidation causes

2024-02-23 Thread Michael Paquier

On Fri, Feb 23, 2024 at 09:04:04AM +1100, Peter Smith wrote:
> I would've just removed every local variable instead of adding more of
> them. I also felt the iteration starting from RS_INVAL_NONE instead of
> 0 is asserting RS_INVAL_NONE must always be the first enum and can't
> be rearranged. Probably it will never happen, but why require it?

FWIW, I think that the code is OK as-is, so I'd just let it be for
now.
--
Michael


signature.asc
Description: PGP signature

Re: Printing backtrace of postgres processes

2024-02-23 Thread Michael Paquier

On Fri, Feb 23, 2024 at 04:39:47PM +0100, Christoph Berg wrote:
> I'd be concerned about the cost of doing that as part of the startup
> of every single backend process. Shouldn't this rather be done within
> the postmaster so it's automatically inherited by forked backends?
> (EXEC_BACKEND systems probably don't have libgcc I guess.)

Something like this can be measured with a bunch of concurrent
connections attempting connections and a very high rate, like pgbench
with an empty script and -C, for local connections.
--
Michael


signature.asc
Description: PGP signature

Re: RangeTblEntry jumble omissions

2024-02-23 Thread Julien Rouhaud

Hi,

On Fri, Feb 23, 2024 at 04:26:53PM +0100, Peter Eisentraut wrote:
>
> - alias
>
> Currently, two queries like
>
> SELECT * FROM t1 AS foo
> SELECT * FROM t1 AS bar
>
> are counted together by pg_stat_statements -- that might be ok, but they
> both get listed under whichever one is run first, so here if you are looking
> for the "AS bar" query, you won't find it.

I think this one is intentional.  This alias won't change the query behavior or
the field names so it's good to avoid extraneous entries.  It's true that you
then won't find something matching "AS bar", but it's not something you can
rely on anyway.

If you first execute "select * from t1 as foo" and then "SELECT * FROM t1 AS
foo" then you won't find anything matching "AS foo" either.  There isn't even
any guarantee that the stored query text will be jumbled.

> - join_using_alias
>
> Similar situation, currently
>
> SELECT * FROM t1 JOIN t2 USING (a, b)
> SELECT * FROM t1 JOIN t2 USING (a, b) AS x
>
> are counted together.

IMHO same as above.

> - funcordinality
>
> This was probably just forgotten.  It should be included because the WITH
> ORDINALITY clause changes the query result.

Agreed.

> - lateral
>
> Also probably forgotten.  A query specifying LATERAL is clearly different
> from one without it.

Agreed.

Re: Removing unneeded self joins

2024-02-23 Thread Alexander Korotkov

On Thu, Feb 22, 2024 at 10:51 AM Andrei Lepikhov
 wrote:
> On 21/2/2024 14:26, Richard Guo wrote:
> > I think the right fix for these issues is to introduce a new element
> > 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker
> > to 1) recurse into subselects with sublevels_up increased, and 2)
> > perform the replacement only when varlevelsup is equal to sublevels_up.
> This code looks good. No idea how we have lost it before.

Thanks to Richard for the patch and to Andrei for review.  I also find
code looking good.  Pushed with minor edits from me.

--
Regards,
Alexander Korotkov

Re: cleanup patches for dshash

2024-02-23 Thread Nathan Bossart

On Sun, Jan 21, 2024 at 11:07:15PM -0600, Nathan Bossart wrote:
> I attempted to fix this in v2 of the patch set.

If there are no objections, I plan to commit these patches early next week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-02-23 Thread Nathan Bossart

(Apologies in advance for anything I'm bringing up that we've already
covered somewhere else.)

On Fri, Feb 16, 2024 at 04:03:55PM -0800, Jeff Davis wrote:
> Note the changes in amcheck. It's creating functions and calling those
> functions from the comparators, and so the comparators need to set the
> search_path. I don't think that's terribly common, but does represent a
> behavior change and could break something.

Why is this change needed?  Is the idea to make amcheck follow the same
rules as maintenance commands to encourage folks to set up index functions
correctly?  Or is amcheck similarly affected by search_path tricks?

>  void
>  InitializeSearchPath(void)
>  {
> + /* Make the context we'll keep search path cache hashtable in */
> + SearchPathCacheContext = AllocSetContextCreate(TopMemoryContext,
> + 
>"search_path processing cache",
> + 
>ALLOCSET_DEFAULT_SIZES);
> +
>   if (IsBootstrapProcessingMode())
>   {
>   /*
> @@ -4739,11 +4744,6 @@ InitializeSearchPath(void)
>   }
>   else
>   {
> - /* Make the context we'll keep search path cache hashtable in */
> - SearchPathCacheContext = AllocSetContextCreate(TopMemoryContext,
> - 
>"search_path processing cache",
> - 
>ALLOCSET_DEFAULT_SIZES);
> -

What is the purpose of this change?

> + SetConfigOption("search_path", GUC_SAFE_SEARCH_PATH, PGC_USERSET,
> + PGC_S_SESSION);

I wonder if it's worth using PGC_S_INTERACTIVE or introducing a new value
for these.

> +/*
> + * Safe search path when executing code as the table owner, such as during
> + * maintenance operations.
> + */
> +#define GUC_SAFE_SEARCH_PATH "pg_catalog, pg_temp"

Is including pg_temp actually safe?  I worry that a user could use their
temporary schema to inject objects that would take the place of
non-schema-qualified stuff in functions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-23 Thread Tom Lane

Amit Kapila  writes:
> Seeing no objections, I have pushed the required test changes to silence the 
> BF.

The farm is still unhappy in the v16 branch.

regards, tom lane

Re: locked reads for atomics

2024-02-23 Thread Nathan Bossart

Here is a v3 of the patch set with the first draft of the commit messages.
There are no code differences between v2 and v3.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f1441aca96f157c5557d9a961fe85902b510f293 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 23 Feb 2024 14:23:15 -0600
Subject: [PATCH v3 1/2] Introduce atomic read/write functions with full
 barrier semantics.

Writing correct code using atomic variables is often difficult due
to the implied memory barrier semantics (or lack thereof) of the
underlying operations.  This commit introduces atomic read/write
functions with full barrier semantics to ease this cognitive load
in areas that are not performance critical.  For example, some
spinlocks protect a single value, and these new functions make it
easy to convert the value to an atomic variable (thus eliminating
the need for the spinlock) without modifying the barrier semantics
previously provided by the spinlock.  However, since these
functions are less performant than the other atomic reads/writes,
they are not suitable for every use-case.

The base implementations for these new functions are atomic
exchanges (for writes) and atomic fetch/adds with 0 (for reads).
These implementations can be overwritten with better architecture-
specific versions as they are discovered.

This commit leaves converting existing code to use these new
functions as a future exercise.

Reviewed-by: Andres Freund, Yong Li, Jeff Davis
Discussion: https://postgr.es/me/20231110205128.GB1315705%40nathanxps13
---
 src/include/port/atomics.h | 59 ++
 src/include/port/atomics/generic.h | 36 ++
 2 files changed, 95 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index bf151037f7..36a0ac9d85 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -239,6 +239,26 @@ pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
 	return pg_atomic_read_u32_impl(ptr);
 }
 
+/*
+ * pg_atomic_read_membarrier_u32 - read with barrier semantics.
+ *
+ * This read is guaranteed to return the current value, provided that the value
+ * is only ever updated via operations with barrier semantics, such as
+ * pg_atomic_compare_exchange_u32() and pg_atomic_write_membarrier_u32().  Note
+ * that this is less performant than pg_atomic_read_u32(), but it may be easier
+ * to reason about correctness with this function in less performance-sensitive
+ * code.
+ *
+ * Full barrier semantics.
+ */
+static inline uint32
+pg_atomic_read_membarrier_u32(volatile pg_atomic_uint32 *ptr)
+{
+	AssertPointerAlignment(ptr, 4);
+
+	return pg_atomic_read_membarrier_u32_impl(ptr);
+}
+
 /*
  * pg_atomic_write_u32 - write to atomic variable.
  *
@@ -276,6 +296,27 @@ pg_atomic_unlocked_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
 	pg_atomic_unlocked_write_u32_impl(ptr, val);
 }
 
+/*
+ * pg_atomic_write_membarrier_u32 - write with barrier semantics.
+ *
+ * The write is guaranteed to succeed as a whole, i.e., it's not possible to
+ * observe a partial write for any reader.  Note that this correctly interacts
+ * with both pg_atomic_compare_exchange_u32() and
+ * pg_atomic_read_membarrier_u32().  While this may be less performant than
+ * pg_atomic_write_u32() and pg_atomic_unlocked_write_u32(), it may be easier
+ * to reason about correctness with this function in less performance-sensitive
+ * code.
+ *
+ * Full barrier semantics.
+ */
+static inline void
+pg_atomic_write_membarrier_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	AssertPointerAlignment(ptr, 4);
+
+	pg_atomic_write_membarrier_u32_impl(ptr, val);
+}
+
 /*
  * pg_atomic_exchange_u32 - exchange newval with current value
  *
@@ -429,6 +470,15 @@ pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
 	return pg_atomic_read_u64_impl(ptr);
 }
 
+static inline uint64
+pg_atomic_read_membarrier_u64(volatile pg_atomic_uint64 *ptr)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_read_membarrier_u64_impl(ptr);
+}
+
 static inline void
 pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
 {
@@ -438,6 +488,15 @@ pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
 	pg_atomic_write_u64_impl(ptr, val);
 }
 
+static inline void
+pg_atomic_write_membarrier_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	pg_atomic_write_membarrier_u64_impl(ptr, val);
+}
+
 static inline uint64
 pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
 {
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index c3b5f6d9a4..6113ab62a3 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -243,6 +243,24 @@ pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_READ_MEMBARRIER_U32) &&

Re: locked reads for atomics

2024-02-23 Thread Nathan Bossart

On Fri, Feb 23, 2024 at 10:25:00AM -0800, Jeff Davis wrote:
> To be clear:
> 
>   x = pg_atomic_[read|write]_membarrier_u64();
> 
> is semantically equivalent to:
> 
>   pg_memory_barrier();
>   x = pg_atomic_[read|write]_u64();
>   pg_memory_barrier();
> 
> ?
> 
> If so, that does seem more convenient.

I think that's about right.  The upthread feedback from Andres [0] provides
some additional context.

[0] https://postgr.es/m/20231110231150.fjm77gup2i7xu6hc%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: locked reads for atomics

2024-02-23 Thread Jeff Davis

On Fri, 2024-02-23 at 10:17 -0600, Nathan Bossart wrote:
> The idea is
> to provide an easy way to remove spinlocks, etc. and use atomics for
> less
> performance-sensitive stuff.  The implementations are intended to be
> relatively inexpensive and might continue to improve in the future,
> but the
> functions are primarily meant to help reason about correctness.

To be clear:

  x = pg_atomic_[read|write]_membarrier_u64();

is semantically equivalent to:

  pg_memory_barrier();
  x = pg_atomic_[read|write]_u64();
  pg_memory_barrier();

?

If so, that does seem more convenient.

Regards,
Jeff Davis

Re: SQL Property Graph Queries (SQL/PGQ)

2024-02-23 Thread Tomas Vondra

On 2/23/24 17:15, Peter Eisentraut wrote:
> On 16.02.24 20:23, Andres Freund wrote:
>> One aspect that I m concerned with structurally is that the
>> transformation,
>> from property graph queries to something postgres understands, is done
>> via the
>> rewrite system. I doubt that that is a good idea. For one it bars the
>> planner
>> from making plans that benefit from the graph query formulation. But more
>> importantly, we IMO should reduce usage of the rewrite system, not
>> increase
>> it.
> 
> PGQ is meant to be implemented like that, like views expanding to joins
> and unions.  This is what I have gathered during the specification
> process, and from other implementations, and from academics.  There are
> certainly other ways to combine relational and graph database stuff,
> like with native graph storage and specialized execution support, but
> this is not that, and to some extent PGQ was created to supplant those
> other approaches.
> 

I understand PGQ was meant to be implemented as a bit of a "syntactic
sugar" on top of relations, instead of inventing some completely new
ways to store/query graph data.

But does that really mean it needs to be translated to relations this
early / in rewriter? I haven't thought about it very deeply, but won't
that discard useful information about semantics of the query, which
might be useful when planning/executing the query?

I've somehow imagined we'd be able to invent some new index types, or
utilize some other type of auxiliary structure, maybe some special
executor node, but it seems harder without this extra info ...

> Many people will agree that the rewriter is sort of weird and archaic at
> this point.  But I'm not aware of any plans or proposals to do anything
> about it.  As long as the view expansion takes place there, it makes
> sense to align with that.  For example, all the view security stuff
> (privileges, security barriers, etc.) will eventually need to be
> considered, and it would make sense to do that in a consistent way.  So
> for now, I'm working with what we have, but let's see where it goes.
> 
> (Note to self: Check that graph inside view inside graph inside view ...
> works.)
> 

AFAIK the "policy" regarding rewriter was that we don't want to use it
for user stuff (e.g. people using it for partitioning), but I'm not sure
about internal stuff.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Thoughts about NUM_BUFFER_PARTITIONS

2024-02-23 Thread Tomas Vondra

On 2/23/24 15:40, wenhui qiu wrote:
> Hi Tomas Vondra
> Thanks for the information!  But I found postgres pro enterprise
> version has been implemented ,However, it defaults to 16 and maxes out at
> 128, and the maxes are the same as in PostgreSQL.I kindly  hope that if the
> developers can explain what the purpose of this is.May be 128 partitions is
> the optimal value,It's a parameter to make it easier to adjust the number
> of partitions in the future when it's really not enough. and the code
> comments also said that  hope to implement the parameter in the future
> 
> 
> ( https://postgrespro.com/docs/enterprise/16/runtime-config-locks )
> 
> 
> log2_num_lock_partitions (integer) #
> 
> 
> This controls how many partitions the shared lock tables are divided into.
> Number of partitions is calculated by raising 2 to the power of this
> parameter. The default value is 4, which corresponds to 16 partitions, and
> the maximum is 8. This parameter can only be set in the postgresql.conf file
> or on the server command line.
> 
> Best wish
> 

Hi,

Well, if Postgres Pro implements this, I don't know what their reasoning
was exactly, but I guess they wanted to make it easier to experiment
with different values (without rebuild), or maybe they actually have
systems where they know higher values help ...

Note: I'd point the maximum value 8 translates to 256, so no - it does
not max at the same value as PostgreSQL.

Anyway, this value is inherently a trade off. If it wasn't, we'd set it
to something super high from the start. But having more partitions of
the lock table has a cost too, because some parts need to acquire all
the partition locks (and that's O(N) where N = number of partitions).

Of course, not having enough lock table partitions has a cost too,
because it increases the chance of conflict between backends (who happen
to need to operate on the same partition). This constant is not
constant, it changes over time - with 16 cores the collisions might have
been rare, with 128 not so much. Also, with partitioning we may need
many more locks per query.

This means it's entirely possible it'd be good to have more than 128
partitions of the lock table, but we only change this kind of stuff if
we have 2 things:

1) clear demonstration of the benefits (e.g. a workload showing an
improvement with higher number of partitions)

2) analysis of how this affects other workloads (e.g. cases that may
need to lock all the partitions etc)

Ultimately it's a trade off - we need to judge if the impact in (2) is
worth the improvement in (1).

None of this was done in this thread. There's no demonstration of the
benefits, no analysis of the impact etc.

As for turning the parameter into a GUC, that has a cost too. Either
direct - a compiler can do far more optimizations with compile-time
constants than with values that may change during execution, for
example. Or indirect - if we can't give users any guidance how/when to
tune the GUC, it can easily lead to misconfiguration (I can't even count
how many times I had to deal with systems where the values were "tuned"
following the logic that more is always better).

Which just leads back to (1) and (2) even for this case.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-23 Thread Давыдов Виталий


Hi Amit,
Amit Kapila  wrote:
I don't see we do anything specific for 2PC transactions to make them behave 
differently than regular transactions with respect to synchronous_commit 
setting. What makes you think so? Can you pin point the code you are referring 
to?Yes, sure. The function RecordTransactionCommitPrepared is called on 
prepared transaction commit (twophase.c). It calls XLogFlush unconditionally. 
The function RecordTransactionCommit (for regular transactions, xact.c) calls 
XLogFlush if synchronous_commit > OFF, otherwise it calls XLogSetAsyncXactLSN.

There is some comment in RecordTransactionCommitPrepared (by Bruce Momjian) 
that shows that async commit is not supported yet:
/*
* We don't currently try to sleep before flush here ... nor is there any
* support for async commit of a prepared xact (the very idea is probably
* a contradiction)
*/
/* Flush XLOG to disk */
XLogFlush(recptr);
Right, I think for this we need to implement parallel apply.Yes, parallel apply 
is a good point. But, I believe, it will not work if asynchronous commit is not 
supported. You have only one receiver process which should dispatch incoming 
messages to parallel workers. I guess, you will never reach such rate of 
parallel execution on replica as on the master with multiple backends.
 
Can you be a bit more specific about what exactly you have in mind to achieve 
the above solutions?My proposal is to implement async commit for 2PC 
transactions as it is for regular transactions. It should significantly speedup 
the catchup process. Then, think how to apply in parallel, which is much 
diffcult to do. The current problem is to get 2PC state from the WAL on commit 
prepared. At this moment, the WAL is not flushed yet, commit function waits 
until WAL with 2PC state is to be flushed. I just tried to do it in my sandbox 
and found such a problem. Inability to get 2PC state from unflushed WAL stops 
me right now. I think about possible solutions.

The idea with enableFsync is not a suitable solution, in general, I think. I 
just pointed it as an alternate idea. You just do enableFsync = false before 
prepare or commit prepared and do enableFsync = true after these functions. In 
this case, 2PC records will not be fsync-ed, but FlushPtr will be increased. 
Thus, 2PC state can be read from WAL on commit prepared without waiting. To 
make it work correctly, I guess, we have to do some additional work to keep 
more wal on the master and filter some duplicate transactions on the replica, 
if replica restarts during catchup.

With best regards,
Vitaly Davydov

Re: Improve eviction algorithm in ReorderBuffer

2024-02-23 Thread Tomas Vondra

Hi,

I did a basic review and testing of this patch today. Overall I think
the patch is in very good shape - I agree with the tradeoffs it makes,
and I like the approach in general. I do have a couple minor comments
about the code, and then maybe a couple thoughts about the approach.


First, some comments - I'll put them here, but I also kept them in
"review" commits, because that makes it easier to show the exact place
in the code the comment is about.

1) binaryheap_allocate got a new "indexed" argument, but the comment is
not updated to document it

2) I think it's preferable to use descriptive argument names for
bh_set_node. I don't think there's a good reason to keep it short.

3) In a couple places we have code like this:

if (heap->bh_indexed)
bh_nodeidx_delete(heap->bh_nodeidx, result);

Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
that it can be called without it.

4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
either expect to find the node (update of already tracked transaction)
or not (when inserting it). The life cycle may be non-trivial (node
added, updated and removed, ...), would be useful assert I think.

5) Do we actually need the various mem_freed local variables in a couple
places, when we expect the value to be equal to txn->size (there's even
assert enforcing that)?

6) ReorderBufferCleanupTXN has a comment about maybe not using the same
threshold both to enable & disable usage of the binaryheap. I agree with
that, otherwise we could easily end up "trashing" if we add/remove
transactions right around the threshold. I think 90-95% for disabling
the heap would work fine.

7) The code disabling binaryheap (based on the threshold) is copied in a
couple places, perhaps it should be a separate function called from
those places.

8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
memory size check internally, to make the calls simpler.

9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
split maybe not very clear. It's not clear to me why it's divided like
this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.


performance
---

I did some benchmarks, to see the behavior in simple good/bad cases (see
the attached scripts.tgz). "large" is one large transaction inserting 1M
rows, small is 64k single-row inserts, and subxacts is the original case
with ~100k subxacts. Finally, subxacts-small is many transactions with
128 subxacts each (the main transactions are concurrent).

The results are pretty good, I think:

 testmaster patched
  -
large  25872459   95%
small   956 856   89%
 subxacts13891529112%
   subxacts-small 13632   13187   97%

This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
where the subxact timing goes way down, but the overall conclusions do
not change.

I was a bit surprised I haven't seen any clear regression, but in the
end that's a good thing, right? There's a couple results in this thread
showing ~10% regression, but I've been unable to reproduce those.
Perhaps the newer patch versions fix that, I guess.

Anyway, I think that at some point we'd have to accept that some cases
may have slight regression. I think that's inherent for almost any
heuristics - there's always going to be some rare case that defeats it.
What's important is that the case needs to be rare and/or the impact
very limited. And I think that's true here.


overall design
--

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-23 Thread Давыдов Виталий


Hi Ajin,

Thank you for your feedback. Could you please try to increase the number of 
clients (-c pgbench option) up to 20 or more? It seems, I forgot to specify it.

With best regards,
Vitaly Davydov On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий 
 wrote:
Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied 
quite slowly on a replica during logical replication. There is a master and a 
replica with established logical replication from the master to the replica 
with twophase = true. With some load level on the master, the replica starts to 
lag behind the master, and the lag will be increasing. We have to significantly 
decrease the load on the master to allow replica to complete the catchup. Such 
problem may create significant difficulties in the production. The problem 
appears at least on REL_16_STABLE branch.
To reproduce the problem:
 * Setup logical replication from master to replica with subscription parameter 
twophase =  true. * Create some intermediate load on the master (use pgbench 
with custom sql with prepare+commit) * Optionally switch off the replica for 
some time (keep load on master). * Switch on the replica and wait until it 
reaches the master.
The replica will never reach the master with even some low load on the master. 
If to remove the load, the replica will reach the master for much greater time, 
than expected. I tried the same for regular transactions, but such problem 
doesn't appear even with a decent load.
  I tried this setup and I do see that the logical subscriber does reach the 
master in a short time. I'm not sure what I'm missing. I stopped the logical 
subscriber in between while pgbench was running and then started it again and 
ran the following:postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM 
pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
---+
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)
 My pgbench command:pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql 
-T 200 -P 5 my custom sql file:cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$; regards,Ajin CherianFujitsu Australia

Re: locked reads for atomics

2024-02-23 Thread Nathan Bossart

On Thu, Feb 22, 2024 at 11:53:50AM -0800, Jeff Davis wrote:
> On Thu, 2024-02-22 at 12:58 +0530, Bharath Rupireddy wrote:
>> There's some immediate use for reads/writes with barrier semantics -
> 
> Is this mainly a convenience for safety/readability? Or is it faster in
> some cases than doing an atomic access with separate memory barriers?

The former.  Besides the 0002 patch tracked here, there's at least one
other patch [0] that could probably use these new functions.  The idea is
to provide an easy way to remove spinlocks, etc. and use atomics for less
performance-sensitive stuff.  The implementations are intended to be
relatively inexpensive and might continue to improve in the future, but the
functions are primarily meant to help reason about correctness.

I don't mind prioritizing these patches, especially since there now seems
to be multiple patches waiting on it.  IIRC I was worried about not having
enough support for this change, but I might now have it.

[0] https://commitfest.postgresql.org/47/4330/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SQL Property Graph Queries (SQL/PGQ)

2024-02-23 Thread Peter Eisentraut


On 16.02.24 20:23, Andres Freund wrote:

One aspect that I m concerned with structurally is that the transformation,
from property graph queries to something postgres understands, is done via the
rewrite system. I doubt that that is a good idea. For one it bars the planner
from making plans that benefit from the graph query formulation. But more
importantly, we IMO should reduce usage of the rewrite system, not increase
it.


PGQ is meant to be implemented like that, like views expanding to joins 
and unions.  This is what I have gathered during the specification 
process, and from other implementations, and from academics.  There are 
certainly other ways to combine relational and graph database stuff, 
like with native graph storage and specialized execution support, but 
this is not that, and to some extent PGQ was created to supplant those 
other approaches.


Many people will agree that the rewriter is sort of weird and archaic at 
this point.  But I'm not aware of any plans or proposals to do anything 
about it.  As long as the view expansion takes place there, it makes 
sense to align with that.  For example, all the view security stuff 
(privileges, security barriers, etc.) will eventually need to be 
considered, and it would make sense to do that in a consistent way.  So 
for now, I'm working with what we have, but let's see where it goes.


(Note to self: Check that graph inside view inside graph inside view ... 
works.)

Re: Relation bulk write facility

2024-02-23 Thread Tom Lane

Heikki Linnakangas  writes:
> Thanks, the error message was clear enough:
>> bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 
>> feature [-Werror,-Wtypedef-redefinition]
>> } BulkWriteState;

> Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI 
> caught that. I also tried to reproduce it locally by adding 
> -Wtypedef-redefinition, but my version of clang didn't produce any 
> warnings. Are there any extra compiler options on those animals or 
> something?

They use Apple's standard compiler (clang 15 or so), but

 'CC' => 'ccache clang -std=gnu99',

so maybe the -std has something to do with it.  I installed that
(or -std=gnu90 as appropriate to branch) on most of my build
setups back when we started the C99 push.

regards, tom lane

Re: Relation bulk write facility

2024-02-23 Thread Heikki Linnakangas


On 23/02/2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..


Those are mine, let me know if you need local investigation.


Thanks, the error message was clear enough:


bulk_write.c:78:3: error: redefinition of typedef 'BulkWriteState' is a C11 
feature [-Werror,-Wtypedef-redefinition]
} BulkWriteState;
  ^
../../../../src/include/storage/bulk_write.h:20:31: note: previous definition 
is here
typedef struct BulkWriteState BulkWriteState;
  ^
1 error generated.


Fixed now, but I'm a bit surprised other buildfarm members nor cirrus CI 
caught that. I also tried to reproduce it locally by adding 
-Wtypedef-redefinition, but my version of clang didn't produce any 
warnings. Are there any extra compiler options on those animals or 
something?


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Printing backtrace of postgres processes

2024-02-23 Thread Christoph Berg

Re: Michael Paquier
> >>•  backtrace() and backtrace_symbols_fd() don't call malloc()  
> >> explic‐
> >>   itly,  but  they  are part of libgcc, which gets loaded 
> >> dynamically
> >>   when first used.  Dynamic loading usually triggers a call  to  
> >> mal‐
> >>   loc(3).   If  you  need certain calls to these two functions to 
> >> not
> >>   allocate memory (in signal handlers, for example), you need to 
> >> make
> >>   sure libgcc is loaded beforehand.
> >>
> >> and the patch ensures that libgcc is loaded by calling a dummy
> >> backtrace() at the start of the process.
> 
> FWIW, anything I am reading about the matter freaks me out, including
> the dlopen() part in all the backends:
> https://www.gnu.org/software/libc/manual/html_node/Backtraces.html

I'd be concerned about the cost of doing that as part of the startup
of every single backend process. Shouldn't this rather be done within
the postmaster so it's automatically inherited by forked backends?
(EXEC_BACKEND systems probably don't have libgcc I guess.)

Christoph

RangeTblEntry jumble omissions

2024-02-23 Thread Peter Eisentraut

I think there are some fields from the RangeTblEntry struct missing in 
the jumble (function _jumbleRangeTblEntry()).  Probably, some of these 
were really just forgotten, in other cases this might be an intentional 
decision, but then it might be good to document it.  This has come up in 
thread [0] and there is a patch [1], but I figured I'd start a new 
thread here to get the attention of those who know more about 
pg_stat_statements.


I think the following fields are missing.  (See also attached patch.)

- alias

Currently, two queries like

SELECT * FROM t1 AS foo
SELECT * FROM t1 AS bar

are counted together by pg_stat_statements -- that might be ok, but they 
both get listed under whichever one is run first, so here if you are 
looking for the "AS bar" query, you won't find it.


- join_using_alias

Similar situation, currently

SELECT * FROM t1 JOIN t2 USING (a, b)
SELECT * FROM t1 JOIN t2 USING (a, b) AS x

are counted together.

- funcordinality

This was probably just forgotten.  It should be included because the 
WITH ORDINALITY clause changes the query result.


- lateral

Also probably forgotten.  A query specifying LATERAL is clearly 
different from one without it.


Thoughts?  Anything else missing perhaps?


[0]: 
https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca...@eisentraut.org
[1]: 
https://www.postgresql.org/message-id/attachment/154249/v2-0002-Remove-custom-_jumbleRangeTblEntry.patchFrom 8176e6f91abd8809e79e1c8e9335522031da2755 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 23 Feb 2024 16:10:57 +0100
Subject: [PATCH] Add more RangeTblEntry fields to jumble

---
 src/backend/nodes/queryjumblefuncs.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index 82f725baaa5..3b0012a720d 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -372,9 +372,11 @@ _jumbleRangeTblEntry(JumbleState *jstate, Node *node)
break;
case RTE_JOIN:
JUMBLE_FIELD(jointype);
+   JUMBLE_NODE(join_using_alias);
break;
case RTE_FUNCTION:
JUMBLE_NODE(functions);
+   JUMBLE_FIELD(funcordinality);
break;
case RTE_TABLEFUNC:
JUMBLE_NODE(tablefunc);
@@ -400,4 +402,6 @@ _jumbleRangeTblEntry(JumbleState *jstate, Node *node)
elog(ERROR, "unrecognized RTE kind: %d", (int) 
expr->rtekind);
break;
}
+   JUMBLE_NODE(alias);
+   JUMBLE_FIELD(lateral);
 }
-- 
2.43.2

Re: RFC: Logging plan of the running query

2024-02-23 Thread Robert Haas

On Fri, Feb 23, 2024 at 7:50 PM Julien Rouhaud  wrote:
> On Fri, Feb 23, 2024 at 10:22:32AM +0530, Robert Haas wrote:
> > On Thu, Feb 22, 2024 at 6:25 AM James Coleman  wrote:
> > > This is potentially a bit of a wild idea, but I wonder if having some
> > > kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
> > > "normal" as opposed to "critical" (using that word differently than
> > > the existing critical sections) would be worth it.
> >
> > It's worth considering, but the definition of "normal" vs. "critical"
> > might be hard to pin down. Or, we might end up with a definition that
> > is specific to this particular case and not generalizable to others.
>
> But it doesn't have to be all or nothing right?  I mean each call could say
> what the situation is like in their context, like
> CHECK_FOR_INTERRUPTS(GUARANTEE_NO_HEAVYWEIGHT_LOCK | GUARANTEE_WHATEVER), and
> slowly tag calls as needed, similarly to how we add already CFI based on users
> report.

Absolutely. My gut feeling is that it's going to be simpler to pick a
small number of places that are safe and sufficient for this
particular feature and add an extra call there, similar to how we do
vacuum_delay_point(). The reason I think that's likely to be better is
that it will likely require changing only a relatively small number of
places. If we instead start annotating CFIs, well, we've got hundreds
of those. That's a lot more to change, and it also inconveniences
third-party extension authors and people doing back-patching. I'm not
here to say it can't work; I just think it's likely not the easiest
path.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-23 Thread Tom Lane

Dean Rasheed  writes:
> On Fri, 23 Feb 2024 at 14:35, Peter Eisentraut  wrote:
>> Various code comments say that the RangeTblEntry field inh may only be
>> set for entries of kind RTE_RELATION.

> Yes, it's explained a bit more clearly/accurately in 
> expand_inherited_rtentry():

>  * "inh" is only allowed in two cases: RELATION and SUBQUERY RTEs.

Yes.  The latter has been accurate for a very long time, so I'm
surprised that there are any places that think otherwise.  We need
to fix them --- where did you see this exactly?

(Note that RELATION-only is accurate within the parser and rewriter,
so maybe clarifications about context are in order.)

regards, tom lane

incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-02-23 Thread Robert Haas

If XLOG_DBASE_CREATE_FILE_COPY occurs between an incremental backup
and its reference backup, every relation whose DB OID and tablespace
OID match the corresponding values in that record should be backed up
in full. Currently that's not happening, because the WAL summarizer
doesn't see the XLOG_DBASE_CREATE_FILE_COPY as referencing any
particular relfilenode and so basically ignores it. The same happens
for XLOG_DBASE_CREATE_WAL_LOG, but that case is OK because that only
covers creating the directory itself, not anything underneath it, and
there will be separate WAL records telling us the relfilenodes created
below the new directory and the pages modified therein.

AFAICS, fixing this requires some way of noting in the WAL summary
file that an entire directory got blown away. I chose to do that by
setting the limit block to 0 for a fake relation with the given DB OID
and TS OID and relfilenumber 0, which seems natural. Patch with test
case attached. The test case in brief is:

initdb -c summarize_wal=on
# start the server in $PGDATA
psql -c 'create database lakh oid = 10 strategy = file_copy' postgres
psql -c 'create table t1 (a int)' lakh
pg_basebackup -cfast -Dt1
dropdb lakh
psql -c 'create database lakh oid = 10 strategy = file_copy' postgres
pg_basebackup -cfast -Dt2 --incremental t1/backup_manifest
pg_combinebackup t1 t2 -o result
# stop the server, restart from the result directory
psql -c 'select * from t1' lakh

Without this patch, you get something like:

ERROR:  could not open file "base/10/16388": No such file or directory

...because the catalog entries from before the database is dropped and
recreated manage to end up in pg_combinebackup's output directory,
which they should not.

With the patch, you correctly get an error about t1 not existing.

I thought about whether there were any other WAL records that have
similar problems to XLOG_DBASE_CREATE_FILE_COPY and didn't come up
with anything. If anyone knows of any similar cases, please let me
know.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Fix-incremental-backup-interaction-with-XLOG_DBAS.patch
Description: Binary data

Re: Relation bulk write facility

2024-02-23 Thread Tom Lane

Heikki Linnakangas  writes:
> Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

Those are mine, let me know if you need local investigation.

regards, tom lane

Re: Relation bulk write facility

2024-02-23 Thread Heikki Linnakangas


On 23/02/2024 16:27, Heikki Linnakangas wrote:

Committed this. Thanks everyone!


Buildfarm animals 'sifaka' and 'longfin' are not happy, I will investigate..

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Potential issue in ecpg-informix decimal converting functions

2024-02-23 Thread a . imamov


Daniel Gustafsson писал(а) 2024-02-23 13:44:

On 22 Feb 2024, at 17:54, a.ima...@postgrespro.ru wrote:



PGTYPESnumeric_to_int() and PGTYPESnumeric_to_long()
functions return only 0 or -1. So ECPG_INFORMIX_NUM_OVERFLOW can never
be returned.


Indeed, this looks like an oversight.


I think dectoint(), dectolong() and PGTYPESnumeric_to_int() functions
should be a little bit different like in proposing patch.
What do you think?


-Convert a variable to type decimal to an integer.
+Convert a variable of type decimal to an integer.
While related, this should be committed and backpatched regardless.

+   int errnum = 0;
Stylistic nit, we typically don't initialize a variable which cannot be
accessed before being set.

Overall the patch looks sane, please register it for the next 
commitfest to

make it's not missed.

--
Daniel Gustafsson


Thank you for feedback,

-Convert a variable to type decimal to an integer.
+Convert a variable of type decimal to an integer.
I removed this from the patch and proposed to 
pgsql-d...@lists.postgresql.org


+   int errnum = 0;
fixed

Thank's for advice, the patch will be registered for the next 
commitfest.


--
Aidar Imamovdiff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 73351a9136..aa86afa1d9 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -8892,8 +8892,9 @@ int dectoint(decimal *np, int *ip);


 On success, 0 is returned and a negative value if the conversion
-failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
-is returned.
+failed. If overflow or underflow occurred, the function returns
+ECPG_INFORMIX_NUM_OVERFLOW or
+ECPG_INFORMIX_NUM_UNDERFLOW respectively.


 Note that the ECPG implementation differs from the Informix
@@ -8918,8 +8919,9 @@ int dectolong(decimal *np, long *lngp);


 On success, 0 is returned and a negative value if the conversion
-failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
-is returned.
+failed. If overflow or underflow occurred, the function returns
+ECPG_INFORMIX_NUM_OVERFLOW or
+ECPG_INFORMIX_NUM_UNDERFLOW respectively.


 Note that the ECPG implementation differs from the Informix
diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index 80d40aa3e0..5f9d71d0f3 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -434,6 +434,7 @@ int
 dectoint(decimal *np, int *ip)
 {
 	int			ret;
+	int			errnum;
 	numeric*nres = PGTYPESnumeric_new();
 
 	if (nres == NULL)
@@ -445,11 +446,18 @@ dectoint(decimal *np, int *ip)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
 	}
 
+	errno = 0;
 	ret = PGTYPESnumeric_to_int(nres, ip);
+	errnum = errno;
 	PGTYPESnumeric_free(nres);
 
-	if (ret == PGTYPES_NUM_OVERFLOW)
-		ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	if (ret)
+	{
+		if (errnum == PGTYPES_NUM_UNDERFLOW)
+			ret = ECPG_INFORMIX_NUM_UNDERFLOW;
+		else if (errnum == PGTYPES_NUM_OVERFLOW)
+			ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	}
 
 	return ret;
 }
@@ -458,6 +466,7 @@ int
 dectolong(decimal *np, long *lngp)
 {
 	int			ret;
+	int			errnum;
 	numeric*nres = PGTYPESnumeric_new();
 
 	if (nres == NULL)
@@ -469,11 +478,18 @@ dectolong(decimal *np, long *lngp)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
 	}
 
+	errno = 0;
 	ret = PGTYPESnumeric_to_long(nres, lngp);
+	errnum = errno;
 	PGTYPESnumeric_free(nres);
 
-	if (ret == PGTYPES_NUM_OVERFLOW)
-		ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	if (ret)
+	{
+		if (errnum == PGTYPES_NUM_UNDERFLOW)
+			ret = ECPG_INFORMIX_NUM_UNDERFLOW;
+		else if (errnum == PGTYPES_NUM_OVERFLOW)
+			ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	}
 
 	return ret;
 }
diff --git a/src/interfaces/ecpg/pgtypeslib/numeric.c b/src/interfaces/ecpg/pgtypeslib/numeric.c
index 35e7b92da4..52b49e1ce9 100644
--- a/src/interfaces/ecpg/pgtypeslib/numeric.c
+++ b/src/interfaces/ecpg/pgtypeslib/numeric.c
@@ -1502,7 +1502,12 @@ PGTYPESnumeric_to_int(numeric *nv, int *ip)
 /* silence compilers that might complain about useless tests */
 #if SIZEOF_LONG > SIZEOF_INT
 
-	if (l < INT_MIN || l > INT_MAX)
+	if (l < INT_MIN)
+	{
+		errno = PGTYPES_NUM_UNDERFLOW;
+		return -1;
+	}
+	else if(l > INT_MAX)
 	{
 		errno = PGTYPES_NUM_OVERFLOW;
 		return -1;
diff --git a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
index 1f8675b3f3..71a5afa4a7 100644
--- a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
+++ b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
@@ -3,8 +3,8 @@
 (no errno set) - dec[0,3]: r: -1, *
 (no errno set) - dec[0,4]: r: -1, *
 dec[0,5]: r: 0, 0.00
-(errno == PGTYPES_NUM_OVERFLOW) - dec[0,6]: 0 (r: -1)
-(errno == PGTYPES_NUM_OVERFLOW) - dec[0,8]:

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-23 Thread Dean Rasheed

On Fri, 23 Feb 2024 at 14:35, Peter Eisentraut  wrote:
>
> Various code comments say that the RangeTblEntry field inh may only be
> set for entries of kind RTE_RELATION.
>
> The function pull_up_simple_union_all() in prepjointree.c sets ->inh to
> true for RTE_SUBQUERY entries:
>
>  /*
>   * Mark the parent as an append relation.
>   */
>  rte->inh = true;
>
> Whatever this is doing appears to be some undocumented magic.

Yes, it's explained a bit more clearly/accurately in expand_inherited_rtentry():

/*
 * expand_inherited_rtentry
 *Expand a rangetable entry that has the "inh" bit set.
 *
 * "inh" is only allowed in two cases: RELATION and SUBQUERY RTEs.
 *
 * "inh" on a plain RELATION RTE means that it is a partitioned table or the
 * parent of a traditional-inheritance set.  In this case we must add entries
 * for all the interesting child tables to the query's rangetable, and build
 * additional planner data structures for them, including RelOptInfos,
 * AppendRelInfos, and possibly PlanRowMarks.
 *
 * Note that the original RTE is considered to represent the whole inheritance
 * set.  In the case of traditional inheritance, the first of the generated
 * RTEs is an RTE for the same table, but with inh = false, to represent the
 * parent table in its role as a simple member of the inheritance set.  For
 * partitioning, we don't need a second RTE because the partitioned table
 * itself has no data and need not be scanned.
 *
 * "inh" on a SUBQUERY RTE means that it's the parent of a UNION ALL group,
 * which is treated as an appendrel similarly to inheritance cases; however,
 * we already made RTEs and AppendRelInfos for the subqueries.  We only need
 * to build RelOptInfos for them, which is done by expand_appendrel_subquery.
 */

> Is this something we should explain the RangeTblEntry comments?
>

+1

Regards,
Dean

Re: Thoughts about NUM_BUFFER_PARTITIONS

2024-02-23 Thread wenhui qiu

Hi Tomas Vondra
Thanks for the information!  But I found postgres pro enterprise
version has been implemented ,However, it defaults to 16 and maxes out at
128, and the maxes are the same as in PostgreSQL.I kindly  hope that if the
developers can explain what the purpose of this is.May be 128 partitions is
the optimal value,It's a parameter to make it easier to adjust the number
of partitions in the future when it's really not enough. and the code
comments also said that  hope to implement the parameter in the future


( https://postgrespro.com/docs/enterprise/16/runtime-config-locks )


log2_num_lock_partitions (integer) #


This controls how many partitions the shared lock tables are divided into.
Number of partitions is calculated by raising 2 to the power of this
parameter. The default value is 4, which corresponds to 16 partitions, and
the maximum is 8. This parameter can only be set in the postgresql.conf file
or on the server command line.

Best wish


On Tue, 20 Feb 2024 at 21:55, Tomas Vondra 
wrote:

> Hi,
>
> On 2/20/24 03:16, wenhui qiu wrote:
> > Hi Heikki Linnakangas
> >I saw git log found this commit:
> >
> https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892
> > ,I don't seem to see an email discussing this commit. As the commit log
> > tells us, we don't know exactly how large a value is optimal, and I
> believe
> > it's more flexible to make it as a parameter.Thank you very much
> > tomas.vondra for explaining the relationship, i see that
> MAX_SIMUL_LWLOCKS
> > was just doubled in this commit, is there a more appropriate ratio
> between
> > them?
> >
>
> I think the discussion for that commit is in [1] (and especially [2]).
>
> That being said, I don't think MAX_SIMUL_LOCKS and NUM_BUFFER_PARTITIONS
> need to be in any particular ratio. The only requirement is that there
> needs to be enough slack, and 72 locks seemed to work quite fine until
> now - I don't think we need to change that.
>
> What might be necessary is improving held_lwlocks - we treat is as LIFO,
> but more as an expectation than a hard rule. I'm not sure how often we
> violate that rule (if at all), but if we do then it's going to get more
> expensive as we increase the number of locks. But I'm not sure this is
> actually a problem in practice, we usually hold very few LWLocks at the
> same time.
>
> As for making this a parameter, I'm rather opposed to the idea. If we
> don't have a very clear idea how to set this limit, what's the chance
> users with little knowledge of the internals will pick a good value?
> Adding yet another knob would just mean users start messing with it in
> random ways (typically increasing it to very high value, because "more
> is better"), causing more harm than good.
>
> Adding it as a GUC would also require making some parts dynamic (instead
> of just doing static allocation with compile-time constants). That's not
> great, but I'm not sure how significant the penalty might be.
>
>
> IMHO adding a GUC might be acceptable only if we fail to come up with a
> good value (which is going to be a trade off), and if someone
> demonstrates a clear benefit of increasing the value (which I don't
> think happen in this thread yet).
>
>
> regards
>
>
> [1]
>
> https://www.postgresql.org/message-id/flat/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN%3D%2BPjggfzQDukKr3q_DA%40mail.gmail.com
>
> [2]
>
> https://www.postgresql.org/message-id/CA%2BTgmoY58dQi8Z%3DFDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA%40mail.gmail.com
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-23 Thread Peter Eisentraut

Various code comments say that the RangeTblEntry field inh may only be 
set for entries of kind RTE_RELATION.


For example

 *inh is true for relation references that should be expanded to 
include

 *inheritance children, if the rel has any.  This *must* be false for
 *RTEs other than RTE_RELATION entries.

and various comments in other files.

(Confusingly, it is also listed under "Fields valid in all RTEs:", but 
that definitely seems wrong.)


I have been deploying some assertions to see if the claims in the 
RangeTblEntry comments are all correct, and I tripped over something.


The function pull_up_simple_union_all() in prepjointree.c sets ->inh to 
true for RTE_SUBQUERY entries:


/*
 * Mark the parent as an append relation.
 */
rte->inh = true;

Whatever this is doing appears to be some undocumented magic.  If I 
remove the line, then regression tests fail with plan differences, so it 
definitely seems to do something.


Is this something we should explain the RangeTblEntry comments?

Re: Relation bulk write facility

2024-02-23 Thread Heikki Linnakangas


Committed this. Thanks everyone!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: RFC: Logging plan of the running query

2024-02-23 Thread Julien Rouhaud

Hi,

On Fri, Feb 23, 2024 at 10:22:32AM +0530, Robert Haas wrote:
> On Thu, Feb 22, 2024 at 6:25 AM James Coleman  wrote:
> > This is potentially a bit of a wild idea, but I wonder if having some
> > kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
> > "normal" as opposed to "critical" (using that word differently than
> > the existing critical sections) would be worth it.
>
> It's worth considering, but the definition of "normal" vs. "critical"
> might be hard to pin down. Or, we might end up with a definition that
> is specific to this particular case and not generalizable to others.

But it doesn't have to be all or nothing right?  I mean each call could say
what the situation is like in their context, like
CHECK_FOR_INTERRUPTS(GUARANTEE_NO_HEAVYWEIGHT_LOCK | GUARANTEE_WHATEVER), and
slowly tag calls as needed, similarly to how we add already CFI based on users
report.

Re: Synchronizing slots from primary to standby

2024-02-23 Thread Bertrand Drouvot

Hi,

On Fri, Feb 23, 2024 at 09:30:58AM +, Zhijie Hou (Fujitsu) wrote:
> On Friday, February 23, 2024 5:07 PM Bertrand Drouvot 
>  wrote:
> > On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote:
> > >
> > > Thanks for the details. I understand it now.  We do not use '=' in our
> > > main slots-fetch query but we do use '=' in remote-validation query.
> > > See validate_remote_info().
> > 
> > Oh, right, I missed it during the review.
> > 
> > > Do you think instead of doing the above, we can override search-path
> > > with empty string in the slot-sync case.
> > > SImilar to logical apply worker and autovacuum worker case (see
> > > InitializeLogRepWorker(), AutoVacWorkerMain()).
> > 
> > Yeah, we should definitively ensure that any operators being used in the 
> > query
> > is coming from the pg_catalog schema (could be by setting the search path or
> > using the up-thread proposal).
> > 
> > Setting the search path would prevent any risks in case the query is changed
> > later on, so I'd vote for changing the search path in 
> > validate_remote_info() and
> > in synchronize_slots() to be on the safe side.
> 
> I think to set secure search path for remote connection, the standard approach
> could be to extend the code in libpqrcv_connect[1], so that we don't need to 
> schema
> qualify all the operators in the queries.
> 
> And for local connection, I agree it's also needed to add a
> SetConfigOption("search_path", "" call in the slotsync worker.
> 
> [1]
> libpqrcv_connect
> ...
>   if (logical)
> ...
>   res = libpqrcv_PQexec(conn->streamConn,
> 
> ALWAYS_SECURE_SEARCH_PATH_SQL);
> 

Agree, something like in the attached? (it's .txt to not disturb the CF bot).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From abd3e8a7131621251b5d4628f4cb0979911159ac Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 23 Feb 2024 13:58:31 +
Subject: [PATCH v1] reset search_path for slot synchronization.

Ensure that search_path is reset for slot synchronization, within the BGW and
"local" connections.
---
 src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 2 +-
 src/backend/replication/logical/slotsync.c  | 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)
  20.0% src/backend/replication/libpqwalreceiver/
  79.9% src/backend/replication/logical/

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 04271ee703..a30528a5f6 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -271,7 +271,7 @@ libpqrcv_connect(const char *conninfo, bool replication, 
bool logical,
 errhint("Target server's authentication method 
must be changed, or set password_required=false in the subscription 
parameters.")));
}
 
-   if (logical)
+   if (logical || !replication)
{
PGresult   *res;
 
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index 36773cfe73..0ee08c3976 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1215,6 +1215,12 @@ ReplSlotSyncWorkerMain(int argc, char *argv[])
 */
sigprocmask(SIG_SETMASK, , NULL);
 
+   /*
+* Set always-secure search path, so malicious users can't redirect user
+* code (e.g. operators).
+*/
+   SetConfigOption("search_path", "", PGC_SUSET, PGC_S_OVERRIDE);
+
dbname = CheckAndGetDbnameFromConninfo();
 
/*
-- 
2.34.1

Re: table inheritance versus column compression and storage settings

2024-02-23 Thread Ashutosh Bapat

On Tue, Feb 20, 2024 at 3:51 PM Peter Eisentraut  wrote:
>
> I have reverted the patch for now (and re-opened the commitfest entry).
> We should continue to work on this and see if we can at least try to get
> the pg_dump test coverage suitable.
>

I have started a separate thread for dump/restore test. [1].

Using that test, I found an existing bug:
Consider
CREATE TABLE cminh6 (f1 TEXT);
ALTER TABLE cminh6 INHERIT cmparent1;
f1 remains without compression even after inherit per the current code.
But pg_dump dumps it out as
CREATE TABLE cminh6 (f1 TEXT) INHERIT(cmparent1)
Because of this after restoring cminh6::f1 inherits compression of
cmparent1. So before dump cminh6::f1 has no compression and after
restore it has compression.

I am not sure how to fix this. We want inheritance children to have
their on compression. So ALTER TABLE ... INHERIT ... no passing a
compression onto child is fine. CREATE TABLE  INHERIT ... passing
compression onto the child being created also looks fine since that's
what we do with other attributes. Only solution I see is to introduce
"none" as a special compression method to indicate "no compression"
and store that instead of NULL in attcompression column. That looks
ugly.

Similar is the case with storage.

[1] 
https://www.postgresql.org/message-id/CAExHW5uF5V=Cjecx3_Z=7xfh4rg2wf61pt+hfquzjbqourz...@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-02-23 Thread Daniel Gustafsson

> On 23 Feb 2024, at 13:09, Nazir Bilal Yavuz  wrote:

> Does errmsg_internal() need to be used all the time when turning elogs
> into ereports? errmsg_internal()'s comment says that "This should be
> used for "can't happen" cases that are probably not worth spending
> translation effort on.". Is it enough to check if the error message
> has a translation, and then decide the use of errmsg_internal() or
> errmsg()?

If it's an elog then it won't have a translation as none are included in the
translation set.  If the errmsg is generic enough to be translated anyways via
another (un)related ereport call then we of course use that, but ideally not
create new ones.

--
Daniel Gustafsson

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-02-23 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review!

On Thu, 22 Feb 2024 at 16:55, Daniel Gustafsson  wrote:
>
> > On 6 Dec 2023, at 14:03, Nazir Bilal Yavuz  wrote:
>
> > There is an ongoing thread [1] for adding missing SQL error codes to
> > PANIC and FATAL error reports in xlogrecovery.c file. I did the same
> > but for xlog.c and relcache.c files.
>
> -   elog(PANIC, "space reserved for WAL record does not match what was 
> written");
> +   ereport(PANIC,
> +   (errcode(ERRCODE_DATA_CORRUPTED),
> +errmsg("space reserved for WAL record does not match 
> what was written")));
>
> elogs turned into ereports should use errmsg_internal() to keep the strings
> from being translated.

Does errmsg_internal() need to be used all the time when turning elogs
into ereports? errmsg_internal()'s comment says that "This should be
used for "can't happen" cases that are probably not worth spending
translation effort on.". Is it enough to check if the error message
has a translation, and then decide the use of errmsg_internal() or
errmsg()?

> -   elog(FATAL, "could not write init file");
> +   ereport(FATAL,
> +   (errcode_for_file_access(),
> +errmsg("could not write init file")));
>
> Is it worthwhile adding %m on these to get a little more help when debugging
> errors that shouldn't happen?

I believe it is worthwhile, so I will add.

> -   elog(FATAL, "could not write init file");
> +   ereport(FATAL,
> +   (errcode_for_file_access(),
>
> The extra parenthesis are no longer needed, I don't know if we have a policy 
> to
> remove them when changing an ereport call but we should at least not introduce
> new ones.
>
> -   elog(FATAL, "cannot read pg_class without having selected a 
> database");
> +   ereport(FATAL,
> +   (errcode(ERRCODE_INTERNAL_ERROR),
>
> ereport (and thus elog) already defaults to ERRCODE_INTERNAL_ERROR for ERROR 
> or
> higher, so unless there is a better errcode an elog() call if preferrable 
> here.

I did not know these, thanks.

> > I couldn't find a suitable error code for the "cache lookup failed for
> > relation" error in relcache.c and this error comes up in many places.
> > Would it be reasonable to create a new error code specifically for
> > this?
>
> We use ERRCODE_UNDEFINED_OBJECT for similar errors elsewhere, perhaps we can
> use that for these as well?

It looks okay to me, ERRCODE_UNDEFINED_OBJECT is mostly used for the
'not exist' errors and it seems the main reason for the 'cache lookup
failed for relation' error is because heap tuple does not exist
anymore.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Alvaro Herrera

On 2024-Feb-23, Andrey M. Borodin wrote:

> I'm sure anyone with multiple CPUs should increase, not decrease
> previous default of 128 buffers (with 512MB shared buffers). Having
> more CPUs (the only way to benefit from more locks) implies bigger
> transaction buffers.

Sure.

> IMO making bank size variable adds unneeded computation overhead, bank
> search loops should be unrollable by compiler etc.

Makes sense.

> Originally there was a patch set step, that packed bank's page
> addresses together in one array. It was done to make bank search a
> SIMD instruction.

Ants Aasma had proposed a rework of the LRU code for better performance.
He told me it depended on bank size being 16, so you're right that it's
probably not a good idea to make it variable.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Synchronizing slots from primary to standby

2024-02-23 Thread Bertrand Drouvot

Hi,

On Fri, Feb 23, 2024 at 09:46:00AM +, Zhijie Hou (Fujitsu) wrote:
> On Friday, February 23, 2024 1:22 PM shveta malik  
> wrote:
> 
> > 
> > Thanks for the patches. Had a quick look at v95_2, here are some
> > trivial comments:
> 
> Thanks for the comments.
> 
> > 6) streaming replication standby server slot names that logical walsender
> > processes will wait for
> > 
> > Is it better to say it like this? (I leave this to your preference)
> > 
> > streaming replication standby server slot names for which logical
> > walsender processes will wait.
> 
> I feel the current one seems better, so didn’t change. Other comments have 
> been
> addressed. Here is the V97 patch set which addressed Shveta's comments.
> 
> 
> Besides, I'd like to clarify and discuss the behavior of standby_slot_names 
> once.
> 
> As it stands in the patch, If the slots specified in standby_slot_names are
> dropped or invalidated, the logical walsender will issue a WARNING and 
> continue
> to replicate the changes. Another option for this could be to have the
> walsender pause until the slot in standby_slot_names is re-created or becomes
> valid again. Does anyone else have an opinion on this matter ?

Good point, I'd vote for: the only reasons not to wait are:

- slots mentioned in standby_slot_names exist and valid and do catch up
or
- standby_slot_names is empty

The reason is that setting standby_slot_names to a non empty value means that
one wants the walsender to wait until the standby catchup. The way to remove 
this
intentional behavior should be by changing the standby_slot_names value (not the
existence or the state of the slot(s) it points too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar  wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  
> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Andrey M. Borodin

> On 23 Feb 2024, at 12:36, Dilip Kumar  wrote:
> 
>> Another thing I've been thinking is that perhaps it would be useful to
>> make the banks smaller, when the total number of buffers is small.  For
>> example, if you have 16 or 32 buffers, it's not really clear to me that
>> it makes sense to have just 1 bank or 2 banks.  It might be more
>> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
>> the algorithm scale down as well as up ...
> 
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.

I'm sure anyone with multiple CPUs should increase, not decrease previous 
default of 128 buffers (with 512MB shared buffers). Having more CPUs (the only 
way to benefit from more locks) implies bigger transaction buffers.
IMO making bank size variable adds unneeded computation overhead, bank search 
loops should be unrollable by compiler etc.
Originally there was a patch set step, that packed bank's page addresses 
together in one array. It was done to make bank search a SIMD instruction.

Best regards, Andrey Borodin.

Re: Potential issue in ecpg-informix decimal converting functions

2024-02-23 Thread Daniel Gustafsson

> On 22 Feb 2024, at 17:54, a.ima...@postgrespro.ru wrote:

> PGTYPESnumeric_to_int() and PGTYPESnumeric_to_long()
> functions return only 0 or -1. So ECPG_INFORMIX_NUM_OVERFLOW can never
> be returned.

Indeed, this looks like an oversight.

> I think dectoint(), dectolong() and PGTYPESnumeric_to_int() functions
> should be a little bit different like in proposing patch.
> What do you think?

-Convert a variable to type decimal to an integer.
+Convert a variable of type decimal to an integer.
While related, this should be committed and backpatched regardless.

+   int errnum = 0;
Stylistic nit, we typically don't initialize a variable which cannot be
accessed before being set.

Overall the patch looks sane, please register it for the next commitfest to
make it's not missed.

--
Daniel Gustafsson

Refactor SASL exchange in preparation for OAuth Bearer

2024-02-23 Thread Daniel Gustafsson

The attached two patches are smaller refactorings to the SASL exchange and init
codepaths which are required for the OAuthbearer work [0].  Regardless of the
future of that patchset, these refactorings are nice cleanups and can be
considered in isolation.  Another goal is of course to reduce scope of the
OAuth patchset to make it easier to review.

The first patch change state return from the exchange call to use a tri-state
return value instead of the current output parameters.  This makes it possible
to introduce async flows, but it also makes the code a lot more readable due to
using descriptve names IMHO.

The second patch sets password_needed during SASL init on the SCRAM exchanges.
This was implicit in the code but since not all future exchanges may require
password, do it explicitly per mechanism instead.

--
Daniel Gustafsson

[0] d1b467a78e0e36ed85a09adf979d04cf124a9d4b.ca...@vmware.com



v1-0002-Explicitly-require-password-for-SCRAM-exchange.patch
Description: Binary data


v1-0001-Refactor-SASL-exchange-to-return-tri-state-status.patch
Description: Binary data

ExecAppendAsyncEventWait() in REL_14_STABLE can corrupt PG_exception_stack

2024-02-23 Thread Alexander Pyhalov


Hi.

Recent commit 555276f8594087ba15e0d58e38cd2186b9f39f6d introduced final 
cleanup of node->as_eventset in ExecAppendAsyncEventWait().
Unfortunately, now this function can return in the middle of TRY/FINALLY 
block, without restoring PG_exception_stack.


We found this while working on our FDW. Unfortunately, I couldn't 
reproduce the issue with postgres_fdw, but it seems it is also affected.


The following patch heals the issue.

-- l
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 025f40894d6d8f499144f0f7c45c0a124a46c408 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Fri, 23 Feb 2024 13:06:40 +0300
Subject: [PATCH] Avoid corrupting PG_exception_stack in
 ExecAppendAsyncEventWait()

After commit 555276f8594087ba15e0d58e38cd2186b9f39f6d
ExecAppendAsyncEventWait() could corrupt PG_exception_stack
after returning in the the middle of PG_TRY()/PG_END_TRY() block.
It should exit only after PG_END_TRY().
---
 src/backend/executor/nodeAppend.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index baba3ceea23..42962e80177 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -1052,21 +1052,22 @@ ExecAppendAsyncEventWait(AppendState *node)
 		 */
 		if (GetNumRegisteredWaitEvents(node->as_eventset) == 1)
 		{
-			FreeWaitEventSet(node->as_eventset);
-			node->as_eventset = NULL;
-			return;
+			/* Return after PG_TRY()/PG_END_TRY() block */
+			noccurred = 0;
 		}
+		else
+		{
+			/* Return at most EVENT_BUFFER_SIZE events in one call. */
+			if (nevents > EVENT_BUFFER_SIZE)
+nevents = EVENT_BUFFER_SIZE;
 
-		/* Return at most EVENT_BUFFER_SIZE events in one call. */
-		if (nevents > EVENT_BUFFER_SIZE)
-			nevents = EVENT_BUFFER_SIZE;
-
-		/*
-		 * If the timeout is -1, wait until at least one event occurs.  If the
-		 * timeout is 0, poll for events, but do not wait at all.
-		 */
-		noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
-	 nevents, WAIT_EVENT_APPEND_READY);
+			/*
+			 * If the timeout is -1, wait until at least one event occurs.  If
+			 * the timeout is 0, poll for events, but do not wait at all.
+			 */
+			noccurred = WaitEventSetWait(node->as_eventset, timeout, occurred_event,
+		 nevents, WAIT_EVENT_APPEND_READY);
+		}
 	}
 	PG_FINALLY();
 	{
-- 
2.34.1

Re: A failure in t/001_rep_changes.pl

2024-02-23 Thread vignesh C

On Wed, 14 Feb 2024 at 13:19, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I recently observed an assertion failure twice in t/001_rep_changes.pl
> on HEAD with the backtrace [1] on my dev EC2 c5.4xlarge instance [2].
> Unfortunately I'm not observing it again. I haven't got a chance to
> dive deep into it. However, I'm posting it here just for the records,
> and in case something can be derived out of the backtrace.
>
> [1] t/001_rep_changes.pl
>
> 2024-01-31 12:24:38.474 UTC [840166]
> pg_16435_sync_16393_7330237333761601891 STATEMENT:
> DROP_REPLICATION_SLOT pg_16435_sync_16393_7330237333761601891 WAIT
> TRAP: failed Assert("list->head != INVALID_PGPROCNO"), File:
> "../../../../src/include/storage/proclist.h", Line: 101, PID: 840166
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(ExceptionalCondition+0xbb)[0x55c8edf6b8f9]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x6637de)[0x55c8edd517de]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(ConditionVariablePrepareToSleep+0x85)[0x55c8edd51b91]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(ReplicationSlotAcquire+0x142)[0x55c8edcead6b]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(ReplicationSlotDrop+0x51)[0x55c8edceb47f]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x60da71)[0x55c8edcfba71]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(exec_replication_command+0x47e)[0x55c8edcfc96a]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(PostgresMain+0x7df)[0x55c8edd7d644]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x5ab50c)[0x55c8edc9950c]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x5aab21)[0x55c8edc98b21]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x5a70de)[0x55c8edc950de]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(PostmasterMain+0x1534)[0x55c8edc949db]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(+0x459c47)[0x55c8edb47c47]
> /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f19fe629d90]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f19fe629e40]
> postgres: publisher: walsender ubuntu postgres [local]
> DROP_REPLICATION_SLOT(_start+0x25)[0x55c8ed7c4565]
> 2024-01-31 12:24:38.476 UTC [840168]
> pg_16435_sync_16390_7330237333761601891 LOG:  statement: SELECT
> a.attnum,   a.attname,   a.atttypid,   a.attnum =
> ANY(i.indkey)  FROM pg_catalog.pg_attribute a  LEFT JOIN
> pg_catalog.pg_index i   ON (i.indexrelid =
> pg_get_replica_identity_index(16391)) WHERE a.attnum >
> 0::pg_catalog.int2   AND NOT a.attisdropped AND a.attgenerated = ''
> AND a.attrelid = 16391 ORDER BY a.attnum
>
> [2] Linux ip-000-00-0-000 6.2.0-1018-aws #18~22.04.1-Ubuntu SMP Wed
> Jan 10 22:54:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

By any chance do you have the log files when this failure occurred, if
so please share it.

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2024-02-23 Thread Bertrand Drouvot

Hi,

On Fri, Feb 23, 2024 at 04:36:44AM +, Zhijie Hou (Fujitsu) wrote:
> On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) 
>  wrote:
> > >
> > > Hi,
> > >
> > > Since the slotsync worker patch has been committed, I rebased the
> > > remaining patches.
> > > And here is the V95 patch set.
> > >
> > > Also, I fixed a bug in the current 0001 patch where the member of the
> > > standby slot names list pointed to the freed memory after calling
> > ProcessConfigFile().
> > > Now, we will obtain a new list when we call ProcessConfigFile(). The
> > > optimization to only get the new list when the names actually change
> > > has been removed. I think this change is acceptable because
> > > ProcessConfigFile is not a frequent occurrence.
> > >
> > > Additionally, I reordered the tests in
> > > 040_standby_failover_slots_sync.pl. Now the new test will be conducted
> > > after the sync slot test to prevent the risk of the logical slot
> > > occasionally not catching up to the latest catalog_xmin and, as a result, 
> > > not
> > being able to be synced immediately.
> > 
> > There is one unexpected change in the previous version, sorry for that.
> > Here is the correct version.
> 
> I noticed one CFbot failure[1] which is because the tap-test doesn't wait for 
> the
> standby to catch up before promoting, thus the data inserted after promotion
> could not be replicated to the subscriber. Add a wait_for_replay_catchup to 
> fix it.
> 
> Apart from this, I also adjusted some variable names in the tap-test to be
> consistent. And added back a mis-removed ProcessConfigFile call.

Thanks!

Here are some random comments:

1 ===

Commit message "Allow logical walsenders to wait for the physical" 

s/physical/physical standby/?

2 ==

+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -30,6 +30,7 @@
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/message.h"
+#include "replication/walsender.h"

Is this include needed?

3 ===

+* Slot sync is currently not supported on the cascading standby. This 
is

s/on the/on a/?

4 ===

+   if (!ok)
+   GUC_check_errdetail("List syntax is invalid.");
+
+   /*
+* If there is a syntax error in the name or if the replication slots'
+* data is not initialized yet (i.e., we are in the startup process), 
skip
+* the slot verification.
+*/
+   if (!ok || !ReplicationSlotCtl)
+   {
+   pfree(rawname);
+   list_free(elemlist);
+   return ok;
+   }

we are testing the "ok" value twice, what about using if...else if... instead
and test it once? If so, it might be worth to put the:

"
+   pfree(rawname);
+   list_free(elemlist);
+   return ok;
"

in a "goto".

5 ===

+* for which all standbys to wait for. Even if we have physical-slots

s/physical-slots/physical slots/?

6 ===

* Switch to the same memory context under which GUC variables are

s/to the same memory/to the memory/?

7 ===

+ * Return a copy of standby_slot_names_list if the copy flag is set to true,

Not sure, but would it be worth explaining why one would want to set to flag to
true or false? (i.e why one would not want to receive the original list).

8 ===

+   if (RecoveryInProgress())
+   return NIL;

The need is well documented just above, but are we not violating the fact that
we return the original list or a copy of it? (that's what the comment above
the GetStandbySlotList() function definition is saying).

I think the comment above the GetStandbySlotList() function needs a bit of
rewording to cover that case.

9 ===

+* harmless, a WARNING should be enough, no need to 
error-out.

s/error-out/error out/?

10 ===

+   if (slot->data.invalidated != RS_INVAL_NONE)
+   {
+   /*
+* Specified physical slot have been 
invalidated, so no point
+* in waiting for it.

We discovered in [1], that if the wal_status is "unreserved" then the slot is 
still serving the standby. I think we should handle this case differently,
thoughts?

11 ===

+* Specified physical slot have been 
invalidated, so no point

s/have been/has been/?

12 ===

+++ b/src/backend/replication/slotfuncs.c
@@ -22,6 +22,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "replication/slotsync.h"
+#include "replication/walsender.h"

Is this include needed?

[1]: 
https://www.postgresql.org/message-id/CALj2ACWE9asmvN1B18LqfHE8uBuWGsCEP7OO5trRCxPtTPeHVA%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve readability by using designated initializers when possible

2024-02-23 Thread Jelte Fennema-Nio

On Fri, 23 Feb 2024 at 02:57, Jeff Davis  wrote:
> Sorry, I was unclear. I was asking a question about the reason the
> ObjectClass and the object_classes[] array exist in the current code,
> it wasn't a direct question about your patch.

I did a bit of git spelunking and the reason seems to be that back in
2002 when this was introduced not all relation ids were compile time
constants and thus an array was initialized once at bootup. I totally
agree with you that these days there's no reason for the array. So I
now added a second patch that removes this array, instead of updating
it to use the designated initializer syntax.

v2-0001-Remove-unnecessary-object_classes-array.patch
Description: Binary data

v2-0002-Use-designated-initializer-syntax-to-improve-read.patch
Description: Binary data

RE: Synchronizing slots from primary to standby

2024-02-23 Thread Zhijie Hou (Fujitsu)

On Friday, February 23, 2024 1:22 PM shveta malik  
wrote:

> 
> Thanks for the patches. Had a quick look at v95_2, here are some
> trivial comments:

Thanks for the comments.

> 6) streaming replication standby server slot names that logical walsender
> processes will wait for
> 
> Is it better to say it like this? (I leave this to your preference)
> 
> streaming replication standby server slot names for which logical
> walsender processes will wait.

I feel the current one seems better, so didn’t change. Other comments have been
addressed. Here is the V97 patch set which addressed Shveta's comments.

Besides, I'd like to clarify and discuss the behavior of standby_slot_names 
once.

As it stands in the patch, If the slots specified in standby_slot_names are
dropped or invalidated, the logical walsender will issue a WARNING and continue
to replicate the changes. Another option for this could be to have the
walsender pause until the slot in standby_slot_names is re-created or becomes
valid again. Does anyone else have an opinion on this matter ?

Best Regards,
Hou zj

v97-0002-Document-the-steps-to-check-if-the-standby-is-re.patch
Description:  v97-0002-Document-the-steps-to-check-if-the-standby-is-re.patch

v97-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch
Description:  v97-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch

Re: how to read table options during smgropen()

2024-02-23 Thread Heikki Linnakangas


On 22/02/2024 20:22, Dima Rybakov (Tlt) wrote:

Dear pgsql hackers,

I am developing custom storage for pgsql tables. I am using md* 
functions and smgrsw[] structure to switch between different magnetic 
disk access methods.


I want to add some custom options while table created
psql# create table t(...) with (my_option='value');

And thus I want to set "reln->smgr_which" conditionally during 
smgropen(). If myoption='value' i would use another smgr_which


I am really stuck at this point.

smgr.c:
SMgrRelation
smgropen(RelFileNode rnode, BackendId backend){
...
   if ( HasOption(rnode, "my_option","value")){ //<< how to implement 
this check ?

     reln->smgr_which = 1; //new access method
   }else{
     reln->smgr_which = 0; //old access method
   }
...
}


The question is --- can I read table options while the table is 
identified by  "RelFileNode rnode" ??


The short answer is that you can not. smgropen() operates at a lower 
level, and doesn't have access to the catalogs. smgropen() can be called 
by different backends connected to different databases, and even WAL 
recovery when the system is not in a consistent state yet.


Take a look at the table AM interface. It sounds like it might be a 
better fit for what you're doing.


There have been a few threads here on pgsql-hackers on making the smgr 
interface extensible, see 
https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com 
one recent patch. That thread concluded that it's difficult to make it a 
per-tablespace option, let alone per-table.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: make BuiltinTrancheNames less ugly

2024-02-23 Thread Heikki Linnakangas


On 12/02/2024 19:01, Tristan Partin wrote:

On Wed Jan 24, 2024 at 8:09 AM CST, Alvaro Herrera wrote:

IMO it would be less ugly to have the origin file lwlocknames.txt be
not a text file but a .h with a macro that can be defined by
interested parties so that they can extract what they want from the
file, like PG_CMDTAG or PG_KEYWORD.  Using Perl for this seems dirty...


I really like this idea, and would definitely be more inclined to see
a solution like this.


+1 to that idea from me too. Seems pretty straightforward.

--
Heikki Linnakangas
Neon (https://neon.tech)

RE: Synchronizing slots from primary to standby

2024-02-23 Thread Zhijie Hou (Fujitsu)

On Friday, February 23, 2024 5:07 PM Bertrand Drouvot 
 wrote:
> 
> On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote:
> > On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot
> >  wrote:
> > >
> > > Hi,
> > >
> > > Because one could create say the "=" OPERATOR in their own schema,
> > > attach a function to it doing undesired stuff and change the
> > > search_path for the database the sync slot worker connects to.
> > >
> > > Then this new "=" operator would be used (instead of the
> > > pg_catalog.= one), triggering the "undesired" function as superuser.
> >
> > Thanks for the details. I understand it now.  We do not use '=' in our
> > main slots-fetch query but we do use '=' in remote-validation query.
> > See validate_remote_info().
> 
> Oh, right, I missed it during the review.
> 
> > Do you think instead of doing the above, we can override search-path
> > with empty string in the slot-sync case.
> > SImilar to logical apply worker and autovacuum worker case (see
> > InitializeLogRepWorker(), AutoVacWorkerMain()).
> 
> Yeah, we should definitively ensure that any operators being used in the query
> is coming from the pg_catalog schema (could be by setting the search path or
> using the up-thread proposal).
> 
> Setting the search path would prevent any risks in case the query is changed
> later on, so I'd vote for changing the search path in validate_remote_info() 
> and
> in synchronize_slots() to be on the safe side.

I think to set secure search path for remote connection, the standard approach
could be to extend the code in libpqrcv_connect[1], so that we don't need to 
schema
qualify all the operators in the queries.

And for local connection, I agree it's also needed to add a
SetConfigOption("search_path", "" call in the slotsync worker.

[1]
libpqrcv_connect
...
if (logical)
...
res = libpqrcv_PQexec(conn->streamConn,
  
ALWAYS_SECURE_SEARCH_PATH_SQL);

Best Regards,
Hou zj

Re: Improve eviction algorithm in ReorderBuffer

2024-02-23 Thread vignesh C

On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada  wrote:
>
> On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Sawada-san,
> >
> > Thanks for making v3 patchset. I have also benchmarked the case [1].
> > Below results are the average of 5th, there are almost the same result
> > even when median is used for the comparison. On my env, the regression
> > cannot be seen.
> >
> > HEAD (1e285a5)  HEAD + v3 patches   difference
> > 10910.722 ms10714.540 msaround 1.8%
> >
>
> Thank you for doing the performance test!
>
> > Also, here are mino comments for v3 set.
> >
> > 01.
> > bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.
>
> Will add them.
>
> >
> > 02. ReorderBufferTXNSizeCompare
> > Should we assert {ta, tb} are not NULL?
>
> Not sure we really need it as other binaryheap users don't have such checks.
>
> On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > I've run a benchmark test that I shared before[1]. Here are results of
> > > decoding a transaction that has 1M subtransaction each of which has 1
> > > INSERT:
> > >
> > > HEAD:
> > > 1810.192 ms
> > >
> > > HEAD w/ patch:
> > > 2001.094 ms
> > >
> > > I set a large enough value to logical_decoding_work_mem not to evict
> > > any transactions. I can see about about 10% performance regression in
> > > this case.
> >
> > Thanks for running. I think this workload is the worst and an extreme case 
> > which
> > would not be occurred on the real system (Such a system should be fixed), 
> > so we
> > can say that the regression is up to -10%. I felt it could be negligible 
> > but how
> > do other think?
>
> I think this performance regression is not acceptable. In this
> workload, one transaction has 10k subtransactions and the logical
> decoding becomes quite slow if logical_decoding_work_mem is not big
> enough. Therefore, it's a legitimate and common approach to increase
> logical_decoding_work_mem to speedup the decoding. However, with thie
> patch, the decoding becomes slower than today. It's a bad idea in
> general to optimize an extreme case while sacrificing the normal (or
> more common) cases.
>

Since this same function is used by pg_dump sorting TopoSort functions
also, we can just verify once if there is no performance impact with
large number of objects during dump sorting:
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-   int sz;
binaryheap *heap;

-   sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-   heap = (binaryheap *) palloc(sz);
+   heap = (binaryheap *) palloc(sizeof(binaryheap));
heap->bh_space = capacity;
heap->bh_compare = compare;
heap->bh_arg = arg;

heap->bh_size = 0;
heap->bh_has_heap_property = true;
+   heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type)
* capacity);

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2024-02-23 Thread Bertrand Drouvot

Hi,

On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote:
> On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > Because one could create say the "=" OPERATOR in their own schema, attach a
> > function to it doing undesired stuff and change the search_path for the 
> > database
> > the sync slot worker connects to.
> >
> > Then this new "=" operator would be used (instead of the pg_catalog.= one),
> > triggering the "undesired" function as superuser.
> 
> Thanks for the details. I understand it now.  We do not use '=' in our
> main slots-fetch query but we do use '=' in remote-validation query.
> See validate_remote_info().

Oh, right, I missed it during the review.

> Do you think instead of doing the above,
> we can override search-path with empty string in the slot-sync case.
> SImilar to logical apply worker and autovacuum worker case (see
> InitializeLogRepWorker(), AutoVacWorkerMain()).

Yeah, we should definitively ensure that any operators being used in the query
is coming from the pg_catalog schema (could be by setting the search path or
using the up-thread proposal).

Setting the search path would prevent any risks in case the query is changed
later on, so I'd vote for changing the search path in validate_remote_info()
and in synchronize_slots() to be on the safe side.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add publisher and subscriber to glossary documentation.

2024-02-23 Thread Shlok Kyal

> Here are some comments for patch v2.
>
> ==
>
> 1. There are whitespace problems
>
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:43:
> trailing whitespace.
>   A node where publication is defined for
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:45:
> trailing whitespace.
>   It replicates a set of changes from a table or a group of tables in
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:66:
> trailing whitespace.
>  A node where subscription is defined for
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:67:
> trailing whitespace.
>  logical 
> replication.
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:68:
> trailing whitespace.
>  It subscribes to one or more publications on a publisher node and pulls
> warning: squelched 2 whitespace errors
> warning: 7 lines add whitespace errors.
>
> ~~~
> 2. Publisher node
>
> +  
> +   Publisher node
> +   
> +
> +  A node where publication is defined for
> +  logical
> replication.
> +  It replicates a set of changes from a table or a group of tables in
> +  publication to the subscriber node.
> +
> +
> + For more information, see
> + .
> +
> +   
> +  
> +
>
>
> 2a.
> I felt the term here should be "Publication node".
>
> Indeed, there should also be additional glossary terms like
> "Publisher" (i.e. it is the same as "Publication node") and
> "Subscriber" (i.e. it is the same as "Subscription node"). These
> definitions will then be consistent with node descriptions already in
> sections 30.1 and 30.2
>
>
> ~
>
> 2b.
> "node" should include a link to the glossary term. Same for any other
> terms mentioned
>
> ~
>
> 2c.
> /A node where publication is defined for/A node where a publication is
> defined for/
>
> ~
>
> 2d.
> "It replicates" is misleading because it is the PUBLICATION doing the
> replicating, not the node.
>
> IMO it will be better to include 2 more glossary terms "Publication"
> and "Subscription" where you can say this kind of information. Then
> the link "logical-replication-publication" also belongs under the
> "Publication" term.
>
>
> ~~~
>
> 3.
> +  
> +   Subscriber node
> +   
> +
> + A node where subscription is defined for
> + logical 
> replication.
> + It subscribes to one or more publications on a publisher node and pulls
> + a set of changes from a table or a group of tables in publications it
> + subscribes to.
> +
> +
> + For more information, see
> + .
> +
> +   
> +  
>
> All comments are similar to those above.
>
> ==
>
> In summary, IMO it should be a bit more like below:
>
> SUGGESTION (include all the necessary links etc)
>
> Publisher
> See "Publication node"
>
> Publication
> A publication replicates the changes of one or more tables to a
> subscription. For more information, see Section 30.1
>
> Publication node
> A node where a publication is defined for logical replication.
>
> Subscriber
> See "Subscription node"
>
> Subscription
> A subscription receives the changes of one or more tables from the
> publications it subscribes to. For more information, see Section 30.2
>
> Subscription node
> A node where a subscription is defined for logical replication.

I have addressed the comments and added an updated patch.

Thanks and Regards,
Shlok Kyal


v3-0001-Add-publisher-and-subscriber-to-glossary-document.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-23 Thread shveta malik

On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> Because one could create say the "=" OPERATOR in their own schema, attach a
> function to it doing undesired stuff and change the search_path for the 
> database
> the sync slot worker connects to.
>
> Then this new "=" operator would be used (instead of the pg_catalog.= one),
> triggering the "undesired" function as superuser.

Thanks for the details. I understand it now.  We do not use '=' in our
main slots-fetch query but we do use '=' in remote-validation query.
See validate_remote_info(). Do you think instead of doing the above,
we can override search-path with empty string in the slot-sync case.
SImilar to logical apply worker and autovacuum worker case (see
InitializeLogRepWorker(), AutoVacWorkerMain()).

thanks
Shveta

62 matches

Mail list logo