Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Andres Freund
Hi,

On 2024-03-05 16:41:30 +0700, John Naylor wrote:
> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga=2024-03-06%2007%3A33%3A18

Greetings,

Andres Freund




Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-05 Thread Mats Kindahl
On Wed, Mar 6, 2024 at 3:27 AM Andrei Lepikhov 
wrote:

> On 6/3/2024 06:25, Michael Paquier wrote:
> >> Just to elaborate: the intention was to allow a section to be added to
> >> every node in the plan containing information from further down and also
> >> allow this information to propagate upwards. We happen to have buffer
> >> information right now, but allowing something similar to be added
> >> dynamically by extending ExplainNode and passing down a callback to
> >> standard_ExplainOneQuery.
> >
> > Or an extra hook at the end of ExplainNode() to be able to append more
> > information at node level?  Not sure if others would agree with that,
> > though.
>

That is what I had in mind, yes.


> We already discussed EXPLAIN hooks, at least in [1]. IMO, extensions
> should have a chance to add something to the node explain and the
> summary, if only because they can significantly influence the planner
> and executor's behaviour.
>
> [1]
>
> https://www.postgresql.org/message-id/flat/6cd5caa7-06e1-4460-bf35-00a59da3f677%40garret.ru


This is an excellent example of where such a hook would be useful.
-- 
Best wishes,
Mats Kindahl, Timescale


Re: UUID v7

2024-03-05 Thread Peter Eisentraut

On 30.01.24 14:35, Andrey M. Borodin wrote:

On 30 Jan 2024, at 15:33, Junwang Zhao  wrote:

It's always good to add a newline at the end of a  source file, though
this might be nitpicky.


Thanks, also fixed warning found by CFBot.


I have various comments on this patch:


- doc/src/sgml/func.sgml

The documentation of the new functions should be broken up a bit.
It's all one paragraph now.  At least make it several paragraphs, or
possibly tables or something else.

Avoid listing the functions twice: Once before the description and
then again in the description.  That's just going to get out of date.
The first listing is not necessary, I think.

The return values in the documentation should use the public-facing
type names, like "timestamp with time zone" and "smallint".

The descriptions of the UUID generation functions use handwavy
language in their descriptions, like "It provides much better data
locality" or "unacceptable security or business intelligence
implications", which isn't useful.  Either we cut that all out and
just say, it creates a UUIDv7, done, look elsewhere for more
information, or we provide some more concretely useful details.

We shouldn't label a link as "IETF standard" when it's actually a
draft.


- src/include/catalog/pg_proc.dat

The description of uuidv4 should be "generate UUID version 4", so that
it parallels uuidv7.

The description of uuid_extract_time says 'extract timestamp from UUID
version 7', the implementation is not limited to version 7.

I think uuid_extract_time should be named uuid_extract_timestamp,
because it extracts a timestamp, not a time.

The functions uuid_extract_ver and uuid_extract_var could be named
uuid_extract_version and uuid_extract_variant.  Otherwise, it's hard
to tell them apart, with only one letter different.


- src/test/regress/sql/uuid.sql

Why are the tests using the input format '{...}', which is not the
standard one?


- src/backend/utils/adt/uuid.c

All this new code should have more comments.  There is a lot of bit
twiddling going on, and I suppose one is expected to follow along in
the RFC?  At least each function should have a header comment, so one
doesn't have to check in pg_proc.dat what it's supposed to do.

I'm suspicious that these functions all appear to return null for
erroneous input, rather than raising errors.  I think at least some
explanation for this should be recorded somewhere.

I think the behavior of uuid_extract_var(iant) is wrong.  The code
takes just two bits to return, but the draft document is quite clear
that the variant is 4 bits (see Table 1).

The uuidv7 function could really use a header comment that explains
the choices that were made.  The RFC draft provides various options
that implementations could use; we should describe which ones we
chose.

I would have expected that, since gettimeofday() provides microsecond
precision, we'd put the extra precision into "rand_a" as per Section 6.2 
method 3.


You use some kind of counter, but could you explain which method that
counter implements?

I don't see any acknowledgment of issues relating to concurrency or
restarts.  Like, how do we prevent duplicates being generated by
concurrent sessions or between restarts?  Maybe the counter or random
stuff does that, but it's not explained.





broken JIT support on Fedora 40

2024-03-05 Thread Pavel Stehule
Hi

after today update, the build with --with-llvm produces broken code, and
make check fails with crash

Upgradehwdata-0.380-1.fc40.noarch
@updates-testing
Upgraded   hwdata-0.379-1.fc40.noarch   @@System
Upgradellvm-18.1.0~rc4-1.fc40.x86_64
 @updates-testing
Upgraded   llvm-17.0.6-6.fc40.x86_64@@System
Upgradellvm-devel-18.1.0~rc4-1.fc40.x86_64
 @updates-testing
Upgraded   llvm-devel-17.0.6-6.fc40.x86_64  @@System
Upgradellvm-googletest-18.1.0~rc4-1.fc40.x86_64
@updates-testing
Upgraded   llvm-googletest-17.0.6-6.fc40.x86_64 @@System
Upgradellvm-libs-18.1.0~rc4-1.fc40.i686
@updates-testing
Upgraded   llvm-libs-17.0.6-6.fc40.i686 @@System
Upgradellvm-libs-18.1.0~rc4-1.fc40.x86_64
@updates-testing
Upgraded   llvm-libs-17.0.6-6.fc40.x86_64   @@System
Upgradellvm-static-18.1.0~rc4-1.fc40.x86_64
@updates-testing
Upgraded   llvm-static-17.0.6-6.fc40.x86_64 @@System
Upgradellvm-test-18.1.0~rc4-1.fc40.x86_64
@updates-testing
Instalovat llvm17-libs-17.0.6-7.fc40.i686
@updates-testing
Instalovat llvm17-libs-17.0.6-7.fc40.x86_64
@updates-testing

Regards

Pavel


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Masahiko Sawada
On Wed, Mar 6, 2024 at 12:59 PM John Naylor  wrote:
>
> On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:
>
> > > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > > aspect that the random numbers come from a different seed every time
> > > the test runs.
> >
> > The new tests look good. Here are some comments:
> >
> > ---
> > +   expected = keys[i];
> > +   iterval = rt_iterate_next(iter, );
> >
> > -   ndeleted++;
> > +   EXPECT_TRUE(iterval != NULL);
> > +   EXPECT_EQ_U64(iterkey, expected);
> > +   EXPECT_EQ_U64(*iterval, expected);
> >
> > Can we verify that the iteration returns keys in ascending order?
>
> We get the "expected" value from the keys we saved in the now-sorted
> array, so we do already. Unless I misunderstand you.

Ah, you're right. Please ignore this comment.

>
> > ---
> > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
> >   "test_radix_tree",
> >   ALLOCSET_DEFAULT_SIZES);
> >
> > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> > think it's better to use either one for consistency.
>
> Will change to "small", since 32-bit platforms will use slab for leaves.

Agreed.

>
> I'll look at the memory usage and estimate what 32-bit platforms will
> use, and maybe adjust the number of keys. A few megabytes is fine, but
> not many megabytes.

Thanks, sounds good.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-05 Thread Masahiko Sawada
On Fri, Mar 1, 2024 at 3:22 PM Peter Smith  wrote:
>
> On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada  wrote:
> >
> ...
> > +/*
> > + * "*" is not accepted as in that case primary will not be able to 
> > know
> > + * for which all standbys to wait for. Even if we have physical 
> > slots
> > + * info, there is no way to confirm whether there is any standby
> > + * configured for the known physical slots.
> > + */
> > +if (strcmp(*newval, "*") == 0)
> > +{
> > +GUC_check_errdetail("\"*\" is not accepted for
> > standby_slot_names");
> > +return false;
> > +}
> >
> > Why only '*' is checked aside from validate_standby_slots()? I think
> > that the doc doesn't mention anything about '*' and '*' cannot be used
> > as a replication slot name. So even if we don't have this check, it
> > might be no problem.
> >
>
> Hi, a while ago I asked this same question. See [1 #28] for the response..

Thanks. Quoting the response from the email:

SplitIdentifierString() does not give error for '*' and '*' can be considered
as valid value which if accepted can mislead user that all the standbys's slots
are now considered, which is not the case here. So we want to explicitly call
out this case i.e. '*' is not accepted as valid value for standby_slot_names.

IIUC we're concerned with a case like where the user confused
standby_slot_names values with synchronous_standby_names values. Which
means we would need to keep thath check consistent with available
values of synchronous_standby_names. For example, if we support a
regexp for synchronous_standby_names, we will have to update the check
so we disallow other special characters. Also, if we add a new
replication-related parameter that accepts other special characters as
the value in the future, will we want to raise an error also for such
values in check_standby_slot_names()?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Make COPY format extendable: Extract COPY TO format implementations

2024-03-05 Thread Michael Paquier
On Tue, Mar 05, 2024 at 05:18:08PM +0900, Sutou Kouhei wrote:
> I'll send the following patches after this patch is
> merged.

I am not sure that my schedule is on track to allow that for this
release, unfortunately, especially with all the other items to review
and discuss to make this thread feature-complete.  There should be
a bit more than four weeks until the feature freeze (date not set in
stone, should be around the 8th of April AoE), but I have less than
the half due to personal issues.  Perhaps if somebody jumps on this
thread, that will be possible..

> They are based on the v6 patch[1]:
> 
> 1. Add copy_handler
>* This also adds a pg_proc lookup for custom FORMAT
>* This also adds a test for copy_handler
> 2. Export CopyToStateData
>* We need it to implement custom copy TO handler
> 3. Add needed APIs to implement custom copy TO handler
>* Add CopyToStateData::opaque
>* Export CopySendEndOfRow()
> 4. Export CopyFromStateData
>* We need it to implement custom copy FROM handler
> 5. Add needed APIs to implement custom copy FROM handler
>* Add CopyFromStateData::opaque
>* Export CopyReadBinaryData()

Hmm.  Sounds like a good plan for a split.
--
Michael


signature.asc
Description: PGP signature


Re: Injection points: some tools to wait and wake

2024-03-05 Thread Michael Paquier
On Tue, Mar 05, 2024 at 09:43:03AM +0100, Jelte Fennema-Nio wrote:
> I do think there is quite a bit of a difference from a user
> perspective to providing a few custom configure flags and having to go
> to a separate directory and run "make install" there too. Simply
> because it's yet another step. For dev environments most/all of my
> team uses pgenv: https://github.com/theory/pgenv There I agree we
> could add functionality to it to also install test modules when given
> a certain flag/environment variable, but it would be nice if that
> wasn't needed.

Didn't know this one, TBH.  That's interesting.  My own scripts
emulate the same things with versioning, patching, etc.

> One big downside to not having it in contrib is that we also run tests
> of Citus against official PGDG postgres packages and those would
> likely not include these test modules, so we wouldn't be able to run
> all our tests then.

No idea about this part, unfortunately.  One thing that I'd do is
first check if the in-core module is enough to satisfy the
requirements Citus would want.  I got pinged about Timescale and
greenplum recently, and they can reuse the backends APIs, but they've
also wanted a few more callbacks than the in-core module so they will
need to have their own code for tests with a custom callback library.
Perhaps we could move that to contrib/ and document that this is a
module for testing, that can be updated without notice even in minor
upgrades and that there are no version upgrades done, or something
like that.  I'm open to that if there's enough demand for it, but I
don't know how much we should accomodate with the existing
requirements of contrib/ for something that's only developer-oriented.

> Also I think the injection points extension is quite different from
> the other modules in src/modules/test. All of these other modules are
> basically test binaries that need to be separate modules, but they
> don't provide any useful functionality when installing them. The
> injection_poinst one actually provides useful functionality itself,
> that can be used by other people testing things against postgres.

I'm not sure about that yet, TBH.  Anything that gets added to this
module should be used in some way by the in-core tests, or just not be
there.  That's a line I don't want to cross, which is why it's a test 
module.  FWIW, it would be really annoying to have documentation
requirements, actually, because that increases maintenance and I'm not
sure it's a good idea to add a module maintenance on top of what could 
require more facility in the module to implement a test for a bug fix.

> What would I need to do for meson builds? I tried the following
> command but it doesn't seem to actually install the injection points
> command:
> ninja -C build install-test-files

Weird, that works here.  I was digging into the meson tree and noticed
that this was the only run_target() related to the test module install
data, and injection_points gets installed as well as all the other
modules.  This should just need -Dinjection_points=true.
--
Michael


signature.asc
Description: PGP signature


Re: Shared detoast Datum proposal

2024-03-05 Thread Nikita Malakhov
Hi,

In addition to the previous message - for the toast_fetch_datum_slice
the first (seems obvious) way is to detoast the whole value, save it
to cache and get slices from it on demand. I have another one on my
mind, but have to play with it first.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-03-05 Thread Michael Paquier
On Tue, Mar 05, 2024 at 10:17:03AM +, Bertrand Drouvot wrote:
> On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
>> First, in a build where 818fefd8fd is included, this makes the test
>> script a lot slower.  Most of the logic is quick, but we're spending
>> 10s or so checking that catalog_xmin has advanced.  Could it be
>> possible to make that faster?
> 
> Yeah, v2 attached changes this. It moves the injection point after the process
> has been killed so that another process can decode at wish (without the need
> to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation().

Ah, OK.  Indeed that's much faster this way.

> I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1"
> before the initial_* assignements).

Yeah.  I can see how this messes up with the calculation of the
conditions, which is enough from my perspective, even if we don't have
any sanity checks in 818fefd8fd originally.

> The issue is that then the new ASSERT is
> triggered leading to the standby shutdown. So, I'm not sure how to improve 
> this
> case.

It's been mentioned recently that we are not good at detecting crashes
in the TAP tests.  I am wondering if we should check the status of the
node in the most popular routines of Cluster.pm and die hard, as one
way to make the tests more responsive..  A topic for a different
thread, for sure.

> Done in v2.

Reworded a few things and applied this version.
--
Michael


signature.asc
Description: PGP signature


Potential stack overflow in incremental base backup

2024-03-05 Thread Thomas Munro
Hi Robert,

I was rebasing my patch to convert RELSEG_SIZE into an initdb-time
setting, and thus runtime variable, and I noticed this stack variable
in src/backend/backup/basebackup_incremental.c:

GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
BlockNumber *relative_block_numbers,
unsigned *truncation_block_length)
{
BlockNumber absolute_block_numbers[RELSEG_SIZE];

I'll have to move that sucker onto the heap (we banned C99 variable
length arrays and we don't use nonstandard alloca()), but I think the
coding in master is already a bit dodgy: that's 131072 *
sizeof(BlockNumber) = 512kB with default configure options, but
--with-segsize X could create a stack variable up to 16GB,
corresponding to segment size 32TB (meaning no segmentation at all).
That wouldn't work.  Shouldn't we move it to the stack?  See attached
draft patch.

Even on the heap, 16GB is too much to assume we can allocate during a
base backup.  I don't claim that's a real-world problem for
incremental backup right now in master, because I don't have any
evidence that anyone ever really uses --with-segsize (do they?), but
if we make it an initdb option it will be more popular and this will
become a problem.  Hmm.
From 1d183245e9676ef45ca6a93e7d442ee903a2a14c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 6 Mar 2024 17:44:19 +1300
Subject: [PATCH] Fix potential stack overflow in incremental basebackup.

Since the user can set to RELSEG_SIZE to a high number at compile time,
it seems unwise to use it to size an array on the stack.
---
 src/backend/backup/basebackup_incremental.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index ebc41f28be5..3d0e7894fc3 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -28,6 +28,7 @@
 #include "common/int.h"
 #include "datatype/timestamp.h"
 #include "postmaster/walsummarizer.h"
+#include "storage/smgr.h"
 #include "utils/timestamp.h"
 
 #define	BLOCKS_PER_READ			512
@@ -712,7 +713,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	BlockNumber *relative_block_numbers,
 	unsigned *truncation_block_length)
 {
-	BlockNumber absolute_block_numbers[RELSEG_SIZE];
+	BlockNumber *absolute_block_numbers;
 	BlockNumber limit_block;
 	BlockNumber start_blkno;
 	BlockNumber stop_blkno;
@@ -839,8 +840,10 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 errcode(ERRCODE_INTERNAL_ERROR),
 errmsg_internal("overflow computing block number bounds for segment %u with size %zu",
 segno, size));
+	absolute_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
 	nblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno,
-		  absolute_block_numbers, RELSEG_SIZE);
+		  absolute_block_numbers,
+		  RELSEG_SIZE);
 	Assert(nblocks <= RELSEG_SIZE);
 
 	/*
@@ -856,7 +859,10 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	 * nothing good about sending an incremental file in that case.
 	 */
 	if (nblocks * BLCKSZ > size * 0.9)
+	{
+		pfree(absolute_block_numbers);
 		return BACK_UP_FILE_FULLY;
+	}
 
 	/*
 	 * Looks like we can send an incremental file, so sort the absolute the
@@ -872,6 +878,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 		  compare_block_numbers);
 	for (i = 0; i < nblocks; ++i)
 		relative_block_numbers[i] = absolute_block_numbers[i] - start_blkno;
+	pfree(absolute_block_numbers);
 	*num_blocks_required = nblocks;
 
 	/*
-- 
2.43.0



Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-05 Thread shveta malik
On Tue, Mar 5, 2024 at 6:52 PM Bertrand Drouvot
 wrote:
>
> > /*
> > * Nothing to do for physical slots as we collect stats only for logical
> > * slots.
> > */
> > if (SlotIsPhysical(slot))
> > return;
>
> D'oh! Thanks! Fixed in v2 shared up-thread.

Thanks.  Can we try to get rid of multiple LwLockRelease in
pgstat_reset_replslot(). Is this any better?

/*
-* Nothing to do for physical slots as we collect stats only for logical
-* slots.
+* Reset stats if it is a logical slot. Nothing to do for physical slots
+* as we collect stats only for logical slots.
 */
-   if (SlotIsPhysical(slot))
-   {
-   LWLockRelease(ReplicationSlotControlLock);
-   return;
-   }
-
-   /* reset this one entry */
-   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
-ReplicationSlotIndex(slot));
+   if (SlotIsLogical(slot))
+   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
+ReplicationSlotIndex(slot));

LWLockRelease(ReplicationSlotControlLock);


Something similar in pgstat_fetch_replslot() perhaps?

thanks
Shveta




Re: Synchronizing slots from primary to standby

2024-03-05 Thread Masahiko Sawada
On Wed, Mar 6, 2024 at 12:47 PM Amit Kapila  wrote:
>
> On Wed, Mar 6, 2024 at 7:36 AM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
> >  wrote:
> >
> > I have one question about PhysicalWakeupLogicalWalSnd():
> >
> > +/*
> > + * Wake up the logical walsender processes with logical failover slots if 
> > the
> > + * currently acquired physical slot is specified in standby_slot_names GUC.
> > + */
> > +void
> > +PhysicalWakeupLogicalWalSnd(void)
> > +{
> > +List  *standby_slots;
> > +
> > +Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot));
> > +
> > +standby_slots = GetStandbySlotList();
> > +
> > +foreach_ptr(char, name, standby_slots)
> > +{
> > +if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 
> > 0)
> > +{
> > +
> > ConditionVariableBroadcast(>wal_confirm_rcv_cv);
> > +return;
> > +}
> > +}
> > +}
> >
> > IIUC walsender calls this function every time after updating the
> > slot's restart_lsn, which could be very frequently. I'm concerned that
> > it could be expensive to do a linear search on the standby_slot_names
> > list every time. Is it possible to cache the information in walsender
> > local somehow?
> >
>
> We can cache this information for WalSender but not for the case where
> users use pg_physical_replication_slot_advance(). We don't expect this
> list to be long enough to matter, so we can leave this optimization
> for the future especially if we encounter any such case unless you
> think otherwise.

Okay, agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Switching XLog source from archive to streaming when primary available

2024-03-05 Thread Bharath Rupireddy
On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart  wrote:
>
> I was thinking of something more like
>
> typedef enum
> {
> NO_FORCE_SWITCH_TO_STREAMING,   /* no switch 
> necessary */
> FORCE_SWITCH_TO_STREAMING_PENDING,  /* exhausting pg_wal 
> */
> FORCE_SWITCH_TO_STREAMING,  /* switch to 
> streaming now */
> } WALSourceSwitchState;
>
> At least, that illustrates my mental model of the process here.  IMHO
> that's easier to follow than two similarly-named bool variables.

I played with that idea and it came out very nice. Please see the
attached v22 patch. Note that personally I didn't like "FORCE" being
there in the names, so I've simplified them a bit.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From f2c9d1a170ce4cc536ce9139d7a2e0fdc268be69 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 6 Mar 2024 04:12:26 +
Subject: [PATCH v22] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  49 
 doc/src/sgml/high-availability.sgml   |  15 ++-
 src/backend/access/transam/xlogrecovery.c | 117 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/042_wal_source_switch.pl  | 113 +
 8 files changed, 296 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/042_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b38cbd714a..02e79f32fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5011,6 +5011,55 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from archive to streaming replication (i.e., getting WAL from
+primary). However, the standby exhausts all the WAL present in
+pg_wal directory before switching. If the standby
+fails to switch to stream mode, it falls back to archive mode. If this
+parameter value is specified without units, it is taken as seconds.
+With a lower value for this parameter, the standby makes frequent WAL
+source switch attempts. To avoid this, it is recommended to set a
+value depending on the rate of WAL generation on the primary. If the
+WAL files grow faster, the disk runs out of space sooner, so setting a
+value to make frequent WAL source switch attempts can help. The default
+is zero, disabling this feature. When disabled, the standby typically
+switches to stream mode only after receiving WAL from archive finishes
+(i.e., no more WAL left there) or fails for any reason. This parameter
+can only be set in the postgresql.conf file or on
+the server command line.
+   
+   
+
+ Standby may not always attempt to switch source from WAL archive to
+ streaming replication at exact
+ streaming_replication_retry_interval intervals. For
+ example, if the parameter is set to 1min and
+ fetching WAL file from archive takes about 2min,
+ then the source switch attempt happens for the next WAL file after
+ the current WAL file fetched from archive is fully applied.
+  

Re: remaining sql/json patches

2024-03-05 Thread Amit Langote
Hi Tomas,

On Wed, Mar 6, 2024 at 6:30 AM Tomas Vondra
 wrote:
>
> Hi,
>
> I know very little about sql/json and all the json internals, but I
> decided to do some black box testing. I built a large JSONB table
> (single column, ~7GB of data after loading). And then I did a query
> transforming the data into tabular form using JSON_TABLE.
>
> The JSON_TABLE query looks like this:
>
> SELECT jt.* FROM
>   title_jsonb t,
>   json_table(t.info, '$'
> COLUMNS (
>   "id" text path '$."id"',
>   "type" text path '$."type"',
>   "title" text path '$."title"',
>   "original_title" text path '$."original_title"',
>   "is_adult" text path '$."is_adult"',
>   "start_year" text path '$."start_year"',
>   "end_year" text path '$."end_year"',
>   "minutes" text path '$."minutes"',
>   "genres" text path '$."genres"',
>   "aliases" text path '$."aliases"',
>   "directors" text path '$."directors"',
>   "writers" text path '$."writers"',
>   "ratings" text path '$."ratings"',
>   NESTED PATH '$."aliases"[*]'
> COLUMNS (
>   "alias_title" text path '$."title"',
>   "alias_region" text path '$."region"'
> ),
>   NESTED PATH '$."directors"[*]'
> COLUMNS (
>   "director_name" text path '$."name"',
>   "director_birth_year" text path '$."birth_year"',
>   "director_death_year" text path '$."death_year"'
> ),
>   NESTED PATH '$."writers"[*]'
> COLUMNS (
>   "writer_name" text path '$."name"',
>   "writer_birth_year" text path '$."birth_year"',
>   "writer_death_year" text path '$."death_year"'
> ),
>   NESTED PATH '$."ratings"[*]'
> COLUMNS (
>   "rating_average" text path '$."average"',
>   "rating_votes" text path '$."votes"'
> )
> )
>   ) as jt;
>
> again, not particularly complex. But if I run this, it consumes multiple
> gigabytes of memory, before it gets killed by OOM killer. This happens
> even when ran using
>
>   COPY (...) TO '/dev/null'
>
> so there's nothing sent to the client. I did catch memory context info,
> where it looks like this (complete stats attached):
>
> --
> TopMemoryContext: 97696 total in 5 blocks; 13056 free (11 chunks);
>   84640 used
>   ...
>   TopPortalContext: 8192 total in 1 blocks; 7680 free (0 chunks); ...
> PortalContext: 1024 total in 1 blocks; 560 free (0 chunks); ...
>   ExecutorState: 2541764672 total in 314 blocks; 6528176 free
>  (1208 chunks); 2535236496 used
> printtup: 8192 total in 1 blocks; 7952 free (0 chunks); ...
> ...
> ...
> Grand total: 2544132336 bytes in 528 blocks; 7484504 free
>  (1340 chunks); 2536647832 used
> --
>
> I'd say 2.5GB in ExecutorState seems a bit excessive ... Seems there's
> some memory management issue? My guess is we're not releasing memory
> allocated while parsing the JSON or building JSON output.
>
> I'm not attaching the data, but I can provide that if needed - it's
> about 600MB compressed. The structure is not particularly complex, it's
> movie info from [1] combined into a JSON document (one per movie).

Thanks for the report.

Yeah, I'd like to see the data to try to drill down into what's piling
up in ExecutorState.  I want to be sure of if the 1st, query functions
patch, is not implicated in this, because I'd like to get that one out
of the way sooner than later.

-- 
Thanks, Amit Langote




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread John Naylor
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:

> > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > aspect that the random numbers come from a different seed every time
> > the test runs.
>
> The new tests look good. Here are some comments:
>
> ---
> +   expected = keys[i];
> +   iterval = rt_iterate_next(iter, );
>
> -   ndeleted++;
> +   EXPECT_TRUE(iterval != NULL);
> +   EXPECT_EQ_U64(iterkey, expected);
> +   EXPECT_EQ_U64(*iterval, expected);
>
> Can we verify that the iteration returns keys in ascending order?

We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

> ---
> + /* reset random number generator for deletion */
> + pg_prng_seed(, seed);
>
> Why is resetting the seed required here?

Good catch - My intention was to delete in the same random order we
inserted with. We still have the keys in the array, but they're sorted
by now. I forgot to go the extra step and use the prng when generating
the keys for deletion -- will fix.

> ---
> The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

Will fix.

> ---
> radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
>   "test_radix_tree",
>   ALLOCSET_DEFAULT_SIZES);
>
> We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

Sounds good.




Re: Synchronizing slots from primary to standby

2024-03-05 Thread Amit Kapila
On Wed, Mar 6, 2024 at 7:36 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
>  wrote:
>
> I have one question about PhysicalWakeupLogicalWalSnd():
>
> +/*
> + * Wake up the logical walsender processes with logical failover slots if the
> + * currently acquired physical slot is specified in standby_slot_names GUC.
> + */
> +void
> +PhysicalWakeupLogicalWalSnd(void)
> +{
> +List  *standby_slots;
> +
> +Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot));
> +
> +standby_slots = GetStandbySlotList();
> +
> +foreach_ptr(char, name, standby_slots)
> +{
> +if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0)
> +{
> +
> ConditionVariableBroadcast(>wal_confirm_rcv_cv);
> +return;
> +}
> +}
> +}
>
> IIUC walsender calls this function every time after updating the
> slot's restart_lsn, which could be very frequently. I'm concerned that
> it could be expensive to do a linear search on the standby_slot_names
> list every time. Is it possible to cache the information in walsender
> local somehow?
>

We can cache this information for WalSender but not for the case where
users use pg_physical_replication_slot_advance(). We don't expect this
list to be long enough to matter, so we can leave this optimization
for the future especially if we encounter any such case unless you
think otherwise.

-- 
With Regards,
Amit Kapila.




Re: "type with xxxx does not exist" when doing ExecMemoize()

2024-03-05 Thread Andrei Lepikhov

On 6/3/2024 10:10, Tender Wang wrote:



Andrei Lepikhov > 于2024年3月5日周二 17:36写道:


On 1/3/2024 14:18, Tender Wang wrote:
 > During debug, I learned that numeric_add doesn't have type check
like
 > rangetype, so aboved query will not report "type with xxx does
not exist".
 >
 > And I realize that  the test case added by Andrei Lepikhov  in v3 is
 > right. So in v6 patch I add Andrei Lepikhov's test case.  Thanks
a lot.
 >
 > Now I think the v6 version patch seems to be complete now.
I've passed through the patch, and it looks okay. Although I am afraid
of the same problems that future changes can cause and how to detect
them, it works correctly.


Thanks for reviewing it, and I add it to commitfest 2024-07.

I think, it is a bug. Should it be fixed (and back-patched) earlier?

--
regards,
Andrei Lepikhov
Postgres Professional





Re: "type with xxxx does not exist" when doing ExecMemoize()

2024-03-05 Thread Tender Wang
Andrei Lepikhov  于2024年3月5日周二 17:36写道:

> On 1/3/2024 14:18, Tender Wang wrote:
> > During debug, I learned that numeric_add doesn't have type check like
> > rangetype, so aboved query will not report "type with xxx does not
> exist".
> >
> > And I realize that  the test case added by Andrei Lepikhov  in v3 is
> > right. So in v6 patch I add Andrei Lepikhov's test case.  Thanks a lot.
> >
> > Now I think the v6 version patch seems to be complete now.
> I've passed through the patch, and it looks okay. Although I am afraid
> of the same problems that future changes can cause and how to detect
> them, it works correctly.
>

Thanks for reviewing it, and I add it to commitfest 2024-07.


-- 
Tender Wang
OpenPie:  https://en.openpie.com/


RE: Synchronizing slots from primary to standby

2024-03-05 Thread Zhijie Hou (Fujitsu)
On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada  
wrote:

Hi,

> On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Friday, March 1, 2024 2:11 PM Masahiko Sawada
>  wrote:
> > >
> > >
> > > ---
> > > +void
> > > +assign_standby_slot_names(const char *newval, void *extra) {
> > > +List  *standby_slots;
> > > +MemoryContext oldcxt;
> > > +char  *standby_slot_names_cpy = extra;
> > > +
> > >
> > > Given that the newval and extra have the same data
> > > (standby_slot_names value), why do we not use newval instead? I
> > > think that if we use newval, we don't need to guc_strdup() in
> > > check_standby_slot_names(), we might need to do list_copy_deep()
> > > instead, though. It's not clear to me as there is no comment.
> >
> > I think SplitIdentifierString will modify the passed in string, so
> > we'd better not pass the newval to it, otherwise the stored guc
> > string(standby_slot_names) will be changed. I can see we are doing
> > similar thing in other GUC check/assign function as well.
> > (check_wal_consistency_checking/ assign_wal_consistency_checking,
> > check_createrole_self_grant/ assign_createrole_self_grant ...).
> 
> Why does it have to be a List in the first place? 

I thought the List type is convenient to use here, as we have existing list
build function(SplitIdentifierString), and have convenient list macro to loop
the list(foreach_ptr) which can save some codes.

> In earlier version patches, we
> used to copy the list and delete the element until it became empty, while
> waiting for physical wal senders. But we now just refer to each slot name in 
> the
> list. The current code assumes that stnadby_slot_names_cpy is allocated in
> GUCMemoryContext but once it changes, it will silently get broken. I think we
> can check and assign standby_slot_names in a similar way to
> check/assign_temp_tablespaces and
> check/assign_synchronous_standby_names.

Yes, we could do follow it by allocating an array and copy each slot name into
it, but it also requires some codes to build and scan the array. So, is it 
possible
to expose the GucMemorycontext or have an API like guc_copy_list instead ?
If we don't want to touch the guc api, I am ok with using an array as well.

Best Regards,
Hou zj


Re: CREATE DATABASE with filesystem cloning

2024-03-05 Thread Thomas Munro
On Wed, Mar 6, 2024 at 3:16 PM Thomas Munro  wrote:
> Here's a rebase.

Now with a wait event and a paragraph of documentation.
From 9d5a60e9a9cc4a4312de3081be99c254a8876e42 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 2 Sep 2023 22:21:49 +1200
Subject: [PATCH v4] CREATE DATABASE ... STRATEGY=FILE_CLONE.

Similar to STRATEGY=FILE_COPY, but attempting to use efficient file
copying system calls.  The kernel has the opportunity to share block
ranges in copy-on-write file systems, or maybe push down the copy to
network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

XXX need redo -- what to do if unsupported during redo, fall back to plain copy?

Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 doc/src/sgml/ref/create_database.sgml |  6 ++
 src/backend/commands/dbcommands.c | 19 +++--
 src/backend/storage/file/copydir.c| 82 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/include/storage/copydir.h |  3 +-
 5 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 72927960ebb..6ed82ee98dd 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -138,6 +138,12 @@ CREATE DATABASE name
 it also forces the system to perform a checkpoint both before and
 after the creation of the new database. In some situations, this may
 have a noticeable negative impact on overall system performance.
+On some platforms and file systems, the FILE_CLONE
+strategy is available.  This works the same way as
+FILE_COPY, except that it uses fast file cloning
+or copying system calls that might push down the work of copying to the
+storage, or use copy-on-write techniques.  The effect on disk space
+usage and execution time is file system-dependent.

   
  
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index b256d6d0f7d..8ccfd18b4c9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -78,11 +78,14 @@
  * CREATEDB_FILE_COPY will simply perform a file system level copy of the
  * database and log a single record for each tablespace copied. To make this
  * safe, it also triggers checkpoints before and after the operation.
+ *
+ * CREATEDB_FILE_CLONE is the same, but uses faster file cloning system calls.
  */
 typedef enum CreateDBStrategy
 {
 	CREATEDB_WAL_LOG,
 	CREATEDB_FILE_COPY,
+	CREATEDB_FILE_CLONE,
 } CreateDBStrategy;
 
 typedef struct
@@ -136,7 +139,8 @@ static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
 static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 	bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid,
-		Oid src_tsid, Oid dst_tsid);
+		Oid src_tsid, Oid dst_tsid,
+		bool clone_files);
 static void recovery_create_dbdir(char *path, bool only_tblspc);
 
 /*
@@ -548,7 +552,7 @@ CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
  */
 static void
 CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
-			Oid dst_tsid)
+			Oid dst_tsid, bool clone_files)
 {
 	TableScanDesc scan;
 	Relation	rel;
@@ -608,7 +612,7 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(srcpath, dstpath, false);
+		copydir(srcpath, dstpath, false, clone_files);
 
 		/* Record the filesystem change in XLOG */
 		{
@@ -1010,6 +1014,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 			dbstrategy = CREATEDB_WAL_LOG;
 		else if (strcmp(strategy, "file_copy") == 0)
 			dbstrategy = CREATEDB_FILE_COPY;
+		else if (strcmp(strategy, "file_clone") == 0)
+			dbstrategy = CREATEDB_FILE_CLONE;
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1460,7 +1466,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	  dst_deftablespace);
 		else
 			CreateDatabaseUsingFileCopy(src_dboid, dboid, src_deftablespace,
-		dst_deftablespace);
+		dst_deftablespace,
+		dbstrategy == CREATEDB_FILE_CLONE);
 
 		/*
 		 * Close pg_database, but keep lock till commit.
@@ -2096,7 +2103,7 @@ movedb(const char *dbname, const char *tblspcname)
 		/*
 		 * Copy files from the old tablespace to the new one
 		 */
-		copydir(src_dbpath, dst_dbpath, false);
+		copydir(src_dbpath, dst_dbpath, false, false);
 
 		/*
 		 * Record the filesystem change in XLOG
@@ -3255,7 +3262,7 @@ dbase_redo(XLogReaderState *record)
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(src_path, dst_path, false);
+		copydir(src_path, dst_path, false, false);
 
 		

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-05 Thread Andrei Lepikhov

On 6/3/2024 06:25, Michael Paquier wrote:

Just to elaborate: the intention was to allow a section to be added to
every node in the plan containing information from further down and also
allow this information to propagate upwards. We happen to have buffer
information right now, but allowing something similar to be added
dynamically by extending ExplainNode and passing down a callback to
standard_ExplainOneQuery.


Or an extra hook at the end of ExplainNode() to be able to append more
information at node level?  Not sure if others would agree with that,
though.


We already discussed EXPLAIN hooks, at least in [1]. IMO, extensions 
should have a chance to add something to the node explain and the 
summary, if only because they can significantly influence the planner 
and executor's behaviour.


[1] 
https://www.postgresql.org/message-id/flat/6cd5caa7-06e1-4460-bf35-00a59da3f677%40garret.ru


--
regards,
Andrei Lepikhov
Postgres Professional





Re: Support "Right Semi Join" plan shapes

2024-03-05 Thread wenhui qiu
Hi Alena Rybakina
For merge join
+ /*
+ * For now we do not support RIGHT_SEMI join in mergejoin.
+ */
+ if (jointype == JOIN_RIGHT_SEMI)
+ {
+ *mergejoin_allowed = false;
+ return NIL;
+ }
+
Tanks

On Wed, 6 Mar 2024 at 04:10, Alena Rybakina 
wrote:

> To be honest, I didn't see it in the code, could you tell me where they
> are, please?
> On 05.03.2024 05:44, Richard Guo wrote:
>
>
> On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina 
> wrote:
>
>> I have reviewed your patch and I think it is better to add an Assert for
>> JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to
>> prevent the use of RIGHT_SEMI for these types of connections (NestedLoop
>> and MergeJoin).
>
>
> Hmm, I don't see why this is necessary.  The planner should already
> guarantee that we won't have nestloops/mergejoins with right-semi joins.
>
> Thanks
> Richard
>
> --
> Regards,
> Alena Rybakina
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: CREATE DATABASE with filesystem cloning

2024-03-05 Thread Thomas Munro
On Wed, Oct 11, 2023 at 7:40 PM Peter Eisentraut  wrote:
> On 07.10.23 07:51, Thomas Munro wrote:
> > Here is an experimental POC of fast/cheap database cloning.
>
> Here are some previous discussions of this:
>
> https://www.postgresql.org/message-id/flat/20131001223108.GG23410%40saarenmaa.fi
>
> https://www.postgresql.org/message-id/flat/511B5D11.4040507%40socialserve.com
>
> https://www.postgresql.org/message-id/flat/bc9ca382-b98d-0446-f699-8c5de2307ca7%402ndquadrant.com
>
> (I don't see any clear conclusions in any of these threads, but it might
> be good to check them in any case.)

Thanks.  Wow, quite a lot of people have written an experimental patch
like this.  I would say the things that changed since those ones are:

* copy_file_range() became the preferred way to do this on Linux AFAIK
(instead of various raw ioctls)
* FreeBSD adopted Linux's copy_file_range()
* Open ZFS 2.2 implemented range-based cloning
* XFS enabled reflink support by default
* Apple invented ApFS with cloning
* Several OSes adopted XFS, BTRFS, ZFS, ApFS by default
* copy_file_range() went in the direction of not revealing how the
copying is done (no flags to force behaviour)

Here's a rebase.

The main thing that is missing is support for redo.  It's mostly
trivial I think, probably just a record type for "try cloning first"
and then teaching that clone function to fall back to the regular copy
path if it fails in recovery, do you agree with that idea?  Another
approach would be to let it fail if it doesn't work on the replica, so
you don't finish up using dramatically different amounts of disk
space, but that seems terrible because now your replica is broken.  So
probably fallback with logged warning (?), though I'm not sure exactly
which errnos to give that treatment to.

One thing to highlight about COW file system semantics: PostgreSQL
behaves differently when space runs out.  When writing relation data,
eg ZFS sometimes fails like bullet point 2 in this ENOSPC article[1],
while XFS usually fails like bullet point 1.  A database on XFS that
has been cloned in this way might presumably start to fail like bullet
point 2, eg when checkpointing dirty pages, instead of its usual
extension-time-only ENOSPC-rolls-back-your-transaction behaviour.

[1] https://wiki.postgresql.org/wiki/ENOSPC
From 49b7077e24da9d8140de3f303d3f90cb3c0dd0f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 2 Sep 2023 22:21:49 +1200
Subject: [PATCH v3] CREATE DATABASE ... STRATEGY=FILE_CLONE.

Similar to STRATEGY=FILE_COPY, but attempting to use efficient file
copying system calls.  The kernel has the opportunity to share block
ranges in copy-on-write file systems, or maybe push down the copy to
network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

XXX need redo -- what to do if unsupported during redo, fall back to plain copy?

Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/backend/commands/dbcommands.c  | 19 ---
 src/backend/storage/file/copydir.c | 80 --
 src/include/storage/copydir.h  |  3 +-
 3 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index b256d6d0f7d..8ccfd18b4c9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -78,11 +78,14 @@
  * CREATEDB_FILE_COPY will simply perform a file system level copy of the
  * database and log a single record for each tablespace copied. To make this
  * safe, it also triggers checkpoints before and after the operation.
+ *
+ * CREATEDB_FILE_CLONE is the same, but uses faster file cloning system calls.
  */
 typedef enum CreateDBStrategy
 {
 	CREATEDB_WAL_LOG,
 	CREATEDB_FILE_COPY,
+	CREATEDB_FILE_CLONE,
 } CreateDBStrategy;
 
 typedef struct
@@ -136,7 +139,8 @@ static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
 static void CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid,
 	bool isRedo);
 static void CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid,
-		Oid src_tsid, Oid dst_tsid);
+		Oid src_tsid, Oid dst_tsid,
+		bool clone_files);
 static void recovery_create_dbdir(char *path, bool only_tblspc);
 
 /*
@@ -548,7 +552,7 @@ CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
  */
 static void
 CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
-			Oid dst_tsid)
+			Oid dst_tsid, bool clone_files)
 {
 	TableScanDesc scan;
 	Relation	rel;
@@ -608,7 +612,7 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 		 *
 		 * We don't need to copy subdirectories
 		 */
-		copydir(srcpath, dstpath, false);
+		copydir(srcpath, dstpath, false, clone_files);
 
 		/* Record the filesystem change in XLOG */
 		{
@@ -1010,6 +1014,8 @@ createdb(ParseState *pstate, const CreatedbStmt 

Re: Synchronizing slots from primary to standby

2024-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, March 5, 2024 2:35 PM shveta malik  wrote:
> >
> > On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila  wrote:
> > >
> > > On Tue, Mar 5, 2024 at 6:10 AM Peter Smith 
> > wrote:
> > > >
> > > > ==
> > > > src/backend/replication/walsender.c
> > > >
> > > > 5. NeedToWaitForWal
> > > >
> > > > + /*
> > > > + * Check if the standby slots have caught up to the flushed
> > > > + position. It
> > > > + * is good to wait up to the flushed position and then let the
> > > > + WalSender
> > > > + * send the changes to logical subscribers one by one which are
> > > > + already
> > > > + * covered by the flushed position without needing to wait on every
> > > > + change
> > > > + * for standby confirmation.
> > > > + */
> > > > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) return true;
> > > > +
> > > > + *wait_event = 0;
> > > > + return false;
> > > > +}
> > > > +
> > > >
> > > > 5a.
> > > > The comment (or part of it?) seems misplaced because it is talking
> > > > WalSender sending changes, but that is not happening in this function.
> > > >
> > >
> > > I don't think so. This is invoked only by walsender and a static
> > > function. I don't see any other better place to mention this.
> > >
> > > > Also, isn't what this is saying already described by the other
> > > > comment in the caller? e.g.:
> > > >
> > >
> > > Oh no, here we are explaining the wait order.
> >
> > I think there is a scope of improvement here. The comment inside
> > NeedToWaitForWal() which states that we need to wait here for standbys on
> > flush-position(and not on each change) should be outside of this function. 
> > It is
> > too embedded. And the comment which states the order of wait (first flush 
> > and
> > then standbys confirmation) should be outside the for-loop in
> > WalSndWaitForWal(), but yes we do need both the comments. Attached a
> > patch (.txt) for comments improvement, please merge if appropriate.
>
> Thanks, I have slightly modified the top-up patch and merged it.
>
> Attach the V106 patch which addressed above and Peter's comments[1].
>

I have one question about PhysicalWakeupLogicalWalSnd():

+/*
+ * Wake up the logical walsender processes with logical failover slots if the
+ * currently acquired physical slot is specified in standby_slot_names GUC.
+ */
+void
+PhysicalWakeupLogicalWalSnd(void)
+{
+List  *standby_slots;
+
+Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot));
+
+standby_slots = GetStandbySlotList();
+
+foreach_ptr(char, name, standby_slots)
+{
+if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0)
+{
+
ConditionVariableBroadcast(>wal_confirm_rcv_cv);
+return;
+}
+}
+}

IIUC walsender calls this function every time after updating the
slot's restart_lsn, which could be very frequently. I'm concerned that
it could be expensive to do a linear search on the standby_slot_names
list every time. Is it possible to cache the information in walsender
local somehow? For example, the walsender sets a flag in WalSnd after
processing the config file if its slot name is present in
standby_slot_names. That way, they can wake up logical walsenders if
eligible after updating the slot's restart_lsn, without checking the
standby_slot_names value.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: speed up a logical replica setup

2024-03-05 Thread Euler Taveira
On Tue, Mar 5, 2024, at 12:48 AM, Shubham Khanna wrote:
> I applied the v25 patch and did RUN the 'pg_createsubscriber' command.
> I was unable to execute it and experienced the following error:
> 
> $ ./pg_createsubscriber -D node2/ -P "host=localhost port=5432
> dbname=postgres"  -d postgres -d db1 -p 9000 -r
> ./pg_createsubscriber: invalid option -- 'p'
> pg_createsubscriber: hint: Try "pg_createsubscriber --help" for more
> information.

Oops. Good catch! I will post an updated patch soon.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Synchronizing slots from primary to standby

2024-03-05 Thread Masahiko Sawada
On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, March 1, 2024 2:11 PM Masahiko Sawada  
> wrote:
> >
> >
> > ---
> > +void
> > +assign_standby_slot_names(const char *newval, void *extra) {
> > +List  *standby_slots;
> > +MemoryContext oldcxt;
> > +char  *standby_slot_names_cpy = extra;
> > +
> >
> > Given that the newval and extra have the same data (standby_slot_names
> > value), why do we not use newval instead? I think that if we use
> > newval, we don't need to guc_strdup() in check_standby_slot_names(),
> > we might need to do list_copy_deep() instead, though. It's not clear
> > to me as there is no comment.
>
> I think SplitIdentifierString will modify the passed in string, so we'd better
> not pass the newval to it, otherwise the stored guc string(standby_slot_names)
> will be changed. I can see we are doing similar thing in other GUC 
> check/assign
> function as well. (check_wal_consistency_checking/
> assign_wal_consistency_checking, check_createrole_self_grant/
> assign_createrole_self_grant ...).

Why does it have to be a List in the first place? In earlier version
patches, we used to copy the list and delete the element until it
became empty, while waiting for physical wal senders. But we now just
refer to each slot name in the list. The current code assumes that
stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it
changes, it will silently get broken. I think we can check and assign
standby_slot_names in a similar way to check/assign_temp_tablespaces
and check/assign_synchronous_standby_names.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Exponential backoff for auth_delay

2024-03-05 Thread Jacob Champion
On Tue, Mar 5, 2024 at 1:51 PM Nathan Bossart  wrote:
> I don't have a strong opinion about making this configurable, but I do
> think we should consider making this a hash table so that we can set
> MAX_CONN_RECORDS much higher.

I'm curious why? It seems like the higher you make MAX_CONN_RECORDS,
the easier it is to put off the brute-force protection. (My assumption
is that anyone mounting a serious attack is not going to be doing this
from their own computer; they'll be doing it from many devices they
don't own -- a botnet, or a series of proxies, or something.)

--

Drive-by microreview -- auth_delay_cleanup_conn_record() has

> +   port->remote_host[0] = '\0';

which doesn't seem right. I assume acr->remote_host was meant?

--Jacob




Re: Make query cancellation keys longer

2024-03-05 Thread Bruce Momjian
On Fri, Mar  1, 2024 at 03:19:23PM +0100, Peter Eisentraut wrote:
> On 29.02.24 22:25, Heikki Linnakangas wrote:
> > Currently, cancel request key is a 32-bit token, which isn't very much
> > entropy. If you want to cancel another session's query, you can
> > brute-force it. In most environments, an unauthorized cancellation of a
> > query isn't very serious, but it nevertheless would be nice to have more
> > protection from it. The attached patch makes it longer. It is an
> > optional protocol feature, so it's fully backwards-compatible with
> > clients that don't support longer keys.
> 
> My intuition would be to make this a protocol version bump, not an optional
> feature.  I think this is something that everyone should eventually be
> using, not a niche feature that you explicitly want to opt-in for.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Improve readability by using designated initializers when possible

2024-03-05 Thread Japin Li

On Wed, 06 Mar 2024 at 01:53, Jelte Fennema-Nio  wrote:
> On Tue, 5 Mar 2024 at 15:30, Japin Li  wrote:
>> There is a warning if remove it, so I keep it.
>>
>> /home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:118:33:
>>  warning: label ‘CASE_EEOP_LAST’ defined but not used [-Wunused-label]
>>   118 | #define EEO_CASE(name)  CASE_##name:
>>   | ^
>> /home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:1845:17:
>>  note: in expansion of macro ‘EEO_CASE’
>>  1845 | EEO_CASE(EEOP_LAST)
>>   | ^~~~
>
> I think if you remove the EEO_CASE(EEOP_LAST) block the warning should
> go away. That block is clearly marked as unreachable, so it doesn't
> really serve a purpose.

Thanks! Fixed as you suggested.  Please see v3 patch.

>From ff4d9ce75cfd35a1865bbfb9cd5664a7806d92be Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 5 Mar 2024 21:32:46 +0800
Subject: [PATCH v3 1/1] Use C99-designated initializer syntax for
 dispatch_table array

---
 src/backend/executor/execExprInterp.c | 203 --
 src/include/executor/execExpr.h   |   2 +-
 2 files changed, 97 insertions(+), 108 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 7c1f51e2e0..df9fe79058 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -400,110 +400,106 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *scanslot;
 
-	/*
-	 * This array has to be in the same order as enum ExprEvalOp.
-	 */
 #if defined(EEO_USE_COMPUTED_GOTO)
 	static const void *const dispatch_table[] = {
-		&_EEOP_DONE,
-		&_EEOP_INNER_FETCHSOME,
-		&_EEOP_OUTER_FETCHSOME,
-		&_EEOP_SCAN_FETCHSOME,
-		&_EEOP_INNER_VAR,
-		&_EEOP_OUTER_VAR,
-		&_EEOP_SCAN_VAR,
-		&_EEOP_INNER_SYSVAR,
-		&_EEOP_OUTER_SYSVAR,
-		&_EEOP_SCAN_SYSVAR,
-		&_EEOP_WHOLEROW,
-		&_EEOP_ASSIGN_INNER_VAR,
-		&_EEOP_ASSIGN_OUTER_VAR,
-		&_EEOP_ASSIGN_SCAN_VAR,
-		&_EEOP_ASSIGN_TMP,
-		&_EEOP_ASSIGN_TMP_MAKE_RO,
-		&_EEOP_CONST,
-		&_EEOP_FUNCEXPR,
-		&_EEOP_FUNCEXPR_STRICT,
-		&_EEOP_FUNCEXPR_FUSAGE,
-		&_EEOP_FUNCEXPR_STRICT_FUSAGE,
-		&_EEOP_BOOL_AND_STEP_FIRST,
-		&_EEOP_BOOL_AND_STEP,
-		&_EEOP_BOOL_AND_STEP_LAST,
-		&_EEOP_BOOL_OR_STEP_FIRST,
-		&_EEOP_BOOL_OR_STEP,
-		&_EEOP_BOOL_OR_STEP_LAST,
-		&_EEOP_BOOL_NOT_STEP,
-		&_EEOP_QUAL,
-		&_EEOP_JUMP,
-		&_EEOP_JUMP_IF_NULL,
-		&_EEOP_JUMP_IF_NOT_NULL,
-		&_EEOP_JUMP_IF_NOT_TRUE,
-		&_EEOP_NULLTEST_ISNULL,
-		&_EEOP_NULLTEST_ISNOTNULL,
-		&_EEOP_NULLTEST_ROWISNULL,
-		&_EEOP_NULLTEST_ROWISNOTNULL,
-		&_EEOP_BOOLTEST_IS_TRUE,
-		&_EEOP_BOOLTEST_IS_NOT_TRUE,
-		&_EEOP_BOOLTEST_IS_FALSE,
-		&_EEOP_BOOLTEST_IS_NOT_FALSE,
-		&_EEOP_PARAM_EXEC,
-		&_EEOP_PARAM_EXTERN,
-		&_EEOP_PARAM_CALLBACK,
-		&_EEOP_CASE_TESTVAL,
-		&_EEOP_MAKE_READONLY,
-		&_EEOP_IOCOERCE,
-		&_EEOP_IOCOERCE_SAFE,
-		&_EEOP_DISTINCT,
-		&_EEOP_NOT_DISTINCT,
-		&_EEOP_NULLIF,
-		&_EEOP_SQLVALUEFUNCTION,
-		&_EEOP_CURRENTOFEXPR,
-		&_EEOP_NEXTVALUEEXPR,
-		&_EEOP_ARRAYEXPR,
-		&_EEOP_ARRAYCOERCE,
-		&_EEOP_ROW,
-		&_EEOP_ROWCOMPARE_STEP,
-		&_EEOP_ROWCOMPARE_FINAL,
-		&_EEOP_MINMAX,
-		&_EEOP_FIELDSELECT,
-		&_EEOP_FIELDSTORE_DEFORM,
-		&_EEOP_FIELDSTORE_FORM,
-		&_EEOP_SBSREF_SUBSCRIPTS,
-		&_EEOP_SBSREF_OLD,
-		&_EEOP_SBSREF_ASSIGN,
-		&_EEOP_SBSREF_FETCH,
-		&_EEOP_DOMAIN_TESTVAL,
-		&_EEOP_DOMAIN_NOTNULL,
-		&_EEOP_DOMAIN_CHECK,
-		&_EEOP_CONVERT_ROWTYPE,
-		&_EEOP_SCALARARRAYOP,
-		&_EEOP_HASHED_SCALARARRAYOP,
-		&_EEOP_XMLEXPR,
-		&_EEOP_JSON_CONSTRUCTOR,
-		&_EEOP_IS_JSON,
-		&_EEOP_AGGREF,
-		&_EEOP_GROUPING_FUNC,
-		&_EEOP_WINDOW_FUNC,
-		&_EEOP_SUBPLAN,
-		&_EEOP_AGG_STRICT_DESERIALIZE,
-		&_EEOP_AGG_DESERIALIZE,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
-		&_EEOP_AGG_PLAIN_PERGROUP_NULLCHECK,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_BYREF,
-		&_EEOP_AGG_PRESORTED_DISTINCT_SINGLE,
-		&_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
-		&_EEOP_AGG_ORDERED_TRANS_DATUM,
-		&_EEOP_AGG_ORDERED_TRANS_TUPLE,
-		&_EEOP_LAST
+		[EEOP_DONE] = &_EEOP_DONE,
+		[EEOP_INNER_FETCHSOME] = &_EEOP_INNER_FETCHSOME,
+		[EEOP_OUTER_FETCHSOME] = &_EEOP_OUTER_FETCHSOME,
+		[EEOP_SCAN_FETCHSOME] = &_EEOP_SCAN_FETCHSOME,
+		[EEOP_INNER_VAR] = &_EEOP_INNER_VAR,
+		[EEOP_OUTER_VAR] = &_EEOP_OUTER_VAR,
+		[EEOP_SCAN_VAR] = &_EEOP_SCAN_VAR,
+		[EEOP_INNER_SYSVAR] = &_EEOP_INNER_SYSVAR,
+		[EEOP_OUTER_SYSVAR] = &_EEOP_OUTER_SYSVAR,
+		[EEOP_SCAN_SYSVAR] = &_EEOP_SCAN_SYSVAR,
+		[EEOP_WHOLEROW] = &_EEOP_WHOLEROW,
+		[EEOP_ASSIGN_INNER_VAR] = &_EEOP_ASSIGN_INNER_VAR,
+		[EEOP_ASSIGN_OUTER_VAR] = &_EEOP_ASSIGN_OUTER_VAR,
+		[EEOP_ASSIGN_SCAN_VAR] = &_EEOP_ASSIGN_SCAN_VAR,
+		

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-03-05 Thread David Zhang

Hi Hackers,

This is the third version patch for "Certificate status check using OCSP 
Stapling" with ssl regression test cases added.


Here is how I run the ssl regression test:
    ./configure --enable-tap-tests --with-openssl
    make -j
    cd src/test/ssl
    make sslfiles
    make check PG_TEST_EXTRA=ssl

expected results:
    # +++ tap check in src/test/ssl +++
    t/001_ssltests.pl .. ok
    t/002_scram.pl . ok
    t/003_sslinfo.pl ... ok
    All tests successful.
    Files=3, Tests=279, 17 wallclock secs ( 0.05 usr  0.01 sys + 2.32 
cusr  2.16 csys =  4.54 CPU)


    Result: PASS

Notes, before executing the SSL regression tests with the command `make 
check PG_TEST_EXTRA=ssl`, it is necessary to wait for 1 minute after 
running `make sslfiles`. This delay is required because the newly 
generated OCSP responses for the 'expired' test cases need 1 minute to 
pass the nextUpdate period. Once the stapled OCSP response files for the 
tests are committed as test input, there is no need to wait, similar to 
certificate files.


Any comments or feedback would be greatly appreciated!

Thank you,

DavidFrom 99fc46ed0bf05eedbe7539890d946db472617150 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Tue, 5 Mar 2024 15:31:22 -0800
Subject: [PATCH 1/3] support certificate status check using OCSP stapling

---
 src/backend/libpq/be-secure-openssl.c |  87 
 src/backend/libpq/be-secure.c |   1 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/libpq/libpq.h |   1 +
 src/interfaces/libpq/fe-connect.c |  37 
 src/interfaces/libpq/fe-secure-openssl.c  | 198 ++
 src/interfaces/libpq/libpq-int.h  |   1 +
 8 files changed, 336 insertions(+)

diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index e12b1cc9e3..c727634dfa 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -50,6 +50,7 @@
 #include 
 #endif
 #include 
+#include 
 
 
 /* default init hook can be overridden by a shared library */
@@ -81,6 +82,8 @@ static bool ssl_is_server_start;
 static int ssl_protocol_version_to_openssl(int v);
 static const char *ssl_protocol_version_to_string(int v);
 
+static int ocsp_stapling_cb(SSL *ssl);
+
 /* for passing data back from verify_cb() */
 static const char *cert_errdetail;
 
@@ -429,6 +432,9 @@ be_tls_open_server(Port *port)
return -1;
}
 
+   /* set up OCSP stapling callback */
+   SSL_CTX_set_tlsext_status_cb(SSL_context, ocsp_stapling_cb);
+
/* set up debugging/info callback */
SSL_CTX_set_info_callback(SSL_context, info_cb);
 
@@ -1653,3 +1659,84 @@ default_openssl_tls_init(SSL_CTX *context, bool 
isServerStart)
SSL_CTX_set_default_passwd_cb(context, 
dummy_ssl_passwd_cb);
}
 }
+
+/*
+ * OCSP stapling callback function for the server side.
+ *
+ * This function is responsible for providing the OCSP stapling response to
+ * the client during the SSL/TLS handshake, based on the client's request.
+ *
+ * Parameters:
+ *   - ssl: SSL/TLS connection object.
+ *
+ * Returns:
+ *   - SSL_TLSEXT_ERR_OK: OCSP stapling response successfully provided.
+ *   - SSL_TLSEXT_ERR_NOACK: OCSP stapling response not provided due to errors.
+ *
+ * Steps:
+ *   1. Check if the server-side OCSP stapling feature is enabled.
+ *   2. Read OCSP response from file if client requested OCSP stapling.
+ *   3. Set the OCSP stapling response in the SSL/TLS connection.
+ */
+static int ocsp_stapling_cb(SSL *ssl)
+{
+   int resp_len = -1;
+   BIO *bio = NULL;
+   OCSP_RESPONSE   *resp = NULL;
+   unsigned char   *rspder = NULL;
+
+   /* return, if ssl_ocsp_file not enabled on server */
+   if (ssl_ocsp_file == NULL)
+   {
+   ereport(WARNING,
+   (errcode(ERRCODE_CONFIG_FILE_ERROR),
+errmsg("could not find ssl_ocsp_file")));
+   return SSL_TLSEXT_ERR_NOACK;
+   }
+
+   /* whether the client requested OCSP stapling */
+   if (SSL_get_tlsext_status_type(ssl) == TLSEXT_STATUSTYPE_ocsp)
+   {
+   bio = BIO_new_file(ssl_ocsp_file, "r");
+   if (bio == NULL)
+   {
+   ereport(WARNING,
+   (errcode(ERRCODE_CONFIG_FILE_ERROR),
+errmsg("could not read 
ssl_ocsp_file")));
+   return SSL_TLSEXT_ERR_NOACK;
+   }
+
+   resp = d2i_OCSP_RESPONSE_bio(bio, NULL);
+   BIO_free(bio);
+   if (resp == NULL)
+   {
+   ereport(WARNING,
+   (errcode(ERRCODE_PROTOCOL_VIOLATION),
+   

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-05 Thread Michael Paquier
On Tue, Mar 05, 2024 at 08:21:34AM +0100, Mats Kindahl wrote:
> I realize that a more advanced system is possible to create where you can
> customize the output even more, but in this case I just wanted to add a
> section with some additional information related to plan execution. Also,
> the code in explain.c seems to not be written with extensibility in mind,
> so I did not want to make too big a change here before thinking through how
> this would work.

Sure.

> Just to elaborate: the intention was to allow a section to be added to
> every node in the plan containing information from further down and also
> allow this information to propagate upwards. We happen to have buffer
> information right now, but allowing something similar to be added
> dynamically by extending ExplainNode and passing down a callback to
> standard_ExplainOneQuery.

Or an extra hook at the end of ExplainNode() to be able to append more
information at node level?  Not sure if others would agree with that,
though.
--
Michael


signature.asc
Description: PGP signature


Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-05 Thread Dean Rasheed
On Tue, 5 Mar 2024 at 02:22, Nathan Bossart  wrote:
>
> I saw that this thread was referenced elsewhere [0], so I figured I'd take
> a fresh look.  From a quick glance, I'd say 0001 and 0002 are pretty
> reasonable and could probably be committed for v17.
>

I'm not sure how useful these changes are, but I don't really object.
You need to update the synopsis section of the docs though.

Regards,
Dean




Re: pg_upgrade --copy-file-range

2024-03-05 Thread Thomas Munro
On Wed, Mar 6, 2024 at 2:43 AM Peter Eisentraut  wrote:
> As far as I can tell, the original pg_upgrade patch has been ready to
> commit since October.  Unless Thomas has any qualms that have not been
> made explicit in this thread, I suggest we move ahead with that.

pg_upgrade --copy-file-range pushed.  The only change I made was to
remove the EINTR retry condition which was subtly wrong and actually
not needed here AFAICS.  (Erm, maybe I did have an unexpressed qualm
about some bug reports unfolding around that time about corruption
linked to copy_file_range that might have spooked me but those seem to
have been addressed.)

> And then Jakub could rebase his patch set on top of that.  It looks like
> if the formatting issues are fixed, the remaining pg_combinebackup
> support isn't that big.

+1

I'll also go and rebase CREATE DATABASE ... STRATEGY=file_clone[1].

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com




Re: Remove unnecessary code from psql's watch command

2024-03-05 Thread Michael Paquier
On Tue, Mar 05, 2024 at 10:05:52PM +0900, Yugo NAGATA wrote:
> In the current code of do_watch(), sigsetjmp is called if WIN32
> is defined, but siglongjmp is not called in the signal handler
> in this condition. On Windows, currently, cancellation is checked
> only by cancel_pressed, and  calling sigsetjmp in do_watch() is
> unnecessary. Therefore, we can remove code around sigsetjmp in
> do_watch(). I've attached the patch for this fix.

Re-reading the top comment of sigint_interrupt_enabled, it looks like
you're right here.  As long as we check for cancel_pressed there
should be no need for any special cancellation handling here.
--
Michael


signature.asc
Description: PGP signature


Re: Refactoring backend fork+exec code

2024-03-05 Thread Tristan Partin

On Mon Mar 4, 2024 at 3:05 AM CST, Heikki Linnakangas wrote:
I've now completed many of the side-quests, here are the patches that 
remain.


The first three patches form a logical unit. They move the 
initialization of the Port struct from postmaster to the backend 
process. Currently, that work is split between the postmaster and the 
backend process so that postmaster fills in the socket and some other 
fields, and the backend process fills the rest after reading the startup 
packet. With these patches, there is a new much smaller ClientSocket 
struct that is passed from the postmaster to the child process, which 
contains just the fields that postmaster initializes. The Port struct is 
allocated in the child process. That makes the backend startup easier to 
understand. I plan to commit those three patches next if there are no 
objections.


That leaves the rest of the patches. I think they're in pretty good 
shape too, and I've gotten some review on those earlier and have 
addressed the comments I got so far, but would still appreciate another 
round of review.



- * *MyProcPort, because ConnCreate() allocated that space with malloc()
- * ... else we'd need to copy the Port data first.  Also, subsidiary 
data
- * such as the username isn't lost either; see ProcessStartupPacket().
+ * *MyProcPort, because that space is allocated in stack ... else we'd
+ * need to copy the Port data first.  Also, subsidiary data such as the
+ * username isn't lost either; see ProcessStartupPacket().


s/allocated in/allocated on the

The first 3 patches seem good to go, in my opinion.


@@ -225,14 +331,13 @@ internal_forkexec(int argc, char *argv[], ClientSocket 
*client_sock, BackgroundW
 return -1;
 }

-/* Make sure caller set up argv properly */
-Assert(argc >= 3);
-Assert(argv[argc] == NULL);
-Assert(strncmp(argv[1], "--fork", 6) == 0);
-Assert(argv[2] == NULL);
-
-/* Insert temp file name after --fork argument */
+/* set up argv properly */
+argv[0] = "postgres";
+snprintf(forkav, MAXPGPATH, "--forkchild=%s", child_kind);
+argv[1] = forkav;
+/* Insert temp file name after --forkchild argument */
 argv[2] = tmpfilename;
+argv[3] = NULL;


Should we use postgres_exec_path instead of the naked "postgres" here?


+/* in postmaster, fork failed ... */
+ereport(LOG,
+(errmsg("could not fork worker process: %m")));
+/* undo what assign_backendlist_entry did */
+ReleasePostmasterChildSlot(rw->rw_child_slot);
+rw->rw_child_slot = 0;
+pfree(rw->rw_backend);
+rw->rw_backend = NULL;
+/* mark entry as crashed, so we'll try again later */
+rw->rw_crashed_at = GetCurrentTimestamp();
+return false;


I think the error message should include the word "background." It would 
be more consistent with the log message above it.



+typedef struct
+{
+intsyslogFile;
+intcsvlogFile;
+intjsonlogFile;
+} syslogger_startup_data;


It would be nice if all of these startup data structs were named 
similarly. For instance, a previous one was BackendStartupInfo. It would 
help with greppability.


I noticed there were a few XXX comments left that you created. I'll 
highlight them here for more visibility.



+/* XXX: where does this belong? */
+extern bool LoadedSSL;


Perhaps near the My* variables or maybe in the Port struct?


+#ifdef EXEC_BACKEND
+
+/*
+ * Need to reinitialize the SSL library in the backend, since the 
context
+ * structures contain function pointers and cannot be passed through 
the
+ * parameter file.
+ *
+ * If for some reason reload fails (maybe the user installed broken key
+ * files), soldier on without SSL; that's better than all connections
+ * becoming impossible.
+ *
+ * XXX should we do this in all child processes?  For the moment it's
+ * enough to do it in backend children. XXX good question indeed
+ */
+#ifdef USE_SSL
+if (EnableSSL)
+{
+if (secure_initialize(false) == 0)
+LoadedSSL = true;
+else
+ereport(LOG,
+(errmsg("SSL configuration could not be 
loaded in child process")));
+}
+#endif
+#endif


Here you added the "good question indeed." I am not sure what the best 
answer is either! :)



+/* XXX: translation? */
+ereport(LOG,
+(errmsg("could not fork %s process: %m", 
PostmasterChildName(type;


I assume you are referring to the child name 

Re: Improving contrib/tablefunc's error reporting

2024-03-05 Thread Tom Lane
Joe Conway  writes:
> I will have a look at your patches, and the other issue you mention, but 
> it might be a day or three before I can give it some quality time.

No hurry certainly.  Thanks for looking.

regards, tom lane




Re: Experiments with Postgres and SSL

2024-03-05 Thread Jacob Champion
I keep forgetting -- attached is the diff I'm carrying to plug
libpq_encryption into Meson. (The current patchset has a meson.build
for it, but it's not connected.)

--Jacob
commit 64215f1e6b68208378b34cc0736d2f0eb1d45919
Author: Jacob Champion 
Date:   Wed Feb 28 11:28:17 2024 -0800

WIP: mesonify

diff --git a/src/test/libpq_encryption/meson.build 
b/src/test/libpq_encryption/meson.build
index 04f479e9fe..ac1db10d74 100644
--- a/src/test/libpq_encryption/meson.build
+++ b/src/test/libpq_encryption/meson.build
@@ -1,13 +1,12 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
 tests += {
-  'name': 'ldap',
+  'name': 'libpq_encryption',
   'sd': meson.current_source_dir(),
   'bd': meson.current_build_dir(),
   'tap': {
 'tests': [
-  't/001_auth.pl',
-  't/002_bindpasswd.pl',
+  't/001_negotiate_encryption.pl',
 ],
 'env': {
   'with_ssl': ssl_library,
diff --git a/src/test/meson.build b/src/test/meson.build
index c3d0dfedf1..702213bc6f 100644
--- a/src/test/meson.build
+++ b/src/test/meson.build
@@ -4,6 +4,7 @@ subdir('regress')
 subdir('isolation')
 
 subdir('authentication')
+subdir('libpq_encryption')
 subdir('recovery')
 subdir('subscription')
 subdir('modules')


Re: Popcount optimization using AVX512

2024-03-05 Thread Bruce Momjian
On Tue, Mar  5, 2024 at 04:52:23PM +, Amonson, Paul D wrote:
> -Original Message-
> >From: Nathan Bossart  
> >Sent: Tuesday, March 5, 2024 8:38 AM
> >To: Amonson, Paul D 
> >Cc: Andres Freund ; Alvaro Herrera 
> >; Shankaran, Akash ; 
> >Noah Misch ; Tom Lane ; Matthias van 
> >de Meent ; >pgsql-hackers@lists.postgresql.org
> >Subject: Re: Popcount optimization using AVX512
> >
> >On Tue, Mar 05, 2024 at 04:31:15PM +, Amonson, Paul D wrote:
> >> I am not sure what "top-post" means but I am not doing anything 
> >> different but using "reply to all" in Outlook. Please enlighten me. 
> >
> >The following link provides some more information:
> >
> > https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
> >
> >--
> >Nathan Bossart
> >Amazon Web Services: https://aws.amazon.com
> 
> A.Ok... guess it's time to thank Microsoft then.  ;) Noted I will try 
> to do the "reduced" bottom-posting. I might slip up occasionally because it's 
> an Intel habit. Is there a way to make Outlook do the leading ">" in a reply 
> for the previous message?

Here is a blog post about how complex email posting can be:

https://momjian.us/main/blogs/pgblog/2023.html#September_8_2023

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Improving contrib/tablefunc's error reporting

2024-03-05 Thread Joe Conway

On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


Been a long time since I gave contrib/tablefunc any love I guess ;-)

I will have a look at your patches, and the other issue you mention, but 
it might be a day or three before I can give it some quality time.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Improving contrib/tablefunc's error reporting

2024-03-05 Thread Tom Lane
After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/DM4PR19MB597886696589C5CE33F5D58AD3222%40DM4PR19MB5978.namprd19.prod.outlook.com

From 7fba23c28e9e7558f54c52dd9407ce50f75b0e1d Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Tue, 5 Mar 2024 16:37:35 -0500
Subject: [PATCH v1 1/2] Add more test coverage for contrib/tablefunc.

Add test cases that exercise all the error reports in the module,
excluding those that are for failures elsewhere such as SPI errors.
---
 contrib/tablefunc/expected/tablefunc.out | 63 
 contrib/tablefunc/sql/tablefunc.sql  | 42 
 2 files changed, 105 insertions(+)

diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index 464c210f42..0c4e114aba 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -145,6 +145,21 @@ SELECT * FROM crosstab_out('SELECT rowid, attribute, val FROM ct where rowclass
| val9 | val10 | val11
 (3 rows)
 
+-- check error reporting
+SELECT * FROM crosstab('SELECT rowid, val FROM ct where rowclass = ''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;')
+  AS ct(row_name text, category_1 text, category_2 text);
+ERROR:  invalid source data SQL statement
+DETAIL:  The provided SQL must return 3 columns: rowid, category, and values.
+SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;')
+  AS ct(row_name text);
+ERROR:  return and sql tuple descriptions are incompatible
+SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;')
+  AS ct(row_name int, category_1 text, category_2 text);
+ERROR:  invalid return type
+DETAIL:  SQL rowid datatype does not match return rowid datatype.
+SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;')
+  AS ct(row_name text, category_1 text, category_2 int);
+ERROR:  return and sql tuple descriptions are incompatible
 --
 -- hash based crosstab
 --
@@ -223,6 +238,12 @@ SELECT * FROM crosstab(
   'SELECT DISTINCT rowdt, attribute FROM cth ORDER BY 2')
 AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
 ERROR:  provided "categories" SQL must return 1 column of at least one row
+-- if category query generates a NULL value, get expected error
+SELECT * FROM crosstab(
+  'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
+  'SELECT NULL::text')
+AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
+ERROR:  provided "categories" SQL must not return NULL values
 -- if source query returns zero rows, get zero rows returned
 SELECT * FROM crosstab(
   'SELECT rowid, rowdt, attribute, val FROM cth WHERE false ORDER BY 1',
@@ -241,6 +262,25 @@ AS c(rowid text, rowdt timestamp, temperature text, test_result text, test_start
 ---+---+-+-++---
 (0 rows)
 
+-- check errors with inappropriate input rowtype
+SELECT * FROM crosstab(
+  'SELECT rowid, attribute FROM cth ORDER BY 1',
+  'SELECT DISTINCT attribute FROM cth ORDER BY 1')
+AS c(rowid text, temperature text, test_result text, test_startdate text, volts text);
+ERROR:  invalid source data SQL statement
+DETAIL:  The provided SQL must return 3  columns; rowid, category, and values.
+SELECT * FROM crosstab(
+  'SELECT rowid, rowdt, rowdt, attribute, val FROM cth ORDER BY 1',
+  'SELECT DISTINCT attribute FROM cth 

Re: Get rid of the excess semicolon in planner.c

2024-03-05 Thread David Rowley
On Wed, 6 Mar 2024 at 00:43, Richard Guo  wrote:
>
> I think this is a typo introduced in 0452b461b.
>
>  +root->processed_groupClause = list_copy(parse->groupClause);;

"git grep -E ";;$" -- *.c *.h" tell me it's the only one.

Pushed.

David




Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-05 Thread Jacob Champion
On Mon, Mar 4, 2024 at 6:23 AM Daniel Gustafsson  wrote:
> > On 12 Sep 2023, at 21:40, Jacob Champion  wrote:

Sorry for the long delay!

> >> + ssl_client_get_notbefore() returns text
> >> ...> + ssl_client_get_notafter() returns text
> >
> > I think this should say timestamptz rather than text? Ditto for the
> > pg_stat_ssl documentation.
> >
> > Speaking of which: is the use of `timestamp` rather than `timestamptz`
> > in pg_proc.dat intentional? Will that cause problems with comparisons?
>
> It should be timestamptz, it was a tyop on my part. Fixed.

Looks like sslinfo--1.2--1.3.sql is also declaring the functions as
timestamp rather than timestamptz, which is breaking comparisons with
the not_before/after columns. It might also be nice to rename
ASN1_TIME_to_timestamp().

Squinting further at the server backend implementation, should that
also be using TimestampTz throughout, instead of Timestamp? It all
goes through float8_timestamptz at the end, so I guess it shouldn't
have a material impact, but it's a bit confusing.

> Thanks for reviewing, the attached v8 contains the fixes from this review 
> along
> with a fresh rebase and some attempts at making tests more stable in the face
> of timezones by casting to date.

In my -08 timezone, the date doesn't match what's recorded either
(it's my "tomorrow"). I think those probably just need to be converted
to UTC explicitly? I've attached a sample diff on top of v8 that
passes tests on my machine.

--Jacob
diff --git a/contrib/sslinfo/sslinfo--1.2--1.3.sql 
b/contrib/sslinfo/sslinfo--1.2--1.3.sql
index 9d64d2bfa4..424a11afe4 100644
--- a/contrib/sslinfo/sslinfo--1.2--1.3.sql
+++ b/contrib/sslinfo/sslinfo--1.2--1.3.sql
@@ -3,10 +3,10 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION sslinfo" to load this file. \quit
 
-CREATE FUNCTION ssl_client_get_notbefore() RETURNS timestamp
+CREATE FUNCTION ssl_client_get_notbefore() RETURNS timestamptz
 AS 'MODULE_PATHNAME', 'ssl_client_get_notbefore'
 LANGUAGE C STRICT PARALLEL RESTRICTED;
 
-CREATE FUNCTION ssl_client_get_notafter() RETURNS timestamp
+CREATE FUNCTION ssl_client_get_notafter() RETURNS timestamptz
 AS 'MODULE_PATHNAME', 'ssl_client_get_notafter'
 LANGUAGE C STRICT PARALLEL RESTRICTED;
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 490c48a7bb..90a4230413 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -740,10 +740,10 @@ command_like(
"$common_connstr user=ssltestuser sslcert=ssl/client.crt "
  . sslkey('client.key'),
'-c',
-   "SELECT 
ssl,version,cipher,bits,client_dn,client_serial,issuer_dn,not_before::date,not_after::date
 FROM pg_stat_ssl WHERE pid = pg_backend_pid()"
+   "SELECT 
ssl,version,cipher,bits,client_dn,client_serial,issuer_dn,not_before AT TIME 
ZONE 'UTC' AS not_before,not_after AT TIME ZONE 'UTC' AS not_after FROM 
pg_stat_ssl WHERE pid = pg_backend_pid()"
],

qr{^ssl,version,cipher,bits,client_dn,client_serial,issuer_dn,not_before,not_after\r?\n
-   
^t,TLSv[\d.]+,[\w-]+,\d+,/?CN=ssltestuser,$serialno,/?\QCN=Test CA for 
PostgreSQL SSL regression test client certs,2023-06-29,2050-01-01\E\r?$}mx,
+   
^t,TLSv[\d.]+,[\w-]+,\d+,/?CN=ssltestuser,$serialno,/?\QCN=Test CA for 
PostgreSQL SSL regression test client certs,2023-06-29 01:01:01,2050-01-01 
01:01:01\E\r?$}mx,
'pg_stat_ssl with client certificate');
 
 # client key with wrong permissions
diff --git a/src/test/ssl/t/003_sslinfo.pl b/src/test/ssl/t/003_sslinfo.pl
index 587c0e2dce..4df3a941b5 100644
--- a/src/test/ssl/t/003_sslinfo.pl
+++ b/src/test/ssl/t/003_sslinfo.pl
@@ -167,15 +167,15 @@ is($result, 't', "ssl_issuer_field() for commonName");
 
 $result = $node->safe_psql(
"certdb",
-   "SELECT ssl_client_get_notbefore()::date = not_before::date, "
- . "not_before::date = '2023-06-29' FROM pg_stat_ssl WHERE pid = 
pg_backend_pid();",
+   "SELECT ssl_client_get_notbefore() = not_before, "
+ . "not_before AT TIME ZONE 'UTC' = '2023-06-29 01:01:01' FROM 
pg_stat_ssl WHERE pid = pg_backend_pid();",
connstr => $common_connstr);
 is($result, 't|t', "ssl_client_get_notbefore() for not_before timestamp");
 
 $result = $node->safe_psql(
"certdb",
-   "SELECT ssl_client_get_notafter()::date = not_after::date, "
- . "not_after::date = '2050-01-01' FROM pg_stat_ssl WHERE pid = 
pg_backend_pid();",
+   "SELECT ssl_client_get_notafter() = not_after, "
+ . "not_after AT TIME ZONE 'UTC' = '2050-01-01 01:01:01' FROM 
pg_stat_ssl WHERE pid = pg_backend_pid();",
connstr => $common_connstr);
 is($result, 't|t', "ssl_client_get_notafter() for not_after timestamp");
 


Re: [PATCH] Exponential backoff for auth_delay

2024-03-05 Thread Nathan Bossart
On Mon, Mar 04, 2024 at 08:42:55PM +0100, Michael Banck wrote:
> On Wed, Feb 21, 2024 at 10:26:11PM +0100, Tomas Vondra wrote:
>> I think it'd be good to consider:
>> 
>> - Making the acr_array a hash table, and larger than 50 entries (I
>> wonder if that should be dynamic / customizable by GUC?).
> 
> I would say a GUC should be overkill for this as this would mostly be an
> implementation detail.
> 
> More generally, I think now that entries are expired (see below), this
> should be less of a concern, so I have not changed this to a hash table
> for now but doubled MAX_CONN_RECORDS to 100 entries.

I don't have a strong opinion about making this configurable, but I do
think we should consider making this a hash table so that we can set
MAX_CONN_RECORDS much higher.

Also, since these records are stored in shared memory, don't we need to
lock them when searching/updating?

> +static void
> +auth_delay_init_state(void *ptr)
> +{
> + Sizeshm_size;
> + AuthConnRecord *array = (AuthConnRecord *) ptr;
> +
> + shm_size = sizeof(AuthConnRecord) * MAX_CONN_RECORDS;
> +
> + memset(array, 0, shm_size);
> +}
> +
> +static void
> +auth_delay_shmem_startup(void)
> +{
> + boolfound;
> + Sizeshm_size;
> +
> + if (shmem_startup_next)
> + shmem_startup_next();
> +
> + shm_size = sizeof(AuthConnRecord) * MAX_CONN_RECORDS;
> + acr_array = GetNamedDSMSegment("auth_delay", shm_size, 
> auth_delay_init_state, );
> +}

Great to see the DSM registry getting some use.  This example makes me
wonder whether the size of the segment should be passed to the
init_callback.

>  /*
>   * Module Load Callback
>   */
>  void
>  _PG_init(void)
>  {
> + if (!process_shared_preload_libraries_in_progress)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  errmsg("auth_delay must be loaded via 
> shared_preload_libraries")));
> +

This change seems like a good idea independent of this feature.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remaining sql/json patches

2024-03-05 Thread Tomas Vondra
Hi,

I know very little about sql/json and all the json internals, but I
decided to do some black box testing. I built a large JSONB table
(single column, ~7GB of data after loading). And then I did a query
transforming the data into tabular form using JSON_TABLE.

The JSON_TABLE query looks like this:

SELECT jt.* FROM
  title_jsonb t,
  json_table(t.info, '$'
COLUMNS (
  "id" text path '$."id"',
  "type" text path '$."type"',
  "title" text path '$."title"',
  "original_title" text path '$."original_title"',
  "is_adult" text path '$."is_adult"',
  "start_year" text path '$."start_year"',
  "end_year" text path '$."end_year"',
  "minutes" text path '$."minutes"',
  "genres" text path '$."genres"',
  "aliases" text path '$."aliases"',
  "directors" text path '$."directors"',
  "writers" text path '$."writers"',
  "ratings" text path '$."ratings"',
  NESTED PATH '$."aliases"[*]'
COLUMNS (
  "alias_title" text path '$."title"',
  "alias_region" text path '$."region"'
),
  NESTED PATH '$."directors"[*]'
COLUMNS (
  "director_name" text path '$."name"',
  "director_birth_year" text path '$."birth_year"',
  "director_death_year" text path '$."death_year"'
),
  NESTED PATH '$."writers"[*]'
COLUMNS (
  "writer_name" text path '$."name"',
  "writer_birth_year" text path '$."birth_year"',
  "writer_death_year" text path '$."death_year"'
),
  NESTED PATH '$."ratings"[*]'
COLUMNS (
  "rating_average" text path '$."average"',
  "rating_votes" text path '$."votes"'
)
)
  ) as jt;

again, not particularly complex. But if I run this, it consumes multiple
gigabytes of memory, before it gets killed by OOM killer. This happens
even when ran using

  COPY (...) TO '/dev/null'

so there's nothing sent to the client. I did catch memory context info,
where it looks like this (complete stats attached):

--
TopMemoryContext: 97696 total in 5 blocks; 13056 free (11 chunks);
  84640 used
  ...
  TopPortalContext: 8192 total in 1 blocks; 7680 free (0 chunks); ...
PortalContext: 1024 total in 1 blocks; 560 free (0 chunks); ...
  ExecutorState: 2541764672 total in 314 blocks; 6528176 free
 (1208 chunks); 2535236496 used
printtup: 8192 total in 1 blocks; 7952 free (0 chunks); ...
...
...
Grand total: 2544132336 bytes in 528 blocks; 7484504 free
 (1340 chunks); 2536647832 used
--

I'd say 2.5GB in ExecutorState seems a bit excessive ... Seems there's
some memory management issue? My guess is we're not releasing memory
allocated while parsing the JSON or building JSON output.


I'm not attaching the data, but I can provide that if needed - it's
about 600MB compressed. The structure is not particularly complex, it's
movie info from [1] combined into a JSON document (one per movie).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyTopMemoryContext: 97696 total in 5 blocks; 13056 free (11 chunks); 84640 used
  Type information cache: 24384 total in 2 blocks; 2640 free (0 chunks); 21744 
used
  TableSpace cache: 8192 total in 1 blocks; 2112 free (0 chunks); 6080 used
  TopTransactionContext: 8192 total in 1 blocks; 7760 free (2 chunks); 432 used
  RowDescriptionContext: 8192 total in 1 blocks; 6912 free (0 chunks); 1280 used
  MessageContext: 524288 total in 7 blocks; 240736 free (4 chunks); 283552 used
  search_path processing cache: 8192 total in 1 blocks; 5616 free (8 chunks); 
2576 used
  Operator class cache: 8192 total in 1 blocks; 576 free (0 chunks); 7616 used
  PgStat Shared Ref Hash: 7232 total in 2 blocks; 704 free (0 chunks); 6528 used
  PgStat Shared Ref: 4096 total in 3 blocks; 496 free (2 chunks); 3600 used
  PgStat Pending: 16384 total in 5 blocks; 6464 free (9 chunks); 9920 used
  smgr relation table: 32768 total in 3 blocks; 16848 free (8 chunks); 15920 
used
  TransactionAbortContext: 32768 total in 1 blocks; 32528 free (0 chunks); 240 
used
  Portal hash: 8192 total in 1 blocks; 576 free (0 chunks); 7616 used
  TopPortalContext: 8192 total in 1 blocks; 7680 free (0 chunks); 512 used
PortalContext: 1024 total in 1 blocks; 560 free (0 chunks); 464 used: 

  ExecutorState: 2541764672 total in 314 blocks; 6528176 free (1208 
chunks); 2535236496 used
printtup: 8192 total in 1 blocks; 7952 free (0 chunks); 240 used
TableFunc per value context: 8192 total in 1 blocks; 6672 free (0 
chunks); 1520 used
  JsonTableExecContext: 8192 total in 1 blocks; 6864 free (0 chunks); 
1328 used
JsonTableExecContext: 8192 total in 1 blocks; 7952 free (0 chunks); 
240 used
JsonTableExecContext: 8192 total in 1 blocks; 7952 free (0 chunks); 
240 used
JsonTableExecContext: 8192 total in 1 blocks; 7952 free (0 chunks); 
240 used

Re: processes stuck in shutdown following OOM/recovery

2024-03-05 Thread Thomas Munro
On Sat, Dec 2, 2023 at 3:30 PM Thomas Munro  wrote:
> On Sat, Dec 2, 2023 at 2:18 PM Thomas Munro  wrote:
> > On Fri, Dec 1, 2023 at 6:13 PM Justin Pryzby  wrote:
> > > $ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 
> > > 2524495 is a child's pid
> >
> > > This affects v15, and fails at ) but not its parent.
> >
> > Repro'd here.  I had to make the sleep shorter on my system.  Looking...
>
> The PostmasterStateMachine() case for PM_WAIT_BACKENDS doesn't tell
> the checkpointer to shut down in this race case.  We have
> CheckpointerPID != 0 (because 7ff23c6d27 starts it earlier than
> before), and FatalError is true because a child recently crashed and
> we haven't yet received the PMSIGNAL_RECOVERY_STARTED handler that
> would clear it.  Hmm.

Here is a first attempt at fixing this.  I am not yet 100% sure if it
is right, and there may be a nicer/simpler way to express the
conditions.  It passes the test suite, and it fixes the repro that
Justin posted.  FYI on my machine I had to use sleep 0.005 where he
had 0.05, as an FYI if someone else is trying to reproduce the issue.


0001-Fix-rare-recovery-shutdown-hang-due-to-checkpointer.patch
Description: Binary data


Re: Support "Right Semi Join" plan shapes

2024-03-05 Thread Alena Rybakina
To be honest, I didn't see it in the code, could you tell me where they 
are, please?


On 05.03.2024 05:44, Richard Guo wrote:


On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina 
 wrote:


I have reviewed your patch and I think it is better to add an
Assert for
JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to
prevent the use of RIGHT_SEMI for these types of connections
(NestedLoop
and MergeJoin).


Hmm, I don't see why this is necessary.  The planner should already
guarantee that we won't have nestloops/mergejoins with right-semi joins.

Thanks
Richard


--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-03-05 Thread Matthias van de Meent
On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev
 wrote:
>
> Hi!
>
> > How do you suppose this would work differently from a long-lived
> > normal snapshot, which is how it works right now?
>
> Difference in the ability to take new visibility snapshot periodically
> during the second phase with rechecking visibility of tuple according
> to the "reference" snapshot (which is taken only once like now).
> It is the approach from (1) but with a workaround for the issues
> caused by heap_page_prune_opt.
>
> > Would it be exclusively for that relation?
> Yes, only for that affected relation. Other relations are unaffected.

I suppose this could work. We'd also need to be very sure that the
toast relation isn't cleaned up either: Even though that's currently
DELETE+INSERT only and can't apply HOT, it would be an issue if we
couldn't find the TOAST data of a deleted for everyone (but visible to
us) tuple.

Note that disabling cleanup for a relation will also disable cleanup
of tuple versions in that table that are not used for the R/CIC
snapshots, and that'd be an issue, too.

> > How would this be integrated with e.g. heap_page_prune_opt?
> Probably by some flag in RelationData, but not sure here yet.
>
> If the idea looks sane, I could try to extend my POC - it should be
> not too hard, likely (I already have tests to make sure it is
> correct).

I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.

But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.

-Matthias




Re: Switching XLog source from archive to streaming when primary available

2024-03-05 Thread Nathan Bossart
On Tue, Mar 05, 2024 at 11:38:37PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart  
> wrote:
>> Is there any way to simplify this?  For
>> example, would it be possible to make an enum that tracks the
>> streaming_replication_retry_interval state?
> 
> I guess the way it is right now looks simple IMHO. If the suggestion
> is to have an enum like below; it looks overkill for just two states.
> 
> typedef enum
> {
> CAN_SWITCH_SOURCE,
> SWITCH_SOURCE
> } XLogSourceSwitchState;

I was thinking of something more like

typedef enum
{
NO_FORCE_SWITCH_TO_STREAMING,   /* no switch necessary 
*/
FORCE_SWITCH_TO_STREAMING_PENDING,  /* exhausting pg_wal */
FORCE_SWITCH_TO_STREAMING,  /* switch to 
streaming now */
} WALSourceSwitchState;

At least, that illustrates my mental model of the process here.  IMHO
that's easier to follow than two similarly-named bool variables.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-05 Thread Nathan Bossart
On Wed, Mar 06, 2024 at 12:50:38AM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 4, 2024 at 2:11 PM Bertrand Drouvot
>  wrote:
>> On Sun, Mar 03, 2024 at 03:44:34PM -0600, Nathan Bossart wrote:
>> > Unless I am misinterpreting some details, ISTM we could rename this column
>> > to invalidation_reason and use it for both logical and physical slots.  I'm
>> > not seeing a strong need for another column.
>>
>> Yeah having two columns was more for convenience purpose. Without the 
>> "conflict"
>> one, a slot conflicting with recovery would be "a logical slot having a non 
>> NULL
>> invalidation_reason".
>>
>> I'm also fine with one column if most of you prefer that way.
> 
> While we debate on the above, please find the attached v7 patch set
> after rebasing.

It looks like Bertrand is okay with reusing the same column for both
logical and physical slots, which IIUC is what you initially proposed in v1
of the patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-05 Thread Bharath Rupireddy
On Mon, Mar 4, 2024 at 2:11 PM Bertrand Drouvot
 wrote:
>
> On Sun, Mar 03, 2024 at 03:44:34PM -0600, Nathan Bossart wrote:
> > On Sun, Mar 03, 2024 at 11:40:00PM +0530, Bharath Rupireddy wrote:
> > > On Sat, Mar 2, 2024 at 3:41 AM Nathan Bossart  
> > > wrote:
> > >> Would you ever see "conflict" as false and "invalidation_reason" as
> > >> non-null for a logical slot?
> > >
> > > No. Because both conflict and invalidation_reason are decided based on
> > > the invalidation reason i.e. value of slot_contents.data.invalidated.
> > > IOW, a logical slot that reports conflict as true must have been
> > > invalidated.
> > >
> > > Do you have any thoughts on reverting 007693f and introducing
> > > invalidation_reason?
> >
> > Unless I am misinterpreting some details, ISTM we could rename this column
> > to invalidation_reason and use it for both logical and physical slots.  I'm
> > not seeing a strong need for another column.
>
> Yeah having two columns was more for convenience purpose. Without the 
> "conflict"
> one, a slot conflicting with recovery would be "a logical slot having a non 
> NULL
> invalidation_reason".
>
> I'm also fine with one column if most of you prefer that way.

While we debate on the above, please find the attached v7 patch set
after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 906f8829f7b6bf1da4b37edf2e4d5a46a7227400 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 5 Mar 2024 18:56:00 +
Subject: [PATCH v7 1/4] Track invalidation_reason in pg_replication_slots

Currently the reason for replication slot invalidation is not
tracked in pg_replication_slots. A recent commit 007693f2a added
conflict_reason to show the reasons for slot invalidation, but
only for logical slots. This commit adds invalidation_reason to
pg_replication_slots to show invalidation reasons for both
physical and logical slots.
---
 doc/src/sgml/system-views.sgml   | 32 
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/replication/slotfuncs.c  | 12 ---
 src/include/catalog/pg_proc.dat  |  6 +++---
 src/test/regress/expected/rules.out  |  5 +++--
 5 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index be90edd0e2..cce88c14bb 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2581,6 +2581,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+ 
+  
+   invalidation_reason text
+  
+  
+   The reason for the slot's invalidation. NULL if the
+   slot is currently actively being used. The non-NULL values indicate that
+   the slot is marked as invalidated. Possible values are:
+   
+
+ 
+  wal_removed means that the required WAL has been
+  removed.
+ 
+
+
+ 
+  rows_removed means that the required rows have
+  been removed.
+ 
+
+
+ 
+  wal_level_insufficient means that the
+  primary doesn't have a  sufficient to
+  perform logical decoding.
+ 
+
+   
+  
+ 
+
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..c39f0d73d3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1025,7 +1025,8 @@ CREATE VIEW pg_replication_slots AS
 L.two_phase,
 L.conflict_reason,
 L.failover,
-L.synced
+L.synced,
+L.invalidation_reason
 FROM pg_get_replication_slots() AS L
 LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 768a304723..a7a250b7c5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -239,7 +239,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 17
+#define PG_GET_REPLICATION_SLOTS_COLS 18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	XLogRecPtr	currlsn;
 	int			slotno;
@@ -263,6 +263,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		bool		nulls[PG_GET_REPLICATION_SLOTS_COLS];
 		WALAvailability walstate;
 		int			i;
+		ReplicationSlotInvalidationCause cause;
 
 		if (!slot->in_use)
 			continue;
@@ -409,12 +410,12 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		values[i++] = BoolGetDatum(slot_contents.data.two_phase);
 
+		cause = slot_contents.data.invalidated;
+
 		if (slot_contents.data.database == InvalidOid)
 			nulls[i++] = true;
 		else
 		{
-			ReplicationSlotInvalidationCause cause = slot_contents.data.invalidated;
-
 			if (cause == RS_INVAL_NONE)
 nulls[i++] = true;
 			else
@@ 

Re: Support a wildcard in backtrace_functions

2024-03-05 Thread Alvaro Herrera
On 2024-Mar-06, Bharath Rupireddy wrote:

> +1 for disallowing *foo or foo* or foo*bar etc. combinations.

Cool.

> I think we need to go a bit further and convert backtrace_functions of
> type GUC_LIST_INPUT so that check_backtrace_functions can just use
> SplitIdentifierString to parse the list of identifiers. Then, the
> strspn can just be something like below for each token:
> 
> validlen = strspn(*tok,
> "0123456789_"
> "abcdefghijklmnopqrstuvwxyz"
> "ABCDEFGHIJKLMNOPQRSTUVWXYZ");
> 
> Does anyone see a problem with it?

IIRC the reason it's coded as it is, is so that we have a single palloc
chunk of memory to free when the value changes; we purposefully stayed
away from SplitIdentifierString and the like.  What problem do you see
with the idea I proposed?  That was:

> On Thu, Feb 29, 2024 at 4:05 PM Alvaro Herrera  
> wrote:

> > I think we should tighten this up: an asterisk should be allowed
> > only if it appears alone in the string (short-circuiting
> > check_backtrace_functions before strspn); and let's leave the
> > strspn() call alone.

That means, just add something like this at the top of
check_backtrace_functions and don't do anything to this function
otherwise (untested code):

if (newval[0] == '*' && newval[1] == '\0')
{
someval = guc_malloc(ERROR, 2);
if (someval == NULL)
return false;
someval[0] = '*';
someval[1] = '\0';
*extra = someval;
return true;
}

(Not sure if a second trailing \0 is necessary.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




Re: Experiments with Postgres and SSL

2024-03-05 Thread Jacob Champion
On Tue, Mar 5, 2024 at 6:09 AM Heikki Linnakangas  wrote:
>
> I hope I didn't joggle your elbow reviewing this

Nope, not at all!

> The tests are still not distinguishing whether a connection was
> established in direct or negotiated mode. So if we e.g. had a bug that
> accidentally disabled direct SSL connection completely and always used
> negotiated mode, the tests would still pass. I'd like to see some tests
> that would catch that.

+1

On Mon, Mar 4, 2024 at 7:29 AM Heikki Linnakangas  wrote:
> On 01/03/2024 23:49, Jacob Champion wrote:
> > I'll squint more closely at the MITM-protection changes in 0008 later.
> > First impressions, though: it looks like that code has gotten much
> > less straightforward, which I think is dangerous given the attack it's
> > preventing. (Off-topic: I'm skeptical of future 0-RTT support. Our
> > protocol doesn't seem particularly replay-safe to me.)
>
> Let's drop that patch. AFAICS it's not needed by the rest of the patches.

Okay, sounds good.

> > If we're interested in ALPN negotiation in the future, we may also
> > want to look at GREASE [1] to keep those options open in the presence
> > of third-party implementations. Unfortunately OpenSSL doesn't do this
> > automatically yet.
>
> Can you elaborate?

Sure: now that we're letting middleboxes and proxies inspect and react
to connections based on ALPN, it's possible that some intermediary
might incorrectly fixate on the "postgres" ID (or whatever we choose
in the end), and shut down connections that carry additional protocols
rather than ignoring them. That would prevent future graceful upgrades
where the client sends both "postgres/X" and "postgres/X+1". While
that wouldn't be our fault, it'd be cold comfort to whoever has that
middlebox.

GREASE is a set of reserved protocol IDs that you can add randomly to
your ALPN list, so any middleboxes that fail to follow the rules will
just break outright rather than silently proliferating. (Hence the
pun: GREASE keeps the joints in the pipe from rusting into place.) The
RFC goes into more detail about how to do it. And I don't know if it's
necessary for a v1, but it'd be something to keep in mind.

> Do we need to do something extra in the server to be
> compatible with GREASE?

No, I think that as long as we use OpenSSL's APIs correctly on the
server side, we'll be compatible by default. This would be a
client-side implementation, to push random GREASE strings into the
ALPN list. (There is a risk that if/when OpenSSL finally starts
supporting this transparently, we'd need to remove it from our code.)

> > If we don't have a reason not to, it'd be good to follow the strictest
> > recommendations from [2] to avoid cross-protocol attacks. (For anyone
> > currently running web servers and Postgres on the same host, they
> > really don't want browsers "talking" to their Postgres servers.) That
> > would mean checking the negotiated ALPN on both the server and client
> > side, and failing if it's not what we expect.
>
> Hmm, I thought that's what the patches does. But looking closer, libpq
> is not checking that ALPN was used. We should add that. Am I right?

Right. Also, it looks like the server isn't failing the TLS handshake
itself, but instead just dropping the connection after the handshake.
In a cross-protocol attack, there's a danger that the client (which is
not speaking our protocol) could still treat the server as
authoritative in that situation.

> > I'm not excited about the proliferation of connection options. I don't
> > have a lot of ideas on how to fix it, though, other than to note that
> > the current sslnegotiation option names are very unintuitive to me:
> > - "postgres": only legacy handshakes
> > - "direct": might be direct... or maybe legacy
> > - "requiredirect": only direct handshakes... unless other options are
> > enabled and then we fall back again to legacy? How many people willing
> > to break TLS compatibility with old servers via "requiredirect" are
> > going to be okay with lazy fallback to GSS or otherwise?
>
> Yeah, this is my biggest complaint about all this. Not so much the names
> of the options, but the number of combinations of different options, and
> how we're going to test them all. I don't have any great solutions,
> except adding a lot of tests to cover them, like Matthias did.

The default gssencmode=prefer is especially problematic if I'm trying
to use sslnegotiation=requiredirect for security. It'll appear to work
at first, but if somehow I get a credential cache into my environment,
libpq will suddenly fall back to plaintext negotiation :(

> > (Plus, we need to have a good error message when connecting to older
> > servers anyway.I think we should be able to key off of the EOF coming
> > back from OpenSSL; it'd be a good excuse to give that part of the code
> > some love.)
>
> Hmm, if OpenSSL sends ClientHello and the server responds with a
> Postgres error packet, OpenSSL will presumably consume the error packet
> or at 

Re: Support a wildcard in backtrace_functions

2024-03-05 Thread Bharath Rupireddy
On Thu, Feb 29, 2024 at 4:05 PM Alvaro Herrera  wrote:
>
> Hmm, so if I write "foo,*" this will work but check all function names
> first and on the second entry.  But if I write "foo*" the GUC value will
> be accepted but match nothing (as will "*foo" or "foo*bar").  I don't
> like either of these behaviors.  I think we should tighten this up: an
> asterisk should be allowed only if it appears alone in the string
> (short-circuiting check_backtrace_functions before strspn); and let's
> leave the strspn() call alone.

+1 for disallowing *foo or foo* or foo*bar etc. combinations. I think
we need to go a bit further and convert backtrace_functions of type
GUC_LIST_INPUT so that check_backtrace_functions can just use
SplitIdentifierString to parse the list of identifiers. Then, the
strspn can just be something like below for each token:

validlen = strspn(*tok,
"0123456789_"
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ");

Does anyone see a problem with it?

FWIW, I've recently noticed for my work on
https://commitfest.postgresql.org/47/2863/ that there isn't any test
covering all the backtrace related code - backtrace_functions GUC,
backtrace_on_internal_error GUC, set_backtrace(), backtrace(),
backtrace_symbols(). I came up with a test module covering these areas
https://commitfest.postgresql.org/47/4823/. I could make the TAP tests
pass on all the CF bot animals. Interestingly, the new code that gets
added for this thread can also be covered with it. Any thoughts are
welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-05 Thread Jeff Davis
On Tue, 2024-03-05 at 09:53 -0600, Elizabeth Christensen wrote:
> Looks good to me. Thanks!

Thank you, committed.

Regards,
Jeff Davis





Re: Switching XLog source from archive to streaming when primary available

2024-03-05 Thread Bharath Rupireddy
On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart  wrote:
>
> cfbot claims that this one needs another rebase.

Yeah, the conflict was with the new TAP test file name in
src/test/recovery/meson.build.

> I've spent some time thinking about this one.  I'll admit I'm a bit worried
> about adding more complexity to this state machine, but I also haven't
> thought of any other viable approaches,

Right. I understand that the WaitForWALToBecomeAvailable()'s state
machine is a complex piece.

> and this still seems like a useful
> feature.  So, for now, I think we should continue with the current
> approach.

Yes, the feature is useful like mentioned in the docs as below:

+Reading WAL from archive may not always be as efficient and fast as
+reading from primary. This can be due to the differences in disk types,
+IO costs, network latencies etc. All of these can impact the recovery
+performance on standby, and can increase the replication lag on
+primary. In addition, the primary keeps accumulating WAL needed for the
+standby while the standby reads WAL from archive, because the standby
+replication slot stays inactive. To avoid these problems, one can use
+this parameter to make standby switch to stream mode sooner.

> +fails to switch to stream mode, it falls back to archive mode. If 
> this
> +parameter value is specified without units, it is taken as
> +milliseconds. Default is 5min. With a lower value
>
> Does this really need to be milliseconds?  I would think that any
> reasonable setting would at least on the order of seconds.

Agreed. Done that way.

> +attempts. To avoid this, it is recommended to set a reasonable value.
>
> I think we might want to suggest what a "reasonable value" is.

It really depends on the WAL generation rate on the primary. If the
WAL files grow faster, the disk runs out of space sooner, so setting a
 value to make frequent WAL source switch attempts can help. It's hard
to suggest a one-size-fits-all value. Therefore, I've tweaked the docs
a bit to reflect the fact  that it depends on the WAL generation rate.

> +   static bool canSwitchSource = false;
> +   boolswitchSource = false;
>
> IIUC "canSwitchSource" indicates that we are trying to force a switch to
> streaming, but we are currently exhausting anything that's present in the
> pg_wal directory,

Right.

> while "switchSource" indicates that we should force a
> switch to streaming right now.

It's not indicating force switch, it says "previously I was asked to
switch source via canSwitchSource, now that I've exhausted all the WAL
from the pg_wal directory, I'll make a source switch attempt now".

> Furthermore, "canSwitchSource" is static
> while "switchSource" is not.

This is because the WaitForWALToBecomeAvailable() has to remember the
decision (that streaming_replication_retry_interval has occurred)
across the calls. And, switchSource is decided within
WaitForWALToBecomeAvailable() for every function call.

> Is there any way to simplify this?  For
> example, would it be possible to make an enum that tracks the
> streaming_replication_retry_interval state?

I guess the way it is right now looks simple IMHO. If the suggestion
is to have an enum like below; it looks overkill for just two states.

typedef enum
{
CAN_SWITCH_SOURCE,
SWITCH_SOURCE
} XLogSourceSwitchState;

> /*
>  * Don't allow any retry loops to occur during 
> nonblocking
> -* readahead.  Let the caller process everything that 
> has been
> -* decoded already first.
> +* readahead if we failed to read from the current 
> source. Let the
> +* caller process everything that has been decoded 
> already first.
>  */
> -   if (nonblocking)
> +   if (nonblocking && lastSourceFailed)
> return XLREAD_WOULDBLOCK;
>
> Why do we skip this when "switchSource" is set?

It was leftover from the initial version of the patch - I was then
encountering some issue and had that piece there. Removed it now.

> +   /* Reset the WAL source switch state */
> +   if (switchSource)
> +   {
> +   Assert(canSwitchSource);
> +   Assert(currentSource == XLOG_FROM_STREAM);
> +   Assert(oldSource == XLOG_FROM_ARCHIVE);
> +   switchSource = false;
> +   canSwitchSource = false;
> +   }
>
> How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE?  Is
> there no way it could be XLOG_FROM_PG_WAL?

No. switchSource is set to true only when canSwitchSource is set to
true, which happens only when currentSource 

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2024-03-05 Thread Jeff Davis
On Tue, 2024-03-05 at 17:19 +0100, Alvaro Herrera wrote:
> This appears to have upset the sepgsql tests.  In buildfarm member
> rhinoceros there's now a bunch of errors like this

Thank you, pushed, and it appears to have fixed the problem.

Regards,
Jeff Davis





Re: Improve readability by using designated initializers when possible

2024-03-05 Thread Jelte Fennema-Nio
On Tue, 5 Mar 2024 at 15:30, Japin Li  wrote:
> There is a warning if remove it, so I keep it.
>
> /home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:118:33:
>  warning: label ‘CASE_EEOP_LAST’ defined but not used [-Wunused-label]
>   118 | #define EEO_CASE(name)  CASE_##name:
>   | ^
> /home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:1845:17:
>  note: in expansion of macro ‘EEO_CASE’
>  1845 | EEO_CASE(EEOP_LAST)
>   | ^~~~

I think if you remove the EEO_CASE(EEOP_LAST) block the warning should
go away. That block is clearly marked as unreachable, so it doesn't
really serve a purpose.




Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-05 Thread Alvaro Herrera
On 2024-Mar-05, Dean Rasheed wrote:

> Looking at the other places that call RelationGetIndexAttrBitmap()
> with INDEX_ATTR_BITMAP_PRIMARY_KEY, they all appear to want to include
> deferrable PKs, since they are relying on the result to see which
> columns are not nullable.

Hmm, I find this pretty surprising, but you are right.  Somehow I had
the idea that INDEX_ATTR_BITMAP_PRIMARY_KEY was used for planning
activities so I didn't want to modify its behavior ... but clearly
that's not at all the case.  It's only used for DDL, and one check in
logical replication.

> So there are other bugs here. For example:
> 
> CREATE TABLE foo (id int PRIMARY KEY DEFERRABLE, val text);
> CREATE TABLE bar (LIKE foo);
> 
> now fails to mark bar.id as not nullable, whereas prior to
> b0e96f311985 it would have been.

Fun.  (Thankfully, easy to fix. But I'll add this as a test too.)

> So I think RelationGetIndexAttrBitmap() should include deferrable PKs,

Yeah, I'll go make it so.  I think I'll add a test for the case that
changes behavior in logical replication first (which is that the target
relation of logical replication is currently not marked as updatable,
when its PK is deferrable).

> but not all the changes made to RelationGetIndexList() by b0e96f311985
> need reverting.

I'll give this a look too.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php




RE: Popcount optimization using AVX512

2024-03-05 Thread Amonson, Paul D
-Original Message-
>From: Nathan Bossart  
>Sent: Tuesday, March 5, 2024 8:38 AM
>To: Amonson, Paul D 
>Cc: Andres Freund ; Alvaro Herrera 
>; Shankaran, Akash ; Noah 
>Misch ; Tom Lane ; Matthias van de 
>Meent ; >pgsql-hackers@lists.postgresql.org
>Subject: Re: Popcount optimization using AVX512
>
>On Tue, Mar 05, 2024 at 04:31:15PM +, Amonson, Paul D wrote:
>> I am not sure what "top-post" means but I am not doing anything 
>> different but using "reply to all" in Outlook. Please enlighten me. 
>
>The following link provides some more information:
>
>   https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
>
>--
>Nathan Bossart
>Amazon Web Services: https://aws.amazon.com

A.Ok... guess it's time to thank Microsoft then.  ;) Noted I will try 
to do the "reduced" bottom-posting. I might slip up occasionally because it's 
an Intel habit. Is there a way to make Outlook do the leading ">" in a reply 
for the previous message?

BTW: Created the commit-fest submission.

Paul



Re: Popcount optimization using AVX512

2024-03-05 Thread Nathan Bossart
On Tue, Mar 05, 2024 at 04:31:15PM +, Amonson, Paul D wrote:
> I am not sure what "top-post" means but I am not doing anything different
> but using "reply to all" in Outlook. Please enlighten me. 

The following link provides some more information:

https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Adding deprecation notices to pgcrypto documentation

2024-03-05 Thread Nathan Bossart
On Tue, Mar 05, 2024 at 11:50:36AM +0100, Daniel Gustafsson wrote:
>> On 4 Mar 2024, at 23:49, Nathan Bossart  wrote:
>> * Should this be a "Warning" and/or moved to the top of the page?  This
>>  seems like a relatively important notice that folks should see when
>>  beginning to use pgcrypto.
> 
> Good question.  If we do we'd probably need to move other equally important
> bits of information from "Security Limitations" as well so perhaps it's best 
> to
> keep it as is for now, or putting it under Notes.

Fair point.

>> * Should we actually document the exact list of algorithms along with
>>  detailed reasons?  This list seems prone to becoming outdated.
> 
> If we don't detail the list then I think that it's not worth doing, doing the
> research isn't entirely trivial as one might not even know where to look or
> what to look for.
> 
> I don't think this list will move faster than we can keep up with it,
> especially since it's more or less listing everything that pgcrypto supports 
> at
> this point.

Also fair.  Would updates to this list be back-patched?

> Looking at this some more I propose that we also remove the table of hash
> benchmarks, as it's widely misleading.  Modern hardware can generate far more
> than what we list here, and it gives the impression that these algorithms can
> only be broken with brute force which is untrue.  The table was first 
> published
> in 2008 and hasn't been updated since.

It looks like it was updated in 2013 [0] (commit d6464fd).  If there are
still objections to removing it, I think it should at least be given its
decennial update.

[0] 
https://postgr.es/m/CAPVvHdPj5rmf294FbWi2TuEy%3DhSxZMNjTURESaM5zY8P_wCJMg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-05 Thread Stephen Frost
Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote:
> > Attached is an updated patch which drops the 'such as' and adds a
> > sentence mentioning that BRIN is the only in-core summarizing index.
> 
> The original patch reads more clearly to me. In v4, summarizing (the
> exception) feels like it's dominating the description.
> 
> Also, is it standard practice to backport this kind of doc update? I
> ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16
> as well.

I do think this change should be back-ported to when the change
happened, otherwise the documentation won't reflect what's in the
product for that version...

> Attached my own suggested wording that hopefully addresses Stephen and
> Alvaro's concerns. I agree that it's tricky to write so I took a more
> minimalist approach:

>  * I got rid of the "In summary" sentence because (a) it's confusing
> now that we're talking about summarizing indexes; and (b) it's not
> summarizing anything, it's just redundant.

>  * I removed the mention partial or expression indexes. It's a bit
> redundant and doesn't seem especially helpful in this context.

Just to point it out- the "In summary" did provide a bit of a summary,
before the 'partial or expression indexes' bit was removed.  That said,
I tend to still agree with these changes as I feel that users will
generally be able to infer that this applies to partial and expression
indexes without it having to be spelled out to them.

> If this is agreeable I can commit it.

Great, thanks!

Stephen


signature.asc
Description: PGP signature


Re: Refactoring backend fork+exec code

2024-03-05 Thread Heikki Linnakangas

On 05/03/2024 11:44, Richard Guo wrote:

I noticed that there are still three places in backend_status.c where
pgstat_get_beentry_by_backend_id() is referenced.  I think we should
replace them with pgstat_get_beentry_by_proc_number().


Fixed, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





RE: Popcount optimization using AVX512

2024-03-05 Thread Amonson, Paul D
Hi,

I am not sure what "top-post" means but I am not doing anything different but 
using "reply to all" in Outlook. Please enlighten me. 

This is the new patch with the hand edit to remove the offending lines from the 
patch file. I did a basic test to make the patch would apply and build. It 
succeeded.

Thanks,
Paul

-Original Message-
From: Nathan Bossart  
Sent: Monday, March 4, 2024 2:21 PM
To: Amonson, Paul D 
Cc: Andres Freund ; Alvaro Herrera 
; Shankaran, Akash ; Noah 
Misch ; Tom Lane ; Matthias van de Meent 
; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

(Please don't top-post on the Postgres lists.)

On Mon, Mar 04, 2024 at 09:39:36PM +, Amonson, Paul D wrote:
> First, apologies on the patch. Find re-attached updated version.

Thanks for the new version of the patch.

>> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
>> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 
>> +31) << 31))
>>
>> IME this means that the autoconf you are using has been patched.  A 
>> quick search on the mailing lists seems to indicate that it might be 
>> specific to Debian [1].
>  
> I am not sure what the ask is here?  I made changes to the 
> configure.ac and ran autoconf2.69 to get builds to succeed. Do you 
> have a separate feedback here?

These LARGE_OFF_T changes are unrelated to the patch at hand and should be 
removed.  This likely means that you are using a patched autoconf that is 
making these extra changes.
 
> As for the refactoring, this was done to satisfy previous review 
> feedback about applying the AVX512 CFLAGS to the entire pg_bitutils.c 
> file. Mainly to avoid segfault due to the AVX512 flags. If its ok, I 
> would prefer to make a single commit as the change is pretty small and 
> straight forward.

Okay.  The only reason I suggest this is to ease review.  For example, if there 
is some required refactoring that doesn't involve any functionality changes, it 
can be advantageous to get that part reviewed and committed first so that 
reviewers can better focus on the code for the new feature.
But, of course, that isn't necessary and/or isn't possible in all cases.

> I am not sure I understand the comment about the SIZE_VOID_P checks.
> Aren't they necessary to choose which functions to call based on 32 or 
> 64 bit architectures?

Yes.  My comment was that the patch appeared to make unnecessary changes to 
this code.  Perhaps I am misunderstanding something here.

> Would this change qualify for Workflow A as described in [0] and can 
> be picked up by a committer, given it has been reviewed by multiple 
> committers so far? The scope of the change is pretty contained as well.

I think so.  I would still encourage you to create an entry for this so that it 
is automatically tested via cfbot [0].

[0] http://commitfest.cputube.org/

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


v6-0001-Add-support-for-AVX512-implemented-POPCNT.patch
Description: v6-0001-Add-support-for-AVX512-implemented-POPCNT.patch


Re: pgsql: Fix search_path to a safe value during maintenance operations.

2024-03-05 Thread Alvaro Herrera
On 2024-Mar-05, Jeff Davis wrote:

> Fix search_path to a safe value during maintenance operations.
> 
> While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
> MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
> 'pg_catalog, pg_temp' to prevent inconsistent behavior.
> 
> Functions that are used for functional indexes, in index expressions,
> or in materialized views and depend on a different search path must be
> declared with CREATE FUNCTION ... SET search_path='...'.

This appears to have upset the sepgsql tests.  In buildfarm member
rhinoceros there's now a bunch of errors like this

 ALTER TABLE regtest_table_4
   ADD CONSTRAINT regtest_tbl4_con EXCLUDE USING btree (z WITH =);
+LOG:  SELinux: allowed { search } 
scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0 
tcontext=unconfined_u:object_r:sepgsql_schema_t:s0 tclass=db_schema 
name="regtest_schema" permissive=0
+LOG:  SELinux: allowed { search } 
scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0 
tcontext=system_u:object_r:sepgsql_schema_t:s0 tclass=db_schema name="public" 
permissive=0

in its ddl.sql test.  I suppose this is just the result of the internal
change of search_path.  Maybe the thing to do is just accept the new
output as expected.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




un-revert the MAINTAIN privilege and the pg_maintain predefined role

2024-03-05 Thread Nathan Bossart
Thanks to Jeff's recent work with commits 2af07e2 and 59825d1, the issue
that led to the revert of the MAINTAIN privilege and the pg_maintain
predefined role (commit 151c22d) should now be resolved.  Specifically,
there was a concern that roles with the MAINTAIN privilege could use
search_path tricks to run arbitrary code as the table owner.  Jeff's work
prevents this by restricting search_path to a known safe value when running
maintenance commands.  (This approach and others were discussed on the
lists quite extensively, and it was also brought up at the developer
meeting at FOSDEM [0] earlier this year.)

Given this, I'd like to finally propose un-reverting MAINTAIN and
pg_maintain.  I created a commitfest entry for this [1] a few weeks ago and
attached it to Jeff's search_path thread, but I figured it would be good to
create a dedicated thread for this, too.  The attached patch is a straight
revert of commit 151c22d except for the following small changes:

* The catversion bump has been removed for now.  The catversion will need
  to be bumped appropriately if/when this is committed.

* The OID for the pg_maintain predefined role needed to be changed.  The
  original OID has been reused for something else since this feature was
  reverted.

* The change in AdjustUpgrade.pm needed to be updated to check for
  "$old_version < 17" instead of "$old_version < 16".

Thoughts?

[0] 
https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2024_Developer_Meeting#The_Path_to_un-reverting_the_MAINTAIN_privilege
[1] https://commitfest.postgresql.org/47/4836/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d2598b78f0796be20ad322fcb3b9ef7cfaa76491 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 5 Mar 2024 09:49:49 -0600
Subject: [PATCH v1 1/1] Revert "Revert MAINTAIN privilege and pg_maintain
 predefined role."

XXX: NEEDS CATVERSION BUMP

This reverts commit 151c22deee66a3390ca9a1c3675e29de54ae73fc.
---
 doc/src/sgml/ddl.sgml |  35 --
 doc/src/sgml/func.sgml|   2 +-
 .../sgml/ref/alter_default_privileges.sgml|   4 +-
 doc/src/sgml/ref/analyze.sgml |   6 +-
 doc/src/sgml/ref/cluster.sgml |  10 +-
 doc/src/sgml/ref/grant.sgml   |   3 +-
 doc/src/sgml/ref/lock.sgml|   4 +-
 .../sgml/ref/refresh_materialized_view.sgml   |   5 +-
 doc/src/sgml/ref/reindex.sgml |  23 ++--
 doc/src/sgml/ref/revoke.sgml  |   2 +-
 doc/src/sgml/ref/vacuum.sgml  |   6 +-
 doc/src/sgml/user-manag.sgml  |  12 ++
 src/backend/catalog/aclchk.c  |  15 +++
 src/backend/commands/analyze.c|  13 +-
 src/backend/commands/cluster.c|  43 +--
 src/backend/commands/indexcmds.c  |  34 ++---
 src/backend/commands/lockcmds.c   |   2 +-
 src/backend/commands/matview.c|   3 +-
 src/backend/commands/tablecmds.c  |  16 +--
 src/backend/commands/vacuum.c |  65 +-
 src/backend/utils/adt/acl.c   |   8 ++
 src/bin/pg_dump/dumputils.c   |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   2 +-
 src/bin/psql/tab-complete.c   |   6 +-
 src/include/catalog/pg_authid.dat |   5 +
 src/include/commands/tablecmds.h  |   5 +-
 src/include/commands/vacuum.h |   5 +-
 src/include/nodes/parsenodes.h|   3 +-
 src/include/utils/acl.h   |   5 +-
 .../expected/cluster-conflict-partition.out   |   8 +-
 .../specs/cluster-conflict-partition.spec |   2 +-
 .../perl/PostgreSQL/Test/AdjustUpgrade.pm |  11 ++
 src/test/regress/expected/cluster.out |   7 ++
 src/test/regress/expected/create_index.out|   4 +-
 src/test/regress/expected/dependency.out  |  22 ++--
 src/test/regress/expected/privileges.out  | 116 ++
 src/test/regress/expected/rowsecurity.out |  34 ++---
 src/test/regress/sql/cluster.sql  |   5 +
 src/test/regress/sql/dependency.sql   |   2 +-
 src/test/regress/sql/privileges.sql   |  68 ++
 40 files changed, 444 insertions(+), 178 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9d7e2c756b..8616a8e9cc 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1868,8 +1868,8 @@ ALTER TABLE products RENAME TO items;
INSERT, UPDATE, DELETE,
TRUNCATE, REFERENCES, TRIGGER,
CREATE, CONNECT, TEMPORARY,
-   EXECUTE, USAGE, SET
-   and ALTER SYSTEM.
+   EXECUTE, USAGE, SET,
+   ALTER SYSTEM, and MAINTAIN.
The privileges applicable to a particular
object vary depending on the object's type (table, function, etc.).
More detail about the meanings of these privileges appears below.
@@ -2160,7 +2160,19 @@ REVOKE ALL ON accounts FROM PUBLIC;
   
  
 
-   
+
+   
+MAINTAIN

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:
>
> > > It's pretty hard to see what test_pattern() is doing, or why it's
> > > useful. I wonder if instead the test could use something like the
> > > benchmark where random integers are masked off. That seems simpler. I
> > > can work on that, but I'd like to hear your side about test_pattern().
> >
> > Yeah, test_pattern() is originally created for the integerset so it
> > doesn't necessarily fit the radixtree. I agree to use some tests from
> > benchmarks.
>
> Done in v66-0009. I'd be curious to hear any feedback. I like the
> aspect that the random numbers come from a different seed every time
> the test runs.

The new tests look good. Here are some comments:

---
+   expected = keys[i];
+   iterval = rt_iterate_next(iter, );

-   ndeleted++;
+   EXPECT_TRUE(iterval != NULL);
+   EXPECT_EQ_U64(iterkey, expected);
+   EXPECT_EQ_U64(*iterval, expected);

Can we verify that the iteration returns keys in ascending order?

---
+ /* reset random number generator for deletion */
+ pg_prng_seed(, seed);

Why is resetting the seed required here?

---
The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

---
radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
  "test_radix_tree",
  ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-05 Thread Elizabeth Christensen



> On Mar 4, 2024, at 5:32 PM, Jeff Davis  wrote:

> Attached my own suggested wording that hopefully addresses Stephen and
> Alvaro's concerns. I agree that it's tricky to write so I took a more
> minimalist approach:
> 
> If this is agreeable I can commit it.
> 
> Regards,
>   Jeff Davis
> 
> 


Looks good to me. Thanks!

Elizabeth




Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-05 Thread Dean Rasheed
On Tue, 5 Mar 2024 at 12:36, Alvaro Herrera  wrote:
>
> Yeah.  As I said upthread, a good fix seems to require no longer relying
> on RelationGetIndexAttrBitmap to obtain the columns in the primary key,
> because that function does not include deferred primary keys.  I came up
> with the attached POC, which seems to fix the reported problem, but of
> course it needs more polish, a working test case, and verifying whether
> the new function should be used in more places -- in particular, whether
> it can be used to revert the changes to RelationGetIndexList that
> b0e96f311985 did.
>

Looking at the other places that call RelationGetIndexAttrBitmap()
with INDEX_ATTR_BITMAP_PRIMARY_KEY, they all appear to want to include
deferrable PKs, since they are relying on the result to see which
columns are not nullable.

So there are other bugs here. For example:

CREATE TABLE foo (id int PRIMARY KEY DEFERRABLE, val text);
CREATE TABLE bar (LIKE foo);

now fails to mark bar.id as not nullable, whereas prior to
b0e96f311985 it would have been.

So I think RelationGetIndexAttrBitmap() should include deferrable PKs,
but not all the changes made to RelationGetIndexList() by b0e96f311985
need reverting.

Regards,
Dean




Re: Improve readability by using designated initializers when possible

2024-03-05 Thread Japin Li

On Tue, 05 Mar 2024 at 22:03, Jelte Fennema-Nio  wrote:
> On Tue, 5 Mar 2024 at 14:50, Japin Li  wrote:
>> Attach a patch to rewrite dispatch_table array using C99-designated
>> initializer syntax.
>
> Looks good. Two small things:

Thanks for the review.

>
> +   [EEOP_LAST] = &_EEOP_LAST,
>
> Is EEOP_LAST actually needed in this array? It seems unused afaict. If
> indeed not needed, that would be good to remove in an additional
> commit.

There is a warning if remove it, so I keep it.

/home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:118:33:
 warning: label ‘CASE_EEOP_LAST’ defined but not used [-Wunused-label]
  118 | #define EEO_CASE(name)  CASE_##name:
  | ^
/home/japin/Codes/postgres/build/../src/backend/executor/execExprInterp.c:1845:17:
 note: in expansion of macro ‘EEO_CASE’
 1845 | EEO_CASE(EEOP_LAST)
  | ^~~~

>
> - *
> - * The order of entries needs to be kept in sync with the dispatch_table[]
> - * array in execExprInterp.c:ExecInterpExpr().
>
> I think it would be good to at least keep the comment saying that this
> array should be updated (only the order doesn't need to be strictly
> kept in sync anymore).

Fixed.

>From 20a72f2a5e0b282812ecde5c6aef2ce4ab118003 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 5 Mar 2024 21:32:46 +0800
Subject: [PATCH v2 1/1] Use C99-designated initializer syntax for
 dispatch_table array

---
 src/backend/executor/execExprInterp.c | 195 +-
 src/include/executor/execExpr.h   |   2 +-
 2 files changed, 97 insertions(+), 100 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 7c1f51e2e0..e4ad522312 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -400,107 +400,104 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *scanslot;
 
-	/*
-	 * This array has to be in the same order as enum ExprEvalOp.
-	 */
 #if defined(EEO_USE_COMPUTED_GOTO)
 	static const void *const dispatch_table[] = {
-		&_EEOP_DONE,
-		&_EEOP_INNER_FETCHSOME,
-		&_EEOP_OUTER_FETCHSOME,
-		&_EEOP_SCAN_FETCHSOME,
-		&_EEOP_INNER_VAR,
-		&_EEOP_OUTER_VAR,
-		&_EEOP_SCAN_VAR,
-		&_EEOP_INNER_SYSVAR,
-		&_EEOP_OUTER_SYSVAR,
-		&_EEOP_SCAN_SYSVAR,
-		&_EEOP_WHOLEROW,
-		&_EEOP_ASSIGN_INNER_VAR,
-		&_EEOP_ASSIGN_OUTER_VAR,
-		&_EEOP_ASSIGN_SCAN_VAR,
-		&_EEOP_ASSIGN_TMP,
-		&_EEOP_ASSIGN_TMP_MAKE_RO,
-		&_EEOP_CONST,
-		&_EEOP_FUNCEXPR,
-		&_EEOP_FUNCEXPR_STRICT,
-		&_EEOP_FUNCEXPR_FUSAGE,
-		&_EEOP_FUNCEXPR_STRICT_FUSAGE,
-		&_EEOP_BOOL_AND_STEP_FIRST,
-		&_EEOP_BOOL_AND_STEP,
-		&_EEOP_BOOL_AND_STEP_LAST,
-		&_EEOP_BOOL_OR_STEP_FIRST,
-		&_EEOP_BOOL_OR_STEP,
-		&_EEOP_BOOL_OR_STEP_LAST,
-		&_EEOP_BOOL_NOT_STEP,
-		&_EEOP_QUAL,
-		&_EEOP_JUMP,
-		&_EEOP_JUMP_IF_NULL,
-		&_EEOP_JUMP_IF_NOT_NULL,
-		&_EEOP_JUMP_IF_NOT_TRUE,
-		&_EEOP_NULLTEST_ISNULL,
-		&_EEOP_NULLTEST_ISNOTNULL,
-		&_EEOP_NULLTEST_ROWISNULL,
-		&_EEOP_NULLTEST_ROWISNOTNULL,
-		&_EEOP_BOOLTEST_IS_TRUE,
-		&_EEOP_BOOLTEST_IS_NOT_TRUE,
-		&_EEOP_BOOLTEST_IS_FALSE,
-		&_EEOP_BOOLTEST_IS_NOT_FALSE,
-		&_EEOP_PARAM_EXEC,
-		&_EEOP_PARAM_EXTERN,
-		&_EEOP_PARAM_CALLBACK,
-		&_EEOP_CASE_TESTVAL,
-		&_EEOP_MAKE_READONLY,
-		&_EEOP_IOCOERCE,
-		&_EEOP_IOCOERCE_SAFE,
-		&_EEOP_DISTINCT,
-		&_EEOP_NOT_DISTINCT,
-		&_EEOP_NULLIF,
-		&_EEOP_SQLVALUEFUNCTION,
-		&_EEOP_CURRENTOFEXPR,
-		&_EEOP_NEXTVALUEEXPR,
-		&_EEOP_ARRAYEXPR,
-		&_EEOP_ARRAYCOERCE,
-		&_EEOP_ROW,
-		&_EEOP_ROWCOMPARE_STEP,
-		&_EEOP_ROWCOMPARE_FINAL,
-		&_EEOP_MINMAX,
-		&_EEOP_FIELDSELECT,
-		&_EEOP_FIELDSTORE_DEFORM,
-		&_EEOP_FIELDSTORE_FORM,
-		&_EEOP_SBSREF_SUBSCRIPTS,
-		&_EEOP_SBSREF_OLD,
-		&_EEOP_SBSREF_ASSIGN,
-		&_EEOP_SBSREF_FETCH,
-		&_EEOP_DOMAIN_TESTVAL,
-		&_EEOP_DOMAIN_NOTNULL,
-		&_EEOP_DOMAIN_CHECK,
-		&_EEOP_CONVERT_ROWTYPE,
-		&_EEOP_SCALARARRAYOP,
-		&_EEOP_HASHED_SCALARARRAYOP,
-		&_EEOP_XMLEXPR,
-		&_EEOP_JSON_CONSTRUCTOR,
-		&_EEOP_IS_JSON,
-		&_EEOP_AGGREF,
-		&_EEOP_GROUPING_FUNC,
-		&_EEOP_WINDOW_FUNC,
-		&_EEOP_SUBPLAN,
-		&_EEOP_AGG_STRICT_DESERIALIZE,
-		&_EEOP_AGG_DESERIALIZE,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
-		&_EEOP_AGG_PLAIN_PERGROUP_NULLCHECK,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_BYREF,
-		&_EEOP_AGG_PRESORTED_DISTINCT_SINGLE,
-		&_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
-		&_EEOP_AGG_ORDERED_TRANS_DATUM,
-		&_EEOP_AGG_ORDERED_TRANS_TUPLE,
-		&_EEOP_LAST
+		[EEOP_DONE] = &_EEOP_DONE,
+		[EEOP_INNER_FETCHSOME] = &_EEOP_INNER_FETCHSOME,
+		[EEOP_OUTER_FETCHSOME] = &_EEOP_OUTER_FETCHSOME,
+		[EEOP_SCAN_FETCHSOME] = &_EEOP_SCAN_FETCHSOME,
+		[EEOP_INNER_VAR] = &_EEOP_INNER_VAR,
+		[EEOP_OUTER_VAR] = 

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-03-05 Thread Давыдов Виталий

Hi Heikki,

Thank you for the reply.

On Tuesday, March 05, 2024 12:05 MSK, Heikki Linnakangas  
wrote:
 In a nutshell, this changes PREPARE TRANSACTION so that if
synchronous_commit is 'off', the PREPARE TRANSACTION is not fsync'd to
disk. So if you crash after the PREPARE TRANSACTION has returned, the
transaction might be lost. I think that's completely unacceptable.​
You are right, the prepared transaction might be lost after crash. The same may 
happen with regular transactions that are not fsync-ed on replica in logical 
replication by default. The subscription parameter synchronous_commit is OFF by 
default. I'm not sure, is there some auto recovery for regular transactions? I 
think, the main difference between these two cases - how to manually recover 
when some PREPARE TRANSACTION or COMMIT PREPARED are lost. For regular 
transactions, some updates or deletes in tables on replica may be enough to fix 
the problem. For twophase transactions, it may be harder to fix it by hands, 
but it is possible, I believe. If you create a custom solution that is based on 
twophase transactions (like multimaster) such auto recovery may happen 
automatically. Another solution is to ignore errors on commit prepared if the 
corresponding prepared tx is missing. I don't know other risks that may happen 
with async commit of twophase transactions.
 If you're ok to lose the prepared state of twophase transactions on
crash, why don't you create the subscription with 'two_phase=off' to
begin with?In usual work, the subscription has two_phase = on. I have to change 
this option at catchup stage only, but this parameter can not be altered. There 
was a patch proposal in past to implement altering of two_phase option, but it 
was rejected. I think, the recreation of the subscription with two_phase = off 
will not work.

I believe, async commit for twophase transactions on catchup will significantly 
improve the catchup performance. It is worth to think about such feature.

P.S. We might introduce a GUC option to allow async commit for twophase 
transactions. By default, sync commit will be applied for twophase 
transactions, as it is now.

With best regards,
Vitaly Davydov


Re: type cache cleanup improvements

2024-03-05 Thread Teodor Sigaev





I would like to tweak the patch a little bit - change some comments,
add some Asserts, etc. Don't you mind?

You are welcome!
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/




Re: Experiments with Postgres and SSL

2024-03-05 Thread Heikki Linnakangas
I hope I didn't joggle your elbow reviewing this, Jacob, but I spent 
some time rebase and fix various little things:


- Incorporated Matthias's test changes

- Squashed the client, server and documentation patches. Not much point 
in keeping them separate, as one requires the other, and if you're only 
interested e.g. in the server parts, just look at src/backend.


- Squashed some of my refactorings with the main patches, because I'm 
certain enough that they're desirable. I kept the last libpq state 
machine refactoring separate though. I'm pretty sure we need a 
refactoring like that, but I'm not 100% sure about the details.


- Added some comments to the new state machine logic in fe-connect.c.

- Removed the XXX comments about TLS alerts.

- Removed the "Allow pipelining data after ssl request" patch

- Reordered the patches so that the first two patches add the tests 
different combinations of sslmode, gssencmode and server support. That 
could be committed separately, without the rest of the patches. A later 
patch expands the tests for the new sslnegotiation option.



The tests are still not distinguishing whether a connection was 
established in direct or negotiated mode. So if we e.g. had a bug that 
accidentally disabled direct SSL connection completely and always used 
negotiated mode, the tests would still pass. I'd like to see some tests 
that would catch that.


--
Heikki Linnakangas
Neon (https://neon.tech)
From c3b88ffb05a2a8b50e1af3220bf8f524e8bbae46 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 5 Mar 2024 15:40:30 +0200
Subject: [PATCH v8 1/6] Move Kerberos module

So that we can reuse it in new tests.
---
 src/test/kerberos/t/001_auth.pl   | 174 ++--
 src/test/perl/PostgreSQL/Test/Kerberos.pm | 229 ++
 2 files changed, 240 insertions(+), 163 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/Kerberos.pm

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 2a81ce8834b..9d3fac83aaa 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -21,6 +21,7 @@ use strict;
 use warnings FATAL => 'all';
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Kerberos;
 use Test::More;
 use Time::HiRes qw(usleep);
 
@@ -34,177 +35,27 @@ elsif (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bkerberos\b/)
 	  'Potentially unsafe test GSSAPI/Kerberos not enabled in PG_TEST_EXTRA';
 }
 
-my ($krb5_bin_dir, $krb5_sbin_dir);
-
-if ($^O eq 'darwin' && -d "/opt/homebrew")
-{
-	# typical paths for Homebrew on ARM
-	$krb5_bin_dir = '/opt/homebrew/opt/krb5/bin';
-	$krb5_sbin_dir = '/opt/homebrew/opt/krb5/sbin';
-}
-elsif ($^O eq 'darwin')
-{
-	# typical paths for Homebrew on Intel
-	$krb5_bin_dir = '/usr/local/opt/krb5/bin';
-	$krb5_sbin_dir = '/usr/local/opt/krb5/sbin';
-}
-elsif ($^O eq 'freebsd')
-{
-	$krb5_bin_dir = '/usr/local/bin';
-	$krb5_sbin_dir = '/usr/local/sbin';
-}
-elsif ($^O eq 'linux')
-{
-	$krb5_sbin_dir = '/usr/sbin';
-}
-
-my $krb5_config = 'krb5-config';
-my $kinit = 'kinit';
-my $klist = 'klist';
-my $kdb5_util = 'kdb5_util';
-my $kadmin_local = 'kadmin.local';
-my $krb5kdc = 'krb5kdc';
-
-if ($krb5_bin_dir && -d $krb5_bin_dir)
-{
-	$krb5_config = $krb5_bin_dir . '/' . $krb5_config;
-	$kinit = $krb5_bin_dir . '/' . $kinit;
-	$klist = $krb5_bin_dir . '/' . $klist;
-}
-if ($krb5_sbin_dir && -d $krb5_sbin_dir)
-{
-	$kdb5_util = $krb5_sbin_dir . '/' . $kdb5_util;
-	$kadmin_local = $krb5_sbin_dir . '/' . $kadmin_local;
-	$krb5kdc = $krb5_sbin_dir . '/' . $krb5kdc;
-}
-
-my $host = 'auth-test-localhost.postgresql.example.com';
-my $hostaddr = '127.0.0.1';
-my $realm = 'EXAMPLE.COM';
-
-my $krb5_conf = "${PostgreSQL::Test::Utils::tmp_check}/krb5.conf";
-my $kdc_conf = "${PostgreSQL::Test::Utils::tmp_check}/kdc.conf";
-my $krb5_cache = "${PostgreSQL::Test::Utils::tmp_check}/krb5cc";
-my $krb5_log = "${PostgreSQL::Test::Utils::log_path}/krb5libs.log";
-my $kdc_log = "${PostgreSQL::Test::Utils::log_path}/krb5kdc.log";
-my $kdc_port = PostgreSQL::Test::Cluster::get_free_port();
-my $kdc_datadir = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc";
-my $kdc_pidfile = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc.pid";
-my $keytab = "${PostgreSQL::Test::Utils::tmp_check}/krb5.keytab";
-
 my $pgpass = "${PostgreSQL::Test::Utils::tmp_check}/.pgpass";
 
 my $dbname = 'postgres';
 my $username = 'test1';
 my $application = '001_auth.pl';
 
-note "setting up Kerberos";
-
-my ($stdout, $krb5_version);
-run_log [ $krb5_config, '--version' ], '>', \$stdout
-  or BAIL_OUT("could not execute krb5-config");
-BAIL_OUT("Heimdal is not supported") if $stdout =~ m/heimdal/;
-$stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/
-  or BAIL_OUT("could not get Kerberos version");
-$krb5_version = $1;
-
 # Construct a pgpass file to make sure we don't use it
 append_to_file($pgpass, '*:*:*:*:abc123');
 
 chmod 0600, $pgpass;
 
-# Build the krb5.conf to use.
-#
-# 

Re: Reducing the log spam

2024-03-05 Thread Jelte Fennema-Nio
On Tue, 5 Mar 2024 at 14:55, Jim Jones  wrote:
> > So what about a parameter "log_suppress_sqlstates" that suppresses
> > logging ERROR and FATAL messages with the enumerated SQL states?

Big +1 from me for this idea.




Re: Improve readability by using designated initializers when possible

2024-03-05 Thread Jelte Fennema-Nio
On Tue, 5 Mar 2024 at 14:50, Japin Li  wrote:
> Attach a patch to rewrite dispatch_table array using C99-designated
> initializer syntax.

Looks good. Two small things:

+   [EEOP_LAST] = &_EEOP_LAST,

Is EEOP_LAST actually needed in this array? It seems unused afaict. If
indeed not needed, that would be good to remove in an additional
commit.

- *
- * The order of entries needs to be kept in sync with the dispatch_table[]
- * array in execExprInterp.c:ExecInterpExpr().

I think it would be good to at least keep the comment saying that this
array should be updated (only the order doesn't need to be strictly
kept in sync anymore).




Re: Reducing the log spam

2024-03-05 Thread Pavel Stehule
Hi

út 5. 3. 2024 v 14:55 odesílatel Jim Jones 
napsal:

> Hi Laurenz
>
> On 05.03.24 13:55, Laurenz Albe wrote:
> > Inspired by feedback to [1], I thought about how to reduce log spam.
> >
> > My experience from the field is that a lot of log spam looks like
> >
> >database/table/... "xy" does not exist
> >duplicate key value violates unique constraint "xy"
> >
> > So what about a parameter "log_suppress_sqlstates" that suppresses
> > logging ERROR and FATAL messages with the enumerated SQL states?
> >
> > My idea for a default setting would be something like
> >
> >log_suppress_sqlstates = '23505,3D000,3F000,42601,42704,42883,42P01'
> >
> > but that's of course bikeshedding territory.
> >
> > Yours,
> > Laurenz Albe
> >
> >
> >
> >   [1]:
> https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at
>
> I like this idea, and I could see myself using it a lot in some projects.
>
> Additionally, it would be nice to also have the possibility suppress a
> whole class instead of single SQL states, e.g.
>
> log_suppress_sqlstates = 'class_08' to suppress these all at once:
>
> 08000   connection_exception
> 08003   connection_does_not_exist
> 08006   connection_failure
> 08001   sqlclient_unable_to_establish_sqlconnection
> 08004   sqlserver_rejected_establishment_of_sqlconnection
> 08007   transaction_resolution_unknown
> 08P01   protocol_violation
>
>
It can take code from PLpgSQL.

Regards

Pavel



> Best regards,
> Jim
>
>
>
>


Re: Reducing the log spam

2024-03-05 Thread Jim Jones

Hi Laurenz

On 05.03.24 13:55, Laurenz Albe wrote:

Inspired by feedback to [1], I thought about how to reduce log spam.

My experience from the field is that a lot of log spam looks like

   database/table/... "xy" does not exist
   duplicate key value violates unique constraint "xy"

So what about a parameter "log_suppress_sqlstates" that suppresses
logging ERROR and FATAL messages with the enumerated SQL states?

My idea for a default setting would be something like

   log_suppress_sqlstates = '23505,3D000,3F000,42601,42704,42883,42P01'

but that's of course bikeshedding territory.

Yours,
Laurenz Albe



  [1]: 
https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at


I like this idea, and I could see myself using it a lot in some projects.

Additionally, it would be nice to also have the possibility suppress a 
whole class instead of single SQL states, e.g.


log_suppress_sqlstates = 'class_08' to suppress these all at once:

08000   connection_exception
08003   connection_does_not_exist
08006   connection_failure
08001   sqlclient_unable_to_establish_sqlconnection
08004   sqlserver_rejected_establishment_of_sqlconnection
08007   transaction_resolution_unknown
08P01   protocol_violation

Best regards,
Jim





Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-03-05 Thread Dean Rasheed
[cc'ing Alvaro]

On Tue, 5 Mar 2024 at 10:04, zwj  wrote:
>
>  If I only execute merge , I will get the following error:
> merge into tgt a using src1 c on  a.a = c.a when matched then update set 
> b = c.b when not matched then insert (a,b) values(c.a,c.b);  -- excute fail
> ERROR:  MERGE command cannot affect row a second time
> HIINT:  Ensure that not more than one source row matches any one target 
> row.
>
>  But when I execute the update and merge concurrently, I will get the 
> following result set.
>

Yes, this should still produce that error, even after a concurrent update.

In the first example without a concurrent update, when we reach the
tgt.a = 3 row the second time, ExecUpdateAct() returns TM_SelfModified
and we do this:

case TM_SelfModified:

/*
 * The SQL standard disallows this for MERGE.
 */
if (TransactionIdIsCurrentTransactionId(context->tmfd.xmax))
ereport(ERROR,
(errcode(ERRCODE_CARDINALITY_VIOLATION),
/* translator: %s is a SQL command name */
 errmsg("%s command cannot affect row a second time",
"MERGE"),
 errhint("Ensure that not more than one source row
matches any one target row.")));
/* This shouldn't happen */
elog(ERROR, "attempted to update or delete invisible tuple");
break;

whereas in the second case, after a concurrent update, ExecUpdateAct()
returns TM_Updated, we attempt to lock the tuple prior to running EPQ,
and table_tuple_lock() returns TM_SelfModified, which does this:

case TM_SelfModified:

/*
 * This can be reached when following an update
 * chain from a tuple updated by another session,
 * reaching a tuple that was already updated in
 * this transaction. If previously modified by
 * this command, ignore the redundant update,
 * otherwise error out.
 *
 * See also response to TM_SelfModified in
 * ExecUpdate().
 */
if (context->tmfd.cmax != estate->es_output_cid)
ereport(ERROR,
(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
 errmsg("tuple to be updated or deleted
was already modified by an operation triggered by the current
command"),
 errhint("Consider using an AFTER trigger
instead of a BEFORE trigger to propagate changes to other rows.")));
return false;

My initial reaction is that neither of those blocks of code is
entirely correct, and that they should both be doing both of those
checks. I.e., something like the attached (which probably needs some
additional test cases).

Regards,
Dean
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
new file mode 100644
index fcb6133..9351fbc
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3001,8 +3001,29 @@ lmerge_matched:
 			case TM_SelfModified:
 
 /*
- * The SQL standard disallows this for MERGE.
+ * The target tuple was already updated or deleted by the
+ * current command, or by a later command in the current
+ * transaction.  The former case is explicitly disallowed by
+ * the SQL standard for MERGE, which insists that the MERGE
+ * join condition should not join a target row to more than
+ * one source row.
+ *
+ * The latter case arises if the tuple is modified by a
+ * command in a BEFORE trigger, or perhaps by a command in a
+ * volatile function used in the query.  In such situations we
+ * should not ignore the MERGE action, but it is equally
+ * unsafe to proceed.  We don't want to discard the original
+ * MERGE action while keeping the triggered actions based on
+ * it; and it would be no better to allow the original MERGE
+ * action while discarding the updates that it triggered.  So
+ * throwing an error is the only safe course.
  */
+if (context->tmfd.cmax != estate->es_output_cid)
+	ereport(ERROR,
+			(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+			 errmsg("tuple to be updated or deleted was already modified by an operation triggered by the current command"),
+			 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
+
 if (TransactionIdIsCurrentTransactionId(context->tmfd.xmax))
 	ereport(ERROR,
 			(errcode(ERRCODE_CARDINALITY_VIOLATION),
@@ -3010,6 +3031,7 @@ lmerge_matched:
 			 errmsg("%s command cannot affect row a second time",
 	"MERGE"),
 			 errhint("Ensure that not more than one source row matches any one target row.")));
+
 /* This shouldn't happen */
 elog(ERROR, "attempted to update or delete invisible 

Re: Improve readability by using designated initializers when possible

2024-03-05 Thread Japin Li

On Mon, 04 Mar 2024 at 07:46, Michael Paquier  wrote:
> On Fri, Mar 01, 2024 at 06:30:10AM +0100, Jelte Fennema-Nio wrote:
>> On Fri, 1 Mar 2024 at 06:08, Michael Paquier  wrote:
>>> Mostly OK to me.  Just note that this comment is incorrect because
>>> pg_enc2gettext_tbl[] includes elements in the range
>>> [PG_SJIS,_PG_LAST_ENCODING_[  ;)
>>
>> fixed
>
> (Forgot to update this thread.)
> Thanks, applied this one.  I went over a few versions of the comment
> in pg_wchar.h, and tweaked it to something that was of one of the
> previous versions, I think.

Hi,

Attach a patch to rewrite dispatch_table array using C99-designated
initializer syntax.

>From f398d6d310e9436c5e415baa6fd273981a9e181f Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 5 Mar 2024 21:32:46 +0800
Subject: [PATCH v1 1/1] Use C99-designated initializer syntax for
 dispatch_table array

---
 src/backend/executor/execExprInterp.c | 195 +-
 src/include/executor/execExpr.h   |   3 -
 2 files changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 7c1f51e2e0..e4ad522312 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -400,107 +400,104 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *scanslot;
 
-	/*
-	 * This array has to be in the same order as enum ExprEvalOp.
-	 */
 #if defined(EEO_USE_COMPUTED_GOTO)
 	static const void *const dispatch_table[] = {
-		&_EEOP_DONE,
-		&_EEOP_INNER_FETCHSOME,
-		&_EEOP_OUTER_FETCHSOME,
-		&_EEOP_SCAN_FETCHSOME,
-		&_EEOP_INNER_VAR,
-		&_EEOP_OUTER_VAR,
-		&_EEOP_SCAN_VAR,
-		&_EEOP_INNER_SYSVAR,
-		&_EEOP_OUTER_SYSVAR,
-		&_EEOP_SCAN_SYSVAR,
-		&_EEOP_WHOLEROW,
-		&_EEOP_ASSIGN_INNER_VAR,
-		&_EEOP_ASSIGN_OUTER_VAR,
-		&_EEOP_ASSIGN_SCAN_VAR,
-		&_EEOP_ASSIGN_TMP,
-		&_EEOP_ASSIGN_TMP_MAKE_RO,
-		&_EEOP_CONST,
-		&_EEOP_FUNCEXPR,
-		&_EEOP_FUNCEXPR_STRICT,
-		&_EEOP_FUNCEXPR_FUSAGE,
-		&_EEOP_FUNCEXPR_STRICT_FUSAGE,
-		&_EEOP_BOOL_AND_STEP_FIRST,
-		&_EEOP_BOOL_AND_STEP,
-		&_EEOP_BOOL_AND_STEP_LAST,
-		&_EEOP_BOOL_OR_STEP_FIRST,
-		&_EEOP_BOOL_OR_STEP,
-		&_EEOP_BOOL_OR_STEP_LAST,
-		&_EEOP_BOOL_NOT_STEP,
-		&_EEOP_QUAL,
-		&_EEOP_JUMP,
-		&_EEOP_JUMP_IF_NULL,
-		&_EEOP_JUMP_IF_NOT_NULL,
-		&_EEOP_JUMP_IF_NOT_TRUE,
-		&_EEOP_NULLTEST_ISNULL,
-		&_EEOP_NULLTEST_ISNOTNULL,
-		&_EEOP_NULLTEST_ROWISNULL,
-		&_EEOP_NULLTEST_ROWISNOTNULL,
-		&_EEOP_BOOLTEST_IS_TRUE,
-		&_EEOP_BOOLTEST_IS_NOT_TRUE,
-		&_EEOP_BOOLTEST_IS_FALSE,
-		&_EEOP_BOOLTEST_IS_NOT_FALSE,
-		&_EEOP_PARAM_EXEC,
-		&_EEOP_PARAM_EXTERN,
-		&_EEOP_PARAM_CALLBACK,
-		&_EEOP_CASE_TESTVAL,
-		&_EEOP_MAKE_READONLY,
-		&_EEOP_IOCOERCE,
-		&_EEOP_IOCOERCE_SAFE,
-		&_EEOP_DISTINCT,
-		&_EEOP_NOT_DISTINCT,
-		&_EEOP_NULLIF,
-		&_EEOP_SQLVALUEFUNCTION,
-		&_EEOP_CURRENTOFEXPR,
-		&_EEOP_NEXTVALUEEXPR,
-		&_EEOP_ARRAYEXPR,
-		&_EEOP_ARRAYCOERCE,
-		&_EEOP_ROW,
-		&_EEOP_ROWCOMPARE_STEP,
-		&_EEOP_ROWCOMPARE_FINAL,
-		&_EEOP_MINMAX,
-		&_EEOP_FIELDSELECT,
-		&_EEOP_FIELDSTORE_DEFORM,
-		&_EEOP_FIELDSTORE_FORM,
-		&_EEOP_SBSREF_SUBSCRIPTS,
-		&_EEOP_SBSREF_OLD,
-		&_EEOP_SBSREF_ASSIGN,
-		&_EEOP_SBSREF_FETCH,
-		&_EEOP_DOMAIN_TESTVAL,
-		&_EEOP_DOMAIN_NOTNULL,
-		&_EEOP_DOMAIN_CHECK,
-		&_EEOP_CONVERT_ROWTYPE,
-		&_EEOP_SCALARARRAYOP,
-		&_EEOP_HASHED_SCALARARRAYOP,
-		&_EEOP_XMLEXPR,
-		&_EEOP_JSON_CONSTRUCTOR,
-		&_EEOP_IS_JSON,
-		&_EEOP_AGGREF,
-		&_EEOP_GROUPING_FUNC,
-		&_EEOP_WINDOW_FUNC,
-		&_EEOP_SUBPLAN,
-		&_EEOP_AGG_STRICT_DESERIALIZE,
-		&_EEOP_AGG_DESERIALIZE,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
-		&_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
-		&_EEOP_AGG_PLAIN_PERGROUP_NULLCHECK,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_BYVAL,
-		&_EEOP_AGG_PLAIN_TRANS_INIT_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_STRICT_BYREF,
-		&_EEOP_AGG_PLAIN_TRANS_BYREF,
-		&_EEOP_AGG_PRESORTED_DISTINCT_SINGLE,
-		&_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
-		&_EEOP_AGG_ORDERED_TRANS_DATUM,
-		&_EEOP_AGG_ORDERED_TRANS_TUPLE,
-		&_EEOP_LAST
+		[EEOP_DONE] = &_EEOP_DONE,
+		[EEOP_INNER_FETCHSOME] = &_EEOP_INNER_FETCHSOME,
+		[EEOP_OUTER_FETCHSOME] = &_EEOP_OUTER_FETCHSOME,
+		[EEOP_SCAN_FETCHSOME] = &_EEOP_SCAN_FETCHSOME,
+		[EEOP_INNER_VAR] = &_EEOP_INNER_VAR,
+		[EEOP_OUTER_VAR] = &_EEOP_OUTER_VAR,
+		[EEOP_SCAN_VAR] = &_EEOP_SCAN_VAR,
+		[EEOP_INNER_SYSVAR] = &_EEOP_INNER_SYSVAR,
+		[EEOP_OUTER_SYSVAR] = &_EEOP_OUTER_SYSVAR,
+		[EEOP_SCAN_SYSVAR] = &_EEOP_SCAN_SYSVAR,
+		[EEOP_WHOLEROW] = &_EEOP_WHOLEROW,
+		[EEOP_ASSIGN_INNER_VAR] = &_EEOP_ASSIGN_INNER_VAR,
+		[EEOP_ASSIGN_OUTER_VAR] = &_EEOP_ASSIGN_OUTER_VAR,
+		[EEOP_ASSIGN_SCAN_VAR] = &_EEOP_ASSIGN_SCAN_VAR,
+		[EEOP_ASSIGN_TMP] = &_EEOP_ASSIGN_TMP,
+		[EEOP_ASSIGN_TMP_MAKE_RO] = &_EEOP_ASSIGN_TMP_MAKE_RO,
+		[EEOP_CONST] = &_EEOP_CONST,
+		[EEOP_FUNCEXPR] = &_EEOP_FUNCEXPR,
+		[EEOP_FUNCEXPR_STRICT] = &_EEOP_FUNCEXPR_STRICT,

Re: pg_upgrade --copy-file-range

2024-03-05 Thread Peter Eisentraut

On 05.01.24 13:40, Jakub Wartak wrote:

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.


Something is wrong with the pgindent in your patch set.  Maybe you used 
a wrong version.  You should try to fix that, because it is hard to 
process your patch with that amount of unrelated reformatting.


As far as I can tell, the original pg_upgrade patch has been ready to 
commit since October.  Unless Thomas has any qualms that have not been 
made explicit in this thread, I suggest we move ahead with that.


And then Jakub could rebase his patch set on top of that.  It looks like 
if the formatting issues are fixed, the remaining pg_combinebackup 
support isn't that big.






Re: Preserve subscription OIDs during pg_upgrade

2024-03-05 Thread Aleksander Alekseev
Hi,

> It will be better to preserve them as it will be easier to
> compare subscription related objects in pg_subscription and
> pg_subscription_rel in the old and new clusters.

IMO it would be helpful if you could give a little bit more context on
why/when this is useful. Personally I find it somewhat difficult to
imagine a case when I really need to compare Oids of subscriptions
between old and new clusters.

If we commit to such a guarantee it will lay a certain burden on the
community in the long run and the benefits are not quite clear, to me
at least. If we are talking about giving such a guarantee only once
the value of this is arguably low.

-- 
Best regards,
Aleksander Alekseev




Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-05 Thread Bertrand Drouvot
Hi,

On Tue, Mar 05, 2024 at 02:19:19PM +0530, shveta malik wrote:
> On Tue, Mar 5, 2024 at 1:25 PM Heikki Linnakangas  wrote:
> 
> > SearchNamedReplicationSlot() will also acquire the lock in LW_SHARED
> > mode, when you pass need_lock=true. So that at least should be changed
> > to false.
> >
> 
> Also don't we need to release the lock when we return here:
> 
> /*
> * Nothing to do for physical slots as we collect stats only for logical
> * slots.
> */
> if (SlotIsPhysical(slot))
> return;

D'oh! Thanks! Fixed in v2 shared up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-05 Thread Bertrand Drouvot
Hi,

On Tue, Mar 05, 2024 at 09:55:32AM +0200, Heikki Linnakangas wrote:
> On 01/03/2024 12:15, Bertrand Drouvot wrote:
> > Hi hackers,
> > 
> > I think that pgstat_reset_replslot() is missing LWLock protection. Indeed, 
> > we
> > don't have any guarantee that the slot is active (then preventing it to be
> > dropped/recreated) when this function is executed.
> 
> Yes, so it seems at quick glance.

Thanks for looking at it!

> We have a similar issue in
> pgstat_fetch_replslot(); it might return stats for wrong slot, if the slot
> is dropped/recreated concurrently.

Good catch! 

> Do we care?

Yeah, I think we should: done in v2 attached.

> > --- a/src/backend/utils/activity/pgstat_replslot.c
> > +++ b/src/backend/utils/activity/pgstat_replslot.c
> > @@ -46,6 +46,8 @@ pgstat_reset_replslot(const char *name)
> > Assert(name != NULL);
> > +   LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > +
> > /* Check if the slot exits with the given name. */
> > slot = SearchNamedReplicationSlot(name, true);
> 
> SearchNamedReplicationSlot() will also acquire the lock in LW_SHARED mode,
> when you pass need_lock=true. So that at least should be changed to false.
>

Right, done in v2. Also had to add an extra "need_lock" argument to
get_replslot_index() for the same reason while taking care of 
pgstat_fetch_replslot().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From fabee924c3131692addfe8941e4bdb3dc5540aae Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 1 Mar 2024 10:04:17 +
Subject: [PATCH v2] Adding LWLock protection in pgstat_reset_replslot() and
 pgstat_fetch_replslot()

pgstat_reset_replslot() and pgstat_fetch_replslot() are missing a LWLock
protection as we don't have any guarantee that the slot is active (then
preventing it to be dropped/recreated) when those functions are executed.
---
 src/backend/utils/activity/pgstat_replslot.c | 35 +++-
 1 file changed, 27 insertions(+), 8 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_replslot.c b/src/backend/utils/activity/pgstat_replslot.c
index 70cabf2881..7c409af670 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -30,7 +30,7 @@
 #include "utils/pgstat_internal.h"
 
 
-static int	get_replslot_index(const char *name);
+static int	get_replslot_index(const char *name, bool need_lock);
 
 
 /*
@@ -46,8 +46,10 @@ pgstat_reset_replslot(const char *name)
 
 	Assert(name != NULL);
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
 	/* Check if the slot exits with the given name. */
-	slot = SearchNamedReplicationSlot(name, true);
+	slot = SearchNamedReplicationSlot(name, false);
 
 	if (!slot)
 		ereport(ERROR,
@@ -60,11 +62,16 @@ pgstat_reset_replslot(const char *name)
 	 * slots.
 	 */
 	if (SlotIsPhysical(slot))
+	{
+		LWLockRelease(ReplicationSlotControlLock);
 		return;
+	}
 
 	/* reset this one entry */
 	pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
  ReplicationSlotIndex(slot));
+
+	LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
@@ -164,13 +171,25 @@ pgstat_drop_replslot(ReplicationSlot *slot)
 PgStat_StatReplSlotEntry *
 pgstat_fetch_replslot(NameData slotname)
 {
-	int			idx = get_replslot_index(NameStr(slotname));
+	int			idx;
+	PgStat_StatReplSlotEntry *slotentry;
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	idx = get_replslot_index(NameStr(slotname), false);
 
 	if (idx == -1)
+	{
+		LWLockRelease(ReplicationSlotControlLock);
 		return NULL;
+	}
+
+	slotentry = (PgStat_StatReplSlotEntry *) pgstat_fetch_entry(PGSTAT_KIND_REPLSLOT,
+InvalidOid, idx);
+
+	LWLockRelease(ReplicationSlotControlLock);
 
-	return (PgStat_StatReplSlotEntry *)
-		pgstat_fetch_entry(PGSTAT_KIND_REPLSLOT, InvalidOid, idx);
+	return slotentry;
 }
 
 void
@@ -189,7 +208,7 @@ pgstat_replslot_to_serialized_name_cb(const PgStat_HashKey *key, const PgStatSha
 bool
 pgstat_replslot_from_serialized_name_cb(const NameData *name, PgStat_HashKey *key)
 {
-	int			idx = get_replslot_index(NameStr(*name));
+	int			idx = get_replslot_index(NameStr(*name), true);
 
 	/* slot might have been deleted */
 	if (idx == -1)
@@ -209,13 +228,13 @@ pgstat_replslot_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts)
 }
 
 static int
-get_replslot_index(const char *name)
+get_replslot_index(const char *name, bool need_lock)
 {
 	ReplicationSlot *slot;
 
 	Assert(name != NULL);
 
-	slot = SearchNamedReplicationSlot(name, true);
+	slot = SearchNamedReplicationSlot(name, need_lock);
 
 	if (!slot)
 		return -1;
-- 
2.34.1



Re: [PATCH] Exponential backoff for auth_delay

2024-03-05 Thread Robert Haas
On Mon, Mar 4, 2024 at 4:58 PM Michael Banck  wrote:
> Alright, I have changed it so that auth_delay.milliseconds and
> auth_delay.max_milliseconds are the only GUCs, their default being 0. If
> the latter is 0, the former's value is always taken. If the latter is
> non-zero and larger than the former, exponential backoff is applied with
> the latter's value as maximum delay.
>
> If the latter is smaller than the former then auth_delay just sets the
> delay to the latter, I don't think this is problem or confusing, or
> should this be considered a misconfiguration?

Seems fine to me. We may need to think about what the documentation
should say about this, if anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: a wrong index choose when statistics is out of date

2024-03-05 Thread Andy Fan

Hi,

>
>> We should do anything like add column options in the meantime. Those
>> are hard to remove once added.
>
> I will try it very soon.

Attached is a PoC version. and here is the test case.

create table t(a int, b int, c int) with (autovacuum_enabled=off);
create index on t(a, b);
create index on t(a, c);

insert into t select floor(random() * 100 + 1)::int, i, i
from generate_series(1, 10) i;

analyze t;

insert into t
select floor(random() * 10 + 1)::int + 100 , i, i
from generate_series(1, 1000) i;

-- one of the below queries would choose a wrong index.
-- here is the result from my test.
explain (costs off) select * from t where a = 109 and c = 8;
  QUERY PLAN   
---
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
   QUERY PLAN
-
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8)
(3 rows)

Wrong index is chosen for the second case!

-- After applying the new API.

alter table t alter column a set (force_generic=on);

explain (costs off) select * from t where a = 109 and c = 8;
  QUERY PLAN   
---
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
  QUERY PLAN   
---
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = 109) AND (b = 8))
(2 rows)

Then both cases can choose a correct index.

commit f8cca472479c50ba73479ec387882db43d203522 (HEAD -> shared_detoast_value)
Author: yizhi.fzh 
Date:   Tue Mar 5 18:27:48 2024 +0800

Add a "force_generic" attoptions for selfunc.c

Sometime user just care about the recent data and the optimizer
statistics for such data is not gathered, then some bad decision may
happen. Before this patch, we have to make the autoanalyze often and
often, but it is not only expensive but also may be too late.

This patch introduces a new attoptions like this:

ALTER TABLE t ALTER COLUMN col set (force_generic=true);

Then selfunc.c realizes this and ignore the special Const value, then
average selectivity is chosen. This fall into the weakness of generic
plan, but this patch doesn't introduce any new weakness and we leave the
decision to user which could resolve some problem. Also this logic only
apply to eqsel since the ineqsel have the get_actual_variable_range
mechanism which is helpful for index choose case at least.

I think it is OK for a design review, for the implementaion side, the
known issue includes:

1. Support grap such infromation from its parent for partitioned table
if the child doesn't have such information.
2. builtin document and testing. 

Any feedback is welcome.

-- 
Best Regards
Andy Fan

>From f8cca472479c50ba73479ec387882db43d203522 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 5 Mar 2024 18:27:48 +0800
Subject: [PATCH v0 1/1] Add a "force_generic" attoptions for selfunc.c

Sometime user just care about the recent data and the optimizer
statistics for such data is not gathered, then some bad decision may
happen. Before this patch, we have to make the autoanalyze often and
often, but it is not only expensive but also may be too late.

This patch introduces a new attoptions like this:

ALTER TABLE t ALTER COLUMN col set (force_generic=true);

Then selfunc.c realizes this and ignore the special Const value, then
average selectivity is chosen. This fall into the weakness of generic
plan, but this patch doesn't introduce any new weakness and we leave the
decision to user which could resolve some problem. Also this logic only
apply to eqsel since the ineqsel have the get_actual_variable_range
mechanism which is helpful for index choose case at least.
---
 src/backend/access/common/reloptions.c | 12 -
 src/backend/utils/adt/selfuncs.c   | 35 ++
 src/bin/psql/tab-complete.c|  2 +-
 src/include/utils/attoptcache.h|  1 +
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 0921a736ab..c8444ea577 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -131,6 +131,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"force_generic",
+			"estimate the selectivity like generic plan",
+			RELOPT_KIND_ATTRIBUTE,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	{
 		{
 			"security_barrier",
@@ -2072,7 +2081,8 @@ attribute_reloptions(Datum reloptions, bool validate)
 {
 	static const relopt_parse_elt tab[] = {
 		{"n_distinct", RELOPT_TYPE_REAL, offsetof(AttributeOpts, n_distinct)},
-		{"n_distinct_inherited", RELOPT_TYPE_REAL, 

Re: Reducing the log spam

2024-03-05 Thread Pavel Stehule
Hi

út 5. 3. 2024 v 13:55 odesílatel Laurenz Albe 
napsal:

> Inspired by feedback to [1], I thought about how to reduce log spam.
>
> My experience from the field is that a lot of log spam looks like
>
>   database/table/... "xy" does not exist
>   duplicate key value violates unique constraint "xy"
>
> So what about a parameter "log_suppress_sqlstates" that suppresses
> logging ERROR and FATAL messages with the enumerated SQL states?
>
> My idea for a default setting would be something like
>
>   log_suppress_sqlstates = '23505,3D000,3F000,42601,42704,42883,42P01'
>

+1 in this form

the overhead of this implementation should be small

Regards

Pavel


> but that's of course bikeshedding territory.
>
> Yours,
> Laurenz Albe
>
>
>
>  [1]:
> https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at
>
>
>


Remove unnecessary code from psql's watch command

2024-03-05 Thread Yugo NAGATA
Hi,

In the current code of do_watch(), sigsetjmp is called if WIN32
is defined, but siglongjmp is not called in the signal handler
in this condition. On Windows, currently, cancellation is checked
only by cancel_pressed, and  calling sigsetjmp in do_watch() is
unnecessary. Therefore, we can remove code around sigsetjmp in
do_watch(). I've attached the patch for this fix.

Regards,
Yugo Ngata

-- 
Yugo NAGATA 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 5c906e4806..c03e47744e 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5312,20 +5312,10 @@ do_watch(PQExpBuffer query_buf, double sleep, int iter, int min_rows)
 			continue;
 
 #ifdef WIN32
-
-		/*
-		 * Set up cancellation of 'watch' via SIGINT.  We redo this each time
-		 * through the loop since it's conceivable something inside
-		 * PSQLexecWatch could change sigint_interrupt_jmp.
-		 */
-		if (sigsetjmp(sigint_interrupt_jmp, 1) != 0)
-			break;
-
 		/*
-		 * Enable 'watch' cancellations and wait a while before running the
-		 * query again.  Break the sleep into short intervals (at most 1s).
+		 * Wait a while before running the query again.  Break the sleep into
+		 * short intervals (at most 1s).
 		 */
-		sigint_interrupt_enabled = true;
 		for (long i = sleep_ms; i > 0;)
 		{
 			long		s = Min(i, 1000L);
@@ -5335,7 +5325,6 @@ do_watch(PQExpBuffer query_buf, double sleep, int iter, int min_rows)
 break;
 			i -= s;
 		}
-		sigint_interrupt_enabled = false;
 #else
 		/* sigwait() will handle SIGINT. */
 		sigprocmask(SIG_BLOCK, , NULL);


Re: index prefetching

2024-03-05 Thread Jakub Wartak
On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
 wrote:
[..]
> TBH I don't have a clear idea what to do. It'd be cool to have at least
> some benefits in v17, but I don't know how to do that in a way that
> would be useful in the future.
>
> For example, the v20240124 patch implements this in the executor, but
> based on the recent discussions it seems that's not the right layer -
> the index AM needs to have some control, and I'm not convinced it's
> possible to improve it in that direction (even ignoring the various
> issues we identified in the executor-based approach).
>
> I think it might be more practical to do this from the index AM, even if
> it has various limitations. Ironically, that's what I proposed at pgcon,
> but mostly because it was the quick way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

> > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> > (allowing it to follow dynamically; the more prefetches the user wants
> > to perform, the more you spread them across shared LRUs and the more
> > memory for history is required?)
> >
> > + * XXX Maybe we could consider effective_cache_size when sizing the 
> > cache?
> > + * Not to size the cache for that, ofc, but maybe as a guidance of how 
> > many
> > + * heap pages it might keep. Maybe just a fraction fraction of the 
> > value,
> > + * say Max(8MB, effective_cache_size / max_connections) or something.
> > + */
> > +#definePREFETCH_LRU_SIZE8/* slots in one LRU */
> > +#definePREFETCH_LRU_COUNT128 /* number of LRUs */
> > +#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
> > PREFETCH_LRU_COUNT)
> >
>
> I don't see why would this be related to effective_io_concurrency? It's
> merely about how many recently accessed pages we expect to find in the
> page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

> > 7. in IndexPrefetchComputeTarget()
> >
> > + * XXX We cap the target to plan_rows, becausse it's pointless to 
> > prefetch
> > + * more than we expect to use.
> >
> > That's a nice fact that's already in patch, so XXX isn't needed?
> >
>
> Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

> But I think it's good to
> point this out - I'm not 100% convinced we should be using plan_rows
> like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95.  So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

> > Some further tests, given data:
> >
> > CREATE TABLE test (id bigint, val bigint, str text);
> > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> > 3000) FROM generate_series(1, 1) g;
> > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> > (10*random())::int), 3000) from (select 1 * random() as r from
> > generate_series(1, 1)) x;
> > VACUUM ANALYZE test;
> > CREATE INDEX on test (id) ;
> >
>
> It's not clear to me what's the purpose of this test? Can you explain?

It's just schema preparation for the tests below:

> >
> > 2. Prefetching for TOASTed heap seems to be not implemented at all,
> > correct? (Is my assumption that we should go like this:
> > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> > see the code path where it could be added - certainly it's not blocker
> > -- but maybe in commit message a list of improvements for future could
> > be listed?):
> >
>
> Yes, that's true. I haven't thought about TOAST very much, but with
> prefetching happening in executor, that does not work. There'd need to
> be some extra code for TOAST prefetching. I'm not sure how beneficial
> that would be, considering most TOAST values tend to be stored on
> consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of 

Reducing the log spam

2024-03-05 Thread Laurenz Albe
Inspired by feedback to [1], I thought about how to reduce log spam.

My experience from the field is that a lot of log spam looks like

  database/table/... "xy" does not exist
  duplicate key value violates unique constraint "xy"

So what about a parameter "log_suppress_sqlstates" that suppresses
logging ERROR and FATAL messages with the enumerated SQL states?

My idea for a default setting would be something like

  log_suppress_sqlstates = '23505,3D000,3F000,42601,42704,42883,42P01'

but that's of course bikeshedding territory.

Yours,
Laurenz Albe



 [1]: 
https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at




Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-05 Thread Alvaro Herrera
On 2024-Mar-04, Dean Rasheed wrote:

> I don't think that this is the right fix. ISTM that the real issue is
> that dropping a NOT NULL constraint should not mark the column as
> nullable if it is part of a PK, whether or not that PK is deferrable
> -- a deferrable PK still marks a  column as not nullable.

Yeah.  As I said upthread, a good fix seems to require no longer relying
on RelationGetIndexAttrBitmap to obtain the columns in the primary key,
because that function does not include deferred primary keys.  I came up
with the attached POC, which seems to fix the reported problem, but of
course it needs more polish, a working test case, and verifying whether
the new function should be used in more places -- in particular, whether
it can be used to revert the changes to RelationGetIndexList that
b0e96f311985 did.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)
>From cc7032c3f62949d20bc98d4b3c70cb2cb8a0dd5e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 5 Mar 2024 13:32:50 +0100
Subject: [PATCH] Don't lose attnotnull if a deferred PK supports it

When dropping a NOT NULL constraint on a column that has a deferred
primary key, we would reset attnotnull, but that's bogus.

XXX this patch is nowhere near final.

Reported-by: Amul Sul 
Discussion: https://postgr.es/m/caaj_b94qonkgsbdxofakhdnorqngafd1y3oa5qxfpqnjyxy...@mail.gmail.com
---
 src/backend/commands/tablecmds.c | 71 ++--
 1 file changed, 68 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c61f9305c2..84c7871dda 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12704,6 +12704,73 @@ ATExecDropConstraint(Relation rel, const char *constrName,
 	table_close(conrel, RowExclusiveLock);
 }
 
+/*
+ * dropconstraint_getpkcols -- subroutine for dropconstraint_internal
+ *
+ * This is a workaround to the fact that RelationGetIndexAttrBitmap does
+ * not consider a DEFERRABLE PRIMARY KEY to be a real primary key.  Maybe
+ * this code should be elsewhere.
+ */
+static Bitmapset *
+dropconstraint_getpkcols(Relation rel)
+{
+	Relation	indrel;
+	SysScanDesc	indscan;
+	ScanKeyData	skey;
+	HeapTuple	htup;
+	Bitmapset  *b = NULL;
+
+	/*
+	 * We try first to obtain a list of PK columns from the cache; if there
+	 * is one, we're done.
+	 */
+	b = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_PRIMARY_KEY);
+	if (b != NULL)
+		return b;
+
+	/* Prepare to scan pg_index for entries having indrelid = this rel. */
+	ScanKeyInit(,
+Anum_pg_index_indrelid,
+BTEqualStrategyNumber, F_OIDEQ,
+ObjectIdGetDatum(RelationGetRelid(rel)));
+
+	indrel = table_open(IndexRelationId, AccessShareLock);
+	indscan = systable_beginscan(indrel, IndexIndrelidIndexId, true,
+ NULL, 1, );
+
+	while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+	{
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(htup);
+
+		/*
+		 * Ignore any indexes that are currently being dropped and those that
+		 * aren't a primary key.
+		 */
+		if (!index->indislive)
+			continue;
+		if (!index->indisprimary)
+			continue;
+
+		/*
+		 * If this primary key was IMMEDIATE, it would have been returned
+		 * by RelationGetIndexAttrBitmap.
+		 */
+		Assert(!index->indimmediate);
+
+		for (int i = 0; i < index->indnatts; i++)
+		{
+			int			attrnum = index->indkey.values[i];
+
+			b = bms_add_member(b, attrnum - FirstLowInvalidHeapAttributeNumber);
+		}
+	}
+
+	systable_endscan(indscan);
+	table_close(indrel, AccessShareLock);
+
+	return b;
+}
+
 /*
  * Remove a constraint, using its pg_constraint tuple
  *
@@ -12837,9 +12904,7 @@ dropconstraint_internal(Relation rel, HeapTuple constraintTup, DropBehavior beha
 		 * We want to test columns for their presence in the primary key, but
 		 * only if we're not dropping it.
 		 */
-		pkcols = dropping_pk ? NULL :
-			RelationGetIndexAttrBitmap(rel,
-	   INDEX_ATTR_BITMAP_PRIMARY_KEY);
+		pkcols = dropping_pk ? NULL : dropconstraint_getpkcols(rel);
 		ircols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_IDENTITY_KEY);
 
 		foreach(lc, unconstrained_cols)
-- 
2.39.2



Re: type cache cleanup improvements

2024-03-05 Thread Aleksander Alekseev
Hi,

> Thank you for interesting in it!
>
> > These changes look very promising. Unfortunately the proposed patches
> > conflict with each other regardless the order of applying:
> >
> > ```
> > error: patch failed: src/backend/utils/cache/typcache.c:356
> > error: src/backend/utils/cache/typcache.c: patch does not apply
> > ```
> Try increase -F option of patch.
>
> Anyway, union of both patches in attachment

Thanks for the quick update.

I tested the patch on an Intel MacBook. A release build was used with
my typical configuration, TWIMC see single-install-meson.sh [1]. The
speedup I got on the provided benchmark is about 150 times. cfbot
seems to be happy with the patch.

I would like to tweak the patch a little bit - change some comments,
add some Asserts, etc. Don't you mind?

[1]: https://github.com/afiskon/pgscripts/

-- 
Best regards,
Aleksander Alekseev




Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-05 Thread Ranier Vilela
Em seg., 4 de mar. de 2024 às 20:28, Tom Lane  escreveu:

> Michael Paquier  writes:
> > On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote:
> >> I can't see any user validation at the function
> pg_newlocale_from_collation.
>
> > Matthias is right, look closer.  I can see more than one check,
> > especially note the one related to the collation version mismatch that
> > should not be silently ignored.
>
> The fast path through that code doesn't include any checks, true,
> but the point is that finding the entry proves that we made those
> checks previously.  I can't agree with making those semantics
> squishy in order to save a few cycles in the exact-equality case.
>
Robustness is a fair point.

best regards,
Ranier Vilela


Get rid of the excess semicolon in planner.c

2024-03-05 Thread Richard Guo
I think this is a typo introduced in 0452b461b.

 +root->processed_groupClause = list_copy(parse->groupClause);;

The extra empty statement is harmless in most times, but I still think
it would be better to get rid of it.

Attached is a trivial patch to do that.

Thanks
Richard


v1-0001-Get-rid-of-the-excess-semicolon-in-planner.c.patch
Description: Binary data


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-05 Thread Alvaro Herrera
On 2024-Mar-04, Tom Lane wrote:

> In hopes of noticing whether there are other similar thinkos,
> I permuted the order of the SlruPageStatus enum values, and
> now I get the expected warnings from gcc:

Thanks for checking!  I pushed the fixes.

Maybe we should assign a nonzero value (= 1) to the first element of
enums, to avoid this kind of mistake.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-05 Thread Nazir Bilal Yavuz
Hi,

Thanks for working on this!

The patches are cleanly applied on top of the current master and all
tests are passed.

On Thu, 4 Jan 2024 at 02:23, Cedric Villemain
 wrote:
>
> Hi,
>
> I wonder what you think of making pg_prewarm use recent addition on
> smgrprefetch and readv ?
>
>
> In order to try, I did it anyway in the attached patches. They contain
> no doc update, but I will proceed if it is of interest.
>
> In summary:
>
> 1. The first one adds a new check on parameters (checking last block is
> indeed not before first block).
> Consequence is an ERROR is raised instead of silently doing nothing.

This is a general improvement and can be committed without other patches.

> 2. The second one does implement smgrprefetch with range and loops by
> default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

> 3. The third one provides smgrreadv instead of smgrread,  by default on
> a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Looks good to me.

> Q: posix_fadvise may not work exactly the way you think it does, or does
> it ?
>
>
> In details, and for the question:
>
> It's not so obvious that the "feature" is really required or wanted,
> depending on what are the expectations from user point of view.
>
> The kernel decides on what to do with posix_fadvise calls, and how we
> pass parameters does impact the decision.
> With the current situation where prefetch is done step by step, block by
> block, they are very probably most of the time all loaded even if those
> from the beginning of the relation can be discarded at the end of the
> prefetch.
>
> However,  if instead you provide a real range, or the magic len=0 to
> posix_fadvise, then blocks are "more" loaded according to effective vm
> pressure (which is not the case on the previous example).
> As a result only a small part of the relation might be loaded, and this
> is probably not what end-users expect despite being probably a good
> choice (you can still free cache beforehand to help the kernel).
>
> An example, below I'm using vm_relation_cachestat() which provides linux
> cachestat output, and vm_relation_fadvise() to unload cache, and
> pg_prewarm for the demo:
>
> # clear cache: (nr_cache is the number of file system pages in cache,
> not postgres blocks)
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -+-+--+--
>0 |   32768 |65536 |0
>32768 |   32768 |65536 |0
>65536 |   32768 |65536 |0
>98304 |   32768 |65536 |0
>   131072 |1672 | 3344 |0
> ```
>
> # load full relation with pg_prewarm (patched)
>
> ```
> postgres=# select pg_prewarm('foo','prefetch');
> pg_prewarm
> 
>  132744
> (1 row)
> ```
>
> # Checking results:
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -+-+--+--
>0 |   32768 |65536 |  320
>32768 |   32768 |65536 |0
>65536 |   32768 |65536 |0
>98304 |   32768 |65536 |0
>   131072 |1672 | 3344 |  320  <-- segment 1
>
> ```
>
> # Load block by block and check:
>
> ```
> postgres=# select from generate_series(0, 132743) g(n), lateral
> pg_prewarm('foo','prefetch', 'main', n, n);
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -+-+--+--
>0 |   32768 |65536 |65536
>32768 |   32768 |65536 |65536
>65536 |   32768 |65536 |65536
>98304 |   32768 |65536 |65536
>   131072 |1672 | 3344 | 3344
>
> ```
>
> The duration of the last example is also really significant: full
> relation is 0.3ms and block by block is 1550ms!
> You might think it's because of generate_series or whatever, but I have
> the exact same behavior with pgfincore.
> I can compare loading and unloading duration for similar "async" work,
> here each call is from block 0 with len of 132744 and a range of 1 block
> (i.e. posix_fadvise on 8kB at a time).
> So they have exactly the same number of operations doing 

Re: Adding deprecation notices to pgcrypto documentation

2024-03-05 Thread Daniel Gustafsson
> On 4 Mar 2024, at 23:49, Nathan Bossart  wrote:
> 
> On Mon, Mar 04, 2024 at 10:03:13PM +0100, Daniel Gustafsson wrote:
>> Cleaning out my inbox I came across this which I still think is worth doing,
>> any objections to going ahead with it?
> 
> I think the general idea is reasonable, but I have two small comments:
> 
> * Should this be a "Warning" and/or moved to the top of the page?  This
>  seems like a relatively important notice that folks should see when
>  beginning to use pgcrypto.

Good question.  If we do we'd probably need to move other equally important
bits of information from "Security Limitations" as well so perhaps it's best to
keep it as is for now, or putting it under Notes.

> * Should we actually document the exact list of algorithms along with
>  detailed reasons?  This list seems prone to becoming outdated.

If we don't detail the list then I think that it's not worth doing, doing the
research isn't entirely trivial as one might not even know where to look or
what to look for.

I don't think this list will move faster than we can keep up with it,
especially since it's more or less listing everything that pgcrypto supports at
this point.

Looking at this some more I propose that we also remove the table of hash
benchmarks, as it's widely misleading.  Modern hardware can generate far more
than what we list here, and it gives the impression that these algorithms can
only be broken with brute force which is untrue.  The table was first published
in 2008 and hasn't been updated since.

Attached is an updated patchset.

--
Daniel Gustafsson



v3-0002-pgcrypto-Document-deprecation-notices-against-alg.patch
Description: Binary data


v3-0001-pgcrypto-Remove-hash-speed-comparison-table.patch
Description: Binary data


Re: Change GUC hashtable to use simplehash?

2024-03-05 Thread John Naylor
On Tue, Jan 30, 2024 at 5:04 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > But given that we know the data length and we have it in a register
> > already, it's easy enough to just mask out data past the end with a
> > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > small test harness that just hashes and finalizes an array of strings,
> > with a data dependency between consecutive hashes (next address
> > depends on the previous hash output).
>
> Interesting work! I've taken this idea and (I'm guessing, haven't
> tested) improved it by re-using an intermediate step for the
> conditional, simplifying the creation of the mask, and moving the
> bitscan out of the longest dependency chain.

This needed a rebase, and is now 0001. I plan to push this soon.

I also went and looked at the simplehash instances and found a few
that would be easy to switch over. Rather than try to figure out which
could benefit from shaving cycles, I changed all the string hashes,
and one more, in 0002 so they can act as examples.

0003 uses fasthash for resowner, as suggested by Heikki upthread.  Now
murmur64 has no callers, but it (or similar *) could be used in
pg_dump/common.c for hashing CatalogId (8 bytes).

Commit 42a1de3013 added a new use for string_hash, but I can't tell
from a quick glance whether it uses the truncation, so I'm going to
take a closer look before re-attaching the proposed dynahash change
again.

* some examples here:
https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp
From 63d8140f146b58ea044f3516ae5472febd6d1caf Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 6 Feb 2024 13:11:33 +0700
Subject: [PATCH v19 1/3] Speed up tail processing when hashing aligned C
 strings

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated the
terminator's position and re-loaded the remaining bytes from the input
string. We already have all the data we need in a register, so let's
just mask off the bytes we need and hash them immediately. The mask can
be cheaply computed without knowing the terminator's position. We still
need that position for the length calculation, but the CPU can now
do that in parallel with other work, shortening the dependency chain.

Ants Aasma and John Naylor

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
---
 src/include/common/hashfn_unstable.h | 44 +---
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 791750d136..bd7323fe05 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -219,8 +219,9 @@ static inline size_t
 fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 {
 	const char *const start = str;
-	size_t		remainder;
+	uint64		chunk;
 	uint64		zero_byte_low;
+	uint64		mask;
 
 	Assert(PointerIsAligned(start, uint64));
 
@@ -239,7 +240,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 */
 	for (;;)
 	{
-		uint64		chunk = *(uint64 *) str;
+		chunk = *(uint64 *) str;
 
 #ifdef WORDS_BIGENDIAN
 		zero_byte_low = haszero64(pg_bswap64(chunk));
@@ -254,14 +255,37 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 		str += FH_SIZEOF_ACCUM;
 	}
 
-	/*
-	 * The byte corresponding to the NUL will be 0x80, so the rightmost bit
-	 * position will be in the range 7, 15, ..., 63. Turn this into byte
-	 * position by dividing by 8.
-	 */
-	remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
-	str += remainder;
+	if (zero_byte_low & 0xFF)
+	{
+		/*
+		 * The next byte in the input is the NUL terminator, so we have
+		 * nothing to do.
+		 */
+	}
+	else
+	{
+		/*
+		 * Create a mask for the remaining bytes so we can combine them into
+		 * the hash. The mask also covers the NUL terminator, but that's
+		 * harmless. The mask could contain 0x80 in bytes corresponding to the
+		 * input past the terminator, but only where the input byte is zero or
+		 * one, so also harmless.
+		 */
+		mask = zero_byte_low | (zero_byte_low - 1);
+#ifdef WORDS_BIGENDIAN
+		/* need to mask the upper bytes */
+		mask = pg_bswap64(mask);
+#endif
+		hs->accum = chunk & mask;
+		fasthash_combine(hs);
+
+		/*
+		 * The byte corresponding to the NUL will be 0x80, so the rightmost
+		 * bit position will be in the range 15, 23, ..., 63. Turn this into
+		 * byte position by dividing by 8.
+		 */
+		str += pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
+	}
 
 	return str - start;
 }
-- 
2.43.0

From 5dad8c783bc5d0d5f573ea136b13e79aa22d0371 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 5 Mar 2024 17:22:16 +0700
Subject: [PATCH v19 3/3] Use fasthash for hash_resource_elem

---
 src/backend/utils/resowner/resowner.c | 11 

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-03-05 Thread Bertrand Drouvot
Hi,

On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
> On Mon, Feb 26, 2024 at 02:01:45PM +, Bertrand Drouvot wrote:
> > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch 
> > now
> > (see the attached).
> 
> I have looked at what you have here.

Thanks!

> First, in a build where 818fefd8fd is included, this makes the test
> script a lot slower.  Most of the logic is quick, but we're spending
> 10s or so checking that catalog_xmin has advanced.  Could it be
> possible to make that faster?

Yeah, v2 attached changes this. It moves the injection point after the process
has been killed so that another process can decode at wish (without the need
to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation().

> A second issue is the failure mode when 818fefd8fd is reverted.  The
> test is getting stuck when we are waiting on the standby to catch up,
> until a timeout decides to kick in to fail the test, and all the
> previous tests pass.  Could it be possible to make that more
> responsive?  I assume that in the failure mode we would get an
> incorrect conflict_reason for injection_inactiveslot, succeeding in
> checking the failure.

I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1"
before the initial_* assignements). The issue is that then the new ASSERT is
triggered leading to the standby shutdown. So, I'm not sure how to improve this
case.

> +my $terminated = 0;
> +for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; 
> $i++)
> +{
> +if ($node_standby->log_contains(
> +'terminating process .* to release replication slot 
> \"injection_activeslot\"', $logstart))
> +{
> +$terminated = 1;
> +last;
> +}
> +usleep(100_000);
> +}
> +ok($terminated, 'terminating process holding the active slot is logged 
> with injection point');
> 
> The LOG exists when we are sure that the startup process is waiting
> in the injection point, so this loop could be replaced with something
> like:
> +   $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
> +   ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. 
> ';)
> 

Yeah, now that wait_for_event() is there, let's use it: done in v2.

> Nit: the name of the injection point should be
> terminate-process-holding-slot rather than
> TerminateProcessHoldingSlot, to be consistent with the other ones. 

Done in v2.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From bca1159e246287dfd6469b2cc7e53609de25fae7 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 20 Feb 2024 08:11:47 +
Subject: [PATCH v2] Add regression test for 818fefd8fd

---
 src/backend/replication/slot.c|   8 ++
 .../t/035_standby_logical_decoding.pl | 106 +-
 2 files changed, 112 insertions(+), 2 deletions(-)
   6.9% src/backend/replication/
  93.0% src/test/recovery/t/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2614f98ddd..9d379f154d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -53,6 +53,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/injection_point.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1658,6 +1659,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 last_signaled_pid = active_pid;
 terminated = true;
 conflict_prev = conflict;
+/*
+ * This injection needs to be after the kill to ensure that
+ * the slot is not "active" anymore. It also has to be after
+ * ReportSlotInvalidation() to ensure the invalidation message
+ * is reported.
+ */
+INJECTION_POINT("terminate-process-holding-slot");
 			}
 
 			/* Wait until the slot is released. */
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 0710bccc17..4204be60f5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -733,14 +733,116 @@ change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
-# Scenario 6: incorrect wal_level on primary.
+# Scenario 6: Ensure race condition described in 818fefd8fd is fixed.
+##
+SKIP:
+{
+skip "Injection points not supported by this build", 1
+  if ($ENV{enable_injection_points} ne 'yes');
+
+	# Get the position to search from in the standby logfile
+	$logstart = -s $node_standby->logfile;
+
+	# Drop the slots, re-create them, change hot_standby_feedback,
+	# check xmin and catalog_xmin values, make slot active and reset 

?????? bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-03-05 Thread zwj
Hi??hackers

I may have discovered another issue in the concurrency scenario of 
merge, and I am currently not sure if this new issue is related to the 
previous one.
It seems that it may also be an issue with the EPQ mechanism in the merge 
scenario?
I will provide this test case, hoping it will be helpful for you to fix 
related issues in the future.



 DROP TABLE IF EXISTS src1, tgt;
 CREATE TABLE src1 (a int, b text);
 CREATE TABLE tgt (a int, b text);
 INSERT INTO src1 SELECT x, 'Src1 '||x FROM 
generate_series(1, 3) g(x);
 INSERT INTO tgt SELECT x, 'Tgt '||x FROM generate_series(1, 
6, 2) g(x);
 insert into src1 values(3,'src1 33');



If I only execute merge , I will get the following error:
 merge into tgt a using src1 c on a.a = c.a when 
matched then update set b = c.b when not matched then insert (a,b) 
values(c.a,c.b); -- excute fail
 ERROR: MERGE command cannot affect row a second time
 HIINT: Ensure that not more than one source row 
matches any one target row.



But when I execute the update and merge concurrently, I will get the 
following result set.

 --session1
 begin;

 update tgt set b = 'tgt333' where a =3;

 --session2
 merge into tgt a using src1 c on a.a = c.a when 
matched then update set  b = c.b when not matched then insert (a,b) 
values(c.a,c.b); -- excute success
 --session1
 commit;
 select * from tgt;
  a | b 
 ---+-
 5 | Tgt 5
 1 | Src1 1
 2 | Src1 2
 3 | Src1 3
 3 | src1 33

 I think even if the tuple with id:3 is udpated, merge should still be 
able to retrieve new tuples with id:3, andreport the same error as 
above?


Regards,
wenjiang zhang


----
??: 
   "jian he"



  1   2   >