date:20240215

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila  wrote:
>
> Thanks for noticing this. I have pushed all your debug patches. Let's
> hope if there is a BF failure next time, we can gather enough
> information to know the reason of the same.
>

There is a new BF failure [1] after adding these LOGs and I think I
know what is going wrong. First, let's look at standby LOGs:

2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG:
segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4
available
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG:
xmin required by slots: data 0, catalog 741
2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote
 could not sync slot information as reslot precedes local slot: remote
slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN
(0/4000168), catalog xmin (741)

So, from the above LOG, it is clear that the remote slot's catalog
xmin (739) precedes the local catalog xmin (741) which makes the sync
on standby to not complete.

Next, let's look at the LOG from the primary during the nearby time:
2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG:
 analyzing "pg_catalog.pg_depend"
2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG:
 "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0
dead rows; 1709 rows in sample, 1709 estimated total rows
...
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG:
Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes,
cost_limit=200, cost_delay=2 active=yes failsafe=no)
2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG:
 analyzing "information_schema.sql_features"
2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG:
 "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0
dead rows; 756 rows in sample, 756 estimated total rows

It shows us that autovacuum worker has analyzed catalog table and for
updating its statistics in pg_statistic table, it would have acquired
a new transaction id. Now, after the slot creation, a new transaction
id that has updated the catalog is generated on primary and would have
been replication to standby. Due to this catalog_xmin of primary's
slot would precede standby's catalog_xmin and we see this failure.

As per this theory, we should disable autovacuum on primary to avoid
updates to catalog_xmin values.

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae=2024-02-16%2006%3A12%3A59

-- 
With Regards,
Amit Kapila.

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-15 Thread Bharath Rupireddy

On Wed, Feb 14, 2024 at 6:59 AM Jeff Davis  wrote:
>
> Attached 2 patches.
>
> Per Andres's suggestion, 0001 adds an:
>   Assert(startptr + count <= LogwrtResult.Write)
>
> Though if we want to allow the caller (e.g. in an extension) to
> determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
> then the check is wrong.

Right.

> Maybe we should just get rid of that code
> entirely and trust the caller to request a reasonable range?

I'd suggest we strike a balance here - error out in assert builds if
startptr+count is past the current insert position and trust the
callers for production builds. It has a couple of advantages over
doing just Assert(startptr + count <= LogwrtResult.Write):
1) It allows the caller to read unflushed WAL directly from WAL
buffers, see the attached 0005 for an example.
2) All the existing callers where WALReadFromBuffers() is thought to
be used are ensuring WAL availability by reading upto the flush
position so no problem with it.

Also, a note before WALRead() stating the caller must request the WAL
at least that's written out (upto LogwrtResult.Write). I'm not so sure
about this, perhaps, we don't need this comment at all.

Here, I'm with v23 patch set:

0001 - Adds assertion in WALReadFromBuffers() to ensure the requested
WAL isn't beyond the current insert position.
0002 - Adds a new test module to demonstrate how one can use
WALReadFromBuffers() ensuring WaitXLogInsertionsToFinish() if need be.
0003 - Uses WALReadFromBuffers in more places like logical walsenders
and backends.
0004 - Removes zero-padding related stuff as discussed in
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=aeezpjmqhpeobg...@mail.gmail.com.
This is needed in this patch set otherwise the assertion added in 0001
fails after 0003.
0005 - Adds a page_read callback for reading from WAL buffers in the
new test module added in 0002. Also, adds tests.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5c0f95acda494904d02593e6ba305717b61c44b5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 16 Feb 2024 06:54:53 +
Subject: [PATCH v23 2/5] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile| 23 ++
 .../modules/read_wal_from_buffers/meson.build | 33 +
 .../read_wal_from_buffers--1.0.sql| 14 
 .../read_wal_from_buffers.c   | 54 ++
 .../read_wal_from_buffers.control |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl  | 72 +++
 9 files changed, 206 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 00..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+

Re: Fix a typo in pg_rotate_logfile

2024-02-15 Thread Bharath Rupireddy

On Thu, Feb 15, 2024 at 2:18 AM Daniel Gustafsson  wrote:
>
> >>> Searching on Github and Debian Codesearch I cannot find any reference to 
> >>> anyone
> >>> using any function from adminpack.  With pgAdminIII being EOL it might be 
> >>> to
> >>> remove it now rather than be on the hook to maintain it for another 5 
> >>> years
> >>> until v17 goes EOL.  It'll still be around for years in V16->.
> >>
> >> Works for me.
> >>
> >>> Attached is a diff to show what it would look like to remove adminpack 
> >>> (catalog
> >>> version bump omitted on purpose to avoid conflicts until commit).
> >>
> >> I don't see any references you missed, so +1.
> >
> > Seems reasonable to me, too.
>
> Thanks!  I'll put this in the next CF to keep it open for comments a bit
> longer, but will close it early in the CF.

LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Bertrand Drouvot

Hi,

On Fri, Feb 16, 2024 at 12:32:45AM +, Zhijie Hou (Fujitsu) wrote:
> Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it
> could provide more information about communication between primary and 
> standby.
> This also doesn't increase noticeable testing time on my machine for debug
> build.

Same here, and there is no big diff as far the amount of log generated:

Without the patch:

$ du -sh ./src/test/recovery/tmp_check/log/*040*
4.0K
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log
24K 
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log
16K 
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log
4.0K
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log
12K 
./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync

With the patch:

$ du -sh ./src/test/recovery/tmp_check/log/*040*
4.0K
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log
36K 
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log
48K 
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log
4.0K
./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log
12K 
./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: speed up a logical replica setup

2024-02-15 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Thanks for updating the patch!
Before reviewing deeply, here are replies for your comments.

>
> > Points raised by me [1] are not solved yet.
> > 
> > * What if the target version is PG16-?
pg_ctl and pg_resetwal won't work.
$ pg_ctl start -D /tmp/blah
waiting for server to start
2024-02-15 23:50:03.448 -03 [364610] FATAL:  database files are incompatible 
with server
2024-02-15 23:50:03.448 -03 [364610] DETAIL:  The data directory was 
initialized by PostgreSQL version 16, which is not compatible with this version 
17devel.
stopped waiting
pg_ctl: could not start server
Examine the log output.

$ pg_resetwal -D /tmp/blah
pg_resetwal: error: data directory is of wrong version
pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible 
with this program's version "17".

> > * What if the found executables have diffent version with 
> > pg_createsubscriber?

The new code take care of it.
>

I preferred to have a common path and test one by one, but agreed this worked 
well.
Let's keep it and hear opinions from others.

>
> > * What if the target is sending WAL to another server?
> >I.e., there are clusters like `node1->node2-.node3`, and the target is 
> > node2.
The new code detects if the server is in recovery and aborts as you suggested.
A new option can be added to ignore the fact there are servers receiving WAL
from it.
>

Confirmed it can detect.

>
> > * Can we really cleanup the standby in case of failure?
> >Shouldn't we suggest to remove the target once?

If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid
it. However, we should provide instructions to inform the user that it should
create a fresh standby and try again.
>

I think the cleanup function looks not sufficient. In v21, recovery_ended is 
kept
to true even after drop_publication() is done in setup_subscriber(). I think 
that
made_subscription is not needed, and the function should output some messages
when recovery_ended is on.
Besides, in case of pg_upgrade, they always report below messages before 
starting
the migration. I think this is more helpful for users.

```
pg_log(PG_REPORT, "\n"
   "If pg_upgrade fails after this point, you must re-initdb 
the\n"
   "new cluster before continuing.");
```

>
> > * Can we move outputs to stdout?

Are you suggesting to use another logging framework? It is not a good idea
because each client program is already using common/logging.c.
>

Hmm, indeed. Other programs in pg_basebackup seem to use the framework.

>
v20-0011: Do we really want to interrupt the recovery if the network was
momentarily interrupted or if the OS killed walsender? Recovery is critical for
the process. I think we should do our best to be resilient and deliver all
changes required by the new subscriber.
>

It might be too strict to raise an ERROR as soon as we met a disconnection.
And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for 
still_alive.

>
The proposal is not correct because the
query return no tuples if it is disconnected so you cannot PQgetvalue().
>

Sorry for misunderstanding, but you might be confused. pg_createsubcriber
sends a query to target, and we are discussing the disconnection between the
target and source instances. I think the connection which pg_createsubscriber
has is stil alive so PQgetvalue() can get a value.

(BTW, callers of server_is_in_recovery() has not considered a disconnection from
the target...)

>
The
retry interval (the time that ServerLoop() will create another walreceiver) is
determined by DetermineSleepTime() and it is a maximum of 5 seconds
(SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up
using the pg_stat_wal_receiver query. Do you have a better plan?
>

It's good to determine the threshold. It can define the number  of acceptable
loss of walreceiver during the loop.
I think we should retry at least the postmaster revives the walreceiver.
The checking would work once per seconds, so more than 5 (or 10) may be better.
Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

Re: 035_standby_logical_decoding unbounded hang

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote:
> On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> > What about creating a sub, say wait_for_restart_lsn_calculation() in 
> > Cluster.pm
> > and then make use of it in create_logical_slot_on_standby() and above? 
> > (something
> > like wait_for_restart_lsn_calculation-v1.patch attached).
> 
> Waiting for restart_lsn is just a prerequisite for calling
> pg_log_standby_snapshot(), so I wouldn't separate those two.

Yeah, makes sense.

> If we're
> extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
> and make the API like one of these:
> 
>   $standby->wait_for_subscription_starting_point($primary, $slot_name);
>   $primary->log_standby_snapshot($standby, $slot_name);
> 
> Would you like to finish the patch in such a way?

Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used
log_standby_snapshot() as it seems more generic to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 07da74cf56..5d3174d0b5 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3181,6 +3181,37 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->log_standby_snapshot(self, standby, slot_name)
+
+Log a standby snapshot on primary once the slot restart_lsn is determined on
+the standby.
+
+=cut
+
+sub log_standby_snapshot
+{
+	my ($self, $standby, $slot_name) = @_;
+	my ($stdout, $stderr);
+
+	# Once the slot's restart_lsn is determined, the standby looks for
+	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
+	# until the slot restart_lsn is determined.
+
+	$standby->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	])
+	  or die
+	  "timed out waiting for logical slot to calculate its restart_lsn";
+
+	# Then arrange for the xl_running_xacts record for which the standby is
+	# waiting.
+	$self->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+}
+
+=pod
+
 =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
 
 Create logical replication slot on given standby
@@ -3206,21 +3237,9 @@ sub create_logical_slot_on_standby
 		'2>',
 		\$stderr);
 
-	# Once the slot's restart_lsn is determined, the standby looks for
-	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
-	# until the slot restart_lsn is determined.
-
-	$self->poll_query_until(
-		'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
-	])
-	  or die
-	  "timed out waiting for logical slot to calculate its restart_lsn";
-
-	# Then arrange for the xl_running_xacts record for which pg_recvlogical is
+	# Arrange for the xl_running_xacts record for which pg_recvlogical is
 	# waiting.
-	$primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()');
+	$primary->log_standby_snapshot($self, $slot_name);
 
 	$handle->finish();
 
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index b020361b29..0710bccc17 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -465,8 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 
 $psql_subscriber{run}->pump_nb();
 
-# Speed up the subscription creation
-$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+# Log the standby snapshot to speed up the subscription creation
+$node_primary->log_standby_snapshot($node_standby, 'tap_sub');
 
 # Explicitly shut down psql instance gracefully - to avoid hangs
 # or worse on windows

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Fri, Feb 16, 2024 at 11:12 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) 
>  wrote:
> > >
> > > Yeah, we can consider outputting some information via this function
> > > like how many slots are synced and persisted but not sure what would
> > > be appropriate here. Because one can anyway find that or more
> > > information by querying pg_replication_slots. I think we can keep
> > > discussing what makes more sense as a return value but let's commit
> > > the debug/log patches as they will be helpful to analyze BF failures or 
> > > any
> > other issues reported.
> >
> > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it 
> > could
> > provide more information about communication between primary and standby.
> > This also doesn't increase noticeable testing time on my machine for debug
> > build.
>
> Sorry, there was a miss in the DEBUG message, I should have used
> UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch
> to fix this.
>

Thanks for noticing this. I have pushed all your debug patches. Let's
hope if there is a BF failure next time, we can gather enough
information to know the reason of the same.

-- 
With Regards,
Amit Kapila.

RE: Synchronizing slots from primary to standby

2024-02-15 Thread Zhijie Hou (Fujitsu)

On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) 
 wrote:
> On Thursday, February 15, 2024 8:29 PM Amit Kapila
>  wrote:
> 
> >
> > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot
> >  wrote:
> > >
> > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote:
> > > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu)
> > > >  wrote:
> > > > > Attach the v2 patch here.
> > > > >
> > > > > Apart from the new log message. I think we can add one more
> > > > > debug message in reserve_wal_for_local_slot, this could be
> > > > > useful to analyze the
> > failure.
> > > >
> > > > Yeah, that can also be helpful, but the added message looks naive to me.
> > > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno,
> > > > + segno);
> > > >
> > > > Instead of the above, how about something like: "segno: %ld of
> > > > purposed restart_lsn for the synced slot, oldest_segno: %ld
> > > > available"?
> > >
> > > Looks good to me. I'm not sure if it would make more sense to elog
> > > only if segno < oldest_segno means just before the
> > XLogSegNoOffsetToRecPtr() call?
> > >
> > > But I'm fine with the proposed location too.
> > >
> >
> > I am also fine either way but the current location gives required
> > information in more number of cases and could be helpful in debugging this
> new facility.
> >
> > > >
> > > > > And we
> > > > > can also enable the DEBUG log in the 040 tap-test, I see we have
> > > > > similar setting in 010_logical_decoding_timline and logging
> > > > > debug1 message doesn't increase noticable time on my machine.
> > > > > These are done
> > in 0002.
> > > > >
> > > >
> > > > I haven't tested it but I think this can help in debugging BF
> > > > failures, if any. I am not sure if to keep it always like that but
> > > > till the time these tests are stabilized, this sounds like a good
> > > > idea. So, how, about just making test changes as a separate patch
> > > > so that later if required we can revert/remove it easily?
> > > > Bertrand, do you have any thoughts on this?
> > >
> > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I
> > > +think we
> > > took the same approach for 035_standby_logical_decoding.pl IIRC) and
> > > then revert it back.
> > >
> >
> > Good to know!
> >
> > > Also I was thinking: what about adding an output to
> > pg_sync_replication_slots()?
> > > The output could be the number of sync slots that have been created
> > > and are not considered as sync-ready during the execution.
> > >
> >
> > Yeah, we can consider outputting some information via this function
> > like how many slots are synced and persisted but not sure what would
> > be appropriate here. Because one can anyway find that or more
> > information by querying pg_replication_slots. I think we can keep
> > discussing what makes more sense as a return value but let's commit
> > the debug/log patches as they will be helpful to analyze BF failures or any
> other issues reported.
> 
> Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it 
> could
> provide more information about communication between primary and standby.
> This also doesn't increase noticeable testing time on my machine for debug
> build.

Sorry, there was a miss in the DEBUG message, I should have used
UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch
to fix this.

Best Regards,
Hou zj


0001-fix-the-format-for-uint64.patch
Description: 0001-fix-the-format-for-uint64.patch

Re: POC, WIP: OR-clause support for indexes

2024-02-15 Thread Andrei Lepikhov


On 16/2/2024 07:00, jian he wrote:

On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov
 wrote:
My OS: Ubuntu 22.04.3 LTS
I already set the max_parallel_workers_per_gather to 10.
So for all cases, it should use parallelism first?

a better question would be:
how to make the number of OR less than 29 still faster when
enable_or_transformation is ON by only set parameters?
In my test environment this example gives some subtle supremacy to ORs 
over ANY with only 3 ors and less.
Please, provide next EXPLAIN ANALYZE results for the case you want to 
discuss here:

1. with enable_or_transformation enabled
2. with enable_or_transformation disabled
3. with enable_or_transformation disabled but with manual transformation 
OR -> ANY done, to check the overhead of this optimization.


--
regards,
Andrei Lepikhov
Postgres Professional

RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thanks for reviewing! PSA new version.

> 
> Thanks for the updated patch, few suggestions:
> 1) Can we use a new publication for this subscription too so that the
> publication and subscription naming will become consistent throughout
> the test case:
> +# Table will be in 'd' (data is being copied) state as table sync will fail
> +# because of primary key constraint error.
> +my $started_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'";
> +$old_sub->poll_query_until('postgres', $started_query)
> +  or die
> +  "Timed out while waiting for the table state to become 'd' (datasync)";
> +
> +# Create another subscription and drop the subscription's replication origin
> +$old_sub->safe_psql('postgres',
> +   "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr'
> PUBLICATION regress_pub2 WITH (enabled = false)"
> +);
>
> So after the change it will become like subscription regress_sub3 for
> publication regress_pub3, subscription regress_sub4 for publication
> regress_pub4 and subscription regress_sub5 for publication
> regress_pub5.

A new publication was defined.

> 2) The tab_upgraded1 table can be created along with create
> publication and create subscription itself:
> $publisher->safe_psql('postgres',
> "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1");
> $old_sub->safe_psql('postgres',
> "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION
> regress_pub3 WITH (failover = true)"
> );

The definition of tab_upgraded1 was moved to the place you pointed.

> 3) The tab_upgraded2 table can be created along with create
> publication and create subscription itself to keep it consistent:
>  $publisher->safe_psql('postgres',
> -   "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2");
> +   "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2");
>  $old_sub->safe_psql('postgres',
> -   "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION");
> +   "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr'
> PUBLICATION regress_pub4"
> +);

Ditto.

> With above fixes, the following can be removed:
> # Initial setup
> $publisher->safe_psql(
> 'postgres', qq[
> CREATE TABLE tab_upgraded1(id int);
> CREATE TABLE tab_upgraded2(id int);
> ]);
> $old_sub->safe_psql(
> 'postgres', qq[
> CREATE TABLE tab_upgraded1(id int);
> CREATE TABLE tab_upgraded2(id int);
> ]);

Yes, earlier definitions were removed instead.
Also, some comments were adjusted based on these fixes.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v4-0001-Fix-testcase.patch
Description: v4-0001-Fix-testcase.patch

Re: pg_upgrade and logical replication

2024-02-15 Thread vignesh C

On Fri, 16 Feb 2024 at 08:22, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > This sounds like a reasonable way to address the reported problem.
>
> OK, thanks!
>
> > Justin, do let me know if you think otherwise?
> >
> > Comment:
> > ===
> > *
> > -# Setup an enabled subscription to verify that the running status and 
> > failover
> > -# option are retained after the upgrade.
> > +# Setup a subscription to verify that the failover option are retained 
> > after
> > +# the upgrade.
> >  $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
> >  $old_sub->safe_psql('postgres',
> > - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
> > regress_pub1 WITH (failover = true)"
> > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> > PUBLICATION
> > regress_pub1 WITH (failover = true, enabled = false)"
> >  );
> >
> > I think it is better not to create a subscription in the early stage
> > which we wanted to use for the success case. Let's have separate
> > subscriptions for failure and success cases. I think that will avoid
> > the newly added ALTER statements in the patch.
>
> I made a patch to avoid creating objects as much as possible, but it
> may lead some confusion. I recreated a patch for creating pub/sub
> and dropping them at cleanup for every test cases.
>
> PSA a new version.

Thanks for the updated patch, few suggestions:
1) Can we use a new publication for this subscription too so that the
publication and subscription naming will become consistent throughout
the test case:
+# Table will be in 'd' (data is being copied) state as table sync will fail
+# because of primary key constraint error.
+my $started_query =
+  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'";
+$old_sub->poll_query_until('postgres', $started_query)
+  or die
+  "Timed out while waiting for the table state to become 'd' (datasync)";
+
+# Create another subscription and drop the subscription's replication origin
+$old_sub->safe_psql('postgres',
+   "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr'
PUBLICATION regress_pub2 WITH (enabled = false)"
+);

So after the change it will become like subscription regress_sub3 for
publication regress_pub3, subscription regress_sub4 for publication
regress_pub4 and subscription regress_sub5 for publication
regress_pub5.

2) The tab_upgraded1 table can be created along with create
publication and create subscription itself:
$publisher->safe_psql('postgres',
"CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1");
$old_sub->safe_psql('postgres',
"CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION
regress_pub3 WITH (failover = true)"
);

3) The tab_upgraded2 table can be created along with create
publication and create subscription itself to keep it consistent:
 $publisher->safe_psql('postgres',
-   "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2");
+   "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2");
 $old_sub->safe_psql('postgres',
-   "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION");
+   "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr'
PUBLICATION regress_pub4"
+);

With above fixes, the following can be removed:
# Initial setup
$publisher->safe_psql(
'postgres', qq[
CREATE TABLE tab_upgraded1(id int);
CREATE TABLE tab_upgraded2(id int);
]);
$old_sub->safe_psql(
'postgres', qq[
CREATE TABLE tab_upgraded1(id int);
CREATE TABLE tab_upgraded2(id int);
]);

Regards,
Vignesh

PGC_SIGHUP shared_buffers?

2024-02-15 Thread Robert Haas

Hi,

I remember Magnus making a comment many years ago to the effect that
every setting that is PGC_POSTMASTER is a bug, but some of those bugs
are very difficult to fix. Perhaps the use of the word bug is
arguable, but I think the sentiment is apt, especially with regard to
shared_buffers. Changing without a server restart would be really
nice, but it's hard to figure out how to do it. I can think of a few
basic approaches, and I'd like to know (a) which ones people think are
good and which ones people think suck (maybe they all suck) and (b) if
anybody's got any other ideas not mentioned here.

1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock()
is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means
that we're expecting to find all of the buffers in a single giant
array. Years ago, somebody proposed changing the implementation to
essentially WhereIsTheBuffer[buffer], which was heavily criticized on
performance grounds, because it requires an extra memory access. A
gentler version of this might be something like
WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ;
i.e. instead of allowing every single buffer to be at some random
address, manage chunks of the buffer pool. This makes the lookup array
potentially quite a lot smaller, which might mitigate performance
concerns. For example, if you had one chunk per GB of shared_buffers,
your mapping array would need only a handful of cache lines, or a few
handfuls on really big systems.

(I am here ignoring the difficulties of how to orchestrate addition of
or removal of chunks as a SMOP[1]. Feel free to criticize that
hand-waving, but as of this writing, I feel like moderate
determination would suffice.)

2. Make a Buffer just a disguised pointer. Imagine something like
typedef struct { Page bp; } *buffer. WIth this approach,
BufferGetBlock() becomes trivial. The tricky part with this approach
is that you still need a cheap way of finding the buffer header. What
I imagine might work here is to again have some kind of chunked
representation of shared_buffers, where each chunk contains a bunch of
buffer headers at, say, the beginning, followed by a bunch of buffers.
Theoretically, if the chunks are sufficiently strong-aligned, you can
figure out what offset you're at within the chunk without any
additional information and the whole process of locating the buffer
header is just math, with no memory access. But in practice, getting
the chunks to be sufficiently strongly aligned sounds hard, and this
also makes a Buffer 64 bits rather than the current 32. A variant on
this concept might be to make the Buffer even wider and include two
pointers in it i.e. typedef struct { Page bp; BufferDesc *bd; }
Buffer.

3. Reserve lots of address space and then only use some of it. I hear
rumors that some forks of PG have implemented something like this. The
idea is that you convince the OS to give you a whole bunch of address
space, but you try to avoid having all of it be backed by physical
memory. If you later want to increase shared_buffers, you then get the
OS to back more of it by physical memory, and if you later want to
decrease shared_buffers, you hopefully have some way of giving the OS
the memory back. As compared with the previous two approaches, this
seems less likely to be noticeable to most PG code. Problems include
(1) you have to somehow figure out how much address space to reserve,
and that forms an upper bound on how big shared_buffers can grow at
runtime and (2) you have to figure out ways to reserve address space
and back more or less of it with physical memory that will work on all
of the platforms that we currently support or might want to support in
the future.

4. Give up on actually changing the size of shared_buffer per se, but
stick some kind of resizable secondary cache in front of it. Data that
is going to be manipulated gets brought into a (perhaps small?) "real"
shared_buffers that behaves just like today, but you have some larger
data structure which is designed to be easier to resize and maybe
simpler in some other ways that sits between shared_buffers and the OS
cache. This doesn't seem super-appealing because it requires a lot of
data copying, but maybe it's worth considering as a last resort.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://en.wikipedia.org/wiki/Small_matter_of_programming

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
>
> monotonically ordered int column index:
>
> master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

> I'm happy to see a huge improvement. While it's really fascinating to
> me, I'm concerned about the time left until the feature freeze. We
> need to polish both tidstore and vacuum integration patches in 5
> weeks. Personally I'd like to have it as a separate patch for now, and
> focus on completing the main three patches since we might face some
> issues after pushing these patches. I think with 0007 patch it's a big
> win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

> > 2. Management of memory contexts. It's pretty verbose and messy. I
> > think the abstraction could be better:
> > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > can't destroy or reset it. That means we have to do a lot of manual
> > work.
> > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > it seems the wrong responsibility. Not all uses will have a
> > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > block size, and aset's default 8MB is already plenty small for
> > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > smaller, and there it makes sense to limit the max blocksize this way.
> > C: The context for values has complex #ifdefs based on the value
> > length/varlen, but it's both too much and not enough. If we get a bump
> > context, how would we shoehorn that in for values for vacuum but not
> > for tidbitmap?
> >
> > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > values (for local mem), and the node slabs will be children of the
> > value context. That way, measuring memory usage and free-ing can just
> > call with this parent context, and let recursion handle the rest.
> > Perhaps the passed context can also hold the radix-tree struct, but
> > I'm not sure since I haven't tried it. What do you think?
>
> If I understand your idea correctly, RT_CREATE() creates the context
> for values as a child of the passed context and the node slabs as
> children of the value context. That way, measuring memory usage can
> just call with the value context. It sounds like a good idea. But it
> was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
- node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

> Another variant of this idea would be that RT_CREATE() creates the
> parent context of the value context to store radix-tree struct. That
> is, the hierarchy would be like:
>
> A MemoryContext (passed by vacuum through tidstore)
> - radix tree memory context (store radx-tree struct, control
> struct, and iterator)
> - value context (aset, slab, or bump)
> - node slab contexts

The template handling the value context here is complex, and is what I
meant by 'C' above. Most fixed length allocations in all of the
backend are aset, so it seems fine to use it always.

> Freeing can just call with the radix tree memory context. And perhaps
> it works even if tidstore passes

RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)

Dear Justin,

Thanks for replying!

> What optimizations?  I can't see them, and since the patch is described
> as rearranging test cases (and therefore already difficult to read), I
> guess they should be a separate patch, or the optimizations described.

The basic idea was to reduce number of CREATE/DROP statement,
but it was changed for now - publications and subscriptions were created and
dropped per testcases. 

E.g., In case of successful upgrade, below steps were done:

1. create two publications
2. create a subscription with failover = true
3. avoid further initial sync by setting max_logical_replication_workers = 0
4. create another subscription
5. confirm statuses of tables are either of 'i' or 'r'
6. run pg_upgrade
7. confirm table statuses are preserved
8. confirm replication origins are preserved.

New patch is available in [1].

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077B16EEDA360BA645B96F8F54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: speed up a logical replica setup

2024-02-15 Thread Euler Taveira

On Thu, Feb 15, 2024, at 8:23 AM, Hayato Kuroda (Fujitsu) wrote:
> > Points raised by me [1] are not solved yet.
> > 
> > * What if the target version is PG16-?

pg_ctl and pg_resetwal won't work.

$ pg_ctl start -D /tmp/blah
waiting for server to start
2024-02-15 23:50:03.448 -03 [364610] FATAL:  database files are incompatible 
with server
2024-02-15 23:50:03.448 -03 [364610] DETAIL:  The data directory was 
initialized by PostgreSQL version 16, which is not compatible with this version 
17devel.
stopped waiting
pg_ctl: could not start server
Examine the log output.

$ pg_resetwal -D /tmp/blah
pg_resetwal: error: data directory is of wrong version
pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible 
with this program's version "17".

> > * What if the found executables have diffent version with 
> > pg_createsubscriber?

The new code take care of it.

> > * What if the target is sending WAL to another server?
> >I.e., there are clusters like `node1->node2-.node3`, and the target is 
> > node2.

The new code detects if the server is in recovery and aborts as you suggested.
A new option can be added to ignore the fact there are servers receiving WAL
from it.

> > * Can we really cleanup the standby in case of failure?
> >Shouldn't we suggest to remove the target once?

If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid
it. However, we should provide instructions to inform the user that it should
create a fresh standby and try again.

> > * Can we move outputs to stdout?

Are you suggesting to use another logging framework? It is not a good idea
because each client program is already using common/logging.c.

> 1.
> Cfbot got angry [1]. This is because WIFEXITED and others are defined in 
> ,
> but the inclusion was removed per comment. Added the inclusion again.

Ok.

> 2.
> As Shubham pointed out [3], when we convert an intermediate node of cascading 
> replication,
> the last node would stuck. This is because a walreciever process requires 
> nodes have the same
> system identifier (in WalReceiverMain), but it would be changed by 
> pg_createsubscriebr.

Hopefully it was fixed.

> 3.
> Moreover, when we convert a last node of cascade, it won't work well. Because 
> we cannot create
> publications on the standby node.

Ditto.

> 4.
> If the standby server was initialized as PG16-, this command would fail.
> Because the API of pg_logical_create_replication_slot() were changed.

See comment above.

> 5.
> Also, used pg_ctl commands must have same versions with the instance.
> I think we should require all the executables and servers must be a same 
> major version.

It is enforced by the new code. See find_other_exec() in get_exec_path().

> Based on them, below part describes attached ones:

Thanks for another review. I'm sharing a new patch to merge a bunch of
improvements and fixes. Comments are below.

v20-0002: I did some extensive documentation changes (including some of them
related to the changes in the new patch). I will defer its update to check
v20-0002. It will be included in the next one.

v20-0003: I included most of it. There are a few things that pgindent reverted
so I didn't apply. I also didn't like some SQL commands that were broken into
multiple lines with spaces at the beginning. It seems nice in the code but it
is not in the output.

v20-0004: Nice catch. Applied.

v20-0005: Applied.

v20-0006: I prefer to remove the comment.

v20-0007: I partially applied it. I only removed the states that were not used
and propose another dry run mode message. Maybe it is clear than it was.

v20-0008: I refactored the get_bin_directory code. Under reflection, I reverted
the unified binary directory that we agreed a few days ago. The main reason is
to provide a specific error message for each program it is using. The
get_exec_path will check if the program is available in the same directory as
pg_createsubscriber and if it has the same version. An absolute path is
returned and is used by some functions.

v20-0009: to be reviewed.

v20-0010: As I said above, this code was refactored so I didn't apply this one.

v20-0011: Do we really want to interrupt the recovery if the network was
momentarily interrupted or if the OS killed walsender? Recovery is critical for
the process. I think we should do our best to be resilient and deliver all
changes required by the new subscriber. The proposal is not correct because the
query return no tuples if it is disconnected so you cannot PQgetvalue(). The
retry interval (the time that ServerLoop() will create another walreceiver) is
determined by DetermineSleepTime() and it is a maximum of 5 seconds
(SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up
using the pg_stat_wal_receiver query. Do you have a better plan?

v20-0012: I applied a different patch to accomplish the same thing. I included
a refactor around pg_is_in_recovery() function to be used in other 2 points.

Re: Memory consumed by paths during partitionwise join planning

2024-02-15 Thread Andrei Lepikhov


On 15/2/2024 19:06, Ashutosh Bapat wrote:

On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov

But I'm not sure about freeing unreferenced paths. I would have to see
alternatives in the pathlist.


I didn't understand this. Can you please elaborate? A path in any
pathlist is referenced. An unreferenced path should not be in any
pathlist.
I mean that at some point, an extension can reconsider the path tree 
after building the top node of this path. I vaguely recall that we 
already have (or had) kind of such optimization in the core where part 
of the plan changes after it has been built.
Live example: right now, I am working on the code like MSSQL has - a 
combination of NestLoop and HashJoin paths and switching between them in 
real-time. It requires both paths in the path list at the moment when 
extensions are coming. Even if one of them isn't referenced from the 
upper pathlist, it may still be helpful for the extension.



About partitioning. As I discovered planning issues connected to
partitions, the painful problem is a rule, according to which we are
trying to use all nomenclature of possible paths for each partition.
With indexes, it quickly increases optimization work. IMO, this can help
a 'symmetrical' approach, which could restrict the scope of possible
pathways for upcoming partitions if we filter some paths in a set of
previously planned partitions.


filter or free?

Filter.
I meant that Postres tries to apply IndexScan, BitmapScan, 
IndexOnlyScan, and other strategies, passing throughout the partition 
indexes. The optimizer spends a lot of time doing that. So, why not 
introduce a symmetrical strategy and give away from the search some 
indexes of types of scan based on the pathifying experience of previous 
partitions of the same table: if you have dozens of partitions, Is it 
beneficial for the system to find a bit more optimal IndexScan on one 
partition having SeqScans on 999 other?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread Masahiko Sawada

On Thu, Feb 15, 2024 at 8:26 PM John Naylor  wrote:
>
> On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada  
> wrote:
> >
> > On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:
>
> > I've also run the same scripts in my environment just in case and got
> > similar results:
>
> Thanks for testing, looks good as well.
>
> > > There are still some micro-benchmarks we could do on tidstore, and
> > > it'd be good to find out worse-case memory use (1 dead tuple each on
> > > spread-out pages), but this is decent demonstration.
> >
> > I've tested a simple case where vacuum removes 33k dead tuples spread
> > about every 10 pages.
> >
> > master:
> > 198,000 bytes (=33000 * 6)
> > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
> >
> > v-59:
> > 2,834,432 bytes (reported by TidStoreMemoryUsage())
> > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s
>
> The memory usage for the sparse case may be a concern, although it's
> not bad -- a multiple of something small is probably not huge in
> practice. See below for an option we have for this.
>
> > > > > I'm pretty sure there's an
> > > > > accidental memset call that crept in there, but I'm running out of
> > > > > steam today.
> > >
> > > I have just a little bit of work to add for v59:
> > >
> > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > > and there are none between 64 and 128, so not a huge deal but I think
> > > it's a bit nicer in this patch.
> >
> > LGTM.
>
> Okay, I've squashed this.
>
> > I've drafted the commit message.
>
> Thanks, this is a good start.
>
> > I've run regression tests with valgrind and run the coverity scan, and
> > I don't see critical issues.
>
> Great!
>
> Now, I think we're in pretty good shape. There are a couple of things
> that might be objectionable, so I want to try to improve them in the
> little time we have:
>
> 1. Memory use for the sparse case. I shared an idea a few months ago
> of how runtime-embeddable values (true combined pointer-value slots)
> could work for tids. I don't think this is a must-have, but it's not a
> lot of code, and I have this working:
>
> v61-0006: Preparatory refactoring -- I think we should do this anyway,
> since the intent seems more clear to me.

Looks good refactoring to me.

> v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> reduce memory regressions, so should be considered. Up to 3 tids can
> be stored in the last level child pointer. It's not polished, but I'll
> only proceed with that if we think we need this. "flags" iis called
> that because it could hold tidbitmap.c booleans (recheck, lossy) in
> the future, in addition to reserving space for the pointer tag. Note:
> I hacked the tests to only have 2 offsets per block to demo, but of
> course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

uuid column index:

master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s

int & uuid indexes in parallel:

master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s
v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s

sparse case:

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

v-62:
729,088 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

>
> 2. Management of memory contexts. It's pretty verbose and messy. I
> think the abstraction could be better:
> A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> can't destroy or reset it. That means we have to do a lot of manual
> work.
> B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> it seems the wrong responsibility. Not all uses will have a
> work_mem-type limit, I'm

RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> This sounds like a reasonable way to address the reported problem.

OK, thanks!

> Justin, do let me know if you think otherwise?
> 
> Comment:
> ===
> *
> -# Setup an enabled subscription to verify that the running status and 
> failover
> -# option are retained after the upgrade.
> +# Setup a subscription to verify that the failover option are retained after
> +# the upgrade.
>  $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
>  $old_sub->safe_psql('postgres',
> - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
> regress_pub1 WITH (failover = true)"
> + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> PUBLICATION
> regress_pub1 WITH (failover = true, enabled = false)"
>  );
> 
> I think it is better not to create a subscription in the early stage
> which we wanted to use for the success case. Let's have separate
> subscriptions for failure and success cases. I think that will avoid
> the newly added ALTER statements in the patch.

I made a patch to avoid creating objects as much as possible, but it
may lead some confusion. I recreated a patch for creating pub/sub
and dropping them at cleanup for every test cases.

PSA a new version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-Fix-testcase.patch
Description: v3-0001-Fix-testcase.patch

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread David Rowley

On Fri, 16 Feb 2024 at 01:09, David Rowley  wrote:
>
> On Thu, 15 Feb 2024 at 21:42, Andy Fan  wrote:
> > I found the both plans have the same cost, I can't get the accurate
> > cause of this after some hours research, but it is pretty similar with
> > 7516056c584e3, so I uses a similar strategy to stable it. is it
> > acceptable?
>
> It's pretty hard to say.  I can only guess why this test would be
> flapping like this. I see it's happened before on mylodon, so probably
> not a cosmic ray.  It's not like add_path() chooses a random path when
> the costs are the same, so I wondered if something similar is going on
> here that was going on that led to f03a9ca4. In particular, see [1].

While it's not conclusive proof, the following demonstrates that
relpages dropping by just 1 page causes the join order to change.

regression=# explain
regression-# select t1.unique1 from tenk1 t1
regression-# inner join tenk2 t2 on t1.tenthous = t2.tenthous
regression-# union all
regression-# (values(1)) limit 1;
  QUERY PLAN
--
 Limit  (cost=0.00..150.08 rows=1 width=4)
   ->  Append  (cost=0.00..1500965.01 rows=10001 width=4)
 ->  Nested Loop  (cost=0.00..1500915.00 rows=1 width=4)
   Join Filter: (t1.tenthous = t2.tenthous)
   ->  Seq Scan on tenk1 t1  (cost=0.00..445.00 rows=1 width=8)
   ->  Materialize  (cost=0.00..495.00 rows=1 width=4)
 ->  Seq Scan on tenk2 t2  (cost=0.00..445.00
rows=1 width=4)
 ->  Result  (cost=0.00..0.01 rows=1 width=4)

regression=# update pg_class set relpages=relpages - 1 where relname = 'tenk2';
UPDATE 1
regression=# explain
regression-# select t1.unique1 from tenk1 t1
regression-# inner join tenk2 t2 on t1.tenthous = t2.tenthous
regression-# union all
regression-# (values(1)) limit 1;
  QUERY PLAN
--
 Limit  (cost=0.00..150.52 rows=1 width=4)
   ->  Append  (cost=0.00..1505315.30 rows=10001 width=4)
 ->  Nested Loop  (cost=0.00..1505265.29 rows=1 width=4)
   Join Filter: (t1.tenthous = t2.tenthous)
   ->  Seq Scan on tenk2 t2  (cost=0.00..445.29 rows=10029 width=4)
   ->  Materialize  (cost=0.00..495.00 rows=1 width=8)
 ->  Seq Scan on tenk1 t1  (cost=0.00..445.00
rows=1 width=8)
 ->  Result  (cost=0.00..0.01 rows=1 width=4)

I tried this with the proposed changes to the test and the plan did not change.

I've pushed the change now.

David

Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Kyotaro Horiguchi

At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -

Good catch! The comment seems wrong also to me. The subsequent bytes
can be written simultaneously, and it's very normal that there are
unflushed bytes are in OS's page buffer. The objective of the comment
seems to be to declare that there's no need to clear out the remaining
bytes, here. I agree that it's not a problem for now. However, I think
we need two fixes here.

1. It's useless to copy the whole page regardless of the 'count'. It's
  enough to copy only up to the 'count'. The patch looks good in this
  regard.

2. Maybe we need a comment that states the page_read callback
  functions leave garbage bytes beyond the returned count, due to
  partial copying without clearing the unused portion.

> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

If I understand your question correctly, I guess that the whole-page
copy was expected to clear out the remaining bytes, as I mentioned
earlier.

> Although, there's no immediate problem with it right now, the
> assumption is going to be incorrect when reading WAL from WAL buffers
> using WALReadFromBuffers -
> https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com.
>
> If we have no reason, can the WALRead() callers just read how much
> they want like walsender for physical replication? Attached a patch
> for the change.
> 
> Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Fix incorrect PG_GETARG in pgcrypto

2024-02-15 Thread shihao zhong

On Tue, Feb 13, 2024 at 7:08 PM Michael Paquier  wrote:
>
> On Tue, Feb 13, 2024 at 05:36:36PM +0900, Michael Paquier wrote:
> > You've indeed grabbed some historical inconsistencies here.  Please
> > note that your patch has reversed diffs (for example, the SQL
> > definition of pgp_sym_encrypt_bytea uses bytea,text,text as arguments
> > and your resulting patch shows how HEAD does the job with
> > bytea,bytea,bytea), but perhaps you have generated it with a command
> > like `git diff -R`?
>
> The reversed part of the patch put aside aside, I've double-checked
> your patch and the inconsistencies seem to be all addressed in this
> area.
Thanks for fixing and merging this patch, I appreciate it!

Thanks,
Shihao


> The symmetry that we have now between the bytea and text versions of
> the functions is stunning, but I cannot really get excited about
> merging all of them either as it would imply a bump of pgcrypto to
> update the prosrc of these functions, and we have to maintain runtime
> compatibility with older versions.
> --
> Michael

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread Andy Fan



David Rowley  writes:

> I'd be more happy using this one as percentage-wise, the cost
> difference is much larger.

+1 for the percentage-wise.
>
> I checked that the t2.thounsand = 0 query still tests the cheap
> startup paths in add_paths_to_append_rel().

I get the same conclusion here. 
>
> So, if nobody has any better ideas, I'm just going to push the " and
> t2.thousand = 0" adjustment.

LGTM.

-- 
Best Regards
Andy Fan

Re: When extended query protocol ends?

2024-02-15 Thread Tatsuo Ishii

Hi Dave,

Oh, I see.

> Hi Tatsuo,
> 
> Actually no need, I figured it out.
> 
> I don't have a solution yet though.
> 
> Dave Cramer
> www.postgres.rocks
> 
> 
> On Thu, 15 Feb 2024 at 19:43, Tatsuo Ishii  wrote:
> 
>> > Can you ask the OP what they are doing in the startup. I'm trying to
>> > replicate their situation.
>> > Looks like possibly 'setReadOnly' and 'select version()'
>>
>> Sure I will. By the way 'select version()' may be issued by Pgpool-II
>> itself. In this case it should be 'SELECT version()', not 'select
>> version()'.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor

v61 had a brown-paper-bag bug in the embedded tids patch that didn't
present in the tidstore test, but caused vacuum to fail, fixed in v62.


v62-ART.tar.gz
Description: application/gzip

Re: When extended query protocol ends?

2024-02-15 Thread Dave Cramer

Hi Tatsuo,

Actually no need, I figured it out.

I don't have a solution yet though.

Dave Cramer
www.postgres.rocks


On Thu, 15 Feb 2024 at 19:43, Tatsuo Ishii  wrote:

> > Can you ask the OP what they are doing in the startup. I'm trying to
> > replicate their situation.
> > Looks like possibly 'setReadOnly' and 'select version()'
>
> Sure I will. By the way 'select version()' may be issued by Pgpool-II
> itself. In this case it should be 'SELECT version()', not 'select
> version()'.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

Re: When extended query protocol ends?

2024-02-15 Thread Tatsuo Ishii

> Can you ask the OP what they are doing in the startup. I'm trying to
> replicate their situation.
> Looks like possibly 'setReadOnly' and 'select version()'

Sure I will. By the way 'select version()' may be issued by Pgpool-II
itself. In this case it should be 'SELECT version()', not 'select
version()'.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

RE: Synchronizing slots from primary to standby

2024-02-15 Thread Zhijie Hou (Fujitsu)

On Thursday, February 15, 2024 8:29 PM Amit Kapila  
wrote:

> 
> On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote:
> > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > > Attach the v2 patch here.
> > > >
> > > > Apart from the new log message. I think we can add one more debug
> > > > message in reserve_wal_for_local_slot, this could be useful to analyze 
> > > > the
> failure.
> > >
> > > Yeah, that can also be helpful, but the added message looks naive to me.
> > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno);
> > >
> > > Instead of the above, how about something like: "segno: %ld of
> > > purposed restart_lsn for the synced slot, oldest_segno: %ld
> > > available"?
> >
> > Looks good to me. I'm not sure if it would make more sense to elog
> > only if segno < oldest_segno means just before the
> XLogSegNoOffsetToRecPtr() call?
> >
> > But I'm fine with the proposed location too.
> >
> 
> I am also fine either way but the current location gives required information 
> in
> more number of cases and could be helpful in debugging this new facility.
> 
> > >
> > > > And we
> > > > can also enable the DEBUG log in the 040 tap-test, I see we have
> > > > similar setting in 010_logical_decoding_timline and logging debug1
> > > > message doesn't increase noticable time on my machine. These are done
> in 0002.
> > > >
> > >
> > > I haven't tested it but I think this can help in debugging BF
> > > failures, if any. I am not sure if to keep it always like that but
> > > till the time these tests are stabilized, this sounds like a good
> > > idea. So, how, about just making test changes as a separate patch so
> > > that later if required we can revert/remove it easily? Bertrand, do
> > > you have any thoughts on this?
> >
> > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I
> > +think we
> > took the same approach for 035_standby_logical_decoding.pl IIRC) and
> > then revert it back.
> >
> 
> Good to know!
> 
> > Also I was thinking: what about adding an output to
> pg_sync_replication_slots()?
> > The output could be the number of sync slots that have been created
> > and are not considered as sync-ready during the execution.
> >
> 
> Yeah, we can consider outputting some information via this function like how
> many slots are synced and persisted but not sure what would be appropriate
> here. Because one can anyway find that or more information by querying
> pg_replication_slots. I think we can keep discussing what makes more sense as 
> a
> return value but let's commit the debug/log patches as they will be helpful to
> analyze BF failures or any other issues reported.

Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it
could provide more information about communication between primary and standby.
This also doesn't increase noticeable testing time on my machine for debug
build.

Best Regards,
Hou zj


v3-0002-Add-a-DEBUG1-message-for-slot-sync.patch
Description: v3-0002-Add-a-DEBUG1-message-for-slot-sync.patch


v3-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch
Description:  v3-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch


v3-0003-change-the-log-level-of-sync-slot-test-to-debug2.patch
Description:  v3-0003-change-the-log-level-of-sync-slot-test-to-debug2.patch

Re: POC, WIP: OR-clause support for indexes

2024-02-15 Thread jian he

On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov
 wrote:
>
> So, this example is more about the subtle balance between
> parallel/sequential execution, which can vary from one platform to another.
>

Hi, here I attached two files, expression_num_or_1_100.sql,
expression_num_or_1_1.sql
it has step by step test cases, also with the tests output.

For both sql files, I already set the max_parallel_workers_per_gather to
10, work_mem to 4GB.
I think the parameters setting should be fine.

in expression_num_or_1_100.sql:
main test table:
create table test_1_100 as (select
(random()*1000)::int x, (random()*1000) y from
generate_series(1,1_000_000) i);

if the number of OR exceeds 29,
the performance with enable_or_transformation (ON) begins to outpace
enable_or_transformation (OFF).

if the number of OR less than 29,
the performance with enable_or_transformation (OFF) is better than
enable_or_transformation (ON).

expression_num_or_1_1.sql
enable_or_transformation (ON) is always better than
enable_or_transformation (OFF).

My OS: Ubuntu 22.04.3 LTS
I already set the max_parallel_workers_per_gather to 10.
So for all cases, it should use parallelism first?

a better question would be:
how to make the number of OR less than 29 still faster when
enable_or_transformation is ON by only set parameters?

expression_num_or_1_100.sql
Description: application/sql

expression_num_or_1_1.sql
Description: application/sql

Re: glibc qsort() vulnerability

2024-02-15 Thread Nathan Bossart

Here is what I have staged for commit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8e5a66fb3a0787f15a900a89742862c89da38a1d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 15 Feb 2024 10:53:11 -0600
Subject: [PATCH v8 1/3] Remove direct calls to pg_qsort().

Calls to this function might give the idea that pg_qsort() is
somehow different than qsort(), when in fact all calls to qsort()
in the Postgres tree are redirected to pg_qsort() via a macro.
This commit removes all direct calls to pg_qsort() in favor of
letting the macro handle the conversions.

Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
---
 contrib/pg_prewarm/autoprewarm.c|  4 ++--
 src/backend/access/brin/brin_minmax_multi.c |  2 +-
 src/backend/optimizer/plan/analyzejoins.c   |  4 ++--
 src/backend/storage/buffer/bufmgr.c |  4 ++--
 src/backend/utils/cache/syscache.c  | 10 +-
 src/include/port.h  |  3 +++
 src/port/qsort.c|  3 +++
 7 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 06ee21d496..248b9914a3 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -346,8 +346,8 @@ apw_load_buffers(void)
 	FreeFile(file);
 
 	/* Sort the blocks to be loaded. */
-	pg_qsort(blkinfo, num_elements, sizeof(BlockInfoRecord),
-			 apw_compare_blockinfo);
+	qsort(blkinfo, num_elements, sizeof(BlockInfoRecord),
+		  apw_compare_blockinfo);
 
 	/* Populate shared memory state. */
 	apw_state->block_info_handle = dsm_segment_handle(seg);
diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index 3ffaad3e42..2c29aa3d4e 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -1369,7 +1369,7 @@ build_distances(FmgrInfo *distanceFn, Oid colloid,
 	 * Sort the distances in descending order, so that the longest gaps are at
 	 * the front.
 	 */
-	pg_qsort(distances, ndistances, sizeof(DistanceValue), compare_distances);
+	qsort(distances, ndistances, sizeof(DistanceValue), compare_distances);
 
 	return distances;
 }
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 7dcb74572a..e494acd51a 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -2307,8 +2307,8 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove)
 		j++;
 	}
 
-	pg_qsort(candidates, numRels, sizeof(SelfJoinCandidate),
-			 self_join_candidates_cmp);
+	qsort(candidates, numRels, sizeof(SelfJoinCandidate),
+		  self_join_candidates_cmp);
 
 	/*
 	 * Iteratively form a group of relation indexes with the same oid and
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb1ec3b86d..07575ef312 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3915,7 +3915,7 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
 
 	/* sort the list of rlocators if necessary */
 	if (use_bsearch)
-		pg_qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);
+		qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);
 
 	for (i = 0; i < NBuffers; i++)
 	{
@@ -4269,7 +4269,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 
 	/* sort the list of SMgrRelations if necessary */
 	if (use_bsearch)
-		pg_qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator);
+		qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator);
 
 	for (i = 0; i < NBuffers; i++)
 	{
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 162855b158..662f930864 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -146,14 +146,14 @@ InitCatalogCache(void)
 	Assert(SysCacheSupportingRelOidSize <= lengthof(SysCacheSupportingRelOid));
 
 	/* Sort and de-dup OID arrays, so we can use binary search. */
-	pg_qsort(SysCacheRelationOid, SysCacheRelationOidSize,
-			 sizeof(Oid), oid_compare);
+	qsort(SysCacheRelationOid, SysCacheRelationOidSize,
+		  sizeof(Oid), oid_compare);
 	SysCacheRelationOidSize =
 		qunique(SysCacheRelationOid, SysCacheRelationOidSize, sizeof(Oid),
 oid_compare);
 
-	pg_qsort(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize,
-			 sizeof(Oid), oid_compare);
+	qsort(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize,
+		  sizeof(Oid), oid_compare);
 	SysCacheSupportingRelOidSize =
 		qunique(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize,
 sizeof(Oid), oid_compare);
@@ -668,7 +668,7 @@ RelationSupportsSysCache(Oid relid)
 
 
 /*
- * OID comparator for pg_qsort
+ * OID comparator for qsort
  */
 static int
 oid_compare(const void *a, const void *b)
diff --git a/src/include/port.h

Re: Transaction timeout

2024-02-15 Thread Andres Freund

Hi,

On 2024-02-13 23:42:35 +0200, Alexander Korotkov wrote:
> diff --git a/src/backend/access/transam/xact.c 
> b/src/backend/access/transam/xact.c
> index 464858117e0..a124ba59330 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -2139,6 +2139,10 @@ StartTransaction(void)
>*/
>   s->state = TRANS_INPROGRESS;
>  
> + /* Schedule transaction timeout */
> + if (TransactionTimeout > 0)
> + enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
> +
>   ShowTransactionState("StartTransaction");
>  }

Isn't it a problem that all uses of StartTransaction() trigger a timeout, but
transaction commit/abort don't?  What if e.g. logical replication apply starts
a transaction, commits it, and then goes idle? The timer would still be
active, afaict?

I don't think it works well to enable timeouts in xact.c and to disable them
in PostgresMain().


> @@ -4491,12 +4511,18 @@ PostgresMain(const char *dbname, const char *username)
>   
> pgstat_report_activity(STATE_IDLEINTRANSACTION_ABORTED, NULL);
>  
>   /* Start the idle-in-transaction timer */
> - if (IdleInTransactionSessionTimeout > 0)
> + if (IdleInTransactionSessionTimeout > 0
> + && (IdleInTransactionSessionTimeout < 
> TransactionTimeout || TransactionTimeout == 0))
>   {
>   idle_in_transaction_timeout_enabled = 
> true;
>   
> enable_timeout_after(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
>   
>  IdleInTransactionSessionTimeout);
>   }
> +
> + /* Schedule or reschedule transaction timeout */
> + if (TransactionTimeout > 0 && 
> !get_timeout_active(TRANSACTION_TIMEOUT))
> + 
> enable_timeout_after(TRANSACTION_TIMEOUT,
> + 
>  TransactionTimeout);
>   }
>   else if (IsTransactionOrTransactionBlock())
>   {
> @@ -4504,12 +4530,18 @@ PostgresMain(const char *dbname, const char *username)
>   pgstat_report_activity(STATE_IDLEINTRANSACTION, 
> NULL);
>  
>   /* Start the idle-in-transaction timer */
> - if (IdleInTransactionSessionTimeout > 0)
> + if (IdleInTransactionSessionTimeout > 0
> + && (IdleInTransactionSessionTimeout < 
> TransactionTimeout || TransactionTimeout == 0))
>   {
>   idle_in_transaction_timeout_enabled = 
> true;
>   
> enable_timeout_after(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
>   
>  IdleInTransactionSessionTimeout);
>   }
> +
> + /* Schedule or reschedule transaction timeout */
> + if (TransactionTimeout > 0 && 
> !get_timeout_active(TRANSACTION_TIMEOUT))
> + 
> enable_timeout_after(TRANSACTION_TIMEOUT,
> + 
>  TransactionTimeout);
>   }
>   else
>   {

Why do we need to do anything in these cases if the timer is started in
StartTransaction()?


> new file mode 100644
> index 000..ce2c9a43011
> --- /dev/null
> +++ b/src/test/isolation/specs/timeouts-long.spec
> @@ -0,0 +1,35 @@
> +# Tests for transaction timeout that require long wait times
> +
> +session s7
> +step s7_begin
> +{
> +BEGIN ISOLATION LEVEL READ COMMITTED;
> +SET transaction_timeout = '1s';
> +}
> +step s7_commit_and_chain { COMMIT AND CHAIN; }
> +step s7_sleep{ SELECT pg_sleep(0.6); }
> +step s7_abort{ ABORT; }
> +
> +session s8
> +step s8_begin
> +{
> +BEGIN ISOLATION LEVEL READ COMMITTED;
> +SET transaction_timeout = '900ms';
> +}
> +# to test that quick query does not restart transaction_timeout
> +step s8_select_1 { SELECT 1; }
> +step s8_sleep{ SELECT pg_sleep(0.6); }
> +
> +session checker
> +step checker_sleep   { SELECT pg_sleep(0.3); }

Isn't this test going to be very fragile on busy / slow machines? What if the
pg_sleep() takes one second, because there were other tasks to schedule?  I'd
be surprised if this didn't fail under valgrind, for example.


Greetings,

Andres Freund

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-02-15 Thread Peter Geoghegan

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
 wrote:
> Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
> on top of your earlier set, and both don't allocate additional memory
> in the merge operation in non-assertion builds.
>
> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

This patch fails some of my tests on non-assert builds only (assert
builds pass all my tests, though). I'm using the first patch on its
own here.

While I tend to be relatively in favor of complicated assertions (I
tend to think the risk of introducing side-effects is worth it), it
looks like you're only performing certain steps in release builds.
This is evident from just looking at the code (there is an #else block
just for the release build in the loop). Note also that
"Assert(_bt_compare_array_elements([merged_nelems++], orig,
) == 0)" has side-effects in assert-enabled builds only (it
increments merged_nelems). While it's true that you *also* increment
merged_nelems *outside* of the assertion (or in an #else block used
during non-assert builds), that is conditioned on some other thing (so
it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING
case). It's also just really hard to understand what's going on here.

If I was going to do this kind of thing, I'd use two completely
separate loops, that were obviously completely separate (maybe even
two functions). I'd then memcmp() each array at the end.

-- 
Peter Geoghegan

RE: Psql meta-command conninfo+

2024-02-15 Thread Maiquel Grassi

Hi!

(v16)

In this version, I made a small adjustment to the indentation
of the \conninfo code and described the columns as returned
by \conninfo+ as suggested by Jim Jones.

Regards,
Maiquel Grassi.


v16-0001-psql-meta-command-conninfo-plus.patch
Description: v16-0001-psql-meta-command-conninfo-plus.patch

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-02-15 Thread Andres Freund

Hi,

On 2024-02-11 13:19:00 -0500, David Benjamin wrote:
> I've attached a patch for the master branch to fix up the custom BIOs used
> by PostgreSQL, in light of the issues with the OpenSSL update recently.
> While c82207a548db47623a2bfa2447babdaa630302b9 (switching from BIO_get_data
> to BIO_get_app_data) resolved the immediate conflict, I don't think it
> addressed the root cause, which is that PostgreSQL was mixing up two BIO
> implementations, each of which expected to be driving the BIO.

Yea, that's certainly not nice - and I think we've been somewhat lucky it
hasn't caused more issues. There's some nasty possibilities, e.g. sock_ctrl()
partially enabling ktls without our initialization having called
ktls_enable(). Right now that just means ktls is unusable, but it's not hard
to imagine accidentally ending up sending unencrypted data.

I've in the past looked into not using a custom BIO [1], but I have my doubts
about that being a good idea. I think medium term we want to be able to do
network IO asynchronously, which seems quite hard to do when using openssl's
socket BIO.

> Once we've done that, we're free to use BIO_set_data. While BIO_set_app_data
> works fine, I've reverted back to BIO_set_data because it's more commonly 
> used.
> app_data depends on OpenSSL's "ex_data" mechanism, which is a tad heavier 
> under
> the hood.

At first I was a bit wary of that, because it requires us to bring back the
fallback implementation. But you're right, it's noticeably heavier than
BIO_get_data(), and we do call BIO_get_app_data() fairly frequently.

> That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only
> operation that matters is BIO_CTRL_FLUSH, which is implemented as a no-op. All
> other operations are unused. It's once again good that they're unused because
> otherwise OpenSSL might mess with postgres's socket, break the assumptions
> around interrupt handling, etc.

How did you determine that only FLUSH is required? I didn't even really find
documentation about what the intended semantics actually are.

E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so
far, because we never set it, but is that right? What about
BIO_CTRL_GET_CLOSE/BIO_CTRL_SET_CLOSE?

Another issue is that 0 doesn't actually seem like the universal error return
- e.g. BIO_C_GET_FD seems to return -1, because 0 is a valid fd.

As of your patch the bio doesn't actually have an FD anymore, should it still
set BIO_TYPE_DESCRIPTOR?

> +static long
> +my_sock_ctrl(BIO *h, int cmd, long num, void *ptr)
> +{
> + longres = 0;
> +
> + switch (cmd)
> + {
> + case BIO_CTRL_FLUSH:
> + /* libssl expects all BIOs to support BIO_flush. */
> + res = 1;
> + break;
> + }
> +
> + return res;
> +}

I'd move the res = 0 into a default: block. That way the compiler can warn if
some case doesn't set it in all branches.

>  static BIO_METHOD *
>  my_BIO_s_socket(void)
>  {

Wonder if we should rename this. It's pretty odd that we still call it's not
really related to s_socket anymore, and doesn't even really implement the same
interface (e.g. get_fd doesn't work anymore).  Similarly, my_SSL_set_fd()
doesn't actually call set_fd() anymore, which sure seems odd.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20210715021747.h2hih7mw56ivyntt%40alap3.anarazel.de

Re: 035_standby_logical_decoding unbounded hang

2024-02-15 Thread Noah Misch

On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> > 
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> > XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> > XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits 
> > XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> > 
> > Other test code already has a solution for this, so the attached patches 
> > add a
> > timeout and copy the existing solution.  I'm also attaching the hack that
> > makes it 100% reproducible.

> I did a few tests and confirm that the proposed solution fixes the corner 
> case.

Thanks for reviewing.

> What about creating a sub, say wait_for_restart_lsn_calculation() in 
> Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? 
> (something
> like wait_for_restart_lsn_calculation-v1.patch attached).

Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two.  If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:

  $standby->wait_for_subscription_starting_point($primary, $slot_name);
  $primary->log_standby_snapshot($standby, $slot_name);

Would you like to finish the patch in such a way?

Re: index prefetching

2024-02-15 Thread Peter Geoghegan

On Thu, Feb 15, 2024 at 3:13 PM Andres Freund  wrote:
> > This is why I don't think that the tuples with lower page offset
> > numbers are in any way significant here.  The significant part is
> > whether or not you'll actually need to visit more than one leaf page
> > in the first place (plus the penalty from not being able to reorder
> > the work across page boundaries in your initial v1 of prefetching).
>
> To me this your phrasing just seems to reformulate the issue.

What I said to Tomas seems very obvious to me. I think that there
might have been some kind of miscommunication (not a real
disagreement). I was just trying to work through that.

> In practical terms you'll have to wait for the full IO latency when fetching
> the table tuple corresponding to the first tid on a leaf page. Of course
> that's also the moment you had to visit another leaf page. Whether the stall
> is due to visit another leaf page or due to processing the first entry on such
> a leaf page is a distinction without a difference.

I don't think anybody said otherwise?

> > > That's certainly true / helpful, and it makes the "first entry" issue
> > > much less common. But the issue is still there. Of course, this says
> > > nothing about the importance of the issue - the impact may easily be so
> > > small it's not worth worrying about.
> >
> > Right. And I want to be clear: I'm really *not* sure how much it
> > matters. I just doubt that it's worth worrying about in v1 -- time
> > grows short. Although I agree that we should commit a v1 that leaves
> > the door open to improving matters in this area in v2.
>
> I somewhat doubt that it's realistic to aim for 17 at this point.

That's a fair point. Tomas?

> We seem to
> still be doing fairly fundamental architectual work. I think it might be the
> right thing even for 18 to go for the simpler only-a-single-leaf-page
> approach though.

I definitely think it's a good idea to have that as a fall back
option. And to not commit ourselves to having something better than
that for v1 (though we probably should commit to making that possible
in v2).

> I wonder if there are prerequisites that can be tackled for 17. One idea is to
> work on infrastructure to provide executor nodes with information about the
> number of tuples likely to be fetched - I suspect we'll trigger regressions
> without that in place.

I don't think that there'll be regressions if we just take the simpler
only-a-single-leaf-page approach. At least it seems much less likely.

> One way to *sometimes* process more than a single leaf page, without having to
> redesign kill_prior_tuple, would be to use the visibilitymap to check if the
> target pages are all-visible. If all the table pages on a leaf page are
> all-visible, we know that we don't need to kill index entries, and thus can
> move on to the next leaf page

It's possible that we'll need a variety of different strategies.
nbtree already has two such strategies in _bt_killitems(), in a way.
Though its "Modified while not pinned means hinting is not safe" path
(LSN doesn't match canary value path) seems pretty naive. The
prefetching stuff might present us with a good opportunity to replace
that with something fundamentally better.

-- 
Peter Geoghegan

Re: logical decoding and replication of sequences, take 2

2024-02-15 Thread Tomas Vondra

On 2/15/24 05:16, Robert Haas wrote:
> On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
>  wrote:
>> The way I think about non-transactional sequence changes is as if they
>> were tiny transactions that happen "fully" (including commit) at the LSN
>> where the LSN change is logged.
> 
> 100% this.
> 
>> It does not mean we can arbitrarily reorder the changes, it only means
>> the changes are applied as if they were independent transactions (but in
>> the same order as they were executed originally). Both with respect to
>> the other non-transactional changes, and to "commits" of other stuff.
> 
> Right, this is very important and I agree completely.
> 
> I'm feeling more confident about this now that I heard you say that
> stuff -- this is really the key issue I've been worried about since I
> first looked at this, and I wasn't sure that you were in agreement,
> but it sounds like you are. I think we should (a) fix the locking bug
> I found (but that can be independent of this patch) and (b) make sure
> that this patch documents the points from the quoted material above so
> that everyone who reads the code (and maybe tries to enhance it) is
> clear on what the assumptions are.
> 
> (I haven't checked whether it documents that stuff or not. I'm just
> saying it should, because I think it's a subtlety that someone might
> miss.)
> 

Thanks for thinking about these issues with reordering events. Good we
seem to be in agreement and that you feel more confident about this.
I'll check if there's a good place to document this.

For me, the part that I feel most uneasy about is the decoding while the
snapshot is still being built (and can flip to consistent snapshot
between the relfilenode creation and sequence change, confusing the
logic that decides which changes are transactional).

It seems "a bit weird" that we either keep the "simple" logic that may
end up with incorrect "non-transactional" result, but happens to then
work fine because we immediately discard the change.

But it still feels better than the alternative, which requires us to
start decoding stuff (relfilenode creation) before building a proper
snapshot is consistent, which we didn't do before - or at least not in
this particular way. While I don't have a practical example where it
would cause trouble now, I have a nagging feeling it might easily cause
trouble in the future by making some new features harder to implement.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

2024-02-15 Thread Andres Freund

Hi,

On 2024-02-15 12:53:10 -0500, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
>  wrote:
> > I may be missing something, but it seems fairly self-evident to me an
> > entry at the beginning of an index page won't get prefetched (assuming
> > the page-at-a-time thing).
> 
> Sure, if the first item on the page is also the first item that we
> need the scan to return (having just descended the tree), then it
> won't get prefetched under a scheme that sticks with the current
> page-at-a-time behavior (at least in v1). Just like when the first
> item that we need the scan to return is from the middle of the page,
> or more towards the end of the page.
> 
> It is of course also true that we can't prefetch the next page's
> first item until we actually visit the next page -- clearly that's
> suboptimal. Just like we can't prefetch any other, later tuples from
> the next page (until such time as we have determined for sure that
> there really will be a next page, and have called _bt_readpage for
> that next page.)
>
> This is why I don't think that the tuples with lower page offset
> numbers are in any way significant here.  The significant part is
> whether or not you'll actually need to visit more than one leaf page
> in the first place (plus the penalty from not being able to reorder
> the work across page boundaries in your initial v1 of prefetching).

To me this your phrasing just seems to reformulate the issue.

In practical terms you'll have to wait for the full IO latency when fetching
the table tuple corresponding to the first tid on a leaf page. Of course
that's also the moment you had to visit another leaf page. Whether the stall
is due to visit another leaf page or due to processing the first entry on such
a leaf page is a distinction without a difference.

> > That's certainly true / helpful, and it makes the "first entry" issue
> > much less common. But the issue is still there. Of course, this says
> > nothing about the importance of the issue - the impact may easily be so
> > small it's not worth worrying about.
> 
> Right. And I want to be clear: I'm really *not* sure how much it
> matters. I just doubt that it's worth worrying about in v1 -- time
> grows short. Although I agree that we should commit a v1 that leaves
> the door open to improving matters in this area in v2.

I somewhat doubt that it's realistic to aim for 17 at this point. We seem to
still be doing fairly fundamental architectual work. I think it might be the
right thing even for 18 to go for the simpler only-a-single-leaf-page
approach though.

I wonder if there are prerequisites that can be tackled for 17. One idea is to
work on infrastructure to provide executor nodes with information about the
number of tuples likely to be fetched - I suspect we'll trigger regressions
without that in place.

One way to *sometimes* process more than a single leaf page, without having to
redesign kill_prior_tuple, would be to use the visibilitymap to check if the
target pages are all-visible. If all the table pages on a leaf page are
all-visible, we know that we don't need to kill index entries, and thus can
move on to the next leaf page

Greetings,

Andres Freund

RE: Psql meta-command conninfo+

2024-02-15 Thread Maiquel Grassi

>v14 applies cleanly and the SSL info is now shown as previously
>suggested. Here is a more comprehensive test:
>
>
>$ /usr/local/postgres-dev/bin/psql -x "\
>host=server.uni-muenster.de
>hostaddr=172.19.42.1
>user=jim dbname=postgres
>sslrootcert=server-certificates/server.crt
>sslcert=jim-certificates/jim.crt
>sslkey=jim-certificates/jim.key"
>
>psql (17devel)
>SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
>compression: off)
>Type "help" for help.
>
>postgres=# SET ROLE foo;
>SET
>postgres=> \conninfo+
>Current Connection Information
>-[ RECORD 1
>]--+---
>Database   | postgres
>Authenticated User | jim
>System User| cert:emailAddress=j...@uni-muenster.de,CN=jim,OU=WWU
>IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE
>Current User   | foo
>Session User   | jim
>Backend PID| 278294
>Server Address | 172.19.42.1
>Server Port| 5432
>Client Address | 192.168.178.27
>Client Port| 54948
>Socket Directory   |
>Host   | server.uni-muenster.de
>Encryption | SSL
>Protocol   | TLSv1.3
>Cipher | TLS_AES_256_GCM_SHA384
>Compression| off
>
>postgres=> \conninfo
>You are connected to database "postgres" as user "jim" on host
>"server.uni-muenster.de" (address "127.0.0.1") at port "5432".
>SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
>compression: off)

>A little odd is that "Server Address" of \conninfo+ differs from
>"address" of \conninfo...

//

Hi Jim!


Tests performed on CentOS Linux 7.


I made some further observations and concluded that there will
be more cases where the "address" from \conninfo will differ from
the "Server Address" from \conninfo+. Below is a more detailed example.

The input of "hostaddr" or "host" in the psql call can be any pointing to
"loopback local" and the connection will still be established. For example,
all of these are accepted:

Case (inet):
psql -x --host 0
psql -x --host 0.0.0.0
psql -x hostaddr=0
psql -x hostaddr=0.0.0.0
All these examples will have "Server Address" = 127.0.0.1

Case (inet6):
psql -x --host ::
psql -x --host * (this entry is not accepted)
psql -x --host \*

psql -x --host "*"
psql -x hostaddr=::
psql -x hostaddr=*
All these examples will have "Server Address" = ::1

Thus, the inet_server_addr() function will return 127.0.0.1 or ::1 which in 
some cases will differ from the "address" from \conninfo.

[postgres@localhost bin]$ ./psql -x hostaddr=0

Password for user postgres:
psql (17devel)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, 
compression: off)
Type "help" for help.

postgres=# SET ROLE maiquel;
SET
postgres=> \conninfo
You are connected to database "postgres" as user "postgres" on host "0" 
(address "0.0.0.0") at port "5432".
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, 
compression: off)
postgres=> \conninfo+
Current Connection Information
-[ RECORD 1 ]--+
Database   | postgres
Authenticated User | postgres
System User| scram-sha-256:postgres
Current User   | maiquel
Session User   | postgres
Backend PID| 15205
Server Address | 127.0.0.1
Server Port| 5432
Client Address | 127.0.0.1
Client Port| 57598
Socket Directory   |
Host   | 0
Encryption | SSL
Protocol   | TLSv1.2
Cipher | ECDHE-RSA-AES256-GCM-SHA384
Compression| off

postgres=> \q
[postgres@localhost bin]$ ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.061 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.071 ms
64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.107 ms
^C
--- 0.0.0.0 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 0.061/0.077/0.107/0.017 ms

As demonstrated above, "address" = 0.0.0.0 and "Server Address" = 127.0.0.1 are 
divergent.

In practice, these IPs are the "same", and the ping from the example proves it.


However, we are concerned here with the psql user, and this may seem confusing 
to them at
first glance. I would like to propose a change in "address" so that it always 
returns the same as
"Server Address", that is, to use the inet_server_address() function in 
"address".

Result:

[postgres@localhost bin]$ ./psql -x hostaddr=0

psql (17devel)
Type "help" for help.

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" on host "0" 
(address "127.0.0.1") at port "5432".
postgres=# \conninfo+
Current Connection Information
-[ RECORD 1 ]--+--
Database   | postgres
Authenticated User | postgres
System User|
Current User   | postgres

Re: RFC: Logging plan of the running query

2024-02-15 Thread Andres Freund

Hi,

On 2024-02-15 14:42:11 +0530, Robert Haas wrote:
> I think the issue is very general. We have lots of subsystems that
> both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS().
> If we process an interrupt while that code is in the middle of
> manipulating its global variables and then again re-enter that code,
> bad things might happen. We could try to rule that out by analyzing
> all such subsystems and all places where CHECK_FOR_INTERRUPTS() may
> appear, but that's very difficult. Suppose we took the alternative
> approach recommended by Andrey Lepikhov in
> http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com
> and instead set a flag that gets handled in some suitable place in the
> executor code, like ExecProcNode(). If we did that, then we would know
> that we're not in the middle of a syscache lookup, a catcache lookup,
> a heavyweight lock acquisition, an ereport, or any other low-level
> subsystem call that might create problems for the EXPLAIN code.
> 
> In that design, the hack shown above would go away, and we could be
> much more certain that we don't need any other similar hacks for other
> subsystems. The only downside is that we might experience a slightly
> longer delay before a requested EXPLAIN plan actually shows up, but
> that seems like a pretty small price to pay for being able to reason
> about the behavior of the system.

I am very wary of adding overhead to ExecProcNode() - I'm quite sure that
adding code there would trigger visible overhead for query times.

If we went with something like tht approach, I think we'd have to do something
like redirecting node->ExecProcNode to a wrapper, presumably from within a
CFI. That wrapper could then implement the explain support, without slowing
down the normal execution path.


> I don't *think* there are any cases where we run in the executor for a
> particularly long time without a new call to ExecProcNode().

I guess it depends on what you call a long time. A large sort, for example,
could spend a fair amount of time inside tuplesort, similarly, a gather node
might need to wait for a worker for a while etc.


> It's really hard for me to accept that the heavyweight lock problem
> for which the patch contains a workaround is the only one that exists.
> I can't see any reason why that should be true.

I suspect you're right.

Greetings,

Andres Freund

Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-15 Thread Corey Huinker

>
> > As for whether it's commonplace, when I was a consultant I had a number
> > of customers that I had who bemoaned how large updates caused big
> > replica lag, basically punishing access to records they did care about
> > in order to properly archive or backfill records they don't care about.
> > I used the technique a lot, putting the update/delete in a loop, and
> > often running multiple copies of the same script at times when I/O
> > contention was low, but if load levels rose it was trivial to just kill
> > a few of the scripts until things calmed down.
>
> I've also used the technique quite a lot, but only using the PK,
> didn't know about the ctid trick, so many thanks for documenting it.


tid-scans only became a thing a few versions ago (12?). Prior to that, PK
was the only way to go.

Re: partitioning and identity column

2024-02-15 Thread Alexander Lakhin


Hello Ashutosh,

24.01.2024 09:34, Ashutosh Bapat wrote:



There's another thing I found. The file isn't using
check_stack_depth() in the function which traverse inheritance
hierarchies. This isn't just a problem of the identity related
function but most of the functions in that file. Do you think it's
worth fixing it?

I suppose the number of inheritance levels is usually not a problem for
stack depth?


Practically it should not. I would rethink the application design if
it requires so many inheritance or partition levels. But functions in
optimizer like try_partitionwise_join() and set_append_rel_size() call

/* Guard against stack overflow due to overly deep inheritance tree. */
check_stack_depth();

I am fine if we want to skip this.


I've managed to reach stack overflow inside ATExecSetIdentity() with
the following script:
(echo "CREATE TABLE tp0 (a int PRIMARY KEY,
    b int GENERATED ALWAYS AS IDENTITY) PARTITION BY RANGE (a);";
for ((i=1;i<=8;i++)); do
  echo "CREATE TABLE tp$i PARTITION OF tp$(( $i - 1 ))
 FOR VALUES FROM ($i) TO (100) PARTITION BY RANGE (a);";
done;
echo "ALTER TABLE tp0 ALTER COLUMN b SET GENERATED BY DEFAULT;") | psql 
>psql.log

(with max_locks_per_transaction = 400 in the config)

It runs about 15 minutes for me and ends with:
Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, 
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169
1169    {
(gdb) bt
#0  0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, 
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169
#1  0x55a8cea0342d in WALInsertLockAcquire () at xlog.c:1389
#2  XLogInsertRecord (rdata=0x55a8cf1ccee8 , fpw_lsn=fpw_lsn@entry=1261347512, flags=0 '\000', 
num_fpi=num_fpi@entry=0, topxid_included=false) at xlog.c:817

#3  0x55a8cea1396e in XLogInsert (rmid=rmid@entry=11 '\v', info=) at xloginsert.c:524
#4  0x55a8ce9c1541 in _bt_insertonpg (rel=0x7faeb8478c98, heaprel=0x7faecf63d378, 
itup_key=itup_key@entry=0x55a8d5064678, buf=3210, cbuf=cbuf@entry=0, stack=stack@entry=0x55a8d1063d08, 
itup=0x55a8d5064658, itemsz=16,

    newitemoff=, postingoff=0, split_only_page=) 
at nbtinsert.c:1389
#5  0x55a8ce9bf9a7 in _bt_doinsert (rel=, rel@entry=0x7faeb8478c98, itup=, 
itup@entry=0x55a8d5064658, checkUnique=, checkUnique@entry=UNIQUE_CHECK_YES, indexUnchanged=,

    heapRel=, heapRel@entry=0x7faecf63d378) at nbtinsert.c:260
#6  0x55a8ce9c92ad in btinsert (rel=0x7faeb8478c98, values=, isnull=, 
ht_ctid=0x55a8d50643cc, heapRel=0x7faecf63d378, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=,

    indexInfo=) at nbtree.c:205
#7  0x55a8cea41391 in CatalogIndexInsert (indstate=indstate@entry=0x55a8d0fc03e8, heapTuple=, 
heapTuple@entry=0x55a8d50643c8, updateIndexes=) at indexing.c:170
#8  0x55a8cea4172c in CatalogTupleUpdate (heapRel=heapRel@entry=0x7faecf63d378, otid=0x55a8d50643cc, 
tup=tup@entry=0x55a8d50643c8) at indexing.c:324
#9  0x55a8ceb18173 in ATExecSetIdentity (rel=0x7faeab1288a8, colName=colName@entry=0x55a8d0fbc2b8 "b", 
def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8307
#10 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab127f28, colName=colName@entry=0x55a8d0fbc2b8 "b", 
def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337
#11 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab1275a8, colName=colName@entry=0x55a8d0fbc2b8 "b", 
def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337
#12 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab126c28, colName=colName@entry=0x55a8d0fbc2b8 "b", 
def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337

...

Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too,
so I think they can be exploited as well.

Best regards,
Alexander

Re: index prefetching

2024-02-15 Thread Peter Geoghegan

On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
 wrote:
> I may be missing something, but it seems fairly self-evident to me an
> entry at the beginning of an index page won't get prefetched (assuming
> the page-at-a-time thing).

Sure, if the first item on the page is also the first item that we
need the scan to return (having just descended the tree), then it
won't get prefetched under a scheme that sticks with the current
page-at-a-time behavior (at least in v1). Just like when the first
item that we need the scan to return is from the middle of the page,
or more towards the end of the page.

It is of course also true that we can't prefetch the next page's
first item until we actually visit the next page -- clearly that's
suboptimal. Just like we can't prefetch any other, later tuples from
the next page (until such time as we have determined for sure that
there really will be a next page, and have called _bt_readpage for
that next page.)

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here.  The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

> If I understand your point about boundary cases / suffix truncation,
> that helps us by (a) picking the split in a way to minimize a single key
> spanning multiple pages, if possible and (b) increasing the number of
> entries that fit onto a single index page.

More like it makes the boundaries between leaf pages (i.e. high keys)
align with the "natural boundaries of the key space". Simple point
queries should practically never require more than a single leaf page
access as a result. Even somewhat complicated index scans that are
reasonably selective (think tens to low hundreds of matches) don't
tend to need to read more than a single leaf page match, at least with
equality type scan keys for the index qual.

> That's certainly true / helpful, and it makes the "first entry" issue
> much less common. But the issue is still there. Of course, this says
> nothing about the importance of the issue - the impact may easily be so
> small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

> One case I've been thinking about is sorting using index, where we often
> read large part of the index.

That definitely seems like a case where reordering
work/desynchronization of the heap and index scans might be relatively
important.

> > I think that there is no question that this will need to not
> > completely disable kill_prior_tuple -- I'd be surprised if one single
> > person disagreed with me on this point. There is also a more nuanced
> > way of describing this same restriction, but we don't necessarily need
> > to agree on what exactly that is right now.
> >
>
> Even for the page-at-a-time approach? Or are you talking about the v2?

I meant that the current kill_prior_tuple behavior isn't sacred, and
can be revised in v2, for the benefit of lifting the restriction on
prefetching. But that's going to involve a trade-off of some kind. And
not a particularly simple one.

> Yeah. The basic idea was that by moving this above index AM it will work
> for all indexes automatically - but given the current discussion about
> kill_prior_tuple, locking etc. I'm not sure that's really feasible.
>
> The index AM clearly needs to have more control over this.

Cool. I think that that makes the layering question a lot clearer, then.

--
Peter Geoghegan

Re: Add trim_trailing_whitespace to editorconfig file

2024-02-15 Thread Jelte Fennema-Nio

On Thu, 15 Feb 2024 at 16:57, Peter Eisentraut  wrote:
> Is there a command-line tool to verify the syntax of .editorconfig and
> check compliance of existing files?
>
> I'm worried that expanding .editorconfig with detailed per-file rules
> will lead to a lot of mistakes and blind editing, if we don't have
> verification tooling.

I tried this one just now:
https://github.com/editorconfig-checker/editorconfig-checker.javascript

I fixed all the issues by updating my patchset to use "unset" for
insert_final_newline instead of "false".

All other files were already clean, which makes sense because the new
editorconfig rules are exactly the same as gitattributes (which I'm
guessing we are checking in CI/buildfarm). So I don't think it makes
sense to introduce another tool to check the same thing again.

v3-0002-Require-final-newline-in-.po-files.patch
Description: Binary data

v3-0004-Add-note-about-keeping-.editorconfig-and-.gitattr.patch
Description: Binary data

v3-0001-Remove-non-existing-file-from-.gitattributes.patch
Description: Binary data

v3-0005-Add-indent-information-about-gitattributes-to-edi.patch
Description: Binary data

v3-0003-Bring-editorconfig-in-line-with-gitattributes.patch
Description: Binary data

Re: index prefetching

2024-02-15 Thread Tomas Vondra




On 2/15/24 17:42, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
>  wrote:
>> On 2/15/24 00:06, Peter Geoghegan wrote:
>>> I suppose that it might be much more important than I imagine it is
>>> right now, but it'd be nice to have something a bit more concrete to
>>> go on.
>>>
>>
>> This probably depends on which corner cases are considered important.
>>
>> The page-at-a-time approach essentially means index items at the
>> beginning of the page won't get prefetched (or vice versa, prefetch
>> distance drops to 0 when we get to end of index page).
> 
> I don't think that's true. At least not for nbtree scans.
> 
> As I went into last year, you'd get the benefit of the work I've done
> on "boundary cases" (most recently in commit c9c0589f from just a
> couple of months back), which helps us get the most out of suffix
> truncation. This maximizes the chances of only having to scan a single
> index leaf page in many important cases. So I can see no reason why
> index items at the beginning of the page are at any particular
> disadvantage (compared to those from the middle or the end of the
> page).
> 

I may be missing something, but it seems fairly self-evident to me an
entry at the beginning of an index page won't get prefetched (assuming
the page-at-a-time thing).

If I understand your point about boundary cases / suffix truncation,
that helps us by (a) picking the split in a way to minimize a single key
spanning multiple pages, if possible and (b) increasing the number of
entries that fit onto a single index page.

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

> Where you might have a problem is cases where it's just inherently
> necessary to visit more than a single leaf page, despite the best
> efforts of the nbtsplitloc.c logic -- cases where the scan just
> inherently needs to return tuples that "straddle the boundary between
> two neighboring pages". That isn't a particularly natural restriction,
> but it's also not obvious that it's all that much of a disadvantage in
> practice.
> 

One case I've been thinking about is sorting using index, where we often
read large part of the index.

>> It certainly was a great improvement, no doubt about that. I dislike the
>> restriction, but that's partially for aesthetic reasons - it just seems
>> it'd be nice to not have this.
>>
>> That being said, I'd be OK with having this restriction if it makes v1
>> feasible. For me, the big question is whether it'd mean we're stuck with
>> this restriction forever, or whether there's a viable way to improve
>> this in v2.
> 
> I think that there is no question that this will need to not
> completely disable kill_prior_tuple -- I'd be surprised if one single
> person disagreed with me on this point. There is also a more nuanced
> way of describing this same restriction, but we don't necessarily need
> to agree on what exactly that is right now.
> 

Even for the page-at-a-time approach? Or are you talking about the v2?

>> And I don't have answer to that :-( I got completely lost in the ongoing
>> discussion about the locking implications (which I happily ignored while
>> working on the PoC patch), layering tensions and questions which part
>> should be "in control".
> 
> Honestly, I always thought that it made sense to do things on the
> index AM side. When you went the other way I was surprised. Perhaps I
> should have said more about that, sooner, but I'd already said quite a
> bit at that point, so...
> 
> Anyway, I think that it's pretty clear that "naive desynchronization"
> is just not acceptable, because that'll disable kill_prior_tuple
> altogether. So you're going to have to do this in a way that more or
> less preserves something like the current kill_prior_tuple behavior.
> It's going to have some downsides, but those can be managed. They can
> be managed from within the index AM itself, a bit like the
> _bt_killitems() no-pin stuff does things already.
> 
> Obviously this interpretation suggests that doing things at the index
> AM level is indeed the right way to go, layering-wise. Does it make
> sense to you, though?
> 

Yeah. The basic idea was that by moving this above index AM it will work
for all indexes automatically - but given the current discussion about
kill_prior_tuple, locking etc. I'm not sure that's really feasible.

The index AM clearly needs to have more control over this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 05:58:47PM +0530, Amit Kapila wrote:
> On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot
>  wrote:
> > Also I was thinking: what about adding an output to 
> > pg_sync_replication_slots()?
> > The output could be the number of sync slots that have been created and are
> > not considered as sync-ready during the execution.
> >
> 
> Yeah, we can consider outputting some information via this function
> like how many slots are synced and persisted but not sure what would
> be appropriate here. Because one can anyway find that or more
> information by querying pg_replication_slots.

Right, so maybe just return a bool that would indicate that at least one new
created slot(s) is/are not sync-ready? (If so, then the details could be found 
in
pg_replication_slots).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 06:13:38PM +0530, Amit Kapila wrote:
> On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Since the slotsync function is committed, I rebased remaining patches.
> > And here is the V88 patch set.
> >

Thanks!

> 
> Please find the improvements in some of the comments in v88_0001*
> attached. Kindly include these in next version, if you are okay with
> it.

Looking at v88_0001, random comments:

1 ===

Commit message "Be enabling slot synchronization"

Typo? s:Be/By

2 ===

+It enables a physical standby to synchronize logical failover slots
+from the primary server so that logical subscribers are not blocked
+after failover.

Not sure "not blocked" is the right wording.
"can be resumed from the new primary" maybe? (was discussed in [1])

3 ===

+#define SlotSyncWorkerAllowed()\
+   (sync_replication_slots && pmState == PM_HOT_STANDBY && \
+SlotSyncWorkerCanRestart())

Maybe add a comment above the macro explaining the logic?

4 ===

+#include "replication/walreceiver.h"
 #include "replication/slotsync.h"

should be reverse order?

5 ===

+   if (SlotSyncWorker->syncing)
{
-   SpinLockRelease(>mutex);
+   SpinLockRelease(>mutex);
ereport(ERROR,

errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot synchronize replication slots 
concurrently"));
}

worth to add a test in 040_standby_failover_slots_sync.pl for it?

6 ===

+static void
+slotsync_reread_config(bool restart)
+{

worth to add test(s) in 040_standby_failover_slots_sync.pl for it?

[1]: 
https://www.postgresql.org/message-id/CAA4eK1JcBG6TJ3o5iUd4z0BuTbciLV3dK4aKgb7OgrNGoLcfSQ%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: planner chooses incremental but not the best one

2024-02-15 Thread Tomas Vondra




On 2/15/24 13:45, Andrei Lepikhov wrote:
> On 15/2/2024 18:10, Tomas Vondra wrote:
>>
>>
>> On 2/15/24 07:50, Andrei Lepikhov wrote:
>>> On 18/12/2023 19:53, Tomas Vondra wrote:
 On 12/18/23 11:40, Richard Guo wrote:
 The challenge is where to get usable information about correlation
 between columns. I only have a couple very rought ideas of what might
 try. For example, if we have multi-column ndistinct statistics, we
 might
 look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

   ndistinct(b,c,d) / ndistinct(b,c)

 If we know how many distinct values we have for the predicate
 column, we
 could then estimate the number of groups. I mean, we know that for the
 restriction "WHERE b = 3" we only have 1 distinct value, so we could
 estimate the number of groups as

   1 * ndistinct(b,c)
>>> Did you mean here ndistinct(c,d) and the formula:
>>> ndistinct(b,c,d) / ndistinct(c,d) ?
>>
>> Yes, I think that's probably a more correct ... Essentially, the idea is
>> to estimate the change in number of distinct groups after adding a
>> column (or restricting it in some way).
> Thanks, I got it. I just think how to implement such techniques with
> extensions just to test the idea in action. In the case of GROUP-BY we
> can use path hook, of course. But what if to invent a hook on clauselist
> estimation?

Maybe.

I have thought about introducing such hook to alter estimation of
clauses, so I'm not opposed to it. Ofc, it depends on where would the
hook be, what would it be allowed to do etc. And as it doesn't exist
yet, it'd be more a "local" improvement to separate the changes into an
extension.

>>> Do you implicitly bear in mind here the necessity of tracking clauses
>>> that were applied to the data up to the moment of grouping?
>>>
>>
>> I don't recall what exactly I considered two months ago when writing the
>> message, but I don't see why we would need to track that beyond what we
>> already have. Shouldn't it be enough for the grouping to simply inspect
>> the conditions on the lower levels?
> Yes, exactly. I've thought about looking into baserestrictinfos and, if
> group-by references a subquery targetlist, into subqueries too.
> 

True. Something like that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

2024-02-15 Thread Peter Geoghegan

On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
 wrote:
> On 2/15/24 00:06, Peter Geoghegan wrote:
> > I suppose that it might be much more important than I imagine it is
> > right now, but it'd be nice to have something a bit more concrete to
> > go on.
> >
>
> This probably depends on which corner cases are considered important.
>
> The page-at-a-time approach essentially means index items at the
> beginning of the page won't get prefetched (or vice versa, prefetch
> distance drops to 0 when we get to end of index page).

I don't think that's true. At least not for nbtree scans.

As I went into last year, you'd get the benefit of the work I've done
on "boundary cases" (most recently in commit c9c0589f from just a
couple of months back), which helps us get the most out of suffix
truncation. This maximizes the chances of only having to scan a single
index leaf page in many important cases. So I can see no reason why
index items at the beginning of the page are at any particular
disadvantage (compared to those from the middle or the end of the
page).

Where you might have a problem is cases where it's just inherently
necessary to visit more than a single leaf page, despite the best
efforts of the nbtsplitloc.c logic -- cases where the scan just
inherently needs to return tuples that "straddle the boundary between
two neighboring pages". That isn't a particularly natural restriction,
but it's also not obvious that it's all that much of a disadvantage in
practice.

> It certainly was a great improvement, no doubt about that. I dislike the
> restriction, but that's partially for aesthetic reasons - it just seems
> it'd be nice to not have this.
>
> That being said, I'd be OK with having this restriction if it makes v1
> feasible. For me, the big question is whether it'd mean we're stuck with
> this restriction forever, or whether there's a viable way to improve
> this in v2.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

> And I don't have answer to that :-( I got completely lost in the ongoing
> discussion about the locking implications (which I happily ignored while
> working on the PoC patch), layering tensions and questions which part
> should be "in control".

Honestly, I always thought that it made sense to do things on the
index AM side. When you went the other way I was surprised. Perhaps I
should have said more about that, sooner, but I'd already said quite a
bit at that point, so...

Anyway, I think that it's pretty clear that "naive desynchronization"
is just not acceptable, because that'll disable kill_prior_tuple
altogether. So you're going to have to do this in a way that more or
less preserves something like the current kill_prior_tuple behavior.
It's going to have some downsides, but those can be managed. They can
be managed from within the index AM itself, a bit like the
_bt_killitems() no-pin stuff does things already.

Obviously this interpretation suggests that doing things at the index
AM level is indeed the right way to go, layering-wise. Does it make
sense to you, though?

-- 
Peter Geoghegan

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-02-15 Thread Nathan Bossart

On Wed, Feb 14, 2024 at 01:02:26PM -0600, Nathan Bossart wrote:
> BTW I have been testing reverting commit 151c22d (i.e., un-reverting
> MAINTAIN) every month or two, and last I checked, it still applies pretty
> cleanly.  The only changes I've needed to make are to the catversion and to
> a hard-coded version in a test (16 -> 17).

Posting to get some cfbot coverage.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ce4f0c8cc87c2534d46060dc0725783ad6275f21 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 15 Feb 2024 10:02:41 -0600
Subject: [PATCH v1 1/1] Revert "Revert MAINTAIN privilege and pg_maintain
 predefined role."

XXX: NEEDS CATVERSION BUMP

This reverts commit 151c22deee66a3390ca9a1c3675e29de54ae73fc.
---
 doc/src/sgml/ddl.sgml |  35 --
 doc/src/sgml/func.sgml|   2 +-
 .../sgml/ref/alter_default_privileges.sgml|   4 +-
 doc/src/sgml/ref/analyze.sgml |   6 +-
 doc/src/sgml/ref/cluster.sgml |  10 +-
 doc/src/sgml/ref/grant.sgml   |   3 +-
 doc/src/sgml/ref/lock.sgml|   4 +-
 .../sgml/ref/refresh_materialized_view.sgml   |   5 +-
 doc/src/sgml/ref/reindex.sgml |  23 ++--
 doc/src/sgml/ref/revoke.sgml  |   2 +-
 doc/src/sgml/ref/vacuum.sgml  |   6 +-
 doc/src/sgml/user-manag.sgml  |  12 ++
 src/backend/catalog/aclchk.c  |  15 +++
 src/backend/commands/analyze.c|  13 +-
 src/backend/commands/cluster.c|  43 +--
 src/backend/commands/indexcmds.c  |  34 ++---
 src/backend/commands/lockcmds.c   |   2 +-
 src/backend/commands/matview.c|   3 +-
 src/backend/commands/tablecmds.c  |  16 +--
 src/backend/commands/vacuum.c |  65 +-
 src/backend/utils/adt/acl.c   |   8 ++
 src/bin/pg_dump/dumputils.c   |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   2 +-
 src/bin/psql/tab-complete.c   |   6 +-
 src/include/catalog/pg_authid.dat |   5 +
 src/include/commands/tablecmds.h  |   5 +-
 src/include/commands/vacuum.h |   5 +-
 src/include/nodes/parsenodes.h|   3 +-
 src/include/utils/acl.h   |   5 +-
 .../expected/cluster-conflict-partition.out   |   8 +-
 .../specs/cluster-conflict-partition.spec |   2 +-
 .../perl/PostgreSQL/Test/AdjustUpgrade.pm |  11 ++
 src/test/regress/expected/cluster.out |   7 ++
 src/test/regress/expected/create_index.out|   4 +-
 src/test/regress/expected/dependency.out  |  22 ++--
 src/test/regress/expected/privileges.out  | 116 ++
 src/test/regress/expected/rowsecurity.out |  34 ++---
 src/test/regress/sql/cluster.sql  |   5 +
 src/test/regress/sql/dependency.sql   |   2 +-
 src/test/regress/sql/privileges.sql   |  68 ++
 40 files changed, 444 insertions(+), 178 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9d7e2c756b..8616a8e9cc 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1868,8 +1868,8 @@ ALTER TABLE products RENAME TO items;
INSERT, UPDATE, DELETE,
TRUNCATE, REFERENCES, TRIGGER,
CREATE, CONNECT, TEMPORARY,
-   EXECUTE, USAGE, SET
-   and ALTER SYSTEM.
+   EXECUTE, USAGE, SET,
+   ALTER SYSTEM, and MAINTAIN.
The privileges applicable to a particular
object vary depending on the object's type (table, function, etc.).
More detail about the meanings of these privileges appears below.
@@ -2160,7 +2160,19 @@ REVOKE ALL ON accounts FROM PUBLIC;
   
  
 
-   
+
+   
+MAINTAIN
+
+ 
+  Allows VACUUM, ANALYZE,
+  CLUSTER, REFRESH MATERIALIZED VIEW,
+  REINDEX, and LOCK TABLE on a
+  relation.
+ 
+
+   
+  
 
The privileges required by other commands are listed on the
reference page of the respective command.
@@ -2309,6 +2321,11 @@ REVOKE ALL ON accounts FROM PUBLIC;
   A
   PARAMETER
  
+ 
+  MAINTAIN
+  m
+  TABLE
+ 
  

   
@@ -2399,7 +2416,7 @@ REVOKE ALL ON accounts FROM PUBLIC;
  
  
   TABLE (and table-like objects)
-  arwdDxt
+  arwdDxtm
   none
   \dp
  
@@ -2465,11 +2482,11 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw;
 
 = \dp mytable
   Access privileges
- Schema |  Name   | Type  |   Access privileges   |   Column privileges   | Policies
-+-+---+---+---+--
- public | mytable | table | miriam=arwdDxt/miriam+| col1:+|
-| |   | =r/miriam+|   miriam_rw=rw/miriam |
-| |   | admin=arw/miriam  |   |
+ Schema |  Name   | Type  |

Re: Add trim_trailing_whitespace to editorconfig file

2024-02-15 Thread Peter Eisentraut


On 15.02.24 10:26, Jelte Fennema-Nio wrote:

On Wed, 14 Feb 2024 at 23:19, Daniel Gustafsson  wrote:

+1 from me. But when do we want it to be false? That is, why not
declare it true for all file types?


Regression test .out files commonly have spaces at the end of the line.  (Not
to mention the ECPG .c files but they probably really shouldn't have.)


Attached is v2, which now makes the rules between gitattributes and
editorconfig completely identical. As well as improving two minor
things about .gitattributes before doing that.


Is there a command-line tool to verify the syntax of .editorconfig and 
check compliance of existing files?


I'm worried that expanding .editorconfig with detailed per-file rules 
will lead to a lot of mistakes and blind editing, if we don't have 
verification tooling.

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-15 Thread Peter Eisentraut


On 15.02.24 13:42, Koshi Shibagaki (Fujitsu) wrote:

However, crypt() and gen_salt() do not use OpenSSL as mentioned in [2].
Therefore, if we run crypt() and gen_salt() on a machine with FIPS mode enabled,
they are not affected by FIPS mode. This means we can use encryption algorithms
disallowed in FIPS.

I would like to change the proprietary implementations of crypt() and gen_salt()
to use OpenSSL API.
If it's not a problem, I am going to create a patch, but if you have a better
approach, please let me know.


The problems are:

1. All the block ciphers currently supported by crypt() and gen_salt() 
are not FIPS-compliant.


2. The crypt() and gen_salt() methods built on top of them (modes of 
operation, kind of) are not FIPS-compliant.


3. The implementations (crypt-blowfish.c, crypt-des.c, etc.) are not 
structured in a way that OpenSSL calls can easily be patched in.


So if you want FIPS-compliant cryptography, these interfaces look like a 
dead end.  I don't know if there are any modern equivalents of these 
functions that we should be supplying instead.

Re: Allow passing extra options to initdb for tests

2024-02-15 Thread Tom Lane

Daniel Gustafsson  writes:
> On 15 Feb 2024, at 11:38, Peter Eisentraut  wrote:
>> We don't have a man page for pg_regress, so there is no place to 
>> comprehensively document all the options and their interactions.

> This comes up every now and again, just yesterday there was a question on
> -general [0] about alternate output files which also are
> undocumented.

Really?

https://www.postgresql.org/docs/devel/regress-variant.html

> Maybe it's time to add documentation for pg_regress?

I'm inclined to think that a formal man page wouldn't be that useful,
since nobody ever invokes pg_regress directly; as Peter says, what
is of interest is the "make check" targets and the corresponding meson
behaviors.  I think 32.1 is already reasonably thorough about the
make targets; but the complete lack of equivalent info about what
to do in a meson build clearly needs to be rectified.

regards, tom lane

Re: Reducing output size of nodeToString

2024-02-15 Thread Matthias van de Meent

On Thu, 15 Feb 2024 at 13:59, Peter Eisentraut  wrote:
>
> Thanks, this patch set is a good way to incrementally work through these
> changes.
>
> I have looked at
> v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today.
> Here are my thoughts:
>
> I believe we had discussed offline to not omit enum fields with value 0
> (WRITE_ENUM_FIELD).  This is because the values of enum fields are
> implementation artifacts, and this could be confusing for readers.

Thanks for reminding me, I didn't remember this when I worked on
updating the patchset. I'll update this soon.

> I have some concerns about the round-trippability of float values.  If
> we do, effectively, if (node->fldname != 0.0), then I think this would
> also match negative zero, but when we read it back it, it would get
> assigned positive zero.  Maybe there are other edge cases like this.
> Might be safer to not mess with this.

That's a good point. Would an additional check that the sign of the
field equals the default's sign be enough for this? As for other
cases, I'm not sure we currently want to support non-normal floats,
even if it is technically possible to do the round-trip in the current
format.

> On the reading side, the macro nesting has gotten a bit out of hand. :)
> We had talked earlier in the thread about the _DIRECT macros and you
> said there were left over from something else you want to try, but I see
> nothing else in this patch set uses this.  I think this could all be
> much simpler, like (omitting required punctuation)
[...]
> Not only is this simpler, but it might also have better performance,
> because we don't have separate pg_strtok_next() and pg_strtok() calls in
> sequence.

Good points. I'll see what I can do here.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: index prefetching

2024-02-15 Thread Tomas Vondra

On 2/15/24 00:06, Peter Geoghegan wrote:
> On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
>  wrote:
>
>> ...
> 
> 2. Are you sure that the leaf-page-at-a-time thing is such a huge
> hindrance to effective prefetching?
> 
> I suppose that it might be much more important than I imagine it is
> right now, but it'd be nice to have something a bit more concrete to
> go on.
> 

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

That may be acceptable, considering we can usually fit 200+ index items
on a single page. Even then it limits what effective_io_concurrency
values are sensible, but in my experience quickly diminish past ~32.

> 3. Even if it is somewhat important, do you really need to get that
> part working in v1?
> 
> Tomas' original prototype worked with the leaf-page-at-a-time thing,
> and that still seemed like a big improvement to me. While being less
> invasive, in effect. If we can agree that something like that
> represents a useful step in the right direction (not an evolutionary
> dead end), then we can make good incremental progress within a single
> release.
> 

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: When extended query protocol ends?

2024-02-15 Thread Dave Cramer

On Wed, 14 Feb 2024 at 17:55, Tatsuo Ishii  wrote:

> >>> From [1] I think the JDBC driver sends something like below if
> >>> autosave=always option is specified.
> >>>
> >>> "BEGIN READ ONLY" Parse/Bind/Eexecute (in the extended query protocol)
> >>> "SAVEPOINT PGJDBC_AUTOSAVE" (in the simple query protocol)
> >>>
> >>> It seems the SAVEPOINT is sent without finishing the extended query
> >>> protocol (i.e. without Sync message). Is it possible for the JDBC
> >>> driver to issue a Sync message before sending SAVEPOINT in simple
> >>> query protocol? Or you can send SAVEPOINT using the extended query
> >>> protocol.
> >>>
> >>> [1]
> >>>
> https://www.pgpool.net/pipermail/pgpool-general/2023-December/009051.html
> >>
> >>
> >> Can you ask the OP what version of the driver they are using. From what
> I
> >> can tell we send BEGIN using SimpleQuery.
> >
> > Sure. I will get back once I get the JDBC version.
>
> Here it is:
> > JDBC driver version used:42.5.1 Regards, Karel.
>

Can you ask the OP what they are doing in the startup. I'm trying to
replicate their situation.
Looks like possibly 'setReadOnly' and 'select version()'

Thanks,
Dave

>
>

Re: Reducing output size of nodeToString

2024-02-15 Thread Peter Eisentraut

Thanks, this patch set is a good way to incrementally work through these 
changes.


I have looked at 
v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today. 
Here are my thoughts:


I believe we had discussed offline to not omit enum fields with value 0 
(WRITE_ENUM_FIELD).  This is because the values of enum fields are 
implementation artifacts, and this could be confusing for readers. 
(This could be added as a squeeze-out-every-byte change later, but if 
we're going to keep the format fit for human reading, I think we should 
skip this.)


I have some concerns about the round-trippability of float values.  If 
we do, effectively, if (node->fldname != 0.0), then I think this would 
also match negative zero, but when we read it back it, it would get 
assigned positive zero.  Maybe there are other edge cases like this. 
Might be safer to not mess with this.


On the reading side, the macro nesting has gotten a bit out of hand. :) 
We had talked earlier in the thread about the _DIRECT macros and you 
said there were left over from something else you want to try, but I see 
nothing else in this patch set uses this.  I think this could all be 
much simpler, like (omitting required punctuation)


#define READ_INT_FIELD(fldname, default)
if ((token = next_field(fldname, )))
local_node->fldname = atoi(token);
else
local_node->fldname = default;

where next_field() would

1. read the next token
2. if it is ":fldname", continue;
   else rewind the read pointer and return NULL
3. read the next token and return that

Not only is this simpler, but it might also have better performance, 
because we don't have separate pg_strtok_next() and pg_strtok() calls in 
sequence.

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-02-15 Thread Daniel Gustafsson

> On 15 Feb 2024, at 04:58, David Benjamin  wrote:
> 
> By the way, I'm unable to add the patch to the next commitfest due to the 
> cool off period for new accounts. How long is that period? I don't suppose 
> there's a way to avoid it?

There is a way to expedite the cooling-off period (it's a SPAM prevention
measure), but I don't have access to it.  In the meantime I've added the patch
for you, and once the cooling off is over we can add your name as author.

https://commitfest.postgresql.org/47/4835/

I'm currently reviewing it and will get back to you soon on that.

--
Daniel Gustafsson

Re: planner chooses incremental but not the best one

2024-02-15 Thread Andrei Lepikhov


On 15/2/2024 18:10, Tomas Vondra wrote:



On 2/15/24 07:50, Andrei Lepikhov wrote:

On 18/12/2023 19:53, Tomas Vondra wrote:

On 12/18/23 11:40, Richard Guo wrote:
The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

  ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

  1 * ndistinct(b,c)

Did you mean here ndistinct(c,d) and the formula:
ndistinct(b,c,d) / ndistinct(c,d) ?


Yes, I think that's probably a more correct ... Essentially, the idea is
to estimate the change in number of distinct groups after adding a
column (or restricting it in some way).
Thanks, I got it. I just think how to implement such techniques with 
extensions just to test the idea in action. In the case of GROUP-BY we 
can use path hook, of course. But what if to invent a hook on clauselist 
estimation?

Do you implicitly bear in mind here the necessity of tracking clauses
that were applied to the data up to the moment of grouping?



I don't recall what exactly I considered two months ago when writing the
message, but I don't see why we would need to track that beyond what we
already have. Shouldn't it be enough for the grouping to simply inspect
the conditions on the lower levels?
Yes, exactly. I've thought about looking into baserestrictinfos and, if 
group-by references a subquery targetlist, into subqueries too.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Since the slotsync function is committed, I rebased remaining patches.
> And here is the V88 patch set.
>

Please find the improvements in some of the comments in v88_0001*
attached. Kindly include these in next version, if you are okay with
it.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 68f48c052c..6173143bcf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1469,16 +1469,16 @@ FinishWalRecovery(void)
XLogShutdownWalRcv();
 
/*
-* Shutdown the slot sync workers to prevent potential conflicts between
-* user processes and slotsync workers after a promotion.
+* Shutdown the slot sync worker to prevent it from keep trying to fetch
+* slots and drop any temporary slots.
 *
 * We do not update the 'synced' column from true to false here, as any
 * failed update could leave 'synced' column false for some slots. This
 * could cause issues during slot sync after restarting the server as a
-* standby. While updating after switching to the new timeline is an
-* option, it does not simplify the handling for 'synced' column.
-* Therefore, we retain the 'synced' column as true after promotion as 
it
-* may provide useful information about the slot origin.
+* standby. While updating the 'synced' column after switching to the 
new
+* timeline is an option, it does not simplify the handling for the
+* 'synced' column. Therefore, we retain the 'synced' column as true 
after
+* promotion as it may provide useful information about the slot origin.
 */
ShutDownSlotSync();
 
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index b0334ce449..011b5f4828 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -168,7 +168,7 @@
  * they will never become live backends.  dead_end children are not assigned a
  * PMChildSlot.  dead_end children have bkend_type NORMAL.
  *
- * "Special" children such as the startup, bgwriter, autovacuum launcher and
+ * "Special" children such as the startup, bgwriter, autovacuum launcher, and
  * slot sync worker tasks are not in this list.  They are tracked via 
StartupPID
  * and other pid_t variables below.  (Thus, there can't be more than one of any
  * given "special" child process type.  We use BackendList entries for any
@@ -3415,7 +3415,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, archiver, slot sync worker or background worker.
+ * walwriter, autovacuum, archiver, slot sync worker, or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e40cdbe4d9..04271ee703 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -6,8 +6,8 @@
  * loaded as a dynamic module to avoid linking the main server binary with
  * libpq.
  *
- * Apart from walreceiver, the libpq-specific routines here are now being used
- * by logical replication workers and slot sync worker as well.
+ * Apart from walreceiver, the libpq-specific routines are now being used by
+ * logical replication workers and slot synchronization.
  *
  * Portions Copyright (c) 2010-2024, PostgreSQL Global Development Group
  *
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index 6492582156..56eb0ea11b 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -22,10 +22,10 @@
  * which we can call pg_sync_replication_slots() periodically to perform
  * syncs.
  *
- * The slot sync worker worker waits for a period of time before the next
- * synchronization, with the duration varying based on whether any slots were
- * updated during the last cycle. Refer to the comments above
- * wait_for_slot_activity() for more details.
+ * The slot sync worker waits for some time before the next synchronization,
+ * with the duration varying based on whether any slots were updated during
+ * the last cycle. Refer to the comments above wait_for_slot_activity() for
+ * more details.
  *
  * Any standby synchronized slots will be dropped if they no longer need
  * to be synchronized. See comment atop drop_local_obsolete_slots() for more
@@ -1449,9 +1449,8 @@ ShutDownSlotSync(void)
 /*
  * SlotSyncWorkerCanRestart
  *
- * Returns true if the worker is allowed to restart if enough time has

Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-15 Thread Koshi Shibagaki (Fujitsu)

Hi
This is Shibagaki.

When FIPS mode is enabled, some encryption algorithms cannot be used.
Since PostgreSQL15, pgcrypto requires OpenSSL[1], digest() and other functions
also follow this policy.

However, crypt() and gen_salt() do not use OpenSSL as mentioned in [2].
Therefore, if we run crypt() and gen_salt() on a machine with FIPS mode enabled,
they are not affected by FIPS mode. This means we can use encryption algorithms 
disallowed in FIPS.

I would like to change the proprietary implementations of crypt() and gen_salt()
to use OpenSSL API.
If it's not a problem, I am going to create a patch, but if you have a better 
approach, please let me know.

Thank you


[1] 
https://github.com/postgres/postgres/commit/db7d1a7b0530e8cbd045744e1c75b0e63fb6916f
[2] https://peter.eisentraut.org/blog/2023/12/05/postgresql-and-fips-mode

crypt() and gen_salt() are performed on in example below.

/

-- OS RHEL8.6

$openssl version
OpenSSL 1.1.1k  FIPS 25 Mar 2021

$fips-mode-setup --check
FIPS mode is enabled.

$./pgsql17/bin/psql
psql (17devel)
Type "help" for help.

postgres=# SHOW server_version;
 server_version 

 17devel
(1 row)

postgres=# SELECT digest('data','md5');
ERROR:  Cannot use "md5": Cipher cannot be initialized

postgres=# SELECT crypt('new password',gen_salt('md5')); -- md5 is not 
available when fips mode is turned on. This is a normal behavior
ERROR:  crypt(3) returned NULL

postgres=# SELECT crypt('new password',gen_salt('des')); -- however, des is 
avalable. This may break a FIPS rule
 crypt
---
 32REGk7H6dSnE
(1 row)

/

FYI - OpenSSL itself cannot use DES algorithm while encrypting files. This is 
an expected behavior.

---
Fujitsu Limited
Shibagaki Koshi
shibagaki.ko...@fujitsu.com

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot
 wrote:
>
> On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote:
> > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > > Attach the v2 patch here.
> > >
> > > Apart from the new log message. I think we can add one more debug message 
> > > in
> > > reserve_wal_for_local_slot, this could be useful to analyze the failure.
> >
> > Yeah, that can also be helpful, but the added message looks naive to me.
> > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno);
> >
> > Instead of the above, how about something like: "segno: %ld of
> > purposed restart_lsn for the synced slot, oldest_segno: %ld
> > available"?
>
> Looks good to me. I'm not sure if it would make more sense to elog only if
> segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call?
>
> But I'm fine with the proposed location too.
>

I am also fine either way but the current location gives required
information in more number of cases and could be helpful in debugging
this new facility.

> >
> > > And we
> > > can also enable the DEBUG log in the 040 tap-test, I see we have similar
> > > setting in 010_logical_decoding_timline and logging debug1 message doesn't
> > > increase noticable time on my machine. These are done in 0002.
> > >
> >
> > I haven't tested it but I think this can help in debugging BF
> > failures, if any. I am not sure if to keep it always like that but
> > till the time these tests are stabilized, this sounds like a good
> > idea. So, how, about just making test changes as a separate patch so
> > that later if required we can revert/remove it easily? Bertrand, do
> > you have any thoughts on this?
>
> +1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we
> took the same approach for 035_standby_logical_decoding.pl IIRC) and then 
> revert
> it back.
>

Good to know!

> Also I was thinking: what about adding an output to 
> pg_sync_replication_slots()?
> The output could be the number of sync slots that have been created and are
> not considered as sync-ready during the execution.
>

Yeah, we can consider outputting some information via this function
like how many slots are synced and persisted but not sure what would
be appropriate here. Because one can anyway find that or more
information by querying pg_replication_slots. I think we can keep
discussing what makes more sense as a return value but let's commit
the debug/log patches as they will be helpful to analyze BF failures
or any other issues reported.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote:
> On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu)
>  wrote:
> > Attach the v2 patch here.
> >
> > Apart from the new log message. I think we can add one more debug message in
> > reserve_wal_for_local_slot, this could be useful to analyze the failure.
> 
> Yeah, that can also be helpful, but the added message looks naive to me.
> + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno);
> 
> Instead of the above, how about something like: "segno: %ld of
> purposed restart_lsn for the synced slot, oldest_segno: %ld
> available"?

Looks good to me. I'm not sure if it would make more sense to elog only if 
segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call?

But I'm fine with the proposed location too.

> 
> > And we
> > can also enable the DEBUG log in the 040 tap-test, I see we have similar
> > setting in 010_logical_decoding_timline and logging debug1 message doesn't
> > increase noticable time on my machine. These are done in 0002.
> >
> 
> I haven't tested it but I think this can help in debugging BF
> failures, if any. I am not sure if to keep it always like that but
> till the time these tests are stabilized, this sounds like a good
> idea. So, how, about just making test changes as a separate patch so
> that later if required we can revert/remove it easily? Bertrand, do
> you have any thoughts on this?

+1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we
took the same approach for 035_standby_logical_decoding.pl IIRC) and then revert
it back.

Also I was thinking: what about adding an output to pg_sync_replication_slots()?
The output could be the number of sync slots that have been created and are
not considered as sync-ready during the execution. I think that could be a good
addition to v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch
proposed here (should trigger special attention in case of non zero value).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add system identifier to backup manifest

2024-02-15 Thread Robert Haas

On Thu, Feb 15, 2024 at 3:05 PM Amul Sul  wrote:
> Kindly have a look at the attached version.

IMHO, 0001 looks fine, except probably the comment could be phrased a
bit more nicely. That can be left for whoever commits this to
wordsmith. Michael, what are your plans?

0002 seems like a reasonable approach, but there's a hunk in the wrong
patch: 0004 modifies pg_combinebackup's check_control_files to use
get_dir_controlfile() rather than git_controlfile(), but it looks to
me like that should be part of 0002.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread David Rowley

On Thu, 15 Feb 2024 at 21:42, Andy Fan  wrote:
> I found the both plans have the same cost, I can't get the accurate
> cause of this after some hours research, but it is pretty similar with
> 7516056c584e3, so I uses a similar strategy to stable it. is it
> acceptable?

It's pretty hard to say.  I can only guess why this test would be
flapping like this. I see it's happened before on mylodon, so probably
not a cosmic ray.  It's not like add_path() chooses a random path when
the costs are the same, so I wondered if something similar is going on
here that was going on that led to f03a9ca4. In particular, see [1].

On master, I set a breakpoint in try_nestloop_path() to break on
"outerrel->relid==1 && innerrel->relid==2". I see the total Nested
Loop cost comes out the same with the join order reversed.

Which is:

 ->  Nested Loop  (cost=0.00..1500915.00 rows=1 width=4)

Doing the same with your patch applied, I get:

->  Nested Loop  (cost=0.00..600925.00 rows=4000 width=4)

and forcing the join order to swap with the debugger, I see:

->  Nested Loop  (cost=0.00..600940.00 rows=4000 width=4)

So there's a difference now, but it's quite small. If it was a problem
like we had on [1], then since tenk1 and tenk2 have 345 pages (on my
machine), if relpages is down 1 or 2 pages, we'll likely get more of a
costing difference than 600925 vs 600940.

If I use:

explain
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;

I get:

->  Nested Loop  (cost=0.00..2415.03 rows=10 width=4)

and with the join order reversed, I get:

 ->  Nested Loop  (cost=0.00..2440.00 rows=10 width=4)

I'd be more happy using this one as percentage-wise, the cost
difference is much larger.  I don't quite have the will to go through
proving what the actual problem is here. I think [1] already proved
the relpages problem can (or could) happen.

I checked that the t2.thounsand = 0 query still tests the cheap
startup paths in add_paths_to_append_rel() and it does. If I flip
startup_subpaths_valid to false in the debugger, the plan flips to:

QUERY PLAN
---
 Limit  (cost=470.12..514.00 rows=1 width=4)
   ->  Append  (cost=470.12..952.79 rows=11 width=4)
 ->  Hash Join  (cost=470.12..952.73 rows=10 width=4)
   Hash Cond: (t1.tenthous = t2.tenthous)
   ->  Seq Scan on tenk1 t1  (cost=0.00..445.00 rows=1 width=8)
   ->  Hash  (cost=470.00..470.00 rows=10 width=4)
 ->  Seq Scan on tenk2 t2  (cost=0.00..470.00
rows=10 width=4)
   Filter: (thousand = 0)
 ->  Result  (cost=0.00..0.01 rows=1 width=4)

So, if nobody has any better ideas, I'm just going to push the " and
t2.thousand = 0" adjustment.

David

[1] https://www.postgresql.org/message-id/4174.1563239552%40sss.pgh.pa.us

Re: Memory consumed by paths during partitionwise join planning

2024-02-15 Thread Ashutosh Bapat

On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov
 wrote:
>
> On 6/2/2024 19:51, Ashutosh Bapat wrote:
> > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat
> > The patches are raw. make check has some crashes that I need to fix. I
> > am waiting to hear whether this is useful and whether the design is on
> > the right track.
> Let me write words of opinion on that feature.
> I generally like the idea of a refcount field. We had problems
> generating alternative paths in extensions and designing the Asymmetric
> Join feature [1] when we proposed an alternative path one level below
> and called the add_path() routine. We had lost the path in the path list
> referenced from the upper RelOptInfo. This approach allows us to make
> more handy solutions instead of hacking with a copy/restore pathlist.

Thanks for your comments. I am glad that you find this useful.

> But I'm not sure about freeing unreferenced paths. I would have to see
> alternatives in the pathlist.

I didn't understand this. Can you please elaborate? A path in any
pathlist is referenced. An unreferenced path should not be in any
pathlist.

> About partitioning. As I discovered planning issues connected to
> partitions, the painful problem is a rule, according to which we are
> trying to use all nomenclature of possible paths for each partition.
> With indexes, it quickly increases optimization work. IMO, this can help
> a 'symmetrical' approach, which could restrict the scope of possible
> pathways for upcoming partitions if we filter some paths in a set of
> previously planned partitions.

filter or free?

> Also, I am glad to see a positive opinion about the path_walker()
> routine. Somewhere else, for example, in [2], it seems we need it too.
>

I agree that we should introduce path_walker() yes.

I am waiting for David's opinion about the performance numbers shared
in [1] before working further on the patches and bring them out of WIP
state.

[1] 
postgr.es/m/CAExHW5uDkMQL8SicV3_=aycswwmtnr8gzjo310j7sv7tmlo...@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-02-15 Thread David Benjamin

By the way, I'm unable to add the patch to the next commitfest due to the
cool off period for new accounts. How long is that period? I don't suppose
there's a way to avoid it?

On Mon, Feb 12, 2024 at 11:31 AM David Benjamin  wrote:

> On Mon, Feb 12, 2024 at 9:38 AM Daniel Gustafsson  wrote:
>
>> > On 11 Feb 2024, at 19:19, David Benjamin  wrote:
>> > It turns out the parts that came from the OpenSSL socket BIO were a
>> no-op, and in fact PostgreSQL is relying on it being a no-op. Instead, it's
>> cleaner to just define a custom BIO the normal way, which then leaves the
>> more standard BIO_get_data mechanism usable. This also avoids the risk that
>> a future OpenSSL will add a now BIO_ctrl to the socket type, with libssl
>> calling into it, and then break some assumptions made by PostgreSQL.
>>
>> +   case BIO_CTRL_FLUSH:
>> +   /* libssl expects all BIOs to support BIO_flush.
>> */
>> +   res = 1;
>> +   break;
>>
>> Will this always be true?  Returning 1 implies that we have flushed all
>> data on
>> the socket, but what if we just before called BIO_set_retry..XX()?
>>
>
> The real one is also just a no-op. :-)
> https://github.com/openssl/openssl/blob/master/crypto/bio/bss_sock.c#L215
>
> This is used in places like buffer BIO or the FILE* BIO, where BIO_write
> might accept data, but stage it into a buffer, to be flushed later. libssl
> ends up calling BIO_flush at the end of every flight, which in turn means
> that BIOs used with libssl need to support it, even if to return true
> because there's nothing to flush. (Arguably TCP sockets could have used a
> flush concept, to help control Nagle's algorithm, but for better or worse,
> that's a socket-wide TCP_NODELAY option, rather than an explicit flush
> call.)
>
> BIO_set_retry.. behaves like POSIX I/O, where a failed EWOULDBLOCK write
> is as if you never wrote to the socket at all and doesn't impact
> socket state. That is, the data hasn't been accepted yet. It's not expected
> for BIO_flush to care about the rejected write data. (Also I don't believe
> libssl will ever trigger this case.) It's confusing because unlike an
> EWOULDBLOCK errno, BIO_set_retry.. *is* itself BIO state, but that's just
> because the BIO calling convention is goofy and didn't just return the
> error out of the return value. So OpenSSL just stashes the bit on the BIO
> itself, for you to query out immediately afterwards.
>
>
>> > I've attached a patch which does that. The existing SSL tests pass with
>> it, tested on Debian stable. (Though it took me a few iterations to figure
>> out how to run the SSL tests, so it's possible I've missed something.)
>>
>> We've done a fair bit of work on making them easier to run, so I'm
>> curious if
>> you saw any room for improvements there as someone coming to them for the
>> first
>> time?
>>
>
>  A lot of my time was just trying to figure out how to run the tests in
> the first place, so perhaps documentation? But I may just have been looking
> in the wrong spot and honestly didn't really know what I was doing. I can
> try to summarize what I did (from memory), and perhaps that can point to
> possible improvements?
>
> - I looked in the repository for instructions on running the tests and
> couldn't find any. At this point, I hadn't found src/test/README.
> - I found
> https://wiki.postgresql.org/wiki/Developer_FAQ#How_do_I_test_my_changes.3F,
> linked from https://www.postgresql.org/developer/
> - I ran the configure build with --enable-cassert, ran make check, tests
> passed.
> - I wrote my patch and then spent a while intentionally adding bugs to see
> if the tests would catch it (I wasn't sure whether there was ssl test
> coverage), finally concluding that I wasn't running any ssl tests
> - I looked some more and found src/test/ssl/README
> - I reconfigured with --enable-tap-tests and ran make check
> PG_TEST_EXTRA=ssl per those instructions, but the SSL tests still weren't
> running
> - I grepped for PG_TEST_EXTRA and found references in the CI config, but
> using the meson build
> - I installed meson, mimicked a few commands from the CI. That seemed to
> work.
> - I tried running only the ssl tests, looking up how you specify
> individual tests in meson, to make my compile/test cycles a bit faster, but
> they failed.
> - I noticed that the first couple "tests" were named like setup tasks, and
> guessed that the ssl tests depended on this setup to run. But by then I
> just gave up and waited out the whole test suite per run. :-)
>
> Once I got it running, it was quite smooth. I just wasn't sure how to do
> it.
>
> David
>

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, February 15, 2024 5:20 PM Amit Kapila  
> wrote:
> > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) 
> > 
> > wrote:
> > >
> > > On Thursday, February 15, 2024 10:49 AM Amit Kapila
> >  wrote:
> > > >
> > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> > > >
> > > > Right, we can do that or probably this test would have made more
> > > > sense with a worker patch where we could wait for the slot to be synced.
> > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do
> > > > you think that adding a LOG when we are not able to sync will help
> > > > in debugging such problems? I think eventually we can change it to
> > > > DEBUG1 but for now, it can help with stabilizing BF and or some other
> > reported issues.
> > >
> > > Here is the patch that attempts the re-create sub idea.
> > >
> >
> > Pushed this.
> >
> > >
> >  I also think that a LOG/DEBUG
> > > would be useful for such analysis, so the 0002 is to add such a log.
> > >
> >
> > I feel such a LOG would be useful.
> >
> > + ereport(LOG,
> > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin"
> > +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)",
> >
> > I think waiting is a bit misleading here, how about something like:
> > "could not sync slot information as remote slot precedes local slot:
> > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X),
> > catalog xmin (%u)"
>
> Changed.
>
> Attach the v2 patch here.
>
> Apart from the new log message. I think we can add one more debug message in
> reserve_wal_for_local_slot, this could be useful to analyze the failure.

Yeah, that can also be helpful, but the added message looks naive to me.
+ elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno);

Instead of the above, how about something like: "segno: %ld of
purposed restart_lsn for the synced slot, oldest_segno: %ld
available"?

> And we
> can also enable the DEBUG log in the 040 tap-test, I see we have similar
> setting in 010_logical_decoding_timline and logging debug1 message doesn't
> increase noticable time on my machine. These are done in 0002.
>

I haven't tested it but I think this can help in debugging BF
failures, if any. I am not sure if to keep it always like that but
till the time these tests are stabilized, this sounds like a good
idea. So, how, about just making test changes as a separate patch so
that later if required we can revert/remove it easily? Bertrand, do
you have any thoughts on this?

-- 
With Regards,
Amit Kapila.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor

On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada  wrote:
>
> On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:

> I've also run the same scripts in my environment just in case and got
> similar results:

Thanks for testing, looks good as well.

> > There are still some micro-benchmarks we could do on tidstore, and
> > it'd be good to find out worse-case memory use (1 dead tuple each on
> > spread-out pages), but this is decent demonstration.
>
> I've tested a simple case where vacuum removes 33k dead tuples spread
> about every 10 pages.
>
> master:
> 198,000 bytes (=33000 * 6)
> system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
>
> v-59:
> 2,834,432 bytes (reported by TidStoreMemoryUsage())
> system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

> > > > I'm pretty sure there's an
> > > > accidental memset call that crept in there, but I'm running out of
> > > > steam today.
> >
> > I have just a little bit of work to add for v59:
> >
> > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > and there are none between 64 and 128, so not a huge deal but I think
> > it's a bit nicer in this patch.
>
> LGTM.

Okay, I've squashed this.

> I've drafted the commit message.

Thanks, this is a good start.

> I've run regression tests with valgrind and run the coverity scan, and
> I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.
v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)

v61-ART.tar.gz
Description: application/gzip

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 02:09:45PM +0900, Michael Paquier wrote:
> On Thu, Jan 18, 2024 at 02:20:28PM +, Bertrand Drouvot wrote:
> > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> >> I'm not sure if it
> >> can happen at all, but I think we can rely on previous conflict reason
> >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
> > 
> > I'm not sure what you mean here. I think we should still keep the "initial" 
> > LSN
> > so that the next check done with it still makes sense. The previous conflict
> > reason as you're proposing also makes sense to me but for another reason: 
> > PANIC
> > in case the issue still happen (for cases we did not think about, means not
> > covered by what the added previous LSNs are covering).
> 
> Using a PANIC seems like an overreaction to me for this path.  Sure, 
> we should not record twice a conflict reason, but wouldn't an
> assertion be more adapted enough to check that a termination is not
> logged twice for this code path run in the checkpointer?

Thanks for looking at it!

Agree, using an assertion instead in v3 attached.

> 
> +if (!terminated)
> +{
> +initial_restart_lsn = s->data.restart_lsn;
> +initial_effective_xmin = s->effective_xmin;
> +initial_catalog_effective_xmin = s->effective_catalog_xmin;
> +}
> 
> It seems important to document why we are saving this data here; while
> we hold the LWLock for repslots, before we perform any termination,
> and we  want to protect conflict reports with incorrect data if the
> slot data got changed while the lwlock is temporarily released while
> there's a termination.

Yeah, comments added in v3.

> >> 3. Is there a way to reproduce this racy issue, may be by adding some
> >> sleeps or something? If yes, can you share it here, just for the
> >> records and verify the whatever fix provided actually works?
> > 
> > Alexander was able to reproduce it on a slow machine and the issue was not 
> > there
> > anymore with v1 in place. I think it's tricky to reproduce as it would need 
> > the
> > slot to advance between the 2 checks.
> 
> I'd really want a test for that in the long term.  And an injection
> point stuck between the point where ReplicationSlotControlLock is
> released then re-acquired when there is an active PID should be
> enough, isn't it? 

Yeah, that should be enough.

> For example just after the kill()?  It seems to me
> that we should stuck the checkpointer, perform a forced upgrade of the
> xmins, release the checkpointer and see the effects of the change in
> the second loop.  Now, modules/injection_points/ does not allow that,
> yet, but I've posted a patch among these lines that can stop a process
> and release it using a condition variable (should be 0006 on [1]).  I
> was planning to start a new thread with a patch posted for the next CF
> to add this kind of facility with a regression test for an old bug,
> the patch needs a few hours of love, first.  I should be able to post
> that next week.

Great, that looks like a good idea!

Are we going to fix this on master and 16 stable first and then later on add a
test on master with the injection points?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From eb8bb9eda4c1ea5fe2abe87df66ef2b86cdb2441 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:57:53 +
Subject: [PATCH v3] Fix race condition in InvalidatePossiblyObsoleteSlot()

In case of an active slot we proceed in 2 steps:
- terminate the backend holding the slot
- report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.

Holding the mutex longer is not an option (given what we'd to do while holding it)
so record the previous LSNs instead that were used during the backend termination.
---
 src/backend/replication/slot.c | 39 --
 1 file changed, 33 insertions(+), 6 deletions(-)
 100.0% src/backend/replication/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 2180a38063..33f76c4acc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1454,6 +1454,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		terminated = false;
+	XLogRecPtr	initial_effective_xmin = InvalidXLogRecPtr;
+	XLogRecPtr	initial_catalog_effective_xmin = InvalidXLogRecPtr;
+	XLogRecPtr	initial_restart_lsn = InvalidXLogRecPtr;
+	ReplicationSlotInvalidationCause conflict_prev PG_USED_FOR_ASSERTS_ONLY = RS_INVAL_NONE;
 
 	for (;;)
 	{
@@ -1488,11 +1493,24 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 		 */
 		if (s->data.invalidated ==

Re: planner chooses incremental but not the best one

2024-02-15 Thread Tomas Vondra




On 2/15/24 07:50, Andrei Lepikhov wrote:
> On 18/12/2023 19:53, Tomas Vondra wrote:
>> On 12/18/23 11:40, Richard Guo wrote:
>> The challenge is where to get usable information about correlation
>> between columns. I only have a couple very rought ideas of what might
>> try. For example, if we have multi-column ndistinct statistics, we might
>> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from
>>
>>  ndistinct(b,c,d) / ndistinct(b,c)
>>
>> If we know how many distinct values we have for the predicate column, we
>> could then estimate the number of groups. I mean, we know that for the
>> restriction "WHERE b = 3" we only have 1 distinct value, so we could
>> estimate the number of groups as
>>
>>  1 * ndistinct(b,c)
> Did you mean here ndistinct(c,d) and the formula:
> ndistinct(b,c,d) / ndistinct(c,d) ?

Yes, I think that's probably a more correct ... Essentially, the idea is
to estimate the change in number of distinct groups after adding a
column (or restricting it in some way).

> 
> Do you implicitly bear in mind here the necessity of tracking clauses
> that were applied to the data up to the moment of grouping?
> 

I don't recall what exactly I considered two months ago when writing the
message, but I don't see why we would need to track that beyond what we
already have. Shouldn't it be enough for the grouping to simply inspect
the conditions on the lower levels?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Synchronizing slots from primary to standby

2024-02-15 Thread Zhijie Hou (Fujitsu)

On Thursday, February 15, 2024 5:20 PM Amit Kapila  
wrote:
> On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Thursday, February 15, 2024 10:49 AM Amit Kapila
>  wrote:
> > >
> > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> > >
> > > Right, we can do that or probably this test would have made more
> > > sense with a worker patch where we could wait for the slot to be synced.
> > > Anyway, let's try to recreate the slot/subscription idea. BTW, do
> > > you think that adding a LOG when we are not able to sync will help
> > > in debugging such problems? I think eventually we can change it to
> > > DEBUG1 but for now, it can help with stabilizing BF and or some other
> reported issues.
> >
> > Here is the patch that attempts the re-create sub idea.
> >
> 
> Pushed this.
> 
> >
>  I also think that a LOG/DEBUG
> > would be useful for such analysis, so the 0002 is to add such a log.
> >
> 
> I feel such a LOG would be useful.
> 
> + ereport(LOG,
> + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin"
> +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)",
> 
> I think waiting is a bit misleading here, how about something like:
> "could not sync slot information as remote slot precedes local slot:
> remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X),
> catalog xmin (%u)"

Changed.

Attach the v2 patch here. 

Apart from the new log message. I think we can add one more debug message in
reserve_wal_for_local_slot, this could be useful to analyze the failure. And we
can also enable the DEBUG log in the 040 tap-test, I see we have similar
setting in 010_logical_decoding_timline and logging debug1 message doesn't
increase noticable time on my machine. These are done in 0002.

Best Regards,
Hou zj


v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch
Description:  v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch


v2-0002-Add-a-debug-message-and-enable-DEBUG-log-in-slots.patch
Description:  v2-0002-Add-a-debug-message-and-enable-DEBUG-log-in-slots.patch

Re: Allow passing extra options to initdb for tests

2024-02-15 Thread Daniel Gustafsson

> On 15 Feb 2024, at 11:38, Peter Eisentraut  wrote:

> We don't have a man page for pg_regress, so there is no place to 
> comprehensively document all the options and their interactions.

This comes up every now and again, just yesterday there was a question on
-general [0] about alternate output files which also are undocumented.  Maybe
it's time to add documentation for pg_regress?

--
Daniel Gustafsson

[0] CACX+KaPOzzRHEt4w_=iqkbtpmkjyrugvng1c749yp3r6dpr...@mail.gmail.com

Re: Allow passing extra options to initdb for tests

2024-02-15 Thread Peter Eisentraut


On 14.02.24 06:22, Ian Lawrence Barwick wrote:

I had a longer look at this and can't find any issues with the code or
documentation changes.


Thanks, committed.


I did wonder whether it would be worth mentioning that any initdb
options set in "PG_TEST_INITDB_EXTRA_OPTS" will override those
which can be set by pg_regress, but of the four ("--no-clean", "--no-sync",
"--debug" and "--no-locale"), only the optional  "--no-locale" can actually
be overridden, so it doesn't seem important.


I thought about this.  We don't have a man page for pg_regress, so there 
is no place to comprehensively document all the options and their 
interactions.  The documentation in regress.sgml works on a makefile 
level.  AFAICT, the --debug option is not exposed via the makefiles at 
all, while --no-locale can be requested by the environment variable 
NOLOCALE, but that is not documented, and also not ported to meson.  I 
think if you want tweaks on this level, there is some expectations right 
now that you might need to study the source code a bit.


One thing that might be interesting would be to allow running initdb 
without the --no-sync option, to exercise fsync a bit.  But there is no 
"--yes-sync" option to override --no-sync, otherwise you could do it 
with the just-committed feature.

Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Nazir Bilal Yavuz

Hi,

On Tue, 13 Feb 2024 at 09:17, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -
> https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
> the tests hit the Assert(false); added. Which means, the zero-padding
> comment around WALRead() call sites isn't quite right.
>
> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

I am curious about the same. The page size and disk block size could
be different, so the reason could be explicitly fetching the whole
page from the disk as you said. Is this the reason or are there any
other benefits of always reading XLOG_BLCKSZ instead of reading the
sufficient part? I tried to search in older threads and code comments
but I could not find an explanation.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Returning non-terminated string in ECPG Informix-compatible function

2024-02-15 Thread Oleg Tselebrovskiy


Thanks for review!

I added a regression test that is based on code from previous email

New patch is attached

Oleg Tselebrovskiy, Postgres Prodiff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index dccf39582da..80d40aa3e09 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -654,7 +654,7 @@ intoasc(interval * i, char *str)
 	if (!tmp)
 		return -errno;
 
-	memcpy(str, tmp, strlen(tmp));
+	strcpy(str, tmp);
 	free(tmp);
 	return 0;
 }
diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c b/src/interfaces/ecpg/pgtypeslib/dt_common.c
index 99bdc94d6d7..d4ca0cbff6e 100644
--- a/src/interfaces/ecpg/pgtypeslib/dt_common.c
+++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c
@@ -2659,6 +2659,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
  */
 pfmt++;
 tmp = pgtypes_alloc(strlen("%m/%d/%y") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%m/%d/%y");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2784,6 +2786,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'r':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%I:%M:%S %p") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%I:%M:%S %p");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2792,6 +2796,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'R':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%H:%M") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%H:%M");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
@@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d,
 			case 'T':
 pfmt++;
 tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1);
+if(!tmp)
+	return 1;
 strcpy(tmp, "%H:%M:%S");
 strcat(tmp, pfmt);
 err = PGTYPEStimestamp_defmt_scan(, tmp, d, year, month, day, hour, minute, second, tz);
diff --git a/src/interfaces/ecpg/test/compat_informix/.gitignore b/src/interfaces/ecpg/test/compat_informix/.gitignore
index f97706ba4be..6967ae77cd2 100644
--- a/src/interfaces/ecpg/test/compat_informix/.gitignore
+++ b/src/interfaces/ecpg/test/compat_informix/.gitignore
@@ -4,6 +4,8 @@
 /dec_test.c
 /describe
 /describe.c
+/intoasc
+/intoasc.c
 /rfmtdate
 /rfmtdate.c
 /rfmtlong
diff --git a/src/interfaces/ecpg/test/compat_informix/Makefile b/src/interfaces/ecpg/test/compat_informix/Makefile
index d50fdc29fd1..638b4e0af78 100644
--- a/src/interfaces/ecpg/test/compat_informix/Makefile
+++ b/src/interfaces/ecpg/test/compat_informix/Makefile
@@ -16,7 +16,8 @@ TESTS = test_informix test_informix.c \
 rnull rnull.c \
 sqlda sqlda.c \
 describe describe.c \
-charfuncs charfuncs.c
+charfuncs charfuncs.c \
+intoasc intoasc.c
 
 all: $(TESTS)
 
diff --git a/src/interfaces/ecpg/test/compat_informix/intoasc.pgc b/src/interfaces/ecpg/test/compat_informix/intoasc.pgc
new file mode 100644
index 000..d13c83bb7a7
--- /dev/null
+++ b/src/interfaces/ecpg/test/compat_informix/intoasc.pgc
@@ -0,0 +1,21 @@
+#include 
+#include 
+
+#include "pgtypes_interval.h"
+
+EXEC SQL BEGIN DECLARE SECTION;
+char dirty_str[100] = "a__c_d_";
+interval *interval_ptr;
+EXEC SQL END DECLARE SECTION;
+
+int main()
+{
+interval_ptr = (interval *) malloc(sizeof(interval));
+interval_ptr->time = 1;
+interval_ptr->month = 240;
+
+printf("dirty_str contents before intoasc: %s\n", dirty_str);
+intoasc(interval_ptr, dirty_str);
+printf("dirty_str contents after intoasc: %s\n", dirty_str);
+return 0;
+}
diff --git a/src/interfaces/ecpg/test/compat_informix/meson.build b/src/interfaces/ecpg/test/compat_informix/meson.build
index e2f8802330d..7e4790933ad 100644
--- a/src/interfaces/ecpg/test/compat_informix/meson.build
+++ b/src/interfaces/ecpg/test/compat_informix/meson.build
@@ -4,6 +4,7 @@ pgc_files = [
   'charfuncs',
   'dec_test',
   'describe',
+  'intoasc',
   'rfmtdate',
   'rfmtlong',
   'rnull',
diff --git a/src/interfaces/ecpg/test/ecpg_schedule b/src/interfaces/ecpg/test/ecpg_schedule
index 39814a39c17..f9c0a0e3c00 100644
--- a/src/interfaces/ecpg/test/ecpg_schedule
+++ b/src/interfaces/ecpg/test/ecpg_schedule
@@ -7,6 +7,7 @@ test: compat_informix/sqlda
 test: compat_informix/describe
 test: compat_informix/test_informix
 test: compat_informix/test_informix2
+test: compat_informix/intoasc
 test: compat_oracle/char_array
 test: connect/test2
 test: connect/test3
diff --git a/src/interfaces/ecpg/test/expected/compat_informix-intoasc.c b/src/interfaces/ecpg/test/expected/compat_informix-intoasc.c
new file mode 100644
index 000..30988809e92
---

Re: A new strategy for pull-up correlated ANY_SUBLINK

2024-02-15 Thread Andy Fan



Hi Alexander,

> Hi!
>
>> If the changes of Alena are ok, can you merge the changes and post an
>> updated version so that CFBot can apply the patch and verify the
>> changes. As currently CFBot is trying to apply only Alena's changes
>> and failing with the following at [1]:
>
> I think this is a very nice and pretty simple optimization.  I've
> merged the changes by Alena, and slightly revised the code changes in
> convert_ANY_sublink_to_join().  I'm going to push this if there are no
> objections.

Thanks for picking up this! I double checked the patch, it looks good to
me. 

-- 
Best Regards
Andy Fan

Re: pg_upgrade and logical replication

2024-02-15 Thread Justin Pryzby

On Wed, Feb 14, 2024 at 03:37:03AM +, Hayato Kuroda (Fujitsu) wrote:
> Attached patch modified the test accordingly. Also, it contains some 
> optimizations.
> This can pass the test on my env:

What optimizations?  I can't see them, and since the patch is described
as rearranging test cases (and therefore already difficult to read), I
guess they should be a separate patch, or the optimizations described.

-- 
Justin

Re: 039_end_of_wal: error in "xl_tot_len zero" test

2024-02-15 Thread Anton Voloshin


Hello, Thomas,

On 19/01/2024 01:35, Thomas Munro wrote:

I don't yet have an opinion on the best way to
do it though.  Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()?


Yes, indeed that seems to be enough. At least I could not produce any 
more "xl_tot_len zero" failures with that addition.


I like this solution the best.


Tolerating the two different messages would weaken the test.


I agree, I also don't really like this option.

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Bertrand Drouvot

Hi,

On Thu, Feb 15, 2024 at 02:49:54PM +0530, Amit Kapila wrote:
> On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Thursday, February 15, 2024 10:49 AM Amit Kapila 
> >  wrote:
> > >
> > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> > >
> > > Right, we can do that or probably this test would have made more sense 
> > > with a
> > > worker patch where we could wait for the slot to be synced.
> > > Anyway, let's try to recreate the slot/subscription idea. BTW, do you 
> > > think that
> > > adding a LOG when we are not able to sync will help in debugging such
> > > problems? I think eventually we can change it to DEBUG1 but for now, it 
> > > can help
> > > with stabilizing BF and or some other reported issues.
> >
> > Here is the patch that attempts the re-create sub idea.
> >
> 
> Pushed this.
> 
> >
>  I also think that a LOG/DEBUG
> > would be useful for such analysis, so the 0002 is to add such a log.
> >
> 
> I feel such a LOG would be useful.

Same here.

> + ereport(LOG,
> + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin"
> +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)",
> 
> I think waiting is a bit misleading here, how about something like:
> "could not sync slot information as remote slot precedes local slot:
> remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN
> (%X/%X), catalog xmin (%u)"
> 

This wording works for me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add system identifier to backup manifest

2024-02-15 Thread Amul Sul

On Thu, Feb 15, 2024 at 7:18 AM Michael Paquier  wrote:

> On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote:
> > Ok, I did that way in the attached version, I have passed the control
> file's
> > full path as a second argument to verify_system_identifier() what we
> gets in
> > verify_backup_file(), but that is not doing any useful things with it,
> > since we
> > were using get_controlfile() to open the control file, which takes the
> > directory as an input and computes the full path on its own.
>
> I've read through the patch, and that's pretty cool.
>

Thank you for looking into this.


> -static void
> -parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
> -   manifest_wal_range
> **first_wal_range_p)
> +static manifest_data *
> +parse_manifest_file(char *manifest_path)
>
> In 0001, should the comment describing this routine be updated as
> well?
>

Ok, updated in the attached version.


>
> +   identifier with pg_control of the backup directory or fails
> verification
>
> This is missing a  markup here.
>

Done, in the attached version.


>
> +  PostgreSQL 17, it is 2; in older
> versions,
> +  it is 1.
>
> Perhaps a couple of s here.
>
Done.


> +   if (strcmp(parse->manifest_version, "1") != 0 &&
> +   strcmp(parse->manifest_version, "2") != 0)
> +   json_manifest_parse_failure(parse->context,
> +
>  "unexpected manifest version");
> +
> +   /* Parse version. */
> +   version = strtoi64(parse->manifest_version, , 10);
> +   if (*ep)
> +   json_manifest_parse_failure(parse->context,
> +
>  "manifest version not an integer");
> +
> +   /* Invoke the callback for version */
> +   context->version_cb(context, version);
>
> Shouldn't these two checks be reversed?  And is there actually a need
> for the first check at all knowing that the version callback should be
> in charge of performing the validation vased on the version received?
>

Make sense, reversed the order.

I think, particular allowed versions should be placed at the central place,
and
the callback can check and react on the versions suitable to them, IMHO.


> +my $node2;
> +{
> +   local $ENV{'INITDB_TEMPLATE'} = undef;
>
> Not sure that it is a good idea to duplicate this pattern twice.
> Shouldn't this be something we'd want to control with an option in the
> init() method instead?
>

Yes, I did that in a separate patch, see 0001 patch.


> +static void
> +verify_system_identifier(verifier_context *context, char *controlpath)
>
> Relying both on controlpath, being a full path to the control file
> including the data directory, and context->backup_directory to read
> the contents of the control file looks a bit weird.  Wouldn't it be
> cleaner to just use one of them?
>

Well, yes, I had to have the same feeling, how about having another function
that can accept a full path of pg_control?

I tried in the 0002 patch, where the original function is renamed to
get_dir_controlfile(), which accepts the data directory path as before, and
get_controlfile() now accepts the full path of the pg_control file.

Kindly have a look at the attached version.

Regards,
Amul


v7-0004-Add-system-identifier-to-backup-manifest.patch
Description: Binary data


v7-0003-pg_verifybackup-code-refactor.patch
Description: Binary data


v7-0001-Add-option-to-force-initdb-in-PostgreSQL-Test-Clu.patch
Description: Binary data


v7-0002-Code-refactor-get_controlfile-to-accept-full-path.patch
Description: Binary data

Re: Add trim_trailing_whitespace to editorconfig file

2024-02-15 Thread Jelte Fennema-Nio

On Wed, 14 Feb 2024 at 23:19, Daniel Gustafsson  wrote:
> > +1 from me. But when do we want it to be false? That is, why not
> > declare it true for all file types?
>
> Regression test .out files commonly have spaces at the end of the line.  (Not
> to mention the ECPG .c files but they probably really shouldn't have.)

Attached is v2, which now makes the rules between gitattributes and
editorconfig completely identical. As well as improving two minor
things about .gitattributes before doing that.


v2-0002-Require-final-newline-in-.po-files.patch
Description: Binary data


v2-0003-Bring-editorconfig-in-line-with-gitattributes.patch
Description: Binary data


v2-0004-Add-note-about-keeping-.editorconfig-and-.gitattr.patch
Description: Binary data


v2-0001-Remove-non-existing-file-from-.gitattributes.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-15 Thread Amit Kapila

On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, February 15, 2024 10:49 AM Amit Kapila  
> wrote:
> >
> > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> >
> > Right, we can do that or probably this test would have made more sense with 
> > a
> > worker patch where we could wait for the slot to be synced.
> > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think 
> > that
> > adding a LOG when we are not able to sync will help in debugging such
> > problems? I think eventually we can change it to DEBUG1 but for now, it can 
> > help
> > with stabilizing BF and or some other reported issues.
>
> Here is the patch that attempts the re-create sub idea.
>

Pushed this.

>
 I also think that a LOG/DEBUG
> would be useful for such analysis, so the 0002 is to add such a log.
>

I feel such a LOG would be useful.

+ ereport(LOG,
+ errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin"
+" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)",

I think waiting is a bit misleading here, how about something like:
"could not sync slot information as remote slot precedes local slot:
remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN
(%X/%X), catalog xmin (%u)"

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-15 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Thu, 15 Feb 2024 17:09:20 +0800,
  jian he  wrote:

> My environment is slow (around 10x) but consistent.
> I see around 2-3 percent increase consistently.
> (with patch 7369.068 ms, without patch 7574.802 ms)

Thanks for sharing your numbers! It will help us to
determine whether these changes improve performance or not.

> the patchset looks good in my eyes, i can understand it.
> however I cannot apply it cleanly against the HEAD.

Hmm, I used 9bc1eee988c31e66a27e007d41020664df490214 as the
base version. But both patches based on the same
revision. So we may not be able to apply both patches at
once cleanly.

> +/*
> + * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo
> + * instead of initializing it for each call. This is for performance.
> + */
> +FunctionCallInfoBaseData *
> +PrepareInputFunctionCallInfo(void)
> +{
> + FunctionCallInfoBaseData *fcinfo;
> +
> + fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3));
> 
> just wondering, I saw other similar places using palloc0,
> do we need to use palloc0?

I think that we don't need to use palloc0() here because the
following InitFunctionCallInfoData() call initializes all
members explicitly.

Thanks,
-- 
kou

Re: RFC: Logging plan of the running query

2024-02-15 Thread Robert Haas

Hi,

I've just been catching up on this thread.

+ if (MyProc->heldLocks)
+ {
+ ereport(LOG_SERVER_ONLY,
+ errmsg("ignored request for logging query plan due to lock conflicts"),
+ errdetail("You can try again in a moment."));
+ return;
+ }

I don't like this for several reasons.

First, I think it's not nice to have a request just get ignored. A
user will expect that if we cannot immediately respond to some
request, we'll respond later at the first time that it's safe to do
so, rather than just ignoring it and telling them to retry.

Second, I don't think that the error message is very good. It talks
about lock conflicts, but we don't know that there is any real
problem. We know that, if we enter this block, the server is in the
middle of trying to acquire some lock, and we also know that we could
attempt to acquire a lock as part of generating the EXPLAIN output,
and maybe that's an issue. But that's not a lock conflict. That's a
re-entrancy problem. I don't know that we want to talk about
re-entrancy problems in an error message, and I don't really think
this error message should exist in the first place, but if we're going
to error out in this case surely we shouldn't do so with an error
message that describes a problem we don't have.

Third, I think that the re-entrancy problems with this patch may
extend well beyond lock acquisition. This is one example of how it can
be unsafe to do something as complicated as EXPLAIN at any arbitrary
CHECK_FOR_INTERRUPTS(). It is not correct to say, as
http://postgr.es/m/caaaqye9euuzd8bkjxtvcd9e4n5c7kzhzcvucjxt-xds9x4c...@mail.gmail.com
does, that the problems with running EXPLAIN at an arbitrary point are
specific to running this code in an aborted transaction. The existence
of this code shows that there is at least one hazard even if the
current transaction is not aborted, and I see no analysis on this
thread indicating whether there are any more such hazards, or how we
could go about finding them all.

I think the issue is very general. We have lots of subsystems that
both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS().
If we process an interrupt while that code is in the middle of
manipulating its global variables and then again re-enter that code,
bad things might happen. We could try to rule that out by analyzing
all such subsystems and all places where CHECK_FOR_INTERRUPTS() may
appear, but that's very difficult. Suppose we took the alternative
approach recommended by Andrey Lepikhov in
http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com
and instead set a flag that gets handled in some suitable place in the
executor code, like ExecProcNode(). If we did that, then we would know
that we're not in the middle of a syscache lookup, a catcache lookup,
a heavyweight lock acquisition, an ereport, or any other low-level
subsystem call that might create problems for the EXPLAIN code.

In that design, the hack shown above would go away, and we could be
much more certain that we don't need any other similar hacks for other
subsystems. The only downside is that we might experience a slightly
longer delay before a requested EXPLAIN plan actually shows up, but
that seems like a pretty small price to pay for being able to reason
about the behavior of the system. I don't *think* there are any cases
where we run in the executor for a particularly long time without a
new call to ExecProcNode().

I think this case is a bit like vacuum_delay_point(). You might think
that vacuum_delay_point() could be moved inside of
CHECK_FOR_INTERRUPTS(), but we've made the opposite decision: every
vacuum_delay_point() calls CHECK_FOR_INTERRUPTS() but not every
CHECK_FOR_INTERRUPTS() calls vacuum_delay_point(). That means that we
can allow vacuum_delay_point() only at cases where we know it's safe,
rather than at every CHECK_FOR_INTERRUPTS(). I think that's a pretty
smart decision, even for vacuum_delay_point(), and AFAICS the code
we're proposing to run here does things that are substantially more
complicated than what vacuum_delay_point() does. That code just does a
bit of reading of shared memory, reports a wait event, and sleeps.
That's really simple compared to code that could try to do relcache
builds, which can trigger syscache lookups, which can both trigger
heavyweight lock acquisition, lightweight lock acquisition, bringing
pages into shared_buffers possibly through physical I/O, processing of
invalidation messages, and a bunch of other stuff.

It's really hard for me to accept that the heavyweight lock problem
for which the patch contains a workaround is the only one that exists.
I can't see any reason why that should be true.

...Robert

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-15 Thread jian he

On Thu, Feb 15, 2024 at 2:34 PM Sutou Kouhei  wrote:
>
>
> Thanks for the info. Let's use InputFunctionCallSafeWithInfo().
> See that attached patch:
> v2-0001-Reuse-fcinfo-used-in-COPY-FROM.patch
>
> I also attach a patch for COPY TO:
> v1-0001-Reuse-fcinfo-used-in-COPY-TO.patch
>
> I measured the COPY TO patch on my environment with:
> COPY (SELECT 
> 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
>  generate_series(1, 100::int4)) TO '/dev/null' \watch c=5
>
> master:
> 740.066ms
> 734.884ms
> 738.579ms
> 734.170ms
> 727.953ms
>
> patched:
> 730.714ms
> 741.483ms
> 714.149ms
> 715.436ms
> 713.578ms
>
> It seems that it improves performance a bit but my
> environment isn't suitable for benchmark. So they may not
> be valid numbers.

My environment is slow (around 10x) but consistent.
I see around 2-3 percent increase consistently.
(with patch 7369.068 ms, without patch 7574.802 ms)

the patchset looks good in my eyes, i can understand it.
however I cannot apply it cleanly against the HEAD.

+/*
+ * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo
+ * instead of initializing it for each call. This is for performance.
+ */
+FunctionCallInfoBaseData *
+PrepareInputFunctionCallInfo(void)
+{
+ FunctionCallInfoBaseData *fcinfo;
+
+ fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3));

just wondering, I saw other similar places using palloc0,
do we need to use palloc0?

Re: [PATCH] Expose port->authn_id to extensions and triggers

2024-02-15 Thread Andy Fan


Hi,

Michael Paquier  writes:

> [[PGP Signed Part:Undecided]]
> On Tue, Aug 23, 2022 at 10:04:30AM -0700, Jacob Champion wrote:
>
> My main worry here is EXEC_BACKEND, where we would just use our own
> implementation of fork(), and it is a bad idea at the end to leave
> that untouched while we could have code paths that attempt to access
> it.  At the end, I have moved the initialization at the same place as
> where we set MyProcPort for a backend in BackendInitialize(), mainly
> as a matter of consistency because ClientConnectionInfo is aimed at
> being a subset of that.  And applied.

I found a compiler complaint of this patch. The attached fix that.

-- 
Best Regards
Andy Fan

>From 9e8a3fb7a044704fbfcd682a897f72260266bd54 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 15 Feb 2024 16:46:57 +0800
Subject: [PATCH v1 1/1] Remove the "parameter 'maxsize' set but not used"
 error.

maxsize is only used in Assert build, to make compiler quiet, it is
better maintaining it only assert build.
---
 src/backend/utils/init/miscinit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 23f77a59e5..0e72fcefab 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1050,7 +1050,9 @@ SerializeClientConnectionInfo(Size maxsize, char *start_address)
 	Assert(maxsize >= sizeof(serialized));
 	memcpy(start_address, , sizeof(serialized));
 
+#ifdef USE_ASSERT_CHECKING
 	maxsize -= sizeof(serialized);
+#endif
 	start_address += sizeof(serialized);
 
 	/* Copy authn_id into the space after the struct */
-- 
2.34.1

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread Andy Fan


Andy Fan  writes:

> Thomas Munro  writes:
>
>> On Thu, Oct 5, 2023 at 9:07 PM David Rowley  wrote:
>>> Thanks. Pushed.
>>
>> FYI somehow this plan from a8a968a8212e flipped in this run:
>>
>> === dumping 
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs
>> ===
>> diff -U3 
>> /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
>> --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
>> 2024-01-15 00:31:13.947555940 +
>> +++ 
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
>> 2024-02-14 00:06:17.075584839 +
>> @@ -1447,9 +1447,9 @@
>> ->  Append
>>   ->  Nested Loop
>> Join Filter: (t1.tenthous = t2.tenthous)
>> -   ->  Seq Scan on tenk1 t1
>> +   ->  Seq Scan on tenk2 t2
>> ->  Materialize
>> - ->  Seq Scan on tenk2 t2
>> + ->  Seq Scan on tenk1 t1
>>   ->  Result
>>  (8 rows)
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-02-14%2000%3A01%3A03
>
> Thanks for this information! I will take a look at this.

I found the both plans have the same cost, I can't get the accurate
cause of this after some hours research, but it is pretty similar with
7516056c584e3, so I uses a similar strategy to stable it. is it
acceptable? 

-- 
Best Regards
Andy Fan

>From 01c2b5ae76a4493f33b894afdd4a8d3ce31e1a3e Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 15 Feb 2024 16:29:30 +0800
Subject: [PATCH v1 1/1] Attempt to stabilize new unionall + limit regression
 test

This test was recently added in a8a968a8212e, and some times It appears
to be unstable in regards to the join order presumably due to the
relations at either side of the join being equal in side.  Here we add a
qual to make one of them smaller so the planner is more likely to choose
to material the smaller of the two.

Reported-by: Thomas Munro
Author: Andy Fan

Discussion: https://postgr.es/m/CA%2BhUKGLqC-NobKYfjxNM3Gexv9OJ-Fhvy9bugUcXsZjTqH7W%3DQ%40mail.gmail.com#88e6420b59a863403b9b67a0128fdacc
---
 src/test/regress/expected/union.out | 11 ++-
 src/test/regress/sql/union.sql  |  4 ++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 73e320bad4..0e527b65b3 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1438,8 +1438,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2);
 -- Ensure we get a Nested Loop join between tenk1 and tenk2
 explain (costs off)
 select t1.unique1 from tenk1 t1
-inner join tenk2 t2 on t1.tenthous = t2.tenthous
-   union all
+inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5
+union all
 (values(1)) limit 1;
QUERY PLAN   
 
@@ -1447,11 +1447,12 @@ inner join tenk2 t2 on t1.tenthous = t2.tenthous
->  Append
  ->  Nested Loop
Join Filter: (t1.tenthous = t2.tenthous)
-   ->  Seq Scan on tenk1 t1
+   ->  Seq Scan on tenk2 t2
->  Materialize
- ->  Seq Scan on tenk2 t2
+ ->  Seq Scan on tenk1 t1
+   Filter: (ten > 5)
  ->  Result
-(8 rows)
+(9 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 6c509ac80c..13ba9f21ad 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -548,8 +548,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2);
 -- Ensure we get a Nested Loop join between tenk1 and tenk2
 explain (costs off)
 select t1.unique1 from tenk1 t1
-inner join tenk2 t2 on t1.tenthous = t2.tenthous
-   union all
+inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5
+union all
 (values(1)) limit 1;
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
-- 
2.34.1

87 matches

Mail list logo