date:20240111

Hi,

On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote:
> On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> > Bertrand, I've relaunched tests in the same slowed down VM with both
> > patches applied (but with no other modifications) and got a failure
> > with pg_class, similar to what we had seen before:
> > 9   #   Failed test 'activeslot slot invalidation is logged with vacuum 
> > on pg_class'
> > 9   #   at t/035_standby_logical_decoding.pl line 230.
> > 
> > Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> > 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Thanks! 

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the 
catalog_xmin
advanced and there is no conflict).

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" 
the
catalog_xmin of the active slot.

> Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
> = minimal, because these lead to unpredictible records inserted,
> impacting the reliability of the tests.  We cannot do that here,
> obviously.  That may be a long shot, but could it be possible to tweak
> the test with a retry logic, retrying things if such a standby
> snapshot is found because we know that the invalidation is not going
> to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select 
txid_current()"
that is done in wait_until_vacuum_can_remove() in 
035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached 
is
now doing.

Let's give v5 a try? (please apply 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b1c316a85a2dd43a5ab33adbda1ee82020d84bb1 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:36:57 +
Subject: [PATCH v5] Fix 035_standby_logical_decoding.pl race conditions

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

Also, get rid of the active slot invalidation check during the on-access pruning
check. This test is racy for active slots and active slots invalidations are well
covered in other tests.
---
 .../t/035_standby_logical_decoding.pl | 114 +-
 1 file changed, 59 insertions(+), 55 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..dd4149c8bc 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -191,9 +191,9 @@ sub check_slots_conflict_reason
 
 # Drop the slots, re-create them, change hot_standby_feedback,
 # check xmin and catalog_xmin values, make slot active and reset stat.
-sub reactive_slots_change_hfs_and_wait_for_xmins
+sub recreate_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_;
 
 	# drop the logical slots
 	drop_logical_slots($previous_slot_prefix);
@@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($activate)
+	{
+		$handle =
+			make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -213,7 +216,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $activated) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -230,14 +233,33 @@ sub check_for_invalidation
 		"activeslot slot invalidation is logged $test_name");
 
 	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from pg_stat_database_conf

Re: Synchronizing slots from primary to standby

Hi,

On Fri, Jan 12, 2024 at 03:46:00AM +, Zhijie Hou (Fujitsu) wrote:
> On Thursday, January 11, 2024 11:42 PM Bertrand Drouvot 
>  wrote:
> 
> Hi,
>  
> > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> > IIUC, this would be a sync slot (so not usable until promotion) that could 
> > not be
> > used anyway (invalidated), so I'll vote for drop / re-create then.
> 
> Such race can happen when user drop and re-create the same failover slot on
> primary as well. For example, user dropped one failover slot and them
> immediately created a new one by copying from an old slot(using
> pg_copy_logical_replication_slot). Then the slotsync worker will find the
> restart_lsn of this slot go backwards.
> 
> The steps:
> 
> SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 
> 'pgoutput', false, false, true);
> SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput', 
> false, false, true);
> 
> - Advance the restart_lsn of 'test' slot
> CREATE TABLE test2(a int);
> INSERT INTO test2 SELECT generate_series(1,1,1);
> SELECT slot_name FROM pg_replication_slot_advance('test', 
> pg_current_wal_lsn());
> 
> - re-create the test slot but based on the old isolation_slot.
> SELECT pg_drop_replication_slot('test');
> SELECT 'copy' FROM pg_copy_logical_replication_slot('isolation_slot', 'test');
> 
> Then the restart_lsn of 'test' slot will go backwards.

Yeah, that's right.

BTW, I think it's worth to add those "corner cases" in the TAP tests related to
the sync slot feature (the more coverage the better).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-11 Thread Junwang Zhao

Hi,

On Wed, Jan 10, 2024 at 2:20 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Fri, 22 Dec 2023 10:58:05 +0800,
>   Junwang Zhao  wrote:
>
> >> 1. Add an opaque space for custom COPY TO handler
> >>* Add CopyToState{Get,Set}Opaque()
> >>
> >> https://github.com/kou/postgres/commit/5a610b6a066243f971e029432db67152cfe5e944
> >>
> >> 2. Export CopyToState::attnumlist
> >>* Add CopyToStateGetAttNumList()
> >>
> >> https://github.com/kou/postgres/commit/15fcba8b4e95afa86edb3f677a7bdb1acb1e7688
> >>
> >> 3. Export CopySend*()
> >>* Rename CopySend*() to CopyToStateSend*() and export them
> >>* Exception: CopySendEndOfRow() to CopyToStateFlush() because
> >>  it just flushes the internal buffer now.
> >>
> >> https://github.com/kou/postgres/commit/289a5640135bde6733a1b8e2c412221ad522901e
> >>
> > I guess the purpose of these helpers is to avoid expose CopyToState to
> > copy.h,
>
> Yes.
>
> > but I
> > think expose CopyToState to user might make life easier, users might want 
> > to use
> > the memory contexts of the structure (though I agree not all the
> > fields are necessary
> > for extension handers).
>
> OK. I don't object it as I said in another e-mail:
> https://www.postgresql.org/message-id/flat/20240110.120644.1876591646729327180.kou%40clear-code.com#d923173e9625c20319750155083cbd72
>
> >> 2. Need an opaque space like IndexScanDesc::opaque does
> >>
> >>* A custom COPY TO handler needs to keep its data
> >
> > I once thought users might want to parse their own options, maybe this
> > is a use case for this opaque space.
>
> Good catch! I forgot to suggest a callback for custom format
> options. How about the following API?
>
> 
> ...
> typedef bool (*CopyToProcessOption_function) (CopyToState cstate, DefElem 
> *defel);
>
> ...
> typedef bool (*CopyFromProcessOption_function) (CopyFromState cstate, DefElem 
> *defel);
>
> typedef struct CopyToFormatRoutine
> {
> ...
> CopyToProcessOption_function process_option_fn;
> } CopyToFormatRoutine;
>
> typedef struct CopyFromFormatRoutine
> {
> ...
> CopyFromProcessOption_function process_option_fn;
> } CopyFromFormatRoutine;
> 
>
> 
> diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
> index e7597894bf..1aa8b62551 100644
> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -416,6 +416,7 @@ void
>  ProcessCopyOptions(ParseState *pstate,
>CopyFormatOptions *opts_out,
>bool is_from,
> +  void *cstate, /* CopyToState* for 
> !is_from, CopyFromState* for is_from */
>List *options)
>  {
> boolformat_specified = false;
> @@ -593,11 +594,19 @@ ProcessCopyOptions(ParseState *pstate,
>  parser_errposition(pstate, 
> defel->location)));
> }
> else
> -   ereport(ERROR,
> -   (errcode(ERRCODE_SYNTAX_ERROR),
> -errmsg("option \"%s\" not 
> recognized",
> -   defel->defname),
> -parser_errposition(pstate, 
> defel->location)));
> +   {
> +   bool processed;
> +   if (is_from)
> +   processed = 
> opts_out->from_ops->process_option_fn(cstate, defel);
> +   else
> +   processed = 
> opts_out->to_ops->process_option_fn(cstate, defel);
> +   if (!processed)
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("option \"%s\" not 
> recognized",
> +   
> defel->defname),
> +parser_errposition(pstate, 
> defel->location)));
> +   }
> }
>
> /*
> 

Looks good.

>
> > For the name, I thought private_data might be a better candidate than
> > opaque, but I do not insist.
>
> I don't have a strong opinion for this. Here are the number
> of headers that use "private_data" and "opaque":
>
> $ grep -r private_data --files-with-matches src/include | wc -l
> 6
> $ grep -r opaque --files-with-matches src/include | wc -l
> 38
>
> It seems that we use "opaque" than "private_data" in general.
>
> but it seems that we use
> "opaque" than "private_data" in our code.
>
> > Do you use the arrow library to control the memory?
>
> Yes.
>
> > Is there a way that
> > we can let the arrow use postgres' memory

Re: Synchronizing slots from primary to standby

Hi,

On Fri, Jan 12, 2024 at 08:42:39AM +0530, Amit Kapila wrote:
> On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> > > >
> > > > To close the above race, I could think of the following ways:
> > > > 1. Drop and re-create the slot.
> > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > > > ahead of local_slot's LSN then we can update it; but as mentioned in
> > > > your previous comment, we need to update all other fields as well. If
> > > > we follow this then we probably need to have a check for catalog_xmin
> > > > as well.
> >
> > IIUC, this would be a sync slot (so not usable until promotion) that could
> > not be used anyway (invalidated), so I'll vote for drop / re-create then.
> >
> 
> No, it can happen for non-sync slots as well.

Yeah, I meant that we could decide to drop/re-create only for sync slots.

> 
> > > > Now, related to this the other case which needs some handling is what
> > > > if the remote_slot's restart_lsn is greater than local_slot's
> > > > restart_lsn but it is a re-created slot with the same name. In that
> > > > case, I think the other properties like 'two_phase', 'plugin' could be
> > > > different. So, is simply copying those sufficient or do we need to do
> > > > something else as well?
> > > >
> > >
> >
> > I'm not sure to follow here. If the remote slot is re-created then it would
> > be also dropped / re-created locally, or am I missing something?
> >
> 
> As our slot-syncing mechanism is asynchronous (from time to time we
> check the slot information on primary), isn't it possible that the
> same name slot is dropped and recreated between slot-sync worker's
> checks?
> 

Yeah, I should have thought harder ;-) So for this case, let's imagine that If 
we
had an easy way to detect that a remote slot has been drop/re-created then I 
think
we would also drop and re-create it on the standby too. 

If so, I think we should then update all the fields (that we're currently 
updating
in the "create locally" case) when we detect that (at least) one of the 
following differs:

- dboid
- plugin
- two_phase

Maybe the "best" approach would be to have a way to detect that a slot has been
re-created on the primary (but that would mean rely on more than the slot name
to "identify" a slot and probably add a new member to the struct to do so).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Make NUM_XLOGINSERT_LOCKS configurable

2024-01-11 Thread Bharath Rupireddy

On Wed, Jan 10, 2024 at 11:43 AM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > On Wed, Jan 10, 2024 at 10:00 AM Tom Lane  wrote:
> >> Maybe.  I bet just bumping up the constant by 2X or 4X or so would get
> >> most of the win for far less work; it's not like adding a few more
> >> LWLocks is expensive.  But we need some evidence about what to set it to.
>
> > I previously made an attempt to improve WAL insertion performance with
> > varying NUM_XLOGINSERT_LOCKS. IIRC, we will lose what we get by
> > increasing insertion locks (reduction in WAL insertion lock
> > acquisition time) to the CPU overhead of flushing the WAL in
> > WaitXLogInsertionsToFinish as referred to by the following comment.
>
> Very interesting --- this is at variance with what the OP said, so
> we definitely need details about the test conditions in both cases.
>
> > Unfortunately, I've lost the test results, I'll run them up again and
> > come back.
>
> Please.

Okay, I'm back with some testing.

Test case:
./pgbench --initialize --scale=100 --username=ubuntu postgres
./pgbench --progress=10 --client=64 --time=300 --builtin=tpcb-like
--username=ubuntu postgres

Setup:
./configure --prefix=$PWD/inst/ CFLAGS="-ggdb3 -O3" > install.log &&
make -j 8 install > install.log 2>&1 &

shared_buffers = '8GB'
max_wal_size = '32GB'
track_wal_io_timing = on

Stats measured:
I've used the attached patch to measure WAL Insert Lock Acquire Time
(wal_insert_lock_acquire_time)  and WAL Wait for In-progress Inserts
to Finish Time (wal_wait_for_insert_to_finish_time).

Results with varying NUM_XLOGINSERT_LOCKS (note that we can't allow it
be more than MAX_SIMUL_LWLOCKS):
LocksTPSWAL Insert Lock Acquire Time in MillisecondsWAL
Wait for In-progress Inserts to Finish Time in Milliseconds
818669125328775
16180761064113491
3218034663513997
6417582393714718
12817782456320145

Also, check the attached graph. Clearly there's an increase in the
time spent in waiting for in-progress insertions to finish in
WaitXLogInsertionsToFinish from 8.7 seconds to 20 seconds. Whereas,
the time spent to acquire WAL insertion locks decreased from 12.5
seconds to 4.5 seconds. Overall, this hasn't resulted any improvement
in TPS, in fact observed slight reduction.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-WAL-insertion-stats.patch
Description: Binary data

RE: speed up a logical replica setup

2024-01-11 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Sorry for disturbing your work and thanks for updates.
I will review your patch again.

>
* 0001 patch
It is almost the same as v3-0001, which was posted by Euler.
An unnecessary change for Mkvcbuild.pm (this file was removed) was ignored.

v5 removes the MSVC support.
>

Confirmed that the patch could be applied.

>
* 0002 patch
This contains small fixes to keep complier quiet.

I applied it. Although, I used a different approach for format specifier.
>

Good, all warnings were removed. However, the patch failed to pass tests on 
FreeBSD twice.
 I'm quite not sure the ERROR, but is it related with us?

>
* 0003 patch
This addresses comments posted to -hackers. For now, this does not contain a 
doc.
Will add if everyone agrees these idea.

I didn't review all items but ...

1.
An option --port was added to control the port number for physical standby.
Users can specify a port number via the option, or an environment variable 
PGSUBPORT.
If not specified, a fixed value (50111) would be used.

My first reaction as a new user would be: why do I need to specify a port if my
--subscriber-conninfo already contains a port? Ugh. I'm wondering if we can do
it behind the scenes. Try a range of ports.
>

My initial motivation of the setting was to avoid establishing connections
during the pg_subscriber. While considering more, I started to think that
--subscriber-conninfo may not be needed. pg_upgrade does not requires the
string: it requries username, and optionally port number (which would be used
during the upgrade) instead. The advantage of this approach is that we do not
have to parse the connection string.
How do you think?

>
2.
A FATAL error would be raised if --subscriber-conninfo specifies non-local 
server.

Extra protection is always good. However, let's make sure this code path is
really useful. I'll think a bit about it.
>

OK, I can wait your consideration. Note that if we follow the pg_ugprade style,
we may able to reuse check_pghost_envvar().

>
3. 
Options -o/-O were added to specify options for publications/subscriptions.

Flexibility is cool. However, I think the cost benefit of it is not good. You
have to parse the options to catch preliminary errors. Things like publish only
delete and subscription options that conflicts with the embedded ones are
additional sources of failure.
>

As I already replied, let's stop doing it once. We can resume based on the 
requirement.

>
4. 
Made standby to save their output to log file.

It was already done in v5. I did in a different way.
>

Good. I felt that yours were better. BTW, can we record outputs by 
pg_subscriber to a file as well?
pg_upgrade did similar thing. Thought?

>
5. 
Unnecessary Assert in drop_replication_slot() was removed.

Instead, I fixed the code and keep the assert.
>

Cool.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: doc: add LITERAL tag to RETURNING

On Fri, Jan 12, 2024 at 11:27 AM torikoshia  wrote:
>
> Hi,
>
> RETURNING is usually tagged with appropriate tags, such as ,
> but not in the 'query' section of COPY.

I have the same observation.

>
> https://www.postgresql.org/docs/devel/sql-copy.html
>
> Would it be better to put  here as well?
>

The patch looks good.

-- 
Best Wishes,
Ashutosh Bapat

doc: add LITERAL tag to RETURNING

2024-01-11 Thread torikoshia


Hi,

RETURNING is usually tagged with appropriate tags, such as , 
but not in the 'query' section of COPY.


https://www.postgresql.org/docs/devel/sql-copy.html

Would it be better to put  here as well?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 3c9efe404310bf01d79b2f0f006541ebc0b170a0 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 12 Jan 2024 14:33:47 +0900
Subject: [PATCH v1] Added literal tag for RETURNING.

---
 doc/src/sgml/ref/copy.sgml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69c33..e2ffbbdf84 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -128,10 +128,10 @@ COPY { table_name [ ( 
  
   For INSERT, UPDATE and
-  DELETE queries a RETURNING clause must be provided,
-  and the target relation must not have a conditional rule, nor
-  an ALSO rule, nor an INSTEAD rule
-  that expands to multiple statements.
+  DELETE queries a RETURNING clause
+  must be provided, and the target relation must not have a conditional
+  rule, nor an ALSO rule, nor an
+  INSTEAD rule that expands to multiple statements.
  
 


base-commit: 08c3ad27eb5348d0cbffa843a3edb11534f9904a
-- 
2.39.2

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-11 Thread Sutou Kouhei

Hi,

Here is the current summary for a this discussion to make
COPY format extendable. It's for reaching consensus and
starting implementing the feature. (I'll start implementing
the feature once we reach consensus.) If you have any
opinion, please share it.

Confirmed:

1.1 Making COPY format extendable will not reduce performance.
[1]

Decisions:

2.1 Use separated handler for COPY TO and COPY FROM because
our COPY TO implementation (copyto.c) and COPY FROM
implementation (coypfrom.c) are separated.
[2]

2.2 Don't use system catalog for COPY TO/FROM handlers. We can
just use a function(internal) that returns a handler instead.
[3]

2.3 The implementation must include documentation.
[5]

2.4 The implementation must include test.
[6]

2.5 The implementation should be consist of small patches
for easy to review.
[6]

2.7 Copy{To,From}State must have a opaque space for
handlers.
[8]

2.8 Export CopySendData() and CopySendEndOfRow() for COPY TO
handlers.
[8]

2.9 Make "format" in PgMsg_CopyOutResponse message
extendable.
[9]

2.10 Make makeNode() call avoidable in function(internal)
 that returns COPY TO/FROM handler.
 [9]

2.11 Custom COPY TO/FROM handlers must be able to parse
 their options.
 [11]

Discussing:

3.1 Should we use one function(internal) for COPY TO/FROM
handlers or two function(internal)s (one is for COPY TO
handler and another is for COPY FROM handler)?
[4]

3.2 If we use separated function(internal) for COPY TO/FROM
handlers, we need to define naming rule. For example,
_to(internal) for COPY TO handler and
_from(internal) for COPY FROM handler.
[4]

3.3 Should we use prefix or suffix for function(internal)
name to avoid name conflict with other handlers such as
tablesample handlers?
[7]

3.4 Should we export Copy{To,From}State? Or should we just
provide getters/setters to access Copy{To,From}State
internal?
[10]


[1] 
https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809.kou%40clear-code.com
[2] https://www.postgresql.org/message-id/flat/ZXEUIy6wl4jHy6Nm%40paquier.xyz
[3] 
https://www.postgresql.org/message-id/flat/CAD21AoAhcZkAp_WDJ4sSv_%2Bg2iCGjfyMFgeu7MxjnjX_FutZAg%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/flat/CAD21AoDkoGL6yJ_HjNOg9cU%3DaAdW8uQ3rSQOeRS0SX85LPPNwQ%40mail.gmail.com
[5] 
https://www.postgresql.org/message-id/flat/TY3PR01MB9889C9234CD220A3A7075F0DF589A%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[6] https://www.postgresql.org/message-id/flat/ZXbiPNriHHyUrcTF%40paquier.xyz
[7] 
https://www.postgresql.org/message-id/flat/20231214.184414.2179134502876898942.kou%40clear-code.com
[8] 
https://www.postgresql.org/message-id/flat/20231221.183504.1240642084042888377.kou%40clear-code.com
[9] https://www.postgresql.org/message-id/flat/ZYTfqGppMc9e_w2k%40paquier.xyz
[10] 
https://www.postgresql.org/message-id/flat/CAD21AoD%3DUapH4Wh06G6H5XAzPJ0iJg9YcW8r7E2UEJkZ8QsosA%40mail.gmail.com
[11] 
https://www.postgresql.org/message-id/flat/20240110.152023.1920937326588672387.kou%40clear-code.com


Thanks,
-- 
kou

Re: Synchronizing slots from primary to standby

On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot
 wrote:
>
> On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
> > >
> > > +static bool
> > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > > {
> > > ...
> > > + /* Slot ready for sync, so sync it. */
> > > + else
> > > + {
> > > + /*
> > > + * Sanity check: With hot_standby_feedback enabled and
> > > + * invalidations handled appropriately as above, this should never
> > > + * happen.
> > > + */
> > > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > > + elog(ERROR,
> > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > > + " to remote slot's LSN(%X/%X) as synchronization"
> > > + " would move it backwards", remote_slot->name,
> > > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > > ...
> > > }
> > >
> > > I was thinking about the above code in the patch and as far as I can
> > > think this can only occur if the same name slot is re-created with
> > > prior restart_lsn after the existing slot is dropped. Normally, the
> > > newly created slot (with the same name) will have higher restart_lsn
> > > but one can mimic it by copying some older slot by using
> > > pg_copy_logical_replication_slot().
> > >
> > > I don't think as mentioned in comments even if hot_standby_feedback is
> > > temporarily set to off, the above shouldn't happen. It can only lead
> > > to invalidated slots on standby.
>
> I also think so.
>
> > >
> > > To close the above race, I could think of the following ways:
> > > 1. Drop and re-create the slot.
> > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > > ahead of local_slot's LSN then we can update it; but as mentioned in
> > > your previous comment, we need to update all other fields as well. If
> > > we follow this then we probably need to have a check for catalog_xmin
> > > as well.
>
> IIUC, this would be a sync slot (so not usable until promotion) that could
> not be used anyway (invalidated), so I'll vote for drop / re-create then.
>

The one more drawback I see is that in such a case (where the slot
could have dropped on primary) is that it is not advisable to keep it
on standby. So, I also think we should drop and re-create the slots in
this case unless I am missing something here.

-- 
With Regards,
Amit Kapila.

Re: Compile warnings in dbcommands.c building with meson

2024-01-11 Thread jian he

On Fri, Jan 12, 2024 at 1:05 AM Magnus Hagander  wrote:
>
> On Wed, Jan 10, 2024 at 1:16 PM Aleksander Alekseev
>  wrote:
> >
> > Hi,
> >
> > > When building current head on debian bullseye I get this compile warning:
> > >
> > > In file included from ../src/backend/commands/dbcommands.c:20:
> > > ../src/backend/commands/dbcommands.c: In function ‘createdb’:
> > > ../src/include/postgres.h:104:9: warning: ‘src_hasloginevt’ may be
> > > used uninitialized in this function [-Wmaybe-uninitialized]
> > >   104 |  return (Datum) (X ? 1 : 0);
> > >   | ^~~
> > > ../src/backend/commands/dbcommands.c:683:8: note: ‘src_hasloginevt’
> > > was declared here
> > >   683 |  bool  src_hasloginevt;
> > >   |^~~
> > >
> > >
> > > I only get this when building with meson, not when building with
> > > autotools. AFAICT, I have the same config:
> > >
> > > ./configure --enable-debug --enable-depend --with-python
> > > --enable-cassert --with-openssl --enable-tap-tests --with-icu
> > >
> > > vs
> > >
> > > meson setup build -Ddebug=true -Dpython=true -Dcassert=true
> > > -Dssl=openssl -Dtap-test=true -Dicu=enabled -Dnls=disabled
> > >
> > >
> > > in both cases the compiler is:
> > > gcc (Debian 10.2.1-6) 10.2.1 20210110
> >
> > Seems to me that the compiler is not smart enough to process:
> >
> > ```
> > if (!get_db_info(dbtemplate, ShareLock,
> >  &src_dboid, &src_owner, &src_encoding,
> >  &src_istemplate, &src_allowconn, &src_hasloginevt,
> >  &src_frozenxid, &src_minmxid, &src_deftablespace,
> >  &src_collate, &src_ctype, &src_iculocale,
> > &src_icurules, &src_locprovider,
> >  &src_collversion))
> > ereport(ERROR, ...
> > ```
> >
> > Should we just silence the warning like this - see attachment? I don't
> > think createdb() is called that often to worry about slight
> > performance change, if there is any.
>
> Certainly looks that way, but I'm curious as to why nobody else has seen 
> this..
>

I saw it sometimes, sometimes not.
Now I think the reason is:
it will appear when you do `-Dbuildtype=release`.

but it will not occur when I do:
`-Dbuildtype=debug`

my current meson version is 1.3.1, my ninja version is 1.10.1.

RE: speed up a logical replica setup

2024-01-11 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Euler,

> >
> >
> > 3.
> > Options -o/-O were added to specify options for publications/subscriptions.
> >
> >
> > Flexibility is cool. However, I think the cost benefit of it is not good. 
> > You
> > have to parse the options to catch preliminary errors. Things like publish 
> > only
> > delete and subscription options that conflicts with the embedded ones are
> > additional sources of failure.
> >
> 
> Yeah, I am also not sure we need those. Did we discussed about that
> previously? OTOH, we may consider to enhance this tool later if we
> have user demand for such options.

OK, so let's drop it once and consider later as an enhancement.
As Euler said, it leads additional ERRORs.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: pub/sub - specifying optional parameters without values.

2024-01-11 Thread Peter Smith

On Mon, Jan 30, 2023 at 8:36 AM Tom Lane  wrote:
>
> Zheng Li  writes:
> > The behavior is due to the following code
> > https://github.com/postgres/postgres/blob/master/src/backend/commands/define.c#L113
>
> Yeah, so you can grep for places that have this behavior by looking
> for defGetBoolean calls ... and there are quite a few.  That leads
> me to the conclusion that we'd better invent a fairly stylized
> documentation solution that we can plug into a lot of places,
> rather than thinking of slightly different ways to say it and
> places to say it.  I'm not necessarily opposed to Peter's desire
> to fix replication-related commands first, but we have more to do
> later.
>
> I'm also not that thrilled with putting the addition up at the top
> of the relevant text.  This behavior is at least two decades old,
> so if we've escaped documenting it at all up to now, it can't be
> that important to most people.
>
> I also notice that ALTER SUBSCRIPTION has fully three different
> sub-sections with about equal claims on this note, if we're going
> to stick it directly into the affected option lists.
>
> That all leads me to propose that we add the new text at the end of
> the Parameters  in the affected man pages.  So about
> like the attached.  (I left out alter_publication.sgml, as I'm not
> sure it needs its own copy of this text --- it doesn't describe
> individual parameters at all, just refer to CREATE PUBLICATION.)
>
> regards, tom lane
>

Hi,

Here is a similar update for another page: "55.4 Streaming Replication
Protocol" [0]. This patch was prompted by a review comment reply at
[1] (#2).

I've used text almost the same as the boilerplate text added by the
previous commit [2]

~

PSA patch v4.

==
[0] https://www.postgresql.org/docs/devel/protocol-replication.html
[1] 
https://www.postgresql.org/message-id/OS0PR01MB571663BCE8B28597D462FADE946A2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] 
https://github.com/postgres/postgres/commit/ec7e053a98f39a9e3c7e6d35f0d2e83933882399

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Describe-default-if-boolean-value-omitted.patch
Description: Binary data

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Thu, Jan 11, 2024 at 01:43:27PM -0500, Robert Haas wrote:
> On Tue, Jan 9, 2024 at 2:35 PM Andres Freund  wrote:
>> I don't have that strong feelings about it. If both of you think it
>> looks good, go ahead...
> 
> Done.

Thanks for e2d5b3b9b643.
--
Michael


signature.asc
Description: PGP signature

Re: Adding facility for injection points (or probe points?) for more advanced tests

On Fri, Jan 12, 2024 at 08:35:42AM +0900, Michael Paquier wrote:
> It also seems like you've missed this message, where this has been
> mentioned (spoiler: first version of the patch used an alternate
> output):
> https://www.postgresql.org/message-id/zunuzpimkzcov...@paquier.xyz 

The refactoring of 0001 has now been applied as of e72a37528dda, and
the buildfarm looks stable (at least for now).

Here is a rebased patch set of the rest.
--
Michael
From 0265596520b4b39dcb871a524e5715f5c18a8c37 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 9 Jan 2024 13:20:01 +0900
Subject: [PATCH v9 1/4] Add backend facility for injection points

This adds a set of routines allowing developers to attach, detach and
run custom code based on arbitrary code paths set with a centralized
macro called INJECTION_POINT().  Injection points are registered in a
shared hash table.  Processes also use a local cache to over loading
callbacks more than necessary, cleaning up their cache if a callback has
found to be removed.
---
 src/include/pg_config.h.in|   3 +
 src/include/utils/injection_point.h   |  37 ++
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 .../utils/activity/wait_event_names.txt   |   1 +
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/injection_point.c  | 334 ++
 src/backend/utils/misc/meson.build|   1 +
 doc/src/sgml/installation.sgml|  30 ++
 doc/src/sgml/xfunc.sgml   |  54 +++
 src/makefiles/meson.build |   1 +
 configure |  34 ++
 configure.ac  |   7 +
 meson.build   |   1 +
 meson_options.txt |   3 +
 src/Makefile.global.in|   1 +
 src/tools/pgindent/typedefs.list  |   2 +
 17 files changed, 514 insertions(+)
 create mode 100644 src/include/utils/injection_point.h
 create mode 100644 src/backend/utils/misc/injection_point.c

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5f16918243..288bb9cb42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -698,6 +698,9 @@
 /* Define to build with ICU support. (--with-icu) */
 #undef USE_ICU
 
+/* Define to 1 to build with injection points. (--enable-injection-points) */
+#undef USE_INJECTION_POINTS
+
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
diff --git a/src/include/utils/injection_point.h b/src/include/utils/injection_point.h
new file mode 100644
index 00..55524b568f
--- /dev/null
+++ b/src/include/utils/injection_point.h
@@ -0,0 +1,37 @@
+/*-
+ * injection_point.h
+ *	  Definitions related to injection points.
+ *
+ * Copyright (c) 2001-2024, PostgreSQL Global Development Group
+ *
+ * src/include/utils/injection_point.h
+ *-
+ */
+
+#ifndef INJECTION_POINT_H
+#define INJECTION_POINT_H
+
+/*
+ * Injections points require --enable-injection-points.
+ */
+#ifdef USE_INJECTION_POINTS
+#define INJECTION_POINT(name) InjectionPointRun(name)
+#else
+#define INJECTION_POINT(name) ((void) name)
+#endif
+
+/*
+ * Typedef for callback function launched by an injection point.
+ */
+typedef void (*InjectionPointCallback) (const char *name);
+
+extern Size InjectionPointShmemSize(void);
+extern void InjectionPointShmemInit(void);
+
+extern void InjectionPointAttach(const char *name,
+ const char *library,
+ const char *function);
+extern void InjectionPointRun(const char *name);
+extern void InjectionPointDetach(const char *name);
+
+#endif			/* INJECTION_POINT_H */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e5119ed55d..6b9dcd8057 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -50,6 +50,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/snapmgr.h"
 #include "utils/wait_event.h"
 
@@ -149,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, InjectionPointShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -351,6 +353,7 @@ CreateOrAttachShmemStructs(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
+	InjectionPointShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index d621f5507f..0a0f02e055 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlockna

reorganize "Shared Memory and LWLocks" section of docs

2024-01-11 Thread Nathan Bossart

I recently began trying to write documentation for the dynamic shared
memory registry feature [0], and I noticed that the "Shared Memory and
LWLocks" section of the documentation might need some improvement.  At
least, I felt that it would be hard to add any new content to this section
without making it very difficult to follow.

Concretely, I am proposing breaking it into two sections: one for shared
memory and one for LWLocks.  Furthermore, the LWLocks section will be split
into two: one for requesting locks at server startup and one for requesting
locks after server startup.  I intend to also split the shared memory
section into at-startup and after-startup sections if/when the dynamic
shared memory registry feature is committed.

Besides this restructuring, I felt that certain parts of this documentation
could benefit from rephrasing and/or additional detail.

Thoughts?

[0] https://postgr.es/m/20231205034647.GA2705267%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From dc5c1b37ced883021f4a92a17aa50b9d80d73fc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v1 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 180 +---
 1 file changed, 112 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..a758384a9a 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,134 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Any registered
+  shmem_startup_hook will be executed shortly after each
+  backend attaches to shared memory.  Note that add-ins should still
+  acquire AddinShmemInitLock within this hook, as
+  shown in the example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can b

Re: Adding facility for injection points (or probe points?) for more advanced tests

On Fri, Jan 12, 2024 at 5:05 AM Michael Paquier  wrote:
>
> On Thu, Jan 11, 2024 at 02:47:27PM +0530, Ashutosh Bapat wrote:
> > On Thu, Jan 11, 2024 at 9:42 AM Michael Paquier  wrote:
> >> I don't really aim for complicated here, just useful.
> >
> > It isn't complicated. Such simple error check improve user's
> > confidence on the feature and better be part of the 1st cut.
>
> I'm really not sure about that, because it does not impact the scope
> of the facility even with all the use cases I've seen where injection
> points could be used.  It could always be added later if there's a
> strong push for it.  For testing, I'm biased about attempting to load
> callbacks in the process attaching them.
>

I am not able to understand the objection to adding another handful of
lines of code. The core code is quite minimal and better to be robust.
We may seek someone else's opinion to break the tie.

> >>> With v6 I could run the test when built with enable_injection_point
> >>> false. I just ran make check in that folder. Didn't test meson build.
> >>
> >> The CI has been failing because 041_invalid_checkpoint_after_promote
> >> was loading Time::HiRes::nanosleep and Windows does not support it.
> >
> > Some miscommunication here. The SQL test under injection_point module
> > can be run in a build without injection_point and it fails. I think
> > it's better to have an alternate output for the same or prohibit the
> > test running itself.
>
> The same problem exists if you try to run the SSL tests in
> src/test/ssl/ without support build for them.  Protections at the
> upper levels are good enough for the CI and the buildfarm, while
> making the overall maintenance cheaper, so I'm happy with just these.
>
> It also seems like you've missed this message, where this has been
> mentioned (spoiler: first version of the patch used an alternate
> output):
> https://www.postgresql.org/message-id/zunuzpimkzcov...@paquier.xyz

Ah! sorry for missing that. If there's a precedent, I am ok. If the
confusion arises we can fix it later.

-- 
Best Wishes,
Ashutosh Bapat

Re: A failure in t/038_save_logical_slots_shutdown.pl

On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 11, 2024 at 4:35 PM vignesh C  wrote:
> >
> > On further analysis, it was found that in the failing test,
> > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > page header present just before the CHECKPOINT_SHUTDOWN which was
> > causing the failure. We could alternatively reproduce the issue by
> > switching the WAL file before restarting the server like in the
> > attached test change patch.
> > There are a couple of ways to fix this issue a) one by switching the
> > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > does not get inserted in a new page as in the attached test_fix.patch
> > b) by using pg_walinspect to check that the next WAL record is
> > CHECKPOINT_SHUTDOWN. I have to try this approach.
> >
> > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > in getting to the root cause.
>
> IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> are no left-over decodable WAL records between confirmed_flush LSN and
> a shutdown checkpoint, which is what it is expected from the
> t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> returning true if there are any decodable WAL records between the
> given start_lsn and end_lsn?
>

But, we already test this in 003_logical_slot during a successful
upgrade. Having an explicit test to do the same thing has some merits
but not sure if it is worth it. The current test tries to ensure that
during shutdown after we shutdown walsender and ensures that it sends
all the wal records and receipts an ack for the same, there is no
other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
the test robust enough that it shouldn't show spurious failures like
the current one reported by you, so shall we try to proceed with that?

-- 
With Regards,
Amit Kapila.

Re: Recovering from detoast-related catcache invalidations

2024-01-11 Thread Xiaoran Wang

> Also, I'm pretty dubious that GetNonHistoricCatalogSnapshot rather
> than GetCatalogSnapshot is the right thing, because the catcaches
> use the latter.
Yes, you are right, should use GetCatalogSnapshot here.

> Maybe, but that undocumented hack in SetHintBits seems completely
> unacceptable.  Isn't there a cleaner way to make this check?
Maybe we don't need to call 'HeapTupleSatisfiesVisibility' to check if the
tuple has been deleted.
As the tuple's xmin must been committed, so we just need to check if its
xmax is committed,
like the below:


@@ -1956,9 +1956,11 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple
ntp, Datum *arguments,
 */
if (HeapTupleHasExternal(ntp))
{
+   TransactionId xmax;

dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
-   if (!HeapTupleSatisfiesVisibility(ntp,
GetNonHistoricCatalogSnapshot(cache->cc_reloid), InvalidBuffer))
+   xmax = HeapTupleHeaderGetUpdateXid(ntp->t_data);
+   if (TransactionIdIsValid(xmax) &&
TransactionIdDidCommit(xmax))
{
heap_freetuple(dtp);
return NULL;


I'm not quite sure the code is correct, I cannot clearly understand
'HeapTupleHeaderGetUpdateXid', and I need more time to dive into it.

Any thoughts?


Tom Lane  于2024年1月12日周五 06:21写道：

> Xiaoran Wang  writes:
> >>> The detection of "get an invalidation" could be refined: what I did
> >>> here is to check for any advance of SharedInvalidMessageCounter,
> >>> which clearly will have a significant number of false positives.
>
> > I have reviewed your patch, and it looks good.  But instead of checking
> for
> > any advance of SharedInvalidMessageCounter ( if the invalidate message is
> > not related to the current tuple, it is a little expensive)  I have
> another
> > idea:  we can recheck the visibility of the tuple with
> CatalogSnapshot(the
> > CatalogSnapthot must be refreshed if there is any SharedInvalidMessages)
> if
> > it is not visible, we re-fetch the tuple, otherwise, we can continue to
> use
> > it as it is not outdated.
>
> Maybe, but that undocumented hack in SetHintBits seems completely
> unacceptable.  Isn't there a cleaner way to make this check?
>
> Also, I'm pretty dubious that GetNonHistoricCatalogSnapshot rather
> than GetCatalogSnapshot is the right thing, because the catcaches
> use the latter.
>
> regards, tom lane
>

RE: Synchronizing slots from primary to standby

2024-01-11 Thread Zhijie Hou (Fujitsu)

On Thursday, January 11, 2024 11:42 PM Bertrand Drouvot 
 wrote:

Hi,
 
> On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila 
> wrote:
> > >
> > > +static bool
> > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot
> > > +*remote_slot)
> > > {
> > > ...
> > > + /* Slot ready for sync, so sync it. */ else {
> > > + /*
> > > + * Sanity check: With hot_standby_feedback enabled and
> > > + * invalidations handled appropriately as above, this should never
> > > + * happen.
> > > + */
> > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR,
> > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > > + " to remote slot's LSN(%X/%X) as synchronization"
> > > + " would move it backwards", remote_slot->name,
> > > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > > ...
> > > }
> > >
> > > I was thinking about the above code in the patch and as far as I can
> > > think this can only occur if the same name slot is re-created with
> > > prior restart_lsn after the existing slot is dropped. Normally, the
> > > newly created slot (with the same name) will have higher restart_lsn
> > > but one can mimic it by copying some older slot by using
> > > pg_copy_logical_replication_slot().
> > >
> > > I don't think as mentioned in comments even if hot_standby_feedback
> > > is temporarily set to off, the above shouldn't happen. It can only
> > > lead to invalidated slots on standby.
> 
> I also think so.
> 
> > >
> > > To close the above race, I could think of the following ways:
> > > 1. Drop and re-create the slot.
> > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > > ahead of local_slot's LSN then we can update it; but as mentioned in
> > > your previous comment, we need to update all other fields as well.
> > > If we follow this then we probably need to have a check for
> > > catalog_xmin as well.
> 
> IIUC, this would be a sync slot (so not usable until promotion) that could 
> not be
> used anyway (invalidated), so I'll vote for drop / re-create then.

Such race can happen when user drop and re-create the same failover slot on
primary as well. For example, user dropped one failover slot and them
immediately created a new one by copying from an old slot(using
pg_copy_logical_replication_slot). Then the slotsync worker will find the
restart_lsn of this slot go backwards.

The steps:

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 
'pgoutput', false, false, true);
SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput', 
false, false, true);

- Advance the restart_lsn of 'test' slot
CREATE TABLE test2(a int);
INSERT INTO test2 SELECT generate_series(1,1,1);
SELECT slot_name FROM pg_replication_slot_advance('test', pg_current_wal_lsn());

- re-create the test slot but based on the old isolation_slot.
SELECT pg_drop_replication_slot('test');
SELECT 'copy' FROM pg_copy_logical_replication_slot('isolation_slot', 'test');

Then the restart_lsn of 'test' slot will go backwards.

Best Regards,
Hou zj

Re: speed up a logical replica setup

On Fri, Jan 12, 2024 at 4:30 AM Euler Taveira  wrote:
>
>
> 3.
> Options -o/-O were added to specify options for publications/subscriptions.
>
>
> Flexibility is cool. However, I think the cost benefit of it is not good. You
> have to parse the options to catch preliminary errors. Things like publish 
> only
> delete and subscription options that conflicts with the embedded ones are
> additional sources of failure.
>

Yeah, I am also not sure we need those. Did we discussed about that
previously? OTOH, we may consider to enhance this tool later if we
have user demand for such options.

BTW, I think we need some way to at least drop the existing
subscriptions otherwise the newly created subscriber will attempt to
fetch the data which may not be intended. Ashutosh made an argument
above thread that we need an option for publications as well.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot
 wrote:
>
> On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
> > >
> > > +static bool
> > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > > {
> > > ...
> > > + /* Slot ready for sync, so sync it. */
> > > + else
> > > + {
> > > + /*
> > > + * Sanity check: With hot_standby_feedback enabled and
> > > + * invalidations handled appropriately as above, this should never
> > > + * happen.
> > > + */
> > > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > > + elog(ERROR,
> > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > > + " to remote slot's LSN(%X/%X) as synchronization"
> > > + " would move it backwards", remote_slot->name,
> > > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > > ...
> > > }
> > >
> > > I was thinking about the above code in the patch and as far as I can
> > > think this can only occur if the same name slot is re-created with
> > > prior restart_lsn after the existing slot is dropped. Normally, the
> > > newly created slot (with the same name) will have higher restart_lsn
> > > but one can mimic it by copying some older slot by using
> > > pg_copy_logical_replication_slot().
> > >
> > > I don't think as mentioned in comments even if hot_standby_feedback is
> > > temporarily set to off, the above shouldn't happen. It can only lead
> > > to invalidated slots on standby.
>
> I also think so.
>
> > >
> > > To close the above race, I could think of the following ways:
> > > 1. Drop and re-create the slot.
> > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > > ahead of local_slot's LSN then we can update it; but as mentioned in
> > > your previous comment, we need to update all other fields as well. If
> > > we follow this then we probably need to have a check for catalog_xmin
> > > as well.
>
> IIUC, this would be a sync slot (so not usable until promotion) that could
> not be used anyway (invalidated), so I'll vote for drop / re-create then.
>

No, it can happen for non-sync slots as well.

> > > Now, related to this the other case which needs some handling is what
> > > if the remote_slot's restart_lsn is greater than local_slot's
> > > restart_lsn but it is a re-created slot with the same name. In that
> > > case, I think the other properties like 'two_phase', 'plugin' could be
> > > different. So, is simply copying those sufficient or do we need to do
> > > something else as well?
> > >
> >
>
> I'm not sure to follow here. If the remote slot is re-created then it would
> be also dropped / re-created locally, or am I missing something?
>

As our slot-syncing mechanism is asynchronous (from time to time we
check the slot information on primary), isn't it possible that the
same name slot is dropped and recreated between slot-sync worker's
checks?

-- 
With Regards,
Amit Kapila.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-11 Thread torikoshia

On Wed, Jan 10, 2024 at 4:42 PM Masahiko Sawada  
wrote:

Yeah, I'm still thinking it's better to implement this feature
incrementally. Given we're closing to feature freeze, I think it's
unlikely to get the whole feature into PG17 since there are still many
design discussions we need in addition to what Torikoshi-san pointed
out. The feature like "ignore errors" or "logging errors" would have
higher possibilities. Even if we get only these parts of the whole
"error table" feature into PG17, it will make it much easier to

implement "error tables" feature.

+1.
I'm also going to make patch for "logging errors", since this 
functionality is isolated from v7 patch.

Seems promising. I'll look at the patch.

Thanks a lot!
Sorry to attach v2 if you already reviewed v1..

On 2024-01-11 12:13, jian he wrote:
On Tue, Jan 9, 2024 at 10:36 PM torikoshia  
wrote:

On Tue, Dec 19, 2023 at 10:14 AM Masahiko Sawada 

wrote:
> If we want only such a feature we need to implement it together (the
> patch could be split, though). But if some parts of the feature are
> useful for users as well, I'd recommend implementing it incrementally.
> That way, the patches can get small and it would be easy for reviewers
> and committers to review/commit them.

Jian, how do you think this comment?

Looking back at the discussion so far, it seems that not everyone 
thinks
saving table information is the best idea[1] and some people think 
just

skipping error data is useful.[2]

Since there are issues to be considered from the design such as
physical/logical replication treatment, putting error information to
table is likely to take time for consensus building and development.

Wouldn't it be better to follow the following advice and develop the
functionality incrementally?

On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada
 wrote:
> So I'm thinking we may be able to implement this
> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.

Attached a patch for this "first step" with reference to v7 patch, 
which

logged errors and simpler than latest one.
- This patch adds new option SAVE_ERROR_TO, but currently only 
supports

'none', which means just skips error data. It is expected to support
'log' and 'table'.
- This patch Skips just soft errors and don't handle other errors such
as missing column data.

Hi.
I made the following change based on your patch
(v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch)

* when specified SAVE_ERROR_TO, move the initialization of
ErrorSaveContext to the function BeginCopyFrom.
I think that's the right place to initialize struct CopyFromState 
field.

* I think your patch when there are N rows have malformed data, then it
will initialize N ErrorSaveContext.
In the struct CopyFromStateData, I changed it to ErrorSaveContext 
*escontext.

So if an error occurred, you can just set the escontext accordingly.
* doc: mention "If this option is omitted, COPY
stops operation at the first error."
* Since we only support 'none' for now, 'none' means we don't want
ErrorSaveContext metadata,
 so we should set cstate->escontext->details_wanted to false.

BTW I have question and comment about v15 patch:

> +   {
> +   /*
> +   *
> +   * InputFunctionCall is more faster than
> InputFunctionCallSafe.
> +   *
> +   */

Have you measured this?
When I tested it in an older patch, there were no big difference[3].

Thanks for pointing it out, I probably was over thinking.

  > -   SAVEPOINT SCALAR SCHEMA SCHEMAS SCROLL SEARCH SECOND_P 
SECURITY

SELECT
  > +   SAVEPOINT SAVE_ERROR SCALAR SCHEMA SCHEMAS SCROLL SEARCH 
SECOND_P

SECURITY SELECT

There was a comment that we shouldn't add new keyword for this[4].

Thanks for pointing it out.

Thanks for reviewing!

Updated the patch merging your suggestions except below points:

+   cstate->num_errors = 0;

Since cstate is already initialized in below lines, this may be 
redundant.

| /* Allocate workspace and zero all fields */
| cstate = (CopyFromStateData *) palloc0(sizeof(CopyFromStateData));

 +   Assert(!cstate->escontext->details_wanted);

I'm not sure this is necessary, considering we're going to add other 
options like 'table' and 'log', which need details_wanted soon.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom a3f14a0e7e9a7b5fb961ad6b6b7b163cf6534a26 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 12 Jan 2024 11:32:00 +0900
Subject: [PATCH v2] Add new COPY option SAVE_ERROR_TO

Currently when source data contains unexpected data regarding data type or
range, entire COPY fails. However, in some cases such data can be ignored and
just copying normal data is preferable.

This patch adds a new option SAVE_ERR

Re: initdb --no-locale=C doesn't work as specified when the environment is not C

2024-01-11 Thread Kyotaro Horiguchi

At Wed, 10 Jan 2024 18:16:03 -0500, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > It seems somewhat intentional that only lc_messages references the
> > environment at boot time. On the other hand, previously, in the
> > absence of a specified locale, initdb would embed the environmental
> > value in the configuration file, as it seems to be documented. Given
> > that initdb is always used for cluster creation, it's unlikey that
> > systems depend on this boot-time default for their operation.
> 
> Yeah, after further reflection there doesn't seem to be a lot of value
> in leaving these entries commented-out, even in the cases where that's
> technically correct.  Let's just go back to the old behavior of always
> uncommenting them; that stood for years without complaints.  So I
> committed your latest patch as-is.

I'm glad you understand. Thank you for commiting.

> > By the way, the lines around lc_* in the sample file seem to have
> > somewhat inconsistent indentations. Wouldnt' it be preferable to fix
> > this? (The attached doesn't that.)
> 
> They look all right if you assume the tab width is 8, which seems to
> be what is used elsewhere in the file.  I think there's been some
> prior discussion about whether to ban use of tabs at all in these
> sample files, so as to reduce confusion about how wide the tabs are.
> But I'm not touching that question today.

Ah, I see, I understood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improve the connection failure error messages

2024-01-11 Thread Peter Smith

Thanks for the patch! Here are a couple of review comments for it.

==
src/backend/commands/subscriptioncmds.c

1.
@@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
  if (!wrconn)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));

In practice, these commands give errors like:

test_sub=# create subscription sub1 connection 'dbname=bogus' publication pub1;
ERROR:  could not connect to the publisher: connection to server on
socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "bogus" does not
exist

and logs like:

2024-01-12 12:45:05.177 AEDT [13108] ERROR:  could not connect to the
publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL:  database "bogus" does not exist
2024-01-12 12:45:05.177 AEDT [13108] STATEMENT:  create subscription
sub1 connection 'dbname=bogus' publication pub1;

Since the subscription name is already obvious from the statement that
caused the error I'm not sure it benefits much to add this to the
error message (but maybe it is useful if the error message was somehow
read in isolation from the statement).

Anyway, I felt at least it should include the word "subscription" for
better consistency with the other messages in this patch:

SUGGESTION
subscription \"%s\" could not connect to the publisher: %s

==

2.
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
  /* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
  if (!wrconn)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("%s could not connect to the primary server: %s", appname, err)));

I think your new %s should be quoted according to the guidelines at [1].

==
src/test/regress/expected/subscription.out

3.
Apparently, there is no existing regression test case for the ALTER
"could not connect" message because if there was, it would have
failed. Maybe a test should be added?

==
[1] https://www.postgresql.org/docs/current/error-style-guide.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Error "initial slot snapshot too large" in create replication slot

2024-01-11 Thread Kyotaro Horiguchi

Thank you for the comments. This will help move the discussion
forward.

At Fri, 5 Jan 2024 15:17:11 -0500, Robert Haas  wrote in 
> On Thu, Mar 23, 2023 at 11:02 PM Kyotaro Horiguchi
>  wrote:
> > [ new patch ]
> 
> Well, I guess nobody is too excited about fixing this, because it's
> been another 10 months with no discussion. Andres doesn't even seem to
> think this is as much a bug as it is a limitation, for all that it's
> filed in the CF application under bug fixes. I kind of wonder if we
> should just close this entry in the CF, but I'll hold off on that for
> now.

Perhaps you are correct. Ultimately, this issue is largely
theoretical, and I don't believe anyone would be inconvenienced by
imposing this contraint.

>   * note: all ids in xip[] satisfy xmin <= xip[i] < xmax
>   */
> 
> I have to say that I don't like this at all. It's bad enough that we
> already use the xip/subxip arrays in two different ways depending on
> the situation. Increasing that to three different ways seems painful.
> How is anyone supposed to keep track of how the array is being used at
> which point in the code?

I understand. So, summirizing the current state briefly, I believe it
as follows:

a. snapbuild lacks a means to differentiate between top and sub xids
  during snapshot building.

b. Abusing takenDuringRecovery could lead to potential issues, so it
  has been rejected.

c. Dynamic sizing of xip is likely to have a significant impact on
  performance (as mentioned in the comments of GetSnapshotData).

d. (new!) Adding a third storing method is not favored.

As a method to satisfy these prerequisites, I think one direction
could be to split takenDuringRecovery into flags indicating the
storage method and creation timing. I present this as a last-ditch
effort. It's a rough proposal, and further validation should be
necessary. If this direction is also not acceptable, I'll proceed with
removing the CF entry.

> If we are going to do that, I suspect it needs comment updates in more
> places than what the patch does currently. For instance:
> 
> + if (newxcnt < xiplen)
> + newxip[newxcnt++] = xid;
> + else
> + newsubxip[newsubxcnt++] = xid;
> 
> Just imagine coming across this code in 5 or 10 years and finding that
> it had no comment explaining anything. Yikes!

^^;

> Aside from the details of the patch, and perhaps more seriously, I'm
> not really clear that we have consensus on an approach. A few
> different proposals seem to have been floated, and it doesn't seem
> like anybody hates anybody else's idea completely, but it doesn't
> really seem like everyone agrees on what to do, either.

I don't fully agree with that.It's not so much that I dislike other
proposals, but rather that we haven't been able to find a definitive
solution that stands out.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From cd310711c1565b686416e8bf5a3a5105002e35c3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 12 Jan 2024 11:36:41 +0900
Subject: [PATCH v9] Lift initial snapshot limit for logical replication

The replication command CREATE_REPLICATION_SLOT often fails with
"snapshot too large" error when numerous subtransactions are
active. This issue stems from SnapBuildInitialSnapshot attempting to
generate a snapshot that includes all active XIDs, including subxids,
stored within the xip array. This patch mitigates the constraint by
utilizing the same storing method as takenDuringRecovery.
---
 src/backend/replication/logical/snapbuild.c | 22 +
 src/backend/storage/ipc/procarray.c |  1 +
 src/backend/utils/time/snapmgr.c| 15 +-
 src/include/utils/snapshot.h|  3 ++-
 4 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a0b7947d2f..528bdbb662 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -388,6 +388,7 @@ SnapBuildFreeSnapshot(Snapshot snap)
 	Assert(snap->curcid == FirstCommandId);
 	Assert(!snap->suboverflowed);
 	Assert(!snap->takenDuringRecovery);
+	Assert(!snap->mixed);
 	Assert(snap->regd_count == 0);
 
 	/* slightly more likely, so it's checked even without c-asserts */
@@ -464,6 +465,7 @@ SnapBuildSnapDecRefcount(Snapshot snap)
 	Assert(snap->curcid == FirstCommandId);
 	Assert(!snap->suboverflowed);
 	Assert(!snap->takenDuringRecovery);
+	Assert(!snap->mixed);
 
 	Assert(snap->regd_count == 0);
 
@@ -550,6 +552,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 
 	snapshot->suboverflowed = false;
 	snapshot->takenDuringRecovery = false;
+	snapshot->mixed = false;
 	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
@@ -572,8 +575,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	Snapshot	snap;
 	TransactionId xid;
 	TransactionId safeXid;
-	TransactionId *newxip;
-	int			newxcnt = 0;
+	TransactionId *newsubxip;
+	int			ne

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-11 Thread Shinoda, Noriyoshi (HPE Services Japan - FSIP)

On Thu, Jan 11, 2024 at 4:49 PM Robert Haas  wrote:
>
> On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman
>  wrote:
>
> I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c.
>
> Say we didn't add the recordfreespace parameter to lazy_scan_prune().
> Couldn't the caller just compute it? lpdead_items goes out of scope,
> but there's also prstate.has_lpdead_items.
>
> Pressing that gripe a bit more, I just figured out that "Wait until
> lazy_vacuum_heap_rel() to save free space" gets turned into "If we
> will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to
> save free space." That follows the good practice of phrasing the
> comment conditionally when the comment is outside the if-statement.
> But the if-statement says merely "if (recordfreespace)", which is not
> obviously related to "If we will likely do index vacuuming" but under
> the hood actually is. But it seems like an unpleasant amount of action
> at a distance. If that condition said if (vacrel->nindexes > 0 &&
> vacrel->do_index_vacuuming && prstate.has_lpdead_items)
> UnlockReleaseBuffer(); else { PageGetHeapFreeSpace;
> UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more
> obvious that the code was doing what the comment says.

Yes, this is a good point. Seems like writing maintainable code is
really about never lying and then figuring out when hiding the truth
is also lying. Darn!

My only excuse is that lazy_scan_noprune() has a similarly confusing
output parameter, recordfreespace, both of which I removed in a later
patch in the set. I think we should change it as you say.

> That's a refactoring question, but I'm also wondering if there's a
> functional issue. Pre-patch, if the table has no indexes and some
> items are left LP_DEAD, then we mark them unused, forget them,
> RecordPageWithFreeSpace(), and if enough pages have been visited,
> FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no
> items will be left LP_DEAD, and *recordfreespace will be set to true,
> so we'll PageRecordFreeSpace(). But this seems to me to mean that
> post-patch we'll PageRecordFreeSpace() in more cases than we do now.
> Specifically, imagine a case where the current code wouldn't mark any
> items LP_DEAD and the page also had no pre-existing items that were
> LP_DEAD and the table also has no indexes. Currently, I believe we
> wouldn't PageRecordFreeSpace(), but with the patch, I think we would.
> Am I wrong?

Ah! I think you are right. Good catch. I could fix this with logic like this:

bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
if ((vacrel->nindexes == 0 && space_freed) ||
(vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))

I think I made this mistake when working on a different version of
this that combined the prune and no prune cases. I noticed that
lazy_scan_noprune() updates the FSM whenever there are no indexes. I
wonder why this is (and why we don't do it in the prune case).

> Note that the call to FreeSpaceMapVacuumRange() seems to try to guard
> against this problem by testing for vacrel->tuples_deleted >
> tuples_already_deleted. I haven't tried to verify whether that guard
> is correct, but the fact that FreeSpaceMapVacuumRange() has such a
> guard and RecordPageWithFreeSpace() does not have one makes me
> suspicious.

FreeSpaceMapVacuumRange() is not called for the no prune case, so I
think this is right.

> Another interesting effect of the patch is that instead of getting
> lazy_vacuum_heap_page()'s handling of the all-visible status of the
> page, we get the somewhat more complex handling done by
> lazy_scan_heap(). I haven't fully through the consequences of that,
> but if you have, I'd be interested to hear your analysis.

lazy_vacuum_heap_page() calls heap_page_is_all_visible() which does
another HeapTupleSatisfiesVacuum() call -- which is definitely going
to be more expensive than not doing that. In one case, in
lazy_scan_heap(), we might do a visibilitymap_get_status() (via
VM_ALL_FROZEN()) to avoid calling visibilitymap_set() if the page is
already marked all frozen in the VM. But that would pale in comparison
to another HeapTupleSatisfiesVacuum() (I think).

The VM update code in lazy_scan_heap() looks complicated but two out
of four cases are basically to deal with uncommon data corruption.

- Melanie

RE: cleanup patches for incremental backup

Hi,
Thank you for developing the new tool. I have attached a patch that corrects 
the spelling of the --individual option in the SGML file.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Robert Haas  
Sent: Friday, January 12, 2024 3:13 AM
To: PostgreSQL Hackers ; Jakub Wartak 

Subject: Re: cleanup patches for incremental backup

On Tue, Jan 9, 2024 at 1:18 PM Robert Haas  wrote:
> Here's v2. I plan to commit the rest of this fairly soon if there are 
> no comments.

Done, with a minor adjustment to 0003.

--
Robert Haas
EDB: http://www.enterprisedb.com 




pg_walsummary_sgml_v1.diff
Description: pg_walsummary_sgml_v1.diff

Re: Fix log_line_prefix to display the transaction id (%x) for statements not in a transaction block

2024-01-11 Thread jian he

Hi
another big  difference compare to HEAD:

select  name,setting
frompg_settings
where name in
('auto_explain.log_timing','auto_explain.log_analyze',
'auto_explain.log_min_duration','log_statement','log_line_prefix')
;
 name  |  setting
---+
 auto_explain.log_analyze  | on
 auto_explain.log_min_duration | 0
 auto_explain.log_timing   | on
 log_line_prefix   | %m [%p] %q%u@%d/%a XID:%x
 log_statement | all

simple query: explain(analyze, costs off) select 1 from pg_sleep(10);

with patch:
src3=# explain(analyze, costs off) select 1 from pg_sleep(10);
2024-01-12 08:43:14.750 CST [5739] jian@src3/psql XID:0 LOG:
duration: 10010.167 ms  plan:
Query Text: explain(analyze, costs off) select 1 from pg_sleep(10);
Function Scan on pg_sleep  (cost=0.00..0.01 rows=1 width=4)
(actual time=10010.155..10010.159 rows=1 loops=1)
2024-01-12 08:43:14.750 CST [5739] jian@src3/psql XID:0 LOG:
statement: explain(analyze, costs off) select 1 from pg_sleep(10);
 QUERY PLAN
-
 Function Scan on pg_sleep (actual time=10010.155..10010.159 rows=1 loops=1)
 Planning Time: 0.115 ms
 Execution Time: 10010.227 ms
(3 rows)

without patch:

src4=#
src4=# explain(analyze, costs off) select 1 from pg_sleep(10);
2024-01-12 08:43:13.462 CST [5869] jian@src4/psql XID:0 LOG:
statement: explain(analyze, costs off) select 1 from pg_sleep(10);
2024-01-12 08:43:23.473 CST [5869] jian@src4/psql XID:0 LOG:
duration: 10010.133 ms  plan:
Query Text: explain(analyze, costs off) select 1 from pg_sleep(10);
Function Scan on pg_sleep  (cost=0.00..0.01 rows=1 width=4)
(actual time=10010.126..10010.128 rows=1 loops=1)
 QUERY PLAN
-
 Function Scan on pg_sleep (actual time=10010.126..10010.128 rows=1 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 10010.172 ms
(3 rows)

Re: Confine vacuum skip logic to lazy_scan_skip

On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz  wrote:
>
> On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:
> >
> > On 1/4/24 2:23 PM, Andres Freund wrote:
> >
> > On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> >
> > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var 
> > rel_pages
> > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
> >  nskippable_blocks
> >
> > I think these may lead to worse code - the compiler has to reload
> > vacrel->rel_pages/next_unskippable_block for every loop iteration, because 
> > it
> > can't guarantee that they're not changed within one of the external 
> > functions
> > called in the loop body.
> >
> > Admittedly I'm not up to speed on recent vacuum changes, but I have to 
> > wonder if the concept of skipping should go away in the context of vector 
> > IO? Instead of thinking about "we can skip this range of blocks", why not 
> > maintain a list of "here's the next X number of blocks that we need to 
> > vacuum"?
>
> Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
> "the next X number of blocks that need to be vacuumed" will be
> prefetched by calculating the unskippable blocks ( using the
> lazy_scan_skip() function ) and the X will be determined by Postgres
> itself. Do you have something different in your mind?

I think you are both right. As we gain more control of readahead from
within Postgres, we will likely want to revisit this heuristic as it
may not serve us anymore. But the streaming read interface/vectored
I/O is also not a drop-in replacement for it. To change anything and
ensure there is no regression, we will probably have to do
cross-platform benchmarking, though.

That being said, I would absolutely love to get rid of the skippable
ranges because I find them very error-prone and confusing. Hopefully
now that the skipping logic is isolated to a single function, it will
be easier not to trip over it when working on lazy_scan_heap().

- Melanie

Re: Confine vacuum skip logic to lazy_scan_skip

v3 attached

On Thu, Jan 4, 2024 at 3:23 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var 
> > rel_pages
> > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
> >  nskippable_blocks
>
> I think these may lead to worse code - the compiler has to reload
> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
> can't guarantee that they're not changed within one of the external functions
> called in the loop body.

I buy that for 0001 but 0002 is still using local variables.
nskippable_blocks was just another variable to keep track of even
though we could already get that info from local variables
next_unskippable_block and next_block.

In light of this comment, I've refactored 0003/0004 (0002 and 0003 in
this version [v3]) to use local variables in the loop as well. I had
started using the members of the VacSkipState which I introduced.

> > Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state
> >
> > Future commits will remove all skipping logic from lazy_scan_heap() and
> > confine it to lazy_scan_skip(). To make those commits more clear, first
> > introduce the struct, VacSkipState, which will maintain the variables
> > needed to skip ranges less than SKIP_PAGES_THRESHOLD.
>
> Why not add this to LVRelState, possibly as a struct embedded within it?

Done in attached.

> > From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sat, 30 Dec 2023 16:59:27 -0500
> > Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip
>
> > By always calling lazy_scan_skip() -- instead of only when we have reached 
> > the
> > next unskippable block, we no longer need the skipping_current_range 
> > variable.
> > lazy_scan_heap() no longer needs to manage the skipped range -- checking if 
> > we
> > reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip()
> > can derive the visibility status of a block from whether or not we are in a
> > skippable range -- that is, whether or not the next_block is equal to the 
> > next
> > unskippable block.
>
> I wonder if it should be renamed as part of this - the name is somewhat
> confusing now (and perhaps before)? lazy_scan_get_next_block() or such?

Why stop there! I've removed lazy and called it
heap_vac_scan_get_next_block() -- a little long, but...

> > + while (true)
> >   {
> >   Buffer  buf;
> >   Pagepage;
> > - boolall_visible_according_to_vm;
> >   LVPagePruneState prunestate;
> >
> > - if (blkno == vacskip.next_unskippable_block)
> > - {
> > - /*
> > -  * Can't skip this page safely.  Must scan the page.  
> > But
> > -  * determine the next skippable range after the page 
> > first.
> > -  */
> > - all_visible_according_to_vm = 
> > vacskip.next_unskippable_allvis;
> > - lazy_scan_skip(vacrel, &vacskip, blkno + 1);
> > -
> > - Assert(vacskip.next_unskippable_block >= blkno + 1);
> > - }
> > - else
> > - {
> > - /* Last page always scanned (may need to set 
> > nonempty_pages) */
> > - Assert(blkno < rel_pages - 1);
> > -
> > - if (vacskip.skipping_current_range)
> > - continue;
> > + blkno = lazy_scan_skip(vacrel, &vacskip, blkno + 1,
> > +
> > &all_visible_according_to_vm);
> >
> > - /* Current range is too small to skip -- just scan 
> > the page */
> > - all_visible_according_to_vm = true;
> > - }
> > + if (blkno == InvalidBlockNumber)
> > + break;
> >
> >   vacrel->scanned_pages++;
> >
>
> I don't like that we still do determination about the next block outside of
> lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip()
> and lazy_scan_heap().
>
> I'd probably change the interface to something like
>
> while (lazy_scan_get_next_block(vacrel, &blkno))
> {
> ...
> }

I've done this. I do now find the parameter names a bit confusing.
There is next_block (which is the "next block in line" and is an input
parameter) and blkno, which is an output parameter with the next block
that should actually be processed. Maybe it's okay?

> > From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 31 Dec 2023 09:47:18 -0500
> > Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState
> >
> > The streaming read interface can only give pgsr_next callbacks access to
> > two pieces of private data. As such, move a reference to

Re: A recent message added to pg_upgade

On Thu, Jan 11, 2024 at 10:01:16AM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
>> "The logical replication launcher is disabled during binary upgrades, to
>> avoid logical replication workers running on the source cluster.  That
>> would cause replication origins to move forward after having been copied
>> to the target cluster, potentially creating conflicts with the copied
>> data files."
> 
> "avoid logical replication workers running" still seems like shaky
> grammar.  Perhaps s/avoid/avoid having/, or write "to prevent logical
> replication workers from running ...".

After sleeping on it, your last suggestion sounds better to me, so
I've incorporated that with Alvaro's wording (also cleaner than what I
have posted), and applied the patch on HEAD.

> Also perhaps s/would/could/.

Yep.
--
Michael


signature.asc
Description: PGP signature

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
> 
> =# \l
> ERROR:  42703: column d.datlocale does not exist
> LINE 8:   d.datlocale as "Locale",
>   ^
> HINT:  Perhaps you meant to reference the column "d.daticulocale".
> LOCATION:  errorMissingColumn, parse_relation.c:3720

I think you're connecting to a patched server with an older version of
psql, so it doesn't know the catalog column was renamed.

pg_dump and pg_upgrade don't have that problem because they throw an
error when connecting to a newer server.

But for psql, that's perfectly reasonable to connect to a newer server.
Have we dropped or renamed catalog columns used by psql backslash
commands before, and if so, how do we handle that?

I can just not rename that column, but that's a bit frustrating because
it means I'd need to add a new column to pg_database, which seems
redundant.

Regards,
Jeff Davis

Re: Adding facility for injection points (or probe points?) for more advanced tests

On Thu, Jan 11, 2024 at 02:47:27PM +0530, Ashutosh Bapat wrote:
> On Thu, Jan 11, 2024 at 9:42 AM Michael Paquier  wrote:
>> I don't really aim for complicated here, just useful.
> 
> It isn't complicated. Such simple error check improve user's
> confidence on the feature and better be part of the 1st cut.

I'm really not sure about that, because it does not impact the scope
of the facility even with all the use cases I've seen where injection
points could be used.  It could always be added later if there's a
strong push for it.  For testing, I'm biased about attempting to load
callbacks in the process attaching them.

>>> With v6 I could run the test when built with enable_injection_point
>>> false. I just ran make check in that folder. Didn't test meson build.
>>
>> The CI has been failing because 041_invalid_checkpoint_after_promote
>> was loading Time::HiRes::nanosleep and Windows does not support it.
> 
> Some miscommunication here. The SQL test under injection_point module
> can be run in a build without injection_point and it fails. I think
> it's better to have an alternate output for the same or prohibit the
> test running itself.

The same problem exists if you try to run the SSL tests in
src/test/ssl/ without support build for them.  Protections at the
upper levels are good enough for the CI and the buildfarm, while
making the overall maintenance cheaper, so I'm happy with just these.

It also seems like you've missed this message, where this has been
mentioned (spoiler: first version of the patch used an alternate
output):
https://www.postgresql.org/message-id/zunuzpimkzcov...@paquier.xyz 
--
Michael

signature.asc
Description: PGP signature

Re: Streaming I/O, vectored I/O (WIP)

2024-01-11 Thread Thomas Munro

On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas  wrote:
> Ok. It feels surprising to have three steps. I understand that you need
> two steps, one to start the I/O and another to wait for them to finish,
> but why do you need separate Prepare and Start steps? What can you do in
> between them? (You explained that. I'm just saying that that's my
> initial reaction when seeing that API. It is surprising.)

Actually I don't think I explained that very well.  First, some more
detail about how a two-step version would work:

* we only want to start one I/O in an StartReadBuffers() call, because
otherwise it is hard/impossible for the caller to cap concurrency I/O
* therefore StartReadBuffers() can handle sequences matching /^H*I*H*/
("H" = hit, "I" = miss, I/O) in one call
* in the asynchronous version, "I" in that pattern means we got
BM_IO_IN_PROGRESS
* in the synchronous version, "I" means that it's not valid, not
BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS
until the later Complete/Wait call (and then the answer might chagne,
but we'll just deal with that by looping in the synchronous version)
* streaming_read.c has to deal with buffering up work that
StartReadBuffers() didn't accept
* that's actually quite easy, you just use the rest to create a new
range in the next slot

Previously I thought the requirement to deal with buffering future
stuff that StartReadBuffers() couldn't accept yet was a pain, and life
became so much simpler once I deleted all that and exposed
PrepareReadBuffer() to the calling code.  Perhaps I just hadn't done a
good enough job of that.

The other disadvantage you reminded me of was the duplicate buffer
lookup in certain unlucky patterns, which I had forgotten about in my
previous email.  But I guess it's not fatal to the idea and there is a
potential partial mitigation.  (See below).

A third thing was the requirement for communication between
StartReadBuffers() and CompleteReadBuffers() which I originally had an
"opaque" object that the caller has to keep around that held private
state.  It seemed nice to go back to talking just about buffer
numbers, but that's not really an argument for anything...

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out...  more soon.

> I see. When you're about to zero the page, there's not much point in
> splitting the operation into Prepare/Start/Complete stages anyway.
> You're not actually doing any I/O. Perhaps it's best to have a separate
> "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
> function that does the same as
> ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

That makes sense, but... hmm, sometimes just allocating a page
generates I/O if it has to evict a dirty buffer.  Nothing in this code
does anything fancy about that, but imagine some hypothetical future
thing that manages to do that asynchronously -- then we might want to
take advantage of the ability to stream even a zeroed page, ie doing
something ahead of time?  Just a thought for another day, and perhaps
that is just an argument for including it in the streaming read API,
but it doesn't mean that the bufmgr.c API can't be as you say.

> One weakness is that if StartReadBufferRange() finds that the range is
> "chopped up", it needs to return and throw away the work it had to do to
> look up the next buffer. So in the extreme case that every other block
> in the range is in the buffer cache, each call would look up two buffers
> in the buffer cache, startBlk and startBlk + 1, but only return one
> buffer to the caller.

Yeah, right.  This was one of the observations that influenced my
PrepareReadBuffer() three-step thing that I'd forgotten.  To spell
that out with an example, suppose the buffer pool contains every odd
numbered block.  Successive StartReadBuffers() calls would process
"HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do
anything with for a block we'll look up in the buffer pool again in
the next call.  With the PrepareReadBuffer() design, that miss just
starts a new range and we don't have to look it up again.  Hmm, I
suppose that could be mitigated somewhat with ReadRecentBuffer() if we
can find somewhere decent to store it.

BTW it was while thinking about and testing cases like that that I
found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/
extremely useful.  It can kick out every second page or any other
range-chopping scenario you can express in a WHERE clause.  I would
quite like to get that tool into the tree...

Re: speed up a logical replica setup

2024-01-11 Thread Euler Taveira

On Thu, Jan 11, 2024, at 9:18 AM, Hayato Kuroda (Fujitsu) wrote:
> 
> I have been concerned that the patch has not been tested by cfbot due to the
> application error. Also, some comments were raised. Therefore, I created a 
> patch
> to move forward.
> I also tried to address some comments which is not so claimed by others.
> They were included in 0003 patch.

[I removed the following part in the previous email and couldn't reply to it...]

> * 0001 patch
> It is almost the same as v3-0001, which was posted by Euler.
> An unnecessary change for Mkvcbuild.pm (this file was removed) was ignored.

v5 removes the MSVC support.

> * 0002 patch
> This contains small fixes to keep complier quiet.

I applied it. Although, I used a different approach for format specifier.

> * 0003 patch
> This addresses comments posted to -hackers. For now, this does not contain a 
> doc.
> Will add if everyone agrees these idea.

I didn't review all items but ...

> 1.
> An option --port was added to control the port number for physical standby.
> Users can specify a port number via the option, or an environment variable 
> PGSUBPORT.
> If not specified, a fixed value (50111) would be used.

My first reaction as a new user would be: why do I need to specify a port if my
--subscriber-conninfo already contains a port? Ugh. I'm wondering if we can do
it behind the scenes. Try a range of ports.

> 2.
> A FATAL error would be raised if --subscriber-conninfo specifies non-local 
> server.

Extra protection is always good. However, let's make sure this code path is
really useful. I'll think a bit about it.

> 3. 
> Options -o/-O were added to specify options for publications/subscriptions.

Flexibility is cool. However, I think the cost benefit of it is not good. You
have to parse the options to catch preliminary errors. Things like publish only
delete and subscription options that conflicts with the embedded ones are
additional sources of failure.

> 4. 
> Made standby to save their output to log file.

It was already done in v5. I did in a different way.

> 5. 
> Unnecessary Assert in drop_replication_slot() was removed.

Instead, I fixed the code and keep the assert.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Recovering from detoast-related catcache invalidations

2024-01-11 Thread Tom Lane

Xiaoran Wang  writes:
>>> The detection of "get an invalidation" could be refined: what I did
>>> here is to check for any advance of SharedInvalidMessageCounter,
>>> which clearly will have a significant number of false positives.

> I have reviewed your patch, and it looks good.  But instead of checking for
> any advance of SharedInvalidMessageCounter ( if the invalidate message is
> not related to the current tuple, it is a little expensive)  I have another
> idea:  we can recheck the visibility of the tuple with CatalogSnapshot(the
> CatalogSnapthot must be refreshed if there is any SharedInvalidMessages) if
> it is not visible, we re-fetch the tuple, otherwise, we can continue to use
> it as it is not outdated.

Maybe, but that undocumented hack in SetHintBits seems completely
unacceptable.  Isn't there a cleaner way to make this check?

Also, I'm pretty dubious that GetNonHistoricCatalogSnapshot rather
than GetCatalogSnapshot is the right thing, because the catcaches
use the latter.

regards, tom lane

Re: speed up a logical replica setup

2024-01-11 Thread Euler Taveira

On Thu, Jan 11, 2024, at 9:18 AM, Hayato Kuroda (Fujitsu) wrote:
> I have been concerned that the patch has not been tested by cfbot due to the
> application error. Also, some comments were raised. Therefore, I created a 
> patch
> to move forward.

Let me send an updated patch to hopefully keep the CF bot happy. The following
items are included in this patch:

* drop physical replication slot if standby is using one [1].
* cleanup small changes (copyright, .gitignore) [2][3]
* fix getopt_long() options [2]
* fix format specifier for some messages
* move doc to Server Application section [4]
* fix assert failure
* ignore duplicate database names [2]
* store subscriber server log into a separate file
* remove MSVC support

I'm still addressing other reviews and I'll post another version that includes
it soon.

[1] 
https://www.postgresql.org/message-id/e02a2c17-22e5-4ba6-b788-de696ab74f1e%40app.fastmail.com
[2] 
https://www.postgresql.org/message-id/CALDaNm1joke42n68LdegN5wCpaeoOMex2EHcdZrVZnGD3UhfNQ%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/TY3PR01MB98895BA6C1D72CB8582CACC4F5682%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[4] 
https://www.postgresql.org/message-id/TY3PR01MB988978C7362A101927070D29F56A2%40TY3PR01MB9889.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 2341ef3dc26d48b51b13ad01ac38c79d24b1e461 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 5 Jun 2023 14:39:40 -0400
Subject: [PATCH v5] Creates a new logical replica from a standby server

A new tool called pg_subscriber can convert a physical replica into a
logical replica. It runs on the target server and should be able to
connect to the source server (publisher) and the target server
(subscriber).

The conversion requires a few steps. Check if the target data directory
has the same system identifier than the source data directory. Stop the
target server if it is running as a standby server. Create one
replication slot per specified database on the source server. One
additional replication slot is created at the end to get the consistent
LSN (This consistent LSN will be used as (a) a stopping point for the
recovery process and (b) a starting point for the subscriptions). Write
recovery parameters into the target data directory and start the target
server (Wait until the target server is promoted). Create one
publication (FOR ALL TABLES) per specified database on the source
server. Create one subscription per specified database on the target
server (Use replication slot and publication created in a previous step.
Don't enable the subscriptions yet). Sets the replication progress to
the consistent LSN that was got in a previous step. Enable the
subscription for each specified database on the target server. Remove
the additional replication slot that was used to get the consistent LSN.
Stop the target server. Change the system identifier from the target
server.

Depending on your workload and database size, creating a logical replica
couldn't be an option due to resource constraints (WAL backlog should be
available until all table data is synchronized). The initial data copy
and the replication progress tends to be faster on a physical replica.
The purpose of this tool is to speed up a logical replica setup.
---
 doc/src/sgml/ref/allfiles.sgml|1 +
 doc/src/sgml/ref/pg_subscriber.sgml   |  284 +++
 doc/src/sgml/reference.sgml   |1 +
 src/bin/pg_basebackup/.gitignore  |1 +
 src/bin/pg_basebackup/Makefile|8 +-
 src/bin/pg_basebackup/meson.build |   19 +
 src/bin/pg_basebackup/pg_subscriber.c | 1657 +
 src/bin/pg_basebackup/t/040_pg_subscriber.pl  |   44 +
 .../t/041_pg_subscriber_standby.pl|  139 ++
 9 files changed, 2153 insertions(+), 1 deletion(-)
 create mode 100644 doc/src/sgml/ref/pg_subscriber.sgml
 create mode 100644 src/bin/pg_basebackup/pg_subscriber.c
 create mode 100644 src/bin/pg_basebackup/t/040_pg_subscriber.pl
 create mode 100644 src/bin/pg_basebackup/t/041_pg_subscriber_standby.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..3862c976d7 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -214,6 +214,7 @@ Complete list of usable sgml source files in this directory.

+

diff --git a/doc/src/sgml/ref/pg_subscriber.sgml b/doc/src/sgml/ref/pg_subscriber.sgml
new file mode 100644
index 00..553185c35f
--- /dev/null
+++ b/doc/src/sgml/ref/pg_subscriber.sgml
@@ -0,0 +1,284 @@
+
+
+
+ 
+  pg_subscriber
+ 
+
+ 
+  pg_subscriber
+  1
+  Application
+ 
+
+ 
+  pg_subscriber
+  create a new logical replica from a standby server
+ 
+
+ 
+  
+   pg_subscriber
+   option
+  
+ 
+
+ 
+  Description
+  
+   pg_subscriber takes the publisher and subscriber
+   connection strings, a cluster directory from a standby server and a list of
+   database names and it set

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> Bertrand, I've relaunched tests in the same slowed down VM with both
> patches applied (but with no other modifications) and got a failure
> with pg_class, similar to what we had seen before:
> 9   #   Failed test 'activeslot slot invalidation is logged with vacuum 
> on pg_class'
> 9   #   at t/035_standby_logical_decoding.pl line 230.
> 
> Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests.  We cannot do that here,
obviously.  That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?
--
Michael


signature.asc
Description: PGP signature

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman
 wrote:
> Attached v7 is rebased over the commit you just made to remove
> LVPagePruneState->hastup.

Apologies for making you rebase but I was tired of thinking about that patch.

I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c.

Say we didn't add the recordfreespace parameter to lazy_scan_prune().
Couldn't the caller just compute it? lpdead_items goes out of scope,
but there's also prstate.has_lpdead_items.

Pressing that gripe a bit more, I just figured out that "Wait until
lazy_vacuum_heap_rel() to save free space" gets turned into "If we
will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to
save free space." That follows the good practice of phrasing the
comment conditionally when the comment is outside the if-statement.
But the if-statement says merely "if (recordfreespace)", which is not
obviously related to "If we will likely do index vacuuming" but under
the hood actually is. But it seems like an unpleasant amount of action
at a distance. If that condition said if (vacrel->nindexes > 0 &&
vacrel->do_index_vacuuming && prstate.has_lpdead_items)
UnlockReleaseBuffer(); else { PageGetHeapFreeSpace;
UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more
obvious that the code was doing what the comment says.

That's a refactoring question, but I'm also wondering if there's a
functional issue. Pre-patch, if the table has no indexes and some
items are left LP_DEAD, then we mark them unused, forget them,
RecordPageWithFreeSpace(), and if enough pages have been visited,
FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no
items will be left LP_DEAD, and *recordfreespace will be set to true,
so we'll PageRecordFreeSpace(). But this seems to me to mean that
post-patch we'll PageRecordFreeSpace() in more cases than we do now.
Specifically, imagine a case where the current code wouldn't mark any
items LP_DEAD and the page also had no pre-existing items that were
LP_DEAD and the table also has no indexes. Currently, I believe we
wouldn't PageRecordFreeSpace(), but with the patch, I think we would.
Am I wrong?

Note that the call to FreeSpaceMapVacuumRange() seems to try to guard
against this problem by testing for vacrel->tuples_deleted >
tuples_already_deleted. I haven't tried to verify whether that guard
is correct, but the fact that FreeSpaceMapVacuumRange() has such a
guard and RecordPageWithFreeSpace() does not have one makes me
suspicious.

Another interesting effect of the patch is that instead of getting
lazy_vacuum_heap_page()'s handling of the all-visible status of the
page, we get the somewhat more complex handling done by
lazy_scan_heap(). I haven't fully through the consequences of that,
but if you have, I'd be interested to hear your analysis.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

Attached v7 is rebased over the commit you just made to remove
LVPagePruneState->hastup.

On Thu, Jan 11, 2024 at 11:54 AM Robert Haas  wrote:
>
> On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman
>  wrote:
> > Yes, the options I can think of are:
> >
> > 1) rename the parameter to "immed_reap" or similar and make very clear
> > in heap_page_prune_opt() that you are not to pass true.
> > 2) make immediate reaping work for on-access pruning. I would need a
> > low cost way to find out if there are any indexes on the table. Do you
> > think this is possible? Should we just rename the parameter for now
> > and think about that later?
>
> I don't think we can implement (2), because:
>
> robert.haas=# create table test (a int);
> CREATE TABLE
> robert.haas=# begin;
> BEGIN
> robert.haas=*# select * from test;
>  a
> ---
> (0 rows)
>
> 
>
> robert.haas=# create index on test (a);
> CREATE INDEX
>
> In English, the lock we hold during regular table access isn't strong
> enough to foreclose concurrent addition of an index.

Ah, I see.

> So renaming the parameter seems like the way to go. How about 
> "mark_unused_now"?

I've done this in attached v7.

> > > -   if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > > +   if (!ItemIdIsUsed(itemid))
> > > continue;
> > >
> > > There is a bit of overhead here for the !no_indexes case. I assume it
> > > doesn't matter.
> >
> > Right. Should be okay. Alternatively, and I'm not saying this is a
> > good idea, but we could throw this into the loop in heap_page_prune()
> > which calls heap_prune_chain():
> >
> > +   if (ItemIdIsDead(itemid) && prstate.no_indexes)
> > +   {
> > +   heap_prune_record_unused(&prstate, offnum);
> > +   continue;
> > +   }
> >
> > I think that is correct?
>
> Wouldn't that be wrong in any case where heap_prune_chain() calls
> heap_prune_record_dead()?

Hmm. But previously, we skipped heap_prune_chain() for already dead
line pointers. In my patch, I rely on the test in the loop in
heap_prune_chain() to set those LP_UNUSED if mark_unused_now is true
(previously the below code just broke out of the loop).

/*
 * Likewise, a dead line pointer can't be part of the chain. (We
 * already eliminated the case of dead root tuple outside this
 * function.)
 */
if (ItemIdIsDead(lp))
{
/*
 * If the caller set mark_unused_now true, we can set dead line
 * pointers LP_UNUSED now. We don't increment ndeleted here since
 * the LP was already marked dead.
 */
if (unlikely(prstate->mark_unused_now))
heap_prune_record_unused(prstate, offnum);

break;
}

so wouldn't what I suggested simply set the item LP_UNSED that before
invoking heap_prune_chain()? Thereby achieving the same result without
invoking heap_prune_chain() for already dead line pointers? I could be
missing something. That heap_prune_chain() logic sometimes gets me
turned around.

> > Yes, so I preferred it in the body of heap_prune_chain() (the caller).
> > Andres didn't like the extra level of indentation. I could wrap
> > heap_record_dead() and heap_record_unused(), but I couldn't really
> > think of a good wrapper name. heap_record_dead_or_unused() seems long
> > and literal. But, it might be the best I can do. I can't think of a
> > general word which encompasses both the transition to death and to
> > disposal.
>
> I'm sure we could solve the wordsmithing problem but I think it's
> clearer if the heap_prune_record_* functions don't get clever.

I've gone with my heap_record_dead_or_unused() suggestion.

- Melanie
From dedbd202dfd800c3ac8b06b58c5d16a036b4a3e2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 5 Jan 2024 11:12:58 -0500
Subject: [PATCH v7 2/4] Move VM update code to lazy_scan_prune()

The visibility map is updated after lazy_scan_prune() returns in
lazy_scan_heap(). Most of the output parameters of lazy_scan_prune() are
used to update the VM in lazy_scan_heap(). Moving that code into
lazy_scan_prune() simplifies lazy_scan_heap() and requires less
communication between the two. This is a pure lift and shift. No VM
update logic is changed. All of the members of the prunestate are now
obsolete. It will be removed in a future commit.
---
 src/backend/access/heap/vacuumlazy.c | 229 ++-
 1 file changed, 116 insertions(+), 113 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 85c0919dc71..82495ef8082 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -249,8 +249,8 @@ static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
    bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 			BlockNumber blkno, Page page,
-			LVPagePruneState *prunestate,
-

buildfarm failures in pg_walsummary checks

I tried to post this elsewhere but it got moderated, so retrying:

On Thu, Jan 11, 2024 at 1:49 PM Robert Haas  wrote:
> On Thu, Jan 11, 2024 at 12:56 PM Robert Haas  wrote:
> > Add new pg_walsummary tool.
>
> culicidae is unhappy with this, but I don't yet understand why. The output is:
>
> #   Failed test 'stdout shows block 0 modified'
> #   at t/002_blocks.pl line 85.
> #   'TS 1663, DB 5, REL 16384, FORK main: blocks 0..1'
> # doesn't match '(?^m:FORK main: block 0$)'
>
> The test is expecting block 0 to be modified, but block 1 to be
> unmodified, but here, both blocks are modified. That would maybe make
> sense if this machine had a really big block size, but that doesn't
> seem to be the case. Or, maybe the test has erred in failing to
> disable autovacuum -- though it does take other precautions to try to
> prevent that from interfering.

It's not autovacuum, the test is flaky. I ran it in a loop locally
until it failed, and then ran pg_waldump, finding this:

rmgr: Heaplen (rec/tot): 73/  8249, tx:738, lsn:
0/0158AEE8, prev 0/01588EB8, desc: UPDATE old_xmax: 738, old_off: 2,
old_infobits: [], flags: 0x03, new_xmax: 0, new_off: 76, blkref #0:
rel 1663/5/16384 blk 1 FPW, blkref #1: rel 1663/5/16384 blk 0

I'm slightly puzzled, here. I would have expected that if I inserted a
bunch of records into the table and then updated one of them, the new
record would have gone into a new page at the end of the table, and
also that even if it didn't extend the relation, it would go into the
same page every time the test was run. But here the behavior seems to
be nondeterministic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
>  
>   The database cluster will be initialized with this locale
> configuration:
>     default collation provider:  builtin
>     default collation locale:    C.UTF-8
>     LC_COLLATE:  C.UTF-8
>     LC_CTYPE:    C.UTF-8
>     LC_MESSAGES: C.UTF-8
>     LC_MONETARY: C.UTF-8
>     LC_NUMERIC:  C.UTF-8
>     LC_TIME: C.UTF-8
>   The default database encoding has accordingly been set to "UTF8".
>   The default text search configuration will be set to "english".
> 
> This is from an environment where LANG=fr_FR.UTF-8
> 
> I would expect all LC_* variables to be fr_FR.UTF-8, and the default
> text search configuration to be "french".

You can get the behavior you want by doing:

  initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \
-D/tmp/pgdata

where "--builtin-locale" is analogous to "--icu-locale".

It looks like I forgot to document the new initdb option, which seems
to be the source of the confusion. Sorry, I'll fix that in the next
patch set. (See examples in the initdb tests.)

I think this answers some of your follow-up questions as well.

> A related comment is about naming the builtin locale C.UTF-8, the
> same
> name as in libc. On one hand this is semantically sound, but on the
> other hand, it's likely to confuse people. What about using
> completely
> different names, like "pg_unicode" or something else prefixed by
> "pg_"
> both for the locale name and the collation name (currently
> C.UTF-8/c_utf8)?

I'm flexible on naming, but here are my thoughts:

* A "pg_" prefix makes sense.

* If we named it something like "pg_unicode" someone might expect it to
sort using the root collation.

* The locale name "C.UTF-8" is nice because it implies things about
both the collation and the character behavior. It's also nice because
on at least some platforms, the behavior is almost identical to the
libc locale of the same name.

* UCS_BASIC might be a good name, because it also seems to carry the
right meanings, but that name is already taken.

* We also might to support variations, such as full case mapping (which
uppercases "ß" to "SS", as the SQL standard requires), or perhaps the
"standard" flavor of regexes (which don't count all symbols as
punctuation). Leaving some room to name those variations would be a
good idea.

Regards,
Jeff Davis

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Tue, Jan 9, 2024 at 2:35 PM Andres Freund  wrote:
> I don't have that strong feelings about it. If both of you think it looks 
> good, go ahead...

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cleanup patches for incremental backup

On Tue, Jan 9, 2024 at 1:18 PM Robert Haas  wrote:
> Here's v2. I plan to commit the rest of this fairly soon if there are
> no comments.

Done, with a minor adjustment to 0003.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Recovering from detoast-related catcache invalidations

2024-01-11 Thread Xiaoran Wang

Hi,


>> BTW, while nosing around I found what seems like a very nasty related
>> bug. Suppose that a catalog tuple being loaded into syscache contains
>> some toasted fields. CatalogCacheCreateEntry will flatten the tuple,
>> involving fetches from toast tables that will certainly cause
>> AcceptInvalidationMessages calls. What if one of those should have
>> invalidated this tuple? We will not notice, because it's not in
>> the hashtable yet. When we do add it, we will mark it not-dead,
>> meaning that the stale entry looks fine and could persist for a long
>> while.
I spent some time trying to understand the bug and finally, I can reproduce
it locally with the following steps:

step1:
create a function called 'test' with a long body that must be stored in a
toast table.
and put it in schema 'yy' by : "alter function test set schema yy";

step 2:
I  added a breakpoint at  'toast_flatten_tuple'  for session1 ,
 then execute the following SQL:
--
set search_path='public';
alter function test set schema xx;
--
step 3:
when the session1 stops at the breakpoint, I open session2 and execute
---
set search_path = 'yy';
alter function test set schema public;
---
step4:
resume the session1 , it reports the error  "ERROR:  could not find a
function named "test""

step 5:
continue to execute "alter function test set schema xx;" in session1, but
it still can not work and report the above error although the function test
already belongs to schema 'public'

Obviously, in session 1, the "test"  proc tuple in the cache is outdated.

>> The detection of "get an invalidation" could be refined: what I did
>> here is to check for any advance of SharedInvalidMessageCounter,
>> which clearly will have a significant number of false positives.
>> However, the only way I see to make that a lot better is to
>> temporarily create a placeholder catcache entry (probably a negative
>> one) with the same keys, and then see if it got marked dead.
>> This seems a little expensive, plus I'm afraid that it'd be actively
>> wrong in the recursive-lookup cases that the existing comment in
>> SearchCatCacheMiss is talking about (that is, the catcache entry
>> might mislead any recursive lookup that happens).

I have reviewed your patch, and it looks good.  But instead of checking for
any advance of SharedInvalidMessageCounter ( if the invalidate message is
not related to the current tuple, it is a little expensive)  I have another
idea:  we can recheck the visibility of the tuple with CatalogSnapshot(the
CatalogSnapthot must be refreshed if there is any SharedInvalidMessages) if
it is not visible, we re-fetch the tuple, otherwise, we can continue to use
it as it is not outdated.

I added a commit based on your patch and attached it.


0001-Recheck-the-tuple-visibility.patch
Description: Binary data

Re: Stack overflow issue

On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas  wrote:
> The problem with CommitTransactionCommand (or rather
> AbortCurrentTransaction() which has the same problem)
> and ShowTransactionStateRec is that they get called in a state where
> aborting can lead to a panic. If you add a "check_stack_depth()" to them
> and try to reproducer scripts that Egor posted, you still get a panic.

Hmm, that's unfortunate. I'm not sure what to do about that. But I'd
still suggest looking for a goto-free approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Compile warnings in dbcommands.c building with meson

On Wed, Jan 10, 2024 at 1:16 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > When building current head on debian bullseye I get this compile warning:
> >
> > In file included from ../src/backend/commands/dbcommands.c:20:
> > ../src/backend/commands/dbcommands.c: In function ‘createdb’:
> > ../src/include/postgres.h:104:9: warning: ‘src_hasloginevt’ may be
> > used uninitialized in this function [-Wmaybe-uninitialized]
> >   104 |  return (Datum) (X ? 1 : 0);
> >   | ^~~
> > ../src/backend/commands/dbcommands.c:683:8: note: ‘src_hasloginevt’
> > was declared here
> >   683 |  bool  src_hasloginevt;
> >   |^~~
> >
> >
> > I only get this when building with meson, not when building with
> > autotools. AFAICT, I have the same config:
> >
> > ./configure --enable-debug --enable-depend --with-python
> > --enable-cassert --with-openssl --enable-tap-tests --with-icu
> >
> > vs
> >
> > meson setup build -Ddebug=true -Dpython=true -Dcassert=true
> > -Dssl=openssl -Dtap-test=true -Dicu=enabled -Dnls=disabled
> >
> >
> > in both cases the compiler is:
> > gcc (Debian 10.2.1-6) 10.2.1 20210110
>
> Seems to me that the compiler is not smart enough to process:
>
> ```
> if (!get_db_info(dbtemplate, ShareLock,
>  &src_dboid, &src_owner, &src_encoding,
>  &src_istemplate, &src_allowconn, &src_hasloginevt,
>  &src_frozenxid, &src_minmxid, &src_deftablespace,
>  &src_collate, &src_ctype, &src_iculocale,
> &src_icurules, &src_locprovider,
>  &src_collversion))
> ereport(ERROR, ...
> ```
>
> Should we just silence the warning like this - see attachment? I don't
> think createdb() is called that often to worry about slight
> performance change, if there is any.

Certainly looks that way, but I'm curious as to why nobody else has seen this..

That said, it appears to be gone in current master. Even though
nothing changed in that file.  Must've been some transient effect,
that somehow didn't get blown away by doing a clean

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Suggest two small improvements for PITR.

2024-01-11 Thread Yura Sokolov


Good day, hackers.

Here I am to suggest two small improvements to Point In Time Recovery.

First is ability to recover recovery-target-time with timestamp stored 
in XLOG_RESTORE_POINT. Looks like historically this ability did exist 
and were removed unintentionally during refactoring at commit [1]

c945af80 "Refactor checking whether we've reached the recovery target."

Second is extending XLOG_BACKUP_END record with timestamp, therefore 
backup will have its own timestamp as well. It is backward compatible 
change since there were no record length check before.


Both changes slightly helps in mostly idle systems, when between several 
backups may happens no commits at all, so there's no timestamp to 
recover to.


Attached sample patches are made in reverse order:
- XLOG_BACKUP_END then XLOG_RESTORE_POINT.
Second patch made by colleague by my idea.
Publishing for both is permitted.

If idea is accepted, patches for tests will be applied as well.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=c945af80

---

Yura Sokolov.From 173cfc3762a97c300b618f863fd23433909cdb81 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 3 May 2023 18:48:46 +0300
Subject: [PATCH] PGPRO-8083: record timestamp in XLOG_BACKUP_END for
 recovery_target_time

It is useful for pg_probackup to recover on backup end.

Tags: pg_probackup
---
 src/backend/access/rmgrdesc/xlogdesc.c| 15 +++--
 src/backend/access/transam/xlog.c |  6 ++--
 src/backend/access/transam/xlogrecovery.c | 39 +++
 src/include/access/xlog_internal.h|  7 
 4 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 3fd7185f217..e1114168239 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -86,10 +86,19 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
-		XLogRecPtr	startpoint;
+		xl_backup_end	xlrec = {0, 0};
+		size_t			rec_len = XLogRecGetDataLen(record);
 
-		memcpy(&startpoint, rec, sizeof(XLogRecPtr));
-		appendStringInfo(buf, "%X/%X", LSN_FORMAT_ARGS(startpoint));
+		if (rec_len == sizeof(XLogRecPtr))
+			memcpy(&xlrec.startpoint, rec, sizeof(XLogRecPtr));
+		else if (rec_len >= sizeof(xl_backup_end))
+			memcpy(&xlrec, rec, sizeof(xl_backup_end));
+
+		appendStringInfo(buf, "%X/%X", LSN_FORMAT_ARGS(xlrec.startpoint));
+
+		if (rec_len >= sizeof(xl_backup_end))
+			appendStringInfo(buf, " at %s",
+			 timestamptz_to_str(xlrec.end_time));
 	}
 	else if (info == XLOG_PARAMETER_CHANGE)
 	{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5ebb9783e2f..42cd06cd7d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8680,14 +8680,16 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	}
 	else
 	{
+		xl_backup_end	xlrec;
 		char	   *history_file;
 
 		/*
 		 * Write the backup-end xlog record
 		 */
+		xlrec.startpoint = state->startpoint;
+		xlrec.end_time = GetCurrentTimestamp();
 		XLogBeginInsert();
-		XLogRegisterData((char *) (&state->startpoint),
-		 sizeof(state->startpoint));
+		XLogRegisterData((char *) (&xlrec), sizeof(xlrec));
 		state->stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);
 
 		/*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 11747a6ff13..42d5b59ac25 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2312,6 +2312,13 @@ getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
 		*recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
 		return true;
 	}
+	if (rmid == RM_XLOG_ID && info == XLOG_BACKUP_END)
+	{
+		if (XLogRecGetDataLen(record) < sizeof(xl_backup_end))
+			return false;
+		*recordXtime = ((xl_backup_end *) XLogRecGetData(record))->end_time;
+		return true;
+	}
 	if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
 			   xact_info == XLOG_XACT_COMMIT_PREPARED))
 	{
@@ -2640,6 +2647,38 @@ recoveryStopsAfter(XLogReaderState *record)
 		}
 	}
 
+	if (recoveryTarget == RECOVERY_TARGET_TIME &&
+		rmid == RM_XLOG_ID && info == XLOG_BACKUP_END)
+	{
+		bool	stopsHere = false;
+
+		if (getRecordTimestamp(record, &recordXtime))
+		{
+			/*
+			 * Use same conditions as in recoveryStopsBefore for transaction
+			 * records to not override transactions time handling.
+			 */
+			if (recoveryTargetInclusive)
+stopsHere = recordXtime > recoveryTargetTime;
+			else
+stopsHere = recordXtime >= recoveryTargetTime;
+		}
+
+		if (stopsHere)
+		{
+			recoveryStopAfter = true;
+			recoveryStopXid = InvalidTransactionId;
+			recoveryStopLSN = InvalidXLogRecPtr;
+			recoveryStopTime = recordXtime;
+			recoveryStopName[0] = '\0';
+
+			ereport(LOG,
+	(errmsg("recovery stopping at backup end at time %s",
+			timestamptz_to_str(re

Re: recovery modules

2024-01-11 Thread Nathan Bossart

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6cee7d220b886e9be309929da5274c5770e59845 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 14:28:53 -0800
Subject: [PATCH v17 1/5] introduce routine for checking mutually exclusive
 string GUCs

---
 src/backend/postmaster/pgarch.c |  8 +++-
 src/backend/utils/misc/guc.c| 22 ++
 src/include/utils/guc.h |  3 +++
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..6ae8bb53c4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -824,11 +824,9 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
-	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+	(void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library",
+			XLogArchiveCommand, "archive_command",
+			ERROR);
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8f65ef3d89..9d21c16169 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2640,6 +2640,28 @@ ReportGUCOption(struct config_generic *record)
 	pfree(val);
 }
 
+/*
+ * If both parameters are set, emits a log message at 'elevel' and returns
+ * false.  Otherwise, returns true.
+ */
+bool
+CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+ const char *p2val, const char *p2name,
+ int elevel)
+{
+	if (p1val[0] != '\0' && p2val[0] != '\0')
+	{
+		ereport(elevel,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot set both %s and %s", p1name, p2name),
+ errdetail("Only one of %s or %s may be set.",
+		   p1name, p2name)));
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a value from one of the human-friendly units ("kB", "min" etc.)
  * to the given base unit.  'value' and 'unit' are the input value and unit
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 471d53da8f..6543e61463 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -375,6 +375,9 @@ extern int	NewGUCNestLevel(void);
 extern void AtEOXact_GUC(bool isCommit, int nestLevel);
 extern void BeginReportingGUCOptions(void);
 extern void ReportChangedGUCOptions(void);
+extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+			 const char *p2val, const char *p2name,
+			 int elevel);
 extern void ParseLongOption(const char *string, char **name, char **value);
 extern const char *get_config_unit_name(int flags);
 extern bool parse_int(const char *value, int *result, int flags,
-- 
2.25.1

>From 9f5c2602baa8e29058bc976c01516d04e3c1f901 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 10:36:00 -0800
Subject: [PATCH v17 2/5] refactor code for restoring via shell

---
 src/backend/Makefile  |   2 +-
 src/backend/access/transam/timeline.c |  12 +-
 src/backend/access/transam/xlog.c |  50 -
 src/backend/access/transam/xlogarchive.c  | 167 ---
 src/backend/access/transam/xlogrecovery.c |   3 +-
 src/backend/meson.build   |   1 +
 src/backend/postmaster/startup.c  |  16 +-
 src/backend/restore/Makefile  |  18 ++
 src/backend/restore/meson.build   |   5 +
 src/backend/restore/shell_restore.c   | 245 ++
 src/include/Makefile  |   2 +-
 src/include/access/xlogarchive.h  |   9 +-
 src/include/meson.build   |   1 +
 src/include/postmaster/startup.h  |   1 +
 src/include/restore/shell_restore.h   |  26 +++
 src/tools/pgindent/typedefs.list  |   1 +
 16 files changed, 407 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/restore/Makefile
 create mode 100644 src/backend/restore/meson.build
 create mode 100644 src/backend/restore/shell_restore.c
 create mode 100644 src/include/restore/shell_restore.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 7d2ea7d54a..cbeb8ac93c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access archive backup bootstrap catalog parser commands executor \
 	foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
-	regex replication rewrite \
+	regex replication restore rewrite \
 	statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
 	jit
 
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 146751ae37..3269547de7 100644
--- a/src/bac

Re: System username in pg_stat_activity

Hi,

On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote:
> On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot
>  wrote:
> >
> > If we go the 2 fields way, then what about auth_identity and auth_method 
> > then?
> 
> 
> Here is an updated patch based on this idea.

Thanks!

+ 
+  
+   auth_method text
+  
+  
+   The authentication method used for authenticating the connection, or
+   NULL for background processes.
+  

I'm wondering if it would make sense to populate it for parallel workers too.
I think it's doable thanks to d951052, but I'm not sure it's worth it (one could
join based on the leader_pid though). OTOH that would be consistent with
how the SYSTEM_USER behaves with parallel workers (it's populated).

+  
+   auth_identity text
+  
+  
+   The identity (if any) that the user presented during the authentication
+   cycle before they were assigned a database role.  Contains the same
+   value as 

Same remark regarding the parallel workers case +:

- Would it be better to use the `name` datatype for auth_identity?
- what about "Contains the same value as the identity part in "?

+   /*
+* Trust doesn't set_authn_id(), but we still need to 
store the
+* auth_method
+*/
+   MyClientConnectionInfo.auth_method = uaTrust;

+1, I think it is useful here to provide "trust" and not a NULL value in the
context of this patch.

+# pg_stat_activity shold contain trust and empty string for trust auth

typo: s/shold/should/

+# Users with md5 auth should show both auth method and name in pg_stat_activity

what about "show both auth method and identity"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman
 wrote:
> Yes, the options I can think of are:
>
> 1) rename the parameter to "immed_reap" or similar and make very clear
> in heap_page_prune_opt() that you are not to pass true.
> 2) make immediate reaping work for on-access pruning. I would need a
> low cost way to find out if there are any indexes on the table. Do you
> think this is possible? Should we just rename the parameter for now
> and think about that later?

I don't think we can implement (2), because:

robert.haas=# create table test (a int);
CREATE TABLE
robert.haas=# begin;
BEGIN
robert.haas=*# select * from test;
 a
---
(0 rows)



robert.haas=# create index on test (a);
CREATE INDEX

In English, the lock we hold during regular table access isn't strong
enough to foreclose concurrent addition of an index.

So renaming the parameter seems like the way to go. How about "mark_unused_now"?

> > -   if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > +   if (!ItemIdIsUsed(itemid))
> > continue;
> >
> > There is a bit of overhead here for the !no_indexes case. I assume it
> > doesn't matter.
>
> Right. Should be okay. Alternatively, and I'm not saying this is a
> good idea, but we could throw this into the loop in heap_page_prune()
> which calls heap_prune_chain():
>
> +   if (ItemIdIsDead(itemid) && prstate.no_indexes)
> +   {
> +   heap_prune_record_unused(&prstate, offnum);
> +   continue;
> +   }
>
> I think that is correct?

Wouldn't that be wrong in any case where heap_prune_chain() calls
heap_prune_record_dead()?

> Yes, so I preferred it in the body of heap_prune_chain() (the caller).
> Andres didn't like the extra level of indentation. I could wrap
> heap_record_dead() and heap_record_unused(), but I couldn't really
> think of a good wrapper name. heap_record_dead_or_unused() seems long
> and literal. But, it might be the best I can do. I can't think of a
> general word which encompasses both the transition to death and to
> disposal.

I'm sure we could solve the wordsmithing problem but I think it's
clearer if the heap_prune_record_* functions don't get clever.

> > +   boolrecordfreespace;
> >
> > Not sure if it's necessary to move this to an outer scope like this?
> > The whole handling of this looks kind of confusing. If we're going to
> > do it this way, then I think lazy_scan_prune() definitely needs to
> > document how it handles this function (i.e. might set true to false,
> > won't set false to true, also meaning). But are we sure we want to let
> > a local variable with a weird name "leak out" like this?
>
> Which function do you mean when you say "document how
> lazy_scan_prune() handles this function".

"function" was a thinko for "variable".

> And no we definitely don't
> want a variable like this to be hanging out in lazy_scan_heap(), IMHO.
> The other patches in the stack move the FSM updates into
> lazy_scan_[no]prune() and eliminate this parameter. I moved up the
> scope because lazy_scan_noprune() already had recordfreespace as an
> output parameter and initialized it unconditionally inside. I
> initialize it unconditionally in lazy_scan_prune() as well. I mention
> in the commit message that this is temporary because we plan to
> eliminate recordfreespace as an output parameter by updating the FSM
> in lazy_scan_[no]prune(). I could have stuck recordfreespace into the
> LVPagePruneState with the other output parameters. But, leaving it as
> a separate output parameter made the diffs lovely for the rest of the
> patches in the stack.

I guess I'll have to look at more of the patches to see what I think
of this. Looking at this patch in isolation, it's ugly, IMHO, but it
seems we agree on that.

> > +   Assert(vacrel->do_index_vacuuming);
> >
> > Is this related?
>
> Yes, previously the assert was:
> Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
> And we eliminated the caller of lazy_vacuum_heap_page() with
> vacrel->nindexes == 0. Now it should only be called after doing index
> vacuuming (thus index vacuuming should definitely be enabled).

Ah.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: A failure in t/038_save_logical_slots_shutdown.pl

2024-01-11 Thread Bharath Rupireddy

On Thu, Jan 11, 2024 at 4:35 PM vignesh C  wrote:
>
> On further analysis, it was found that in the failing test,
> CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> page header present just before the CHECKPOINT_SHUTDOWN which was
> causing the failure. We could alternatively reproduce the issue by
> switching the WAL file before restarting the server like in the
> attached test change patch.
> There are a couple of ways to fix this issue a) one by switching the
> WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> does not get inserted in a new page as in the attached test_fix.patch
> b) by using pg_walinspect to check that the next WAL record is
> CHECKPOINT_SHUTDOWN. I have to try this approach.
>
> Thanks to Bharath and Kuroda-san for offline discussions and helping
> in getting to the root cause.

IIUC, the problem the commit e0b2eed tries to solve is to ensure there
are no left-over decodable WAL records between confirmed_flush LSN and
a shutdown checkpoint, which is what it is expected from the
t/038_save_logical_slots_shutdown.pl. How about we have a PG function
returning true if there are any decodable WAL records between the
given start_lsn and end_lsn? Usage of this new function will make the
tests more concrete and stable. This function doesn't have to be
something really new, we can just turn
binary_upgrade_logical_slot_has_caught_up to a general, non-binary PG
function; this idea has come up before
https://www.postgresql.org/message-id/CAA4eK1KZXaBgVOAdV8ZfG6AdDbKYFVz7teDa7GORgQ3RVYS93g%40mail.gmail.com.
If okay, I can offer to write a patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-11 Thread Jelte Fennema-Nio

On Wed, 10 Jan 2024 at 23:53, Jelte Fennema-Nio  wrote:
> Honestly, I care more about patch 0010 than patch 0008. Patch 0008
> simply seemed like the easiest way to demonstrate the ParameterSet
> message.

So to be clear, if you consider 0008 the most controversial/risky part
of this patchset (which it sounds like that's the case). I'd be fine
with removing that for now. IMHO the first 7 patches would be very
useful on their own, because they unblock any other patches that want
to introduce protocol changes (examples of those are 0008 and 0010).

Do you think that is a good idea? I could fairly easily modify the
tests in 0009 to remove any things related to 0008.

Re: Synchronizing slots from primary to standby

Hi,

On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote:
> On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
> >
> > +static bool
> > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > {
> > ...
> > + /* Slot ready for sync, so sync it. */
> > + else
> > + {
> > + /*
> > + * Sanity check: With hot_standby_feedback enabled and
> > + * invalidations handled appropriately as above, this should never
> > + * happen.
> > + */
> > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > + elog(ERROR,
> > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > + " to remote slot's LSN(%X/%X) as synchronization"
> > + " would move it backwards", remote_slot->name,
> > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > ...
> > }
> >
> > I was thinking about the above code in the patch and as far as I can
> > think this can only occur if the same name slot is re-created with
> > prior restart_lsn after the existing slot is dropped. Normally, the
> > newly created slot (with the same name) will have higher restart_lsn
> > but one can mimic it by copying some older slot by using
> > pg_copy_logical_replication_slot().
> >
> > I don't think as mentioned in comments even if hot_standby_feedback is
> > temporarily set to off, the above shouldn't happen. It can only lead
> > to invalidated slots on standby.

I also think so.

> >
> > To close the above race, I could think of the following ways:
> > 1. Drop and re-create the slot.
> > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > ahead of local_slot's LSN then we can update it; but as mentioned in
> > your previous comment, we need to update all other fields as well. If
> > we follow this then we probably need to have a check for catalog_xmin
> > as well.

IIUC, this would be a sync slot (so not usable until promotion) that could
not be used anyway (invalidated), so I'll vote for drop / re-create then.

> > Now, related to this the other case which needs some handling is what
> > if the remote_slot's restart_lsn is greater than local_slot's
> > restart_lsn but it is a re-created slot with the same name. In that
> > case, I think the other properties like 'two_phase', 'plugin' could be
> > different. So, is simply copying those sufficient or do we need to do
> > something else as well?
> >
>

I'm not sure to follow here. If the remote slot is re-created then it would
be also dropped / re-created locally, or am I missing something?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG] autovacuum may skip tables when session_authorization/role is set on database

On Thu, 14 Dec 2023 at 02:13, Imseih (AWS), Sami  wrote:
>
> Hi,
>
>
>
> A recent case in the field in which a database session_authorization is
>
> altered to a non-superuser, non-owner of tables via alter database .. set 
> session_authorization ..
>
> caused autovacuum to skip tables.
>
>
>
> The issue was discovered on 13.10, and the logs show such messages:
>
>
>
> warning:  skipping "table1" --- only table or database owner can vacuum it
>
>
>
> In HEAD, I can repro, but the message is now a bit different due to [1].
>
>
>
> WARNING:  permission denied to vacuum "table1”, skipping it
>
>
>
> It seems to me we should force an autovacuum worker to set the session userid 
> to
>
> a superuser.
>
>
>
> Attached is a repro and a patch which sets the session user to the BOOTSTRAP 
> superuser
>
> at the start of the autovac worker.

Since there is not much interest on this patch, I have changed the
status with "Returned with Feedback". Feel free to propose a stronger
use case for the patch and add an entry for the same.

Regards,
Vignesh

Re: [PATCH] LockAcquireExtended improvement

On Tue, 28 Nov 2023 at 18:23, Jingxian Li  wrote:
>
> Hi hackers,
>
> I found a problem when doing the test shown below:
>
> Time
>
> Session A
>
> Session B
>
> T1
>
> postgres=# create table test(a int);
>
> CREATE TABLE
>
> postgres=# insert into test values (1);
>
> INSERT 0 1
>
>
>
> T2
>
> postgres=# begin;
>
> BEGIN
>
> postgres=*# lock table test in access exclusive mode ;
>
> LOCK TABLE
>
>
>
> T3
>
>
>
> postgres=# begin;
>
> BEGIN
>
> postgres=*# lock table test in exclusive mode ;
>
> T4
>
> Case 1:
>
> postgres=*# lock table test in share row exclusive mode nowait;
>
> ERROR:  could not obtain lock on relation "test"
>
> 
>
> Case 2:
>
> postgres=*# lock table test in share row exclusive mode;
>
> LOCK TABLE
>
>
>
> At T4 moment in session A, (case 1) when executing SQL “lock table test in 
> share row exclusive mode nowait;”, an error occurs with message “could not 
> obtain lock on relation test";However, (case 2) when executing the SQL above 
> without nowait, lock can be obtained successfully.
>
> Digging into the source code, I find that in case 2 the lock was obtained in 
> the function ProcSleep instead of LockAcquireExtended. Due to nowait logic 
> processed before WaitOnLock->ProcSleep, acquiring lock failed in case 1. Can 
> any changes be made so that the act of such lock granted occurs before 
> WaitOnLock?
>
>
>
> Providing a more universal case:
>
> Transaction A already holds an n-mode lock on table test. If then transaction 
> A requests an m-mode lock on table Test, m and n have the following 
> constraints:
>
> (lockMethodTable->conflictTab[n] & lockMethodTable->conflictTab[m]) == 
> lockMethodTable->conflictTab[m]
>
> Obviously, in this case, m<=n.
>
> Should the m-mode lock be granted before WaitOnLock?
>
>
>
> In the case of m=n (i.e. we already hold the lock), the m-mode lock is 
> immediately granted in the LocalLock path, without the need of lock conflict 
> check.
>
> Based on the facts above, can we obtain a weaker lock (m object within the same transaction without doing lock conflict check?
>
> Since m=n works, m
>
>
> I am attaching a patch here with which the problem in case 1 fixed.

I did not see any test added for this, should we add a test case for this?

Regards,
Vignesh

Re: Wrong results with grouping sets

On Thu, 7 Dec 2023 at 13:52, Richard Guo  wrote:
>
>
> On Mon, Sep 25, 2023 at 3:11 PM Richard Guo  wrote:
>>
>> If the grouping expression is a Var or PHV, we can just set its
>> nullingrels, very straightforward.   For an expression that is neither a
>> Var nor a PHV, I'm not quite sure how to set the nullingrels.  I tried
>> the idea of wrapping it in a new PHV to carry the nullingrels, but that
>> may cause some unnecessary plan diffs.  In the patch for such an
>> expression I just set the nullingrels of Vars or PHVs that are contained
>> in it.  This is not really 'correct' in theory, because it is the whole
>> expression that can be nullable by grouping sets, not its individual
>> vars.  But it works in practice, because what we need is that the
>> expression can be somehow distinguished from the same expression in ECs,
>> and marking its vars is sufficient for this purpose.  But what if the
>> expression is variable-free?  This is the point that needs more work.
>> Fow now the patch just handles variable-free expressions of type Const,
>> by wrapping it in a new PHV.
>
>
> For a variable-free expression, if it contains volatile functions, SRFs,
> aggregates, or window functions, it would not be treated as a member of
> EC that is redundant (see get_eclass_for_sort_expr()).  That means it
> would not be removed from the pathkeys list, so we do not need to set
> the nullingrels for it.  Otherwise we can just wrap the expression in a
> PlaceHolderVar.  Attached is an updated patch to do that.
>
> BTW, this wrong results issue has existed ever since grouping sets was
> introduced in v9.5, and there were field reports complaining about this
> issue.  I think it would be great if we can get rid of it.  I'm still
> not sure what we should do in back branches.  But let's fix it at least
> on v16 and later.

I have changed the status of the patch to "Waiting on Author" as Tom
Lane's comments have not yet been addressed, feel free to address them
and update the commitfest entry accordingly.

Regards,
Vignesh

Re: Should the archiver process always make sure that the timeline history files exist in the archive?

On Tue, 29 Aug 2023 at 06:29, Jimmy Yih  wrote:
>
> Thanks for the insightful response! I have attached an updated patch
> that moves the proposed logic to the end of StartupXLOG where it seems
> more correct to do this. It also helps with backporting (if it's
> needed) since the archiver process only has access to shared memory
> starting from Postgres 14.
>
> Kyotaro Horiguchi  wrote:
> > A. The OP suggests archiving the timeline history file for the current
> >  timeline every time the archiver starts. However, I don't think we
> >  want to keep archiving the same file over and over. (Granted, we're
> >  not always perfect at avoiding that..)
>
> With the updated proposed patch, we'll be checking if the current
> timeline history file needs to be archived at the end of StartupXLOG
> if archiving is enabled. If it detects that a .ready or .done file
> already exists, then it won't do anything (which will be the common
> case). I agree though that this may be an excessive check since it'll
> be a no-op the majority of the time. However, it shouldn't execute
> often and seems like a quick safe preventive measure. Could you give
> more details on why this would be too cumbersome?
>
> Kyotaro Horiguchi  wrote:
> > B. Given that the steps valid, I concur to what is described in the
> >  test script provided: standbys don't really need that history file
> >  for the initial TLI (though I have yet to fully verify this).  If the
> >  walreceiver just overlooks a fetch error for this file, the standby
> >  can successfully start. (Just skipping the first history file seems
> >  to work, but it feels a tad aggressive to me.)
>
> This was my initial thought as well but I wasn't sure if it was okay
> to overlook the fetch error. Initial testing and brainstorming seems
> to show that it's okay. I think the main bad thing is that these new
> standbys will not have their initial timeline history files which can
> be useful for administration. I've attached a patch that attempts this
> approach if we want to switch to this approach as the solution. The
> patch contains an updated TAP test as well to better showcase the
> issue and fix.
>
> Kyotaro Horiguchi  wrote:
> > C. If those steps aren't valid, we might want to add a note stating
> >  that -X none basebackups do need the timeline history file for the
> >  initial TLI.
>
> The difficult thing about only documenting this is that it forces the
> user to manually store and track the timeline history files. It can be
> a bit cumbersome for WAL archiving users to recognize this scenario
> when they're just trying to optimize their basebackups by using
> -Xnone. But then again -Xnone does seem like it's designed for
> advanced users so this might be okay.
>
> Kyotaro Horiguchi  wrote:
> > And don't forget to enable archive mode before the latest timeline
> > switch if any.
>
> This might not be reasonable since a user could've been using
> streaming replication and doing failover/failbacks as part of general
> high availability to manage their Postgres without knowing they were
> going to enable WAL archiving later on. The user would need to
> configure archiving and force a failover which may not be
> straightforward.

I have changed the status of the patch to "Waiting on Author" as
Robert's suggestions have not yet been addressed. Feel free to address
the suggestions and update the status accordingly.

Regards,
Vignesh

Re: Postgres Partitions Limitations (5.11.2.3)

2024-01-11 Thread Laurenz Albe

On Thu, 2024-01-11 at 14:44 +0100, Magnus Hagander wrote:
> Thanks, applied and backpatched all the way.

Thanks for taking care of that!

Yours,
Laurenz Albe

Re: A recent message added to pg_upgade

2024-01-11 Thread Tom Lane

Alvaro Herrera  writes:
> "The logical replication launcher is disabled during binary upgrades, to
> avoid logical replication workers running on the source cluster.  That
> would cause replication origins to move forward after having been copied
> to the target cluster, potentially creating conflicts with the copied
> data files."

"avoid logical replication workers running" still seems like shaky
grammar.  Perhaps s/avoid/avoid having/, or write "to prevent logical
replication workers from running ...".

Also perhaps s/would/could/.

Otherwise +1

regards, tom lane

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi,

On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 10.01.2024 19:32, Bertrand Drouvot wrote:
> > 
> > > is an absent message 'obsolete replication slot "row_removal_activeslot"'
> > Looking at 
> > recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
> > 
> > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
> > 
> > In case of an active slot we first call ReportSlotInvalidation with the 
> > second
> > parameter set to true (to emit the "terminating" message), then SIGTERM the 
> > active
> > process and then (later) we should call the other ReportSlotInvalidation()
> > call with the second parameter set to false (to emit the message that we 
> > don't
> > see here).
> > 
> > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second 
> > ReportSlotInvalidation()
> > call.
> 
> I've found a way to reproduce the issue without slowing down a machine or
> running multiple tests in parallel. It's enough for this to add a delay to
> allow BackgroundWriterMain() to execute LogStandbySnapshot():
> @@ -692,2 +690,3 @@
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> 
> With this delay, I get the failure immediately:
> $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C 
> src/test/recovery
> # +++ tap check in src/test/recovery +++
> t/035_standby_logical_decoding.pl .. 47/?
> #   Failed test 'activeslot slot invalidation is logged with on-access 
> pruning'
> #   at t/035_standby_logical_decoding.pl line 227.

Thanks a lot for the testing!

So I think we have 2 issues here:

1) The one you're mentioning above related to the on-access pruning test:

I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test 
(
as I don't see any "easy" way to take care of this race condition). The active 
slot
invalidation is already well covered in the other tests anyway. 

I'm proposing the attached 
v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.

2) The fact that sometime we're getting a termination message which is not 
followed
by an obsolete one (like as discussed in [1]).

For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:

In case of an active slot we proceed in 2 steps:
 - terminate the backend holding the slot
 - report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that 
time.

I'm proposing the attached 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.

Would it be possible to re-launch your repro (the slow one, not the pg_sleep() 
one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 
that
you're already using in your tests).

If it helps, I'll propose 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.

[1]: 
https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 8347851b3cc42655cfd81914f0c2f5cc1d22e2b8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:36:57 +
Subject: [PATCH v4] Fix 035_standby_logical_decoding.pl race conditions

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

Also, get rid of the active slot invalidation check during the on-access pruning
check. This test is racy for active slots and active slots invalidations are well
covered in other tests.
---
 .../t/035_standby_logical_decoding.pl | 114 +-
 1 file changed, 59 insertions(+), 55 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..7a50187326 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -191,9 +191,9 @@ sub check_slots_conflict_reason
 
 # Drop the slots, re-create them, change hot_standby_feedback,
 # check xmin and catalog_xmin values, make slot active and reset stat.
-sub reactive_slots_change_hfs_and_wait_for_xmins
+sub recreate_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_;
 
 	# drop the logical slots
 	drop_logical_slots($previous_slot_prefix);
@@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_h

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-11 Thread Jonathan S. Katz


On 1/10/24 10:45 PM, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 09:17:47PM -0600, Nathan Bossart wrote:

Now that commit a4adc31 has had some time to bake and concerns about
unintended consequences may have abated, I wanted to revive this
back-patching discussion.  I see a few possibly-related reports [0] [1]
[2], and I'm now seeing this in the field, too.  While it is debatable
whether this is a bug, it's a quite nasty issue for users, and it's both
difficult to detect and difficult to work around.


+1, I've seen this becoming a PITA for a few things.  Knowing that the
size of PGPROC does not change at all, I would be in favor for a
backpatch, especially since it's been in the tree for more than 1
year, and even more knowing that we have 16 released with this stuff
in.


I have similar data sources to Nathan/Michael and I'm trying to avoid 
piling on, but one case that's interesting occurred after a major 
version upgrade from PG10 to PG14 on a database supporting a very 
active/highly concurrent workload. On inspection, it seems like 
backpatching would help this particularly case.


With 10/11 EOL, I do wonder if we'll see more of these reports on 
upgrade to < PG16.


(I was in favor of backpatching prior; opinion is unchanged).

Thanks,

Jonathan



OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: SQL:2011 application time

2024-01-11 Thread Peter Eisentraut

On 31.12.23 09:51, Paul Jungwirth wrote:
On Wed, Dec 6, 2023 at 12:59 AM Peter Eisentraut  
wrote:

 >
 > On 02.12.23 19:41, Paul Jungwirth wrote:
 > > So what do you think of this idea instead?:
 > >
 > > We could add a new (optional) support function to GiST that translates
 > > "well-known" strategy numbers into the opclass's own strategy numbers.
 >
 > I had some conversations about this behind the scenes.  I think this
 > idea makes sense.

Here is a patch series with the GiST stratnum support function added. I 
put this into a separate patch (before all the temporal ones), so it's 
easier to review. Then in the PK patch (now #2) we call that function to 
figure out the = and && operators. I think this is a big improvement.

I like this solution.

Here is some more detailed review of the first two patches.  (I reviewed 
v20; I see you have also posted v21, but they don't appear very 
different for this purpose.)

v20-0001-Add-stratnum-GiST-support-function.patch

* contrib/btree_gist/Makefile

Needs corresponding meson.build updates.

* contrib/btree_gist/btree_gist--1.7--1.8.sql

Should gist_stratnum_btree() live in contrib/btree_gist/ or in core?
Are there other extensions that use the btree strategy numbers for
gist?

+ALTER OPERATOR FAMILY gist_vbit_ops USING gist ADD
+   FUNCTION  12 (varbit, varbit) gist_stratnum_btree (int2) ;

Is there a reason for the extra space after FUNCTION here (repeated
throughout the file)?

+-- added in 1.4:

What is the purpose of these "added in" comments?

v20-0002-Add-temporal-PRIMARY-KEY-and-UNIQUE-constraints.patch

* contrib/btree_gist/Makefile

Also update meson.build.

* contrib/btree_gist/sql/without_overlaps.sql

Maybe also insert a few values, to verify that the constraint actually
does something?

* doc/src/sgml/ref/create_table.sgml

Is "must have a range type" still true?  With the changes to the
strategy number mapping, any type with a supported operator class
should work?

* src/backend/utils/adt/ruleutils.c

Is it actually useful to add an argument to
decompile_column_index_array()?  Wouldn't it be easier to just print
the " WITHOUT OVERLAPS" in the caller after returning from it?

* src/include/access/gist_private.h

The added function gistTranslateStratnum() isn't really "private" to
gist.  So access/gist.h would be a better place for it.

Also, most other functions there appear to be named "GistSomething",
so a more consistent name might be GistTranslateStratnum.

* src/include/access/stratnum.h

The added StrategyIsValid() doesn't seem that useful?  Plenty of
existing code just compares against InvalidStrategy, and there is only
one caller for the new function.  I suggest to do without it.

* src/include/commands/defrem.h

We are using two terms here, well-known strategy number and canonical
strategy number, to mean the same thing (I think?).  Let's try to
stick with one.  Or explain the relationship?

If these points are addressed, and maybe with another round of checking 
that all corner cases are covered, I think these patches (0001 and 0002) 
are close to ready.

Re: Streaming I/O, vectored I/O (WIP)

2024-01-11 Thread Heikki Linnakangas


On 11/01/2024 05:19, Thomas Munro wrote:

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas  wrote:

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.


How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?


What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

  * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing 
needed
  * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
  * CompleteReadBuffers() -> waits, completes as necessary


Ok. It feels surprising to have three steps. I understand that you need 
two steps, one to start the I/O and another to wait for them to finish, 
but why do you need separate Prepare and Start steps? What can you do in 
between them? (You explained that. I'm just saying that that's my 
initial reaction when seeing that API. It is surprising.)



If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.


Yeah, good thoughts, and topics that have occupied me for some time
now.  I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL.  In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended().  But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode.  But that feels just wrong
-- we'd be using RBM but implementing them only partially.


I see. When you're about to zero the page, there's not much point in 
splitting the operation into Prepare/Start/Complete stages anyway. 
You're not actually doing any I/O. Perhaps it's best to have a separate 
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" 
function that does the same as 
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.



The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.


+1


Putting all that together, I propose:

/*
   * Initiate reading a block from disk to the buffer cache.
   *
   * XXX: Until we have async I/O, this just allocates the buffer in the
buffer
   * cache. The actual I/O happens in WaitReadBuffer().
   */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
 ForkNumber forkNum,
 BlockNumber blockNum,
 BufferAccessStrategy strategy,
 ReadBufferMode mode,
 bool *foundPtr);

/*
   * Wait for a read that was started earlier with StartReadBuffer() to
finish.
   *
   * XXX: Until we have async I/O, this is the function that actually
performs
   * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
   * will try to perform all of them in one syscall. Subsequent calls to
   * WaitReadBuffer(), for those other buffers, will finish quickly.
   */
void
WaitReadBuffer(Buffer buf);


I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O.  In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os,

Re: Issue in postgres_fdw causing unnecessary wait for cancel request reply

On Thu, 13 Apr 2023 at 23:34, Fujii Masao  wrote:
>
>
>
> On 2023/04/13 11:00, Kyotaro Horiguchi wrote:
> > Agreed, it seems to be a leftover when we moved to parse_int_param()
> > in that area.
>
> It looks like there was an oversight in commit e7a2217978. I've attached a 
> patch (0002) that updates PQconnectPoll() to use parse_int_param() for 
> parsing the keepalives parameter.
>
> As this change is not directly related to the bug fix, it may not be 
> necessary to back-patch it to the stable versions, I think. Thought?
>
>
> >> To clarify, are you suggesting that PQgetCancel() should
> >> only parse the parameters for TCP connections
> >> if cancel->raddr.addr.ss_family != AF_UNIX?
> >> If so, I think that's a good idea.
> >
> > You're right. I used connip in the diff because I thought it provided
> > the same condition, but in a simpler way.
>
> I made a modification to the 0001 patch. It will now allow PQgetCancel() to 
> parse and interpret TCP connection parameters only when the connection is not 
> made through a Unix-domain socket.
>
>
> > However, I notcied that PQgetCancel() doesn't set errbuf.. So, I'm
> > fine with your proposal.
>
> Ok.
>
>
> >> I think it is important to inform the user when an error
> >> occurs and a cancel request cannot be sent, as this information
> >> can help them identify the cause of the problem (such as
> >> setting an overly large value for the keepalives parameter).
> >
> > Although I view it as an internal error, I agree with emitting some
> > error messages in that situation.
>
> Ok.

I have changed the status of the patch to "Waiting on Author" as all
the issues are not addressed. Feel free to address them and change the
status accordingly.

Regards,
Vignesh

Re: Show WAL write and fsync stats in pg_stat_io

On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz  wrote:
>
> On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
> >
> > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  
> > > wrote:
> > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  
> > >> wrote:
> > >> Oh, I understand it now. Yes, that makes sense.
> > >> I thought removing op_bytes completely ( as you said "This patch
> > >> extends it with two more operation sizes, and there are even cases
> > >> where it may be a variable" ) from pg_stat_io view then adding
> > >> something like {read | write | extend}_bytes and {read | write |
> > >> extend}_calls could be better, so that we don't lose any information.
> > >
> > > Upthread, Michael says:
> > >
> > >> I find the use of 1 in this context a bit confusing, because when
> > >> referring to a counter at N, then it can be understood as doing N
> > >> times a operation,
> > >
> > > I didn't understand this argument, so I'm not sure if I agree or
> > > disagree with it.
> >
> > Nazir has mentioned upthread one thing: what should we do for the case
> > where a combination of (io_object,io_context) does I/O with a
> > *variable* op_bytes, because that may be the case for the WAL
> > receiver?  For this case, he has mentioned that we should set op_bytes
> > to 1, but that's something I find confusing because it would mean that
> > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > would be to use op_bytes = -1 or NULL for the variable case instead,
> > with reads, writes and extends referring to a number of bytes rather
> > than a number of operations.
>
> I agree but we can't do this only for the *variable* cases since
> B_WAL_RECEIVER and other backends use the same
> pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> is, if two backends use the same pgstat_count_io_op_time() function
> call in the code; they must count the same thing (number of calls,
> bytes, etc.). It could be better to count the number of bytes for all
> the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

> > > I think these are the three proposals for handling WAL reads:
> > >
> > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > number of calls to pg_pread() or similar
> > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > calls to pg_pread() or similar
> >
> > 3) could be a number of bytes, actually.
>
> One important point is that we can't change only reads, if we decide
> to count the number of bytes for the reads; writes and extends should
> be counted as a number of bytes as well.

Yes, that is true.

> > > Looking at the patch, I think it is still doing 2.
> >
> > The patch disables stats for the WAL receiver, while the startup
> > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > discard the variable case.
> >
> > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > all those IOOps for which it would be relevant. It kind of stinks but
> > > it would give us the freedom to document exactly what a single IOOp
> > > means for each combination of BackendType, IOContext, IOObject, and
> > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > columns. Everyone will probably hate us if we do that, though.
> > > Especially because having bytes for the existing IOObjects is an
> > > existing feature.
> >
> > An issue I have with this one is that having multiple tuples for
> > each (object,context) if they have multiple op_bytes leads to
> > potentially a lot of bloat in the view.  That would be up to 8k extra
> > tuples just for the sake of op_byte's variability.
>
> Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

> > > A separate question: suppose [1] goes in (to read WAL from WAL buffers

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, 9 Feb 2023 at 12:02, Masahiko Sawada  wrote:
>
> On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila  wrote:
> >
> > On Wed, Feb 8, 2023 at 1:19 AM Andres Freund  wrote:
> > >
> > > On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:
> > > > On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > Attached updated patches.
> > > > >
> > > >
> > > > Thanks, Andres, others, do you see a better way to fix this problem? I
> > > > have reproduced it manually and the steps are shared at [1] and
> > > > Sawada-San also reproduced it, see [2].
> > > >
> > > > [1] - 
> > > > https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com
> > > > [2] - 
> > > > https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com
> > >
> > > Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating 
> > > over
> > > the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
> > > non-neglegible frequency.  Callers like CreateInitDecodingContext(), that 
> > > pass
> > > already_locked=true worry me a lot less, because obviously that's not a 
> > > very
> > > frequent operation.
> > >
> > > This is particularly not great because we need to acquire
> > > ReplicationSlotControlLock while already holding ProcArrayLock.
> > >
> > >
> > > But clearly there's a pretty large hole in the lock protection right now. 
> > > I'm
> > > a bit confused about why we (Robert and I, or just I) thought it's ok to 
> > > do it
> > > this way.
> > >
> > >
> > > I wonder if we could instead invert the locks, and hold
> > > ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), 
> > > and
> > > acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().
> > >
> >
> > Along with inverting, doesn't this mean that we need to acquire
> > ReplicationSlotControlLock in Exclusive mode instead of acquiring it
> > in shared mode? My understanding of the above locking scheme is that
> > in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
> > in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
> > release it after releasing ProcArrayLock. Then,
> > ReplicationSlotsComputeRequiredXmin() acquires
> > ReplicationSlotControlLock in Exclusive mode only when already_locked
> > is false and releases it after a call to
> > ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
> > won't change.
>
> I've attached the patch of this idea for discussion. In
> GetOldestSafeDecodingTransactionId() called by
> CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
> ProcArrayLock, and XidGenLock at a time. So we would need to be
> careful about the ordering.

I have changed the status of the patch to "Waiting on Author" as
Robert's issues were not addressed yet. Feel free to change the status
accordingly after addressing them.

Regards,
Vignesh

Re: Set log_lock_waits=on by default

2024-01-11 Thread Michael Banck

Hi,

On Thu, Dec 21, 2023 at 02:29:05PM +0100, Laurenz Albe wrote:
> Here is a patch to implement this.
> Being stuck behind a lock for more than a second is almost
> always a problem, so it is reasonable to turn this on by default.

I also think that this should be set to on.

I had a look at the patch and it works fine. Regarding the
documentation, maybe the back-reference at deadlock_timeout could be
made a bit more explicit then as well, as in the attached patch, but
this is mostly bikeshedding.


Michael
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6d975ef7ab..7f9bdea50b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10185,11 +10185,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'

 

-When  is set,
-this parameter also determines the amount of time to wait before
-a log message is issued about the lock wait.  If you are trying
-to investigate locking delays you might want to set a shorter than
-normal deadlock_timeout.
+When  is set to
+on (which is the default), this parameter also
+determines the amount of time to wait before a log message is issued
+about the lock wait.  If you are trying to investigate locking delays
+you might want to set a shorter than normal
+deadlock_timeout.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

On Tue, 17 Oct 2023 at 04:18, Thomas Munro  wrote:
>
> I pushed the retry-loop-in-frontend-executables patch and the
> missing-locking-in-SQL-functions patch yesterday.  That leaves the
> backup ones, which I've rebased and attached, no change.  It sounds
> like we need some more healthy debate about that backup label idea
> that would mean we don't need these (two birds with one stone), so
> I'll just leave these here to keep the cfbot happy in the meantime.

I have changed the status of this to "Waiting on Author" as Robert's
comments have not yet been handled. Feel free to post an updated
version and change the status accordingly.

Regards,
Vignesh

Re: Error "initial slot snapshot too large" in create replication slot

On Sat, 6 Jan 2024 at 01:47, Robert Haas  wrote:
>
> On Thu, Mar 23, 2023 at 11:02 PM Kyotaro Horiguchi
>  wrote:
> > [ new patch ]
>
> Well, I guess nobody is too excited about fixing this, because it's
> been another 10 months with no discussion. Andres doesn't even seem to
> think this is as much a bug as it is a limitation, for all that it's
> filed in the CF application under bug fixes. I kind of wonder if we
> should just close this entry in the CF, but I'll hold off on that for
> now.

I have changed the status of the patch to "Waiting on Author" as we
don't have a concrete patch with an accepted design which is in a
reviewable shape. We can think if we want to pursue this patch further
or probably close this in the current commitfest and start it again
when someone wants to work on this more actively.

Regards,
Vignesh

Re: Slightly improved meson error for docs tools

On Wed, Jan 10, 2024 at 1:05 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > least lists what's expected -- I'm not sure if this is the way to go,
> > or if we should somehow list the individual tools that are failing
> > here?
>
> Personally I think the patch is OK.

Thanks. I've pushed this one for now, we can always adjust further
later if needed.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Postgres Partitions Limitations (5.11.2.3)

On Thu, Jan 11, 2024 at 11:24 AM Ashutosh Bapat
 wrote:
>
> On Wed, Jan 10, 2024 at 10:38 PM Laurenz Albe  
> wrote:
> >
> > On Wed, 2024-01-10 at 13:41 +0100, Magnus Hagander wrote:
> > > It still reads a bit weird to me. How about the attached wording instead?
> >
> > Thanks!  I am fine with your wording.
>
> Works for me too.

Thanks, applied and backpatched all the way.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: System username in pg_stat_activity

On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Jan 10, 2024 at 02:59:42PM +0100, Magnus Hagander wrote:
> > On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot
> > I definitely think it should be the same. If it's not exactly the
> > same, then it should be *two* columns, one with auth method and one
> > with the name.
> >
> > And thinking more about it maybe that's cleaner, because that makes it
> > easier to do things like filter based on auth method?
>
> Yeah, that's sounds even better.
>
> >
> > > > + 
> > > > +  
> > > > +   authname name
> > > > +  
> > > > +  
> > > > +   The authentication method and identity (if any) that the user
> > > > +   used to log in. It contains the same value as
> > > > +returns in the backend.
> > > > +  
> > > > + 
> > >
> > > I'm fine with auth_method:identity.
> > >
> > > > +S.authname,
> > >
> > > What about using system_user as the field name? (because if we keep
> > > auth_method:identity it's not really the authname anyway).
> >
> > I was worried system_user or sysuser would both be confusing with the
> > fact that we have usesysid -- which would reference a *different*
> > sys...
>
> If we go the 2 fields way, then what about auth_identity and auth_method then?


Here is an updated patch based on this idea.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index de78d58d4b..ec6a25495f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23462,7 +23462,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5e..9a3ab8d06f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,28 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   auth_method text
+  
+  
+   The authentication method used for authenticating the connection, or
+   NULL for background processes.
+  
+ 
+
+ 
+  
+   auth_identity text
+  
+  
+   The identity (if any) that the user presented during the authentication
+   cycle before they were assigned a database role.  Contains the same
+   value as  returns in the backend, or NULL
+   for background processes.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e43e36f5ac..b68c382de1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,8 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.auth_method,
+S.auth_identity,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 9bbdc4beb0..6621969154 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -625,9 +625,20 @@ ClientAuthentication(Port *port)
 			status = CheckRADIUSAuth(port);
 			break;
 		case uaCert:
-			/* uaCert will be treated as if clientcert=verify-full (uaTrust) */
+
+			/*
+			 * uaCert will be treated as if clientcert=verify-full further
+			 * down
+			 */
+			break;
 		case uaTrust:
 			status = STATUS_OK;
+
+			/*
+			 * Trust doesn't set_authn_id(), but we still need to store the
+			 * auth_method
+			 */
+			MyClientConnectionInfo.auth_method = uaTrust;
 			break;
 	}
 
@@ -645,6 +656,12 @@ ClientAuthentication(Port *port)
 #endif
 	}
 
+	/*
+	 * All auth methods should have either called set_authn_id() or manually
+	 * set the auth_method if they were successful.
+	 */
+	Assert(status != STATUS_OK || MyClientConnectionInfo.auth_method != 0);
+
 	if (Log_connections && status == STATUS_OK &&
 		!MyClientConnectionInfo.authn_id)
 	{
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 3dac79c30e..4bb4435c81 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -357,6 +357,12 @@ pgstat_bestart(void)
 	else
 		MemSet(&lbeentry.st_clientaddr, 0, sizeof(lbeentry.st_clientaddr));
 
+	lbeentry.st_auth_method = MyClientConnectionInfo.auth_method;
+	if (MyProcPort && MyClientConnectionInfo.authn_id)
+		strlcpy(lbeentry.st_auth_identity, MyClientConnectionInfo.authn_id, NAMEDATALEN);
+	else
+		MemSet(&lbeentry.st_auth_identity, 0, sizeof(lbeentry.st_auth_identity));
+
 #ifdef USE_SSL
 	if (MyProcPort && MyProcPort->ssl_in_use)
 	{
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 30a2063505..24490f7ae8 100644
-

Re: Fix log_line_prefix to display the transaction id (%x) for statements not in a transaction block

2024-01-11 Thread jian he

Hi.
+ /* Log immediately if dictated by log_statement and XID assigned. */
+ if (GetTopTransactionIdIfAny() != InvalidTransactionId &&
+ check_log_statement(parsetree_list))

change to

+ /* Log immediately if dictated by log_statement and XID assigned. */
+ if ( check_log_statement(parsetree_list)  &&
+ GetTopTransactionIdIfAny() != InvalidTransactionId)

I think it would reduce GetTopTransactionIdIfAny() calls.

I guess people will have different opinion that
simple query like:
`explain(analyze) select g from generate_series(1,1e6) g, pg_sleep(10);`
The log output will only be generated after 10 seconds.
of course, there is pg_stat_activity and other ways to view the running query.

playing around with the patch.
The patch is better than the current HEAD, in some cases.
both under condition:
alter system set log_line_prefix = '%m [%p] %q%u@%d/%a XID:%x ';
set log_statement = 'all';
select pg_reload_conf();

With Patch:
src3=# create table x1(a int);
2024-01-11 17:11:51.150 CST [115782] jian@src3/psql XID:814 LOG:
statement: create table x1(a int);
CREATE TABLE
src3=#
src3=# insert into x1 select 100;
2024-01-11 17:12:06.953 CST [115782] jian@src3/psql XID:815 LOG:
statement: insert into x1 select 100;
INSERT 0 1
src3=# begin;
2024-01-11 17:12:17.543 CST [115782] jian@src3/psql XID:0 LOG:
statement: begin;
BEGIN
src3=*# insert into x1 select 100;
2024-01-11 17:12:24.779 CST [115782] jian@src3/psql XID:816 LOG:
statement: insert into x1 select 100;
INSERT 0 1
src3=*# commit;
2024-01-11 17:12:29.851 CST [115782] jian@src3/psql XID:816 LOG:
statement: commit;
COMMIT
src3=# select 11;
2024-01-11 17:14:22.909 CST [115782] jian@src3/psql XID:0 LOG:
statement: select 11;
 ?column?
--
   11
(1 row)
src3=# drop table x1;
2024-01-11 17:15:01.409 CST [115782] jian@src3/psql XID:817 LOG:
statement: drop table x1;
DROP TABLE
src3=# select pg_current_xact_id();
2024-01-11 17:21:55.602 CST [115782] jian@src3/psql XID:818 LOG:
statement: select pg_current_xact_id();
 pg_current_xact_id

818
(1 row)
-
without patch:

src4=# insert into x1 select 100;
2024-01-11 17:07:13.556 CST [115240] jian@src4/psql XID:0 LOG:
statement: insert into x1 select 100;
INSERT 0 1
src4=# begin;
2024-01-11 17:07:31.345 CST [115240] jian@src4/psql XID:0 LOG:
statement: begin;
BEGIN
src4=*# insert into x1 select 100;
2024-01-11 17:07:35.475 CST [115240] jian@src4/psql XID:0 LOG:
statement: insert into x1 select 100;
INSERT 0 1
src4=*# commit;
2024-01-11 17:07:39.095 CST [115240] jian@src4/psql XID:863 LOG:
statement: commit;
COMMIT
src4=# show logging_collector;
2024-01-11 17:09:59.307 CST [115240] jian@src4/psql XID:0 LOG:
statement: show logging_collector;
 logging_collector
---
 off
(1 row)
src4=# select 11;
2024-01-11 17:14:30.001 CST [115240] jian@src4/psql XID:0 LOG:
statement: select 11;
 ?column?
--
   11
(1 row)
src4=# drop table x1;
2024-01-11 17:15:08.010 CST [115240] jian@src4/psql XID:0 LOG:
statement: drop table x1;
DROP TABLE
src4=# select pg_current_xact_id();
2024-01-11 17:21:22.085 CST [115240] jian@src4/psql XID:0 LOG:
statement: select pg_current_xact_id();
 pg_current_xact_id

867
(1 row)

Re: tablecmds.c/MergeAttributes() cleanup

2024-01-11 Thread Alvaro Herrera

On 2023-Dec-06, Peter Eisentraut wrote:

> One of your (Álvaro's) comments about this earlier was
> 
> > Hmm, crazy.  I'm not sure I like this, because it seems much too clever.
> > The number of lines that are deleted is alluring, though.
> >
> > Maybe it'd be better to create a separate routine that takes a single
> > ColumnDef and returns the Form_pg_attribute element for it; then use
> > that in both BuildDescForRelation and ATExecAddColumn.
> 
> which was also my thought at the beginning.  However, this wouldn't quite
> work that way, for several reasons.  For instance, BuildDescForRelation()
> also needs to keep track of the has_not_null property across all fields, and
> so if you split that function up, you would have to somehow make that an
> output argument and have the caller keep track of it.  Also, the output of
> BuildDescForRelation() in ATExecAddColumn() is passed into
> InsertPgAttributeTuples(), which requires a tuple descriptor anyway, so
> splitting this up into a per-attribute function would then require
> ATExecAddColumn() to re-assemble a tuple descriptor anyway, so this wouldn't
> save anything.

Hmm.  Well, if this state of affairs is useful to you, then I withdraw
my objection, because with this patch we're not really adding any new
weirdness, just moving around already-existing weirdness.  So let's
press ahead with 0001.  (I didn't look at 0002 this time, since
apparently you'd like to process the other patch first and then come
back here.)

If you look closely at InsertPgAttributeTuples and accompanying code, it
all looks a bit archaic.  They seem to be treating TupleDesc as a
glorified array of Form_pg_attribute elements in a convenient packaging.
It's probably cleaner to change these APIs so that they deal with a
Form_pg_attribute array, and not TupleDesc anymore.  But we can hack on
that some other day.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)

Re: make pg_ctl more friendly

2024-01-11 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jan 2024 at 06:33, Junwang Zhao  wrote:
>
> Hi Nazir,
>
> On Tue, Jan 9, 2024 at 9:23 PM Nazir Bilal Yavuz  wrote:
> >
> > v2-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch:
> >
> > -   "Examine the log output.\n"),
> > +   "Examine the log output\n"),
> >   progname);
> >
> > I don't think that this is needed.
> There seems to be no common sense for the ending dot when using
> write_stderr, so I will leave this not changed.
>
> >
> > Other than that, the patch looks good and I confirm that after PITR 
> > shutdown:
> >
> > "PITR shutdown"
> > "update configuration for startup again if needed"
> >
> > message shows up, instead of:
> >
> > "pg_ctl: could not start server"
> > "Examine the log output.".
> >
> > nitpick: It would be better if the order of the error message cases
> > and enums is the same ( i.e. 'POSTMASTER_RECOVERY_SHUTDOWN' before
> > 'POSTMASTER_FAILED' in enum )
> Agreed, fixed in V3

Thank you for the update. It looks good to me.

--
Regards,
Nazir Bilal Yavuz
Microsoft

RE: speed up a logical replica setup

2024-01-11 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

I have been concerned that the patch has not been tested by cfbot due to the
application error. Also, some comments were raised. Therefore, I created a patch
to move forward.
I also tried to address some comments which is not so claimed by others.
They were included in 0003 patch.

* 0001 patch
It is almost the same as v3-0001, which was posted by Euler.
An unnecessary change for Mkvcbuild.pm (this file was removed) was ignored.

* 0002 patch
This contains small fixes to keep complier quiet.

* 0003 patch
This addresses comments posted to -hackers. For now, this does not contain a
doc.
Will add if everyone agrees these idea.

1.
An option --port was added to control the port number for physical standby.
Users can specify a port number via the option, or an environment variable
PGSUBPORT.
If not specified, a fixed value (50111) would be used.

SOURCE: [1]

2.
A FATAL error would be raised if --subscriber-conninfo specifies non-local
server.

SOURCE: [2]

3.
Options -o/-O were added to specify options for publications/subscriptions.

SOURCE: [2]

4.
Made standby to save their output to log file.

SOURCE: [2]

5.
Unnecessary Assert in drop_replication_slot() was removed.

SOURCE: [3]

How do you think?
Thanks Shlok and Vignesh to work with me offline.

[1]:
https://www.postgresql.org/message-id/TY3PR01MB988978C7362A101927070D29F56A2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]:
https://www.postgresql.org/message-id/TY3PR01MB9889593399165B9A04106741F5662%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[3]:
https://www.postgresql.org/message-id/CALDaNm098Jkbh%2Bye6zMj9Ro9j1bBe6FfPV80BFbs1%3DpUuTJ07g%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

v4-0001-Creates-a-new-logical-replica-from-a-standby-serv.patch
Description: v4-0001-Creates-a-new-logical-replica-from-a-standby-serv.patch

v4-0002-Fixed-small-bugs-to-keep-compiler-quiet.patch
Description: v4-0002-Fixed-small-bugs-to-keep-compiler-quiet.patch

v4-0003-Address-some-comments-proposed-on-hackers.patch
Description: v4-0003-Address-some-comments-proposed-on-hackers.patch

Re: A recent message added to pg_upgade

2024-01-11 Thread Alvaro Herrera

On 2024-Jan-11, Michael Paquier wrote:

> Hence, how about something like that:
> "The logical replication launcher is disabled during binary upgrades,
> as a logical replication workers running on the cluster upgrading from
> could cause replication origins to move forward after they are copied
> to the cluster upgrading to, creating potentially conflicts with the
> physical files copied over." 

"The logical replication launcher is disabled during binary upgrades, to
avoid logical replication workers running on the source cluster.  That
would cause replication origins to move forward after having been copied
to the target cluster, potentially creating conflicts with the copied
data files."

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)

Re: Synchronizing slots from primary to standby

On Thu, Jan 11, 2024 at 3:42 PM Dilip Kumar  wrote:
>
> On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
>
> > > IIUC on the standby we just want to overwrite what we get from primary 
> > > no? If
> > > so why we are using those APIs that are meant for the actual decoding 
> > > slots
> > > where it needs to take certain logical decisions instead of mere 
> > > overwriting?
> >
> > I think we don't have a strong reason to use these APIs, but it was 
> > convenient to
> > use these APIs as they can take care of updating the slots info and will 
> > call
> > functions like, ReplicationSlotsComputeRequiredXmin,
> > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly 
> > overwriting
> > the fields and call these manually ?
>
> I might be missing something but do you want to call
> ReplicationSlotsComputeRequiredXmin() kind of functions in standby?  I
> mean those will ultimately update the catalog xmin and replication
> xmin in Procarray and that prevents Vacuum from cleaning up some of
> the required xids.  But on standby, those shared memory parameters are
> not used IIUC.
>

These xmins are required for logical slots as we allow logical
decoding on standby (see GetOldestSafeDecodingTransactionId()). We
also invalidate such slots if the required rows are removed on
standby. Similarly, we need ReplicationSlotsComputeRequiredLSN() to
avoid getting the required WAL removed.

> In my opinion on standby, we just need to update the values in the
> local slots and whatever we get from remote slots without taking all
> the logical decisions in the hope that they will all fall into a
> particular path, for example, if you see LogicalIncreaseXminForSlot(),
> it is doing following steps of operations as shown below[1].  These
> all make sense when you are doing candidate-based updation where we
> first mark the candidates and then update the candidate to real value
> once you get the confirmation for the LSN.  Now following all this
> logic looks completely weird unless this can fall in a different path
> I feel it will do some duplicate steps as well.  For example in
> local_slot_update(), first you call LogicalConfirmReceivedLocation()
> which will set the 'data.confirmed_flush' and then you will call
> LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;'
> and will again call LogicalConfirmReceivedLocation().
>

In case (else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)),
even the updated_xmin is not getting set to true which means there is
a chance that we will never update the required xmin values.

  I don't think
> this is the correct way of reusing the function unless you need to go
> through those paths and I am missing something.
>

I agree with this conclusion and also think that we should directly
update the required fields and call functions like
ReplicationSlotsComputeRequiredLSN() wherever required.

-- 
With Regards,
Amit Kapila.

Re: POC: GROUP BY optimization

2024-01-11 Thread Alexander Korotkov

Hi!

On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov  wrote:
>> Hmm, I don't see this old code in these patches. Resend 0002-* because
>> of trailing spaces.
>
>
> AFAIK, cfbot does not seek old versions of patchset parts in previous 
> messages. So for it to run correctly, a new version of the whole patchset 
> should be sent even if most patches are unchanged.

Please, find the revised patchset with some refactoring and comments
improvement from me.  I'll continue looking into this.

--
Regards,
Alexander Korotkov


0001-Generalize-common-code-of-adding-sort-befor-20240111.patch
Description: Binary data


0002-Explore-alternative-orderings-of-group-by-p-20240111.patch
Description: Binary data

Re: Make all Perl warnings fatal

2024-01-11 Thread Bharath Rupireddy

On Sat, Dec 30, 2023 at 12:57 AM Peter Eisentraut  wrote:
>
> committed

With the commit c5385929 converting perl warnings to FATAL, use of
psql/safe_psql with timeout parameters [1] fail with the following
error:

Use of uninitialized value $ret in bitwise and (&) at
/home/ubuntu/postgres/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
line 2015.

Perhaps assigning a default error code to $ret instead of undef in
PostgreSQL::Test::Cluster - psql() function is the solution.

[1]
use strict;
use warnings FATAL => 'all';
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;

my $node = PostgreSQL::Test::Cluster->new('test');
$node->init;
$node->start;

my ($stdout, $stderr, $timed_out);
my $cmdret = $node->psql('postgres', q[SELECT pg_sleep(600)],
  stdout => \$stdout, stderr => \$stderr,
  timeout => 5,
  timed_out => \$timed_out,
  extra_params => ['--single-transaction'],
  on_error_die => 1);
print "pg_sleep timed out" if $timed_out;

done_testing();

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-11 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
>
> On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > I have code review feedback as well, but I've saved that for my next email.
>
> Ah, cool.
>
> > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
> >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
> >> Oh, I understand it now. Yes, that makes sense.
> >> I thought removing op_bytes completely ( as you said "This patch
> >> extends it with two more operation sizes, and there are even cases
> >> where it may be a variable" ) from pg_stat_io view then adding
> >> something like {read | write | extend}_bytes and {read | write |
> >> extend}_calls could be better, so that we don't lose any information.
> >
> > Upthread, Michael says:
> >
> >> I find the use of 1 in this context a bit confusing, because when
> >> referring to a counter at N, then it can be understood as doing N
> >> times a operation,
> >
> > I didn't understand this argument, so I'm not sure if I agree or
> > disagree with it.
>
> Nazir has mentioned upthread one thing: what should we do for the case
> where a combination of (io_object,io_context) does I/O with a
> *variable* op_bytes, because that may be the case for the WAL
> receiver?  For this case, he has mentioned that we should set op_bytes
> to 1, but that's something I find confusing because it would mean that
> we are doing read, writes or extends 1 byte at a time.  My suggestion
> would be to use op_bytes = -1 or NULL for the variable case instead,
> with reads, writes and extends referring to a number of bytes rather
> than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

> > I think these are the three proposals for handling WAL reads:
> >
> > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > number of calls to pg_pread() or similar
> > 3) setting op_bytes to NULL and the number of reads is the number of
> > calls to pg_pread() or similar
>
> 3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

> > Looking at the patch, I think it is still doing 2.
>
> The patch disables stats for the WAL receiver, while the startup
> process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> discard the variable case.
>
> > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > all those IOOps for which it would be relevant. It kind of stinks but
> > it would give us the freedom to document exactly what a single IOOp
> > means for each combination of BackendType, IOContext, IOObject, and
> > IOOp (as relevant) and still have an accurate number in the *bytes
> > columns. Everyone will probably hate us if we do that, though.
> > Especially because having bytes for the existing IOObjects is an
> > existing feature.
>
> An issue I have with this one is that having multiple tuples for
> each (object,context) if they have multiple op_bytes leads to
> potentially a lot of bloat in the view.  That would be up to 8k extra
> tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

> > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > directly). Now, WAL reads are not from permanent storage anymore. Are
> > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > question for some of the WAL receiver functions. Should we track any
> > I/O other than permanent storage I/O? Or did I miss this being
> > addressed upthread?
>
> That's a good point.  I guess that this should just be a different
> IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

> > In terms of what I/O we should track in a streaming/asynchronous
> > world, the options would be:
> >
> > 1) track read/write syscalls
> > 2) track blocks of BLCKSZ submitted to the kernel
> > 3) track bytes submitted to the kernel
> > 4) track merged I/Os (after doing any merging in the application)
> >
> > I think the debate was largely between 2 and 4. There was some
> > disagreement, but I think we landed on 2 because there is merging that
> > can happen at many levels in the storage stack (even the storage
> > controller). Distinguishing between whether or not Postgres submitted
> > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > but I think it might be confusing fo

Re: proposal: psql: show current user in prompt

2024-01-11 Thread Jelte Fennema-Nio

On Tue, 12 Sept 2023 at 09:46, Peter Eisentraut  wrote:
> ISTM that for a purpose like pgbouncer, it would be simpler to add a new
> GUC "report these variables" and send that in the startup message?  That
> might not help with the psql use case, but it would be much simpler.

FYI I implemented it that way yesterday on this other thread (patch
0010 of that patchset):
https://www.postgresql.org/message-id/flat/CAGECzQScQ3N-Ykv2j4NDyDtrPPc3FpRoa%3DLZ-2Uj2ocA4zr%3D4Q%40mail.gmail.com#cd9e8407820d492e8f677ee6a67c21ce

Re: Clean up some signal usage mainly related to Windows

On Thu, 7 Dec 2023 at 04:50, Nathan Bossart  wrote:
>
> On Wed, Dec 06, 2023 at 11:30:02AM -0600, Nathan Bossart wrote:
> > On Wed, Dec 06, 2023 at 06:27:04PM +0100, Peter Eisentraut wrote:
> >> Makes sense.  Can you commit that?
> >
> > Yes, I will do so shortly.
>
> Committed.  Apologies for the delay.

I have marked the commitfest entry as committed as the patch has been committed.

Regards,
Vignesh

Re: A failure in t/038_save_logical_slots_shutdown.pl

On Wed, 10 Jan 2024 at 18:37, vignesh C  wrote:
>
> On Wed, 10 Jan 2024 at 14:08, Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> > of late on my developer system:
> >
> > t/038_save_logical_slots_shutdown.pl .. 1/?
> > #   Failed test 'Check that the slot's confirmed_flush LSN is the same
> > as the latest_checkpoint location'
> > #   at t/038_save_logical_slots_shutdown.pl line 35.
> > # Looks like you failed 1 test of 2.
> > t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> > (wstat 256, 0x100)
> > Failed 1/2 subtests
> >
> > I did a quick analysis of the failure and commit
> > https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> > that introduced this test. I think the issue is that the slot's
> > confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> > (0/1508018) are not the same:
> >
> > tmp_check/log/038_save_logical_slots_shutdown_pub.log:
> >
> > 2024-01-10 07:55:44.539 UTC [57621] sub LOG:  starting logical
> > decoding for slot "sub"
> > 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL:  Streaming
> > transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> > 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT:  START_REPLICATION
> > SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> > publication_names '"pub"')
> >
> > ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> > src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> > Database cluster state:   in production
> > pg_control last modified: Wed Jan 10 07:55:44 2024
> > Latest checkpoint location:   0/1508018
> > Latest checkpoint's REDO location:0/1508018
> >
> > But the tests added by t/038_save_logical_slots_shutdown.pl expects
> > both LSNs to be same:
> >
> > sub compare_confirmed_flush
> > {
> > # Is it same as the value read from log?
> > ok( $latest_checkpoint eq $confirmed_flush_from_log,
> > "Check that the slot's confirmed_flush LSN is the same as the
> > latest_checkpoint location"
> > );
> >
> > I suspect that it's quite not right to expect the slot's
> > confirmed_flush and latest checkpoint location to be same in the test.
> > This is because the shutdown checkpoint gets an LSN that's greater
> > than the slot's confirmed_flush LSN - see the shutdown checkpoint
> > record getting inserted into WAL after the slot is marked dirty in
> > CheckPointReplicationSlots().
> >
> > With this analysis in mind, I think the tests need to do something
> > like the following:
> >
> > diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > b/src/test/recovery/t/038_save_logical_slots_shut
> > down.pl
> > index 5a4f5dc1d4..d49e6014fc 100644
> > --- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > +++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
> > @@ -32,7 +32,7 @@ sub compare_confirmed_flush
> >   unless defined($latest_checkpoint);
> >
> > # Is it same as the value read from log?
> > -   ok( $latest_checkpoint eq $confirmed_flush_from_log,
> > +   ok( $latest_checkpoint ge $confirmed_flush_from_log,
> > "Check that the slot's confirmed_flush LSN is the same
> > as the latest_checkpoint location"
> > );
> >
> > Thoughts?
>
> I got the log files from Bharath offline. Thanks Bharath for sharing
> the log files offline.
> The WAL record sequence is exactly the same in the failing test and
> tests which are passing.
> One observation in our case the confirmed flush lsn points exactly to
> shutdown checkpoint, but in the failing test the lsn pointed is
> invalid, pg_waldump says that address is invalid and skips about 24
> bytes and then sees a valid record
>
> Passing case confirm flush lsn(0/150D158) from my machine:
> pg_waldump 00010001 -s 0/150D158
> rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
> 0/0150D158, prev 0/0150D120, desc: CHECKPOINT_SHUTDOWN redo 0/150D158;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> Failing case confirm flush lsn( 0/1508000) from failing tests log file:
> pg_waldump 00010001 -s 0/1508000
> pg_waldump: first record is after 0/1508000, at 0/1508018, skipping
> over 24 bytes
> rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
> 0/01508018, prev 0/01507FC8, desc: CHECKPOINT_SHUTDOWN redo 0/1508018;
> tli 1; prev tli 1; fpw true; xid 0:739; oid 16388; multi 1; offset 0;
> oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
> timestamp xid: 0/0; oldest running xid 0; shutdown
>
> I'm still not sure why in this case, it is not exactly pointing to a
> valid WAL record, it has to skip 24 bytes to find the valid checkpoint
> shutdown record.
> I will investigate this fur

Re: Synchronizing slots from primary to standby

On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
>
> +static bool
> +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> {
> ...
> + /* Slot ready for sync, so sync it. */
> + else
> + {
> + /*
> + * Sanity check: With hot_standby_feedback enabled and
> + * invalidations handled appropriately as above, this should never
> + * happen.
> + */
> + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> + elog(ERROR,
> + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> + " to remote slot's LSN(%X/%X) as synchronization"
> + " would move it backwards", remote_slot->name,
> + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> ...
> }
>
> I was thinking about the above code in the patch and as far as I can
> think this can only occur if the same name slot is re-created with
> prior restart_lsn after the existing slot is dropped. Normally, the
> newly created slot (with the same name) will have higher restart_lsn
> but one can mimic it by copying some older slot by using
> pg_copy_logical_replication_slot().
>
> I don't think as mentioned in comments even if hot_standby_feedback is
> temporarily set to off, the above shouldn't happen. It can only lead
> to invalidated slots on standby.
>
> To close the above race, I could think of the following ways:
> 1. Drop and re-create the slot.
> 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> ahead of local_slot's LSN then we can update it; but as mentioned in
> your previous comment, we need to update all other fields as well. If
> we follow this then we probably need to have a check for catalog_xmin
> as well.
>

The second point as mentioned is slightly misleading, so let me try to
rephrase it once again: Emit LOG/WARNING in this case and once
remote_slot's LSN moves ahead of local_slot's LSN then we can update
it; additionally, we need to update all other fields like two_phase as
well. If we follow this then we probably need to have a check for
catalog_xmin as well along remote_slot's restart_lsn.

> Now, related to this the other case which needs some handling is what
> if the remote_slot's restart_lsn is greater than local_slot's
> restart_lsn but it is a re-created slot with the same name. In that
> case, I think the other properties like 'two_phase', 'plugin' could be
> different. So, is simply copying those sufficient or do we need to do
> something else as well?
>

Bertrand, Dilip, Sawada-San, and others, please share your opinion on
this problem as I think it is important to handle this race condition.

-- 
With Regards,
Amit Kapila.

Re: Postgres Partitions Limitations (5.11.2.3)

On Wed, Jan 10, 2024 at 10:38 PM Laurenz Albe  wrote:
>
> On Wed, 2024-01-10 at 13:41 +0100, Magnus Hagander wrote:
> > It still reads a bit weird to me. How about the attached wording instead?
>
> Thanks!  I am fine with your wording.

Works for me too.

-- 
Best Wishes,
Ashutosh Bapat

Re: Make attstattarget nullable

2024-01-11 Thread Peter Eisentraut


On 10.01.24 14:16, Alvaro Herrera wrote:

Here is an updated patch rebased over 3e2e0d5ad7.

The 0001 patch stands on its own, but I also tacked on two additional WIP
patches that simplify some pg_attribute handling and make these kinds of
refactorings simpler in the future.  See description in the patches.


I didn't look at 0002 and 0003, since they're marked as WIP.  (But I did
like the removal that happens in 0003, so I hope these two also make it
to 17).


Here is an updated patch set.  I have addressed your comments on 0001. 
I looked again at 0002 and 0003 and I was quite happy with them, so I 
just removed the WIP label and also added a few more code comments, but 
otherwise didn't change anything.



Seems reasonable.  Do we really need a catversion bump for this?


Yes, this changes the order of the fields in pg_attribute.


I like that we now have SET STATISTICS DEFAULT rather than -1 to reset
to default.  Do we want to document that setting explicitly to -1
continues to have that behavior?  (I would add something like "Setting
to a value of -1 is an obsolete spelling to get the same behavior."
after the phrase that explains DEFAULT in the ALTER TABLE manpage.)


done


I noticed that equalTupleDescs no longer compares attstattarget, and
this is because the field is not in TupleDesc anymore.  I looked at the
callers of equalTupleDescs and I think this is exactly what we want
(precisely because attstattarget is no longer in TupleDesc.)


Yes, I had investigated that in some detail, and I think it's ok.  I 
think equalTupleDescs() is actually mostly useless and I plan to start a 
separate discussion on that.



This changes the pg_attribute field attstattarget into a nullable field
in the variable-length part of the row.


I don't think we use "the variable-length part of the row" as a term
anywhere.  We only have the variable-length columns, and we made a bit
of a mistake in using CATALOG_VARLEN to differentiate the part of the
catalogs that are not mapped to the structs (because at the time those
were in effect only the variable length fields).  I think this is
largely not a problem, but let's be careful with how we word the related
comments.  So:


Yeah, there are multiple ways to interpret this.  There are fields with 
varlena headers, but there are also fields that are not-fixed-length as 
far as struct access to catalog tuples is concerned, and the two not the 
same.



I think the comment next to "#ifdef CATALOG_VARLEN" is now a bit
misleading, because the field immediately below is effectively not
varlena.  Maybe make it
#ifdef CATALOG_VARLEN   /* nullable/varlena fields start here */


done


In RemoveAttributeById, a comment says
"Clear the other variable-length fields."
but this is no longer fully correct.  Again maybe make it "... the other
nullable or variable-length fields".


done


In get_attstattarget() I think we should return 0 for dropped columns
without reading attstattarget, which is useless anyway, and if it did
happen to return non-null, it might cause us to do stuff, which would be
a waste.


I ended up deciding to get rid of get_attstattarget() altogether and 
just do the fetching inline in examine_attribute().  Because the 
previous API and what you are discussing here is over-designed, since 
the only caller doesn't call it with dropped columns or system columns 
anyway.  This way these issues are contained in the ANALYZE code, not in 
a very general place like lsyscache.c.



It's annoying that the new code in index_concurrently_swap() is more
verbose than the code being replaced, but it seems OK to me, since it
allows us to distinguish a null value in attstattarget from actual 0
without complicating the get_attstattarget API (which I think we would
have to do if we wanted to use it here.)


Yeah, this was annoying.  Originally, I had it even more complicated, 
because I was trying to check if the actual (non-null) values are the 
same.  But then I realized the new value is never set at this point.  I 
think what the code is actually about is clearer now.  And of course the 
0003 patch gets rid of it anyway.
From d937c26d8c471c999aa53c96dce86c68fad71a7a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 11 Jan 2024 10:09:02 +0100
Subject: [PATCH v3 1/3] Make attstattarget nullable

This changes the pg_attribute field attstattarget into a nullable
field in the variable-length part of the row.  If no value is set by
the user for attstattarget, it is now null instead of previously -1.
This saves space in pg_attribute and tuple descriptors for most
practical scenarios.  (ATTRIBUTE_FIXED_PART_SIZE is reduced from 108
to 104.)  Also, null is the semantically more correct value.

The ANALYZE code internally continues to represent the default
statistics target by -1, so that that code can avoid having to deal
with null values.  But that is now contained to the ANALYZE code.
Only the DDL code deals with attstattarget possibly null.

For system columns,

Re: Synchronizing slots from primary to standby

2024-01-11 Thread Dilip Kumar

On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu)
 wrote:
>

> > IIUC on the standby we just want to overwrite what we get from primary no? 
> > If
> > so why we are using those APIs that are meant for the actual decoding slots
> > where it needs to take certain logical decisions instead of mere 
> > overwriting?
>
> I think we don't have a strong reason to use these APIs, but it was 
> convenient to
> use these APIs as they can take care of updating the slots info and will call
> functions like, ReplicationSlotsComputeRequiredXmin,
> ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly 
> overwriting
> the fields and call these manually ?

I might be missing something but do you want to call
ReplicationSlotsComputeRequiredXmin() kind of functions in standby?  I
mean those will ultimately update the catalog xmin and replication
xmin in Procarray and that prevents Vacuum from cleaning up some of
the required xids.  But on standby, those shared memory parameters are
not used IIUC.

In my opinion on standby, we just need to update the values in the
local slots and whatever we get from remote slots without taking all
the logical decisions in the hope that they will all fall into a
particular path, for example, if you see LogicalIncreaseXminForSlot(),
it is doing following steps of operations as shown below[1].  These
all make sense when you are doing candidate-based updation where we
first mark the candidates and then update the candidate to real value
once you get the confirmation for the LSN.  Now following all this
logic looks completely weird unless this can fall in a different path
I feel it will do some duplicate steps as well.  For example in
local_slot_update(), first you call LogicalConfirmReceivedLocation()
which will set the 'data.confirmed_flush' and then you will call
LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;'
and will again call LogicalConfirmReceivedLocation().  I don't think
this is the correct way of reusing the function unless you need to go
through those paths and I am missing something.

[1]
LogicalIncreaseXminForSlot()
{
   if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
  {
  }
  else if (current_lsn <= slot->data.confirmed_flush)
  {
  }
 else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
 {
 }

if (updated_xmin)
  LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Exposing the lock manager's WaitForLockers() to SQL

2024-01-11 Thread Will Mortensen

Here is a new series adding a single pg_wait_for_lockers() function
that takes a boolean argument to control the interpretation of the
lock mode. It omits LOCK's handling of descendant tables so it
requires permissions directly on descendants in order to wait for
locks on them. Not sure if that would be a problem for anyone.


v6-0001-Refactor-GetLockConflicts-into-more-general-GetLo.patch
Description: Binary data


v6-0002-Allow-specifying-single-lockmode-in-WaitForLocker.patch
Description: Binary data


v6-0003-Add-pg_wait_for_lockers-function.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-01-11 Thread shveta malik

On Thu, Jan 11, 2024 at 7:28 AM Peter Smith  wrote:
>
> Here are some review comments for patch v58-0002

Thank You for the feedback. These are addressed in v60. Please find my
response inline for a few.

> (FYI - I quickly checked with the latest v59-0002 and AFAIK all these
> review comments below are still relevant)
>
> ==
> Commit message
>
> 1.
> If a logical slot is invalidated on the primary, slot on the standby is also
> invalidated.
>
> ~
>
> /slot on the standby/then that slot on the standby/
>
> ==
> doc/src/sgml/logicaldecoding.sgml
>
> 2.
> In order to resume logical replication after failover from the synced
> logical slots, it is required that 'conninfo' in subscriptions are
> altered to point to the new primary server using ALTER SUBSCRIPTION
> ... CONNECTION. It is recommended that subscriptions are first
> disabled before promoting the standby and are enabled back once these
> are altered as above after failover.
>
> ~
>
> Minor rewording mainly to reduce a long sentence.
>
> SUGGESTION
> To resume logical replication after failover from the synced logical
> slots, the subscription's 'conninfo' must be altered to point to the
> new primary server. This is done using ALTER SUBSCRIPTION ...
> CONNECTION. It is recommended that subscriptions are first disabled
> before promoting the standby and are enabled back after altering the
> connection string.
>
> ==
> doc/src/sgml/system-views.sgml
>
> 3.
> +  
> +   synced bool
> +  
> +  
> +  True if this logical slot was synced from a primary server.
> +  
> +  
>
> SUGGESTION
> True if this is a logical slot that was synced from a primary server.
>
> ==
> src/backend/access/transam/xlogrecovery.c
>
> 4.
> + /*
> + * Shutdown the slot sync workers to prevent potential conflicts between
> + * user processes and slotsync workers after a promotion.
> + *
> + * We do not update the 'synced' column from true to false here, as any
> + * failed update could leave some slot's 'synced' column as false. This
> + * could cause issues during slot sync after restarting the server as a
> + * standby. While updating after switching to the new timeline is an
> + * option, it does not simplify the handling for 'synced' column.
> + * Therefore, we retain the 'synced' column as true after promotion as they
> + * can provide useful information about their origin.
> + */
>
> Minor comment wording changes.
>
> BEFORE
> ...any failed update could leave some slot's 'synced' column as false.
> SUGGESTION
> ...any failed update could leave 'synced' column false for some slots.
>
> ~
>
> BEFORE
> Therefore, we retain the 'synced' column as true after promotion as
> they can provide useful information about their origin.
> SUGGESTION
> Therefore, we retain the 'synced' column as true after promotion as it
> may provide useful information about the slot origin.
>
> ==
> src/backend/replication/logical/slotsync.c
>
> 5.
> + * While creating the slot on physical standby, if the local restart_lsn 
> and/or
> + * local catalog_xmin is ahead of those on the remote then the worker cannot
> + * create the local slot in sync with the primary server because that would
> + * mean moving the local slot backwards and the standby might not have WALs
> + * retained for old LSN. In this case, the worker will mark the slot as
> + * RS_TEMPORARY. Once the primary server catches up, it will move the slot to
> + * RS_PERSISTENT and will perform the sync periodically.
>
> /will move the slot to RS_PERSISTENT/will mark the slot as RS_PERSISTENT/
>
> ~~~
>
> 6. drop_synced_slots_internal
> +/*
> + * Helper function for drop_obsolete_slots()
> + *
> + * Drops synced slot identified by the passed in name.
> + */
> +static void
> +drop_synced_slots_internal(const char *name, bool nowait)
> +{
> + Assert(MyReplicationSlot == NULL);
> +
> + ReplicationSlotAcquire(name, nowait);
> +
> + Assert(MyReplicationSlot->data.synced);
> +
> + ReplicationSlotDropAcquired();
> +}
>
> IMO you don't need this function. AFAICT it is only called from one
> place and does not result in fewer lines of code.
>
> ~~~
>
> 7. get_local_synced_slots
>
> + /* Check if it is logical synchronized slot */
> + if (s->in_use && SlotIsLogical(s) && s->data.synced)
> + {
> + local_slots = lappend(local_slots, s);
> + }
>
> Do you need to check SlotIsLogical(s) here? I thought s->data.synced
> can never be true for physical slots. I felt you could write this like
> blelow:
>
> if (s->in_use s->data.synced)
> {
>   Assert(SlotIsLogical(s));
>   local_slots = lappend(local_slots, s);
> }
>
> ~~~
>
> 8. check_sync_slot_on_remote
>
> +static bool
> +check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots,
> +   bool *locally_invalidated)
> +{
> + ListCell   *lc;
> +
> + foreach(lc, remote_slots)
> + {
> + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
>
> I think you can use the new style foreach_ptr list macros here.
>
> ~~~
>
> 9. drop_obsolete_slots
>
> +drop_obs

Re: Random pg_upgrade test failure on drongo

On Thu, Jan 11, 2024 at 8:15 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > But tomorrow it could be for other tables and if we change this
> > > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > > there is always a chance that one misses changing this comment. I feel
> > > keeping it generic in this case would be better as the problem is
> > > generic but it is currently shown for pg_largeobject.
> >
> > Yes, for sure. So let's keep it generic as you prefer.
> >
> > Thank you!
>
> Thanks for working the patch. I'm also OK to push the Amit's fix patch.
>

Thanks to both of you. I have pushed the patch.

-- 
With Regards,
Amit Kapila.

Fix bugs not to discard statistics when changing stats_fetch_consistency

2024-01-11 Thread Shinya Kato


Hi, hackers

There is below description in docs for stats_fetch_consistency.
"Changing this parameter in a transaction discards the statistics 
snapshot."


However, I wonder if changes stats_fetch_consistency in a transaction, 
statistics is not discarded in some cases.


Example:
--
* session 1
=# SET stats_fetch_consistency TO snapshot;
=# BEGIN;
=*# SELECT wal_records, wal_fpi, wal_bytes FROM pg_stat_wal;
 wal_records | wal_fpi | wal_bytes
-+-+---
   23592 | 628 |   5939027
(1 row)

* session 2
=# CREATE TABLE test (i int); -- generate WAL records
=# SELECT wal_records, wal_fpi, wal_bytes FROM pg_stat_wal;
 wal_records | wal_fpi | wal_bytes
-+-+---
   23631 | 644 |   6023411
(1 row)

* session 1
=*# -- snapshot is not discarded, it is right
=*# SELECT wal_records, wal_fpi, wal_bytes FROM pg_stat_wal;
 wal_records | wal_fpi | wal_bytes
-+-+---
   23592 | 628 |   5939027
(1 row)

=*# SET stats_fetch_consistency TO cache;

=*# -- snapshot is not discarded, it is not right
=*# SELECT wal_records, wal_fpi, wal_bytes FROM pg_stat_wal;
 wal_records | wal_fpi | wal_bytes
-+-+---
   23592 | 628 |   5939027
(1 row)
--

I can see similar cases in pg_stat_archiver, pg_stat_bgwriter, 
pg_stat_checkpointer, pg_stat_io, and pg_stat_slru.

Is it a bug? I fixed it, and do you think?

--
Regards,
Shinya Kato
NTT DATA GROUP CORPORATIONdiff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 125ca97b18..40ef6ef321 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1714,3 +1714,11 @@ assign_stats_fetch_consistency(int newval, void *extra)
 	if (pgstat_fetch_consistency != newval)
 		force_stats_snapshot_clear = true;
 }
+
+
+void
+pgstat_clear_snapshot_if_needed(void)
+{
+	if (force_stats_snapshot_clear)
+		pgstat_clear_snapshot();
+}
diff --git a/src/backend/utils/activity/pgstat_archiver.c b/src/backend/utils/activity/pgstat_archiver.c
index 66398b20e5..d4979e9fd0 100644
--- a/src/backend/utils/activity/pgstat_archiver.c
+++ b/src/backend/utils/activity/pgstat_archiver.c
@@ -57,6 +57,7 @@ pgstat_report_archiver(const char *xlog, bool failed)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_ARCHIVER);
 
 	return &pgStatLocal.snapshot.archiver;
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index 7d2432e4fa..931d0f17e7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -70,6 +70,7 @@ pgstat_report_bgwriter(void)
 PgStat_BgWriterStats *
 pgstat_fetch_stat_bgwriter(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_BGWRITER);
 
 	return &pgStatLocal.snapshot.bgwriter;
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index 30a8110e38..ac700072ed 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -79,6 +79,7 @@ pgstat_report_checkpointer(void)
 PgStat_CheckpointerStats *
 pgstat_fetch_stat_checkpointer(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 
 	return &pgStatLocal.snapshot.checkpointer;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 43c393d6fe..13712aa4af 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -156,6 +156,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 PgStat_IO *
 pgstat_fetch_stat_io(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
 
 	return &pgStatLocal.snapshot.io;
diff --git a/src/backend/utils/activity/pgstat_slru.c b/src/backend/utils/activity/pgstat_slru.c
index 56ea1c3378..7f8a97e8ca 100644
--- a/src/backend/utils/activity/pgstat_slru.c
+++ b/src/backend/utils/activity/pgstat_slru.c
@@ -104,6 +104,7 @@ pgstat_count_slru_truncate(int slru_idx)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_SLRU);
 
 	return pgStatLocal.snapshot.slru;
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 1a3c0e1a66..f8fcf909f0 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -66,6 +66,7 @@ pgstat_report_wal(bool force)
 PgStat_WalStats *
 pgstat_fetch_stat_wal(void)
 {
+	pgstat_clear_snapshot_if_needed();
 	pgstat_snapshot_fixed(PGSTAT_KIND_WAL);
 
 	return &pgStatLocal.snapshot.wal;

Re: Adding facility for injection points (or probe points?) for more advanced tests