Re: Synchronizing slots from primary to standby

2023-11-09 Thread Amit Kapila
On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand
 wrote:
>
>
> You mean here?
>
> /*
>   * Check to see if promotion is requested. Note that we do
>   * this only after failure, so when you promote, we still
>   * finish replaying as much as we can from archive and
>   * pg_wal before failover.
>   */
> if (StandbyMode && CheckForStandbyTrigger())
> {
> XLogShutdownWalRcv();
>  return XLREAD_FAIL;
> }
>
> If so, that sounds like a good place to me.
>

One more thing to think about is whether we want to shut down syncslot
workers as well on promotion similar to walreceiver? Because we don't
want them to even attempt once to sync after promotion.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-11-09 Thread Amit Kapila
On Fri, Nov 10, 2023 at 12:50 PM Drouvot, Bertrand
 wrote:
>
> On 11/10/23 6:41 AM, Amit Kapila wrote:
> > On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand
> >  wrote:
> >
> > Are you saying that we change the state of the already existing slot
> > on standby?
>
> Yes.
>
> > And, such a state would indicate that we are trying to
> > sync the slot with the same name from the primary. Is that what you
> > have in mind?
>
> Yes.
>
> > If so, it appears quite odd to me to have such a state
> > and also set it in some unrelated slot that just has the same name.
> >
>
> > I understand your point that we can allow other slots to proceed but
> > it is also important to not create any sort of inconsistency that can
> > surprise user after failover.
>
> But even if we ERROR out instead of emitting a WARNING, the user would still
> need to be notified/monitor such errors. I agree that then probably they will
> come to know earlier because the slot sync mechanism would be stopped but 
> still
> it is not "guaranteed" (specially if there is no others "working" synced slots
> around.)

>
> And if they do not, then there is still a risk to use this slot after a
> failover thinking this is a "synced" slot.
>

I think this is another reason that probably giving ERROR has better
chances for the user to notice before failover. IF knowing such errors
user still proceeds with the failover, the onus is on her. We can
probably document this hazard along with the failover feature so that
users are aware that they either need to be careful while creating
slots on standby or consult ERROR logs. I guess we can even make it
visible in the view also.

> Giving more thoughts, what about using a dedicated/reserved naming convention 
> for
> synced slot like synced_ or such and then:
>
> - prevent user to create sync_ slots on standby
> - sync  on primary to sync_ on standby
> - during failover, rename  sync_ to  and if  exists then
> emit a WARNING and keep sync_ in place.
>
> That way both slots are still in place (the manually created  and
> the sync_

Hmm, I think after failover, users need to rename all slots or we need
to provide a way to rename them so that they can be used by
subscribers which sounds like much more work.

> > Also, the current coding doesn't ensure
> > we will always give WARNING. If we see the below code that deals with
> > this WARNING,
> >
> > +  /* User created slot with the same name exists, emit WARNING. */
> > +  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
> > +  {
> > +ereport(WARNING,
> > +errmsg("not synchronizing slot %s; it is a user created slot",
> > +     remote_slot->name));
> > +  }
> > +  /* Otherwise create the slot first. */
> > +  else
> > +  {
> > +TransactionId xmin_horizon = InvalidTransactionId;
> > +ReplicationSlot *slot;
> > +
> > +ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
> > +    remote_slot->two_phase, false);
> >
> > I think this is not a solid check to ensure that the slot existed
> > before. Because it could be created as soon as the slot sync worker
> > invokes ReplicationSlotCreate() here.
>
> Agree.
>

So, having a concrete check to give WARNING would require some more
logic which I don't think is a good idea to handle this boundary case.

-- 
With Regards,
Amit Kapila.




Re: Tab completion for CREATE TABLE ... AS

2023-11-09 Thread Jim Jones
Hi

On 02.11.23 17:27, Gilles Darold wrote:
> Hi,
>
>
> Look like the tab completion for CREATE TABLE ... AS is not proposed.
>
>
> gilles=# CREATE TABLE test
> ( OF    PARTITION OF
>
>  The attached patch fix that and also propose the further completion
> after the AS keyword.
>
>
> gilles=# CREATE TABLE test
> ( AS    OF    PARTITION OF
> gilles=# CREATE TABLE test AS
> SELECT  WITH
>
> Adding the patch to current commitfest.
>
>
> Best regards,
>

Thanks for the patch!
It applies and builds cleanly, and it works as expected

"AS" is suggested after "CREATE TABLE t":

postgres=# CREATE TABLE t 
( AS    OF    PARTITION OF


-- 
Jim





Re: Remove MSVC scripts from the tree

2023-11-09 Thread Peter Eisentraut

On 09.11.23 00:05, Michael Paquier wrote:

Attached is a Perl version of the sed script, converted by hand (so not the
super-verbose s2p thing).  It's basically just the sed script with
semicolons added and the backslashes in the regular expressions moved
around.  I think we could use something like that for all platforms now.


Sounds like a good idea to me now that perl is a hard requirement.
+1.


How about this patch as a comprehensive solution?
From 47731e486d7356462ce610a6a76c21a0b3619823 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 10 Nov 2023 08:07:11 +0100
Subject: [PATCH v1] Replace Gen_dummy_probes.sed with Gen_dummy_probes.pl

---
 .gitattributes   |   1 -
 src/backend/utils/Gen_dummy_probes.pl| 275 ++-
 src/backend/utils/Gen_dummy_probes.pl.prolog |  19 --
 src/backend/utils/Gen_dummy_probes.sed   |  24 --
 src/backend/utils/Makefile   |  15 +-
 src/backend/utils/README.Gen_dummy_probes|  27 --
 src/include/utils/meson.build|   2 +-
 src/tools/msvc/Solution.pm   |   2 +-
 8 files changed, 26 insertions(+), 339 deletions(-)
 delete mode 100644 src/backend/utils/Gen_dummy_probes.pl.prolog
 delete mode 100644 src/backend/utils/Gen_dummy_probes.sed
 delete mode 100644 src/backend/utils/README.Gen_dummy_probes

diff --git a/.gitattributes b/.gitattributes
index 2384956d88..55e6060405 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -14,7 +14,6 @@ README.*  conflict-marker-size=32
 *.data -whitespace
 contrib/pgcrypto/sql/pgp-armor.sql whitespace=-blank-at-eol
 src/backend/catalog/sql_features.txt   
whitespace=space-before-tab,blank-at-eof,-blank-at-eol
-src/backend/utils/Gen_dummy_probes.pl.prolog   whitespace=-blank-at-eof
 
 # Test output files that contain extra whitespace
 *.out  -whitespace
diff --git a/src/backend/utils/Gen_dummy_probes.pl 
b/src/backend/utils/Gen_dummy_probes.pl
index f289b19344..f6df82baa5 100644
--- a/src/backend/utils/Gen_dummy_probes.pl
+++ b/src/backend/utils/Gen_dummy_probes.pl
@@ -1,259 +1,28 @@
-#! /usr/bin/perl -w
 #-
+# Perl script to create dummy probes.h file when dtrace is not available
 #
-# Gen_dummy_probes.pl
-#Perl script that generates probes.h file when dtrace is not available
-#
-# Portions Copyright (c) 2008-2023, PostgreSQL Global Development Group
-#
-#
-# IDENTIFICATION
-#src/backend/utils/Gen_dummy_probes.pl
-#
-# This program was generated by running perl's s2p over Gen_dummy_probes.sed
+# Copyright (c) 2008-2023, PostgreSQL Global Development Group
 #
+# src/backend/utils/Gen_dummy_probes.pl
 #-
 
-# turn off perlcritic for autogenerated code
-## no critic
-
-$0 =~ s/^.*?(\w+)[\.\w+]*$/$1/;
-
 use strict;
-use Symbol;
-use vars qw{ $isEOF $Hold %wFiles @Q $CondReg
-  $doAutoPrint $doOpenWrite $doPrint };
-$doAutoPrint = 1;
-$doOpenWrite = 1;
-
-# prototypes
-sub openARGV();
-sub getsARGV(;\$);
-sub eofARGV();
-sub printQ();
-
-# Run: the sed loop reading input and applying the script
-#
-sub Run()
-{
-   my ($h, $icnt, $s, $n);
-
-   # hack (not unbreakable :-/) to avoid // matching an empty string
-   my $z = "\000";
-   $z =~ /$z/;
-
-   # Initialize.
-   openARGV();
-   $Hold = '';
-   $CondReg = 0;
-   $doPrint = $doAutoPrint;
-  CYCLE:
-   while (getsARGV())
-   {
-   chomp();
-   $CondReg = 0;# cleared on t
- BOS:;
-
-   # /^[   ]*probe /!d
-   unless (m /^[ \t]*probe /s)
-   {
-   $doPrint = 0;
-   goto EOS;
-   }
-
-   # s/^[  ]*probe \([^(]*\)\(.*\);/\1\2/
-   {
-   $s = s /^[ \t]*probe ([^(]*)(.*);/${1}${2}/s;
-   $CondReg ||= $s;
-   }
-
-   # s/__/_/g
-   {
-   $s = s /__/_/sg;
-   $CondReg ||= $s;
-   }
-
-   # y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
-   { y{abcdefghijklmnopqrstuvwxyz}{ABCDEFGHIJKLMNOPQRSTUVWXYZ}; }
-
-   # s/^/#define TRACE_POSTGRESQL_/
-   {
-   $s = s /^/#define TRACE_POSTGRESQL_/s;
-   $CondReg ||= $s;
-   }
-
-   # s/([^,)]\{1,\})/(INT1)/
-   {
-   $s = s /\([^,)]+\)/(INT1)/s;
-   $CondReg ||= $s;
-   }
-
-   # s/([^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2)/
-   {
-   $s = s /\([^,)]+, [^,)]+\)/(INT1, INT2)/s;
-   $CondReg ||= $s;
-   }
-
-   # s/([^,)]\{1,\}, 

Re: Synchronizing slots from primary to standby

2023-11-09 Thread Drouvot, Bertrand

Hi,

On 11/10/23 6:41 AM, Amit Kapila wrote:

On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand
 wrote:

Are you saying that we change the state of the already existing slot
on standby? 


Yes.


And, such a state would indicate that we are trying to
sync the slot with the same name from the primary. Is that what you
have in mind?


Yes.


If so, it appears quite odd to me to have such a state
and also set it in some unrelated slot that just has the same name.




I understand your point that we can allow other slots to proceed but
it is also important to not create any sort of inconsistency that can
surprise user after failover.


But even if we ERROR out instead of emitting a WARNING, the user would still
need to be notified/monitor such errors. I agree that then probably they will
come to know earlier because the slot sync mechanism would be stopped but still
it is not "guaranteed" (specially if there is no others "working" synced slots
around.) And if they do not, then there is still a risk to use this slot after a
failover thinking this is a "synced" slot.

Giving more thoughts, what about using a dedicated/reserved naming convention 
for
synced slot like synced_ or such and then:

- prevent user to create sync_ slots on standby
- sync  on primary to sync_ on standby
- during failover, rename  sync_ to  and if  exists then
emit a WARNING and keep sync_ in place.

That way both slots are still in place (the manually created  and
the sync_ 18 upgrades. So it looks like we'd be good as long as we
are able to prevent sync_ slots creation on 17.

Thoughts?


Also, the current coding doesn't ensure
we will always give WARNING. If we see the below code that deals with
this WARNING,

+  /* User created slot with the same name exists, emit WARNING. */
+  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
+  {
+ereport(WARNING,
+errmsg("not synchronizing slot %s; it is a user created slot",
+     remote_slot->name));
+  }
+  /* Otherwise create the slot first. */
+  else
+  {
+TransactionId xmin_horizon = InvalidTransactionId;
+ReplicationSlot *slot;
+
+ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
+    remote_slot->two_phase, false);

I think this is not a solid check to ensure that the slot existed
before. Because it could be created as soon as the slot sync worker
invokes ReplicationSlotCreate() here.


Agree.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2023-11-09 Thread Amit Kapila
On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand
 wrote:
>
> On 11/9/23 3:41 AM, Amit Kapila wrote:
> > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
> >  wrote:
> >>
> >>> Unrelated to above, if there is a user slot on standby with the same
> >>> name which the slot-sync worker is trying to create, then shall it
> >>> emit a warning and skip the sync of that slot or shall it throw an
> >>> error?
> >>>
> >>
> >> I'd vote for emit a warning and move on to the next slot if any.
> >>
> >
> > But then it could take time for users to know the actual problem and
> > they probably notice it after failover.
>
> Right, that's not appealing
>
> OTOH the slot has already been created manually on the standby so there is
> probably already a "use case" for it (that is probably unrelated to the
> failover story then).
>
> In V32, the following states have been introduced:
>
> "
> 'n': none for user slots,
> 'i': sync initiated for the slot but waiting for primary to catch up.
> 'r': ready for periodic syncs.
> "
>
> Should we introduce a new state that indicates that a sync slot creation
> has failed because the slot already existed? That would probably
> be simple to monitor instead of looking at the log file.
>

Are you saying that we change the state of the already existing slot
on standby? And, such a state would indicate that we are trying to
sync the slot with the same name from the primary. Is that what you
have in mind? If so, it appears quite odd to me to have such a state
and also set it in some unrelated slot that just has the same name.

I understand your point that we can allow other slots to proceed but
it is also important to not create any sort of inconsistency that can
surprise user after failover. Also, the current coding doesn't ensure
we will always give WARNING. If we see the below code that deals with
this WARNING,

+  /* User created slot with the same name exists, emit WARNING. */
+  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
+  {
+ereport(WARNING,
+errmsg("not synchronizing slot %s; it is a user created slot",
+     remote_slot->name));
+  }
+  /* Otherwise create the slot first. */
+  else
+  {
+TransactionId xmin_horizon = InvalidTransactionId;
+ReplicationSlot *slot;
+
+ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
+    remote_slot->two_phase, false);

I think this is not a solid check to ensure that the slot existed
before. Because it could be created as soon as the slot sync worker
invokes ReplicationSlotCreate() here. So, depending on the timing, we
can either get an ERROR or WARNING. I feel giving an ERROR in this
case should be okay.

-- 
With Regards,
Amit Kapila.




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera  wrote:
>
> IMO the whole area of SLRU buffering is in horrible shape and many users
> are struggling with overall PG performance because of it.  An
> improvement doesn't have to be perfect -- it just has to be much better
> than the current situation, which should be easy enough.  We can
> continue to improve later, using more scalable algorithms or ones that
> allow us to raise the limits higher.

I agree with this.

> The only point on which we do not have full consensus yet is the need to
> have one GUC per SLRU, and a lot of effort seems focused on trying to
> fix the problem without adding so many GUCs (for example, using shared
> buffers instead, or use a single "scaling" GUC).  I think that hinders
> progress.  Let's just add multiple GUCs, and users can leave most of
> them alone and only adjust the one with which they have a performance
> problems; it's not going to be the same one for everybody.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar
On Thu, Nov 9, 2023 at 9:39 PM Robert Haas  wrote:
>
> On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar  wrote:
> > Here is the updated version of the patch, here I have taken the
> > approach suggested by Andrey and I discussed the same with Alvaro
> > offlist and he also agrees with it.  So the idea is that we will keep
> > the bank size fixed which is 16 buffers per bank and the allowed GUC
> > value for each slru buffer must be in multiple of the bank size.  We
> > have removed the centralized lock but instead of one lock per bank, we
> > have kept the maximum limit on the number of bank locks which is 128.
> > We kept the max limit as 128 because, in one of the operations (i.e.
> > ActivateCommitTs), we need to acquire all the bank locks (but this is
> > not a performance path at all) and at a time we can acquire a max of
> > 200 LWlocks, so we think this limit of 128 is good.  So now if the
> > number of banks is <= 128 then we will be using one lock per bank
> > otherwise the one lock may protect access of buffer in multiple banks.
>
> Just so I understand, I guess this means that an SLRU is limited to
> 16*128 = 2k buffers = 16MB?

Not really, because 128 is the maximum limit on the number of bank
locks not on the number of banks.  So for example, if you have 16*128
= 2k buffers then each lock will protect one bank, and likewise when
you have 16 * 512 =  8k buffers then each lock will protect 4 banks.
So in short we can get the lock for each bank by simple computation
(banklockno = bankno % 128 )

> When we were talking about this earlier, I suggested fixing the number
> of banks and allowing the number of buffers per bank to scale
> depending on the setting. That seemed simpler than allowing both the
> number of banks and the number of buffers to vary, and it might allow
> the compiler to optimize some code better, by converting a calculation
> like page_no%number_of_banks into a masking operation like page_no&0xf
> or whatever. However, because it allows an individual bank to become
> arbitrarily large, it more or less requires us to use a buffer mapping
> table. Some of the performance problems mentioned could be alleviated
> by omitting the hash table when the number of buffers per bank is
> small, and we could also create the dynahash with a custom hash
> function that just does modular arithmetic on the page number rather
> than a real hashing operation. However, maybe we don't really need to
> do any of that. I agree that dynahash is clunky on a good day. I
> hadn't realized the impact would be so noticeable.

Yes, so one idea is that we keep the number of banks fixed and with
that, as you pointed out correctly with a large number of buffers, the
bank size can be quite big and for that, we would need a hash table
and OTOH what I am doing here is keeping the bank size fixed and
smaller (16 buffers per bank) and with that we can have large numbers
of the bank when the buffer pool size is quite big.  But I feel having
more banks is not really a problem if we grow the number of locks
beyond a certain limit as in some corner cases we need to acquire all
locks together and there is a limit on that.   So I like this idea of
sharing locks across the banks with that 1) We can have enough locks
so that lock contention or cache invalidation due to a common lock
should not be a problem anymore 2) We can keep a small bank size with
that seq search within the bank is quite fast so reads are fast  3)
With small bank size victim buffer search which has to be sequential
is quite fast.

> This proposal takes the opposite approach of fixing the number of
> buffers per bank, letting the number of banks vary. I think that's
> probably fine, although it does reduce the effective associativity of
> the cache. If there are more hot buffers in a bank than the bank size,
> the bank will be contended, even if other banks are cold. However,
> given the way SLRUs are accessed, it seems hard to imagine this being
> a real problem in practice. There aren't likely to be say 20 hot
> buffers that just so happen to all be separated from one another by a
> number of pages that is a multiple of the configured number of banks.
> And in the seemingly very unlikely event that you have a workload that
> behaves like that, you could always adjust the number of banks up or
> down by one, and the problem would go away. So this seems OK to me.

I agree with this

> I also agree with a couple of points that Alvaro made, specifically
> that (1) this doesn't have to be perfect, just better than now and (2)
> separate GUCs for each SLRU is fine. On the latter point, it's worth
> keeping in mind that the cost of a GUC that most people don't need to
> tune is fairly low. GUCs like work_mem and shared_buffers are
> "expensive" because everybody more or less needs to understand what
> they are and how to set them and getting the right value can tricky --
> but a GUC like autovacuum_naptime is a lot cheaper, because almost
> nobody needs to change 

Re: Add recovery to pg_control and remove backup_label

2023-11-09 Thread Michael Paquier
On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote:
> On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:
> I've retested today, and miss the failure.  I'll let you know if I see
> this again.

I've done a few more dozen runs, and still nothing.  I am wondering
what this disturbance was.

>> If we set these fields where backup_label was renamed, the logic would not
>> be exactly the same since pg_control won't be updated until the next time
>> through the loop. Since the fields should be updated before
>> UpdateControlFile() I thought it made sense to keep all the updates
>> together.
>> 
>> Overall I think it is simpler, and we don't need to acquire a lock on
>> ControlFile.
> 
> What you are proposing is the same as what we already do for
> backupEndRequired or backupStartPoint in the control file when
> initializing recovery, so objection withdrawn.
> 
>> Thomas included this change in his pg_basebackup changes so I did the same.
>> Maybe wait a bit before we split this out? Seems like a pretty small
>> change...
> 
> Seems like a pretty good argument for refactoring that now, and let
> any other patches rely on it.  Would you like to send a separate
> patch?

The split still looks worth doing seen from here, so I am switching
the patch as WoA for now.

>>> Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
>>> reduce more the size with some alignment magic, no?
>> 
>> I thought about this, but it seemed to me that existing fields had been
>> positioned to make the grouping logical rather than to optimize alignment,
>> e.g. minRecoveryPointTLI. Ideally that would have been placed near
>> backupEndRequired (or vice versa). But if the general opinion is to
>> rearrange for alignment, I'm OK with that.
> 
> I've not tested, but it looks like moving backupStartPointTLI after
> backupEndPoint should shave 8 bytes, if you want to maintain a more
> coherent group for the LSNs.

+* backupFromStandby indicates that the backup was taken on a standby. It is
+* require to initialize recovery and set to false afterwards.
s/require/required/.

The term "backup recovery", that we've never used in the tree until
now as far as I know.  Perhaps this recovery method should just be
referred as "recovery from backup"?

By the way, there is another thing that this patch has forgotten: the
SQL functions that display data from the control file.  Shouldn't
pg_control_recovery() be extended with the new fields?  These fields
may be less critical than the other ones related to recovery, but I
suspect that showing them can become handy at least for debugging and
monitoring purposes.

Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization.  Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.

What about pg_resetwal and RewriteControlFile()?  Shouldn't these
recovery fields be reset as well?

git diff --check is complaining a bit.
--
Michael


signature.asc
Description: PGP signature


Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-09 Thread Andres Freund
Hi, 

On November 8, 2023 11:28:08 PM PST, Michael Paquier  
wrote:
>On Thu, Nov 09, 2023 at 01:50:34PM +0900, torikoshia wrote:
>> PGSTAT_KIND_SLRU cannot be reset by pg_stat_reset_shared(), so I feel
>> uncomfortable to delete it all together.
>> It might be better after pg_stat_reset_shared() has been modified to take
>> 'slru' as an argument, though.
>
>Not sure how to feel about that, TBH, but I would not include SLRUs
>here if we have already a separate function.
>
>> I thought it would be better to reset statistics even when null is specified
>> so that users are not confused with the behavior of pg_stat_reset_slru().
>> Attached patch added pg_stat_reset_shared() in system_functions.sql mainly
>> for this reason.
>
>I'm OK with doing what your patch does, aka do the work if the value
>is NULL or if there's no argument given.
>
>-Resets some cluster-wide statistics counters to zero, depending on the
>+Resets cluster-wide statistics counters to zero, depending on the 
>
>This does not need to change, aka SLRUs are for example still global
>and not included here.
>
>+If the argument is NULL or not specified, all counters shown in all
>+of these views are reset.
>
>Missing a  markup around NULL.  I know, we're not consistent
>about that, either, but if we are tweaking the area let's be right at
>least.  Perhaps "all the counters from the views listed above are
>reset"?

I see no reason to not include slrus. We should never have added the ability to 
reset them individually, particularly not without a use case - I couldn't find 
one skimming some discussion. And what's the point in not allowing to reset 
them via pg_stat_reset_shared()?

Greetings,

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-09 Thread Michael Paquier
On Fri, Nov 10, 2023 at 12:33:50PM +0900, torikoshia wrote:
> On 2023-11-09 16:28, Michael Paquier wrote:
>> Not sure how to feel about that, TBH, but I would not include SLRUs
>> here if we have already a separate function.
> 
> IMHO I agree with you.

The comments added could be better grammatically, but basically LGTM.
I'll take care of that if there are no objections.
--
Michael


signature.asc
Description: PGP signature


AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

2023-11-09 Thread Andres Freund
Hi,

I just created a primary with wal_segment_size=512. Then tried to create a
standby via pg_basebackup. The pg_basebackup appeared to just hang, for quite
a while, but did eventually complete. Over a minute for an empty cluster, when
using -c fast.

In this case I had used wal_sync_method=open_datasync - it's often faster and
if we want to scale WAL writes more we'll have to use it more widely (you
can't have multiple fdatasyncs in progress and reason about which one affects
what, but you can have multiple DSYNC writes in progress at the same time).

After a bit of confused staring and debugging I figured out that the problem
is that the RequestXLogSwitch() within the code for starting a basebackup was
triggering writing back the WAL in individual 8kB writes via
GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
writes is durable - on this drive each take about 1ms.


Normally we write out WAL in bigger chunks - but as it turns out, we don't
have any logic for doing larger writes when AdvanceXLInsertBuffers() is called
from within GetXLogBuffer(). We just try to make enough space so that one
buffer can be replaced.


The time for a single SELECT pg_switch_wal() on this system, when using
open_datasync and a 512MB segment, are:

wal_buffers   time for pg_switch_xlog()
1664s
100   53s
400   13s
600   1.3s

That's pretty bad.  We don't really benefit from more buffering here, it just
avoids flushing in tiny increments. With a smaller wal_buffers, the large
record by pg_switch_xlog() needs to replace buffers it inself inserted, and
does so one-by-one. If we never re-encounter an buffer we inserted ourself
earlier due to a larger wal_buffers, the problem isn't present.

This can bit with smaller segments too, it doesn't require large ones ones.


The reason this doesn't constantly become an issue is that walwriter normally
tries to write out WAL, and if it does, the AdvanceXLInsertBuffers() called in
backends doesn't need to (walsender also calls AdvanceXLInsertBuffers(), but
it won't ever write out data).

In my case, walsender is actually trying to do something - but it never gets
WALWriteLock. The semaphore does get set after AdvanceXLInsertBuffers()
releases WALWriteLock, but on this system walwriter never succeeds taking the
lwlock before AdvanceXLInsertBuffers() succeeds re-acquiring it.


I think it might be a lucky accident that the problem was visible this
blatantly in this one case - I suspect that this behaviour is encountered
during normal operation in the wild, but much harder to pinpoint, because it
doesn't happen "exclusively".

E.g. I see a lot higher throughput bulk-loading data with larger wal_buffers
when using open_datasync, but basically no difference when using
fdatasync. And there are a lot of wal_buffers_full writes.


To fix this, I suspect we need to make
GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
specific case, we even know for sure that we are going to fill a lot more
buffers, so no heuristic would be needed. In other cases however we need some
heuristic to know how much to write out.

Given how *extremely* aggressive we are about flushing out nearly all pending
WAL in XLogFlush(), I'm not sure there's much point in not also being somewhat
aggressive in GetXLogBuffer()->AdvanceXLInsertBuffer().


Greetings,

Andres Freund




RE: Is this a problem in GenericXLogFinish()?

2023-11-09 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Next we should add some test codes. I will continue considering but please 
> post
> anything
> If you have idea.

And I did, PSA the patch. This patch adds two parts in hash_index.sql.

In the first part, the primary bucket page is filled by live tuples and some 
overflow
pages are by dead ones. This leads removal of overflow pages without moving 
tuples.
HEAD will crash if this were executed, which is already reported on hackers.

The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false &&
is_prev_bucket_same_wrt == true), which is never done before.



Also, I measured the execution time of before/after patching. Below is a madian
for 9 measurements.

BEFORE -> AFTER
647 ms -> 710 ms

This means that the execution time increased -10%, it will not affect so much.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




fix_hash_squeeze_wal_4.patch
Description: fix_hash_squeeze_wal_4.patch


Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-09 Thread torikoshia

On 2023-11-09 16:28, Michael Paquier wrote:
Thanks for your review.
Attached v2 patch.


On Thu, Nov 09, 2023 at 01:50:34PM +0900, torikoshia wrote:

PGSTAT_KIND_SLRU cannot be reset by pg_stat_reset_shared(), so I feel
uncomfortable to delete it all together.
It might be better after pg_stat_reset_shared() has been modified to 
take

'slru' as an argument, though.


Not sure how to feel about that, TBH, but I would not include SLRUs
here if we have already a separate function.


IMHO I agree with you.

I thought it would be better to reset statistics even when null is 
specified
so that users are not confused with the behavior of 
pg_stat_reset_slru().
Attached patch added pg_stat_reset_shared() in system_functions.sql 
mainly

for this reason.


I'm OK with doing what your patch does, aka do the work if the value
is NULL or if there's no argument given.

-Resets some cluster-wide statistics counters to zero, 
depending on the
+Resets cluster-wide statistics counters to zero, depending on 
the


This does not need to change, aka SLRUs are for example still global
and not included here.

+If the argument is NULL or not specified, all counters shown 
in all

+of these views are reset.

Missing a  markup around NULL.  I know, we're not consistent
about that, either, but if we are tweaking the area let's be right at
least.  Perhaps "all the counters from the views listed above are
reset"?
--
Michael


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom f01ab1071706bdd472fe6063160379a721763a48 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 10 Nov 2023 11:56:40 +0900
Subject: [PATCH v2] Add ability to reset all statistics to
 pg_stat_reset_shared()

Currently pg_stat_reset_shared() requires its argument to specify
statistics to be reset, and there is no way to reset all statistics
which can be specified its argument.
This patch makes it possible when no argument or NULL is specified.

In effect this API equals to call pg_sat_reset_shared() with every
argument, but it requires callers to know the set of possible values.
Considering nearly all the time the right thing to do is to reset all
shared stats, this API would be consistent with the way user reset
stats.
---
 doc/src/sgml/monitoring.sgml |  2 ++
 src/backend/catalog/system_functions.sql |  7 +++
 src/backend/utils/adt/pgstatfuncs.c  | 17 -
 src/include/catalog/pg_proc.dat  |  5 +++--
 src/test/regress/expected/stats.out  | 14 +++---
 src/test/regress/sql/stats.sql   | 14 +++---
 6 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e247..16a54179ae 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4730,6 +4730,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 pg_stat_wal view or
 recovery_prefetch to reset all the counters shown
 in the pg_stat_recovery_prefetch view.
+If the argument is NULL or not specified, all
+the counters from the views listed above are reset.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 35d738d576..8079f1cd7f 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -621,6 +621,13 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION
+  pg_stat_reset_shared(target text DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+CALLED ON NULL INPUT VOLATILE PARALLEL SAFE
+AS 'pg_stat_reset_shared';
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1fb8b31863..a3a92c7b67 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1685,7 +1685,22 @@ pg_stat_reset(PG_FUNCTION_ARGS)
 Datum
 pg_stat_reset_shared(PG_FUNCTION_ARGS)
 {
-	char	   *target = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	char	   *target = NULL;
+
+	if (PG_ARGISNULL(0))
+	{
+		/* Reset all the statistics which can be specified by the argument */
+		pgstat_reset_of_kind(PGSTAT_KIND_ARCHIVER);
+		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
+		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
+		pgstat_reset_of_kind(PGSTAT_KIND_IO);
+		pgstat_reset_of_kind(PGSTAT_KIND_WAL);
+		XLogPrefetchResetStats();
+
+		PG_RETURN_VOID();
+	}
+
+	target = text_to_cstring(PG_GETARG_TEXT_PP(0));
 
 	if (strcmp(target, "archiver") == 0)
 		pgstat_reset_of_kind(PGSTAT_KIND_ARCHIVER);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f14aed422a..d36144fbc6 100644

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2023-11-09 Thread Thomas Munro
Here is a new attempt to fix this mess.  Disclaimer: this based
entirely on reading the manual and vicariously hacking a computer I
don't have via CI.

The two basic ideas are:

 * keep per-socket event handles in a hash table
 * add our own level-triggered event memory

The socket table entries are reference counted, and exist as long as
the socket is currently in at least one WaitEventSet.  When creating a
new entry, extra polling logic re-checks the initial level-triggered
state (an overhead that we had in an ad-hoc way already, and that can
be avoided by more widespread use of long lived WaitEventSet).  You
are not allowed to close a socket while it's in a WaitEventSet,
because then a new socket could be allocated with the same number and
chaos would ensue.  For example, if we revive the idea of hooking
libpq connections up to long-lived WaitEventSets, we'll probably need
to invent a libpq event callback that says 'I am going to close socket
X!', so you have a chance to remove the socket from any WaitEventSet
*before* it's closed, to maintain that invariant.  Other lazier ideas
are possible, but probably become impossible in a hypothetical
multi-threaded future.

With these changes, AFAIK it should be safe to reinstate graceful
socket shutdowns, to fix the field complaints about FATAL error
messages being eaten by a grue and the annoying random CI/BF failures.

Here are some other ideas that I considered but rejected for now:

1.  We could throw the WAIT_USE_WIN32 code away, and hack
WAIT_USE_POLL to use WSAPoll() on Windows; we could create a
'self-pipe' using a pair of connected AF_UNIX sockets to implement
latches and fake signals.  It seems like a lot of work, and makes
latches a bit worse (instead of "everything is an event!" we have
"everything is a socket!" with a helper thread, and we don't even have
socketpair() on this OS).  Blah.

2.  We could figure out how to do fancy asynchronous sockets and IOCP.
That's how NT really wants to talk to the world, it doesn't really
want to pretend to be Unix.  I expect that is where we'll get to
eventually but it's a much bigger cross-platform R job.

3.  Maybe there is a kind of partial step towards idea 2 that Andres
mentioned on another thread somewhere: one could use an IOCP, and then
use event callbacks that run on system threads to post IOCP messages
(a bit like we do for our fake waitpid()).

What I have here is the simplest way I could see to patch up what we
already have, with the idea that in the fullness of time we'll
eventually get around to idea 2, once someone is ready to do the
press-ups.

Review/poking-with-a-stick/trying-to-break-it most welcome.
From 4614df55bd0e98b70b942543482e6a9eb767f718 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 1 Nov 2023 05:53:12 +1300
Subject: [PATCH v2 1/6] simplehash: Allow raw memory to be freed.

Commit 48995040d5e introduced SH_RAW_ALLOCATOR, but assumed that memory
allocated that way could be freed with pfree().  Allow SH_RAW_FREE to be
defined too, for cases where that isn't true.
---
 src/include/lib/simplehash.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index b7adc16b80..cd354e2f11 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -42,6 +42,7 @@
  *		declarations reside
  *	  - SH_RAW_ALLOCATOR - if defined, memory contexts are not used; instead,
  *	use this to allocate bytes. The allocator must zero the returned space.
+ *	  - SH_RAW_FREE - free operation corresponding to SH_RAW_ALLOCATOR
  *	  - SH_USE_NONDEFAULT_ALLOCATOR - if defined no element allocator functions
  *		are defined, so you can supply your own
  *	  The following parameters are only relevant when SH_DEFINE is defined:
@@ -410,7 +411,11 @@ SH_ALLOCATE(SH_TYPE * type, Size size)
 static inline void
 SH_FREE(SH_TYPE * type, void *pointer)
 {
+#ifdef SH_RAW_FREE
+	SH_RAW_FREE(pointer);
+#else
 	pfree(pointer);
+#endif
 }
 
 #endif
@@ -458,7 +463,11 @@ SH_SCOPE void
 SH_DESTROY(SH_TYPE * tb)
 {
 	SH_FREE(tb, tb->data);
+#ifdef SH_RAW_FREE
+	SH_RAW_FREE(tb);
+#else
 	pfree(tb);
+#endif
 }
 
 /* reset the contents of a previously created hash table */
-- 
2.42.0

From 3ee479b182ac53e4839da1156ef30cd9f2523de7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 1 Nov 2023 06:51:38 +1300
Subject: [PATCH v2 2/6] simplehash: Allow raw allocation to fail.

Commit 48995040d5e allowed for raw allocators to be used instead of the
MemoryContext API, but didn't contemplate allocation failure.  Teach the
grow and insert operations to report failure to the caller.
---
 src/include/lib/simplehash.h | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index cd354e2f11..d1034baf67 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -205,7 +205,7 @@ SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
 

Re: Synchronizing slots from primary to standby

2023-11-09 Thread shveta malik
On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 11/9/23 11:54 AM, shveta malik wrote:
> >
> > PFA v32 patches which has below changes:
>
> Thanks!
>
> > 7) Added warning for cases where a user-slot with the same name is
> > already present which slot-sync worker is trying to create. Sync for
> > such slots is skipped.
>
> I'm seeing assertion and segfault in this case due to ReplicationSlotRelease()
> in synchronize_one_slot().
>
> Adding this extra check prior to it:
>
> -   ReplicationSlotRelease();
> +   if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE))
> +   ReplicationSlotRelease();
>
> make them disappear.
>

Oh, I see. Thanks for pointing it out. I will fix it in the next version.

> >
> > Open Question:
> > 1) Currently I have put drop slot logic for slots with 'sync_state=i'
> > in slot-sync worker. Do we need to put it somewhere in promotion-logic
> > as well?
>
> Yeah I think so, because there is a time window when one could "use" the slot
> after the promotion and before it is removed. Producing things like:
>
> "
> 2023-11-09 15:16:50.294 UTC [2580462] LOG:  dropped replication slot 
> "logical_slot2" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.295 UTC [2580462] LOG:  dropped replication slot 
> "logical_slot3" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.297 UTC [2580462] LOG:  dropped replication slot 
> "logical_slot4" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.297 UTC [2580462] ERROR:  replication slot 
> "logical_slot5" is active for PID 2594628
> "
>
> After the promotion one was able to use logical_slot5 and now we can now drop 
> it.

Yes, I was suspicious about this small window which may allow others
to use this slot, that is why I was thinking of putting it in the
promotion flow and thus asked that question earlier. But the slot-sync
worker may end up creating it again in case it has not exited. So we
need to carefully decide at what all places we need to put  'not-in
recovery' checks in slot-sync workers. In the previous version,
synchronize_one_slot() had that check and it was skipping sync if
'!RecoveryInProgress'. But I have removed that check in v32 thinking
that the slots which the worker has already fetched from the primary,
let them all get synced and exit after that  nstead of syncing half
and leaving rest. But now on rethinking, was the previous behaviour
correct i.e. skip sync at that point onward where we see it is no
longer in standby-mode while few of the slots have already been synced
in that sync-cycle. Thoughts?

>
> > Perhaps in WaitForWALToBecomeAvailable() where we call
> > XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts?
> >
>
> You mean here?
>
> /*
>   * Check to see if promotion is requested. Note that we do
>   * this only after failure, so when you promote, we still
>   * finish replaying as much as we can from archive and
>   * pg_wal before failover.
>   */
> if (StandbyMode && CheckForStandbyTrigger())
> {
> XLogShutdownWalRcv();
>  return XLREAD_FAIL;
> }
>

yes, here.

> If so, that sounds like a good place to me.

okay. Will add it.

>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Andres Freund
Hi,

On 2023-11-10 08:17:48 +0530, vignesh C wrote:
> On Thu, 9 Nov 2023 at 21:12, Tristan Partin  wrote:
> >
> > On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:
> > > Hi,
> > >
> > > On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
> > > >
> > > > Can you try with Meson v1.2.3?
> > >
> > > I tried with Meson v1.2.3 and upstream, both failed with the same error.
> >
> > Please open a bug in the Meson repository which also mentions the last
> > known working version. I wonder what versions of Meson we use in the
> > build farm.
> 
> Should we document the supported meson version that should be used for
> building from source?

It should be supported, I think we need to analyze the problem further
first. It's extremely odd that the problem only happens when invoked from one
version of powershell but not the other.  I assume there's a difference in
PATH leading to a different version of *something* being used.

Bilal, you apparently can repro the failure happening in one shell but not the
other? Could you send the PATH variable set in either shell and
meson-logs/meson-log.txt for both the working and non-working case?

Andres




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread vignesh C
On Thu, 9 Nov 2023 at 21:12, Tristan Partin  wrote:
>
> On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
> > >
> > > Can you try with Meson v1.2.3?
> >
> > I tried with Meson v1.2.3 and upstream, both failed with the same error.
>
> Please open a bug in the Meson repository which also mentions the last
> known working version. I wonder what versions of Meson we use in the
> build farm.

Should we document the supported meson version that should be used for
building from source?

Regards,
Vignesh




Re: A recent message added to pg_upgade

2023-11-09 Thread Michael Paquier
On Thu, Nov 09, 2023 at 04:52:32PM +0900, Michael Paquier wrote:
> Thanks!

Also, please see also a patch about switching the logirep launcher to
rely on IsBinaryUpgrade to prevent its startup.  Any thoughts about
that?
--
Michael
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 501910b445..7eb0a5edc9 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -925,7 +925,8 @@ ApplyLauncherRegister(void)
 {
 	BackgroundWorker bgw;
 
-	if (max_logical_replication_workers == 0)
+	/* not started during upgrades */
+	if (max_logical_replication_workers == 0 || IsBinaryUpgrade)
 		return;
 
 	memset(, 0, sizeof(bgw));
diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index c24c5f69fc..04722ad306 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -248,19 +248,17 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 	 * invalidation of slots during the upgrade. We set this option when
 	 * cluster is PG17 or later because logical replication slots can only be
 	 * migrated since then. Besides, max_slot_wal_keep_size is added in PG13.
-	 *
-	 * Use max_logical_replication_workers as 0 to prevent a startup of the
-	 * logical replication launcher while upgrading because it may start apply
-	 * workers that could start receiving changes from the publisher before
-	 * the physical files are put in place, causing corruption on the new
-	 * cluster upgrading to.  Like the previous parameter, this is set only
-	 * when a cluster is PG17 or later as logical slots can only be migrated
-	 * since this version.
 	 */
 	if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
-		appendPQExpBufferStr(, " -c max_slot_wal_keep_size=-1 -c max_logical_replication_workers=0");
+		appendPQExpBufferStr(, " -c max_slot_wal_keep_size=-1");
 
-	/* Use -b to disable autovacuum. */
+	/*
+	 * Use -b to disable autovacuum and logical replication launcher
+	 * (effective in PG17 or later for the latter).  Preventing these
+	 * processes from starting while upgrading avoids any activity on the new
+	 * cluster before the physical files are put in place, which could cause
+	 * corruption on the new cluster upgrading to.
+	 */
 	snprintf(cmd, sizeof(cmd),
 			 "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" start",
 			 cluster->bindir,


signature.asc
Description: PGP signature


Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-09 Thread Matthias van de Meent
On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan  wrote:
> On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan  wrote:
> > If you end up finding a bug in this v6, it'll most likely be a case
> > where nbtree fails to live up to that. This project is as much about
> > robust/predictable performance as anything else -- nbtree needs to be
> > able to cope with practically anything. I suggest that your review
> > start by trying to break the patch along these lines.
>
> I spent some time on this myself today (which I'd already planned on).
>
> Attached is an adversarial stress-test, which shows something that
> must be approaching the worst case for the patch in terms of time
> spent with a buffer lock held, due to spending so much time evaluating
> unusually expensive SAOP index quals. The array binary searches that
> take place with a buffer lock held aren't quite like anything else
> that nbtree can do right now, so it's worthy of special attention.
>
> I thought of several factors that maximize both the number of binary
> searches within any given _bt_readpage, as well as the cost of each
> binary search -- the SQL file has full details. My test query is
> *extremely* unrealistic, since it combines multiple independent
> unrealistic factors, all of which aim to make life hard for the
> implementation in one way or another. I hesitate to say that it
> couldn't be much worse (I only spent a few hours on this), but I'm
> prepared to say that it seems very unlikely that any real world query
> could make the patch spend as many cycles in
> _bt_readpage/_bt_checkkeys as this one does.
>
> Perhaps you can think of some other factor that would make this test
> case even less sympathetic towards the patch, Matthias? The only thing
> I thought of that I've left out was the use of a custom btree opclass,
> "unrealistically_slow_ops". Something that calls pg_usleep in its
> order proc. (I left it out because it wouldn't prove anything.)

Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.

> On my machine, custom instrumentation shows that each call to
> _bt_readpage made while this query executes (on a patched server)
> takes just under 1.4 milliseconds. While that is far longer than it
> usually takes, it's basically acceptable IMV. It's not significantly
> longer than I'd expect heap_index_delete_tuples() to take on an
> average day with EBS (or other network-attached storage). But that's a
> process that happens all the time, with an exclusive buffer lock held
> on the leaf page throughout -- whereas this is only a shared buffer
> lock, and involves a query that's just absurd .
>
> Another factor that makes this seem acceptable is just how sensitive
> the test case is to everything going exactly and perfectly wrong, all
> at the same time, again and again. The test case uses a 32 column
> index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
> clauses (one per index column). If I reduce the number of SAOP clauses
> in the query to (say) 8, I still have a test case that's almost as
> silly as my original -- but now we only spend ~225 microseconds in
> each _bt_readpage call (i.e. we spend over 6x less time in each
> _bt_readpage call). (Admittedly if I also make the CREATE INDEX use
> only 8 columns, we can fit more index tuples on one page, leaving us
> at ~800 microseconds).

A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.


As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:

1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:

SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
   AND thousand < ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I 

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-11-09 Thread Andrey Chudnovsky
Hi Jacob,

Wanted to follow up on one of the topics discussed here in the past:
Do you plan to support adding an extension hook to validate the token?

It would allow a more efficient integration, then spinning a separate
process.

Thanks!
Andrey.

On Wed, Nov 8, 2023 at 11:00 AM Jacob Champion  wrote:

> On Fri, Nov 3, 2023 at 4:48 PM Jacob Champion 
> wrote:
> > Thanks for the nudge! Looks like I need to reconcile with the changes
> > to JsonLexContext in 1c99cde2. I should be able to get to that next
> > week; in the meantime I'll mark it Waiting on Author.
>
> v13 rebases over latest. The JsonLexContext changes have simplified
> 0001 quite a bit, and there's probably a bit more minimization that
> could be done.
>
> Unfortunately the configure/Makefile build of libpq now seems to be
> pulling in an `exit()` dependency in a way that Meson does not. (Or
> maybe Meson isn't checking?) I still need to investigate that
> difference and fix it, so I recommend Meson if you're looking to
> test-drive a build.
>
> Thanks,
> --Jacob
>


Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-09 Thread Peter Geoghegan
On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan  wrote:
> If you end up finding a bug in this v6, it'll most likely be a case
> where nbtree fails to live up to that. This project is as much about
> robust/predictable performance as anything else -- nbtree needs to be
> able to cope with practically anything. I suggest that your review
> start by trying to break the patch along these lines.

I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.

-- 
Peter Geoghegan


nemesis.sql
Description: Binary data


Re: Side effect of CVE-2017-7484 fix?

2023-11-09 Thread Bruce Momjian
On Thu, Nov  9, 2023 at 06:44:42PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Wed, Oct 24, 2018 at 04:01:29PM -0400, Robert Haas wrote:
> >>> This was complained of already,
> >>> https://www.postgresql.org/message-id/flat/3876.1531261875%40sss.pgh.pa.us
> 
> >> I guess you never followed up on that part, though.  Any special
> >> reason for that, or just lack of round tuits?
> 
> > Should this be added as a TODO item?
> 
> Huh?  We dealt with that a long time ago, cf 553d2ec27.

Oh, okay, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Side effect of CVE-2017-7484 fix?

2023-11-09 Thread Tom Lane
Bruce Momjian  writes:
> On Wed, Oct 24, 2018 at 04:01:29PM -0400, Robert Haas wrote:
>>> This was complained of already,
>>> https://www.postgresql.org/message-id/flat/3876.1531261875%40sss.pgh.pa.us

>> I guess you never followed up on that part, though.  Any special
>> reason for that, or just lack of round tuits?

> Should this be added as a TODO item?

Huh?  We dealt with that a long time ago, cf 553d2ec27.

regards, tom lane




Re: remove deprecated @@@ operator ?

2023-11-09 Thread Bruce Momjian
On Sun, Oct 21, 2018 at 04:24:16PM -0400, Tom Lane wrote:
> Oleg Bartunov  writes:
> > The  commit 9b5c8d45f62bd3d243a40cc84deb93893f2f5122 is now 10+ years
> > old, may be we could remove deprecated @@@ operator ?
> 
> Is it actually causing any problem?  AFAICS it's just a couple extra
> pg_operator entries, so why not leave it?
> 
> I'd be +1 for removing it from the docs, though ...

Done in the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d963f0a0a0..e39c815532 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -13124,24 +13124,6 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple

   
 
-  
-   
-tsvector @@@ tsquery
-boolean
-   
-   
-tsquery @@@ tsvector
-boolean
-   
-   
-This is a deprecated synonym for @@.
-   
-   
-to_tsvector('fat cats ate rats') @@@ to_tsquery('cat  rat')
-t
-   
-  
-
   

 tsvector || tsvector
diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml
index d68d12d515..be02966e77 100644
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
@@ -129,9 +129,6 @@
   tsvector_ops
   @@ (tsvector,tsquery)
  
- 
-  @@@ (tsvector,tsquery)
- 
 

   


Re: Side effect of CVE-2017-7484 fix?

2023-11-09 Thread Bruce Momjian
On Wed, Oct 24, 2018 at 04:01:29PM -0400, Robert Haas wrote:
> On Mon, Oct 22, 2018 at 9:47 AM Tom Lane  wrote:
> > Dilip Kumar  writes:
> > > As part of the security fix
> > > (e2d4ef8de869c57e3bf270a30c12d48c2ce4e00c), we have restricted the
> > > users from accessing the statistics of the table if the user doesn't
> > > have privileges on the table and the function is not leakproof.  Now,
> > > as a side effect of this, if the user has the privileges on the root
> > > partitioned table but does not have privilege on the child tables, the
> > > user will be able to access the data of the child table but it won't
> > > be able to access the statistics of the child table. This may result
> > > in a bad plan.
> >
> > This was complained of already,
> > https://www.postgresql.org/message-id/flat/3876.1531261875%40sss.pgh.pa.us
> 
> I guess you never followed up on that part, though.  Any special
> reason for that, or just lack of round tuits?

Should this be added as a TODO item?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Bruce Momjian
On Fri, Nov 10, 2023 at 08:25:35AM +0900, Michael Paquier wrote:
> On Thu, Nov 09, 2023 at 04:14:07PM -0500, Bruce Momjian wrote:
> > Attached is the full patch that changes pg_walfile_name_offset() and
> > pg_walfile_name().  There is no need for doc changes.  We need to
> > document this as incompatible in case users are realying on the old
> > behavior for WAL archiving purposes.  If they want the old behavior they
> > need to check for an offset of zero and subtract one from the file name.
> 
> FWIW, I am not really convinced that there is a strong need to
> backpatch any of that.  There's a risk that some queries relying on
> the old behavior suddenly break after a minor release, and that's
> always annoying.  A HEAD-only change seems like a safer bet to me.

Yes, this cannot be backpatched, clearly.

> > Can someone check that all other calls to XLByteToPrevSeg() are correct?
> 
> On a quick check, all the other calls use that for end record LSNs, so
> that looks fine.

Thank you.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Michael Paquier
On Thu, Nov 09, 2023 at 04:14:07PM -0500, Bruce Momjian wrote:
> Attached is the full patch that changes pg_walfile_name_offset() and
> pg_walfile_name().  There is no need for doc changes.  We need to
> document this as incompatible in case users are realying on the old
> behavior for WAL archiving purposes.  If they want the old behavior they
> need to check for an offset of zero and subtract one from the file name.

FWIW, I am not really convinced that there is a strong need to
backpatch any of that.  There's a risk that some queries relying on
the old behavior suddenly break after a minor release, and that's
always annoying.  A HEAD-only change seems like a safer bet to me.

> Can someone check that all other calls to XLByteToPrevSeg() are correct?

On a quick check, all the other calls use that for end record LSNs, so
that looks fine.
--
Michael


signature.asc
Description: PGP signature


Re: speed up a logical replica setup

2023-11-09 Thread Michael Paquier
On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> On 08.11.23 00:12, Michael Paquier wrote:
>> - Should the subdirectory pg_basebackup be renamed into something more
>> generic at this point?  All these things are frontend tools that deal
>> in some way with the replication protocol to do their work.  Say
>> a replication_tools?
> 
> Seems like unnecessary churn.  Nobody has complained about any of the other
> tools in there.

Not sure.  We rename things across releases in the tree from time to
time, and here that's straight-forward.

>> - And if it would be better to refactor some of the code generic to
>> all these streaming tools to fe_utils.  What makes streamutil.h a bit
>> less pluggable are all its extern variables to control the connection,
>> but perhaps that can be an advantage, as well, in some cases.
> 
> Does anyone outside of pg_basebackup + existing friends + new friend need
> that?  Seems like extra complications.

Actually, yes, I've used these utility routines in some past work, and
having the wrapper routines able to run the replication commands in
fe_utils would have been nicer than having to link to a source tree.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Andres Freund
Hi, 

On November 9, 2023 10:41:01 AM PST, Tom Lane  wrote:
>Also, pg_upgrade is often invoked indirectly via scripts, so I do
>not especially buy the idea that we're going to get useful control
>input from some human somewhere.  I think we'd be better off to
>assume that pg_upgrade is on its own to manage the process, so that
>if we need to switch strategies based on object count or whatever,
>we should put in a heuristic to choose the strategy automatically.
>It might not be perfect, but that will give better results for the
>pretty large fraction of users who are not going to mess with
>weird little switches.

+1 - even leaving everything else aside, just about no user would know about 
the option. There are cases where we can't do better than giving the user 
control, but we are certainly adding options at a rate that doesn't seem 
sustainable. And here it doesn't seem that hard to do better. 

Andres 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Atomic ops for unlogged LSN

2023-11-09 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 04:58:16PM -0800, Andres Freund wrote:
> On 2023-11-07 11:02:49 -0600, Nathan Bossart wrote:
>> Is there something special about all other backends being shut down that
>> ensures this returns the most up-to-date value and not something from "some
>> point in the past" as the stated contract for this function seems to
>> suggest?
> 
> Practically yes - getting to the point of writing the shutdown checkpoint
> implies having gone through a bunch of code that implies memory barriers
> (spinlocks, lwlocks).

Sure.

> However, even if there's likely some other implied memory barrier that we
> could piggyback on, the patch much simpler to understand if it doesn't change
> coherency rules. There's no way the overhead could matter.

I wonder if it's worth providing a set of "locked read" functions.  Those
could just do a compare/exchange with 0 in the generic implementation.  For
patches like this one where the overhead really shouldn't matter, I'd
encourage folks to use those to make it easy to reason about correctness.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Bruce Momjian
On Thu, Nov  9, 2023 at 09:49:48PM +0100, Matthias van de Meent wrote:
> > I have attached fix #1 as offset1.diff and fix #2 as offset2.diff.
> 
> I believe you got the references wrong; fix #1 looks like the output
> of offset2's changes, and fix #2 looks like the result of offset1's
> changes.

Sorry, I swaped them around when I realized the order I was posting them
in the email, and got it wrong.

> Either way, I think fix #1 is most correct (as was attached in
> offset2.diff, and quoted verbatim here), because that has no chance of
> having surprising underflowing behaviour when you use '0/0'::lsn as
> input.

Attached is the full patch that changes pg_walfile_name_offset() and
pg_walfile_name().  There is no need for doc changes.  We need to
document this as incompatible in case users are realying on the old
behavior for WAL archiving purposes.  If they want the old behavior they
need to check for an offset of zero and subtract one from the file name.

Can someone check that all other calls to XLByteToPrevSeg() are correct?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45a70668b1..d8b300638b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -414,7 +414,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 	/*
 	 * xlogfilename
 	 */
-	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
+	XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
 	XLogFileName(xlogfilename, GetWALInsertionTimeLine(), xlogsegno,
  wal_segment_size);
 
@@ -457,7 +457,7 @@ pg_walfile_name(PG_FUNCTION_ARGS)
  errhint("%s cannot be executed during recovery.",
 		 "pg_walfile_name()")));
 
-	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
+	XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
 	XLogFileName(xlogfilename, GetWALInsertionTimeLine(), xlogsegno,
  wal_segment_size);
 


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-09 Thread Andres Freund
Hi,

On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:
> > > + /*
> > > +  * The fact that we acquire WALBufMappingLock while reading 
> > > the WAL
> > > +  * buffer page itself guarantees that no one else 
> > > initializes it or
> > > +  * makes it ready for next use in AdvanceXLInsertBuffer(). 
> > > However, we
> > > +  * need to ensure that we are not reading a page that just 
> > > got
> > > +  * initialized. For this, we look at the needed page header.
> > > +  */
> > > + phdr = (XLogPageHeader) page;
> > > +
> > > + /* Return, if WAL buffer page doesn't look valid. */
> > > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > > +   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) 
> > > &&
> > > +   phdr->xlp_tli == tli))
> > > + break;
> >
> > I don't think this code should ever encounter a page where this is not the
> > case?  We particularly shouldn't do so silently, seems that could hide all
> > kinds of problems.
> 
> I think it's possible to read a "just got initialized" page with the
> new approach to read WAL buffer pages without WALBufMappingLock if the
> page is read right after it is initialized and xlblocks is filled in
> AdvanceXLInsertBuffer() but before actual WAL is written.

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
  XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Greetings,

Andres Freund




Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Matthias van de Meent
On Thu, 9 Nov 2023 at 20:22, Bruce Momjian  wrote:
> I know this bug report is four years old, but it is still a
> pg_walfile_name_offset() bug.  Here is the bug:
>
> SELECT *
> FROM (VALUES ('0/16ff'), ('0/1700'), ('0/1701')) AS 
> t(lsn),
>  LATERAL pg_walfile_name_offset(lsn::pg_lsn);
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777215
> -->  0/1700 | 00010016 |   0
>  0/1701 | 00010017 |   1
>
> The bug is in the indicated line --- it shows the filename as 00016 but
> offset as zero, when clearly the LSN is pointing to 17/0.  The bug is
> essentially that the code for pg_walfile_name_offset() uses the exact
> offset from the LSN, but uses the file name from the previous byte of
> the LSN.

Yes, that's definitely not a correct result.

> The fix involves deciding what the description or purpose of
> pg_walfile_name_offset() means, and adjusting it to be clearer.  The
> current documentation says:
>
> Converts a write-ahead log location to a WAL file name and byte
> offset within that file.
>
> Fix #1:  If we assume write-ahead log location means LSN, it is saying
> show the file/offset of the LSN, and that is most clearly:
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777215
>  0/1700 | 00010017 |   0
>  0/1701 | 00010017 |   1
>
> Fix #2:  Now, there are some who have said they want the output to be
> the last written WAL byte (the byte before the LSN), not the current
> LSN, for archiving purposes.  However, if we do that, we have to update
> the docs to clarify it.  Its output would be:
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777214
>  0/1700 | 00010016 |16777215
>  0/1701 | 00010017 |   0
>
> I have attached fix #1 as offset1.diff and fix #2 as offset2.diff.

I believe you got the references wrong; fix #1 looks like the output
of offset2's changes, and fix #2 looks like the result of offset1's
changes.

Either way, I think fix #1 is most correct (as was attached in
offset2.diff, and quoted verbatim here), because that has no chance of
having surprising underflowing behaviour when you use '0/0'::lsn as
input.

> diff --git a/src/backend/access/transam/xlogfuncs.c 
> b/src/backend/access/transam/xlogfuncs.c
> index 45a70668b1..e65502d51e 100644
> --- a/src/backend/access/transam/xlogfuncs.c
> +++ b/src/backend/access/transam/xlogfuncs.c
> @@ -414,7 +414,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
> /*
>  * xlogfilename
>  */
> -XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
> +XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
> XLogFileName(xlogfilename, GetWALInsertionTimeLine(), xlogsegno,
>  wal_segment_size);

Kind regards,

Matthias van de Meent




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Andrew Dunstan



On 2023-11-09 Th 10:42, Tristan Partin wrote:

On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:

Hi,

On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>
> Can you try with Meson v1.2.3?

I tried with Meson v1.2.3 and upstream, both failed with the same error.


Please open a bug in the Meson repository which also mentions the last 
known working version. I wonder what versions of Meson we use in the 
build farm.



fairywren / drongo have 1.0.1


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Bruce Momjian
On Fri, Jul 26, 2019 at 11:30:19AM +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 26 Jul 2019 17:21:20 +0900 (Tokyo Standard Time)
> Kyotaro Horiguchi  wrote:
> 
> > Hello.
> > 
> > While looking [1], I noticed that pg_walfile_name_offset behaves
> > somewhat oddly at segment boundary.
> > 
> > select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as
> > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
> > lsn |file_name | file_offset
> > +--+-
> >  0/16ff | 00020016 |16777215
> >  0/1700 | 00020016 |   0
> >  0/1701 | 00020017 |   1
> > 
> > 
> > The file names are right as defined, but the return value of the
> > second line wrong, or at least misleading.
> 
> +1
> I noticed it as well and put this report on hold while working on my patch.
> Thanks for reporting this!
> 
> > It should be (16, 100) or (16, FF). The former is out-of-domain so 
> > we
> > would have no way than choosing the latter. I'm not sure the purpose of
> > the second output parameter, thus the former might be right
> > decision.
> >
> > The function returns the following result after this patch is
> > applied.
> > 
> > select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as
> > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
> > lsn |file_name | file_offset
> > +--+-
> >  0/16ff | 00020016 |16777214
> >  0/1700 | 00020016 |16777215
> >  0/1701 | 00020017 |   0
> 
> So you shift the file offset for all LSN by one byte? This could lead to
> regression in various tools relying on this function.
> 
> Moreover, it looks weird as the LSN doesn't reflect the given offset anymore
> (FF <> 16777214, 01 <> 0, etc).
> 
> Another solution might be to return the same result when for both 0/16ff 
> and
> 0/1700, but it doesn't feel right either.
> 
> So in fact, returning 0x100 seems to be the cleaner result to me.

I know this bug report is four years old, but it is still a
pg_walfile_name_offset() bug.  Here is the bug:

SELECT *
FROM (VALUES ('0/16ff'), ('0/1700'), ('0/1701')) AS t(lsn), 
 LATERAL pg_walfile_name_offset(lsn::pg_lsn);

lsn |file_name | file_offset
+--+-
 0/16ff | 00010016 |16777215
-->  0/1700 | 00010016 |   0
 0/1701 | 00010017 |   1

The bug is in the indicated line --- it shows the filename as 00016 but
offset as zero, when clearly the LSN is pointing to 17/0.  The bug is
essentially that the code for pg_walfile_name_offset() uses the exact
offset from the LSN, but uses the file name from the previous byte of
the LSN.

The fix involves deciding what the description or purpose of
pg_walfile_name_offset() means, and adjusting it to be clearer.  The
current documentation says:

Converts a write-ahead log location to a WAL file name and byte
offset within that file.

Fix #1:  If we assume write-ahead log location means LSN, it is saying
show the file/offset of the LSN, and that is most clearly:

lsn |file_name | file_offset
+--+-
 0/16ff | 00010016 |16777215
 0/1700 | 00010017 |   0
 0/1701 | 00010017 |   1

Fix #2:  Now, there are some who have said they want the output to be
the last written WAL byte (the byte before the LSN), not the current
LSN, for archiving purposes.  However, if we do that, we have to update
the docs to clarify it.  Its output would be:

lsn |file_name | file_offset
+--+-
 0/16ff | 00010016 |16777214
 0/1700 | 00010016 |16777215
 0/1701 | 00010017 |   0

The email thread also considered having the second row offset be 16777216
(2^24), which is not a valid offset for a file if we assume a zero-based
offset.

Looking further, pg_walfile_name() also returns the filename for the
previous byte:

SELECT *
FROM (values ('0/16ff'), ('0/1700'), ('0/1701')) as t(lsn),
 LATERAL pg_walfile_name_offset(lsn::pg_lsn), LATERAL 
pg_walfile_name(lsn::pg_lsn);
lsn |file_name | file_offset | 
pg_walfile_name

+--+-+--
 0/16ff | 00010016 |16777215 | 

Re: Fix output of zero privileges in psql

2023-11-09 Thread Tom Lane
Laurenz Albe  writes:
> Thanks for the feedback.  I'll set the patch to "ready for committer" then.

So, just to clarify, we're settling on your v4 from [1]?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/d799f996f422231a99655f1223667d6d887e4c95.ca...@cybertec.at




Re: pg_dump needs SELECT privileges on irrelevant extension table

2023-11-09 Thread Tom Lane
I wrote:
> Jacob Champion  writes:
>> Is this approach backportable?

> The code fix would surely work in the back branches.  Whether the
> behavioral change is too big to be acceptable in minor releases
> is something I don't have a strong opinion on.

I'm hearing nothing but crickets :-(

If nobody objects by say Monday, I'm going to go ahead and
commit (and backpatch) the patch I posted at [1].

regards, tom lane

[1] https://www.postgresql.org/message-id/2984517.1697577076%40sss.pgh.pa.us




Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-11-09 Thread Laurenz Albe
On Thu, 2023-11-09 at 15:59 +, Dean Rasheed wrote:
> On Thu, 9 Nov 2023 at 15:16, Laurenz Albe  wrote:
> > I have thought some more about this, and I believe that if FOR SELECT
> > policies are used to check new rows, you should be allowed to specify
> > WITH CHECK on FOR SELECT policies.  Why not allow a user to specify
> > different conditions for fetching from a table and for new rows after
> > an UPDATE?
> > 
> > The attached patch does that.  What so you think?
> 
> So you'd be able to write policies that allowed you to do an
> INSERT/UPDATE ... RETURNING, where the WITH CHECK part of the SELECT
> policy allowed you see the new row, but then if you tried to SELECT it
> later, the USING part of the policy might say no.
> 
> That seems pretty confusing. I would expect a row to either be visible
> or not, consistently across all commands.

I think it can be useful to allow a user an UPDATE where the result
does not satisfy the USING clause of the FOR SELECT policy.

True, it could surprise that you cannot SELECT something you just saw
with UPDATE ... RETURNING, but I would argue that these are different
operations.

The idea that an UPDATE should only produce rows you can SELECT is not
true today: if you run an UPDATE without a WHERE clause, you can
create rows you cannot see.  The restriction is only on UPDATEs with
a WHERE clause.  Weird, isn't it?

Yours,
Laurenz Albe




Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Tom Lane
[ Jacob's email address updated ]

"Kumar, Sachin"  writes:
> Hi Everyone , I want to continue this thread , I have rebased the patch to 
> latest
> master and fixed an issue when pg_restore prints to file.

Um ... you didn't attach the patch?

FWIW, I agree with Jacob's concern about it being a bad idea to let
users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore.
I think we'd regret going there, because it'd hugely expand the set
of cases pg_upgrade has to deal with.

Also, pg_upgrade is often invoked indirectly via scripts, so I do
not especially buy the idea that we're going to get useful control
input from some human somewhere.  I think we'd be better off to
assume that pg_upgrade is on its own to manage the process, so that
if we need to switch strategies based on object count or whatever,
we should put in a heuristic to choose the strategy automatically.
It might not be perfect, but that will give better results for the
pretty large fraction of users who are not going to mess with
weird little switches.

regards, tom lane




Re: meson documentation build open issues

2023-11-09 Thread Andres Freund
Hi,

On 2023-11-09 15:32:39 +0100, Peter Eisentraut wrote:
> On 09.11.23 01:59, Andres Freund wrote:
> > > I think we could build doc/src/sgml/postgres-full.xml by default.  That
> > > takes less than 0.5 seconds here and it's an intermediate target for html
> > > and man.
> > That does require the docbook dtd to be installed, afaict. I think we would
> > need a configure test for that to be present if we want to build it by
> > default, otherwise we'll cause errors on plenty systems that don't get them
> > today.  The docbook dts aren't a huge dependency, but still. Some OSs might
> > not have a particularly install source for them, e.g. windows.
> 
> I was thinking we would do it only if the required tools are found.
> Basically like
> 
>  postgres_full_xml = custom_target('postgres-full.xml',
>input: 'postgres.sgml',
>output: 'postgres-full.xml',
>depfile: 'postgres-full.xml.d',
>command: [xmllint, '--nonet', '--noent', '--valid',
>  '--path', '@OUTDIR@', '-o', '@OUTPUT@', '@INPUT@'],
>depends: doc_generated,
> -  build_by_default: false,
> +  build_by_default: xmllint_bin.found(),
>  )

We don't get to that point if xmllint isn't found...


> Besides giving you a quick validity check of the XML, this also builds the
> doc_generated, which draw from non-doc source files, so this would also
> serve to check that those are sound and didn't mess up the docs.

Unfortunately presence of xmllint doesn't guarantee presence of the relevant
DTDs. Without docbook-xml installed, you'll get something like

../../../../../home/andres/src/postgresql/doc/src/sgml/postgres.sgml:21: 
warning: failed to load external entity 
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd;
]>
  ^

and a bunch of other subsequent errors.


I think if we want to do this, we'd need a configure time check for being able
to validate a document with
http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd;...

That's certainly doable. If we go there, we imo also should check if the
relevant xslt stylesheets are installed.

Greetings,

Andres Freund




Re: pg_upgrade failing for 200+ million Large Objects

2023-11-09 Thread Kumar, Sachin




Hi Everyone , I want to continue this thread , I have rebased the patch to 
latest
master and fixed an issue when pg_restore prints to file.

`
╰─$ pg_restore  dump_small.custom  --restore-blob-batch-size=2 --file=a
--
-- End BLOB restore batch
--
COMMIT;
`

> On 09/11/2023, 17:05, "Jacob Champion"  > wrote:
> To clarify, I agree that pg_dump should contain the core fix. What I'm
> questioning is the addition of --dump-options to make use of that fix
> from pg_upgrade, since it also lets the user do "exciting" new things
> like --exclude-schema and --include-foreign-data and so on. I don't
> think we should let them do that without a good reason.

Earlier idea was to not expose these options to users and use [1]
   --restore-jobs=NUM --jobs parameter passed to pg_restore
   --restore-blob-batch-size=NUM  number of blobs restored in one xact
But this was later expanded to use --dump-options and --restore-options [2].
With --restore-options user can use --exclude-schema , 
So maybe we can go back to [1]

[1] 
https://www.postgresql.org/message-id/a1e200e6-adde-2561-422b-a166ec084e3b%40wi3ck.info
[2] 
https://www.postgresql.org/message-id/8d8d3961-8e8b-3dbe-f911-6f418c5fb1d3%40wi3ck.info

Regards
Sachin
Amazon Web Services: https://aws.amazon.com




Re: SQL:2011 application time

2023-11-09 Thread Paul Jungwirth

On 11/9/23 05:47, Peter Eisentraut wrote:
I went over the patch 
v17-0001-Add-temporal-PRIMARY-KEY-and-UNIQUE-constraints.patch in more 
detail


Thanks Peter!

I'm about halfway through jian he's last two emails. I'll address your 
feedback also. I wanted to reply to this without waiting though:


Overall, with these fixes, I think this patch is structurally okay.  We 
just need to make sure we have all the weird corner cases covered.


One remaining issue I know about is with table partitions whose column 
order has changed. I've got an in-progress fix for that, but I've been 
prioritizing reviewer feedback the last few months. Just want to make 
sure you know about it for now.


Thanks!

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: Cleaning up array_in()

2023-11-09 Thread Tom Lane
I wrote:
> This comes out when you write something like '{foo"bar"}', and I'd
> say the choice of message is not great.  On the other hand, it's
> consistent with what you get from '{"foo""bar"}', and if we wanted
> to change that too then some tweaking of the state machine in
> ReadArrayStr would be required (or else modify ReadArrayToken so
> it doesn't return instantly upon seeing the second quote mark).
> I'm not sure that this is worth messing with.

After further thought I concluded that this area is worth spending
a little more code for.  If we have input like '{foo"bar"}' or
'{"foo"bar}' or '{"foo""bar"}', what it most likely means is that
the user misunderstood the quoting rules.  A message like "Unexpected
array element" is pretty completely unhelpful for figuring that out.
The alternative I was considering, "Unexpected """ character", would
not be much better.  What we want to say is something like "Incorrectly
quoted array element", and the attached v12 makes ReadArrayToken do
that for both quoted and unquoted cases.

I also fixed the obsolete documentation that Alexander noted, and
cleaned up a couple other infelicities (notably, I'd blindly written
ereport(ERROR) in one place where ereturn is now the way).

Barring objections, I think v12 is committable.

regards, tom lane

diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index 56185b9b03..ce338c770c 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -171,7 +171,8 @@ INSERT INTO sal_emp
 VALUES ('Bill',
 '{1, 1, 1, 1}',
 '{{"meeting", "lunch"}, {"meeting"}}');
-ERROR:  multidimensional arrays must have array expressions with matching dimensions
+ERROR:  malformed array literal: "{{"meeting", "lunch"}, {"meeting"}}"
+DETAIL:  Multidimensional arrays must have sub-arrays with matching dimensions.
 
  
 
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 3ff13eb419..b0532bb45b 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -54,18 +54,16 @@ bool		Array_nulls = true;
 			PG_FREE_IF_COPY(array, n); \
 	} while (0)
 
+/* ReadArrayToken return type */
 typedef enum
 {
-	ARRAY_NO_LEVEL,
-	ARRAY_LEVEL_STARTED,
-	ARRAY_ELEM_STARTED,
-	ARRAY_ELEM_COMPLETED,
-	ARRAY_QUOTED_ELEM_STARTED,
-	ARRAY_QUOTED_ELEM_COMPLETED,
-	ARRAY_ELEM_DELIMITED,
-	ARRAY_LEVEL_COMPLETED,
-	ARRAY_LEVEL_DELIMITED,
-} ArrayParseState;
+	ATOK_LEVEL_START,
+	ATOK_LEVEL_END,
+	ATOK_DELIM,
+	ATOK_ELEM,
+	ATOK_ELEM_NULL,
+	ATOK_ERROR,
+} ArrayToken;
 
 /* Working state for array_iterate() */
 typedef struct ArrayIteratorData
@@ -91,15 +89,21 @@ typedef struct ArrayIteratorData
 	int			current_item;	/* the item # we're at in the array */
 }			ArrayIteratorData;
 
-static int	ArrayCount(const char *str, int *dim, char typdelim,
-	   Node *escontext);
-static bool ReadArrayStr(char *arrayStr, const char *origStr,
-		 int nitems, int ndim, int *dim,
+static bool ReadArrayDimensions(char **srcptr, int *ndim_p,
+int *dim, int *lBound,
+const char *origStr, Node *escontext);
+static bool ReadDimensionInt(char **srcptr, int *result,
+			 const char *origStr, Node *escontext);
+static bool ReadArrayStr(char **srcptr,
 		 FmgrInfo *inputproc, Oid typioparam, int32 typmod,
 		 char typdelim,
 		 int typlen, bool typbyval, char typalign,
-		 Datum *values, bool *nulls,
-		 bool *hasnulls, int32 *nbytes, Node *escontext);
+		 int *ndim_p, int *dim,
+		 int *nitems_p,
+		 Datum **values_p, bool **nulls_p,
+		 const char *origStr, Node *escontext);
+static ArrayToken ReadArrayToken(char **srcptr, StringInfo elembuf, char typdelim,
+ const char *origStr, Node *escontext);
 static void ReadArrayBinary(StringInfo buf, int nitems,
 			FmgrInfo *receiveproc, Oid typioparam, int32 typmod,
 			int typlen, bool typbyval, char typalign,
@@ -185,12 +189,10 @@ array_in(PG_FUNCTION_ARGS)
 	char		typalign;
 	char		typdelim;
 	Oid			typioparam;
-	char	   *string_save,
-			   *p;
-	int			i,
-nitems;
-	Datum	   *dataPtr;
-	bool	   *nullsPtr;
+	char	   *p;
+	int			nitems;
+	Datum	   *values;
+	bool	   *nulls;
 	bool		hasnulls;
 	int32		nbytes;
 	int32		dataoffset;
@@ -233,104 +235,34 @@ array_in(PG_FUNCTION_ARGS)
 	typdelim = my_extra->typdelim;
 	typioparam = my_extra->typioparam;
 
-	/* Make a modifiable copy of the input */
-	string_save = pstrdup(string);
+	/* Initialize dim, lBound for ReadArrayDimensions, ReadArrayStr */
+	for (int i = 0; i < MAXDIM; i++)
+	{
+		dim[i] = -1;			/* indicates "not yet known" */
+		lBound[i] = 1;			/* default lower bound */
+	}
 
 	/*
-	 * If the input string starts with dimension info, read and use that.
-	 * Otherwise, we require the input to be in curly-brace style, and we
-	 * prescan the input to determine dimensions.
+	 * Start processing the input string.
 	 *
-	 * Dimension info takes the form of one or more [n] or [m:n] 

Re: Postgres Partitions Limitations (5.11.2.3)

2023-11-09 Thread shihao zhong
That looks good to me!

The new status of this patch is: Ready for Committer


Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-09 Thread Maxim Orlov
Aleksander Alekseev,

> Maxim,
> I see both of us accounted for Alexanders feedback and submitted v59.
> Your newer version seems to have issues on cfbot, so resubmitting the
> previous patchset that passes the tests. Please feel free to add
> changes.

For unknown reasons, I do not receive any of your emails from after
2023-11-07 11:57:12 (Message-ID: CAJ7c6TN1hKqNPGrNcq96SUyD=
z61rakgxf8iqq36qr90oud...@mail.gmail.com).
Even after resend.

Anyway, PFA patch set of version 61.  I've made some minor changes in the
0001 and add 004 in order to test actual 64-bit SLRU pages.

As for CF bot had failed on my v59 patch set, this is because of the bug.
It's manifested because of added 64-bit pages tests.
The problem was in segno calculation, since we convert it from file name
using strtol call.  But strtol return long,
which is 4 byte long in x86.

-   segno = (int) strtol(clde->d_name, NULL, 16);
+   segno = strtoi64(clde->d_name, NULL, 16);

-- 
Best regards,
Maxim Orlov.


v61-0003-Make-use-FullTransactionId-in-2PC-filenames.patch
Description: Binary data


v61-0004-Add-SLRU-tests-for-64-bit-page-case.patch
Description: Binary data


v61-0002-Use-larger-segment-file-names-for-pg_notify.patch
Description: Binary data


v61-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-11-09 Thread Heikki Linnakangas

On 18/09/2023 07:08, David Rowley wrote:

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas  wrote:

I've added a call to LockAssertNoneHeld(false) in there.


I don't see it in the patch?


hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.


This needed a rebase after my ResourceOwner refactoring. Attached.

A few quick comments:

- It would be nice to add a test for the issue that you fixed in patch 
v7, i.e. if you prepare a transaction while holding session-level locks.


- GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you 
are out of memory. Is that handled gracefully or is the lock leaked?


--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 74ce5f9491c..8c8585be7ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2415,6 +2415,7 @@ PrepareTransaction(void)
 	TransactionId xid = GetCurrentTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
+	HTAB		   *sessionandxactlocks;
 
 	Assert(!IsInParallelMode());
 
@@ -2560,7 +2561,17 @@ PrepareTransaction(void)
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
-	AtPrepare_Locks();
+
+	/*
+	 * Prepare the locks and save the returned hash table that describes if
+	 * the lock is held at the session and/or transaction level.  We need to
+	 * know if we're dealing with session locks inside PostPrepare_Locks(),
+	 * but we're unable to build the hash table there due to that function
+	 * only discovering if we're dealing with a session lock while we're in a
+	 * critical section, in which case we can't allocate memory for the hash
+	 * table.
+	 */
+	sessionandxactlocks = AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@@ -2587,7 +2598,10 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(xid, sessionandxactlocks);
+
+	/* We no longer need this hash table */
+	hash_destroy(sessionandxactlocks);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2ee..5baf83ac6ce 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,12 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
+
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 501910b4454..835f112f751 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -841,7 +841,11 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * The locks will be acquired once the worker is initialized.
 	 */
 	if (!InitializingApplyWorker)
-		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 45de0fd2bd6..e7e0f29347a 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
 subset of the PGPROC object's heldLocks mask (if the PGPROC is
 currently waiting for another lock mode on this lock).
 
-releaseMask -
-A bitmask for the lock modes due to be released during LockReleaseAll.
-This must be a subset of the holdMask.  Note that it is modified without
-taking the partition LWLock, and therefore it is unsafe for any
-backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
 List link for shared memory queue of all the PROCLOCK objects for the
 same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b8c57b3e16e..ce54b589b0d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-
  */
@@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord
 	LOCKMODE	lockmode;
 } TwoPhaseLockRecord;
 
+/*
+ * Used by for a hash table entry type in AtPrepare_Locks() to communicate the
+ * session/xact lock status of each held lock for use in PostPrepare_Locks().
+ */
+typedef struct 

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Robert Haas
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar  wrote:
> Here is the updated version of the patch, here I have taken the
> approach suggested by Andrey and I discussed the same with Alvaro
> offlist and he also agrees with it.  So the idea is that we will keep
> the bank size fixed which is 16 buffers per bank and the allowed GUC
> value for each slru buffer must be in multiple of the bank size.  We
> have removed the centralized lock but instead of one lock per bank, we
> have kept the maximum limit on the number of bank locks which is 128.
> We kept the max limit as 128 because, in one of the operations (i.e.
> ActivateCommitTs), we need to acquire all the bank locks (but this is
> not a performance path at all) and at a time we can acquire a max of
> 200 LWlocks, so we think this limit of 128 is good.  So now if the
> number of banks is <= 128 then we will be using one lock per bank
> otherwise the one lock may protect access of buffer in multiple banks.

Just so I understand, I guess this means that an SLRU is limited to
16*128 = 2k buffers = 16MB?

When we were talking about this earlier, I suggested fixing the number
of banks and allowing the number of buffers per bank to scale
depending on the setting. That seemed simpler than allowing both the
number of banks and the number of buffers to vary, and it might allow
the compiler to optimize some code better, by converting a calculation
like page_no%number_of_banks into a masking operation like page_no&0xf
or whatever. However, because it allows an individual bank to become
arbitrarily large, it more or less requires us to use a buffer mapping
table. Some of the performance problems mentioned could be alleviated
by omitting the hash table when the number of buffers per bank is
small, and we could also create the dynahash with a custom hash
function that just does modular arithmetic on the page number rather
than a real hashing operation. However, maybe we don't really need to
do any of that. I agree that dynahash is clunky on a good day. I
hadn't realized the impact would be so noticeable.

This proposal takes the opposite approach of fixing the number of
buffers per bank, letting the number of banks vary. I think that's
probably fine, although it does reduce the effective associativity of
the cache. If there are more hot buffers in a bank than the bank size,
the bank will be contended, even if other banks are cold. However,
given the way SLRUs are accessed, it seems hard to imagine this being
a real problem in practice. There aren't likely to be say 20 hot
buffers that just so happen to all be separated from one another by a
number of pages that is a multiple of the configured number of banks.
And in the seemingly very unlikely event that you have a workload that
behaves like that, you could always adjust the number of banks up or
down by one, and the problem would go away. So this seems OK to me.

I also agree with a couple of points that Alvaro made, specifically
that (1) this doesn't have to be perfect, just better than now and (2)
separate GUCs for each SLRU is fine. On the latter point, it's worth
keeping in mind that the cost of a GUC that most people don't need to
tune is fairly low. GUCs like work_mem and shared_buffers are
"expensive" because everybody more or less needs to understand what
they are and how to set them and getting the right value can tricky --
but a GUC like autovacuum_naptime is a lot cheaper, because almost
nobody needs to change it. It seems to me that these GUCs will fall
into the latter category. Users can hopefully just ignore them except
if they see a contention on the SLRU bank locks -- and then they can
consider increasing the number of banks for that particular SLRU. That
seems simple enough. As with autovacuum_naptime, there is a danger
that people will configure a ridiculous value of the parameter for no
good reason and get bad results, so it would be nice if someday we had
a magical system that just got all of this right without the user
needing to configure anything. But in the meantime, it's better to
have a somewhat manual system to relieve pressure on these locks than
no system at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-11-09 Thread Dean Rasheed
On Thu, 9 Nov 2023 at 15:16, Laurenz Albe  wrote:
>
> I have thought some more about this, and I believe that if FOR SELECT
> policies are used to check new rows, you should be allowed to specify
> WITH CHECK on FOR SELECT policies.  Why not allow a user to specify
> different conditions for fetching from a table and for new rows after
> an UPDATE?
>
> The attached patch does that.  What so you think?
>

So you'd be able to write policies that allowed you to do an
INSERT/UPDATE ... RETURNING, where the WITH CHECK part of the SELECT
policy allowed you see the new row, but then if you tried to SELECT it
later, the USING part of the policy might say no.

That seems pretty confusing. I would expect a row to either be visible
or not, consistently across all commands.

Regards,
Dean




Re: Synchronizing slots from primary to standby

2023-11-09 Thread Drouvot, Bertrand

Hi,

On 11/9/23 11:54 AM, shveta malik wrote:


PFA v32 patches which has below changes:


Thanks!
  

7) Added warning for cases where a user-slot with the same name is
already present which slot-sync worker is trying to create. Sync for
such slots is skipped.


I'm seeing assertion and segfault in this case due to ReplicationSlotRelease()
in synchronize_one_slot().

Adding this extra check prior to it:

-   ReplicationSlotRelease();
+   if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE))
+   ReplicationSlotRelease();

make them disappear.



Open Question:
1) Currently I have put drop slot logic for slots with 'sync_state=i'
in slot-sync worker. Do we need to put it somewhere in promotion-logic
as well? 


Yeah I think so, because there is a time window when one could "use" the slot
after the promotion and before it is removed. Producing things like:

"
2023-11-09 15:16:50.294 UTC [2580462] LOG:  dropped replication slot 
"logical_slot2" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.295 UTC [2580462] LOG:  dropped replication slot 
"logical_slot3" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.297 UTC [2580462] LOG:  dropped replication slot 
"logical_slot4" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.297 UTC [2580462] ERROR:  replication slot "logical_slot5" 
is active for PID 2594628
"

After the promotion one was able to use logical_slot5 and now we can now drop 
it.


Perhaps in WaitForWALToBecomeAvailable() where we call
XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts?



You mean here?

/*
 * Check to see if promotion is requested. Note that we do
 * this only after failure, so when you promote, we still
 * finish replaying as much as we can from archive and
 * pg_wal before failover.
 */
if (StandbyMode && CheckForStandbyTrigger())
{
XLogShutdownWalRcv();
return XLREAD_FAIL;
}

If so, that sounds like a good place to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Tristan Partin

On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:

Hi,

On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>
> Can you try with Meson v1.2.3?

I tried with Meson v1.2.3 and upstream, both failed with the same error.


Please open a bug in the Meson repository which also mentions the last 
known working version. I wonder what versions of Meson we use in the 
build farm.


--
Tristan Partin
Neon (https://neon.tech)




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>
> Can you try with Meson v1.2.3?

I tried with Meson v1.2.3 and upstream, both failed with the same error.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Tristan Partin

Can you try with Meson v1.2.3?

--
Tristan Partin
Neon (https://neon.tech)




Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-11-09 Thread Laurenz Albe
On Wed, 2023-10-25 at 09:45 +0200, Laurenz Albe wrote:
> I can accept that the error is intentional, even though it violated the
> POLA for me.  I can buy into the argument that an UPDATE should not make
> a row seem to vanish.
> 
> I cannot buy into the constraint argument.  If the table owner wanted to
> prevent you from causing a constraint violation error with a row you
> cannot see, she wouldn't have given you a FOR UPDATE policy that allows
> you to perform such an UPDATE.
> 
> Anyway, it is probably too late to change a behavior that has been like
> that for a while and is not manifestly buggy.

I have thought some more about this, and I believe that if FOR SELECT
policies are used to check new rows, you should be allowed to specify
WITH CHECK on FOR SELECT policies.  Why not allow a user to specify
different conditions for fetching from a table and for new rows after
an UPDATE?

The attached patch does that.  What so you think?

Yours,
Laurenz Albe
From c1bf1cb39962690933e4a37af0ab8b00926a0020 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Thu, 9 Nov 2023 16:09:10 +0100
Subject: [PATCH] Allow WITH CKECK on FOR SELECT policies

FOR SELECT or FOR ALL policies are used to check the new row
after an UPDATE, but you could not define a WITH CHECK clause
on FOR SELECT policies, so the USING clause was used.
This patch adds the capability to define WITH CHECK clauses on
FOR SELECT policies, which allows the user to specify different
criteria for fetching rows from the table and for new rows
after an UPDATE.

Author: Laurenz Albe
Discussion: https://postgr.es/m/aee893f1ec3ca8f62a0da2fc2f9f8b73920f9f9d.camel%40cybertec.at
---
 doc/src/sgml/ref/create_policy.sgml   | 35 +++
 src/backend/commands/policy.c | 14 -
 src/backend/rewrite/rowsecurity.c |  2 +-
 src/test/regress/expected/rowsecurity.out | 14 +
 src/test/regress/sql/rowsecurity.sql  | 11 +++
 5 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/create_policy.sgml b/doc/src/sgml/ref/create_policy.sgml
index e76c342d3d..d473bb1094 100644
--- a/doc/src/sgml/ref/create_policy.sgml
+++ b/doc/src/sgml/ref/create_policy.sgml
@@ -82,7 +82,7 @@ CREATE POLICY name ON 
For policies that can have both USING
-   and WITH CHECK expressions (ALL
+   and WITH CHECK expressions (ALL, SELECT
and UPDATE), if no WITH CHECK
expression is defined, then the USING expression will be
used both to determine which rows are visible (normal
@@ -270,9 +270,11 @@ CREATE POLICY name ON SELECT permissions, such as
  UPDATE, will also only see those records
  that are allowed by the SELECT policy.
- A SELECT policy cannot have a WITH
- CHECK expression, as it only applies in cases where
- records are being retrieved from the relation.
+ In addition, rows written by an INSERT or
+ UPDATE statement are checked against the
+ WITH CHECK expression of SELECT
+ policies on the table (or, if there is no WITH
+ CHECK expression, the USING expression).

   
  
@@ -402,14 +404,17 @@ CREATE POLICY name ON 
 Policies Applied by Command Type
-
- 
- 
+
+ 
+ 
+ 
+ 
+ 
  
  
   
Command
-   SELECT/ALL policy
+   SELECT/ALL policy
INSERT/ALL policy
UPDATE/ALL policy
DELETE/ALL policy
@@ -417,6 +422,7 @@ CREATE POLICY name ON 
USING expression
WITH CHECK expression
+   WITH CHECK expression
USING expression
WITH CHECK expression
USING expression
@@ -430,11 +436,13 @@ CREATE POLICY name ON 


+   
   
   
SELECT FOR UPDATE/SHARE
Existing row

+   
Existing row


@@ -442,6 +450,7 @@ CREATE POLICY name ON 
INSERT / MERGE ... THEN INSERT

+   
New row


@@ -449,6 +458,7 @@ CREATE POLICY name ON 
   
INSERT ... RETURNING
+   

 New row 
  
@@ -465,9 +475,8 @@ CREATE POLICY name ON 
   
UPDATE / MERGE ... THEN UPDATE
-   
-Existing  new rows 
-   
+   Existing row 
+   New row

Existing row
New row
@@ -481,11 +490,13 @@ CREATE POLICY name ON 


+   
Existing row
   
   
ON CONFLICT DO UPDATE
-   Existing  new rows
+   Existing row
+   New row

Existing row
New row
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 76a45e56bf..e3e19167fe 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -597,13 +597,12 @@ CreatePolicy(CreatePolicyStmt *stmt)
 	polcmd = parse_policy_command(stmt->cmd_name);
 
 	/*
-	 * If the command is SELECT or DELETE then WITH CHECK should be NULL.
+	 * If the command is DELETE then WITH 

Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

Adding Kyotaro to CC because Kyotaro reported a similar issue before [1].

On Thu, 9 Nov 2023 at 11:59, Shlok Kyal  wrote:
>
> Hi,
> I am trying to build postgres with meson on Windows. And I am stuck in
> the process.
>
> Steps I followed:
>
> 1. I clone postgres repo
>
> 2.Installed meson and ninja
> pip install meson ninja
>
> 3. Then running following command:
> meson setup build --buildtype debug
>
> 4. Then I ran
> cd build
> ninja
>
> Got following error
> D:\project\repo\pg_meson\postgres\build>C:\Users\kyals\AppData\Roaming\Python\Python311\Scripts\ninja
> ninja: error: 'src/backend/postgres_lib.a.p/meson_pch-c.obj', needed
> by 'src/backend/postgres.exe', missing and no known rule to make it.

I am able to reproduce the error. This error was introduced at meson
v1.2.0, v1.1.0 and before work successfully. It seems meson tries to
use pch files although Postgres is compiled with b_pch=false.
This error occurs when Developer Powershell for VS is used and
Postgres is compiled with b_pch=false option (which is the default on
Postgres). If the -Db_pch=true option or the default powershell is
used, Postgres gets built successfully.

[1] 
https://www.postgresql.org/message-id/20231018.113148.1275969479525954369.horikyota@gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: ensure, not insure

2023-11-09 Thread Tom Lane
David Rowley  writes:
> I've pushed this.  I backpatched due to the typo in the fsync GUC
> description.  I'd have only pushed to master if it were just the
> comment typos.

FTR, I do not think you should have back-patched.  You created extra
work for the translation team, and the mistake is subtle enough that
it wasn't worth that.  (My dictionary says that "insure and ensure
are often interchangeable, particularly in US English".)

regards, tom lane




Re: RFC: Logging plan of the running query

2023-11-09 Thread Ashutosh Bapat
On Thu, Nov 9, 2023 at 4:56 PM torikoshia  wrote:
>
> On 2023-11-09 16:11, Ashutosh Bapat wrote:
> > On Thu, Nov 9, 2023 at 12:03 PM torikoshia 
> > wrote:
> >> >>
> >> >> 1. When a backend is running nested queries, we will see the plan of
> >> >> the innermost query. That query may not be the actual culprit if the
> >> >> user query is running slowly. E.g a query being run as part of inner
> >> >> side nested loop when the nested loop itself is the bottleneck. I
> >> >> think it will be useful to print plans of all the whole query stack.
> >>
> >> This was discussed in previous threads[1] and we thought it'd be
> >> useful
> >> but since it needed some extra works, we stopped widening the scope.
> >>
> >> >
> >> > I think we can start with what auto_explain is doing. Always print the
> >> > plan of the outermost query; the query found in pg_stat_activity. In a
> >> > later version we might find a way to print plans of all the queries in
> >> > the stack and do so in a readable manner.
> >>
> >> Agreed there are cases printing plan of the outermost query is more
> >> useful.
> >>
> >
> > I am fine printing the plan of the outermost query. This will help
> > many cases. Printing plans of the whole query stack can be added as an
> > add on later.
> >
> >> >
> >> > This makes tracking activeQueryDesc a bit tricky. My guess is that the
> >> > outermost query's descriptor will be available through ActivePortal
> >> > most of the time. But there are cases when ExecutorRun is called by
> >> > passing a queryDesc directly. So may be there are some cases where
> >> > that's not true.
> >>
> >> Yeah, actually the original version of the patch got the plan from
> >> ActivePortal, but it failed logging plan when the query was something
> >> like this[2]:
> >>
> >>   DO $$
> >>   BEGIN
> >>   PERFORM pg_sleep(100);
> >>   END$$;
> >
> > References [1] and [2] are not listed in your email.
>
> Oops, sorry. Here are links:
>
> [1]
> https://www.postgresql.org/message-id/64f716c44629e303b66e6c24502147cc%40oss.nttdata.com
> [2]
> https://www.postgresql.org/message-id/632e99eb-8090-53e6-1b1a-101601904cbd%40oss.nttdata.com
>
> > Is that because there was no ActivePortal created or the ActivePortal
> > pointed to DO block instead of PERFORM pg_sleep?
>
> ActivePortal is created but ActivePortal->queryDesc is null.

Thanks.

>
> >> > 2. When a query is running in parallel worker do we want to print that
> >> > query? It may or may not be interesting given the situation. If the
> >> > overall plan itself is faulty, output of the parallel worker query is
> >> > not useful. If the plan is fine but a given worker's leg is running
> >> > slowly it may be interesting.
> >>
> >> I think it can be useful.
> >> I'm wondering if we can add this after the first patch for this
> >> feature
> >> is committed.
> >
> > With the current patches, it will print the query from a parallel
> > backend. If that's not desirable we should prohibit that case at
> > least.
>
> Current patch prohibits printing plan if backend type is parallel worker
> as below:
>
>=# select pg_log_query_plan(pid), backend_type from pg_stat_activity
> where backend_type = 'parallel worker';
>
> pg_log_query_plan |  backend_type
>---+-
> f | parallel worker
> f | parallel worker
>(2 rows)
>
>WARNING:  PID 4618 is not a PostgreSQL client backend process
>WARNING:  PID 4617 is not a PostgreSQL client backend process
>
> Is this the behavior you expect?
>

I misread then. Thanks for correcting me. We could consider plans from
parallel workers in v2 of this feature.

-- 
Best Wishes,
Ashutosh Bapat




Re: speed up a logical replica setup

2023-11-09 Thread Peter Eisentraut

On 08.11.23 00:12, Michael Paquier wrote:

- Should the subdirectory pg_basebackup be renamed into something more
generic at this point?  All these things are frontend tools that deal
in some way with the replication protocol to do their work.  Say
a replication_tools?


Seems like unnecessary churn.  Nobody has complained about any of the 
other tools in there.



- And if it would be better to refactor some of the code generic to
all these streaming tools to fe_utils.  What makes streamutil.h a bit
less pluggable are all its extern variables to control the connection,
but perhaps that can be an advantage, as well, in some cases.


Does anyone outside of pg_basebackup + existing friends + new friend 
need that?  Seems like extra complications.






Re: meson documentation build open issues

2023-11-09 Thread Peter Eisentraut

On 09.11.23 00:21, Andres Freund wrote:

Example output:


This is very nice!


$ ninja help
[0/1 1   0%] Running external command help (wrapped by meson to set env)
Code Targets:
   all  Build everything other than documentation
   backend  Build backend and related modules
   bin  Build frontend binaries
   contrib  Build contrib modules
   pl   Build procedual languages


ok


Documentation Targets:
   docs Build documentation in multi-page HTML format
   doc-html Build documentation in multi-page HTML format
   doc-man  Build documentation in man page format
   doc/src/sgml/postgres-A4.pdf Build documentation in PDF format, with A4 pages
   doc/src/sgml/postgres-US.pdf Build documentation in PDF format, with US 
letter pages
   doc/src/sgml/postgres.html   Build documentation in single-page HTML format
   alldocs  Build documentation in all supported formats

Installation Targets:
   install  Install postgres, excluding documentation


This should probably read "Install everything other than documentation", 
to mirror "all" above.  (Otherwise one might think it installs just the 
backend.)



   install-doc-html Install documentation in multi-page HTML format
   install-doc-man  Install documentation in man page format
   install-docs Install documentation in multi-page HTML and 
man page formats


There is a mismatch between "docs" and "install-docs".  (As was 
previously discussed, I'm in the camp that "docs" should be html + man.)



   install-quietLike "install", but installed files are not 
displayed
   install-worldInstall postgres, including multi-page HTML and 
man page documentation


Suggest "Install everything, including documentation" (matches "world").


   uninstallRemove installed files

Other Targets:
   cleanRemove all build products
   test Run all enabled tests (including contrib)
   worldBuild everything, including documentation


Shouldn't that be under "Code Targets"?


   help List important targets






Re: meson documentation build open issues

2023-11-09 Thread Peter Eisentraut

On 09.11.23 01:59, Andres Freund wrote:

I think we could build doc/src/sgml/postgres-full.xml by default.  That
takes less than 0.5 seconds here and it's an intermediate target for html
and man.

That does require the docbook dtd to be installed, afaict. I think we would
need a configure test for that to be present if we want to build it by
default, otherwise we'll cause errors on plenty systems that don't get them
today.  The docbook dts aren't a huge dependency, but still. Some OSs might
not have a particularly install source for them, e.g. windows.


I was thinking we would do it only if the required tools are found. 
Basically like


 postgres_full_xml = custom_target('postgres-full.xml',
   input: 'postgres.sgml',
   output: 'postgres-full.xml',
   depfile: 'postgres-full.xml.d',
   command: [xmllint, '--nonet', '--noent', '--valid',
 '--path', '@OUTDIR@', '-o', '@OUTPUT@', '@INPUT@'],
   depends: doc_generated,
-  build_by_default: false,
+  build_by_default: xmllint_bin.found(),
 )

Besides giving you a quick validity check of the XML, this also builds 
the doc_generated, which draw from non-doc source files, so this would 
also serve to check that those are sound and didn't mess up the docs.



I don't think that'd detect the missing ids?


Right, it wouldn't do that.





Re: proposal: possibility to read dumped table's name from file

2023-11-09 Thread Daniel Gustafsson
I went and had another look at this.  The patch has been around for 18
commitfests and is widely considered to add a good feature, so it seems about
time to get reach closure.

As I've mentioned in the past I'm not a big fan of the parser, but the thread
has overruled on that.  Another thing I think is a bit overcomplicated is the
layered error handling for printing log messages, and bubbling up of errors to
get around not being able to call exit_nicely.

In the attached version I've boiled down the error logging into a single new
function pg_log_filter_error() which takes a variable format string.  This
removes a fair bit of the extra calls and makes logging easier.  I've also
added a function pointer to the FilterStateData for passing the exit function
via filter_init.  This allows the filtering code to exit gracefully regardless
of which application is using it.  Finally, I've also reimplemented the logic
for checking the parsed tokens into switch statements without defaults in order
to get the compilerwarning on a missed case.  It's easy to miss adding code to
handle a state, especially when adding new ones, and this should help highlight
that.

Overall, this does shave a bit off the patch in size for what IMHO is better
readability and maintainability.  (I've also made a pgindent pass over it of
course).

What are your thoughts on this version?  It's not in a committable state as it
needs a bit more comments here and there and a triplecheck that nothing was
missed in changing this, but I prefer to get your thoughts before spending the
extra time.

--
Daniel Gustafsson



v20231109-0001-possibility-to-read-options-for-dump-from-.patch
Description: Binary data


Re: A recent message added to pg_upgade

2023-11-09 Thread Alvaro Herrera
On 2023-Nov-09, Amit Kapila wrote:

> These comments appear mostly repetitive to what is already mentioned
> in start_postmaster(). So, I have changed those referred to already
> written comments, and slightly adjusted the comments at another place.
> See attached.

I'd still rather mention check_old_cluster_for_valid_slots() just above
the Assert() in InvalidatePossiblyObsoleteSlot().  It looks too bare to
me otherwise.

> Personally, I don't see the need for a test for this, so removed the
> same but can add it back if you or others think so.

I'm neutral on having a test for this.  I'm not sure this is easy to
break unintentionally.  OTOH the test is cheap, since it only has to run
pg_upgrade itself and not, say, another initdb.  On the (as Robert says)
third hand, would we have tests for each possible GUC that we'd like not
to be changed during pg_upgrade?  I suspect not, which suggests we don't
want this one either.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)




Re: Synchronizing slots from primary to standby

2023-11-09 Thread Drouvot, Bertrand

Hi,

On 11/9/23 3:41 AM, Amit Kapila wrote:

On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
 wrote:



Unrelated to above, if there is a user slot on standby with the same
name which the slot-sync worker is trying to create, then shall it
emit a warning and skip the sync of that slot or shall it throw an
error?



I'd vote for emit a warning and move on to the next slot if any.



But then it could take time for users to know the actual problem and
they probably notice it after failover.


Right, that's not appealing

OTOH the slot has already been created manually on the standby so there is
probably already a "use case" for it (that is probably unrelated to the
failover story then).

In V32, the following states have been introduced:

"
'n': none for user slots,
'i': sync initiated for the slot but waiting for primary to catch up.
'r': ready for periodic syncs.
"

Should we introduce a new state that indicates that a sync slot creation
has failed because the slot already existed? That would probably
be simple to monitor instead of looking at the log file.


OTOH, if we throw an error
then probably they will come to know earlier because the slot sync
mechanism would be stopped.


Right.


Do you have reasons to prefer giving a
WARNING and skipping creating such slots?


My idea was that with a WARNING it won't block others slot creation (if any).


I expect this WARNING to
keep getting repeated in LOGs because the consecutive sync tries will
again generate a WARNING.



Yes.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-11-09 Thread Maxim Orlov
On Thu, 9 Nov 2023 at 15:01, Amul Sul  wrote:

>
> Here is the updated version patch. Did minor changes to documents and
> tests.
>
Overall patch looks good to me.  Since Peter did withdraw his comment on
triggers and no open problems
are present, we can make this patch RfC, shall we?  It would be nice to
correct this in the next release.


-- 
Best regards,
Maxim Orlov.


Re: SQL:2011 application time

2023-11-09 Thread Peter Eisentraut

On 02.11.23 21:21, Paul Jungwirth wrote:

New patches attached (rebased to 0bc726d9).


I went over the patch 
v17-0001-Add-temporal-PRIMARY-KEY-and-UNIQUE-constraints.patch in more 
detail.  Attached is a fixup patch that addresses a variety of cosmetic 
issues.


Some details:

- Renamed contemporal to conwithoutoverlaps, as previously discussed. 
Also renamed various variables and function arguments similarly.


- Rearranged text in CREATE TABLE reference page so there are no forward 
references.  (Describe WITHOUT OVERLAPS under UNIQUE and then PRIMARY 
KEy says "see above", rather than describe it under PRIMARY KEY and have 
UNIQUE say "see below.)


- Removed various bits that related to temporal foreign keys, which 
belong in a later patch.


- Reverted some apparently unrelated changes in src/backend/catalog/index.c.

- Removed the "temporal UNIQUE" constraint_type assignment in 
DefineIndex().  This is meant to be used in error messages and should 
refer to actual syntax.  I think it's fine without it this change.


- Field contemporal in NewConstraint struct is not used by this patch.

- Rearrange the grammar so that the rule with WITHOUT OVERLAPS is just a 
Boolean attribute rather than column name plus keywords.  This was kind 
of confusing earlier and led to weird error messages for invalid syntax. 
 I kept the restriction that you need at least one non-overlaps column, 
but that is now enforced in parse analysis, not in the grammar.  (But 
maybe we don't need it?)


(After your earlier explanation, I'm content to just allow one WITHOUT 
OVERLAPS column for now.)


- Some places looked at conexclop to check whether something is a 
WITHOUT OVERLAPS constraint, instead of looking at conwithoutoverlaps 
directly.


- Removed some redundant "unlike" entries in the pg_dump tests.  (This 
caused cfbot tests to fail.)


- Moved the "without_overlaps" test later in the schedule.  It should at 
least be after "constraints" so that normal constraints are tested first.



Two areas that could be improved:

1) In src/backend/commands/indexcmds.c, 
get_index_attr_temporal_operator() has this comment:


+* This seems like a hack
+* but I can't find any existing lookup function
+* that knows about pseudotypes.

This doesn't see very confident. ;-)  I don't quite understand this.  Is 
this a gap in the currently available APIs, do we need to improve 
something here, or does this need more research?


2) In src/backend/parser/parse_utilcmd.c, transformIndexConstraint(), 
there is too much duplication between the normal and the if 
(constraint->without_overlaps) case, like the whole not-null constraints 
stuff at the end.  This should be one code block with a few conditionals 
inside.  Also, the normal case deals with things like table inheritance, 
which the added parts do not.  Is this all complete?


I'm not sure the validateWithoutOverlaps() function is needed at this 
point in the code.  We just need to check that the column exists, which 
the normal code path already does, and then have the index creation code 
later check that an appropriate overlaps operator exists.  We don't even 
need to restrict this to range types.  Consider for example, it's 
possible that a type does not have a btree equality operator.  We don't 
check that here either, but let the index code later check it.



Overall, with these fixes, I think this patch is structurally okay.  We 
just need to make sure we have all the weird corner cases covered.
From 6d7af1f78505c08fb205a56bd1ba3cb951f04002 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 9 Nov 2023 14:11:34 +0100
Subject: [PATCH] fixup! Add temporal PRIMARY KEY and UNIQUE constraints

---
 doc/src/sgml/catalogs.sgml|  8 +--
 doc/src/sgml/ref/create_table.sgml| 68 ---
 src/backend/catalog/heap.c|  4 +-
 src/backend/catalog/index.c   | 14 ++--
 src/backend/catalog/pg_constraint.c   |  4 +-
 src/backend/commands/indexcmds.c  | 30 
 src/backend/commands/tablecmds.c  | 16 ++---
 src/backend/commands/trigger.c|  2 +-
 src/backend/commands/typecmds.c   |  2 +-
 src/backend/parser/gram.y | 19 +++---
 src/backend/parser/parse_utilcmd.c| 28 ++--
 src/backend/utils/adt/ruleutils.c | 27 +++-
 src/backend/utils/cache/relcache.c| 16 +++--
 src/bin/pg_dump/pg_dump.c | 33 +++--
 src/bin/pg_dump/pg_dump.h |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl  | 14 ++--
 src/bin/psql/describe.c   | 10 +--
 src/include/catalog/index.h   |  2 +-
 src/include/catalog/pg_constraint.h   | 11 +--
 src/include/commands/defrem.h |  2 +-
 src/include/nodes/parsenodes.h|  6 +-
 .../regress/expected/without_overlaps.out |  8 +--
 

Re: A recent message added to pg_upgade

2023-11-09 Thread Amit Kapila
On Thu, Nov 9, 2023 at 4:09 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-02, Kyotaro Horiguchi wrote:
>
> > diff --git a/src/backend/access/transam/xlog.c 
> > b/src/backend/access/transam/xlog.c
> > index b541be8eec..46833f6ecd 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2063,6 +2063,29 @@ check_wal_segment_size(int *newval, void **extra, 
> > GucSource source)
> >   return true;
> >  }
> >
> > +/*
> > + * GUC check_hook for max_slot_wal_keep_size
> > + *
> > + * If WALs needed by logical replication slots are deleted, these slots 
> > become
> > + * inoperable. During a binary upgrade, pg_upgrade sets this variable to 
> > -1 via
> > + * the command line in an attempt to prevent such deletions, but users have
> > + * ways to override it. To ensure the successful completion of the upgrade,
> > + * it's essential to keep this variable unaltered.  See
> > + * InvalidatePossiblyObsoleteSlot() and start_postmaster() in pg_upgrade 
> > for
> > + * more details.
> > + */
> > +bool
> > +check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
> > +{
> > + if (IsBinaryUpgrade && *newval != -1)
> > + {
> > + GUC_check_errdetail("\"%s\" must be set to -1 during binary 
> > upgrade mode.",
> > + "max_slot_wal_keep_size");
> > + return false;
> > + }
> > + return true;
> > +}
>
> One sentence in that comment reads weird.  I'd do this:
>
> s/To ensure the ... unaltered/This check callback ensures the value is
> not overridden by the user/
>

These comments appear mostly repetitive to what is already mentioned
in start_postmaster(). So, I have changed those referred to already
written comments, and slightly adjusted the comments at another place.
See attached. Personally, I don't see the need for a test for this, so
removed the same but can add it back if you or others think so.

-- 
With Regards,
Amit Kapila.


inhibit_m_s_w_k_s_during_upgrade_6.patch
Description: Binary data


Re: MERGE ... RETURNING

2023-11-09 Thread Dean Rasheed
On Sun, 5 Nov 2023 at 11:52, Dean Rasheed  wrote:
>
> OK, that's a fair point. Attached is a new version, replacing those
> parts of the implementation with a new MergingFunc node. It doesn't
> add that much more complexity, and I think the new code is much
> neater.
>

Rebased version attached, following the changes made in 615f5f6faa and
a4f7d33a90.

Regards,
Dean
diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
new file mode 100644
index cbbc5e2..3d95bdb
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -283,10 +283,15 @@ DELETE FROM products;
RETURNING
   
 
+  
+   MERGE
+   RETURNING
+  
+
   
Sometimes it is useful to obtain data from modified rows while they are
being manipulated.  The INSERT, UPDATE,
-   and DELETE commands all have an
+   DELETE, and MERGE commands all have an
optional RETURNING clause that supports this.  Use
of RETURNING avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
@@ -339,6 +344,21 @@ DELETE FROM products
 
   
 
+  
+   In a MERGE, the data available to RETURNING is
+   the content of the source row plus the content of the inserted, updated, or
+   deleted target row.  Since it is quite common for the source and target to
+   have many of the same columns, specifying RETURNING *
+   can lead to a lot of duplicated columns, so it is often more useful to
+   qualify it so as to return just the source or target row.  For example:
+
+MERGE INTO products p USING new_products n ON p.product_no = n.product_no
+  WHEN NOT MATCHED THEN INSERT VALUES (n.product_no, n.name, n.price)
+  WHEN MATCHED THEN UPDATE SET name = n.name, price = n.price
+  RETURNING p.*;
+
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index d963f0a..b25de53
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21990,6 +21990,84 @@ SELECT count(*) FROM sometable;
 
  
 
+ 
+  Merge Support Functions
+
+  
+   MERGE
+   RETURNING
+  
+
+  
+   MERGING
+  
+
+  
+   PostgreSQL includes one merge support function
+   that may be used in the RETURNING list of a
+command to return additional information about
+   the action taken for each row:
+
+MERGING ( property )
+
+   The following are valid property values specifying what to return:
+
+   
+
+ ACTION
+ 
+  
+   The merge action command executed for the current row
+   ('INSERT', 'UPDATE', or
+   'DELETE').
+  
+ 
+
+
+
+ CLAUSE_NUMBER
+ 
+  
+   The 1-based index of the WHEN clause executed for
+   the current row.
+  
+ 
+
+   
+  
+
+  
+   The following example illustrates how this may be used to determine
+   which actions were performed for each row affected by the
+   MERGE command:
+
+  
+
+  
+   Note that this function can only be used in the RETURNING
+   list of a MERGE command.  It is an error to use it in any
+   other part of a query.
+  
+
+ 
+
  
   Subquery Expressions
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index fe8def4..51a15ca
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1387,9 +1387,9 @@
  to a client upon the
  completion of an SQL command, usually a
  SELECT but it can be an
- INSERT, UPDATE, or
- DELETE command if the RETURNING
- clause is specified.
+ INSERT, UPDATE,
+ DELETE, or MERGE command if the
+ RETURNING clause is specified.
 
 
  The fact that a result set is a relation means that a query can be used
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 5977534..4ad9894
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1020,8 +1020,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
 
 
 
- If the command does return rows (for example SELECT,
- or INSERT/UPDATE/DELETE
+ If the command does return rows (for example SELECT, or
+ INSERT/UPDATE/DELETE/MERGE
  with RETURNING), there are two ways to proceed.
  When the command will return at most one row, or you only care about
  the first row of output, write the command as usual but add
@@ -1149,6 +1149,7 @@ SELECT select_expressionsexpressions INTO STRICT target;
 UPDATE ... RETURNING expressions INTO STRICT target;
 DELETE ... RETURNING expressions INTO STRICT target;
+MERGE ... RETURNING expressions INTO STRICT target;
 
 
  where target can be a record variable, a row
@@ -1159,8 +1160,8 @@ DELETE ... RETURNING expres
  INTO clause) just as described above,
  and the plan is cached in the same way.
  This works for SELECT,
- INSERT/UPDATE/DELETE with
- RETURNING, and certain utility commands
+ INSERT/UPDATE/DELETE/MERGE
+ with RETURNING, and certain utility commands
  that return 

Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-11-09 Thread Amul Sul
On Tue, Nov 7, 2023 at 8:21 PM Peter Eisentraut 
wrote:

> On 25.10.23 08:12, Amul Sul wrote:
> > Here is the rebase version for the latest master head(673a17e3120).
> >
> > I haven't done any other changes related to the ON UPDATE trigger since
> that
> > seems non-trivial; need a bit of work to add trigger support in
> > ATRewriteTable().
> > Also, I am not sure yet, if we were doing these changes, and the correct
> > direction
> > for that.
>
> I did some detailed testing on Db2, Oracle, and MariaDB (the three
> existing implementations of this feature that I'm tracking), and none of
> them fire any row or statement triggers when the respective statement to
> alter the generation expression is run.  So I'm withdrawing my comment
> that this should fire triggers.  (I suppose event triggers are available
> if anyone really needs the functionality.)
>

Thank you for the confirmation.

Here is the updated version patch. Did minor changes to documents and tests.

Regards,
Amul


v3-0001-Prerequisite-changes-rename-functions-enum.patch
Description: Binary data


v3-0002-Allow-to-change-generated-column-expression.patch
Description: Binary data


Re: Force the old transactions logs cleanup even if checkpoint is skipped

2023-11-09 Thread Zakhlystov, Daniil (Nebius)
Hello,

> On 9 Nov 2023, at 01:30, Michael Paquier  wrote:
> 
> I am not really convinced that this is worth complicating the skipped
> path for this goal.  In my experience, I've seen complaints where WAL
> archiving bloat was coming from the archive command not able to keep
> up with the amount generated by the backend, particularly because the
> command invocation was taking longer than it takes to generate a new
> segment.  Even if there is a hole of activity in the server, if too
> much WAL has been generated it may not be enough to catch up depending
> on the number of segments that need to be processed.  Others are free
> to chime in with extra opinions, of course.

I agree that there might multiple reasons of pg_wal bloat. Please note that
I am not addressing the WAL archiving issue at all. My proposal is to add a 
small improvement to the WAL cleanup routine for WALs that have been already
archived successfully to free the disk space.

Yes, it might be not a common case, but a fairly realistic one. It occurred 
multiple times
in our production when we had temporary issues with archiving. This small
complication of the skipped path will help Postgres to return to a normal 
operational
state without any human operator / external control routine intervention.

> On 9 Nov 2023, at 01:30, Michael Paquier  wrote:
> 
> While on it, I think that your patch would cause incorrect and early
> removal of segments.  It computes the name of the last segment to
> remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
> that it ignores any WAL retention required by replication slots or
> wal_keep_size.  And this causes the calculation of an incorrect segno
> horizon.

Please check the latest patch version, I believe that it has been already fixed 
there.

Thanks,

Daniil Zakhlystov





Re: UniqueKey v2

2023-11-09 Thread zhihuifan1213

zhihuifan1...@163.com writes:

Hi,

Here is the v3, the mainly changes is it maintains the UniqueKey on
joinrel level, which probabaly is the most important part of this
feature. It shows how the UnqiueKey on joinrel is generated and how it
is discarded due to non-interesting-uniquekey and also show much details
about the single-row case.

I will always maintain README.uniquekey under src/backend/optimizer/path/
to include the latest state of this feature to save the time for
reviewer from going through from the begining. I also use the word "BAD
CASE" in uniquekey.sql to demo which sistuation is not handled well so
far, that probably needs more attention at the first review. 

>From 987c1d08bedc0c9c7436ad2daddbab2202b52425 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 9 Nov 2023 17:25:35 +0800
Subject: [PATCH v3 1/1] maintain uniquekey on baserel/joinrel level and use it
 for marking

distinct-as-noop.
---
 src/backend/nodes/list.c|  17 +
 src/backend/optimizer/path/Makefile |   3 +-
 src/backend/optimizer/path/README.uniquekey | 220 +++
 src/backend/optimizer/path/allpaths.c   |   2 +
 src/backend/optimizer/path/equivclass.c |  48 ++
 src/backend/optimizer/path/joinrels.c   |   2 +
 src/backend/optimizer/path/uniquekey.c  | 654 
 src/backend/optimizer/plan/initsplan.c  |   8 +
 src/backend/optimizer/plan/planner.c|  33 +
 src/backend/optimizer/util/plancat.c|  10 +
 src/backend/optimizer/util/relnode.c|   2 +
 src/include/nodes/pathnodes.h   |  19 +
 src/include/nodes/pg_list.h |   2 +
 src/include/optimizer/paths.h   |  13 +
 src/test/regress/expected/join.out  |  11 +-
 src/test/regress/expected/uniquekey.out | 268 
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/sql/uniquekey.sql  | 104 
 18 files changed, 1409 insertions(+), 9 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekey.c
 create mode 100644 src/test/regress/expected/uniquekey.out
 create mode 100644 src/test/regress/sql/uniquekey.sql

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 750ee5a7e5..eaeaa19ad2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -694,6 +694,23 @@ list_member_ptr(const List *list, const void *datum)
 	return false;
 }
 
+/*
+ * Return index of the datum in list if found. otherwise return -1.
+ */
+int
+list_member_ptr_pos(const List *list, const void *datum)
+{
+	ListCell   *lc;
+
+	foreach(lc, list)
+	{
+		if (lfirst(lc) == datum)
+			return foreach_current_index(lc);
+	}
+
+	return -1;
+}
+
 /*
  * Return true iff the integer 'datum' is a member of the list.
  */
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 00..be13b113b9
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,220 @@
+Here is a design document and a part of implementation.
+
+What is UniqueKey?
+-
+
+UniqueKey represents a uniqueness information for a RelOptInfo. for
+example:
+
+SELECT id FROM t;
+
+where the ID is the UniqueKey for the RelOptInfo (t).  In the real world,
+it has the following attributes:
+
+1). It should be EquivalenceClass based. for example:
+
+SELECT a FROM t WHERE a = id;
+
+In this case, the UniqueKey should be 1 EC with two members
+- EC(EM(a), EM(id)).
+
+
+2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example:
+
+CREATE TABLE t(a int not null,  b int not null);
+CREATE UNIQUE INDEX on t(a, b);
+SELECT * FROM t;
+
+Where the UniqueKey for RelOptInfo (t) will be 2 ECs with each one has 1
+member.
+
+- EC(em=a), EC(em=b)
+
+3). Each RelOptInfo may have 1+ UniqueKeys.
+
+CREATE TABLE t(a int not null,  b int not null, c int not null);
+CREATE UNIQUE INDEX on t(a, b);
+CREATE UNIQUE INDEX on t(c);
+
+SELECT * FROM t;
+
+Where the UniqueKey for RelOptInfo (t) will be
+- [EC(em=a), EC(em=b)].
+- [EC(em=c)]
+
+4). A special case is about the one-row case. It works like:
+SELECT * FROM t WHERE id = 1;
+Here every single expression in the RelOptInfo (t) is unique since only
+one row here.
+
+Where can we use it?
+
+1. mark the distinct as no-op.  SELECT DISTINCT uniquekey FROM v; This
+  optimization has been required several times in our threads.
+
+2. Figure out more pathkey within the onerow case, then some planning
+  time can be reduced to be big extend. This user case is not 

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

On Wed, 8 Nov 2023 at 04:19, Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> > By the way, if the write/sync quantities and times begin to be tracked
> > by pg_stat_io, I'd see a pretty good argument in removing the
> > equivalent columns in pg_stat_wal.  It looks like this would reduce
> > the confusion related to the handling of PendingWalStats added in
> > pgstat_io.c, for one.
>
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: RFC: Logging plan of the running query

2023-11-09 Thread torikoshia

On 2023-11-09 16:11, Ashutosh Bapat wrote:
On Thu, Nov 9, 2023 at 12:03 PM torikoshia  
wrote:

>>
>> 1. When a backend is running nested queries, we will see the plan of
>> the innermost query. That query may not be the actual culprit if the
>> user query is running slowly. E.g a query being run as part of inner
>> side nested loop when the nested loop itself is the bottleneck. I
>> think it will be useful to print plans of all the whole query stack.

This was discussed in previous threads[1] and we thought it'd be 
useful

but since it needed some extra works, we stopped widening the scope.

>
> I think we can start with what auto_explain is doing. Always print the
> plan of the outermost query; the query found in pg_stat_activity. In a
> later version we might find a way to print plans of all the queries in
> the stack and do so in a readable manner.

Agreed there are cases printing plan of the outermost query is more
useful.



I am fine printing the plan of the outermost query. This will help
many cases. Printing plans of the whole query stack can be added as an
add on later.


>
> This makes tracking activeQueryDesc a bit tricky. My guess is that the
> outermost query's descriptor will be available through ActivePortal
> most of the time. But there are cases when ExecutorRun is called by
> passing a queryDesc directly. So may be there are some cases where
> that's not true.

Yeah, actually the original version of the patch got the plan from
ActivePortal, but it failed logging plan when the query was something
like this[2]:

  DO $$
  BEGIN
  PERFORM pg_sleep(100);
  END$$;


References [1] and [2] are not listed in your email.


Oops, sorry. Here are links:

[1] 
https://www.postgresql.org/message-id/64f716c44629e303b66e6c24502147cc%40oss.nttdata.com
[2] 
https://www.postgresql.org/message-id/632e99eb-8090-53e6-1b1a-101601904cbd%40oss.nttdata.com



Is that because there was no ActivePortal created or the ActivePortal
pointed to DO block instead of PERFORM pg_sleep?


ActivePortal is created but ActivePortal->queryDesc is null.


> 2. When a query is running in parallel worker do we want to print that
> query? It may or may not be interesting given the situation. If the
> overall plan itself is faulty, output of the parallel worker query is
> not useful. If the plan is fine but a given worker's leg is running
> slowly it may be interesting.

I think it can be useful.
I'm wondering if we can add this after the first patch for this 
feature

is committed.


With the current patches, it will print the query from a parallel
backend. If that's not desirable we should prohibit that case at
least.


Current patch prohibits printing plan if backend type is parallel worker 
as below:


  =# select pg_log_query_plan(pid), backend_type from pg_stat_activity 
where backend_type = 'parallel worker';


   pg_log_query_plan |  backend_type
  ---+-
   f | parallel worker
   f | parallel worker
  (2 rows)

  WARNING:  PID 4618 is not a PostgreSQL client backend process
  WARNING:  PID 4617 is not a PostgreSQL client backend process

Is this the behavior you expect?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Alvaro Herrera
IMO the whole area of SLRU buffering is in horrible shape and many users
are struggling with overall PG performance because of it.  An
improvement doesn't have to be perfect -- it just has to be much better
than the current situation, which should be easy enough.  We can
continue to improve later, using more scalable algorithms or ones that
allow us to raise the limits higher.

The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC).  I think that hinders
progress.  Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)




Re: ensure, not insure

2023-11-09 Thread David Rowley
On Thu, 9 Nov 2023 at 14:22, Michael Paquier  wrote:
>
> On Wed, Nov 08, 2023 at 08:31:28PM +1300, David Rowley wrote:
> > Those all look fine to me too.
>
> +1.

I've pushed this.  I backpatched due to the typo in the fsync GUC
description.  I'd have only pushed to master if it were just the
comment typos.

I noticed older versions had another instance of "insure" in a code
comment. I opted to leave that one alone since that file is now gone
in more recent versions.

David




Re: GUC names in messages

2023-11-09 Thread Alvaro Herrera
On 2023-Nov-09, Peter Smith wrote:

> Notice that NOT QUOTED is the far more common pattern, so my vote
> would be just to standardise on making everything this way. I know
> there was some concern raised about ambiguous words like "timezone"
> and "datestyle" etc but in practice, those are rare. Also, those GUCs
> are different in that they are written as camel-case (e.g.
> "DateStyle") in the guc_tables.c, so if they were also written
> camel-case in the messages that could remove ambiguities with normal
> words. YMMV.

Well, I think camel-casing is also a sufficient differentiator for these
identifiers not being English words.  We'd need to ensure they're always
written that way, when not quoted.  However, in cases where arbitrary
values are expanded, I don't know that they would be expanded that way,
so I would still go for quoting in that case.

There's also a few that are not camel-cased nor have any underscores --
looking at postgresql.conf.sample, we have "port", "bonjour", "ssl",
"fsync", "geqo", "jit", "autovacuum", "xmlbinary", "xmloption".  (We also
have "include", but I doubt that's ever used in an error message).  But
actually, there's more: every reloption is a candidate, and there we
have "fillfactor", "autosummarize", "fastupdate", "buffering".  So if we
want to make generic advice on how to deal with option names in error
messages, I think the wording on conditional quoting I proposed should
go in (adding CamelCase as a reason not to quote), and then we can fix
the code to match.  Looking at your list, I think the changes to make
are not too numerous.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)




Re: A recent message added to pg_upgade

2023-11-09 Thread Alvaro Herrera
On 2023-Nov-02, Kyotaro Horiguchi wrote:

> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index b541be8eec..46833f6ecd 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2063,6 +2063,29 @@ check_wal_segment_size(int *newval, void **extra, 
> GucSource source)
>   return true;
>  }
>  
> +/*
> + * GUC check_hook for max_slot_wal_keep_size
> + *
> + * If WALs needed by logical replication slots are deleted, these slots 
> become
> + * inoperable. During a binary upgrade, pg_upgrade sets this variable to -1 
> via
> + * the command line in an attempt to prevent such deletions, but users have
> + * ways to override it. To ensure the successful completion of the upgrade,
> + * it's essential to keep this variable unaltered.  See
> + * InvalidatePossiblyObsoleteSlot() and start_postmaster() in pg_upgrade for
> + * more details.
> + */
> +bool
> +check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
> +{
> + if (IsBinaryUpgrade && *newval != -1)
> + {
> + GUC_check_errdetail("\"%s\" must be set to -1 during binary 
> upgrade mode.",
> + "max_slot_wal_keep_size");
> + return false;
> + }
> + return true;
> +}

One sentence in that comment reads weird.  I'd do this:

s/To ensure the ... unaltered/This check callback ensures the value is
not overridden by the user/


> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 99823df3c7..5c3d2b1082 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1424,18 +1424,12 @@ 
> InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
>   SpinLockRelease(>mutex);
>  
>   /*
> -  * The logical replication slots shouldn't be invalidated as
> -  * max_slot_wal_keep_size GUC is set to -1 during the upgrade.
> -  *
> -  * The following is just a sanity check.
> +  * check_max_slot_wal_keep_size() ensures 
> max_slot_wal_keep_size is set
> +  * to -1, so, slot invalidation for logical slots shouldn't 
> happen
> +  * during an upgrade. At present, only logical slots really 
> require
> +  * this.
>*/
> - if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
> - {
> - ereport(ERROR,
> - 
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> - errmsg("replication slots must not be 
> invalidated during the upgrade"),
> - errhint("\"max_slot_wal_keep_size\" 
> must be set to -1 during the upgrade"));
> - }
> + Assert (!(*invalidated && SlotIsLogical(s) && IsBinaryUpgrade));

I think it's worth adding a comment here, pointing to
check_old_cluster_for_valid_slots() verifying that no
already-invalidated slots exist before the upgrade starts.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-09 Thread Amit Kapila
On Wed, Nov 8, 2023 at 11:05 PM vignesh C  wrote:
>
> On Wed, 8 Nov 2023 at 08:43, vignesh C  wrote:
>
> Here is a small improvisation where num_slots need not be initialized
> as it will be used only after assigning the result now. The attached
> patch has the changes for the same.
>

Pushed!

-- 
With Regards,
Amit Kapila.




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

Thanks for all the feedback!

On Wed, 8 Nov 2023 at 08:59, Michael Paquier  wrote:
>
> By the way, note that the patch is failing to apply, and that I've
> switched it as waiting on author on 10/26.

Here is an updated patchset in attachment. Rebased on the latest HEAD
and changed 'pgstat_report_wal(false)' to 'pgstat_flush_io(false)' in
xlogrecovery.c. I will share the new version of the patchset once I
address the feedback.

Regards,
Nazir Bilal Yavuz
Microsoft
From e5db5cd6d8c47cadde0539f06bbee22368d17a41 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v4 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c |  60 -
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c| 144 --
 src/backend/utils/adt/pgstatfuncs.c   |  10 +-
 src/include/pgstat.h  |   8 +-
 5 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..d265b8c032 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2256,38 +2256,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 errno = 0;
 
-/* Measure I/O timing to write WAL data */
-if (track_wal_io_timing)
-	INSTR_TIME_SET_CURRENT(start);
-else
-	INSTR_TIME_SET_ZERO(start);
-
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
 pgstat_report_wait_end();
 
-/*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
-if (track_wal_io_timing)
-{
-	instr_time	end;
-
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-}
-
 PendingWalStats.wal_write++;
 
 if (written <= 0)
@@ -2312,6 +2296,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+	IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3005,6 +2992,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3048,6 +3036,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3083,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+			io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3099,6 +3093,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
  errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3111,6 +3108,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+			IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ -8282,7 +8282,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8295,11 +8295,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8343,18 

Re: Some deleted GUCs are still referred to

2023-11-09 Thread Daniel Gustafsson
> On 9 Nov 2023, at 00:51, Peter Smith  wrote:
> 
> Hi,
> 
> I happened to notice that some GUC names "max_fsm_pages" and
> "max_fsm_relations" are still mentioned in these translation files
> (from the REL_16_1 source zip)
> 
> src\backend\po\fr.po
> src\backend\po\tr.po
> 
> ~~
> 
> Should those be removed?

These mentions are only in comments and not in actual translations, so I don't
think they risk causing any issues.

 $ git grep max_fsm_ |cut -d":" -f 2 |grep -v "^#" | wc -l
   0

I don't know enough about the translation workflow to know how these comments
are handled, sending this to pgsql-translators@ might be a better way to reach
the authors working on this.

> There was a commit [1] that said these all traces of those GUCs had
> been eliminated.

Translations are managed in an external repo and synced with the main repo at
intervals, so such a commit couldn't have updated the master translation work
files anyways.

--
Daniel Gustafsson





Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Shlok Kyal
Hi,
I am trying to build postgres with meson on Windows. And I am stuck in
the process.

Steps I followed:

1. I clone postgres repo

2.Installed meson and ninja
pip install meson ninja

3. Then running following command:
meson setup build --buildtype debug

4. Then I ran
cd build
ninja

Got following error
D:\project\repo\pg_meson\postgres\build>C:\Users\kyals\AppData\Roaming\Python\Python311\Scripts\ninja
ninja: error: 'src/backend/postgres_lib.a.p/meson_pch-c.obj', needed
by 'src/backend/postgres.exe', missing and no known rule to make it.

Any thoughts on how to resolve this error?

Thanks
Shlok Kumar Kyal




Re: Infinite Interval

2023-11-09 Thread Dean Rasheed
On Thu, 9 Nov 2023 at 07:15, Ashutosh Bapat
 wrote:
>
> Just to test whether that bug fix also fixes the failure seen with
> this patchset, I am attaching the patchset including the patch with
> the fix.
>
> 0001 - fix in other thread
> 0002 and 0003 are 0001 and 0002 in the previous patch set.
>

Thanks. That's confirmed, it has indeed turned green!

Regards,
Dean




RE: A recent message added to pg_upgade

2023-11-09 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san, hackers,

> Thanks you for the comments!

Thanks for updating the patch!
I'm not sure it is intentional, but you might miss my post...I suggested to add 
a
testcase.

I attached the updated version which is almost the same as Horiguchi-san's one,
but has a test. How do you think? Do you have other idea for testing?


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index b541be8eec..46833f6ecd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2063,6 +2063,29 @@ check_wal_segment_size(int *newval, void **extra, 
GucSource source)
return true;
 }
 
+/*
+ * GUC check_hook for max_slot_wal_keep_size
+ *
+ * If WALs needed by logical replication slots are deleted, these slots become
+ * inoperable. During a binary upgrade, pg_upgrade sets this variable to -1 via
+ * the command line in an attempt to prevent such deletions, but users have
+ * ways to override it. To ensure the successful completion of the upgrade,
+ * it's essential to keep this variable unaltered.  See
+ * InvalidatePossiblyObsoleteSlot() and start_postmaster() in pg_upgrade for
+ * more details.
+ */
+bool
+check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
+{
+   if (IsBinaryUpgrade && *newval != -1)
+   {
+   GUC_check_errdetail("\"%s\" must be set to -1 during binary 
upgrade mode.",
+   "max_slot_wal_keep_size");
+   return false;
+   }
+   return true;
+}
+
 /*
  * At a checkpoint, how many WAL segments to recycle as preallocated future
  * XLOG segments? Returns the highest segment that should be preallocated.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 99823df3c7..5c3d2b1082 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1424,18 +1424,12 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
SpinLockRelease(>mutex);
 
/*
-* The logical replication slots shouldn't be invalidated as
-* max_slot_wal_keep_size GUC is set to -1 during the upgrade.
-*
-* The following is just a sanity check.
+* check_max_slot_wal_keep_size() ensures 
max_slot_wal_keep_size is set
+* to -1, so, slot invalidation for logical slots shouldn't 
happen
+* during an upgrade. At present, only logical slots really 
require
+* this.
 */
-   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
-   {
-   ereport(ERROR,
-   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-   errmsg("replication slots must not be 
invalidated during the upgrade"),
-   errhint("\"max_slot_wal_keep_size\" 
must be set to -1 during the upgrade"));
-   }
+   Assert (!(*invalidated && SlotIsLogical(s) && IsBinaryUpgrade));
 
if (active_pid != 0)
{
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..818ffdb2ae 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2845,7 +2845,7 @@ struct config_int ConfigureNamesInt[] =
},
_slot_wal_keep_size_mb,
-1, -1, MAX_KILOBYTES,
-   NULL, NULL, NULL
+   check_max_slot_wal_keep_size, NULL, NULL
},
 
{
diff --git a/src/bin/pg_upgrade/t/003_logical_slots.pl 
b/src/bin/pg_upgrade/t/003_logical_slots.pl
index 5b01cf8c40..1a55a75827 100644
--- a/src/bin/pg_upgrade/t/003_logical_slots.pl
+++ b/src/bin/pg_upgrade/t/003_logical_slots.pl
@@ -22,7 +22,36 @@ $oldpub->init(allows_streaming => 'logical');
 my $newpub = PostgreSQL::Test::Cluster->new('newpub');
 $newpub->init(allows_streaming => 'logical');
 
-# Setup a common pg_upgrade command to be used by all the test cases
+# In a VPATH build, we'll be started in the source directory, but we want
+# to run pg_upgrade in the build directory so that any files generated finish
+# in it, like delete_old_cluster.{sh,bat}.
+chdir ${PostgreSQL::Test::Utils::tmp_check};
+
+# --
+# TEST: Confirm max_slot_wal_keep_size must not be overwritten
+
+# pg_upgrade will fail because the GUC max_slot_wal_keep_size is overwritten
+# to a positive value
+command_checks_all(
+   [
+   'pg_upgrade', '--no-sync',
+   '-d', $oldpub->data_dir,
+   '-D', $newpub->data_dir,
+   '-b', $oldpub->config_data('--bindir'),
+   '-B', $newpub->config_data('--bindir'),
+   '-s', $newpub->host,
+   '-p', $oldpub->port,
+   '-P', $newpub->port,
+