date:20231120

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Drouvot, Bertrand


Hi,

On 11/21/23 6:16 AM, Amit Kapila wrote:

On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand
 wrote:

As far the 'i' state here, from what I see, it is currently useful for:

1. Cascading standby to not sync slots with state = 'i' from
the first standby.
2. Easily report Slots that did not catch up on the primary yet.
3. Avoid inactive slots to block "active" ones creation.

So not creating those slots should not be an issue for 1. (sync are
not needed on cascading standby as not created on the first standby yet)
but is an issue for 2. (unless we provide another way to keep track and report
such slots) and 3. (as I think we should still need to reserve WAL).

I've a question: we'd still need to reserve WAL for those slots, no?

If that's the case and if we don't call ReplicationSlotCreate() then 
ReplicationSlotReserveWal()
would not work as  MyReplicationSlot would be NULL.



Yes, we need to reserve WAL to see if we can sync the slot. We are
currently creating an RS_EPHEMERAL slot and if we don't explicitly
persist it when we can't sync, then it will be dropped when we do
ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the
loss is probably, the next time we again try to sync the slot, we need
to again create it and may need to wait for newer restart_lsn on
standby


Yeah, and doing so we'd reduce the time window to give the slot a chance
to catch up (as opposed to create it a single time and maintain an 'i' state).


which could be avoided if we have the slot in 'i' state from
the previous run.


Right.


I don't deny the importance of having 'i'
(initialized) state but was just trying to say that it has additional
code complexity. 


Right, and I think it's worth it.


OTOH, having it may give better visibility to even
users about slots that are not active (say manually created slots on
the primary).


Agree.

All that being said, on my side I'm +1 on keeping the 'i' state behavior
as it is implemented currently (would be happy to hear others' opinions too).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use of backup_label not noted in log

2023-11-20 Thread Laurenz Albe

On Mon, 2023-11-20 at 11:03 -0800, Andres Freund wrote:
> > If we add a message for starting with "backup_label", shouldn't
> > we also add a corresponding message for starting from a checkpoint
> > found in the control file?  If you see that in a problem report,
> > you immediately know what is going on.
> 
> Maybe - the reason I hesitate on that is that emitting an additional log
> message when starting from a base backup just adds something "once once the
> lifetime of a node". Whereas emitting something every start obviously doesn't
> impose any limit.

The message should only be shown if PostgreSQL replays WAL, that is,
after a crash.  That would (hopefully) make it a rare message too.

Yours,
Laurenz Albe

Re: remaining sql/json patches

2023-11-20 Thread Peter Eisentraut

I looked a bit at the parser additions, because there were some concerns 
expressed that they are quite big.


It looks like the parser rules were mostly literally copied from the BNF 
in the SQL standard.  That's probably a reasonable place to start, but 
now at the end, there is some room for simplification.


Attached are a few patches that apply on top of the 0003 patch.  (I 
haven't gotten to 0004 in detail yet.)  Some explanations:


0001-Put-keywords-in-right-order.patch

This is just an unrelated cleanup.

0002-Remove-js_quotes-union-entry.patch

We usually don't want to put every single node type into the gram.y 
%union.  This one can be trivially removed.


0003-Move-some-code-from-gram.y-to-parse-analysis.patch

Code like this can be postponed to parse analysis, keeping gram.y 
smaller.  The error pointer loses a bit of precision, but I think that's 
ok.  (There is similar code in your 0004 patch, which could be similarly 
moved.)


0004-Remove-JsonBehavior-stuff-from-union.patch

Similar to my 0002.  This adds a few casts as a result, but that is the 
typical style in gram.y.


0005-Get-rid-of-JsonBehaviorClause.patch

I think this two-level wrapping of the behavior clauses is both 
confusing and overkill.  I was trying to just list the on-empty and 
on-error clauses separately in the top-level productions (JSON_VALUE 
etc.), but that led to shift/reduce errors. So the existing rule 
structure is probably ok.  But we don't need a separate node type just 
to combine two values and then unpack them again shortly thereafter.  So 
I just replaced all this with a list.


0006-Get-rid-of-JsonCommon.patch

This is an example where the SQL standard BNF is not sensible to apply 
literally.  I moved those clauses up directly into their callers, thus 
removing one intermediate levels of rules and also nodes.  Also, the 
path name (AS name) stuff is only for JSON_TABLE, so it's not needed in 
this patch.  I removed it here, but it would have to be readded in your 
0004 patch.


Another thing: In your patch, JSON_EXISTS has a RETURNING clause 
(json_returning_clause_opt), but I don't see that in the standard, and 
also not in the Oracle or Db2 docs.  Where did this come from?


With these changes, I think the grammar complexity in your 0003 patch is 
at an acceptable level.  Similar simplification opportunities exist in 
the 0004 patch, but I haven't worked on that yet.  I suggest that you 
focus on getting 0001..0003 committed around this commit fest and then 
deal with 0004 in the next one.  (Also split up the 0005 patch into the 
pieces that apply to 0003 and 0004, respectively.)
From 90cd46c91231a29a41118861d5a6122d78f93379 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 21 Nov 2023 05:19:03 +0100
Subject: [PATCH 1/6] Put keywords in right order

---
 src/backend/parser/gram.y | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 1dc3300fde..9a7058b767 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -735,7 +735,7 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
JOIN JSON JSON_ARRAY JSON_ARRAYAGG JSON_EXISTS JSON_OBJECT 
JSON_OBJECTAGG
JSON_QUERY JSON_SCALAR JSON_SERIALIZE JSON_VALUE
 
-   KEY KEYS KEEP
+   KEEP KEY KEYS
 
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-- 
2.42.1

From b669a45c9603a23db240cf1566b3f2e726254ac4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 21 Nov 2023 05:28:17 +0100
Subject: [PATCH 2/6] Remove js_quotes %union entry

---
 src/backend/parser/gram.y | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9a7058b767..a8cce5b00e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -280,7 +280,6 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
struct KeyAction *keyaction;
JsonBehavior *jsbehavior;
JsonBehaviorClause *jsbehaviorclause;
-   JsonQuotes  js_quotes;
 }
 
 %typestmt toplevel_stmt schema_stmt routine_body_stmt
@@ -662,12 +661,12 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
 %typejson_encoding_clause_opt
json_predicate_type_constraint
json_wrapper_behavior
+   json_quotes_clause_opt
 %type json_key_uniqueness_constraint_opt
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
 %type  json_behavior
 %type  json_behavior_clause_opt
-%type   json_quotes_clause_opt
 
 
 /*
-- 
2.42.1

From 7fb1906bab90e539697e6d66d3f2754eb3031603 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 21 Nov 2023 05:43:10 +0100

Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-20 Thread Jeff Davis

On Mon, 2023-11-20 at 22:50 -0600, Nathan Bossart wrote:
> I'm mostly thinking out loud here, but could we just always do this? 
> I
> guess you might want to avoid it if your SH_EQUAL is particularly
> expensive
> and you know repeated lookups are rare, but maybe that's uncommon
> enough
> that we don't really care.

I like that simplehash is simple, so I'm not inclined to introduce an
always-on feature.

It would be interesting to know how often it's a good idea to turn it
on, though. I could try turning it on for various other uses of
simplehash, and see where it tends to win.

The caller can also save the hash and pass it down, but that's not
always convenient to do.

Regards,
Jeff Davis

Re: Why is hot_standby_feedback off by default?

2023-11-20 Thread John Naylor

On Tue, Nov 21, 2023 at 6:49 AM Andres Freund  wrote:
>
> On 2023-11-20 16:34:47 +0700, John Naylor wrote:
> > Sounds like a TODO?
>
> WFM. I don't personally use or update TODO, as I have my doubts about its
> usefulness or state of maintenance. But please feel free to add this as a TODO
> from my end...

Yeah, I was hoping to change that, but it's been a long row to hoe.
Anyway, the above idea was added added under "administration".

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-20 Thread Richard Guo

On Tue, Nov 21, 2023 at 1:46 AM Tom Lane  wrote:

> * Do we really need to use make_tlist_from_pathtarget?  Why isn't
> the tlist of the cteplan good enough (indeed, more so)?

I think you are right.  The cteplan->targetlist is built for the CTE's
best path by build_path_tlist(), which is almost the same as
make_tlist_from_pathtarget() except that it also replaces nestloop
params.  So cteplan->targetlist is good enough here.

> * I don't love having this code assume that it knows how to find
> the Path the cteplan was made from.  It'd be better to make
> SS_process_ctes save that somewhere, maybe in a list paralleling
> root->cte_plan_ids.

Fair point.

I've updated the patch to v2 for the changes.

> Alternatively: maybe it's time to do what the comments in
> SS_process_ctes vaguely speculate about, and just save the Path
> at that point, with construction of the plan left for createplan()?
> That might be a lot of refactoring for not much gain, so not sure.

I'm not sure if this is worth the effort.  And it seems that we have the
same situation with SubLinks where we construct the plan in subselect.c
rather than createplan.c.

Thanks
Richard

v2-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patch
Description: Binary data

Typo with amtype = 's' in opr_sanity.sql

2023-11-20 Thread Michael Paquier

Hi all,

While rebasing a patch from 2016 related to sequence AMs (more about
that later), I've bumped on a mistake from 8586bf7ed888 in
opr_sanity.sql, as of:
+SELECT p1.oid, p1.amname, p2.oid, p2.proname
+FROM pg_am AS p1, pg_proc AS p2
+WHERE p2.oid = p1.amhandler AND p1.amtype = 's' AND

It seems to me that this has been copy-pasted on HEAD from the
sequence AM patch, but forgot to update amtype to 't'.  While that's
maybe cosmetic, I think that this could lead to unexpected results, so
perhaps there is a point in doing a backpatch?

Thoughts?
--
Michael
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 7a6f36a6a9..7610b011d6 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -1920,7 +1920,7 @@ WHERE p1.oid = a1.amhandler AND a1.amtype = 'i' AND
 -- Check for table amhandler functions with the wrong signature
 SELECT a1.oid, a1.amname, p1.oid, p1.proname
 FROM pg_am AS a1, pg_proc AS p1
-WHERE p1.oid = a1.amhandler AND a1.amtype = 's' AND
+WHERE p1.oid = a1.amhandler AND a1.amtype = 't' AND
 (p1.prorettype != 'table_am_handler'::regtype
  OR p1.proretset
  OR p1.pronargs != 1
diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql
index e2d2c70d70..2fe7b6dcc4 100644
--- a/src/test/regress/sql/opr_sanity.sql
+++ b/src/test/regress/sql/opr_sanity.sql
@@ -1223,7 +1223,7 @@ WHERE p1.oid = a1.amhandler AND a1.amtype = 'i' AND
 
 SELECT a1.oid, a1.amname, p1.oid, p1.proname
 FROM pg_am AS a1, pg_proc AS p1
-WHERE p1.oid = a1.amhandler AND a1.amtype = 's' AND
+WHERE p1.oid = a1.amhandler AND a1.amtype = 't' AND
 (p1.prorettype != 'table_am_handler'::regtype
  OR p1.proretset
  OR p1.pronargs != 1


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2023-11-20 Thread shveta malik

On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, November 17, 2023 7:39 PM Amit Kapila  
> wrote:
> >
> > On Thu, Nov 16, 2023 at 5:34 PM shveta malik 
> > wrote:
> > >
> > > PFA v35.
> > >
> >
> > Review v35-0002*
> > ==
>
> Thanks for the comments.
>
> > 1.
> > As quoted in the commit message,
> > >
> > If a logical slot is invalidated on the primary, slot on the standby is also
> > invalidated. If a logical slot on the primary is valid but is invalidated 
> > on the
> > standby due to conflict (say required rows removed on the primary), then 
> > that
> > slot is dropped and recreated on the standby in next sync-cycle.
> > It is okay to recreate such slots as long as these are not consumable on the
> > standby (which is the case currently).
> > >
> >
> > I think this won't happen normally because of the physical slot and
> > hot_standby_feedback but probably can occur in cases like if the user
> > temporarily switches hot_standby_feedback from on to off. Are there any 
> > other
> > reasons? I think we can mention the cases along with it as well at least 
> > for now.
> > Additionally, I think this should be covered in code comments as well.
>
> I will collect all these cases and update in next version.
>
> >
> > 2.
> >  #include "postgres.h"
> > -
> > +#include "access/genam.h"
> >
> > Spurious line removal.
>
> Removed.
>
> >
> > 3.
> >A password needs to be provided too, if the sender demands
> > password
> >authentication.  It can be provided in the
> >primary_conninfo string, or in a separate
> > -  ~/.pgpass file on the standby server (use
> > -  replication as the database name).
> > -  Do not specify a database name in the
> > -  primary_conninfo string.
> > +  ~/.pgpass file on the standby server.
> > + 
> > + 
> > +  Specify dbname in
> > +  primary_conninfo string to allow
> > synchronization
> > +  of slots from the primary server to the standby server.
> > +  This will only be used for slot synchronization. It is ignored
> > +  for streaming.
> >
> > Is there a reason to remove part of the earlier sentence "use
> > replication as the database name"?
>
> Added it back.
>
> >
> > 4.
> > +   enable_syncslot configuration
> > parameter
> > +  
> > +  
> > +  
> > +   
> > +It enables a physical standby to synchronize logical failover slots
> > +from the primary server so that logical subscribers are not blocked
> > +after failover.
> > +   
> > +   
> > +It is enabled by default. This parameter can only be set in the
> > +postgresql.conf file or on the server
> > command line.
> > +   
> >
> > I think you forgot to update the documentation for the default value of this
> > variable.
>
> Updated.
>
> >
> > 5.
> > + *   a) start the logical replication workers for every enabled 
> > subscription
> > + *  when not in standby_mode
> > + *   b) start the slot-sync worker for logical failover slots 
> > synchronization
> > + *  from the primary server when in standby_mode.
> >
> > Either use a full stop after both lines or none of these.
>
> Added a full stop.
>
> >
> > 6.
> > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker);
> >
> > There shouldn't be space between * and the worker.
>
> Removed, and added the type to typedefs.list.
>
> >
> > 7.
> > + if (!SlotSyncWorker->hdr.in_use)
> > + {
> > + LWLockRelease(SlotSyncWorkerLock);
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("replication slot-sync worker not initialized, "
> > + "cannot attach")));
> > + }
> > +
> > + if (SlotSyncWorker->hdr.proc)
> > + {
> > + LWLockRelease(SlotSyncWorkerLock);
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("replication slot-sync worker is "
> > + "already running, cannot attach")));
> > + }
> >
> > Using slot-sync in the error messages looks a bit odd to me. Can we use
> > "replication slot sync worker ..." in both these and other similar 
> > messages? I
> > think it would be better if we don't split the messages into multiple lines 
> > in
> > these cases as messages don't appear too long to me.
>
> Changed as suggested.
>
> >
> > 8.
> > +/*
> > + * Detach the worker from DSM and update 'proc' and 'in_use'.
> > + * Logical replication launcher will come to know using these
> > + * that the worker has shutdown.
> > + */
> > +void
> > +slotsync_worker_detach(int code, Datum arg) {
> >
> > I think the reference to DSM is leftover from the previous version of the 
> > patch.
> > Can we change the above comments as per the new code?
>
> Changed.
>
> >
> > 9.
> > +static bool
> > +slotsync_worker_launch()
> > {
> > ...
> > + /* TODO: do we really need 'generation', analyse more here */
> > + worker->hdr.generation++;
> >
> > We should do something about this

Re: Simplify if/else logic of walsender CreateReplicationSlot

2023-11-20 Thread Peter Smith

On Tue, Nov 21, 2023 at 3:57 PM Michael Paquier  wrote:
>
> On Mon, Nov 20, 2023 at 05:07:38PM +0900, Michael Paquier wrote:
> > Good idea.  What you are suggesting here improves the readability of
> > this code, so +1.
>
> And applied this one, thanks!

Thanks for pushing.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2023-11-20 Thread shveta malik

On Tue, Nov 21, 2023 at 10:01 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Saturday, November 18, 2023 6:46 PM Amit Kapila  
> wrote:
> >
> > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
> >  wrote:
> > >
> > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:
> > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand
> >  wrote:
> > > >
> > > > I feel the WaitForWALToBecomeAvailable may not be the best place to
> > > > shutdown slotsync worker and drop slots. There could be other
> > > > reasons(other than
> > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to
> > > > reach the code there. I thought if the intention is to stop slotsync
> > > > workers on promotion, maybe FinishWalRecovery() is a better place to
> > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is 
> > > > also
> > called in it.
> > >
> > > I can see that slotsync_drop_initiated_slots() has been moved in
> > > FinishWalRecovery() in v35. That looks ok.
> > > >
> >
> >
> > More Review for v35-0002*
>
> Thanks for the comments.
>
> > 
> > 1.
> > + ereport(WARNING,
> > + errmsg("skipping slots synchronization as primary_slot_name "
> > +"is not set."));
> >
> > There is no need to use a full stop at the end for WARNING messages and as
> > previously mentioned, let's not split message lines in such cases. There are
> > other messages in the patch with similar problems, please fix those as well.
>
> Adjusted.
>
> >
> > 2.
> > +slotsync_checks()
> > {
> > ...
> > ...
> > + /* The hot_standby_feedback must be ON for slot-sync to work */ if
> > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots
> > + synchronization as hot_standby_feedback "
> > +"is off."));
> >
> > This message has the same problem as mentioned in the previous comment.
> > Additionally, I think either atop slotsync_checks or along with GUC check we
> > should write comments as to why we expect these values to be set for slot 
> > sync
> > to work.
>
> Added comments for these cases.
>
> >
> > 3.
> > + /* The worker is running already */
> > + if (SlotSyncWorker &>hdr.in_use &&
> > + SlotSyncWorker->hdr.proc)
> >
> > The spacing for both the &&'s has problems. You need a space after the first
> > && and the second && should be in the prior line.
>
> Adjusted.
>
> >
> > 4.
> > + LauncherRereadConfig(_slotsync);
> > +
> >   }
> >
> > An empty line after LauncherRereadConfig() is not required.
> >
> > 5.
> > +static void
> > +LauncherRereadConfig(bool *ss_recheck)
> > +{
> > + char*conninfo = pstrdup(PrimaryConnInfo);
> > + char*slotname = pstrdup(PrimarySlotName);
> > + bool syncslot = enable_syncslot;
> > + bool feedback = hot_standby_feedback;
> >
> > Can we change the variable name 'feedback' to 'standbyfeedback' to make it
> > slightly more descriptive?
>
> Changed.
>
> >
> > 6. The logic to recheck the slot_sync related parameters in
> > LauncherMain() is not very clear. IIUC, if after reload config any 
> > parameter is
> > changed, we just seem to be checking the validity of the changed parameter
> > but not restarting the slot sync worker, is that correct? If so, what if 
> > dbname is
> > changed, don't we need to restart the slot-sync worker and re-initialize the
> > connection; similarly slotname change also needs some thoughts. Also, if 
> > all the
> > parameters are valid we seem to be re-launching the slot-sync worker without
> > first stopping it which doesn't seem correct, am I missing something in this
> > logic?
>
> I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC 
> changed
> and new slot sync worker will be started in next loop in LauncherMain().

yes, LauncherRereadConfig will stop the worker on any parameter change
and will set recheck_slotsync(). On finding this flag as true,
LauncherMain will redo all the validations and restart slot-sync
worker if needed.  Yes, we do need to stop and relaunch slotsync
workers on dbname change as well. This is currently missing
inLauncherRereadConfig (). Regarding slot name change,we are already
doing it, we are already checking PrimarySlotName in
LauncherRereadConfig()

>
>
> > 7.
> > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >   errmsg("replication slot \"%s\" was not created in this database",
> >   NameStr(slot->data.name;
> >
> > + in_recovery = RecoveryInProgress();
> > +
> > + /*
> > + * Do not allow consumption of a "synchronized" slot until the standby
> > + * gets promoted. Also do not allow consumption of slots with
> > + sync_state
> > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be
> > + * used.
> > + */
> > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) ||
> > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("cannot use replication slot \"%s\" for logical decoding",
> > + NameStr(slot->data.name)),

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Peter Smith

Here are some review comments for the patch v35-0001.

==
0. GENERAL documentation

I felt that the documentation gave details of the individual changes
(e.g. GUC 'standby_slot_names' and API, CREATE SUBSCRIPTION option,
and pg_replication_slots 'failover' attribute etc.) but there is
nothing that seemed to bring all these parts together to give examples
for user "when" and "how" to make all these parts work. I'm not sure
if there is some overview missing from this patch 1 or if you are
planning that extra documentation for subsequent patches.

==
Commit message

1.
A new property 'failover' is added at the slot level which
is persistent information which specifies that this logical slot
is enabled to be synced to the physical standbys so that logical
replication can be resumed after failover. It is always false
for physical slots.

~

SUGGESTION
A new property 'failover' is added at the slot level. This is
persistent information to indicate that this logical slot...

~~~

2.
Users can set it during the create subscription or during
pg_create_logical_replication_slot. Examples:

create subscription mysub  connection '..' publication mypub
WITH (failover = true);

--last arg
SELECT * FROM pg_create_logical_replication_slot('myslot',
'pgoutput', false, true, true);

~

2a.
Add a blank line before this

~

2b.
Use uppercase for the SQL

~

2c.
SUGGESTION
Users can set this flag during CREATE SUBSCRIPTION or during
pg_create_logical_replication_slot API.

Ex1.
CREATE SUBSCRIPTION mysub CONNECTION '...' PUBLICATION mypub
WITH (failover = true);

Ex2. (failover is the last arg)
SELECT * FROM pg_create_logical_replication_slot('myslot',
'pgoutput', false, true, true);

~~~

3.
This 'failover' is displayed as part of pg_replication_slots
view.

~

SUGGESTION
The value of the 'failover' flag is displayed as part of
pg_replication_slots view.

~~~

4.
A new GUC standby_slot_names has been added. It is the list of
physical replication slots that logical replication with failover
enabled waits for. The intent of this wait is that no logical
replication subscribers (with failover=true) should go
ahead of physical replication standbys (corresponding to the
physical slots in standby_slot_names).

~

4a.
SUGGESTION
A new GUC standby_slot_names has been added. This is a list of
physical replication slots that logical replication with failover
enabled will wait for.

~

4b.
/no logical replication subscribers/no logical replication subscriptions/

~

4c
/should go ahead of physical/should get ahead of physical/

==
contrib/test_decoding/sql/slot.sql

5.
+
+-- Test logical slots creation with 'failover'=true (last arg)
+SELECT 'init' FROM
pg_create_logical_replication_slot('failover_slot', 'test_decoding',
false, false, true);
+SELECT slot_name, slot_type, failover FROM pg_replication_slots;
+
+SELECT pg_drop_replication_slot('failover_slot');

How about a couple more simple tests:
a) pass false arg to confirm it is false in the view.
b) according to the docs this failover is optional, so try API without
passing it
c) create a physical slot to confirm it is false in the view.

==
doc/src/sgml/catalogs.sgml

6.
+ 
+  
+   subfailoverstate char
+  
+  
+   State codes for failover mode:
+   d = disabled,
+   p = pending enablement,
+   e = enabled
+  
+ 
+

This attribute is very similar to the 'subtwophasestate' so IMO it
would be better to be adjacent to that one in the docs.

(probably this means putting it in the same order in the catalog also,
assuming that is allowed)

==
doc/src/sgml/config.sgml

7.
+   
+List of physical replication slots that logical replication slots with
+failover enabled waits for. If a logical replication connection is
+meant to switch to a physical standby after the standby is promoted,
+the physical replication slot for the standby should be listed here.
+   
+   
+The standbys corresponding to the physical replication slots in
+standby_slot_names must enable
+enable_syncslot for the standbys to receive
+failover logical slots changes from the primary.
+   

That sentence mentioning 'enable_syncslot' seems premature because
AFAIK that GUC is not introduced until patch 0002. So this part should
be moved into the 0002 patch.

==
doc/src/sgml/ref/alter_subscription.sgml

8.
These commands also cannot be executed when the subscription has
two_phase
-   commit enabled, unless
+   commit enabled or
+   failover
+   enabled, unless
copy_data
is false. See column
subtwophasestate
-   of pg_subscription
+   and subfailoverstate of
+   pg_subscription
to know the actual two-phase state.
I think the last sentence doesn't make sense anymore because it is no
longer talking about only two-phase state.

BEFORE
See column subtwophasestate and subfailoverstate of pg_subscription to
know the actual two-phase state.

SUGGESTION
See column

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Amit Kapila

On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand
 wrote:
>
> On 11/20/23 11:59 AM, Amit Kapila wrote:
> > On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand
> >  wrote:
> >>
> >> On 11/18/23 11:45 AM, Amit Kapila wrote:
> >>> On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
> >>>  wrote:
> 
>  On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:
> > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
> >  wrote:
> >
> > I feel the WaitForWALToBecomeAvailable may not be the best place to 
> > shutdown
> > slotsync worker and drop slots. There could be other reasons(other than
> > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach 
> > the code
> > there. I thought if the intention is to stop slotsync workers on 
> > promotion,
> > maybe FinishWalRecovery() is a better place to do it as it's indicating 
> > the end
> > of recovery and XLogShutdownWalRcv is also called in it.
> 
>  I can see that slotsync_drop_initiated_slots() has been moved in 
>  FinishWalRecovery()
>  in v35. That looks ok.
> >
> >>>
> >>> I was thinking what if we just ignore creating such slots (which
> >>> require init state) in the first place? I think that can be
> >>> time-consuming in some cases but it will reduce the complexity and we
> >>> can always improve such cases later if we really encounter them in the
> >>> real world. I am not very sure that added complexity is worth
> >>> addressing this particular case, so I would like to know your and
> >>> others' opinions.
> >>>
> >>
> >> I'm not sure I understand your point. Are you saying that we should not 
> >> create
> >> slots on the standby that are "currently" reported in a 'i' state? (so 
> >> just keep
> >> the 'r' and 'n' states?)
> >>
> >
> > Yes.
> >
>
> As far the 'i' state here, from what I see, it is currently useful for:
>
> 1. Cascading standby to not sync slots with state = 'i' from
> the first standby.
> 2. Easily report Slots that did not catch up on the primary yet.
> 3. Avoid inactive slots to block "active" ones creation.
>
> So not creating those slots should not be an issue for 1. (sync are
> not needed on cascading standby as not created on the first standby yet)
> but is an issue for 2. (unless we provide another way to keep track and report
> such slots) and 3. (as I think we should still need to reserve WAL).
>
> I've a question: we'd still need to reserve WAL for those slots, no?
>
> If that's the case and if we don't call ReplicationSlotCreate() then 
> ReplicationSlotReserveWal()
> would not work as  MyReplicationSlot would be NULL.
>

Yes, we need to reserve WAL to see if we can sync the slot. We are
currently creating an RS_EPHEMERAL slot and if we don't explicitly
persist it when we can't sync, then it will be dropped when we do
ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the
loss is probably, the next time we again try to sync the slot, we need
to again create it and may need to wait for newer restart_lsn on
standby which could be avoided if we have the slot in 'i' state from
the previous run. I don't deny the importance of having 'i'
(initialized) state but was just trying to say that it has additional
code complexity. OTOH, having it may give better visibility to even
users about slots that are not active (say manually created slots on
the primary).

-- 
With Regards,
Amit Kapila.

Re: Do away with a few backwards compatibility macros

2023-11-20 Thread Tom Lane

Nathan Bossart  writes:
> On Thu, Nov 16, 2023 at 09:46:22AM -0600, Nathan Bossart wrote:
>> I'm fine with this because all of these macros are no-ops for all supported
>> versions of Postgres.  Even if an extension is using them today, you'll get
>> the same behavior as before if you remove the uses and rebuild against
>> v12-v16.

> Barring objections, I'll plan on committing this in the next week or so.

No objection here, but should we try to establish some sort of project
policy around this sort of change (ie, removal of backwards-compatibility
support)?  "Once it no longer matters for any supported version" sounds
about right to me, but maybe somebody has an argument for thinking about
it differently.

regards, tom lane

How to accurately determine when a relation should use local buffers?

2023-11-20 Thread Давыдов Виталий


Dear Hackers,

I would like to clarify, what the correct way is to determine that a given 
relation is using local buffers. Local buffers, as far as I know, are used for 
temporary tables in backends. There are two functions/macros (bufmgr.c): 
SmgrIsTemp, RelationUsesLocalBuffers. The first function verifies that the 
current process is a regular session backend, while the other macro verifies 
the relation persistence characteristic. It seems, the use of each function 
independently is not correct. I think, these functions should be applied in 
pair to check for local buffers use, but, it seems, these functions are used 
independently. It works until temporary tables are allowed only in session 
backends.

I'm concerned, how to determine the use of local buffers in some other 
theoretical cases? For example, if we decide to replicate temporary tables? Are 
there the other cases, when local buffers can be used with relations in the 
Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not only in 
session backends?

Thank you in advance for your help!

With best regards,
Vitaly Davydov

Re: Do away with a few backwards compatibility macros

2023-11-20 Thread Nathan Bossart

On Thu, Nov 16, 2023 at 09:46:22AM -0600, Nathan Bossart wrote:
> On Thu, Nov 16, 2023 at 07:11:41PM +0530, Bharath Rupireddy wrote:
>> After a recent commit 6a72c42f (a related discussion [1]) which
>> removed MemoryContextResetAndDeleteChildren(), I think there are a
>> couple of other backward compatibility macros out there that can be
>> removed. These macros are tuplestore_donestoring() which was
>> introduced by commit dd04e95 21 years ago and SPI_push() and friends
>> which were made no-ops macros by commit 1833f1a 7 years ago. Debian
>> code search shows very minimal usages of these macros. Here's a patch
>> attached to remove them.
> 
> I'm fine with this because all of these macros are no-ops for all supported
> versions of Postgres.  Even if an extension is using them today, you'll get
> the same behavior as before if you remove the uses and rebuild against
> v12-v16.

Barring objections, I'll plan on committing this in the next week or so.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Simplify if/else logic of walsender CreateReplicationSlot

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 05:07:38PM +0900, Michael Paquier wrote:
> Good idea.  What you are suggesting here improves the readability of
> this code, so +1.

And applied this one, thanks!
--
Michael


signature.asc
Description: PGP signature

Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-20 Thread Nathan Bossart

On Mon, Nov 20, 2023 at 06:12:47PM -0800, Jeff Davis wrote:
> The caller could do something similar, so this option is not necessary,
> but it seems like it could be generally useful. It speeds things up for
> the search_path cache (and is an alternative to another patch I have
> that implements the same thing in the caller).

I'm mostly thinking out loud here, but could we just always do this?  I
guess you might want to avoid it if your SH_EQUAL is particularly expensive
and you know repeated lookups are rare, but maybe that's uncommon enough
that we don't really care.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Nathan Bossart

On Mon, Nov 20, 2023 at 05:14:17PM -0800, Jubilee Young wrote:
> On Mon, Nov 20, 2023 at 2:52 PM Nathan Bossart  
> wrote:
>> Does pgrx use ascii.h at all?
> 
> We don't use utils/ascii.h, no.

Alright.  The next minor release isn't until February, so I'll let this one
sit a little while longer in case anyone objects to back-patching something
like this [0].

[0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 16:20:20 -0500, Tom Lane wrote:
> I'm generally still using autoconf, I only run meson builds when
> somebody complains about them ;-).  But yeah, I see lots of
> "ld: warning: -undefined error is deprecated" when I do that.
> This seems to have been installed by Andres' 9a95a510a:
>
>ldflags += ['-isysroot', pg_sysroot]
> +  # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we
> +  # don't want because a) it's different from what we do for autoconf, b) it
> +  # causes warnings starting in macOS Ventura
> +  ldflags_mod += ['-Wl,-undefined,error']

That's not the sole issue, because meson automatically adds that for binaries
(due to some linker issue that existed in the past IIRC).

> The autoconf side seems to just be letting this option default.
> I'm not sure what the default choice is, but evidently it's not
> "-undefined error"?  Or were they stupid enough to not allow you
> to explicitly select the default behavior?

I'm somewhat confused by what Apple did. I just was upgrading my m1 mac mini
to sonoma, and in one state I recall man ld documenting it below "Obsolete
Options". But then I also updated xcode, and now there's no mention whatsoever
of the option being deprecated in the man page at all. Perhaps my mind is
playing tricks on me.

And yes, it sure looks like everything other than 'dynamic_lookup' creates a
warning. Which seems absurd.

> Also, I *don't* see any complaints about duplicate libraries.
> What build options are you using?

I don't understand what Apple is thinking here. Getting the same library name
twice, e.g. icu once directly and once indirectly via xml2-config --libs or
such, seems very common. To avoid that portably, you basically need to do a
topographical sort of the libraries, to figure out an ordering that
deduplicates but doesn't move a library to before where it's used. With a
bunch of complexities due to -L, which could lead to finding different
libraries for the same library name, thrown in.

WRT Robert seeing those warnings and Tom not: There's something odd going
on. I couldn't immediately reproduce it. Then I realized it reproduces against
a homebrew install but not a macports one.

Robert, which are you using?

Meson actually *tries* to avoid this warning without resulting in incorrect
results due to ordering. But homebrew does something odd, with libxml-2.0,
zlib and a few others: Unless you do something to change that, brew has
/opt/homebrew/Library/Homebrew/os/mac/pkgconfig/14/ in its search path, but
those libraries aren't from homebrew, they're referencing macos
frameworks. Apparently meson isn't able to understand which files those .pc
files link to and gives up on the deduplicating.

If I add to the pkg-config search path, e.g. via
meson configure 
-Dpkg_config_path=$OTHER_STUFF:/opt/homebrew/opt/zlib/lib/pkgconfig/:/opt/homebrew/opt/libxml2/lib/pkgconfig/

the warnings about duplicate libraries vanish.

Greetings,

Andres Freund

RE: Synchronizing slots from primary to standby

2023-11-20 Thread Zhijie Hou (Fujitsu)

On Friday, November 17, 2023 7:39 PM Amit Kapila  
wrote:
> 
> On Thu, Nov 16, 2023 at 5:34 PM shveta malik 
> wrote:
> >
> > PFA v35.
> >
> 
> Review v35-0002*
> ==

Thanks for the comments.

> 1.
> As quoted in the commit message,
> >
> If a logical slot is invalidated on the primary, slot on the standby is also
> invalidated. If a logical slot on the primary is valid but is invalidated on 
> the
> standby due to conflict (say required rows removed on the primary), then that
> slot is dropped and recreated on the standby in next sync-cycle.
> It is okay to recreate such slots as long as these are not consumable on the
> standby (which is the case currently).
> >
> 
> I think this won't happen normally because of the physical slot and
> hot_standby_feedback but probably can occur in cases like if the user
> temporarily switches hot_standby_feedback from on to off. Are there any other
> reasons? I think we can mention the cases along with it as well at least for 
> now.
> Additionally, I think this should be covered in code comments as well.

I will collect all these cases and update in next version.

> 
> 2.
>  #include "postgres.h"
> -
> +#include "access/genam.h"
> 
> Spurious line removal.

Removed.

> 
> 3.
>A password needs to be provided too, if the sender demands
> password
>authentication.  It can be provided in the
>primary_conninfo string, or in a separate
> -  ~/.pgpass file on the standby server (use
> -  replication as the database name).
> -  Do not specify a database name in the
> -  primary_conninfo string.
> +  ~/.pgpass file on the standby server.
> + 
> + 
> +  Specify dbname in
> +  primary_conninfo string to allow
> synchronization
> +  of slots from the primary server to the standby server.
> +  This will only be used for slot synchronization. It is ignored
> +  for streaming.
> 
> Is there a reason to remove part of the earlier sentence "use
> replication as the database name"?

Added it back.

> 
> 4.
> +   enable_syncslot configuration
> parameter
> +  
> +  
> +  
> +   
> +It enables a physical standby to synchronize logical failover slots
> +from the primary server so that logical subscribers are not blocked
> +after failover.
> +   
> +   
> +It is enabled by default. This parameter can only be set in the
> +postgresql.conf file or on the server
> command line.
> +   
> 
> I think you forgot to update the documentation for the default value of this
> variable.

Updated.

> 
> 5.
> + *   a) start the logical replication workers for every enabled subscription
> + *  when not in standby_mode
> + *   b) start the slot-sync worker for logical failover slots synchronization
> + *  from the primary server when in standby_mode.
> 
> Either use a full stop after both lines or none of these.

Added a full stop.

> 
> 6.
> +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker);
> 
> There shouldn't be space between * and the worker.

Removed, and added the type to typedefs.list.

> 
> 7.
> + if (!SlotSyncWorker->hdr.in_use)
> + {
> + LWLockRelease(SlotSyncWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("replication slot-sync worker not initialized, "
> + "cannot attach")));
> + }
> +
> + if (SlotSyncWorker->hdr.proc)
> + {
> + LWLockRelease(SlotSyncWorkerLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("replication slot-sync worker is "
> + "already running, cannot attach")));
> + }
> 
> Using slot-sync in the error messages looks a bit odd to me. Can we use
> "replication slot sync worker ..." in both these and other similar messages? I
> think it would be better if we don't split the messages into multiple lines in
> these cases as messages don't appear too long to me.

Changed as suggested.

> 
> 8.
> +/*
> + * Detach the worker from DSM and update 'proc' and 'in_use'.
> + * Logical replication launcher will come to know using these
> + * that the worker has shutdown.
> + */
> +void
> +slotsync_worker_detach(int code, Datum arg) {
> 
> I think the reference to DSM is leftover from the previous version of the 
> patch.
> Can we change the above comments as per the new code?

Changed.

> 
> 9.
> +static bool
> +slotsync_worker_launch()
> {
> ...
> + /* TODO: do we really need 'generation', analyse more here */
> + worker->hdr.generation++;
> 
> We should do something about this TODO. As per my understanding, we don't
> need a generation number for the slot sync worker as we have one such worker
> but I guess the patch requires it because we are using existing logical
> replication worker infrastructure. This brings the question of whether we 
> really
> need a separate SlotSyncWorkerInfo or if we can use existing
> LogicalRepWorker and distinguish it with

Re: archive modules loose ends

2023-11-20 Thread Nathan Bossart

On Mon, Nov 13, 2023 at 03:35:28PM -0800, Andres Freund wrote:
> On 2023-11-13 16:42:31 -0600, Nathan Bossart wrote:
>> There seems to be no interest in this patch, so I plan to withdraw it from
>> the commitfest system by the end of the month unless such interest
>> materializes.
> 
> I think it might just have arrived too shortly before the feature freeze to be
> worth looking at at the time, and then it didn't really re-raise attention
> until now.  I'm so far behind on keeping up with the list that I rarely end up
> looking far back for things I'd like to have answered... Sorry.

No worries.  I appreciate the review.

>> I see two main options for dealing with this.  One option is to simply have
>> shell_archive (and any other archive modules out there) maintain its own
>> memory context like basic_archive does.  This ends up requiring a whole lot
>> of duplicate code between the two built-in modules, though.  Another option
>> is to have the archiver manage a memory context that it resets after every
>> invocation of the archiving callback, ERROR or not.
> 
> I think passing in a short-lived memory context is a lot nicer to deal with.

Cool.

>> This has the advantage of avoiding code duplication and simplifying things
>> for the built-in modules, but any external modules that rely on palloc'd
>> state being long-lived would need to be adjusted to manage their own
>> long-lived context.  (This would need to be appropriately documented.)
> 
> Alternatively we could provide a longer-lived memory context in
> ArchiveModuleState, set up by the genric infrastructure. That context would
> obviously still need to be explicitly utilized by a module, but no duplicated
> setup code would be required.

Sure.  Right now, I'm not sure there's too much need for that.  A module
could just throw stuff in TopMemoryContext, and you probably wouldn't have
any leaks because the archiver just restarts on any ERROR or
archive_library change.  But that's probably not a pattern we want to
encourage long-term.  I'll jot this down for a follow-up patch idea.

> I think we should just have the AtEOXact_Files() in pgarch.c, then no
> PG_TRY/CATCH is needed here. At the moment I think just about every possible
> use of an archive modules would require using files, so there doesn't seem
> much of a reason to not handle it in pgarch.c.

WFM

> I'd probably reset a few other subsystems at the same time (there's probably
> more):
> - disable_all_timeouts()
> - LWLockReleaseAll()
> - ConditionVariableCancelSleep()
> - pgstat_report_wait_end()
> - ReleaseAuxProcessResources()

I looked around a bit and thought AtEOXact_HashTables() belonged here as
well.  I'll probably give this one another pass to see if there's anything
else obvious.

> It could be worth setting up an errcontext providing the module and file
> that's being processed. I personally find that at least as important as
> setting up a ps string detailing the log file... But I guess that could be a
> separate patch.

Indeed.  Right now we rely on the module to emit sufficiently-detailed
logs, but it'd be nice if they got that for free.

> It'd be nice to add a comment explaining why pgarch_archiveXlog() is the right
> place to handle errors.

Will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Amit Kapila

On Mon, Nov 20, 2023 at 4:28 PM Amit Kapila  wrote:
>
> 9.
> +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
> + bool *slot_updated)
> {
> ...
> + else
> + {
> + TransactionId xmin_horizon = InvalidTransactionId;
> + ReplicationSlot *slot;
> +
> + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
> +   remote_slot->two_phase, false);
> + slot = MyReplicationSlot;
> +
> + SpinLockAcquire(>mutex);
> + slot->data.database = get_database_oid(remote_slot->database, false);
> +
> + /* Mark it as sync initiated by slot-sync worker */
> + slot->data.sync_state = SYNCSLOT_STATE_INITIATED;
> + slot->data.failover = true;
> +
> + namestrcpy(>data.plugin, remote_slot->plugin);
> + SpinLockRelease(>mutex);
> +
> + ReplicationSlotReserveWal();
> +
>
> How and when will this init state (SYNCSLOT_STATE_INITIATED) persist to disk?
>

On closer inspection, I see that it is done inside
wait_for_primary_and_sync() when it fails to sync. I think it is
better to refactor the code a bit and persist it in
synchronize_one_slot() to make the code flow easier to understand.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] pgbench log file headers

2023-11-20 Thread Adam Hendel

Hello,

On Mon, Nov 13, 2023 at 6:01 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-13 11:55:07 -0600, Adam Hendel wrote:
> > Currently, pgbench will log individual transactions to a logfile when the
> > `--log` parameter flag is provided. The logfile, however, does not
> include
> > column header. It has become a fairly standard expectation of users to
> have
> > column headers present in flat files. Without the header in the pgbench
> log
> > files, new users must navigate to the docs and piece together the column
> > headers themselves. Most industry leading frameworks have tooling built
> in
> > to read column headers though, for example python/pandas read_csv().
>
> The disadvantage of doing that is that a single pgbench run with --log will
> generate many files when using -j, to avoid contention. The easiest way to
> deal with that after the run is to cat all the log files to a larger one.
> If
> you do that with headers present, you obviously have a few bogus rows (the
> heades from the various files).
>
> I guess you could avoid the "worst" of that by documenting that the
> combined
> log file should be built by
>   cat {$log_prefix}.${pid} {$log_prefix}.${pid}.*
> and only outputting the header in the file generated by the main thread.
>
>
> > We can improve the experience for users by adding column headers to
> pgbench
> > logfiles with an optional command line flag, `--log-header`. This will
> keep
> > the API backwards compatible by making users opt-in to the column
> headers.
> > It follows the existing pattern of having conditional flags in pgbench’s
> > API; the `--log` option would have both –log-prefix and –log-header if
> this
> > work is accepted.
>
> > The implementation considers the column headers only when the
> > `--log-header` flag is present. The values for the columns are taken
> > directly from the “Per-Transaction Logging” section in
> > https://www.postgresql.org/docs/current/pgbench.html and takes into
> account
> > the conditional columns `schedule_lag` and `retries`.
> >
> >
> > Below is an example of what that logfile will look like:
> >
> >
> > pgbench  postgres://postgres:postgres@localhost:5432/postgres --log
> > --log-header
> >
> > client_id transaction_no time script_no time_epoch time_us
>
> Independent of your patch, but we imo ought to combine time_epoch time_us
> in
> the log output. There's no point in forcing consumers to combine those
> fields,
> and it makes logging more expensive...  And if we touch this, we should
> just
> switch to outputting nanoseconds instead of microseconds.
>
> It also is quite odd that we have "time" and "time_epoch", "time_us", where
> time is the elapsed time of an individual "transaction" and time_epoch +
> time_us together are a wall-clock timestamp.  Without differentiating
> between
> those, the column headers seem not very useful, because one needs to look
> in
> the documentation to understand the fields.
>
>
> I don't think there's all that strong a need for backward compatibility in
> pgbench. We could just change the columns as I suggest above and then
> always
> emit the header in the "main" log file.
>

I updated the patch to always log the header and only off the main thread.
As for the time headers, I will work on renaming/combining those in a
separate commit as:

time -> time_elapsed
time_epoch + time_us -> time_completion_us



> Greetings,
>
> Andres Freund
>


v2-0001-Adds-colum-headers-to-the-log-file-produced-by-pg.patch
Description: Binary data

Re: Use of backup_label not noted in log

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote:
> On 11/20/23 15:03, Andres Freund wrote:
>> Besides the phrasing and the additional log message (I have no opinion about
>> whether it should be backpatched or not), I used %u for TimelineID as
>> appropriate, and added a comma before "on timeline".

The "starting/restarting/completed recovery" line sounds better here,
so I'm OK with your suggestions.

> I still wonder if we need "base backup" in the messages? That sort of
> implies (at least to me) you used pg_basebackup but that may not be the
> case.

Or just s/base backup/backup/?

> Other than that, looks good for HEAD. Whether we back patch or not is
> another question, of course.

I'd rather see more information in the back-branches more quickly, so
count me in the bucket of folks in favor of a backpatch.
--
Michael


signature.asc
Description: PGP signature

Re: remaining sql/json patches

2023-11-20 Thread Amit Langote

On Nov 16, 2023, at 17:48, Amit Langote  wrote:
> On Thu, Nov 16, 2023 at 2:11 AM Andres Freund  wrote:
>> On 2023-11-15 22:00:41 +0900, Amit Langote wrote:
 This causes a nontrivial increase in the size of the parser (~5% in an
 optimized build here), I wonder if we can do better.
>>> 
>>> Hmm, sorry if I sound ignorant but what do you mean by the parser here?
>> 
>> gram.o, in an optimized build.
>> 
>>> I can see that the byte-size of gram.o increases by 1.66% after the
>>> above additions  (1.72% with previous versions).
>> 
>> I'm not sure anymore how I measured it, but if you just looked at the total
>> file size, that might not show the full gain, because of debug symbols
>> etc. You can use the size command to look at just the code and data size.
> 
> $ size /tmp/gram.*
>   textdata bss dec hex filename
> 661808   0   0  661808   a1930 /tmp/gram.o.unpatched
> 672800   0   0  672800   a4420 /tmp/gram.o.patched
> 
> That's still a 1.66% increase in the code size:
> 
> (672800 - 661808) / 661808 % = 1.66
> 
> As for gram.c, the increase is a bit larger:
> 
> $ ll /tmp/gram.*
> -rw-rw-r--. 1 amit amit 2605925 Nov 16 16:18 /tmp/gram.c.unpatched
> -rw-rw-r--. 1 amit amit 2709422 Nov 16 16:22 /tmp/gram.c.patched
> 
> (2709422 - 2605925) / 2605925 % = 3.97
> 
>>> I've also checked
>>> using log_parser_stats that there isn't much slowdown in the
>>> raw-parsing speed.
>> 
>> What does "isn't much slowdown" mean in numbers?
> 
> Sure, the benchmark I used measured the elapsed time (using
> log_parser_stats) of parsing a simple select statement (*) averaged
> over 1 repetitions of the same query performed with `psql -c`:
> 
> Unpatched: 0.61 seconds
> Patched: 0.61 seconds
> 
> Here's a look at the perf:
> 
> Unpatched:
>   0.59%  [.] AllocSetAlloc
>   0.51%  [.] hash_search_with_hash_value
>   0.47%  [.] base_yyparse
> 
> Patched:
>   0.63%  [.] AllocSetAlloc
>   0.52%  [.] hash_search_with_hash_value
>   0.44%  [.] base_yyparse
> 
> (*) select 1, 2, 3 from foo where a = 1
> 
> Is there a more relevant benchmark I could use?

Thought I’d share a few more numbers I collected to analyze the parser size 
increase over releases.

* gram.o text bytes is from the output of `size gram.o`.
* compiled with -O3

version gram.o text bytes  %change  gram.c bytes  %change

9.6 534010 -2108984   -
10  582554 9.09 2258313   7.08
11  584596 0.35 2313475   2.44
12  590957 1.08 2341564   1.21
13  590381-0.09 2357327   0.67
14  600707 1.74 2428841   3.03
15  633180 5.40 2495364   2.73
16  653464 3.20 2575269   3.20
17-sqljson  672800 2.95 2709422   3.97

So if we put SQL/JSON (including JSON_TABLE()) into 17, we end up with a gram.o 
2.95% larger than v16, which granted is a somewhat larger bump, though also 
smaller than with some of recent releases.


> --

Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 03:58:55PM -0800, Andres Freund wrote:
> I was thinking we'd just set it in the pg_basebackup style path, and we'd
> error out if it's set and backup_label is present. But we'd still use
> backup_label without the pg_control flag set.
>
> So it'd just provide a cross-check that backup_label was not removed for
> pg_basebackup style backup, but wouldn't do anything for external backups. But
> imo the proposal to just us pg_control doesn't actually do anything for
> external backups either - which is why I think my proposal would achieve as
> much, for a much lower price.

I don't see why not.  It does not increase the number of steps when
doing a backup, and backupStartPoint alone would not be able to offer
this much protection.
--
Michael


signature.asc
Description: PGP signature

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Tom Lane

Bruce Momjian  writes:
> On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
>> Bruce Momjian  writes:
>>> An alternate approach would
>>> be to remove pg_attribute.attndims so we don't even try to preserve 
>>> dimensionality.

>> I could get behind that, perhaps.  It looks like we're not using the
>> field in any meaningful way, and we could simplify TupleDescInitEntry
>> and perhaps some other APIs.

> So should I work on that patch or do you want to try?  I think we should
> do something.

Let's wait for some other opinions, first ...

regards, tom lane

simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-20 Thread Jeff Davis

Patch attached.

The caller could do something similar, so this option is not necessary,
but it seems like it could be generally useful. It speeds things up for
the search_path cache (and is an alternative to another patch I have
that implements the same thing in the caller).

Thoughts?

Regards,
Jeff Davis

From b878af835da794f3384f870db57b34e236b1efba Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 20 Nov 2023 17:42:07 -0800
Subject: [PATCH] Add SH_OPTIMIZE_REPEAT option to simplehash.h.

Callers which expect to look up the same value repeatedly can specify
SH_OPTIMIZE_REPEAT. That option causes simplehash to quickly check if
the last entry returned has a the same key, and return it immediately
if so (before calculating the hash).
---
 src/include/lib/simplehash.h | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index b1a3d7f927..9a3dbfecfa 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -49,6 +49,7 @@
  *	  - SH_EQUAL(table, a, b) - compare two table keys
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
+ *	  - SH_OPTIMIZE_REPEAT - optimize for repeated lookups of the same key
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
  *
  *	  The element type is required to contain a "status" member that can store
@@ -163,6 +164,11 @@ typedef struct SH_TYPE
 	/* hash buckets */
 	SH_ELEMENT_TYPE *data;
 
+#ifdef SH_OPTIMIZE_REPEAT
+	/* last element found */
+	SH_ELEMENT_TYPE *previous;
+#endif
+
 #ifndef SH_RAW_ALLOCATOR
 	/* memory context to use for allocations */
 	MemoryContext ctx;
@@ -777,7 +783,16 @@ restart:
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32		hash;
+
+#ifdef SH_OPTIMIZE_REPEAT
+	if (tb->previous != NULL &&
+		tb->previous->status == SH_STATUS_IN_USE &&
+		SH_EQUAL(tb, key, tb->previous->SH_KEY))
+		return tb->previous;
+#endif
+
+	hash = SH_HASH_KEY(tb, key);
 
 	return SH_INSERT_HASH_INTERNAL(tb, key, hash, found);
 }
@@ -815,7 +830,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 		Assert(entry->status == SH_STATUS_IN_USE);
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
+		{
+#ifdef SH_OPTIMIZE_REPEAT
+			tb->previous = entry;
+#endif
 			return entry;
+		}
 
 		/*
 		 * TODO: we could stop search based on distance. If the current
@@ -834,7 +854,16 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32		hash;
+
+#ifdef SH_OPTIMIZE_REPEAT
+	if (tb->previous != NULL &&
+		tb->previous->status == SH_STATUS_IN_USE &&
+		SH_EQUAL(tb, key, tb->previous->SH_KEY))
+		return tb->previous;
+#endif
+
+	hash = SH_HASH_KEY(tb, key);
 
 	return SH_LOOKUP_HASH_INTERNAL(tb, key, hash);
 }
@@ -1152,6 +1181,7 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DEFINE
 #undef SH_GET_HASH
 #undef SH_STORE_HASH
+#undef SH_OPTIMIZE_REPEAT
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
 
-- 
2.34.1

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Bruce Momjian

On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > I would like to apply this patch to master because I think our current
> > deficiencies in this area are unacceptable.
> 
> I do not think this is a particularly good idea, because it creates
> the impression in a couple of places that we track this data, when
> we do not really do so to any meaningful extent.

Okay, I thought we could get by without tracking the CREATE TABLE AS
case, but it is inconsistent.  My patch just makes it less
inconsistent.

> > An alternate approach would
> > be to remove pg_attribute.attndims so we don't even try to preserve 
> > dimensionality.
> 
> I could get behind that, perhaps.  It looks like we're not using the
> field in any meaningful way, and we could simplify TupleDescInitEntry
> and perhaps some other APIs.

So should I work on that patch or do you want to try?  I think we should
do something.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Tom Lane

Bruce Momjian  writes:
> I would like to apply this patch to master because I think our current
> deficiencies in this area are unacceptable.

I do not think this is a particularly good idea, because it creates
the impression in a couple of places that we track this data, when
we do not really do so to any meaningful extent.

> An alternate approach would
> be to remove pg_attribute.attndims so we don't even try to preserve 
> dimensionality.

I could get behind that, perhaps.  It looks like we're not using the
field in any meaningful way, and we could simplify TupleDescInitEntry
and perhaps some other APIs.

regards, tom lane

Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-20 Thread yuansong

thanks，After reconsideration, I realized that what I really want is for other 
connections to remain unaffected when a process crashes. This is something that 
a connection pool cannot solve.

At 2023-11-14 09:41:03, "Thomas wen"  wrote:

Hi yuansong
  there is connnection pool path 
(https://commitfest.postgresql.org/34/3043/) ,but it  has been dormant for few 
years,You can check this patch to get what you want to need
发件人: yuansong 
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe 
抄送: pgsql-hackers@lists.postgresql.org 
主题: Re:Re: How to solve the problem of one backend process crashing and causing 
other processes to restart?

Enhancing the overall fault tolerance of the entire system for this feature is 
quite important. No one can avoid bugs, and I don't believe that this approach 
is a more advanced one. It might be worth considering adding it to the roadmap 
so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all 
connections has a significant impact on other connections. Is it possible to 
only disconnect the crashed connection and have the other connections simply 
roll back the current transaction without reconnecting? Perhaps this problem 
can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or 
restore the shared memory and disk changes caused by the crashed backend? 
Besides clearing various known locks, what else needs to be changed? Does 
anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe"  wrote:
>On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
>> yuansong  writes:
>> > In PostgreSQL, when a backend process crashes, it can cause other backend
>> > processes to also require a restart, primarily to ensure data consistency.
>> > I understand that the correct approach is to analyze and identify the
>> > cause of the crash and resolve it. However, it is also important to be
>> > able to handle a backend process crash without affecting the operation of
>> > other processes, thus minimizing the scope of negative impact and
>> > improving availability. To achieve this goal, could we mimic the Oracle
>> > process by introducing a "pmon" process dedicated to rolling back crashed
>> > process transactions and performing resource cleanup? I wonder if anyone
>> > has attempted such a strategy or if there have been previous discussions
>> > on this topic.
>> 
>> The reason we force a database-wide restart is that there's no way to
>> be certain that the crashed process didn't corrupt anything in shared
>> memory.  (Even with the forced restart, there's a window where bad
>> data could reach disk before we kill off the other processes that
>> might write it.  But at least it's a short window.)  "Corruption"
>> here doesn't just involve bad data placed into disk buffers; more
>> often it's things like unreleased locks, which would block other
>> processes indefinitely.
>> 
>> I seriously doubt that anything like what you're describing
>> could be made reliable enough to be acceptable.  "Oracle does
>> it like this" isn't a counter-argument: they have a much different
>> (and non-extensible) architecture, and they also have an army of
>> programmers to deal with minutiae like undoing resource acquisition.
>> Even with that, you'd have to wonder about the number of bugs
>> existing in such necessarily-poorly-tested code paths.
>
>Yes.
>I think that PostgreSQL's approach is superior: rather than investing in
>code to mitigate the impact of data corruption caused by a crash, invest
>in quality code that doesn't crash in the first place.
>
>Euphemistically naming a crash "ORA-600 error" seems to be part of
>their strategy.
>
>Yours,
>Laurenz Albe
>

Re: meson documentation build open issues

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 08:27:48 +0100, Peter Eisentraut wrote:
> On 17.11.23 19:53, Andres Freund wrote:
> > I pushed the first two commits (the selinux stuff) and worked a bit more on
> > the subsequent ones.
> 
> Patches 0001 through 0004 look good to me.

Cool, I pushed them now.


> Some possible small tweaks in 0004:
> 
> +perl, '-ne', 'next if /^#/; print',
> 
> If you're going for super-brief mode, you could also use "perl -p" and drop
> the "print".

I thought this didn't add much, so I didn't go there.


> Put at least two spaces between the "columns" in targets-meson.txt:
> 
> +  doc/src/sgml/postgres-A4.pdf  Build documentation in PDF format, with
>^^

I did adopt this.


One remaining question is whether we should adjust install-doc-{html,man} to
be install-{html,man}, to match the docs targets.

Greetings,

Andres Freund

Re: pg_upgrade and logical replication

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 09:49:41AM +0530, Amit Kapila wrote:
> On Tue, Nov 14, 2023 at 7:21 AM vignesh C  wrote:
>> There are couple of things happening here: a) In the first part we
>> take care of setting subscription relation to SYNCDONE and dropping
>> the replication slot at publisher node, only if drop replication slot
>> is successful the relation state will be set to SYNCDONE , if drop
>> replication slot fails the relation state will still be in
>> FINISHEDCOPY. So if there is a failure in the drop replication slot we
>> will not have an issue as the tablesync worker will be in
>> FINISHEDCOPYstate and this state is not allowed for upgrade. When the
>> state is in SYNCDONE the tablesync slot will not be present. b) In the
>> second part we drop the replication origin, even if there is a chance
>> that drop replication origin fails due to some reason, there will be
>> no problem as we do not copy the table sync replication origin to the
>> new cluster while upgrading. Since the table sync replication origin
>> is not copied to the new cluster there will be no replication origin
>> leaks.
> 
> And, this will work because in the SYNCDONE state, while removing the
> origin, we are okay with missing origins. It seems not copying the
> origin for tablesync workers in this state (SYNCDONE) relies on the
> fact that currently, we don't use those origins once the system
> reaches the SYNCDONE state but I am not sure it is a good idea to have
> such a dependency and that upgrade assuming such things doesn't seems
> ideal to me.

Hmm, yeah, you mean the replorigin_drop_by_name() calls in
tablesync.c.  I did not pay much attention about that in the code, but
your point sounds sensible.

(I have not been able to complete an analysis of the risks behind 's'
to convince myself that it is entirely safe, but leaks are scary as
hell if this gets automated across a large fleet of nodes..)

> Personally, I think allowing an upgrade in  'i'
> (initialize) state or 'r' (ready) state seems safe because in those
> states either slots/origins don't exist or are dropped. What do you
> think?

I share a similar impression about 's'.  From a design point of view,
making the conditions to reach harder in the first implementation
makes the user experience stricter, but that's safer regarding leaks
and it is still possible to relax these choices in the future
depending on the improvement pieces we are able to figure out.
--
Michael


signature.asc
Description: PGP signature

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Bruce Momjian

On Fri, Sep  8, 2023 at 05:10:51PM -0400, Bruce Momjian wrote:
> I knew we only considered the array dimension sizes to be documentation
> _in_ the query, but I thought we at least properly displayed the number
> of dimensions specified at creation when we described the table in psql,
> but it seems we don't do that either.
> 
> A big question is why we even bother to record the dimensions in
> pg_attribute if is not accurate for LIKE and not displayed to the user
> in a meaningful way by psql.
> 
> I think another big question is whether the structure we are using to
> supply the column information to BuildDescForRelation is optimal.  The
> typmod that has to be found for CREATE TABLE uses:
> 
> typenameTypeIdAndMod(NULL, entry->typeName, , );
> 
> which calls typenameTypeIdAndMod() -> typenameType() -> LookupTypeName()
> -> LookupTypeNameExtended() -> typenameTypeMod().  This seems very
> complicated because the ColumnDef, at least in the LIKE case,  already
> has the value.  Is there a need to revisit how we handle type such
> cases?

(Bug report moved to hackers, previous bug reporters added CCs.)

I looked at this some more and found more fundamental problems.  We have
pg_attribute.attndims which does record the array dimensionality:

CREATE TABLE test (data integer, data_array integer[5][5]);

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test'::regclass AND
  attname = 'data_array';
 attndims
--
2

The first new problem I found is that we don't dump the dimensionality:

$ pg_dump test
...
CREATE TABLE public.test (
data integer,
--> data_array integer[]
);

and psql doesn't display the dimensionality:

\d test
   Table "public.test"
   Column   |   Type| Collation | Nullable | Default
+---+---+--+-
 data   | integer   |   |  |
-->  data_array | integer[] |   |  |

A report from 2015 reports that CREATE TABLE ... LIKE and CREATE TABLE
... AS doesn't propagate the dimensionality:

https://www.postgresql.org/message-id/flat/20150707072942.1186.98151%40wrigleys.postgresql.org

and this thread from 2018 supplied a fix:

https://www.postgresql.org/message-id/flat/7862e882-8b9a-0c8e-4a38-40ad374d3634%40brandwatch.com

though in my testing it only fixes LIKE, not CREATE TABLE ... AS.  This
report from April of this year also complains about LIKE:

https://www.postgresql.org/message-id/flat/ZD%2B14YZ4IUue8Rhi%40gendo.asyd.net

Here is the output from master for LIKE:

CREATE TABLE test2 (LIKE test);

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test2'::regclass AND
 attname = 'data_array';
 attndims
--
--> 0

and this is the output for CREATE TABLE ... AS:

CREATE TABLE test3 AS SELECT * FROM test;

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test3'::regclass AND
  attname = 'data_array';
 attndims
--
--> 0

The attached patch fixes pg_dump:

$ pg_dump test
...
CREATE TABLE public.test2 (
data integer,
--> data_array integer[][]
);

It uses repeat() at the SQL level rather then modifying format_type() at
the SQL or C level.  It seems format_type() is mostly used to get the
type name, e.g. int4[], rather than the column definition so I added
brackets at the call site.  I used a similar fix for psql output:

\d test
Table "public.test"
   Column   |Type | Collation | Nullable | Default
+-+---+--+-
 data   | integer |   |  |
-->  data_array | integer[][] |   |  |

The 2018 patch from Alexey Bashtanov fixes the LIKE case:

CREATE TABLE test2 (LIKE test);

\d test2
   Table "public.test2"
   Column   |Type | Collation | Nullable | Default
+-+---+--+-
 data   | integer |   |  |
-->  data_array | integer[][] |   |  |

It does not fix CREATE TABLE ... AS because it looks like fixing that
would require adding an ndims column to Var for WITH NO DATA and adding
ndims to TupleDesc for WITH DATA.  I am not sure if that overhead is
warrented to fix this item.  I have added C comments where they should
be added.

I would like to apply this patch to master because I think our current
deficiencies in this area are unacceptable.  An alternate approach would
be to remove pg_attribute.attndims so we don't even try to preserve 
dimensionality.

-- 
  Bruce Momjian

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote:
> On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier  wrote:
>> I have added some documentation to explain that, as well.  I am not
>> wedded to the name proposed in the patch, so if you feel there is
>> better, feel free to propose ideas.
> 
> Actually with Attach and Detach terminology, INJECTION_POINT becomes
> the place where we "declare" the injection point. So the documentation
> needs to first explain INJECTION_POINT and then explain the other
> operations.

Sure.

>> This facility is hidden behind a specific configure/Meson switch,
>> making it a no-op by default:
>> --enable-injection-points
>> -Dinjection_points={ true | false }
> 
> That's useful, but we will also see demand to enable those in
> production (under controlled circumstances). But let the functionality
> mature under a separate flag and developer builds before used in
> production.

Hmm.  Okay.  I'd still keep that under a compile switch for now
anyway to keep a cleaner separation and avoid issues where it would be
possible to inject code in a production build.  Note that I was
planning to switch one of my buildfarm animals to use it should this
stuff make it into the tree.  And people would be free to use it if
they want.  If in production, that would be a risk, IMO.

> +/*
> + * Local cache of injection callbacks already loaded, stored in
> + * TopMemoryContext.
> + */
> +typedef struct InjectionPointArrayEntry
> +{
> + char name[INJ_NAME_MAXLEN];
> + InjectionPointCallback callback;
> +} InjectionPointArrayEntry;
> +
> +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;
> 
> Initial size of this array is small, but given the size of code in a
> given path to be tested, I fear that this may not be sufficient. I
> feel it's better to use hash table whose APIs are already available.

I've never seen in recent years a need for a given test to use more
than 4~5 points.  But I guess that you've seen more than that wanted
in a prod environment with a fork of Postgres?

I'm OK to switch that to use a hash table in the initial
implementation, even for a low number of entries with points that are
not in hot code paths that should be OK.  At least for my cases
related to testing that's OK.  Am I right to assume that you mean a
HTAB in TopMemoryContext?

> I find it pretty useless to expose that as a SQL API. Also adding it
> in tests would make it usable as an example. In this patchset
> INJECTION_POINT should be the only way to trigger an injection point.

That's useful to test the cache logic while providing simple coverage
for the core API, and that's cheap.  So I'd rather keep this test
module around with these tests.

> That also brings another question. The INJECTION_POINT needs to be added in 
> the
> core code but can only be used through an extension. I think there should be
> some rudimentary albeit raw test in core for this. Extension does some good
> things like provides built-in actions when the injection is triggered. So, we
> should keep those too.

Yeah, I'd like to agree with that but everything I've seen in the
recent years is that every setup finishes by having different
assumptions about what they want to do, so my intention is to trim
down the core of the patch to a bare acceptable minimum and then work
on top of that.

> I am glad that it covers the frequently needed injections error, notice and
> wait. If someone adds multiple injection points for wait and all of them are
> triggered (in different backends), test_injection_points_wake() would
> wake them all. When debugging cascaded functionality it's not easy to
> follow sequence add, trigger, wake for multiple injections. We should
> associate a conditional variable with the required injection points. Attach
> should create conditional variable in the shared memory for that injection
> point and detach should remove it. I see this being mentioned in the commit
> message, but I think it's something we need in the first version of "wait" to
> be useful.

More to the point, I actually disagree with that, because it could be
possible as well that the same condition variable is used across
multiple points.  At the end, IMHO, the central hash table should hold
zero meta-data associated to an injection point like a counter, an
elog, a condition variable, a sleep time, etc. or any combination of
all these ones, and should just know about how to load a callback with
a library path and a routine name.  I understand that this is leaving
the responsibility of what a callback should do down to the individual
developer implementing a callback into their own extension, where they
should be free to have conditions of their own.

Something that I agree would be very useful for the in-core APIs,
depending on the cases, is to be able to push some information to the
callback at runtime to let a callback decide what to do depending on a
running state, including a condition registered when a point was
attached.

Re: Faster "SET search_path"

2023-11-20 Thread Jeff Davis

On Thu, 2023-11-16 at 16:46 -0800, Jeff Davis wrote:
> While I considered OOM during hash key initialization, I missed some
> other potential out-of-memory hazards. Attached a fixup patch 0003,
> which re-introduces one list copy but it simplifies things
> substantially in addition to being safer around OOM conditions.

Committed 0003 fixup.

> > > 0004: Use the same cache to optimize check_search_path().

Committed 0004.

> > > 0005: Optimize cache for repeated lookups of the same value.

Will commit 0005 soon.

I also attached a trivial 0006 patch that uses SH_STORE_HASH. I wasn't
able to show much benefit, though, even when there's a bucket
collision. Perhaps there just aren't enough elements to matter -- I
suppose there would be a benefit if there are lots of unique
search_path strings, but that doesn't seem very plausible to me. If
someone thinks it's worth committing, then I will, but I don't see any
real upside or downside.

Regards,
Jeff Davis

From 5f41c0ecc602dd183b7f6e2f23cd28c9338b3c5b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sun, 19 Nov 2023 14:47:04 -0800
Subject: [PATCH v10 2/2] Use SH_STORE_HASH for search_path cache.

Does not change performance in expected cases, but makes performance
more resilient in case of bucket collisions.

Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com
Discussion: https://postgr.es/m/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com
---
 src/backend/catalog/namespace.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 37a69e9023..0b6e0d711c 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -171,11 +171,10 @@ typedef struct SearchPathCacheEntry
 	List	   *oidlist;		/* namespace OIDs that pass ACL checks */
 	List	   *finalPath;		/* cached final computed search path */
 	Oid			firstNS;		/* first explicitly-listed namespace */
+	uint32		hash;			/* needed for simplehash */
 	bool		temp_missing;
 	bool		forceRecompute; /* force recompute of finalPath */
-
-	/* needed for simplehash */
-	char		status;
+	char		status;			/* needed for simplehash */
 }			SearchPathCacheEntry;
 
 /*
@@ -270,6 +269,8 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 #define SH_EQUAL(tb, a, b)		spcachekey_equal(a, b)
 #define SH_SCOPE		static inline
 #define SH_DECLARE
+#define SH_GET_HASH(tb, a)		a->hash
+#define SH_STORE_HASH
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
-- 
2.34.1

From 46e09b225217bef79a57cc4f8450ed19be8f21ba Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 19 Oct 2023 15:48:16 -0700
Subject: [PATCH v10 1/2] Optimize SearchPathCache by saving the last entry.

Repeated lookups are common, so it's worth it to check the last entry
before doing another hash lookup.

Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com
---
 src/backend/catalog/namespace.c | 88 +
 1 file changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 5027efc91d..37a69e9023 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -241,7 +241,8 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames,
  *
  * The search path cache is based on a wrapper around a simplehash hash table
  * (nsphash, defined below). The spcache wrapper deals with OOM while trying
- * to initialize a key, and also offers a more convenient API.
+ * to initialize a key, optimizes repeated lookups of the same key, and also
+ * offers a more convenient API.
  */
 
 static inline uint32
@@ -281,6 +282,7 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 #define SPCACHE_RESET_THRESHOLD		256
 
 static nsphash_hash * SearchPathCache = NULL;
+static SearchPathCacheEntry * LastSearchPathCacheEntry = NULL;
 
 /*
  * Create or reset search_path cache as necessary.
@@ -295,6 +297,7 @@ spcache_init(void)
 		return;
 
 	MemoryContextReset(SearchPathCacheContext);
+	LastSearchPathCacheEntry = NULL;
 	/* arbitrary initial starting size of 16 elements */
 	SearchPathCache = nsphash_create(SearchPathCacheContext, 16, NULL);
 	searchPathCacheValid = true;
@@ -307,12 +310,25 @@ spcache_init(void)
 static SearchPathCacheEntry *
 spcache_lookup(const char *searchPath, Oid roleid)
 {
-	SearchPathCacheKey cachekey = {
-		.searchPath = searchPath,
-		.roleid = roleid
-	};
+	if (LastSearchPathCacheEntry &&
+		LastSearchPathCacheEntry->key.roleid == roleid &&
+		strcmp(LastSearchPathCacheEntry->key.searchPath, searchPath) == 0)
+	{
+		return LastSearchPathCacheEntry;
+	}
+	else
+	{
+		SearchPathCacheEntry *entry;
+		SearchPathCacheKey cachekey = {
+			.searchPath = searchPath,
+			.roleid = roleid
+		};
+
+		entry = nsphash_lookup(SearchPathCache, cachekey);
 
-	return nsphash_lookup(SearchPathCache, cachekey);
+

Re: POC, WIP: OR-clause support for indexes

2023-11-20 Thread Alena Rybakina


On 20.11.2023 11:52, Andrei Lepikhov wrote:
Looking into the patch, I found some trivial improvements (see 
attachment).
Also, it is not obvious that using a string representation of the 
clause as a hash table key is needed here. Also, by making a copy of 
the node in the get_key_nconst_node(), you replace the location field, 
but in the case of complex expression, you don't do the same with 
other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().


I was thinking about your last email and a possible error where the 
location field may not be cleared in complex expressions. Unfortunately, 
I didn't come up with any examples either, but I think I was able to 
handle this with a special function that removes location-related 
patterns. The alternative to this is to bypass this expression, but I 
think it will be more invasive. In addition, I have added changes 
related to the hash table: now the key is of type int.


All changes are displayed in the attached 
v9-0001-Replace-OR-clause-to_ANY.diff.txt file.


I haven't measured it yet. But what do you think about these changes?


--
Regards,
Alena Rybakina
Postgres Professional
From 8e579b059ffd415fc477e0e8930e718084e58683 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 21 Nov 2023 03:22:13 +0300
Subject: [PATCH] [PATCH] Replace OR clause to ANY expressions. Replace (X=N1)
 OR (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than set
 or_transform_limit. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

Authors: Alena Rybakina , Andrey Lepikhov 
Reviewed-by: Peter Geoghegan , Ranier Vilela 
Reviewed-by: Alexander Korotkov , Robert Haas 
---
 src/backend/parser/parse_expr.c   | 280 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 115 +++
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/join.out|  50 
 src/test/regress/expected/partition_prune.out | 156 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  17 ++
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   1 +
 14 files changed, 703 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 64c582c344c..290422005a3 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -18,6 +18,7 @@
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "common/hashfn.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -43,6 +44,7 @@
 
 /* GUC parameters */
 bool		Transform_null_equals = false;
+bool		enable_or_transformation = false;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -98,7 +100,283 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 			  Node *ltree, Node *rtree, int location);
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
+typedef struct OrClauseGroupEntry
+{
+	int32			hash_leftvar_key;
+
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Node 		   *expr;
+} OrClauseGroupEntry;
+
+/*
+ * TODO: consider different algorithm to manage complexity
+ * in the case of many different clauses,
+ * like Rabin-Karp or Boyer–Moore algorithms.
+ */
+static char *
+clear_patterns(const char *str, const char *start_pattern)
+{
+	int			i = 0;
+	int			j = 0;
+	int			start_pattern_len = strlen(start_pattern);
+	char	   *res = palloc0(sizeof(*res) * (strlen(str) + 1));
+
+	for (i = 0; str[i];)
+	{
+		if (i >= start_pattern_len && strncmp([i - start_pattern_len],
+			  start_pattern,
+			  start_pattern_len) == 0)
+		{
+			while (str[i] && !(str[i] == '{' || str[i] == '}'))
+i++;
+		}
+		if (str[i])
+			res[j++] = str[i++];
+	}
+
+	return res;
+}
+
+static int32
+get_key_nconst_node(Node *nconst_node)
+{
+	char *str = nodeToString(nconst_node);
+
+	str = clear_patterns(str, " :location");
+
+	if (str == NULL)
+		return 0;
+
+	return DatumGetInt32(hash_any((unsigned char *)str, strlen(str)));
+}
+
+static Node *
+transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
+{
+	List		   *or_list = NIL;
+

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote:
> Yes, the timings for the writes and the syncs should work. Another
> question I have in mind is the pg_stat_reset_shared() function. When
> we call it with 'io' it will reset pg_stat_wal's timings and when we
> call it with 'wal' it won't reset them, right?

pg_stat_reset_shared() with a target is IMO a very edge case, so I'm
OK with the approach of resetting timings in pg_stat_wal even if 'io'
was implied because pg_stat_wal would feed partially from pg_stat_io.
I'd take that as a side-cost in favor of compatibility while making
the stats gathering cheaper overall.  I'm OK as well if people
counter-argue on this point, though that would mean to keep entirely
separate views with duplicated fields that serve the same purpose,
impacting all deployments because it would make the stats gathering
heavier for all.
--
Michael


signature.asc
Description: PGP signature

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-16 17:11:03 -0800, Jubilee Young wrote:
> We don't directly `#include` C into Rust, but use libclang to preprocess and
> compile a wrapping C header into a list of symbols Rust will look for at link
> time. Our failure is in libclang and how we steer it:
> - The Clang-C API (libclang.so) cannot determine where system headers are.
> - A clang executable can determine where system headers are, but our bindgen
> may be asked to use a libclang.so without a matching clang executable!
> - This is partly because system packagers do not agree on what clang parts
> must be installed together, nor even on the clang directory's layout.
> - Thus, it is currently impossible to, given a libclang.so, determine with
> 100% accuracy where version-appropriate system headers are and include them,
> nor does it do so implicitly.

I remember battling this in the past, independent of rust :(


What I don't quite get is why SIMD headers are particularly more problematic
than others - there's other headers that are compiler specific?

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
> On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
> > Given that, I wonder if what we should do is to just add a new field to
> > pg_control that says "error out if backup_label does not exist", that we set
> > when creating a streaming base backup
>
> That would mean that one still needs to take an extra step to update a
> control file with this byte set, which is something you had a concern
> with in terms of compatibility when it comes to external backup
> solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
> Given that, I wonder if what we should do is to just add a new field to
> pg_control that says "error out if backup_label does not exist", that we set
> when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?  I
don't quite see why it is different than what's proposed on this
thread, except that you don't need to write one file to the data
folder to store the backup label fields, but two, meaning that there's
a risk for more mistakes because a clean backup process would require
more steps. 

With the current position of the fields in ControlFileData, there are
three free bytes after backupEndRequired, so it is possible to add
that for free.  Now, would you actually need an extra field knowing
that backupStartPoint is around?
--
Michael

signature.asc
Description: PGP signature

Re: Why is hot_standby_feedback off by default?

2023-11-20 Thread Andres Freund

On 2023-11-20 16:34:47 +0700, John Naylor wrote:
> Sounds like a TODO?

WFM. I don't personally use or update TODO, as I have my doubts about its
usefulness or state of maintenance. But please feel free to add this as a TODO
from my end...

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:18:15 -0700, David G. Johnston wrote:
> On Mon, Nov 20, 2023 at 1:37 PM Andres Freund  wrote:
> 
> >
> > Given that, I wonder if what we should do is to just add a new field to
> > pg_control that says "error out if backup_label does not exist", that we
> > set
> > when creating a streaming base backup
> >
> >
> I thought this was DOA since we don't want to ever leave the cluster in a
> state where a crash requires intervention to restart.

I was trying to suggest that we'd set the field in-memory, when streaming out
a pg_basebackup style backup (by just replacing pg_control with an otherwise
identical file that has the flag set). So it'd not have any effect on the
primary.

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 11:11:13AM -0500, Robert Haas wrote:
> I think we need more votes to make a change this big. I have a
> concern, which I think I've expressed before, that we keep whacking
> around the backup APIs, and that has a cost which is potentially
> larger than the benefits. The last time we changed the API, we changed
> pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
> wonder if that's OK. Even if it is, do we really want to change this
> API around again after such a short time?

Agreed.

> That said, I don't have an intrinsic problem with moving this
> information from the backup_label to the backup_manifest file since it
> is purely informational. I do think there should perhaps be some
> additions to the test cases, though.

Yep, cool.  Even if we decide to not go with what's discussed in this
patch, I think that's useful for some users at the end to get more
redundancy, as well.  And that's in a format easier to parse.

> I am concerned about the interaction of this proposal with incremental
> backup. When you take an incremental backup, you get something that
> looks a lot like a usable data directory but isn't. To prevent that
> from causing avoidable disasters, the present version of the patch
> adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
> backup_label. pg_combinebackup knows to look for those fields, and the
> server knows that if they are present it should refuse to start. With
> this change, though, I suppose those fields would end up in
> pg_control. But that does not feel entirely great, because we have a
> goal of keeping the amount of real data in pg_control below 512 bytes,
> the traditional sector size, and this adds another 12 bytes of stuff
> to that file that currently doesn't need to be there. I feel like
> that's kind of a problem.

I don't recall one time where the addition of new fields to the
control file was easy to discuss because of its 512B hard limit.
Anyway, putting the addition aside for a second, and I've not looked
at the incremental backup patch, does the removal of the backup_label
make the combine logic more complicated, or that's just moving a chunk
of code to do a control file lookup instead of a backup_file parsing?
Making the information less readable is definitely an issue for me.  A
different alternative that I've mentioned upthread is to keep an
equivalent of the backup_label and rename it to something like
backup.debug or similar, with a name good enough to tell people that
we don't care about it being removed.
--
Michael


signature.asc
Description: PGP signature

Re: PANIC serves too many masters

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 17:55:32 -0500, Tom Lane wrote:
> Jeff Davis  writes:
> > Is the error level the right way to express what we want to happen? It
> > seems like what we really want is to decide on the behavior, i.e.
> > restart or not, and generate core or not. That could be done a
> > different way, like:
> 
> >   ereport(PANIC,
> >   (errmsg("could not locate a valid checkpoint record"),
> >errabort(false),errrestart(false)));
> 
> Yeah, I was wondering about that too.  It feels to me that
> PANIC_EXIT is an error level (even more severe than PANIC).
> But maybe "no core dump please" should be conveyed separately,
> since it's just a minor adjustment that doesn't fundamentally
> change what happens.

I guess I was thinking of an error level because that'd be easier to search
for in logs. It seems reasonable to want to specificially search for errors
that cause core dumps, since IMO they should all be "should never happen" kind
of paths.

> It's plausible that you'd want a core,
> or not want one, for different cases that all seem to require
> PANIC_EXIT.

I can't immediately think of a case where you'd want PANIC_EXIT but also want
a core dump? In my mental model to use PANIC_EXIT we'd need to have a decent
understanding that the situation isn't going to change after crash-restart -
in which case a core dump presumably isn't interesting?

> (Need a better name than PANIC_EXIT.  OMIGOD?)

CRITICAL?

I agree with the point made upthread that we'd want leave PANIC around, it's
not realistic to annotate everything, and then there's obviously also
extensions (although I hope there aren't many PANICs in extensions).

If that weren't the case, something like this could make sense:

PANIC: crash-restart
CRITICAL: crash-shutdown
BUG: crash-restart, abort()

Greetings,

Andres Freund

Re: PANIC serves too many masters

2023-11-20 Thread Tom Lane

Jeff Davis  writes:
> Is the error level the right way to express what we want to happen? It
> seems like what we really want is to decide on the behavior, i.e.
> restart or not, and generate core or not. That could be done a
> different way, like:

>   ereport(PANIC,
>   (errmsg("could not locate a valid checkpoint record"),
>errabort(false),errrestart(false)));

Yeah, I was wondering about that too.  It feels to me that
PANIC_EXIT is an error level (even more severe than PANIC).
But maybe "no core dump please" should be conveyed separately,
since it's just a minor adjustment that doesn't fundamentally
change what happens.  It's plausible that you'd want a core,
or not want one, for different cases that all seem to require
PANIC_EXIT.

(Need a better name than PANIC_EXIT.  OMIGOD?)

regards, tom lane

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Nathan Bossart

On Mon, Nov 20, 2023 at 10:50:36AM -0800, Jubilee Young wrote:
> In that case, I took a look across the codebase and saw a
> utils/ascii.h that doesn't
> seem to have gotten much love, but I suppose one could argue that it's 
> intended
> to be a backend-only header file?

That might work.  It's not #included in very many files, so adding
port/simd.h shouldn't be too bad.  And ascii.h is also pretty inexpensive,
so including it in wchar.c seems permissible, too.  I'm not certain this
doesn't cause problems with libpgcommon, but I don't see why it would,
either.

> So it should probably end up living somewhere near the UTF-8 support, and
> the easiest way to make it not go into something pgrx currently
> includes would be
> to make it a new header file, though there's a fair amount of API we
> don't touch.

Does pgrx use ascii.h at all?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/common/wchar.c b/src/common/wchar.c
index fb9d9f5c85..fbac11deb4 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,6 +13,7 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "utils/ascii.h"
 
 
 /*
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 29cd5732f1..80676d9e02 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -22,8 +22,6 @@
 #ifndef PG_WCHAR_H
 #define PG_WCHAR_H
 
-#include "port/simd.h"
-
 /*
  * The pg_wchar type
  */
@@ -722,71 +720,4 @@ extern int	mic2latin_with_table(const unsigned char *mic, unsigned char *p,
 extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len);
 #endif
 
-
-/*
- * Verify a chunk of bytes for valid ASCII.
- *
- * Returns false if the input contains any zero bytes or bytes with the
- * high-bit set. Input len must be a multiple of the chunk size (8 or 16).
- */
-static inline bool
-is_valid_ascii(const unsigned char *s, int len)
-{
-	const unsigned char *const s_end = s + len;
-	Vector8		chunk;
-	Vector8		highbit_cum = vector8_broadcast(0);
-#ifdef USE_NO_SIMD
-	Vector8		zero_cum = vector8_broadcast(0x80);
-#endif
-
-	Assert(len % sizeof(chunk) == 0);
-
-	while (s < s_end)
-	{
-		vector8_load(, s);
-
-		/* Capture any zero bytes in this chunk. */
-#ifdef USE_NO_SIMD
-
-		/*
-		 * First, add 0x7f to each byte. This sets the high bit in each byte,
-		 * unless it was a zero. If any resulting high bits are zero, the
-		 * corresponding high bits in the zero accumulator will be cleared.
-		 *
-		 * If none of the bytes in the chunk had the high bit set, the max
-		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
-		 * and we don't need to worry about carrying over to the next byte. If
-		 * any input bytes did have the high bit set, it doesn't matter
-		 * because we check for those separately.
-		 */
-		zero_cum &= (chunk + vector8_broadcast(0x7F));
-#else
-
-		/*
-		 * Set all bits in each lane of the highbit accumulator where input
-		 * bytes are zero.
-		 */
-		highbit_cum = vector8_or(highbit_cum,
- vector8_eq(chunk, vector8_broadcast(0)));
-#endif
-
-		/* Capture all set bits in this chunk. */
-		highbit_cum = vector8_or(highbit_cum, chunk);
-
-		s += sizeof(chunk);
-	}
-
-	/* Check if any high bits in the high bit accumulator got set. */
-	if (vector8_is_highbit_set(highbit_cum))
-		return false;
-
-#ifdef USE_NO_SIMD
-	/* Check if any high bits in the zero accumulator got cleared. */
-	if (zero_cum != vector8_broadcast(0x80))
-		return false;
-#endif
-
-	return true;
-}
-
 #endif			/* PG_WCHAR_H */
diff --git a/src/include/utils/ascii.h b/src/include/utils/ascii.h
index 630acd9bfd..7df024dad3 100644
--- a/src/include/utils/ascii.h
+++ b/src/include/utils/ascii.h
@@ -11,6 +11,74 @@
 #ifndef _ASCII_H_
 #define _ASCII_H_
 
+#include "port/simd.h"
+
 extern void ascii_safe_strlcpy(char *dest, const char *src, size_t destsiz);
 
+/*
+ * Verify a chunk of bytes for valid ASCII.
+ *
+ * Returns false if the input contains any zero bytes or bytes with the
+ * high-bit set. Input len must be a multiple of the chunk size (8 or 16).
+ */
+static inline bool
+is_valid_ascii(const unsigned char *s, int len)
+{
+	const unsigned char *const s_end = s + len;
+	Vector8		chunk;
+	Vector8		highbit_cum = vector8_broadcast(0);
+#ifdef USE_NO_SIMD
+	Vector8		zero_cum = vector8_broadcast(0x80);
+#endif
+
+	Assert(len % sizeof(chunk) == 0);
+
+	while (s < s_end)
+	{
+		vector8_load(, s);
+
+		/* Capture any zero bytes in this chunk. */
+#ifdef USE_NO_SIMD
+
+		/*
+		 * First, add 0x7f to each byte. This sets the high bit in each byte,
+		 * unless it was a zero. If any resulting high bits are zero, the
+		 * corresponding high bits in the zero accumulator will be cleared.
+		 *
+		 * If none of the bytes in the chunk had the high bit set, the max
+		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
+		 * and we don't need to worry about carrying over to the next byte. If
+		 * any input bytes did have the high bit set, it doesn't

Re: Partial aggregates pushdown

2023-11-20 Thread Bruce Momjian

On Mon, Nov 20, 2023 at 03:51:33PM -0500, Robert Haas wrote:
> On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > In postgres_fdw.sql, I have corrected the output format for floating point 
> > numbers
> > by extra_float_digits.
> 
> Looking at this, I find that it's not at all clear to me how the
> partial aggregate function is defined. Let's look at what we have for
> documentation:
> 
> +  
> +   Paraemter AGGPARTIALFUNC optionally defines a
> +   partial aggregate function used for partial aggregate pushdown; see
> +for details.
> +  
> 
> +   Partial aggregate function (zero if none).
> +   See  for the definition
> +   of partial aggregate function.
> 
> +   Partial aggregate pushdown is an optimization for queries that contains
> +   aggregate expressions for a partitioned table across one or more remote
> +   servers. If multiple conditions are met, partial aggregate function
> 
> +   When partial aggregate pushdown is used for aggregate expressions,
> +   remote queries replace aggregate function calls with partial
> +   aggregate function calls.  If the data type of the state value is not
> 
> But there's no definition of what the behavior of the function is
> anywhere that I can see, not even in  id="partial-aggregate-pushdown">. Everywhere it only describes how the
> partial aggregate function is used, not what it is supposed to do.

Yes, I had to figure that out myself, and I was wondering how much
detail to have in our docs vs README files vs. C comments.  I think we
should put more details somewhere.

> Looking at the changes in pg_aggregate.dat, it seems like the partial
> aggregate function is a second aggregate defined in a way that mostly
> matches the original, except that (1) if the original final function
> would have returned a data type other than internal, then the final
> function is removed; and (2) if the original final function would have
> returned a value of internal type, then the final function is the
> serialization function of the original aggregate. I think that's a
> reasonable definition, but the documentation and code comments need to
> be a lot clearer.

Agreed.  I wasn't sure enough about this to add it when I was reviewing
the patch.

> I do have a concern about this, though. It adds a lot of bloat. It
> adds a whole lot of additional entries to pg_aggregate, and every new
> aggregate we add in the future will require a bonus entry for this,
> and it needs a bunch of new pg_proc entries as well. One idea that
> I've had in the past is to instead introduce syntax that just does
> this, without requiring a separate aggregate definition in each case.
> For example, maybe instead of changing string_agg(whatever) to
> string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
> string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
> something. Then all aggregates could be treated in a generic way. I'm
> not completely sure that's better, but I think it's worth considering.

So use an SQL keyword to indicates a pushdown call?  We could then
automate the behavior rather than requiring special catalog functions?

> I think that the control mechanism needs some thought. Right now,
> there are two possible behaviors: either we assume that the local and
> remote sides are the same unconditionally, or we assume that they're
> the same if the remote side is a new enough version. I do like having
> those behaviors available, but I wonder if we need to do something
> better or different. What if somebody wants to push down a
> non-built-in aggregate, for example? I realize that we don't have

It does allow specification of extensions that can be pushed down.

> great solutions to the problem of knowing which functions are
> push-downable in general, and I don't know that partial aggregation
> needs to be any better than anything else, but it's probably worth
> comparing and contrasting the approach we take here with the
> approaches we've taken in other, similar cases. From that point of
> view, I think check_partial_aggregate_support is a novelty: we don't
> do those kinds of checks in other cases, AFAIK. But on the other hand,
> there is the 'extensions' argument to postgres_fdw.

Right.  I am not sure how to improve what the patch does.

> I don't think the patch does a good job explaining why HAVING,
> DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
> shouldn't really be a problem, because HAVING is basically a WHERE
> clause that occurs after aggregation is complete, and whether or not
> the aggregation is safe shouldn't depend on what we're going to do
> with the value afterward. The HAVING clause can't necessarily be
> pushed to the remote side, but I don't see how or why it could make
> the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
> little trickier: if we pushed down DISTINCT, we'd still have to
> re-DISTINCT-ify when combining locally, and if we pushed down ORDER

Re: PANIC serves too many masters

2023-11-20 Thread Jeff Davis

On Mon, 2023-11-20 at 17:12 -0500, Tom Lane wrote:
> I'd be inclined to keep PANIC with its current meaning, and
> incrementally change call sites where we decide that's not the
> best behavior.  I think those will be a minority, maybe a small
> minority.  (PANIC_EXIT had darn well better be a small minority.)

Is the error level the right way to express what we want to happen? It
seems like what we really want is to decide on the behavior, i.e.
restart or not, and generate core or not. That could be done a
different way, like:

  ereport(PANIC,
  (errmsg("could not locate a valid checkpoint record"),
   errabort(false),errrestart(false)));

Regards,
Jeff Davis

Re: PSQL error: total cell count of XXX exceeded

2023-11-20 Thread Tom Lane

Alvaro Herrera  writes:
> I think we're bound to hit this limit at some point in the future, and
> it seems easy enough to solve.  I propose the attached, which is pretty
> much what Hongxu last submitted, with some minor changes.

This bit needs more work:

-   content->cells = pg_malloc0((ncolumns * nrows + 1) * 
sizeof(*content->cells));
+   total_cells = (int64) ncolumns * nrows;
+   content->cells = pg_malloc0((total_cells + 1) * 
sizeof(*content->cells));

You've made the computation of total_cells reliable, but there's
nothing stopping the subsequent computation of the malloc argument
from overflowing (especially on 32-bit machines).  I think we need
an explicit test along the lines of

if (total_cells >= SIZE_MAX / sizeof(*content->cells))
throw error;

(">=" allows not needing to add +1.)

Also, maybe total_cells should be uint64?  We don't want
negative values to pass this test.  Alternatively, add a separate
check that total_cells >= 0.

It should be sufficient to be paranoid about this in printTableInit,
since after that we know the product of ncolumns * nrows isn't
too big.

The rest of this passes an eyeball check.

regards, tom lane

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-11-20 Thread Jeff Davis

On Mon, 2023-11-20 at 15:52 -0500, Robert Haas wrote:
> I agree. Not to burden you, but do you know what the overhead is now,
> and do you have any plans to further reduce it? I don't believe
> that's
> the only thing we ought to be doing here, necessarily, but it is one
> thing that we definitely should be doing and probably the least
> controversial.

Running the simple test described here:

https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com

The baseline (no "SET search_path" clause on the function) is around
3800ms, and with the clause it shoots up to 8000ms. That's not good,
but it is down from about 12000ms before.

There are a few patches in the queue to bring it down further. Andres
and I were discussing some GUC hashtable optimizations here:

https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com

which will (if committed) bring it down into the mid 7s.

There are also a couple other patches I have here (and intend to commit
soon):

https://www.postgresql.org/message-id/e6fded24cb8a2c53d4ef069d9f69cc7baaafe9ef.camel%40j-davis.com

and those I think will get it into the mid 6s. I think a bit lower
combined with the GUC hash table optimizations above.

So we are still looking at around 50% overhead for a simple function if
all this stuff gets committed. Not great, but a lot better than before.

Of course I welcome others to profile and see what they can do. There's
a setjmp() call, and a couple allocations, and maybe some other stuff
to look at. There are also higher-level ideas, like avoiding calling
into guc.c in some cases, but that starts to get tricky as you pointed
out:

https://www.postgresql.org/message-id/CA%2BTgmoa8uKQgak5wH0%3D7sL-ukqbwnCPMXA2ZW7Ccdt7tdNGkzg%40mail.gmail.com

It seems others are also interested in this problem, so I can put some
more effort in after this round of patches goes in. I don't have a
specific target other than "low enough overhead that we can reasonably
recommend it as a best practice in multi-user environments".

Regards,
Jeff Davis

Re: PANIC serves too many masters

2023-11-20 Thread Tom Lane

Jeff Davis  writes:
> On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote:
>> I don't quite know what we should do. But the current situation
>> decidedly
>> doesn't seem great.

> Agreed.

+1

> Better classification is nice, but it also requires more
> discipline and it might not always be obvious which category something
> fits in. What about an error loop resulting in:
>   ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded")));
> We'd want a core file, but I don't think we want to restart in that
> case, right?

Why not restart?  There's no strong reason to assume this will
repeat.

It might be worth having some independent logic in the postmaster
that causes it to give up after too many crashes in a row.  But with
many/most of these call sites, by definition we're not too sure what
is wrong.

> Also, can we do a change like this incrementally by updating a few
> PANIC sites at a time? Is it fine to leave plain PANICs in place for
> the foreseeable future, or do you want all of them to eventually move?

I'd be inclined to keep PANIC with its current meaning, and
incrementally change call sites where we decide that's not the
best behavior.  I think those will be a minority, maybe a small
minority.  (PANIC_EXIT had darn well better be a small minority.)

regards, tom lane

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

I wrote:
> The autoconf side seems to just be letting this option default.
> I'm not sure what the default choice is, but evidently it's not
> "-undefined error"?  Or were they stupid enough to not allow you
> to explicitly select the default behavior?

Seems we are not the only project having trouble with this:

https://github.com/mesonbuild/meson/issues/12450

I had not realized that Apple recently wrote themselves a whole
new linker, but apparently that's why all these deprecation
warnings are showing up.  It's not exactly clear whether
"deprecation" means they actually plan to remove the feature
later, or just that some bozo decided that explicitly specifying
the default behavior is bad style.

regards, tom lane

Re: PANIC serves too many masters

2023-11-20 Thread Jeff Davis

Hi,

On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote:
> I don't quite know what we should do. But the current situation
> decidedly
> doesn't seem great.

Agreed. Better classification is nice, but it also requires more
discipline and it might not always be obvious which category something
fits in. What about an error loop resulting in:

  ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded")));

We'd want a core file, but I don't think we want to restart in that
case, right?

Also, can we do a change like this incrementally by updating a few
PANIC sites at a time? Is it fine to leave plain PANICs in place for
the foreseeable future, or do you want all of them to eventually move?

Regards,
Jeff Davis

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

Robert Haas  writes:
> On Mon, Nov 20, 2023 at 3:53 PM Tom Lane  wrote:
>> 13.6.2?  longfin's host is on 13.6.1, and the only thing Software
>> Update is showing me is an option to upgrade to Sonoma.  But anyway...

> Uh, I guess Apple made a special version just for me? That's
> definitely what it says.

Might be for M-series only; longfin's host is still Intel.

>> Hmm ... I fixed these things in the autoconf build: neither my
>> buildfarm animals nor manual builds show any warnings.  I thought
>> the problems weren't there in the meson build.  Need to take another
>> look I guess.

> They're definitely there for me, and there are a whole lot of them. I
> would have thought that if they were there for you in the meson build
> you would have noticed, since ninja suppresses a lot of distracting
> output that make prints.

I'm generally still using autoconf, I only run meson builds when
somebody complains about them ;-).  But yeah, I see lots of
"ld: warning: -undefined error is deprecated" when I do that.
This seems to have been installed by Andres' 9a95a510a:

   ldflags += ['-isysroot', pg_sysroot]
+  # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we
+  # don't want because a) it's different from what we do for autoconf, b) it
+  # causes warnings starting in macOS Ventura
+  ldflags_mod += ['-Wl,-undefined,error']

The autoconf side seems to just be letting this option default.
I'm not sure what the default choice is, but evidently it's not
"-undefined error"?  Or were they stupid enough to not allow you
to explicitly select the default behavior?

Also, I *don't* see any complaints about duplicate libraries.
What build options are you using?

regards, tom lane

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David G. Johnston

On Mon, Nov 20, 2023 at 1:37 PM Andres Freund  wrote:

>
> Given that, I wonder if what we should do is to just add a new field to
> pg_control that says "error out if backup_label does not exist", that we
> set
> when creating a streaming base backup
>
>
I thought this was DOA since we don't want to ever leave the cluster in a
state where a crash requires intervention to restart.  But I agree that it
is not possible to fool-proof agaInst a naive backup that copies over the
pg_control file as-is if breaking the crashed cluster option is not in play.

I agree that this works if the pg_control generated by stop backup produces
the line and we retain the label file as a separate and now mandatory
component to using the backup.

Or is the idea to make v17 error if it sees a backup label unless
pg_control has the feature flag field?  Which doesn't exist normally, does
in the basebackup version, and is removed once the backup is restored?

David J.

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 3:53 PM Tom Lane  wrote:
> 13.6.2?  longfin's host is on 13.6.1, and the only thing Software
> Update is showing me is an option to upgrade to Sonoma.  But anyway...

Uh, I guess Apple made a special version just for me? That's
definitely what it says.

> > [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
> > ld: warning: -undefined error is deprecated
> > ld: warning: ignoring duplicate libraries: '-lz'
>
> Hmm ... I fixed these things in the autoconf build: neither my
> buildfarm animals nor manual builds show any warnings.  I thought
> the problems weren't there in the meson build.  Need to take another
> look I guess.

They're definitely there for me, and there are a whole lot of them. I
would have thought that if they were there for you in the meson build
you would have noticed, since ninja suppresses a lot of distracting
output that make prints.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

Robert Haas  writes:
> Is there still outstanding work on this thread? Because I'm just now
> using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> kind of thing in a meson build:

13.6.2?  longfin's host is on 13.6.1, and the only thing Software
Update is showing me is an option to upgrade to Sonoma.  But anyway...

> [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
> ld: warning: -undefined error is deprecated
> ld: warning: ignoring duplicate libraries: '-lz'

Hmm ... I fixed these things in the autoconf build: neither my
buildfarm animals nor manual builds show any warnings.  I thought
the problems weren't there in the meson build.  Need to take another
look I guess.

regards, tom lane

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-11-20 Thread Robert Haas

On Tue, Nov 14, 2023 at 11:21 PM Jeff Davis  wrote:
> After adding the search path cache (recent commit f26c2368dc) hopefully
> that helps to make the above suggestion more reasonable performance-
> wise. I think we can call that progress.

I agree. Not to burden you, but do you know what the overhead is now,
and do you have any plans to further reduce it? I don't believe that's
the only thing we ought to be doing here, necessarily, but it is one
thing that we definitely should be doing and probably the least
controversial.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Partial aggregates pushdown

2023-11-20 Thread Robert Haas

On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> In postgres_fdw.sql, I have corrected the output format for floating point 
> numbers
> by extra_float_digits.

Looking at this, I find that it's not at all clear to me how the
partial aggregate function is defined. Let's look at what we have for
documentation:

+  
+   Paraemter AGGPARTIALFUNC optionally defines a
+   partial aggregate function used for partial aggregate pushdown; see
+for details.
+  

+   Partial aggregate function (zero if none).
+   See  for the definition
+   of partial aggregate function.

+   Partial aggregate pushdown is an optimization for queries that contains
+   aggregate expressions for a partitioned table across one or more remote
+   servers. If multiple conditions are met, partial aggregate function

+   When partial aggregate pushdown is used for aggregate expressions,
+   remote queries replace aggregate function calls with partial
+   aggregate function calls.  If the data type of the state value is not

But there's no definition of what the behavior of the function is
anywhere that I can see, not even in . Everywhere it only describes how the
partial aggregate function is used, not what it is supposed to do.

Looking at the changes in pg_aggregate.dat, it seems like the partial
aggregate function is a second aggregate defined in a way that mostly
matches the original, except that (1) if the original final function
would have returned a data type other than internal, then the final
function is removed; and (2) if the original final function would have
returned a value of internal type, then the final function is the
serialization function of the original aggregate. I think that's a
reasonable definition, but the documentation and code comments need to
be a lot clearer.

I do have a concern about this, though. It adds a lot of bloat. It
adds a whole lot of additional entries to pg_aggregate, and every new
aggregate we add in the future will require a bonus entry for this,
and it needs a bunch of new pg_proc entries as well. One idea that
I've had in the past is to instead introduce syntax that just does
this, without requiring a separate aggregate definition in each case.
For example, maybe instead of changing string_agg(whatever) to
string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
something. Then all aggregates could be treated in a generic way. I'm
not completely sure that's better, but I think it's worth considering.

I think that the control mechanism needs some thought. Right now,
there are two possible behaviors: either we assume that the local and
remote sides are the same unconditionally, or we assume that they're
the same if the remote side is a new enough version. I do like having
those behaviors available, but I wonder if we need to do something
better or different. What if somebody wants to push down a
non-built-in aggregate, for example? I realize that we don't have
great solutions to the problem of knowing which functions are
push-downable in general, and I don't know that partial aggregation
needs to be any better than anything else, but it's probably worth
comparing and contrasting the approach we take here with the
approaches we've taken in other, similar cases. From that point of
view, I think check_partial_aggregate_support is a novelty: we don't
do those kinds of checks in other cases, AFAIK. But on the other hand,
there is the 'extensions' argument to postgres_fdw.

I don't think the patch does a good job explaining why HAVING,
DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
shouldn't really be a problem, because HAVING is basically a WHERE
clause that occurs after aggregation is complete, and whether or not
the aggregation is safe shouldn't depend on what we're going to do
with the value afterward. The HAVING clause can't necessarily be
pushed to the remote side, but I don't see how or why it could make
the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
little trickier: if we pushed down DISTINCT, we'd still have to
re-DISTINCT-ify when combining locally, and if we pushed down ORDER
BY, we'd have to do a merge pass to combine the returned values unless
we could prove that the partitions were non-overlapping ranges that
would be visited in the correct order. Although that all sounds
doable, I think it's probably a good thing that the current patch
doesn't try to handle it -- this is complicated already. But it should
explain why it's not handling it and maybe even a bit about how it
could be handling in the future, rather than just saying "well, this
kind of thing is not safe." The trouble with that explanation is that
it does nothing to help the reader understand whether the thing in
question is *fundamentally* unsafe or whether we just don't have the
right code to make it work.

Typo: Paraemter

I'm so sorry to keep complaining about

Re: PSQL error: total cell count of XXX exceeded

2023-11-20 Thread Alvaro Herrera

On 2023-Sep-12, Tom Lane wrote:

> I'm more than a bit skeptical about trying to do something about this,
> simply because this range of query result sizes is far past what is
> practical.  The OP clearly hasn't tested his patch on actually
> overflowing query results, and I don't care to either.

I think we're bound to hit this limit at some point in the future, and
it seems easy enough to solve.  I propose the attached, which is pretty
much what Hongxu last submitted, with some minor changes.

Having this make a difference requires some 128GB of RAM, so it's not a
piece of cake, but it's an amount that can be reasonably expected to be
physically installed in real machines nowadays.

(I first thought we could just use pg_mul_s32_overflow during
printTableInit and raise an error if that returns true, but that just
postpones the problem.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Subversion to GIT: the shortest path to happiness I've ever heard of
(Alexey Klyukin)
>From 75abe5a485532cbc03db1eade2e1a6099914f98f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 20 Nov 2023 17:35:18 +0100
Subject: [PATCH v5] Avoid overflow in fe_utils' printTable API

We were using 32-bit integer arithmetic to compute the total number of
cells, which can overflow when used with large resultsets.  We still
don't allow more than 4 billion rows, but now the total number of cells
can exceed that.

Author: Hongxu Ma 
Discussion: https://postgr.es/m/tybp286mb0351b057b101c90d7c1239e6b4...@tybp286mb0351.jpnp286.prod.outlook.com
---
 src/fe_utils/print.c | 22 ++
 src/include/fe_utils/print.h |  2 +-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/src/fe_utils/print.c b/src/fe_utils/print.c
index 7af1ccb6b5..565cbc6d13 100644
--- a/src/fe_utils/print.c
+++ b/src/fe_utils/print.c
@@ -3172,6 +3172,8 @@ void
 printTableInit(printTableContent *const content, const printTableOpt *opt,
 			   const char *title, const int ncolumns, const int nrows)
 {
+	int64		total_cells;
+
 	content->opt = opt;
 	content->title = title;
 	content->ncolumns = ncolumns;
@@ -3179,7 +3181,8 @@ printTableInit(printTableContent *const content, const printTableOpt *opt,
 
 	content->headers = pg_malloc0((ncolumns + 1) * sizeof(*content->headers));
 
-	content->cells = pg_malloc0((ncolumns * nrows + 1) * sizeof(*content->cells));
+	total_cells = (int64) ncolumns * nrows;
+	content->cells = pg_malloc0((total_cells + 1) * sizeof(*content->cells));
 
 	content->cellmustfree = NULL;
 	content->footers = NULL;
@@ -3249,15 +3252,17 @@ void
 printTableAddCell(printTableContent *const content, char *cell,
   const bool translate, const bool mustfree)
 {
+	int64		total_cells;
+
 #ifndef ENABLE_NLS
 	(void) translate;			/* unused parameter */
 #endif
 
-	if (content->cellsadded >= content->ncolumns * content->nrows)
+	total_cells = (int64) content->ncolumns * content->nrows;
+	if (content->cellsadded >= total_cells)
 	{
-		fprintf(stderr, _("Cannot add cell to table content: "
-		  "total cell count of %d exceeded.\n"),
-content->ncolumns * content->nrows);
+		fprintf(stderr, _("Cannot add cell to table content: total cell count of %lld exceeded.\n"),
+(long long int) total_cells);
 		exit(EXIT_FAILURE);
 	}
 
@@ -3273,7 +3278,7 @@ printTableAddCell(printTableContent *const content, char *cell,
 	{
 		if (content->cellmustfree == NULL)
 			content->cellmustfree =
-pg_malloc0((content->ncolumns * content->nrows + 1) * sizeof(bool));
+pg_malloc0((total_cells + 1) * sizeof(bool));
 
 		content->cellmustfree[content->cellsadded] = true;
 	}
@@ -3341,9 +3346,10 @@ printTableCleanup(printTableContent *const content)
 {
 	if (content->cellmustfree)
 	{
-		int			i;
+		int64		total_cells;
 
-		for (i = 0; i < content->nrows * content->ncolumns; i++)
+		total_cells = (int64) content->ncolumns * content->nrows;
+		for (int64 i = 0; i < total_cells; i++)
 		{
 			if (content->cellmustfree[i])
 free(unconstify(char *, content->cells[i]));
diff --git a/src/include/fe_utils/print.h b/src/include/fe_utils/print.h
index 13ebb00362..a697d0ba8d 100644
--- a/src/include/fe_utils/print.h
+++ b/src/include/fe_utils/print.h
@@ -171,7 +171,7 @@ typedef struct printTableContent
 	const char **cells;			/* NULL-terminated array of cell content
  * strings */
 	const char **cell;			/* Pointer to the last added cell */
-	long		cellsadded;		/* Number of cells added this far */
+	int64		cellsadded;		/* Number of cells added this far */
 	bool	   *cellmustfree;	/* true for cells that need to be free()d */
 	printTableFooter *footers;	/* Pointer to the first footer */
 	printTableFooter *footer;	/* Pointer to the last added footer */
-- 
2.39.2

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 15:56:19 -0400, David Steele wrote:
> I understand this is an option -- but does it need to be? What is the
> benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

Also, for external backups, there's no manifest...

- Andres

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:11:13 -0500, Robert Haas wrote:
> I think we need more votes to make a change this big. I have a
> concern, which I think I've expressed before, that we keep whacking
> around the backup APIs, and that has a cost which is potentially
> larger than the benefits.

+1.  The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.

Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):

It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!

Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

Greetings,

Andres Freund

Re: Implement missing join selectivity estimation for range types

2023-11-20 Thread Schoemans Maxime

On 14/11/2023 20:46, Tom Lane wrote:
> I took a brief look through this very interesting work.  I concur
> with Tomas that it feels a little odd that range join selectivity
> would become smarter than scalar inequality join selectivity, and
> that we really ought to prioritize applying these methods to that
> case.  Still, that's a poor reason to not take the patch.

Indeed, we started with ranges as this was the simpler case (no MCV) and 
was the topic of a course project.
The idea is to later write a second patch that applies these ideas to 
scalar inequality while also handling MCV's correctly.

> I also agree with the upthread criticism that having two identical
> functions in different source files will be a maintenance nightmare.
> Don't do it.  When and if there's a reason for the behavior to
> diverge between the range and multirange cases, it'd likely be
> better to handle that by passing in a flag to say what to do.

The duplication is indeed not ideal. However, there are already 8 other 
duplicate functions between the two files.
I would thus suggest to leave the duplication in this patch and create a 
second one that removes all duplication from the two files, instead of 
just removing the duplication for our new function.
What are your thoughts on this? If we do this, should the function 
definitions go in rangetypes.h or should we create a new 
rangetypes_selfuncs.h header?

> But my real unhappiness with the patch as-submitted is the test cases,
> which require rowcount estimates to be reproduced exactly.

> We need a more forgiving test method. Usually the
> approach is to set up a test case where the improved accuracy of
> the estimate changes the planner's choice of plan compared to what
> you got before, since that will normally not be too prone to change
> from variations of a percent or two in the estimates.

I have changed the test method to produce query plans for a 3-way range 
join.
The plans for the different operators differ due to the computed 
selectivity estimation, which was not the case before this patch.

Regards,
Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index cefc4710fd..c670d225a0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:46:13 -0500, Robert Haas wrote:
> On Mon, Nov 20, 2023 at 2:35 PM Andres Freund  wrote:
> > > Is there still outstanding work on this thread? Because I'm just now
> > > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> > > kind of thing in a meson build:
> >
> > Ventura? In that case I assume you installed newer developer tools? Because
> > otherwise I think we were talking about issues introduced in Sonoma.
>
> I don't think I did anything special when installing developer tools.
> xcode-select --version reports 2397, if that tells you anything.

Odd then. My m1-mini running Ventura, also reporting 2397, doesn't show any of
those warnings. I did a CI run with Sonoma, and that does show them.

I'm updating said m1-mini it to Sonoma now, but that will take until I have to
leave for an appointment.

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/20/23 15:47, Robert Haas wrote:

On Mon, Nov 20, 2023 at 2:41 PM David Steele  wrote:

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?


I mean, you can run pg_basebackup --no-manifest.


Maybe this would be a good thing to disable for page incremental. With 
all the work being done by pg_combinebackup, it seems like it would be a 
good idea to be able to verify the final result?


I understand this is an option -- but does it need to be? What is the 
benefit of excluding the manifest?


Regards,
-David

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-20 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra  wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.
>
> 0001 is just v2, rebased to current master
>
> 0002 and 0003 address most of the issues, in particular it
>
>   - removes the unnecessary spool
>   - fixes bs_reltuples type to double
>   - a couple comments are reworded to be clearer
>   - changes loop/condition in brinbuildCallbackParallel
>   - removes asserts added for debugging
>   - fixes cast in comparetup_index_brin
>   - 0003 then simplifies comparetup_index_brin
>   - I haven't inlined the tuplesort_puttuple_common parameter
> (didn't seem worth it)

OK, thanks

> 0004 Reworks how the work is divided between workers and combined by the
> leader. It undoes the tableam.c changes that attempted to divide the
> relation into chunks matching the BRIN ranges, and instead merges the
> results in the leader (using the BRIN "union" function).

That's OK.

> I haven't done any indentation fixes yet.
>
> I did fairly extensive testing, using pageinspect to compare indexes
> built with/without parallelism. More testing is needed, but it seems to
> work fine (with other opclasses and so on).

After code-only review, here are some comments:

> +++ b/src/backend/access/brin/brin.c
> [...]
> +/* Magic numbers for parallel state sharing */
> +#define PARALLEL_KEY_BRIN_SHAREDUINT64CONST(0xA001)
> +#define PARALLEL_KEY_TUPLESORTUINT64CONST(0xA002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

> +#define PARALLEL_KEY_QUERY_TEXTUINT64CONST(0xA003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA004, one
more than brin's), backend/executor/execParallel.c
(0xE008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
> [...]
> +BrinSpool  *spool = state->bs_spool;
> [...]
> +if (!state)
> +return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

> +state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

> + * FIXME This probably needs some memory management fixes - we're reading
> + * tuples from the tuplesort, we're allocating an emty tuple, and so on.
> + * Probably better to release this memory.

This should probably be resolved.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context.

> +++ b/src/backend/catalog/index.c
> [...]
> -indexRelation->rd_rel->relam == BTREE_AM_OID)
> +(indexRelation->rd_rel->relam == BTREE_AM_OID ||
> + indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.

Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:41 PM David Steele  wrote:
> I can't see why a backup would continue to be valid without a manifest
> -- that's not very standard for backup software. If you have the
> critical info in backup_label, you can't afford to lose that, so why
> should backup_manifest be any different?

I mean, you can run pg_basebackup --no-manifest.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:35 PM Andres Freund  wrote:
> > Is there still outstanding work on this thread? Because I'm just now
> > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> > kind of thing in a meson build:
>
> Ventura? In that case I assume you installed newer developer tools? Because
> otherwise I think we were talking about issues introduced in Sonoma.

I don't think I did anything special when installing developer tools.
xcode-select --version reports 2397, if that tells you anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/20/23 14:44, Robert Haas wrote:

On Mon, Nov 20, 2023 at 12:54 PM David Steele  wrote:

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.


I think that if we do decide to adopt this proposal, that would be a
smart precaution.


I'd be OK with it -- what do you think, Michael? Would this be enough 
that we would not need to rename the functions, or should we just go 
with the rename?



A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?


Hmm. Or should they be pushed up by the parser?


We could do that. I started on that road, but it's a lot of code for 
fields that aren't used. I think it would be better if the parser also 
loaded a data structure that represented the manifest. Seems to me 
there's a lot of duplicated code between pg_verifybackup and 
pg_combinebackup the way it is now.



I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.


I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.


I can't see why a backup would continue to be valid without a manifest 
-- that's not very standard for backup software. If you have the 
critical info in backup_label, you can't afford to lose that, so why 
should backup_manifest be any different?


Regards,
-David

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:14:00 -0500, Robert Haas wrote:
> On Sat, Oct 7, 2023 at 12:09 PM Tom Lane  wrote:
> > Done that way.
> 
> Is there still outstanding work on this thread? Because I'm just now
> using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> kind of thing in a meson build:

Ventura? In that case I assume you installed newer developer tools? Because
otherwise I think we were talking about issues introduced in Sonoma.

Greetings,

Andres Freund

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 15:03, Andres Freund wrote:

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.


Hmm, yeah, that would be a bit much.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80

Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".


I still wonder if we need "base backup" in the messages? That sort of 
implies (at least to me) you used pg_basebackup but that may not be the 
case.


FWIW, I also prefer "backup recovery" over "recovery from backup". 
"recovery from backup" reads fine here, but if gets more awkward when 
you want to say something like "recovery from backup settings". In that 
case, I think "backup recovery settings" reads better. Not important for 
this patch, maybe, but the recovery in pg_control patch went the other 
way and I definitely think it makes sense to keep them consistent, 
whichever way we go.


Other than that, looks good for HEAD. Whether we back patch or not is 
another question, of course.


Regards,
-David

Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:10 PM Robert Haas  wrote:
> > Is this meant to support multiple timelines each with non-overlapping
> > adjacent ranges, rather than multiple non-adjacent ranges?
>
> Correct. I don't see how non-adjacent LSN ranges could ever be a
> useful thing, but adjacent ranges on different timelines are useful.

Thinking about this a bit more, there are a couple of things we could
do here in terms of syntax. Once idea is to give up on having a
separate MANIFEST-WAL-RANGE command for each range and instead just
cram everything into either a single command:

MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}...

Or even into a single option to the BASE_BACKUP command:

BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...'

Or, since we expect adjacent, non-overlapping ranges, you could even
arrange to elide the duplicated boundary LSNs, e.g.

MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn}

Or

BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn'

I'm not sure what's best here. Trying to trim out the duplicated
boundary LSNs feels a bit like rearrangement for the sake of
rearrangement, but maybe it isn't really.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Sat, Oct 7, 2023 at 12:09 PM Tom Lane  wrote:
> Done that way.

Is there still outstanding work on this thread? Because I'm just now
using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
kind of thing in a meson build:

[2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2266/2287] Linking target src/interfaces/ecpg/test/sql/insupd
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2273/2287] Linking target src/interfaces/ecpg/test/sql/quote
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2278/2287] Linking target src/interfaces/ecpg/test/sql/show
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2280/2287] Linking target src/interfaces/ecpg/test/sql/sqlda
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-11-20 Thread Daniel Verite

 Hi,

Here's a new version to improve the performance of FETCH_COUNT
and extend the cases when it can be used.

Patch 0001 adds a new mode in libpq to allow the app to retrieve
larger chunks of results than the single row of the row-by-row mode.
The maximum number of rows per PGresult is set by the user.

Patch 0002 uses that mode in psql and gets rid of the cursor
implementation as suggested upthread.

The performance numbers look good.
For a query retrieving 50M rows of about 200 bytes:
  select repeat('abc', 200) from generate_series(1,500)
/usr/bin/time -v psql -At -c $query reports these metrics
(medians of 5 runs):

  version  | fetch_count | clock_time | user_time | sys_time | max_rss_size
(kB) 
---+-++---+--+---
 16-stable |   0 |   6.58 |  3.98 | 2.09 |  
3446276
 16-stable | 100 |   9.25 |  4.10 | 1.90 | 
8768
 16-stable |1000 |  11.13 |  5.17 | 1.66 | 
8904
 17-patch  |   0 |6.5 |  3.94 | 2.09 |  
3442696
 17-patch  | 100 |  5 |  3.56 | 0.93 | 
4096
 17-patch  |1000 |   6.48 |  4.00 | 1.55 | 
4344

Interestingly, retrieving by chunks of 100 rows appears to be a bit faster
than the default one big chunk. It means that independently
of using less memory, FETCH_COUNT implemented that way
would be a performance enhancement compared to both
not using it and using it in v16 with the cursor implementation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 766bbe84def2db494f646caeaf29eefeba893c1a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 20 Nov 2023 17:24:55 +0100
Subject: [PATCH v4 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml  |  96 +++--
 src/bin/pg_amcheck/pg_amcheck.c  |   1 +
 src/interfaces/libpq/exports.txt |   1 +
 src/interfaces/libpq/fe-exec.c   | 118 +--
 src/interfaces/libpq/libpq-fe.h  |   4 +-
 src/interfaces/libpq/libpq-int.h |   7 +-
 6 files changed, 183 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index ed88ac001a..8007bf67d8 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5187,8 +5200,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5551,12 +5564,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before retrieving results with PQgetResult.
- This mode selection is effective only for the query currently
- being processed. For more information on the use of
- PQsetSingleRowMode,
- refer to .
+ To enter single-row or chunked modes, call
+ respectively PQsetSingleRowMode
+ or PQsetChunkedRowsMode before retrieving results
+ with PQgetResult.  This mode selection is effective
+ only for the query currently being processed. For more information on the
+ use of these functions refer
+ to .
 
 
 
@@ -5895,10 +5909,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
  
 
- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results by chunks
 
-  
+  
libpq
single-row mode
   
@@ -5909,13 +5923,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
PGresult.  This can be unworkable for commands
that return a large number of rows.  For such cases, applications can use
 and  
in
-   single-row mode.  In this mode, the result row(s) are
-   returned to the application

Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera  wrote:
> That sounds good to me.  Not having to parse the manifest server-side
> sounds like a win, as does saving the transfer, for the cases where the
> manifest is large.

OK. I'll look into this next week, hopefully.

> Is this meant to support multiple timelines each with non-overlapping
> adjacent ranges, rather than multiple non-adjacent ranges?

Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.

> Do I have it right that you want to rewrite this bit before considering
> this ready to commit?

For sure. I don't think this is the only thing that needs to be
revised before commit, but it's definitely *a* thing that needs to be
revised before commit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 14:27, Andres Freund wrote:

Hi,

On 2023-11-19 14:28:12 -0400, David Steele wrote:

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...



When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


I like it.


Cool.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


Well, the consistency message is emitted after every restart. Whereas a single
instance only should go through backup recovery once. So it seems worthwhile
to differentiate the two in log messages.


Ah, right. That works for me, then.

Regards,
-David

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:
> On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> > +   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> > +   ereport(LOG,
> > +   (errmsg("continuing to start from base backup with redo 
> > LSN %X/%X",
> > +   
> > LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
> >
> > "Continuing to start" sounds a bit weird to me, though, considering
> > that there are a few LOGs that say "starting" when there is a signal
> > file, but I don't have a better idea on top of my mind.  So that
> > sounds OK here.
>
> We can only reach that message in recovery or standby mode, right?
> So why not write "continuing to recover from base backup"?

It can be reached without either too, albeit much less commonly.


> If we add a message for starting with "backup_label", shouldn't
> we also add a corresponding message for starting from a checkpoint
> found in the control file?  If you see that in a problem report,
> you immediately know what is going on.

Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.

You also can extrapolate from the messages absence that we started up without
backup_label, it's not like there would be a lot of messages inbetween
  "database system was interrupted; last ..."
and
  "starting backup recovery ..."
(commonly there would be no messages)

We can do more on HEAD of course, but we should be wary of just spamming the
log unnecessarily as well.


I guess we could add this message at the same time, including in the back
branches. Initially I thought that might be unwise, because replacing
elog(DEBUG1, "end of backup reached");
with a different message could theoretically cause issues, even if unlikely,
given that it's a DEBUG1 message.

But I think we actually want to emit the message a bit later, just *after* we
updated the control file, as that's the actually relevant piece after which we
won't go back to the "backup recovery" state.  I am somewhat agnostic about
whether we should add that in the back branches or not.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80


Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".

Greetings,

Andres Freund
>From d80e55942a096d9f1ab10b84ab6f2a9d371cf88d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 18 Nov 2023 13:27:17 -0800
Subject: [PATCH v2] WIP: Log when starting up from a base backup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlogrecovery.c | 32 +++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156aa..9803b481118 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (StandbyModeRequested)
 			EnableStandbyMode();
 
+		/*
+		 * Omitting backup_label when creating a new replica, PITR node etc.
+		 * unfortunately is a common cause of corruption. Logging that
+		 * backup_label was used makes it a bit easier to exclude that as the
+		 * cause of observed corruption.
+		 *
+		 * Do so before we try to read the checkpoint record (which can fail),
+		 * as otherwise it can be hard to understand why a checkpoint other
+		 * than ControlFile->checkPoint is used.
+		 */
+		ereport(LOG,
+(errmsg("starting recovery from base backup with redo LSN %X/%X, checkpoint LSN %X/%X, on timeline ID %u",
+		LSN_FORMAT_ARGS(RedoStartLSN),
+		LSN_FORMAT_ARGS(CheckPointLoc),
+		CheckPointTLI)));
+
 		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 EnableStandbyMode();
 		}
 
+		/*
+		 * For the

Re: trying again to get incremental backup

2023-11-20 Thread Alvaro Herrera

On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera  
> wrote:

> > I don't understand this point.  Currently, the protocol is that
> > UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> > backup.  You seem to be saying that you're thinking of removing support
> > for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> > BASE_BACKUP command?
> 
> I don't think I'd want to do exactly that, because then you could only
> send one LSN, and I do think we want to send a set of LSN ranges with
> the corresponding TLI for each. I was thinking about dumping
> UPLOAD_MANIFEST and instead having a command like:
> 
> INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
> 
> The client would execute this command one or more times before
> starting an incremental backup.

That sounds good to me.  Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.

Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?

Do I have it right that you want to rewrite this bit before considering
this ready to commit?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Jubilee Young

On Fri, Nov 17, 2023 at 2:26 AM John Naylor  wrote:
>
> On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart  
> wrote:
> >
> > It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> > it could easily be used elsewhere [0] [1], but that doesn't seem to have
> > happened yet.
> >
> > Would moving this definition to a separate header file be a viable option?
>
> Seems fine to me. (I believe the original motivation for making it an
> inline function was for in pg_mbstrlen_with_len(), but trying that
> hasn't been a priority.)

In that case, I took a look across the codebase and saw a
utils/ascii.h that doesn't
seem to have gotten much love, but I suppose one could argue that it's intended
to be a backend-only header file?

As the codebase is growing some enhanced UTF-8 support, you'll want somewhere
that contains the optimized US-ASCII routines, because, as US-ASCII is
a subset of
UTF-8, and often faster to handle, it's typical for such codepaths to look like

```c
while (i < len && no_multibyte_chars) {
   i = i + ascii_op_version(i, buffer, _multibyte_chars);
}

while (i < len) {
i = i + utf8_op_version(i, buffer);
}
```

So it should probably end up living somewhere near the UTF-8 support, and
the easiest way to make it not go into something pgrx currently
includes would be
to make it a new header file, though there's a fair amount of API we
don't touch.

>From the pgrx / Rust perspective, Postgres function calls are passed
via callback
to a "guard function" that guarantees that longjmp and setjmp don't
cause trouble
(and makes sure we participate in that). So we only want to call
Postgres functions
if we "can't replace" them, as the overhead is quite a lot. That means
UTF-8-per-se
functions aren't very interesting to us as the Rust language already
supports it, but
we do benefit from access to transcoding to/from UTF-8.

—Jubilee

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 12:54 PM David Steele  wrote:
> Another thing we could do is explicitly error if we see backup_label in
> PGDATA during recovery. That's just a few lines of code so would not be
> a big deal to maintain. This error would only be visible on restore, so
> it presumes that backup software is being tested.

I think that if we do decide to adopt this proposal, that would be a
smart precaution.

> A little hard to add to the tests, I think, since they are purely
> informational, i.e. not pushed up by the parser. Maybe we could just
> grep for the fields?

Hmm. Or should they be pushed up by the parser?

> I think these fields would be handled the same as the rest of the fields
> in backup_label: returned from pg_backup_stop() and also stored in
> backup_manifest. Third-party software can do as they like with them and
> pg_combinebackup can just read from backup_manifest.

I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.

> We absolutely need more people to look at this and sign off. I'm glad
> they have not so far because it has allowed time to whack the patch
> around and get it into better shape.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 17:30:31 +0900, Michael Paquier wrote:
> On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> > Note that the LSN in the "continuing" case is the one the backup started at,
> > not where recovery will start.
> > 
> > I've wondered whether it's worth also adding an explicit message just after
> > ReachedEndOfBackup(), but it seems far less urgent due to the existing
> > "consistent recovery state reached at %X/%X" message.
> 
> Upgrading the surrounding DEBUG1 to a LOG is another option, but I
> agree that I've seen less that as being an actual problem in the field
> compared to the famous I-removed-a-backup-label-and-I-m-still-up,
> until this user sees signs of corruption after recovery was finished,
> sometimes days after putting back an instance online.

"end of backup reached" could scare users, it doesn't obviously indicate
something "good". "completed backup recovery, started at %X/%X" or such would
be better imo.


> +   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> +   ereport(LOG,
> +   (errmsg("continuing to start from base backup with redo 
> LSN %X/%X",
> +   LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
> 
> "Continuing to start" sounds a bit weird to me, though, considering
> that there are a few LOGs that say "starting" when there is a signal
> file, but I don't have a better idea on top of my mind.  So that
> sounds OK here.

I didn't like it much either - but I like David's proposal in his sibling
reply:

LOG:  starting backup recovery with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1
LOG:  restarting backup recovery with redo LSN A/34100028
and adding the message from above:
LOG:  completing backup recovery with redo LSN A/34100028

Greetings,

Andres Freund

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:24:25 -0500, Robert Haas wrote:
> I do also think it is worth considering how this proposal interacts
> with the proposal to remove backup_label. If that proposal goes
> through, then this proposal is obsolete, I believe.

I think it's the opposite, if anything. Today you can at least tell there was
use of a backup_label by looking for backup_label.old and you can verify
fairly easily in a restore script that backup_label is present. If we "just"
use pg_control, neither of those is as easy. I.e. log messages would be more
important, not less.  Depending on how things work out, we might need to
reformulate and/or move them a bit, but that doesn't seem like a big deal.

> But if this is a good idea, does that mean that's not a good idea? Or would
> we try to make the pg_control which that patch would drop in place have some
> internal difference which we could use to drive a similar log message?

I think we absolutely have to. If there's no way to tell whether an "external"
pg_backup_start/stop() procedure actually used the proper pg_control, it'd
make the situation substantially worse compared to today's, already bad,
situation.

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele





On 11/20/23 12:11, Robert Haas wrote:

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier  wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. 


From my perspective it's not that big a change for backup software but 
it does bring a lot of benefits, including fixing an outstanding bug in 
Postgres, i.e. reading pg_control without getting a torn copy.



The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?


This is a good point. We could just rename again, but not sure what 
names to go for this time. OTOH if the backup software is selecting 
fields then they will get an error because the names have changed. If 
the software is grabbing fields by position then they'll get a 
valid-looking result (even if querying by position is a terrible idea).


Another thing we could do is explicitly error if we see backup_label in 
PGDATA during recovery. That's just a few lines of code so would not be 
a big deal to maintain. This error would only be visible on restore, so 
it presumes that backup software is being tested.


Maybe just a rename to something like pg_backup_begin/end would be the 
way to go.



That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.


A little hard to add to the tests, I think, since they are purely 
informational, i.e. not pushed up by the parser. Maybe we could just 
grep for the fields?



I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.


I think these fields would be handled the same as the rest of the fields 
in backup_label: returned from pg_backup_stop() and also stored in 
backup_manifest. Third-party software can do as they like with them and 
pg_combinebackup can just read from backup_manifest.


As for the pg_control file -- it might be best to give it a different 
name for backups that are not essentially copies of PGDATA. On the other 
hand, pgBackRest has included pg_control in incremental backups since 
day one and we've never had a user mistakenly do a manual restore of one 
and cause a problem (though manual restores are not the norm). Still, 
probably can't hurt to be a bit careful.



But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-20 Thread Tom Lane

Richard Guo  writes:
> On Fri, Nov 17, 2023 at 11:38 AM Tom Lane  wrote:
>> That line of argument also leads to the conclusion that it'd be
>> okay to expose info about the ordering of the CTE result to the
>> upper planner.

> In the light of this conclusion, I had a go at propagating the pathkeys
> from CTEs up to the outer planner and came up with the attached.

Oh, nice!  I remembered we had code already to do this for regular
SubqueryScans, but I thought we'd need to do some refactoring to
apply it to CTEs.  I think you are right though that
convert_subquery_pathkeys can be used as-is.  Some thoughts:

* Do we really need to use make_tlist_from_pathtarget?  Why isn't
the tlist of the cteplan good enough (indeed, more so)?

* I don't love having this code assume that it knows how to find
the Path the cteplan was made from.  It'd be better to make
SS_process_ctes save that somewhere, maybe in a list paralleling
root->cte_plan_ids.

Alternatively: maybe it's time to do what the comments in
SS_process_ctes vaguely speculate about, and just save the Path
at that point, with construction of the plan left for createplan()?
That might be a lot of refactoring for not much gain, so not sure.

regards, tom lane

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 12:24, Robert Haas wrote:

On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe  wrote:

I can accept that adding log messages to back branches is ok.
Perhaps I am too nervous about things like that, because as an extension
developer I have been bitten too often by ABI breaks in minor releases
in the past.


I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.


Personally I think that the value of the information outweighs the 
weirdness of a new message appearing.



I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. 


Not at all. I don't even think the messages will need to be reworded, or 
not much since they don't mention backup_label.



But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?


The recovery in pg_control patch has all the same recovery info stored, 
so similar (or the same) log message would still be appropriate.



Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...


The info is certainly valuable and we wouldn't lose it, unless there is 
something I'm not getting.


Regards,
-David

Re: On non-Windows, hard depend on uselocale(3)

2023-11-20 Thread Tom Lane

Thomas Munro  writes:
> If we are sure that we'll *never* want locale-aware printf-family
> functions (ie we *always* want "C" locale), then in the thought
> experiment above where I suggested we supply replacement _l()
> functions, we could just skip that for the printf family, but make
> that above comment actually true.  Perhaps with Ryu, but otherwise by
> punting to libc _l() or uselocale() save/restore.

It is pretty annoying that we've got that shiny Ryu code and can't
use it here.  From memory, we did look into that and concluded that
Ryu wasn't amenable to providing "exactly this many digits" as is
required by most variants of printf's conversion specs.  But maybe
somebody should go try harder.  (Worst case, you could do rounding
off by hand on the produced digit string, but that's ugly...)

regards, tom lane

Re: should check collations when creating partitioned index

2023-11-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 14.11.23 17:15, Tom Lane wrote:
>> I don't love the patch details though.  It seems entirely wrong to check
>> this before we check the opclass match.

> Not sure why?  The order doesn't seem to matter?

The case that was bothering me was if we had a non-collated type
versus a collated type.  That would result in throwing an error
about collation mismatch, when complaining about the opclass seems
more apropos.  However, if we do this:

> I see.  That means we shouldn't raise an error on a mismatch but just do
>  if (key->partcollation[i] != collationIds[j])
>  continue;

it might not matter much.

regards, tom lane

Re: Use of backup_label not noted in log

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe  wrote:
> I can accept that adding log messages to back branches is ok.
> Perhaps I am too nervous about things like that, because as an extension
> developer I have been bitten too often by ABI breaks in minor releases
> in the past.

I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.

I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?
Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier  wrote:
> (I am not exactly sure how, but we've lost pgsql-hackers on the way
> when you sent v5.  Now added back in CC with the two latest patches
> you've proposed attached.)
>
> Here is a short summary of what has been missed by the lists:
> - I've commented that the patch should not create, not show up in
> fields returned the SQL functions or stream control files with a size
> of 512B, just stick to 8kB.  If this is worth changing this should be
> applied consistently across the board including initdb, discussed on
> its own thread.
> - The backup-related fields in the control file are reset at the end
> of recovery.  I've suggested to not do that to keep a trace of what
> was happening during recovery.  The latest version of the patch resets
> the fields.
> - With the backup_label file gone, we lose some information in the
> backups themselves, which is not good.  Instead, you have suggested an
> approach where this data is added to the backup manifest, meaning that
> no information would be lost, particularly useful for self-contained
> backups.  The fields planned to be added to the backup manifest are:
> -- The start and end time of the backup, the end timestamp being
> useful to know when stop time can be used for PITR.
> -- The backup label.
> I've agreed that it may be the best thing to do at this end to not
> lose any data related to the removal of the backup_label file.

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?

That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.

I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.

But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of disruption.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[PATCH] fix race condition in libpq (related to ssl connections)

2023-11-20 Thread Willi Mann


Hi,

I've found a race condition in libpq. It is about the initialization of
the my_bio_methods static variable in fe-secure-openssl.c, which is not
protected by any lock. The race condition may make the initialization of
the connection fail, and as an additional weird consequence, it might
cause openssl call close(0), so stdin of the client application gets
closed.

I've prepared a patch to protect the initialization of my_bio_methods
from the race. This is my first patch submission to the postgresql
project, so I hope I didn't miss anything. Any comments and suggestions
are of course very welcome.

I also prepared a testcase. In the testcase tarball, there is a patch
that adds sleeps at the right positions to make the close(0) problem
occur practically always. It also includes comments to explain how the
race can end up calling close(0).

Concerning the patch, it is only tested on Linux. I'm unsure about
whether the simple initialization of the mutex would work nowadays also
on Windows or whether the more complicated initialization also to be
found for the ssl_config_mutex in the same source file needs to be used.
Let me know whether I should adapt that.

We discovered the problem with release 11.5, but the patch and the 
testcase are against the master branch.


Regards,
Willi

--
___

Dr. Willi Mann | Principal Software Engineer, Tech Lead PQL

Celonis SE | Theresienstrasse 6 | 80333 Munich, Germany
F: +4989416159679
w.m...@celonis.com | www.celonis.com | LinkedIn | Twitter | Xing

AG Munich HRB 225439 | Management: Martin Klenk, Bastian Nominacher, 
Alexander RinkeFrom 721be8d6ea26d78fc086a2e85413858585ca9300 Mon Sep 17 00:00:00 2001
From: Willi Mann 
Date: Thu, 16 Nov 2023 11:50:27 +0100
Subject: [PATCH] libpq: Fix race condition in initialization of my_bio_methods
 static variable

---
 src/interfaces/libpq/fe-secure-openssl.c | 26 ++--
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index f1192d28f2..e8fa3eb403 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -81,6 +81,7 @@ static int	PQssl_passwd_cb(char *buf, int size, int rwflag, void *userdata);
 static int	my_sock_read(BIO *h, char *buf, int size);
 static int	my_sock_write(BIO *h, const char *buf, int size);
 static BIO_METHOD *my_BIO_s_socket(void);
+static void my_BIO_methods_init(void);
 static int	my_SSL_set_fd(PGconn *conn, int fd);
 
 
@@ -1882,8 +1883,8 @@ my_sock_write(BIO *h, const char *buf, int size)
 	return res;
 }
 
-static BIO_METHOD *
-my_BIO_s_socket(void)
+static void
+my_BIO_methods_init(void)
 {
 	if (!my_bio_methods)
 	{
@@ -1893,11 +1894,11 @@ my_BIO_s_socket(void)
 
 		my_bio_index = BIO_get_new_index();
 		if (my_bio_index == -1)
-			return NULL;
+			return;
 		my_bio_index |= (BIO_TYPE_DESCRIPTOR | BIO_TYPE_SOURCE_SINK);
 		my_bio_methods = BIO_meth_new(my_bio_index, "libpq socket");
 		if (!my_bio_methods)
-			return NULL;
+			return;
 
 		/*
 		 * As of this writing, these functions never fail. But check anyway,
@@ -1914,17 +1915,30 @@ my_BIO_s_socket(void)
 		{
 			BIO_meth_free(my_bio_methods);
 			my_bio_methods = NULL;
-			return NULL;
+			return;
 		}
 #else
 		my_bio_methods = malloc(sizeof(BIO_METHOD));
 		if (!my_bio_methods)
-			return NULL;
+			return;
 		memcpy(my_bio_methods, biom, sizeof(BIO_METHOD));
 		my_bio_methods->bread = my_sock_read;
 		my_bio_methods->bwrite = my_sock_write;
 #endif
 	}
+}
+
+static pthread_mutex_t my_BIO_methods_init_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static BIO_METHOD *
+my_BIO_s_socket()
+{
+	if (pthread_mutex_lock(_BIO_methods_init_mutex) != 0)
+	{
+		return NULL;
+	}
+	my_BIO_methods_init();
+	pthread_mutex_unlock(_BIO_methods_init_mutex);
 	return my_bio_methods;
 }
 
-- 
2.39.2



testcase.tar.gz
Description: application/gzip

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Jinjing Zhou

Thanks. Our project is at https://github.com/tensorchord/pgvecto.rs. A custom 
index is implemented for the vector similarity search, which implements 
`amgettuples` with direction support to provide candidates for the order by 
clause. 

And we want to inject the filter condition using bitmap into the amgettuples 
process, instead of checking the tuples one by one to accelerate the whole 
process. 

Best
Jinjing Zhou
> From: "Tomas Vondra"
> Date:  Mon, Nov 20, 2023, 18:52
> Subject:  Re: Inquiry on Generating Bitmaps from Filter Conditions in Index 
> Scans
> To: "Jinjing Zhou", 
> "pgsql-hackers@lists.postgresql.org"
> On 11/19/23 18:19, Jinjing Zhou wrote:
> > Hi hackers, 
> > 
> > I hope this message finds you well. I am reaching out to seek guidance
> > on a specific aspect of PostgreSQL's index scanning functionality.
> > 
> > I am currently working on a vector search extension for postgres, where
> > I need to generate bitmaps based on filter conditions during an index
> > scan. The goal is to optimize the query performance by efficiently
> > identifying the rows that meet the given criteria.
> > 
> > The query plan looks like this
> > 
> > Index Scan using products_feature_idx on products  (cost=0.00..27.24
> > rows=495 width=12)
> >          Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
> >          Filter: ((price > '0.2'::double precision) AND (price <=
> > '0.7'::double precision))
> > 
> > 
> > We have a custom index for the order by clause on the feature column.
> > Now we want to utilize the index on other columns like price column. We
> > want to access the bitmap of price column's filter condition in the
> > feature column index. Is there any way I can achieve this goal?
> > 
> > Any help or guidance is appreciated!
> > 
> 
> I suppose you'll need to give more details about what exactly are you
> trying to achieve, what you tried, maybe some code examples, etc. Your
> question is quite vague, and it's unclear what "bitmaps generated on
> filter conditions" or "custom index" means.
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Jinjing Zhou

Thanks a lot! This is exactly what I'm asking. We've tried the CustomScanAPI at 
https://github.com/tensorchord/pgvecto.rs/pull/126, but met error with 
"variable not found in subplan target list". We're still investigating the root 
cause and thanks for your guidance!

Best
Jinjing Zhou
> From: "Matthias van de Meent"
> Date:  Mon, Nov 20, 2023, 19:33
> Subject:  Re: Inquiry on Generating Bitmaps from Filter Conditions in Index 
> Scans
> To: "Jinjing Zhou"
> Cc: "pgsql-hackers@lists.postgresql.org"
> On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou  wrote:
> >
> > Hi hackers,
> >
> > I hope this message finds you well. I am reaching out to seek guidance on a 
> > specific aspect of PostgreSQL's index scanning functionality.
> >
> > I am currently working on a vector search extension for postgres, where I 
> > need to generate bitmaps based on filter conditions during an index scan. 
> > The goal is to optimize the query performance by efficiently identifying 
> > the rows that meet the given criteria.
> >
> > The query plan looks like this
> >
> > Index Scan using products_feature_idx on products  (cost=0.00..27.24 
> > rows=495 width=12)
> >  Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
> >  Filter: ((price > '0.2'::double precision) AND (price <= 
> > '0.7'::double precision))
> >
> >
> > We have a custom index for the order by clause on the feature column. Now 
> > we want to utilize the index on other columns like price column. We want to 
> > access the bitmap of price column's filter condition in the feature column 
> > index. Is there any way I can achieve this goal?
> 
> If you mean "I'd like to use bitmaps generated by combining filter
> results from index A, B, and C for (pre-)filtering the ordered index
> lookups in index D",
> then there is no current infrastructure to do this. Bitmap scans
> currently generate a data structure that is not indexable, and can
> thus not be used efficiently to push an index's generated bitmap into
> another bitmap's scans.
> 
> There are efforts to improve the data structures we use for storing
> TIDs during vacuum [0] which could extend to the TID bitmap structure,
> but even then we'd need some significant effort to rewire Postgres'
> internals to push down the bitmap filters; and that is even under the
> assumption that pushing the bitmap down into the index AM is more
> efficient than doing the merges above the index AM and then re-sorting
> the data.
> 
> So, in short, it's not currently available in community PostgreSQL.
> You could probably create a planner hook + custom executor node that
> does this, but it won't be able to use much of the features available
> inside PostgreSQL.
> 
> Kind regards,
> 
> Matthias van de Meent
> 
> [0] 
> https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> > There are some differences between pg_stat_wal and pg_stat_io while
> > collecting WAL stats. For example in the XLogWrite() function in the
> > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> > is not something we want for pg_stat_io since pg_stat_io counts the
> > number of blocks rather than the system calls, so instead incremented
> > pg_stat_io by npages.
> >
> > Could that cause a problem since pg_stat_wal's behaviour will be
> > changed? Of course, as an alternative we could change pg_stat_io's
> > behaviour but in the end either pg_stat_wal's or pg_stat_io's
> > behaviour will be changed.
>
> Yep, that could be confusing for existing applications that track the
> information of pg_stat_wal.  The number of writes is not something
> that can be correctly shared between both.  The timings for the writes
> and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

>
> This slightly relates to pgstat_count_io_op_n() in your latest patch,
> where it feels a bit weird to see an update of
> PendingWalStats.wal_sync sit in the middle of a routine dedicated to
> pg_stat_io..  I am not completely sure what's the right balance here,
> but I would try to implement things so as pg_stat_io paths does not
> need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 06:35, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


+1. It is easier to detect the presence of a message than the absence of 
one.


Regards,
-David

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:

What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label


I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().


Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...


When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.

I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


Well, this could be improved in HEAD for sure.


I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.


+1


Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


I've been thinking about improving this myself. It would probably also 
help a lot to hint that restore_command may be missing or not returning 
results (but also not erroring). But there are a bunch of ways to get to 
this message so we'd need to be careful.



There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.


Agreed. Moving it up would be better.


And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

   LOG:  consistent recovery state reached at ...
before
   LOG:  redo starts at ...


Huh, I haven't seen that one. Definitely confusing.


But that's all clearly just material for HEAD.


Absolutely. I've been thinking about some of this as well, but want to 
see if we can remove the backup label first so we don't have to rework a 
bunch of stuff.


Of course, that shouldn't stop you from proceeding. I'm sure anything 
you are thinking of here could be adapted.


Regards,
-David

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/19/23 21:15, Michael Paquier wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)


Ugh, I must have hit reply instead of reply all. It's a rookie error and 
you hate to see it.



Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


This looks right to me.


On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO.  This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?


I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.


FWIW, I'm OK with the bits for the backup manifest as presented.  So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers.  I hope that the summary
I've posted above covers everything.  So let's see about doing
something around the middle of next week.  With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.


Timing sounds good to me.



+  The end time for the backup. This is when the backup was stopped in
+  PostgreSQL and represents the earliest time
+  that can be used for time-based Point-In-Time Recovery.

This one is actually a very good point.  We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.


Yeah, the end time is very important for recovery scenarios. We 
definitely need that recorded somewhere.



I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net

+   ControlFile->backupCheckPoint = InvalidXLogRecPtr;
ControlFile->backupStartPoint = InvalidXLogRecPtr;
+   ControlFile->backupStartPointTLI = 0;
ControlFile->backupEndPoint = InvalidXLogRecPtr;
+   ControlFile->backupFromStandby = false;
ControlFile->backupEndRequired = false;

Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..


I'd rather reset everything for now (as we do now) and think about 
keeping these values as a separate patch. It may be that we don't want 
to keep all of them, or we need a separate flag to say recovery was 
completed. We are accumulating a lot of booleans here, maybe we need a 
state var (recoveryRequired, recoveryInProgress, recoveryComplete) and 
then define which other vars are valid in each state.


Regards,
-David

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Drouvot, Bertrand





On 11/20/23 11:59 AM, Amit Kapila wrote:

On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand
 wrote:


On 11/18/23 11:45 AM, Amit Kapila wrote:

On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
 wrote:


On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:

On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
 wrote:

I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown
slotsync worker and drop slots. There could be other reasons(other than
promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code
there. I thought if the intention is to stop slotsync workers on promotion,
maybe FinishWalRecovery() is a better place to do it as it's indicating the end
of recovery and XLogShutdownWalRcv is also called in it.


I can see that slotsync_drop_initiated_slots() has been moved in 
FinishWalRecovery()
in v35. That looks ok.




I was thinking what if we just ignore creating such slots (which
require init state) in the first place? I think that can be
time-consuming in some cases but it will reduce the complexity and we
can always improve such cases later if we really encounter them in the
real world. I am not very sure that added complexity is worth
addressing this particular case, so I would like to know your and
others' opinions.



I'm not sure I understand your point. Are you saying that we should not create
slots on the standby that are "currently" reported in a 'i' state? (so just keep
the 'r' and 'n' states?)



Yes.



As far the 'i' state here, from what I see, it is currently useful for:

1. Cascading standby to not sync slots with state = 'i' from
the first standby.
2. Easily report Slots that did not catch up on the primary yet.
3. Avoid inactive slots to block "active" ones creation.

So not creating those slots should not be an issue for 1. (sync are
not needed on cascading standby as not created on the first standby yet)
but is an issue for 2. (unless we provide another way to keep track and report
such slots) and 3. (as I think we should still need to reserve WAL).

I've a question: we'd still need to reserve WAL for those slots, no?

If that's the case and if we don't call ReplicationSlotCreate() then 
ReplicationSlotReserveWal()
would not work as  MyReplicationSlot would be NULL.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Flushing large data immediately in pqcomm

2023-11-20 Thread Melih Mutlu

Hi hackers

I've been looking into ways to reduce the overhead we're having in pqcomm
and I'd like to propose a small patch to modify how socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send
buffer (PqSendBuffer) and the size of this buffer is 8K. When the send
buffer gets full, it's flushed and we continue to copy more data into the
send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe to
say that if the size of input data is larger than the buffer size, which is
8K, then the buffer will be flushed at least once (or more, depends on the
input size) to store and all the input data.

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that the
send buffer would be flushed anyway due to reaching its limit, the patch
just gets rid of the copy part which seems unnecessary and sends data
without waiting.

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD

HEAD:
-   94,13% 0,22%  postgres  postgres   [.] DoCopyTo
  - 93,90% DoCopyTo
  - 91,80% CopyOneRowTo
 + 47,35% CopyAttributeOutText
 - 26,49% CopySendEndOfRow
- 25,97% socket_putmessage
   - internal_putbytes
  - 24,38% internal_flush
 + secure_write
  + 1,47% memcpy (inlined)
 + 14,69% FunctionCall1Coll
 + 1,94% appendBinaryStringInfo
 + 0,75% MemoryContextResetOnly
  + 1,54% table_scan_getnextslot (inlined)

Patch:
-   94,40% 0,30%  postgres  postgres   [.] DoCopyTo
   - 94,11% DoCopyTo
  - 92,41% CopyOneRowTo
 + 51,20% CopyAttributeOutText
 - 20,87% CopySendEndOfRow
- 20,45% socket_putmessage
   - internal_putbytes
  - 18,50% internal_flush (inlined)
   internal_flush_buffer
 + secure_write
  + 1,61% memcpy (inlined)
 + 17,36% FunctionCall1Coll
 + 1,33% appendBinaryStringInfo
 + 0,93% MemoryContextResetOnly
  + 1,36% table_scan_getnextslot (inlined)

The patch brings a ~5% gain in socket_putmessage.

Also timed the pg_basebackup like:
time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest
-D test

HEAD:
real0m10,040s
user0m0,768s
sys 0m7,758s

Patch:
real0m8,882s
user0m0,699s
sys 0m6,980s

It seems ~11% faster in this specific case.

I'd appreciate any feedback/thoughts.


[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS
name, NOW() AS time FROM generate_series(1, 1) AS i;


Thanks,
-- 
Melih Mutlu
Microsoft


0001-Flush-large-data-immediately-in-pqcomm.patch
Description: Binary data

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Matthias van de Meent

On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou  wrote:
>
> Hi hackers,
>
> I hope this message finds you well. I am reaching out to seek guidance on a 
> specific aspect of PostgreSQL's index scanning functionality.
>
> I am currently working on a vector search extension for postgres, where I 
> need to generate bitmaps based on filter conditions during an index scan. The 
> goal is to optimize the query performance by efficiently identifying the rows 
> that meet the given criteria.
>
> The query plan looks like this
>
> Index Scan using products_feature_idx on products  (cost=0.00..27.24 rows=495 
> width=12)
>  Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
>  Filter: ((price > '0.2'::double precision) AND (price <= 
> '0.7'::double precision))
>
>
> We have a custom index for the order by clause on the feature column. Now we 
> want to utilize the index on other columns like price column. We want to 
> access the bitmap of price column's filter condition in the feature column 
> index. Is there any way I can achieve this goal?

If you mean "I'd like to use bitmaps generated by combining filter
results from index A, B, and C for (pre-)filtering the ordered index
lookups in index D",
then there is no current infrastructure to do this. Bitmap scans
currently generate a data structure that is not indexable, and can
thus not be used efficiently to push an index's generated bitmap into
another bitmap's scans.

There are efforts to improve the data structures we use for storing
TIDs during vacuum [0] which could extend to the TID bitmap structure,
but even then we'd need some significant effort to rewire Postgres'
internals to push down the bitmap filters; and that is even under the
assumption that pushing the bitmap down into the index AM is more
efficient than doing the merges above the index AM and then re-sorting
the data.

So, in short, it's not currently available in community PostgreSQL.
You could probably create a planner hook + custom executor node that
does this, but it won't be able to use much of the features available
inside PostgreSQL.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com

1 2 >

1 - 100 of 125 matches

Mail list logo