Re: Synchronizing slots from primary to standby
Hi, On 11/21/23 6:16 AM, Amit Kapila wrote: On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand wrote: As far the 'i' state here, from what I see, it is currently useful for: 1. Cascading standby to not sync slots with state = 'i' from the first standby. 2. Easily report Slots that did not catch up on the primary yet. 3. Avoid inactive slots to block "active" ones creation. So not creating those slots should not be an issue for 1. (sync are not needed on cascading standby as not created on the first standby yet) but is an issue for 2. (unless we provide another way to keep track and report such slots) and 3. (as I think we should still need to reserve WAL). I've a question: we'd still need to reserve WAL for those slots, no? If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() would not work as MyReplicationSlot would be NULL. Yes, we need to reserve WAL to see if we can sync the slot. We are currently creating an RS_EPHEMERAL slot and if we don't explicitly persist it when we can't sync, then it will be dropped when we do ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the loss is probably, the next time we again try to sync the slot, we need to again create it and may need to wait for newer restart_lsn on standby Yeah, and doing so we'd reduce the time window to give the slot a chance to catch up (as opposed to create it a single time and maintain an 'i' state). which could be avoided if we have the slot in 'i' state from the previous run. Right. I don't deny the importance of having 'i' (initialized) state but was just trying to say that it has additional code complexity. Right, and I think it's worth it. OTOH, having it may give better visibility to even users about slots that are not active (say manually created slots on the primary). Agree. All that being said, on my side I'm +1 on keeping the 'i' state behavior as it is implemented currently (would be happy to hear others' opinions too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use of backup_label not noted in log
On Mon, 2023-11-20 at 11:03 -0800, Andres Freund wrote: > > If we add a message for starting with "backup_label", shouldn't > > we also add a corresponding message for starting from a checkpoint > > found in the control file? If you see that in a problem report, > > you immediately know what is going on. > > Maybe - the reason I hesitate on that is that emitting an additional log > message when starting from a base backup just adds something "once once the > lifetime of a node". Whereas emitting something every start obviously doesn't > impose any limit. The message should only be shown if PostgreSQL replays WAL, that is, after a crash. That would (hopefully) make it a rare message too. Yours, Laurenz Albe
Re: remaining sql/json patches
I looked a bit at the parser additions, because there were some concerns expressed that they are quite big. It looks like the parser rules were mostly literally copied from the BNF in the SQL standard. That's probably a reasonable place to start, but now at the end, there is some room for simplification. Attached are a few patches that apply on top of the 0003 patch. (I haven't gotten to 0004 in detail yet.) Some explanations: 0001-Put-keywords-in-right-order.patch This is just an unrelated cleanup. 0002-Remove-js_quotes-union-entry.patch We usually don't want to put every single node type into the gram.y %union. This one can be trivially removed. 0003-Move-some-code-from-gram.y-to-parse-analysis.patch Code like this can be postponed to parse analysis, keeping gram.y smaller. The error pointer loses a bit of precision, but I think that's ok. (There is similar code in your 0004 patch, which could be similarly moved.) 0004-Remove-JsonBehavior-stuff-from-union.patch Similar to my 0002. This adds a few casts as a result, but that is the typical style in gram.y. 0005-Get-rid-of-JsonBehaviorClause.patch I think this two-level wrapping of the behavior clauses is both confusing and overkill. I was trying to just list the on-empty and on-error clauses separately in the top-level productions (JSON_VALUE etc.), but that led to shift/reduce errors. So the existing rule structure is probably ok. But we don't need a separate node type just to combine two values and then unpack them again shortly thereafter. So I just replaced all this with a list. 0006-Get-rid-of-JsonCommon.patch This is an example where the SQL standard BNF is not sensible to apply literally. I moved those clauses up directly into their callers, thus removing one intermediate levels of rules and also nodes. Also, the path name (AS name) stuff is only for JSON_TABLE, so it's not needed in this patch. I removed it here, but it would have to be readded in your 0004 patch. Another thing: In your patch, JSON_EXISTS has a RETURNING clause (json_returning_clause_opt), but I don't see that in the standard, and also not in the Oracle or Db2 docs. Where did this come from? With these changes, I think the grammar complexity in your 0003 patch is at an acceptable level. Similar simplification opportunities exist in the 0004 patch, but I haven't worked on that yet. I suggest that you focus on getting 0001..0003 committed around this commit fest and then deal with 0004 in the next one. (Also split up the 0005 patch into the pieces that apply to 0003 and 0004, respectively.) From 90cd46c91231a29a41118861d5a6122d78f93379 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 21 Nov 2023 05:19:03 +0100 Subject: [PATCH 1/6] Put keywords in right order --- src/backend/parser/gram.y | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 1dc3300fde..9a7058b767 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -735,7 +735,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); JOIN JSON JSON_ARRAY JSON_ARRAYAGG JSON_EXISTS JSON_OBJECT JSON_OBJECTAGG JSON_QUERY JSON_SCALAR JSON_SERIALIZE JSON_VALUE - KEY KEYS KEEP + KEEP KEY KEYS LABEL LANGUAGE LARGE_P LAST_P LATERAL_P LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL -- 2.42.1 From b669a45c9603a23db240cf1566b3f2e726254ac4 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 21 Nov 2023 05:28:17 +0100 Subject: [PATCH 2/6] Remove js_quotes %union entry --- src/backend/parser/gram.y | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 9a7058b767..a8cce5b00e 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -280,7 +280,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); struct KeyAction *keyaction; JsonBehavior *jsbehavior; JsonBehaviorClause *jsbehaviorclause; - JsonQuotes js_quotes; } %typestmt toplevel_stmt schema_stmt routine_body_stmt @@ -662,12 +661,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %typejson_encoding_clause_opt json_predicate_type_constraint json_wrapper_behavior + json_quotes_clause_opt %type json_key_uniqueness_constraint_opt json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt %type json_behavior %type json_behavior_clause_opt -%type json_quotes_clause_opt /* -- 2.42.1 From 7fb1906bab90e539697e6d66d3f2754eb3031603 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 21 Nov 2023 05:43:10 +0100
Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
On Mon, 2023-11-20 at 22:50 -0600, Nathan Bossart wrote: > I'm mostly thinking out loud here, but could we just always do this? > I > guess you might want to avoid it if your SH_EQUAL is particularly > expensive > and you know repeated lookups are rare, but maybe that's uncommon > enough > that we don't really care. I like that simplehash is simple, so I'm not inclined to introduce an always-on feature. It would be interesting to know how often it's a good idea to turn it on, though. I could try turning it on for various other uses of simplehash, and see where it tends to win. The caller can also save the hash and pass it down, but that's not always convenient to do. Regards, Jeff Davis
Re: Why is hot_standby_feedback off by default?
On Tue, Nov 21, 2023 at 6:49 AM Andres Freund wrote: > > On 2023-11-20 16:34:47 +0700, John Naylor wrote: > > Sounds like a TODO? > > WFM. I don't personally use or update TODO, as I have my doubts about its > usefulness or state of maintenance. But please feel free to add this as a TODO > from my end... Yeah, I was hoping to change that, but it's been a long row to hoe. Anyway, the above idea was added added under "administration".
Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500
On Tue, Nov 21, 2023 at 1:46 AM Tom Lane wrote: > * Do we really need to use make_tlist_from_pathtarget? Why isn't > the tlist of the cteplan good enough (indeed, more so)? I think you are right. The cteplan->targetlist is built for the CTE's best path by build_path_tlist(), which is almost the same as make_tlist_from_pathtarget() except that it also replaces nestloop params. So cteplan->targetlist is good enough here. > * I don't love having this code assume that it knows how to find > the Path the cteplan was made from. It'd be better to make > SS_process_ctes save that somewhere, maybe in a list paralleling > root->cte_plan_ids. Fair point. I've updated the patch to v2 for the changes. > Alternatively: maybe it's time to do what the comments in > SS_process_ctes vaguely speculate about, and just save the Path > at that point, with construction of the plan left for createplan()? > That might be a lot of refactoring for not much gain, so not sure. I'm not sure if this is worth the effort. And it seems that we have the same situation with SubLinks where we construct the plan in subselect.c rather than createplan.c. Thanks Richard v2-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patch Description: Binary data
Typo with amtype = 's' in opr_sanity.sql
Hi all, While rebasing a patch from 2016 related to sequence AMs (more about that later), I've bumped on a mistake from 8586bf7ed888 in opr_sanity.sql, as of: +SELECT p1.oid, p1.amname, p2.oid, p2.proname +FROM pg_am AS p1, pg_proc AS p2 +WHERE p2.oid = p1.amhandler AND p1.amtype = 's' AND It seems to me that this has been copy-pasted on HEAD from the sequence AM patch, but forgot to update amtype to 't'. While that's maybe cosmetic, I think that this could lead to unexpected results, so perhaps there is a point in doing a backpatch? Thoughts? -- Michael diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out index 7a6f36a6a9..7610b011d6 100644 --- a/src/test/regress/expected/opr_sanity.out +++ b/src/test/regress/expected/opr_sanity.out @@ -1920,7 +1920,7 @@ WHERE p1.oid = a1.amhandler AND a1.amtype = 'i' AND -- Check for table amhandler functions with the wrong signature SELECT a1.oid, a1.amname, p1.oid, p1.proname FROM pg_am AS a1, pg_proc AS p1 -WHERE p1.oid = a1.amhandler AND a1.amtype = 's' AND +WHERE p1.oid = a1.amhandler AND a1.amtype = 't' AND (p1.prorettype != 'table_am_handler'::regtype OR p1.proretset OR p1.pronargs != 1 diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql index e2d2c70d70..2fe7b6dcc4 100644 --- a/src/test/regress/sql/opr_sanity.sql +++ b/src/test/regress/sql/opr_sanity.sql @@ -1223,7 +1223,7 @@ WHERE p1.oid = a1.amhandler AND a1.amtype = 'i' AND SELECT a1.oid, a1.amname, p1.oid, p1.proname FROM pg_am AS a1, pg_proc AS p1 -WHERE p1.oid = a1.amhandler AND a1.amtype = 's' AND +WHERE p1.oid = a1.amhandler AND a1.amtype = 't' AND (p1.prorettype != 'table_am_handler'::regtype OR p1.proretset OR p1.pronargs != 1 signature.asc Description: PGP signature
Re: Synchronizing slots from primary to standby
On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu) wrote: > > On Friday, November 17, 2023 7:39 PM Amit Kapila > wrote: > > > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik > > wrote: > > > > > > PFA v35. > > > > > > > Review v35-0002* > > == > > Thanks for the comments. > > > 1. > > As quoted in the commit message, > > > > > If a logical slot is invalidated on the primary, slot on the standby is also > > invalidated. If a logical slot on the primary is valid but is invalidated > > on the > > standby due to conflict (say required rows removed on the primary), then > > that > > slot is dropped and recreated on the standby in next sync-cycle. > > It is okay to recreate such slots as long as these are not consumable on the > > standby (which is the case currently). > > > > > > > I think this won't happen normally because of the physical slot and > > hot_standby_feedback but probably can occur in cases like if the user > > temporarily switches hot_standby_feedback from on to off. Are there any > > other > > reasons? I think we can mention the cases along with it as well at least > > for now. > > Additionally, I think this should be covered in code comments as well. > > I will collect all these cases and update in next version. > > > > > 2. > > #include "postgres.h" > > - > > +#include "access/genam.h" > > > > Spurious line removal. > > Removed. > > > > > 3. > >A password needs to be provided too, if the sender demands > > password > >authentication. It can be provided in the > >primary_conninfo string, or in a separate > > - ~/.pgpass file on the standby server (use > > - replication as the database name). > > - Do not specify a database name in the > > - primary_conninfo string. > > + ~/.pgpass file on the standby server. > > + > > + > > + Specify dbname in > > + primary_conninfo string to allow > > synchronization > > + of slots from the primary server to the standby server. > > + This will only be used for slot synchronization. It is ignored > > + for streaming. > > > > Is there a reason to remove part of the earlier sentence "use > > replication as the database name"? > > Added it back. > > > > > 4. > > + enable_syncslot configuration > > parameter > > + > > + > > + > > + > > +It enables a physical standby to synchronize logical failover slots > > +from the primary server so that logical subscribers are not blocked > > +after failover. > > + > > + > > +It is enabled by default. This parameter can only be set in the > > +postgresql.conf file or on the server > > command line. > > + > > > > I think you forgot to update the documentation for the default value of this > > variable. > > Updated. > > > > > 5. > > + * a) start the logical replication workers for every enabled > > subscription > > + * when not in standby_mode > > + * b) start the slot-sync worker for logical failover slots > > synchronization > > + * from the primary server when in standby_mode. > > > > Either use a full stop after both lines or none of these. > > Added a full stop. > > > > > 6. > > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > > > There shouldn't be space between * and the worker. > > Removed, and added the type to typedefs.list. > > > > > 7. > > + if (!SlotSyncWorker->hdr.in_use) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker not initialized, " > > + "cannot attach"))); > > + } > > + > > + if (SlotSyncWorker->hdr.proc) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker is " > > + "already running, cannot attach"))); > > + } > > > > Using slot-sync in the error messages looks a bit odd to me. Can we use > > "replication slot sync worker ..." in both these and other similar > > messages? I > > think it would be better if we don't split the messages into multiple lines > > in > > these cases as messages don't appear too long to me. > > Changed as suggested. > > > > > 8. > > +/* > > + * Detach the worker from DSM and update 'proc' and 'in_use'. > > + * Logical replication launcher will come to know using these > > + * that the worker has shutdown. > > + */ > > +void > > +slotsync_worker_detach(int code, Datum arg) { > > > > I think the reference to DSM is leftover from the previous version of the > > patch. > > Can we change the above comments as per the new code? > > Changed. > > > > > 9. > > +static bool > > +slotsync_worker_launch() > > { > > ... > > + /* TODO: do we really need 'generation', analyse more here */ > > + worker->hdr.generation++; > > > > We should do something about this
Re: Simplify if/else logic of walsender CreateReplicationSlot
On Tue, Nov 21, 2023 at 3:57 PM Michael Paquier wrote: > > On Mon, Nov 20, 2023 at 05:07:38PM +0900, Michael Paquier wrote: > > Good idea. What you are suggesting here improves the readability of > > this code, so +1. > > And applied this one, thanks! Thanks for pushing. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Synchronizing slots from primary to standby
On Tue, Nov 21, 2023 at 10:01 AM Zhijie Hou (Fujitsu) wrote: > > On Saturday, November 18, 2023 6:46 PM Amit Kapila > wrote: > > > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > > wrote: > > > > > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > > wrote: > > > > > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > > > shutdown slotsync worker and drop slots. There could be other > > > > reasons(other than > > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to > > > > reach the code there. I thought if the intention is to stop slotsync > > > > workers on promotion, maybe FinishWalRecovery() is a better place to > > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is > > > > also > > called in it. > > > > > > I can see that slotsync_drop_initiated_slots() has been moved in > > > FinishWalRecovery() in v35. That looks ok. > > > > > > > > > > More Review for v35-0002* > > Thanks for the comments. > > > > > 1. > > + ereport(WARNING, > > + errmsg("skipping slots synchronization as primary_slot_name " > > +"is not set.")); > > > > There is no need to use a full stop at the end for WARNING messages and as > > previously mentioned, let's not split message lines in such cases. There are > > other messages in the patch with similar problems, please fix those as well. > > Adjusted. > > > > > 2. > > +slotsync_checks() > > { > > ... > > ... > > + /* The hot_standby_feedback must be ON for slot-sync to work */ if > > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots > > + synchronization as hot_standby_feedback " > > +"is off.")); > > > > This message has the same problem as mentioned in the previous comment. > > Additionally, I think either atop slotsync_checks or along with GUC check we > > should write comments as to why we expect these values to be set for slot > > sync > > to work. > > Added comments for these cases. > > > > > 3. > > + /* The worker is running already */ > > + if (SlotSyncWorker &>hdr.in_use && > > + SlotSyncWorker->hdr.proc) > > > > The spacing for both the &&'s has problems. You need a space after the first > > && and the second && should be in the prior line. > > Adjusted. > > > > > 4. > > + LauncherRereadConfig(_slotsync); > > + > > } > > > > An empty line after LauncherRereadConfig() is not required. > > > > 5. > > +static void > > +LauncherRereadConfig(bool *ss_recheck) > > +{ > > + char*conninfo = pstrdup(PrimaryConnInfo); > > + char*slotname = pstrdup(PrimarySlotName); > > + bool syncslot = enable_syncslot; > > + bool feedback = hot_standby_feedback; > > > > Can we change the variable name 'feedback' to 'standbyfeedback' to make it > > slightly more descriptive? > > Changed. > > > > > 6. The logic to recheck the slot_sync related parameters in > > LauncherMain() is not very clear. IIUC, if after reload config any > > parameter is > > changed, we just seem to be checking the validity of the changed parameter > > but not restarting the slot sync worker, is that correct? If so, what if > > dbname is > > changed, don't we need to restart the slot-sync worker and re-initialize the > > connection; similarly slotname change also needs some thoughts. Also, if > > all the > > parameters are valid we seem to be re-launching the slot-sync worker without > > first stopping it which doesn't seem correct, am I missing something in this > > logic? > > I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC > changed > and new slot sync worker will be started in next loop in LauncherMain(). yes, LauncherRereadConfig will stop the worker on any parameter change and will set recheck_slotsync(). On finding this flag as true, LauncherMain will redo all the validations and restart slot-sync worker if needed. Yes, we do need to stop and relaunch slotsync workers on dbname change as well. This is currently missing inLauncherRereadConfig (). Regarding slot name change,we are already doing it, we are already checking PrimarySlotName in LauncherRereadConfig() > > > > 7. > > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, > > errmsg("replication slot \"%s\" was not created in this database", > > NameStr(slot->data.name; > > > > + in_recovery = RecoveryInProgress(); > > + > > + /* > > + * Do not allow consumption of a "synchronized" slot until the standby > > + * gets promoted. Also do not allow consumption of slots with > > + sync_state > > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be > > + * used. > > + */ > > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || > > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot use replication slot \"%s\" for logical decoding", > > + NameStr(slot->data.name)),
Re: Synchronizing slots from primary to standby
Here are some review comments for the patch v35-0001. == 0. GENERAL documentation I felt that the documentation gave details of the individual changes (e.g. GUC 'standby_slot_names' and API, CREATE SUBSCRIPTION option, and pg_replication_slots 'failover' attribute etc.) but there is nothing that seemed to bring all these parts together to give examples for user "when" and "how" to make all these parts work. I'm not sure if there is some overview missing from this patch 1 or if you are planning that extra documentation for subsequent patches. == Commit message 1. A new property 'failover' is added at the slot level which is persistent information which specifies that this logical slot is enabled to be synced to the physical standbys so that logical replication can be resumed after failover. It is always false for physical slots. ~ SUGGESTION A new property 'failover' is added at the slot level. This is persistent information to indicate that this logical slot... ~~~ 2. Users can set it during the create subscription or during pg_create_logical_replication_slot. Examples: create subscription mysub connection '..' publication mypub WITH (failover = true); --last arg SELECT * FROM pg_create_logical_replication_slot('myslot', 'pgoutput', false, true, true); ~ 2a. Add a blank line before this ~ 2b. Use uppercase for the SQL ~ 2c. SUGGESTION Users can set this flag during CREATE SUBSCRIPTION or during pg_create_logical_replication_slot API. Ex1. CREATE SUBSCRIPTION mysub CONNECTION '...' PUBLICATION mypub WITH (failover = true); Ex2. (failover is the last arg) SELECT * FROM pg_create_logical_replication_slot('myslot', 'pgoutput', false, true, true); ~~~ 3. This 'failover' is displayed as part of pg_replication_slots view. ~ SUGGESTION The value of the 'failover' flag is displayed as part of pg_replication_slots view. ~~~ 4. A new GUC standby_slot_names has been added. It is the list of physical replication slots that logical replication with failover enabled waits for. The intent of this wait is that no logical replication subscribers (with failover=true) should go ahead of physical replication standbys (corresponding to the physical slots in standby_slot_names). ~ 4a. SUGGESTION A new GUC standby_slot_names has been added. This is a list of physical replication slots that logical replication with failover enabled will wait for. ~ 4b. /no logical replication subscribers/no logical replication subscriptions/ ~ 4c /should go ahead of physical/should get ahead of physical/ == contrib/test_decoding/sql/slot.sql 5. + +-- Test logical slots creation with 'failover'=true (last arg) +SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', false, false, true); +SELECT slot_name, slot_type, failover FROM pg_replication_slots; + +SELECT pg_drop_replication_slot('failover_slot'); How about a couple more simple tests: a) pass false arg to confirm it is false in the view. b) according to the docs this failover is optional, so try API without passing it c) create a physical slot to confirm it is false in the view. == doc/src/sgml/catalogs.sgml 6. + + + subfailoverstate char + + + State codes for failover mode: + d = disabled, + p = pending enablement, + e = enabled + + + This attribute is very similar to the 'subtwophasestate' so IMO it would be better to be adjacent to that one in the docs. (probably this means putting it in the same order in the catalog also, assuming that is allowed) == doc/src/sgml/config.sgml 7. + +List of physical replication slots that logical replication slots with +failover enabled waits for. If a logical replication connection is +meant to switch to a physical standby after the standby is promoted, +the physical replication slot for the standby should be listed here. + + +The standbys corresponding to the physical replication slots in +standby_slot_names must enable +enable_syncslot for the standbys to receive +failover logical slots changes from the primary. + That sentence mentioning 'enable_syncslot' seems premature because AFAIK that GUC is not introduced until patch 0002. So this part should be moved into the 0002 patch. == doc/src/sgml/ref/alter_subscription.sgml 8. These commands also cannot be executed when the subscription has two_phase - commit enabled, unless + commit enabled or + failover + enabled, unless copy_data is false. See column subtwophasestate - of pg_subscription + and subfailoverstate of + pg_subscription to know the actual two-phase state. I think the last sentence doesn't make sense anymore because it is no longer talking about only two-phase state. BEFORE See column subtwophasestate and subfailoverstate of pg_subscription to know the actual two-phase state. SUGGESTION See column
Re: Synchronizing slots from primary to standby
On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand wrote: > > On 11/20/23 11:59 AM, Amit Kapila wrote: > > On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand > > wrote: > >> > >> On 11/18/23 11:45 AM, Amit Kapila wrote: > >>> On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > >>> wrote: > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > > wrote: > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > shutdown > > slotsync worker and drop slots. There could be other reasons(other than > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach > > the code > > there. I thought if the intention is to stop slotsync workers on > > promotion, > > maybe FinishWalRecovery() is a better place to do it as it's indicating > > the end > > of recovery and XLogShutdownWalRcv is also called in it. > > I can see that slotsync_drop_initiated_slots() has been moved in > FinishWalRecovery() > in v35. That looks ok. > > > >>> > >>> I was thinking what if we just ignore creating such slots (which > >>> require init state) in the first place? I think that can be > >>> time-consuming in some cases but it will reduce the complexity and we > >>> can always improve such cases later if we really encounter them in the > >>> real world. I am not very sure that added complexity is worth > >>> addressing this particular case, so I would like to know your and > >>> others' opinions. > >>> > >> > >> I'm not sure I understand your point. Are you saying that we should not > >> create > >> slots on the standby that are "currently" reported in a 'i' state? (so > >> just keep > >> the 'r' and 'n' states?) > >> > > > > Yes. > > > > As far the 'i' state here, from what I see, it is currently useful for: > > 1. Cascading standby to not sync slots with state = 'i' from > the first standby. > 2. Easily report Slots that did not catch up on the primary yet. > 3. Avoid inactive slots to block "active" ones creation. > > So not creating those slots should not be an issue for 1. (sync are > not needed on cascading standby as not created on the first standby yet) > but is an issue for 2. (unless we provide another way to keep track and report > such slots) and 3. (as I think we should still need to reserve WAL). > > I've a question: we'd still need to reserve WAL for those slots, no? > > If that's the case and if we don't call ReplicationSlotCreate() then > ReplicationSlotReserveWal() > would not work as MyReplicationSlot would be NULL. > Yes, we need to reserve WAL to see if we can sync the slot. We are currently creating an RS_EPHEMERAL slot and if we don't explicitly persist it when we can't sync, then it will be dropped when we do ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the loss is probably, the next time we again try to sync the slot, we need to again create it and may need to wait for newer restart_lsn on standby which could be avoided if we have the slot in 'i' state from the previous run. I don't deny the importance of having 'i' (initialized) state but was just trying to say that it has additional code complexity. OTOH, having it may give better visibility to even users about slots that are not active (say manually created slots on the primary). -- With Regards, Amit Kapila.
Re: Do away with a few backwards compatibility macros
Nathan Bossart writes: > On Thu, Nov 16, 2023 at 09:46:22AM -0600, Nathan Bossart wrote: >> I'm fine with this because all of these macros are no-ops for all supported >> versions of Postgres. Even if an extension is using them today, you'll get >> the same behavior as before if you remove the uses and rebuild against >> v12-v16. > Barring objections, I'll plan on committing this in the next week or so. No objection here, but should we try to establish some sort of project policy around this sort of change (ie, removal of backwards-compatibility support)? "Once it no longer matters for any supported version" sounds about right to me, but maybe somebody has an argument for thinking about it differently. regards, tom lane
How to accurately determine when a relation should use local buffers?
Dear Hackers, I would like to clarify, what the correct way is to determine that a given relation is using local buffers. Local buffers, as far as I know, are used for temporary tables in backends. There are two functions/macros (bufmgr.c): SmgrIsTemp, RelationUsesLocalBuffers. The first function verifies that the current process is a regular session backend, while the other macro verifies the relation persistence characteristic. It seems, the use of each function independently is not correct. I think, these functions should be applied in pair to check for local buffers use, but, it seems, these functions are used independently. It works until temporary tables are allowed only in session backends. I'm concerned, how to determine the use of local buffers in some other theoretical cases? For example, if we decide to replicate temporary tables? Are there the other cases, when local buffers can be used with relations in the Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not only in session backends? Thank you in advance for your help! With best regards, Vitaly Davydov
Re: Do away with a few backwards compatibility macros
On Thu, Nov 16, 2023 at 09:46:22AM -0600, Nathan Bossart wrote: > On Thu, Nov 16, 2023 at 07:11:41PM +0530, Bharath Rupireddy wrote: >> After a recent commit 6a72c42f (a related discussion [1]) which >> removed MemoryContextResetAndDeleteChildren(), I think there are a >> couple of other backward compatibility macros out there that can be >> removed. These macros are tuplestore_donestoring() which was >> introduced by commit dd04e95 21 years ago and SPI_push() and friends >> which were made no-ops macros by commit 1833f1a 7 years ago. Debian >> code search shows very minimal usages of these macros. Here's a patch >> attached to remove them. > > I'm fine with this because all of these macros are no-ops for all supported > versions of Postgres. Even if an extension is using them today, you'll get > the same behavior as before if you remove the uses and rebuild against > v12-v16. Barring objections, I'll plan on committing this in the next week or so. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Simplify if/else logic of walsender CreateReplicationSlot
On Mon, Nov 20, 2023 at 05:07:38PM +0900, Michael Paquier wrote: > Good idea. What you are suggesting here improves the readability of > this code, so +1. And applied this one, thanks! -- Michael signature.asc Description: PGP signature
Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
On Mon, Nov 20, 2023 at 06:12:47PM -0800, Jeff Davis wrote: > The caller could do something similar, so this option is not necessary, > but it seems like it could be generally useful. It speeds things up for > the search_path cache (and is an alternative to another patch I have > that implements the same thing in the caller). I'm mostly thinking out loud here, but could we just always do this? I guess you might want to avoid it if your SH_EQUAL is particularly expensive and you know repeated lookups are rare, but maybe that's uncommon enough that we don't really care. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Hide exposed impl detail of wchar.c
On Mon, Nov 20, 2023 at 05:14:17PM -0800, Jubilee Young wrote: > On Mon, Nov 20, 2023 at 2:52 PM Nathan Bossart > wrote: >> Does pgrx use ascii.h at all? > > We don't use utils/ascii.h, no. Alright. The next minor release isn't until February, so I'll let this one sit a little while longer in case anyone objects to back-patching something like this [0]. [0] https://postgr.es/m/attachment/152305/move_is_valid_ascii_v2.patch -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Annoying build warnings from latest Apple toolchain
Hi, On 2023-11-20 16:20:20 -0500, Tom Lane wrote: > I'm generally still using autoconf, I only run meson builds when > somebody complains about them ;-). But yeah, I see lots of > "ld: warning: -undefined error is deprecated" when I do that. > This seems to have been installed by Andres' 9a95a510a: > >ldflags += ['-isysroot', pg_sysroot] > + # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we > + # don't want because a) it's different from what we do for autoconf, b) it > + # causes warnings starting in macOS Ventura > + ldflags_mod += ['-Wl,-undefined,error'] That's not the sole issue, because meson automatically adds that for binaries (due to some linker issue that existed in the past IIRC). > The autoconf side seems to just be letting this option default. > I'm not sure what the default choice is, but evidently it's not > "-undefined error"? Or were they stupid enough to not allow you > to explicitly select the default behavior? I'm somewhat confused by what Apple did. I just was upgrading my m1 mac mini to sonoma, and in one state I recall man ld documenting it below "Obsolete Options". But then I also updated xcode, and now there's no mention whatsoever of the option being deprecated in the man page at all. Perhaps my mind is playing tricks on me. And yes, it sure looks like everything other than 'dynamic_lookup' creates a warning. Which seems absurd. > Also, I *don't* see any complaints about duplicate libraries. > What build options are you using? I don't understand what Apple is thinking here. Getting the same library name twice, e.g. icu once directly and once indirectly via xml2-config --libs or such, seems very common. To avoid that portably, you basically need to do a topographical sort of the libraries, to figure out an ordering that deduplicates but doesn't move a library to before where it's used. With a bunch of complexities due to -L, which could lead to finding different libraries for the same library name, thrown in. WRT Robert seeing those warnings and Tom not: There's something odd going on. I couldn't immediately reproduce it. Then I realized it reproduces against a homebrew install but not a macports one. Robert, which are you using? Meson actually *tries* to avoid this warning without resulting in incorrect results due to ordering. But homebrew does something odd, with libxml-2.0, zlib and a few others: Unless you do something to change that, brew has /opt/homebrew/Library/Homebrew/os/mac/pkgconfig/14/ in its search path, but those libraries aren't from homebrew, they're referencing macos frameworks. Apparently meson isn't able to understand which files those .pc files link to and gives up on the deduplicating. If I add to the pkg-config search path, e.g. via meson configure -Dpkg_config_path=$OTHER_STUFF:/opt/homebrew/opt/zlib/lib/pkgconfig/:/opt/homebrew/opt/libxml2/lib/pkgconfig/ the warnings about duplicate libraries vanish. Greetings, Andres Freund
RE: Synchronizing slots from primary to standby
On Friday, November 17, 2023 7:39 PM Amit Kapila wrote: > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik > wrote: > > > > PFA v35. > > > > Review v35-0002* > == Thanks for the comments. > 1. > As quoted in the commit message, > > > If a logical slot is invalidated on the primary, slot on the standby is also > invalidated. If a logical slot on the primary is valid but is invalidated on > the > standby due to conflict (say required rows removed on the primary), then that > slot is dropped and recreated on the standby in next sync-cycle. > It is okay to recreate such slots as long as these are not consumable on the > standby (which is the case currently). > > > > I think this won't happen normally because of the physical slot and > hot_standby_feedback but probably can occur in cases like if the user > temporarily switches hot_standby_feedback from on to off. Are there any other > reasons? I think we can mention the cases along with it as well at least for > now. > Additionally, I think this should be covered in code comments as well. I will collect all these cases and update in next version. > > 2. > #include "postgres.h" > - > +#include "access/genam.h" > > Spurious line removal. Removed. > > 3. >A password needs to be provided too, if the sender demands > password >authentication. It can be provided in the >primary_conninfo string, or in a separate > - ~/.pgpass file on the standby server (use > - replication as the database name). > - Do not specify a database name in the > - primary_conninfo string. > + ~/.pgpass file on the standby server. > + > + > + Specify dbname in > + primary_conninfo string to allow > synchronization > + of slots from the primary server to the standby server. > + This will only be used for slot synchronization. It is ignored > + for streaming. > > Is there a reason to remove part of the earlier sentence "use > replication as the database name"? Added it back. > > 4. > + enable_syncslot configuration > parameter > + > + > + > + > +It enables a physical standby to synchronize logical failover slots > +from the primary server so that logical subscribers are not blocked > +after failover. > + > + > +It is enabled by default. This parameter can only be set in the > +postgresql.conf file or on the server > command line. > + > > I think you forgot to update the documentation for the default value of this > variable. Updated. > > 5. > + * a) start the logical replication workers for every enabled subscription > + * when not in standby_mode > + * b) start the slot-sync worker for logical failover slots synchronization > + * from the primary server when in standby_mode. > > Either use a full stop after both lines or none of these. Added a full stop. > > 6. > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > There shouldn't be space between * and the worker. Removed, and added the type to typedefs.list. > > 7. > + if (!SlotSyncWorker->hdr.in_use) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker not initialized, " > + "cannot attach"))); > + } > + > + if (SlotSyncWorker->hdr.proc) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker is " > + "already running, cannot attach"))); > + } > > Using slot-sync in the error messages looks a bit odd to me. Can we use > "replication slot sync worker ..." in both these and other similar messages? I > think it would be better if we don't split the messages into multiple lines in > these cases as messages don't appear too long to me. Changed as suggested. > > 8. > +/* > + * Detach the worker from DSM and update 'proc' and 'in_use'. > + * Logical replication launcher will come to know using these > + * that the worker has shutdown. > + */ > +void > +slotsync_worker_detach(int code, Datum arg) { > > I think the reference to DSM is leftover from the previous version of the > patch. > Can we change the above comments as per the new code? Changed. > > 9. > +static bool > +slotsync_worker_launch() > { > ... > + /* TODO: do we really need 'generation', analyse more here */ > + worker->hdr.generation++; > > We should do something about this TODO. As per my understanding, we don't > need a generation number for the slot sync worker as we have one such worker > but I guess the patch requires it because we are using existing logical > replication worker infrastructure. This brings the question of whether we > really > need a separate SlotSyncWorkerInfo or if we can use existing > LogicalRepWorker and distinguish it with
Re: archive modules loose ends
On Mon, Nov 13, 2023 at 03:35:28PM -0800, Andres Freund wrote: > On 2023-11-13 16:42:31 -0600, Nathan Bossart wrote: >> There seems to be no interest in this patch, so I plan to withdraw it from >> the commitfest system by the end of the month unless such interest >> materializes. > > I think it might just have arrived too shortly before the feature freeze to be > worth looking at at the time, and then it didn't really re-raise attention > until now. I'm so far behind on keeping up with the list that I rarely end up > looking far back for things I'd like to have answered... Sorry. No worries. I appreciate the review. >> I see two main options for dealing with this. One option is to simply have >> shell_archive (and any other archive modules out there) maintain its own >> memory context like basic_archive does. This ends up requiring a whole lot >> of duplicate code between the two built-in modules, though. Another option >> is to have the archiver manage a memory context that it resets after every >> invocation of the archiving callback, ERROR or not. > > I think passing in a short-lived memory context is a lot nicer to deal with. Cool. >> This has the advantage of avoiding code duplication and simplifying things >> for the built-in modules, but any external modules that rely on palloc'd >> state being long-lived would need to be adjusted to manage their own >> long-lived context. (This would need to be appropriately documented.) > > Alternatively we could provide a longer-lived memory context in > ArchiveModuleState, set up by the genric infrastructure. That context would > obviously still need to be explicitly utilized by a module, but no duplicated > setup code would be required. Sure. Right now, I'm not sure there's too much need for that. A module could just throw stuff in TopMemoryContext, and you probably wouldn't have any leaks because the archiver just restarts on any ERROR or archive_library change. But that's probably not a pattern we want to encourage long-term. I'll jot this down for a follow-up patch idea. > I think we should just have the AtEOXact_Files() in pgarch.c, then no > PG_TRY/CATCH is needed here. At the moment I think just about every possible > use of an archive modules would require using files, so there doesn't seem > much of a reason to not handle it in pgarch.c. WFM > I'd probably reset a few other subsystems at the same time (there's probably > more): > - disable_all_timeouts() > - LWLockReleaseAll() > - ConditionVariableCancelSleep() > - pgstat_report_wait_end() > - ReleaseAuxProcessResources() I looked around a bit and thought AtEOXact_HashTables() belonged here as well. I'll probably give this one another pass to see if there's anything else obvious. > It could be worth setting up an errcontext providing the module and file > that's being processed. I personally find that at least as important as > setting up a ps string detailing the log file... But I guess that could be a > separate patch. Indeed. Right now we rely on the module to emit sufficiently-detailed logs, but it'd be nice if they got that for free. > It'd be nice to add a comment explaining why pgarch_archiveXlog() is the right > place to handle errors. Will do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Nov 20, 2023 at 4:28 PM Amit Kapila wrote: > > 9. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > { > ... > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > + slot = MyReplicationSlot; > + > + SpinLockAcquire(>mutex); > + slot->data.database = get_database_oid(remote_slot->database, false); > + > + /* Mark it as sync initiated by slot-sync worker */ > + slot->data.sync_state = SYNCSLOT_STATE_INITIATED; > + slot->data.failover = true; > + > + namestrcpy(>data.plugin, remote_slot->plugin); > + SpinLockRelease(>mutex); > + > + ReplicationSlotReserveWal(); > + > > How and when will this init state (SYNCSLOT_STATE_INITIATED) persist to disk? > On closer inspection, I see that it is done inside wait_for_primary_and_sync() when it fails to sync. I think it is better to refactor the code a bit and persist it in synchronize_one_slot() to make the code flow easier to understand. -- With Regards, Amit Kapila.
Re: [PATCH] pgbench log file headers
Hello, On Mon, Nov 13, 2023 at 6:01 PM Andres Freund wrote: > Hi, > > On 2023-11-13 11:55:07 -0600, Adam Hendel wrote: > > Currently, pgbench will log individual transactions to a logfile when the > > `--log` parameter flag is provided. The logfile, however, does not > include > > column header. It has become a fairly standard expectation of users to > have > > column headers present in flat files. Without the header in the pgbench > log > > files, new users must navigate to the docs and piece together the column > > headers themselves. Most industry leading frameworks have tooling built > in > > to read column headers though, for example python/pandas read_csv(). > > The disadvantage of doing that is that a single pgbench run with --log will > generate many files when using -j, to avoid contention. The easiest way to > deal with that after the run is to cat all the log files to a larger one. > If > you do that with headers present, you obviously have a few bogus rows (the > heades from the various files). > > I guess you could avoid the "worst" of that by documenting that the > combined > log file should be built by > cat {$log_prefix}.${pid} {$log_prefix}.${pid}.* > and only outputting the header in the file generated by the main thread. > > > > We can improve the experience for users by adding column headers to > pgbench > > logfiles with an optional command line flag, `--log-header`. This will > keep > > the API backwards compatible by making users opt-in to the column > headers. > > It follows the existing pattern of having conditional flags in pgbench’s > > API; the `--log` option would have both –log-prefix and –log-header if > this > > work is accepted. > > > The implementation considers the column headers only when the > > `--log-header` flag is present. The values for the columns are taken > > directly from the “Per-Transaction Logging” section in > > https://www.postgresql.org/docs/current/pgbench.html and takes into > account > > the conditional columns `schedule_lag` and `retries`. > > > > > > Below is an example of what that logfile will look like: > > > > > > pgbench postgres://postgres:postgres@localhost:5432/postgres --log > > --log-header > > > > client_id transaction_no time script_no time_epoch time_us > > Independent of your patch, but we imo ought to combine time_epoch time_us > in > the log output. There's no point in forcing consumers to combine those > fields, > and it makes logging more expensive... And if we touch this, we should > just > switch to outputting nanoseconds instead of microseconds. > > It also is quite odd that we have "time" and "time_epoch", "time_us", where > time is the elapsed time of an individual "transaction" and time_epoch + > time_us together are a wall-clock timestamp. Without differentiating > between > those, the column headers seem not very useful, because one needs to look > in > the documentation to understand the fields. > > > I don't think there's all that strong a need for backward compatibility in > pgbench. We could just change the columns as I suggest above and then > always > emit the header in the "main" log file. > I updated the patch to always log the header and only off the main thread. As for the time headers, I will work on renaming/combining those in a separate commit as: time -> time_elapsed time_epoch + time_us -> time_completion_us > Greetings, > > Andres Freund > v2-0001-Adds-colum-headers-to-the-log-file-produced-by-pg.patch Description: Binary data
Re: Use of backup_label not noted in log
On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote: > On 11/20/23 15:03, Andres Freund wrote: >> Besides the phrasing and the additional log message (I have no opinion about >> whether it should be backpatched or not), I used %u for TimelineID as >> appropriate, and added a comma before "on timeline". The "starting/restarting/completed recovery" line sounds better here, so I'm OK with your suggestions. > I still wonder if we need "base backup" in the messages? That sort of > implies (at least to me) you used pg_basebackup but that may not be the > case. Or just s/base backup/backup/? > Other than that, looks good for HEAD. Whether we back patch or not is > another question, of course. I'd rather see more information in the back-branches more quickly, so count me in the bucket of folks in favor of a backpatch. -- Michael signature.asc Description: PGP signature
Re: remaining sql/json patches
On Nov 16, 2023, at 17:48, Amit Langote wrote: > On Thu, Nov 16, 2023 at 2:11 AM Andres Freund wrote: >> On 2023-11-15 22:00:41 +0900, Amit Langote wrote: This causes a nontrivial increase in the size of the parser (~5% in an optimized build here), I wonder if we can do better. >>> >>> Hmm, sorry if I sound ignorant but what do you mean by the parser here? >> >> gram.o, in an optimized build. >> >>> I can see that the byte-size of gram.o increases by 1.66% after the >>> above additions (1.72% with previous versions). >> >> I'm not sure anymore how I measured it, but if you just looked at the total >> file size, that might not show the full gain, because of debug symbols >> etc. You can use the size command to look at just the code and data size. > > $ size /tmp/gram.* > textdata bss dec hex filename > 661808 0 0 661808 a1930 /tmp/gram.o.unpatched > 672800 0 0 672800 a4420 /tmp/gram.o.patched > > That's still a 1.66% increase in the code size: > > (672800 - 661808) / 661808 % = 1.66 > > As for gram.c, the increase is a bit larger: > > $ ll /tmp/gram.* > -rw-rw-r--. 1 amit amit 2605925 Nov 16 16:18 /tmp/gram.c.unpatched > -rw-rw-r--. 1 amit amit 2709422 Nov 16 16:22 /tmp/gram.c.patched > > (2709422 - 2605925) / 2605925 % = 3.97 > >>> I've also checked >>> using log_parser_stats that there isn't much slowdown in the >>> raw-parsing speed. >> >> What does "isn't much slowdown" mean in numbers? > > Sure, the benchmark I used measured the elapsed time (using > log_parser_stats) of parsing a simple select statement (*) averaged > over 1 repetitions of the same query performed with `psql -c`: > > Unpatched: 0.61 seconds > Patched: 0.61 seconds > > Here's a look at the perf: > > Unpatched: > 0.59% [.] AllocSetAlloc > 0.51% [.] hash_search_with_hash_value > 0.47% [.] base_yyparse > > Patched: > 0.63% [.] AllocSetAlloc > 0.52% [.] hash_search_with_hash_value > 0.44% [.] base_yyparse > > (*) select 1, 2, 3 from foo where a = 1 > > Is there a more relevant benchmark I could use? Thought I’d share a few more numbers I collected to analyze the parser size increase over releases. * gram.o text bytes is from the output of `size gram.o`. * compiled with -O3 version gram.o text bytes %change gram.c bytes %change 9.6 534010 -2108984 - 10 582554 9.09 2258313 7.08 11 584596 0.35 2313475 2.44 12 590957 1.08 2341564 1.21 13 590381-0.09 2357327 0.67 14 600707 1.74 2428841 3.03 15 633180 5.40 2495364 2.73 16 653464 3.20 2575269 3.20 17-sqljson 672800 2.95 2709422 3.97 So if we put SQL/JSON (including JSON_TABLE()) into 17, we end up with a gram.o 2.95% larger than v16, which granted is a somewhat larger bump, though also smaller than with some of recent releases. > -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 03:58:55PM -0800, Andres Freund wrote: > I was thinking we'd just set it in the pg_basebackup style path, and we'd > error out if it's set and backup_label is present. But we'd still use > backup_label without the pg_control flag set. > > So it'd just provide a cross-check that backup_label was not removed for > pg_basebackup style backup, but wouldn't do anything for external backups. But > imo the proposal to just us pg_control doesn't actually do anything for > external backups either - which is why I think my proposal would achieve as > much, for a much lower price. I don't see why not. It does not increase the number of steps when doing a backup, and backupStartPoint alone would not be able to offer this much protection. -- Michael signature.asc Description: PGP signature
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
Bruce Momjian writes: > On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote: >> Bruce Momjian writes: >>> An alternate approach would >>> be to remove pg_attribute.attndims so we don't even try to preserve >>> dimensionality. >> I could get behind that, perhaps. It looks like we're not using the >> field in any meaningful way, and we could simplify TupleDescInitEntry >> and perhaps some other APIs. > So should I work on that patch or do you want to try? I think we should > do something. Let's wait for some other opinions, first ... regards, tom lane
simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
Patch attached. The caller could do something similar, so this option is not necessary, but it seems like it could be generally useful. It speeds things up for the search_path cache (and is an alternative to another patch I have that implements the same thing in the caller). Thoughts? Regards, Jeff Davis From b878af835da794f3384f870db57b34e236b1efba Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Mon, 20 Nov 2023 17:42:07 -0800 Subject: [PATCH] Add SH_OPTIMIZE_REPEAT option to simplehash.h. Callers which expect to look up the same value repeatedly can specify SH_OPTIMIZE_REPEAT. That option causes simplehash to quickly check if the last entry returned has a the same key, and return it immediately if so (before calculating the hash). --- src/include/lib/simplehash.h | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index b1a3d7f927..9a3dbfecfa 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -49,6 +49,7 @@ * - SH_EQUAL(table, a, b) - compare two table keys * - SH_HASH_KEY(table, key) - generate hash for the key * - SH_STORE_HASH - if defined the hash is stored in the elements + * - SH_OPTIMIZE_REPEAT - optimize for repeated lookups of the same key * - SH_GET_HASH(tb, a) - return the field to store the hash in * * The element type is required to contain a "status" member that can store @@ -163,6 +164,11 @@ typedef struct SH_TYPE /* hash buckets */ SH_ELEMENT_TYPE *data; +#ifdef SH_OPTIMIZE_REPEAT + /* last element found */ + SH_ELEMENT_TYPE *previous; +#endif + #ifndef SH_RAW_ALLOCATOR /* memory context to use for allocations */ MemoryContext ctx; @@ -777,7 +783,16 @@ restart: SH_SCOPE SH_ELEMENT_TYPE * SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found) { - uint32 hash = SH_HASH_KEY(tb, key); + uint32 hash; + +#ifdef SH_OPTIMIZE_REPEAT + if (tb->previous != NULL && + tb->previous->status == SH_STATUS_IN_USE && + SH_EQUAL(tb, key, tb->previous->SH_KEY)) + return tb->previous; +#endif + + hash = SH_HASH_KEY(tb, key); return SH_INSERT_HASH_INTERNAL(tb, key, hash, found); } @@ -815,7 +830,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash) Assert(entry->status == SH_STATUS_IN_USE); if (SH_COMPARE_KEYS(tb, hash, key, entry)) + { +#ifdef SH_OPTIMIZE_REPEAT + tb->previous = entry; +#endif return entry; + } /* * TODO: we could stop search based on distance. If the current @@ -834,7 +854,16 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash) SH_SCOPE SH_ELEMENT_TYPE * SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key) { - uint32 hash = SH_HASH_KEY(tb, key); + uint32 hash; + +#ifdef SH_OPTIMIZE_REPEAT + if (tb->previous != NULL && + tb->previous->status == SH_STATUS_IN_USE && + SH_EQUAL(tb, key, tb->previous->SH_KEY)) + return tb->previous; +#endif + + hash = SH_HASH_KEY(tb, key); return SH_LOOKUP_HASH_INTERNAL(tb, key, hash); } @@ -1152,6 +1181,7 @@ SH_STAT(SH_TYPE * tb) #undef SH_DEFINE #undef SH_GET_HASH #undef SH_STORE_HASH +#undef SH_OPTIMIZE_REPEAT #undef SH_USE_NONDEFAULT_ALLOCATOR #undef SH_EQUAL -- 2.34.1
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote: > Bruce Momjian writes: > > I would like to apply this patch to master because I think our current > > deficiencies in this area are unacceptable. > > I do not think this is a particularly good idea, because it creates > the impression in a couple of places that we track this data, when > we do not really do so to any meaningful extent. Okay, I thought we could get by without tracking the CREATE TABLE AS case, but it is inconsistent. My patch just makes it less inconsistent. > > An alternate approach would > > be to remove pg_attribute.attndims so we don't even try to preserve > > dimensionality. > > I could get behind that, perhaps. It looks like we're not using the > field in any meaningful way, and we could simplify TupleDescInitEntry > and perhaps some other APIs. So should I work on that patch or do you want to try? I think we should do something. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
Bruce Momjian writes: > I would like to apply this patch to master because I think our current > deficiencies in this area are unacceptable. I do not think this is a particularly good idea, because it creates the impression in a couple of places that we track this data, when we do not really do so to any meaningful extent. > An alternate approach would > be to remove pg_attribute.attndims so we don't even try to preserve > dimensionality. I could get behind that, perhaps. It looks like we're not using the field in any meaningful way, and we could simplify TupleDescInitEntry and perhaps some other APIs. regards, tom lane
Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?
thanks,After reconsideration, I realized that what I really want is for other connections to remain unaffected when a process crashes. This is something that a connection pool cannot solve. At 2023-11-14 09:41:03, "Thomas wen" wrote: Hi yuansong there is connnection pool path (https://commitfest.postgresql.org/34/3043/) ,but it has been dormant for few years,You can check this patch to get what you want to need 发件人: yuansong 发送时间: 2023年11月13日 17:13 收件人: Laurenz Albe 抄送: pgsql-hackers@lists.postgresql.org 主题: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart? Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research. The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool. If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you. At 2023-11-13 13:53:29, "Laurenz Albe" wrote: >On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: >> yuansong writes: >> > In PostgreSQL, when a backend process crashes, it can cause other backend >> > processes to also require a restart, primarily to ensure data consistency. >> > I understand that the correct approach is to analyze and identify the >> > cause of the crash and resolve it. However, it is also important to be >> > able to handle a backend process crash without affecting the operation of >> > other processes, thus minimizing the scope of negative impact and >> > improving availability. To achieve this goal, could we mimic the Oracle >> > process by introducing a "pmon" process dedicated to rolling back crashed >> > process transactions and performing resource cleanup? I wonder if anyone >> > has attempted such a strategy or if there have been previous discussions >> > on this topic. >> >> The reason we force a database-wide restart is that there's no way to >> be certain that the crashed process didn't corrupt anything in shared >> memory. (Even with the forced restart, there's a window where bad >> data could reach disk before we kill off the other processes that >> might write it. But at least it's a short window.) "Corruption" >> here doesn't just involve bad data placed into disk buffers; more >> often it's things like unreleased locks, which would block other >> processes indefinitely. >> >> I seriously doubt that anything like what you're describing >> could be made reliable enough to be acceptable. "Oracle does >> it like this" isn't a counter-argument: they have a much different >> (and non-extensible) architecture, and they also have an army of >> programmers to deal with minutiae like undoing resource acquisition. >> Even with that, you'd have to wonder about the number of bugs >> existing in such necessarily-poorly-tested code paths. > >Yes. >I think that PostgreSQL's approach is superior: rather than investing in >code to mitigate the impact of data corruption caused by a crash, invest >in quality code that doesn't crash in the first place. > >Euphemistically naming a crash "ORA-600 error" seems to be part of >their strategy. > >Yours, >Laurenz Albe >
Re: meson documentation build open issues
Hi, On 2023-11-20 08:27:48 +0100, Peter Eisentraut wrote: > On 17.11.23 19:53, Andres Freund wrote: > > I pushed the first two commits (the selinux stuff) and worked a bit more on > > the subsequent ones. > > Patches 0001 through 0004 look good to me. Cool, I pushed them now. > Some possible small tweaks in 0004: > > +perl, '-ne', 'next if /^#/; print', > > If you're going for super-brief mode, you could also use "perl -p" and drop > the "print". I thought this didn't add much, so I didn't go there. > Put at least two spaces between the "columns" in targets-meson.txt: > > + doc/src/sgml/postgres-A4.pdf Build documentation in PDF format, with >^^ I did adopt this. One remaining question is whether we should adjust install-doc-{html,man} to be install-{html,man}, to match the docs targets. Greetings, Andres Freund
Re: pg_upgrade and logical replication
On Mon, Nov 20, 2023 at 09:49:41AM +0530, Amit Kapila wrote: > On Tue, Nov 14, 2023 at 7:21 AM vignesh C wrote: >> There are couple of things happening here: a) In the first part we >> take care of setting subscription relation to SYNCDONE and dropping >> the replication slot at publisher node, only if drop replication slot >> is successful the relation state will be set to SYNCDONE , if drop >> replication slot fails the relation state will still be in >> FINISHEDCOPY. So if there is a failure in the drop replication slot we >> will not have an issue as the tablesync worker will be in >> FINISHEDCOPYstate and this state is not allowed for upgrade. When the >> state is in SYNCDONE the tablesync slot will not be present. b) In the >> second part we drop the replication origin, even if there is a chance >> that drop replication origin fails due to some reason, there will be >> no problem as we do not copy the table sync replication origin to the >> new cluster while upgrading. Since the table sync replication origin >> is not copied to the new cluster there will be no replication origin >> leaks. > > And, this will work because in the SYNCDONE state, while removing the > origin, we are okay with missing origins. It seems not copying the > origin for tablesync workers in this state (SYNCDONE) relies on the > fact that currently, we don't use those origins once the system > reaches the SYNCDONE state but I am not sure it is a good idea to have > such a dependency and that upgrade assuming such things doesn't seems > ideal to me. Hmm, yeah, you mean the replorigin_drop_by_name() calls in tablesync.c. I did not pay much attention about that in the code, but your point sounds sensible. (I have not been able to complete an analysis of the risks behind 's' to convince myself that it is entirely safe, but leaks are scary as hell if this gets automated across a large fleet of nodes..) > Personally, I think allowing an upgrade in 'i' > (initialize) state or 'r' (ready) state seems safe because in those > states either slots/origins don't exist or are dropped. What do you > think? I share a similar impression about 's'. From a design point of view, making the conditions to reach harder in the first implementation makes the user experience stricter, but that's safer regarding leaks and it is still possible to relax these choices in the future depending on the improvement pieces we are able to figure out. -- Michael signature.asc Description: PGP signature
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
On Fri, Sep 8, 2023 at 05:10:51PM -0400, Bruce Momjian wrote: > I knew we only considered the array dimension sizes to be documentation > _in_ the query, but I thought we at least properly displayed the number > of dimensions specified at creation when we described the table in psql, > but it seems we don't do that either. > > A big question is why we even bother to record the dimensions in > pg_attribute if is not accurate for LIKE and not displayed to the user > in a meaningful way by psql. > > I think another big question is whether the structure we are using to > supply the column information to BuildDescForRelation is optimal. The > typmod that has to be found for CREATE TABLE uses: > > typenameTypeIdAndMod(NULL, entry->typeName, , ); > > which calls typenameTypeIdAndMod() -> typenameType() -> LookupTypeName() > -> LookupTypeNameExtended() -> typenameTypeMod(). This seems very > complicated because the ColumnDef, at least in the LIKE case, already > has the value. Is there a need to revisit how we handle type such > cases? (Bug report moved to hackers, previous bug reporters added CCs.) I looked at this some more and found more fundamental problems. We have pg_attribute.attndims which does record the array dimensionality: CREATE TABLE test (data integer, data_array integer[5][5]); SELECT attndims FROM pg_attribute WHERE attrelid = 'test'::regclass AND attname = 'data_array'; attndims -- 2 The first new problem I found is that we don't dump the dimensionality: $ pg_dump test ... CREATE TABLE public.test ( data integer, --> data_array integer[] ); and psql doesn't display the dimensionality: \d test Table "public.test" Column | Type| Collation | Nullable | Default +---+---+--+- data | integer | | | --> data_array | integer[] | | | A report from 2015 reports that CREATE TABLE ... LIKE and CREATE TABLE ... AS doesn't propagate the dimensionality: https://www.postgresql.org/message-id/flat/20150707072942.1186.98151%40wrigleys.postgresql.org and this thread from 2018 supplied a fix: https://www.postgresql.org/message-id/flat/7862e882-8b9a-0c8e-4a38-40ad374d3634%40brandwatch.com though in my testing it only fixes LIKE, not CREATE TABLE ... AS. This report from April of this year also complains about LIKE: https://www.postgresql.org/message-id/flat/ZD%2B14YZ4IUue8Rhi%40gendo.asyd.net Here is the output from master for LIKE: CREATE TABLE test2 (LIKE test); SELECT attndims FROM pg_attribute WHERE attrelid = 'test2'::regclass AND attname = 'data_array'; attndims -- --> 0 and this is the output for CREATE TABLE ... AS: CREATE TABLE test3 AS SELECT * FROM test; SELECT attndims FROM pg_attribute WHERE attrelid = 'test3'::regclass AND attname = 'data_array'; attndims -- --> 0 The attached patch fixes pg_dump: $ pg_dump test ... CREATE TABLE public.test2 ( data integer, --> data_array integer[][] ); It uses repeat() at the SQL level rather then modifying format_type() at the SQL or C level. It seems format_type() is mostly used to get the type name, e.g. int4[], rather than the column definition so I added brackets at the call site. I used a similar fix for psql output: \d test Table "public.test" Column |Type | Collation | Nullable | Default +-+---+--+- data | integer | | | --> data_array | integer[][] | | | The 2018 patch from Alexey Bashtanov fixes the LIKE case: CREATE TABLE test2 (LIKE test); \d test2 Table "public.test2" Column |Type | Collation | Nullable | Default +-+---+--+- data | integer | | | --> data_array | integer[][] | | | It does not fix CREATE TABLE ... AS because it looks like fixing that would require adding an ndims column to Var for WITH NO DATA and adding ndims to TupleDesc for WITH DATA. I am not sure if that overhead is warrented to fix this item. I have added C comments where they should be added. I would like to apply this patch to master because I think our current deficiencies in this area are unacceptable. An alternate approach would be to remove pg_attribute.attndims so we don't even try to preserve dimensionality. -- Bruce Momjian
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote: > On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier wrote: >> I have added some documentation to explain that, as well. I am not >> wedded to the name proposed in the patch, so if you feel there is >> better, feel free to propose ideas. > > Actually with Attach and Detach terminology, INJECTION_POINT becomes > the place where we "declare" the injection point. So the documentation > needs to first explain INJECTION_POINT and then explain the other > operations. Sure. >> This facility is hidden behind a specific configure/Meson switch, >> making it a no-op by default: >> --enable-injection-points >> -Dinjection_points={ true | false } > > That's useful, but we will also see demand to enable those in > production (under controlled circumstances). But let the functionality > mature under a separate flag and developer builds before used in > production. Hmm. Okay. I'd still keep that under a compile switch for now anyway to keep a cleaner separation and avoid issues where it would be possible to inject code in a production build. Note that I was planning to switch one of my buildfarm animals to use it should this stuff make it into the tree. And people would be free to use it if they want. If in production, that would be a risk, IMO. > +/* > + * Local cache of injection callbacks already loaded, stored in > + * TopMemoryContext. > + */ > +typedef struct InjectionPointArrayEntry > +{ > + char name[INJ_NAME_MAXLEN]; > + InjectionPointCallback callback; > +} InjectionPointArrayEntry; > + > +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL; > > Initial size of this array is small, but given the size of code in a > given path to be tested, I fear that this may not be sufficient. I > feel it's better to use hash table whose APIs are already available. I've never seen in recent years a need for a given test to use more than 4~5 points. But I guess that you've seen more than that wanted in a prod environment with a fork of Postgres? I'm OK to switch that to use a hash table in the initial implementation, even for a low number of entries with points that are not in hot code paths that should be OK. At least for my cases related to testing that's OK. Am I right to assume that you mean a HTAB in TopMemoryContext? > I find it pretty useless to expose that as a SQL API. Also adding it > in tests would make it usable as an example. In this patchset > INJECTION_POINT should be the only way to trigger an injection point. That's useful to test the cache logic while providing simple coverage for the core API, and that's cheap. So I'd rather keep this test module around with these tests. > That also brings another question. The INJECTION_POINT needs to be added in > the > core code but can only be used through an extension. I think there should be > some rudimentary albeit raw test in core for this. Extension does some good > things like provides built-in actions when the injection is triggered. So, we > should keep those too. Yeah, I'd like to agree with that but everything I've seen in the recent years is that every setup finishes by having different assumptions about what they want to do, so my intention is to trim down the core of the patch to a bare acceptable minimum and then work on top of that. > I am glad that it covers the frequently needed injections error, notice and > wait. If someone adds multiple injection points for wait and all of them are > triggered (in different backends), test_injection_points_wake() would > wake them all. When debugging cascaded functionality it's not easy to > follow sequence add, trigger, wake for multiple injections. We should > associate a conditional variable with the required injection points. Attach > should create conditional variable in the shared memory for that injection > point and detach should remove it. I see this being mentioned in the commit > message, but I think it's something we need in the first version of "wait" to > be useful. More to the point, I actually disagree with that, because it could be possible as well that the same condition variable is used across multiple points. At the end, IMHO, the central hash table should hold zero meta-data associated to an injection point like a counter, an elog, a condition variable, a sleep time, etc. or any combination of all these ones, and should just know about how to load a callback with a library path and a routine name. I understand that this is leaving the responsibility of what a callback should do down to the individual developer implementing a callback into their own extension, where they should be free to have conditions of their own. Something that I agree would be very useful for the in-core APIs, depending on the cases, is to be able to push some information to the callback at runtime to let a callback decide what to do depending on a running state, including a condition registered when a point was attached.
Re: Faster "SET search_path"
On Thu, 2023-11-16 at 16:46 -0800, Jeff Davis wrote: > While I considered OOM during hash key initialization, I missed some > other potential out-of-memory hazards. Attached a fixup patch 0003, > which re-introduces one list copy but it simplifies things > substantially in addition to being safer around OOM conditions. Committed 0003 fixup. > > > 0004: Use the same cache to optimize check_search_path(). Committed 0004. > > > 0005: Optimize cache for repeated lookups of the same value. Will commit 0005 soon. I also attached a trivial 0006 patch that uses SH_STORE_HASH. I wasn't able to show much benefit, though, even when there's a bucket collision. Perhaps there just aren't enough elements to matter -- I suppose there would be a benefit if there are lots of unique search_path strings, but that doesn't seem very plausible to me. If someone thinks it's worth committing, then I will, but I don't see any real upside or downside. Regards, Jeff Davis From 5f41c0ecc602dd183b7f6e2f23cd28c9338b3c5b Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sun, 19 Nov 2023 14:47:04 -0800 Subject: [PATCH v10 2/2] Use SH_STORE_HASH for search_path cache. Does not change performance in expected cases, but makes performance more resilient in case of bucket collisions. Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com Discussion: https://postgr.es/m/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com --- src/backend/catalog/namespace.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 37a69e9023..0b6e0d711c 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -171,11 +171,10 @@ typedef struct SearchPathCacheEntry List *oidlist; /* namespace OIDs that pass ACL checks */ List *finalPath; /* cached final computed search path */ Oid firstNS; /* first explicitly-listed namespace */ + uint32 hash; /* needed for simplehash */ bool temp_missing; bool forceRecompute; /* force recompute of finalPath */ - - /* needed for simplehash */ - char status; + char status; /* needed for simplehash */ } SearchPathCacheEntry; /* @@ -270,6 +269,8 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b) #define SH_EQUAL(tb, a, b) spcachekey_equal(a, b) #define SH_SCOPE static inline #define SH_DECLARE +#define SH_GET_HASH(tb, a) a->hash +#define SH_STORE_HASH #define SH_DEFINE #include "lib/simplehash.h" -- 2.34.1 From 46e09b225217bef79a57cc4f8450ed19be8f21ba Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 19 Oct 2023 15:48:16 -0700 Subject: [PATCH v10 1/2] Optimize SearchPathCache by saving the last entry. Repeated lookups are common, so it's worth it to check the last entry before doing another hash lookup. Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com --- src/backend/catalog/namespace.c | 88 + 1 file changed, 57 insertions(+), 31 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 5027efc91d..37a69e9023 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -241,7 +241,8 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames, * * The search path cache is based on a wrapper around a simplehash hash table * (nsphash, defined below). The spcache wrapper deals with OOM while trying - * to initialize a key, and also offers a more convenient API. + * to initialize a key, optimizes repeated lookups of the same key, and also + * offers a more convenient API. */ static inline uint32 @@ -281,6 +282,7 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b) #define SPCACHE_RESET_THRESHOLD 256 static nsphash_hash * SearchPathCache = NULL; +static SearchPathCacheEntry * LastSearchPathCacheEntry = NULL; /* * Create or reset search_path cache as necessary. @@ -295,6 +297,7 @@ spcache_init(void) return; MemoryContextReset(SearchPathCacheContext); + LastSearchPathCacheEntry = NULL; /* arbitrary initial starting size of 16 elements */ SearchPathCache = nsphash_create(SearchPathCacheContext, 16, NULL); searchPathCacheValid = true; @@ -307,12 +310,25 @@ spcache_init(void) static SearchPathCacheEntry * spcache_lookup(const char *searchPath, Oid roleid) { - SearchPathCacheKey cachekey = { - .searchPath = searchPath, - .roleid = roleid - }; + if (LastSearchPathCacheEntry && + LastSearchPathCacheEntry->key.roleid == roleid && + strcmp(LastSearchPathCacheEntry->key.searchPath, searchPath) == 0) + { + return LastSearchPathCacheEntry; + } + else + { + SearchPathCacheEntry *entry; + SearchPathCacheKey cachekey = { + .searchPath = searchPath, + .roleid = roleid + }; + + entry = nsphash_lookup(SearchPathCache, cachekey); - return nsphash_lookup(SearchPathCache, cachekey); +
Re: POC, WIP: OR-clause support for indexes
On 20.11.2023 11:52, Andrei Lepikhov wrote: Looking into the patch, I found some trivial improvements (see attachment). Also, it is not obvious that using a string representation of the clause as a hash table key is needed here. Also, by making a copy of the node in the get_key_nconst_node(), you replace the location field, but in the case of complex expression, you don't do the same with other nodes. I propose to generate expression hash instead + prove the equality of two expressions by calling equal(). I was thinking about your last email and a possible error where the location field may not be cleared in complex expressions. Unfortunately, I didn't come up with any examples either, but I think I was able to handle this with a special function that removes location-related patterns. The alternative to this is to bypass this expression, but I think it will be more invasive. In addition, I have added changes related to the hash table: now the key is of type int. All changes are displayed in the attached v9-0001-Replace-OR-clause-to_ANY.diff.txt file. I haven't measured it yet. But what do you think about these changes? -- Regards, Alena Rybakina Postgres Professional From 8e579b059ffd415fc477e0e8930e718084e58683 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Tue, 21 Nov 2023 03:22:13 +0300 Subject: [PATCH] [PATCH] Replace OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are still working with a tree expression. Firstly, we do not try to make a transformation for "non-or" expressions or inequalities and the creation of a relation with "or" expressions occurs according to the same scenario. Secondly, we do not make transformations if there are less than set or_transform_limit. Thirdly, it is worth considering that we consider "or" expressions only at the current level. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas --- src/backend/parser/parse_expr.c | 280 +- src/backend/utils/misc/guc_tables.c | 10 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 115 +++ src/test/regress/expected/guc.out | 3 +- src/test/regress/expected/join.out| 50 src/test/regress/expected/partition_prune.out | 156 ++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 17 ++ src/test/regress/sql/create_index.sql | 32 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 1 + 14 files changed, 703 insertions(+), 3 deletions(-) diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 64c582c344c..290422005a3 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -18,6 +18,7 @@ #include "catalog/pg_aggregate.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" +#include "common/hashfn.h" #include "commands/dbcommands.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -43,6 +44,7 @@ /* GUC parameters */ bool Transform_null_equals = false; +bool enable_or_transformation = false; static Node *transformExprRecurse(ParseState *pstate, Node *expr); @@ -98,7 +100,283 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname, Node *ltree, Node *rtree, int location); static Node *make_nulltest_from_distinct(ParseState *pstate, A_Expr *distincta, Node *arg); +typedef struct OrClauseGroupEntry +{ + int32 hash_leftvar_key; + + Node *node; + List *consts; + Oidscalar_type; + Oidopno; + Node *expr; +} OrClauseGroupEntry; + +/* + * TODO: consider different algorithm to manage complexity + * in the case of many different clauses, + * like Rabin-Karp or Boyer–Moore algorithms. + */ +static char * +clear_patterns(const char *str, const char *start_pattern) +{ + int i = 0; + int j = 0; + int start_pattern_len = strlen(start_pattern); + char *res = palloc0(sizeof(*res) * (strlen(str) + 1)); + + for (i = 0; str[i];) + { + if (i >= start_pattern_len && strncmp([i - start_pattern_len], + start_pattern, + start_pattern_len) == 0) + { + while (str[i] && !(str[i] == '{' || str[i] == '}')) +i++; + } + if (str[i]) + res[j++] = str[i++]; + } + + return res; +} + +static int32 +get_key_nconst_node(Node *nconst_node) +{ + char *str = nodeToString(nconst_node); + + str = clear_patterns(str, " :location"); + + if (str == NULL) + return 0; + + return DatumGetInt32(hash_any((unsigned char *)str, strlen(str))); +} + +static Node * +transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig) +{ + List *or_list = NIL; +
Re: Show WAL write and fsync stats in pg_stat_io
On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote: > Yes, the timings for the writes and the syncs should work. Another > question I have in mind is the pg_stat_reset_shared() function. When > we call it with 'io' it will reset pg_stat_wal's timings and when we > call it with 'wal' it won't reset them, right? pg_stat_reset_shared() with a target is IMO a very edge case, so I'm OK with the approach of resetting timings in pg_stat_wal even if 'io' was implied because pg_stat_wal would feed partially from pg_stat_io. I'd take that as a side-cost in favor of compatibility while making the stats gathering cheaper overall. I'm OK as well if people counter-argue on this point, though that would mean to keep entirely separate views with duplicated fields that serve the same purpose, impacting all deployments because it would make the stats gathering heavier for all. -- Michael signature.asc Description: PGP signature
Re: Hide exposed impl detail of wchar.c
Hi, On 2023-11-16 17:11:03 -0800, Jubilee Young wrote: > We don't directly `#include` C into Rust, but use libclang to preprocess and > compile a wrapping C header into a list of symbols Rust will look for at link > time. Our failure is in libclang and how we steer it: > - The Clang-C API (libclang.so) cannot determine where system headers are. > - A clang executable can determine where system headers are, but our bindgen > may be asked to use a libclang.so without a matching clang executable! > - This is partly because system packagers do not agree on what clang parts > must be installed together, nor even on the clang directory's layout. > - Thus, it is currently impossible to, given a libclang.so, determine with > 100% accuracy where version-appropriate system headers are and include them, > nor does it do so implicitly. I remember battling this in the past, independent of rust :( What I don't quite get is why SIMD headers are particularly more problematic than others - there's other headers that are compiler specific? Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: > On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > > Given that, I wonder if what we should do is to just add a new field to > > pg_control that says "error out if backup_label does not exist", that we set > > when creating a streaming base backup > > That would mean that one still needs to take an extra step to update a > control file with this byte set, which is something you had a concern > with in terms of compatibility when it comes to external backup > solutions because more steps are necessary to take a backup, no? I was thinking we'd just set it in the pg_basebackup style path, and we'd error out if it's set and backup_label is present. But we'd still use backup_label without the pg_control flag set. So it'd just provide a cross-check that backup_label was not removed for pg_basebackup style backup, but wouldn't do anything for external backups. But imo the proposal to just us pg_control doesn't actually do anything for external backups either - which is why I think my proposal would achieve as much, for a much lower price. Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > Given that, I wonder if what we should do is to just add a new field to > pg_control that says "error out if backup_label does not exist", that we set > when creating a streaming base backup That would mean that one still needs to take an extra step to update a control file with this byte set, which is something you had a concern with in terms of compatibility when it comes to external backup solutions because more steps are necessary to take a backup, no? I don't quite see why it is different than what's proposed on this thread, except that you don't need to write one file to the data folder to store the backup label fields, but two, meaning that there's a risk for more mistakes because a clean backup process would require more steps. With the current position of the fields in ControlFileData, there are three free bytes after backupEndRequired, so it is possible to add that for free. Now, would you actually need an extra field knowing that backupStartPoint is around? -- Michael signature.asc Description: PGP signature
Re: Why is hot_standby_feedback off by default?
On 2023-11-20 16:34:47 +0700, John Naylor wrote: > Sounds like a TODO? WFM. I don't personally use or update TODO, as I have my doubts about its usefulness or state of maintenance. But please feel free to add this as a TODO from my end...
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-20 14:18:15 -0700, David G. Johnston wrote: > On Mon, Nov 20, 2023 at 1:37 PM Andres Freund wrote: > > > > > Given that, I wonder if what we should do is to just add a new field to > > pg_control that says "error out if backup_label does not exist", that we > > set > > when creating a streaming base backup > > > > > I thought this was DOA since we don't want to ever leave the cluster in a > state where a crash requires intervention to restart. I was trying to suggest that we'd set the field in-memory, when streaming out a pg_basebackup style backup (by just replacing pg_control with an otherwise identical file that has the flag set). So it'd not have any effect on the primary. Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 11:11:13AM -0500, Robert Haas wrote: > I think we need more votes to make a change this big. I have a > concern, which I think I've expressed before, that we keep whacking > around the backup APIs, and that has a cost which is potentially > larger than the benefits. The last time we changed the API, we changed > pg_stop_backup to pg_backup_stop, but this doesn't do that, and I > wonder if that's OK. Even if it is, do we really want to change this > API around again after such a short time? Agreed. > That said, I don't have an intrinsic problem with moving this > information from the backup_label to the backup_manifest file since it > is purely informational. I do think there should perhaps be some > additions to the test cases, though. Yep, cool. Even if we decide to not go with what's discussed in this patch, I think that's useful for some users at the end to get more redundancy, as well. And that's in a format easier to parse. > I am concerned about the interaction of this proposal with incremental > backup. When you take an incremental backup, you get something that > looks a lot like a usable data directory but isn't. To prevent that > from causing avoidable disasters, the present version of the patch > adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the > backup_label. pg_combinebackup knows to look for those fields, and the > server knows that if they are present it should refuse to start. With > this change, though, I suppose those fields would end up in > pg_control. But that does not feel entirely great, because we have a > goal of keeping the amount of real data in pg_control below 512 bytes, > the traditional sector size, and this adds another 12 bytes of stuff > to that file that currently doesn't need to be there. I feel like > that's kind of a problem. I don't recall one time where the addition of new fields to the control file was easy to discuss because of its 512B hard limit. Anyway, putting the addition aside for a second, and I've not looked at the incremental backup patch, does the removal of the backup_label make the combine logic more complicated, or that's just moving a chunk of code to do a control file lookup instead of a backup_file parsing? Making the information less readable is definitely an issue for me. A different alternative that I've mentioned upthread is to keep an equivalent of the backup_label and rename it to something like backup.debug or similar, with a name good enough to tell people that we don't care about it being removed. -- Michael signature.asc Description: PGP signature
Re: PANIC serves too many masters
Hi, On 2023-11-20 17:55:32 -0500, Tom Lane wrote: > Jeff Davis writes: > > Is the error level the right way to express what we want to happen? It > > seems like what we really want is to decide on the behavior, i.e. > > restart or not, and generate core or not. That could be done a > > different way, like: > > > ereport(PANIC, > > (errmsg("could not locate a valid checkpoint record"), > >errabort(false),errrestart(false))); > > Yeah, I was wondering about that too. It feels to me that > PANIC_EXIT is an error level (even more severe than PANIC). > But maybe "no core dump please" should be conveyed separately, > since it's just a minor adjustment that doesn't fundamentally > change what happens. I guess I was thinking of an error level because that'd be easier to search for in logs. It seems reasonable to want to specificially search for errors that cause core dumps, since IMO they should all be "should never happen" kind of paths. > It's plausible that you'd want a core, > or not want one, for different cases that all seem to require > PANIC_EXIT. I can't immediately think of a case where you'd want PANIC_EXIT but also want a core dump? In my mental model to use PANIC_EXIT we'd need to have a decent understanding that the situation isn't going to change after crash-restart - in which case a core dump presumably isn't interesting? > (Need a better name than PANIC_EXIT. OMIGOD?) CRITICAL? I agree with the point made upthread that we'd want leave PANIC around, it's not realistic to annotate everything, and then there's obviously also extensions (although I hope there aren't many PANICs in extensions). If that weren't the case, something like this could make sense: PANIC: crash-restart CRITICAL: crash-shutdown BUG: crash-restart, abort() Greetings, Andres Freund
Re: PANIC serves too many masters
Jeff Davis writes: > Is the error level the right way to express what we want to happen? It > seems like what we really want is to decide on the behavior, i.e. > restart or not, and generate core or not. That could be done a > different way, like: > ereport(PANIC, > (errmsg("could not locate a valid checkpoint record"), >errabort(false),errrestart(false))); Yeah, I was wondering about that too. It feels to me that PANIC_EXIT is an error level (even more severe than PANIC). But maybe "no core dump please" should be conveyed separately, since it's just a minor adjustment that doesn't fundamentally change what happens. It's plausible that you'd want a core, or not want one, for different cases that all seem to require PANIC_EXIT. (Need a better name than PANIC_EXIT. OMIGOD?) regards, tom lane
Re: Hide exposed impl detail of wchar.c
On Mon, Nov 20, 2023 at 10:50:36AM -0800, Jubilee Young wrote: > In that case, I took a look across the codebase and saw a > utils/ascii.h that doesn't > seem to have gotten much love, but I suppose one could argue that it's > intended > to be a backend-only header file? That might work. It's not #included in very many files, so adding port/simd.h shouldn't be too bad. And ascii.h is also pretty inexpensive, so including it in wchar.c seems permissible, too. I'm not certain this doesn't cause problems with libpgcommon, but I don't see why it would, either. > So it should probably end up living somewhere near the UTF-8 support, and > the easiest way to make it not go into something pgrx currently > includes would be > to make it a new header file, though there's a fair amount of API we > don't touch. Does pgrx use ascii.h at all? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/common/wchar.c b/src/common/wchar.c index fb9d9f5c85..fbac11deb4 100644 --- a/src/common/wchar.c +++ b/src/common/wchar.c @@ -13,6 +13,7 @@ #include "c.h" #include "mb/pg_wchar.h" +#include "utils/ascii.h" /* diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 29cd5732f1..80676d9e02 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -22,8 +22,6 @@ #ifndef PG_WCHAR_H #define PG_WCHAR_H -#include "port/simd.h" - /* * The pg_wchar type */ @@ -722,71 +720,4 @@ extern int mic2latin_with_table(const unsigned char *mic, unsigned char *p, extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len); #endif - -/* - * Verify a chunk of bytes for valid ASCII. - * - * Returns false if the input contains any zero bytes or bytes with the - * high-bit set. Input len must be a multiple of the chunk size (8 or 16). - */ -static inline bool -is_valid_ascii(const unsigned char *s, int len) -{ - const unsigned char *const s_end = s + len; - Vector8 chunk; - Vector8 highbit_cum = vector8_broadcast(0); -#ifdef USE_NO_SIMD - Vector8 zero_cum = vector8_broadcast(0x80); -#endif - - Assert(len % sizeof(chunk) == 0); - - while (s < s_end) - { - vector8_load(, s); - - /* Capture any zero bytes in this chunk. */ -#ifdef USE_NO_SIMD - - /* - * First, add 0x7f to each byte. This sets the high bit in each byte, - * unless it was a zero. If any resulting high bits are zero, the - * corresponding high bits in the zero accumulator will be cleared. - * - * If none of the bytes in the chunk had the high bit set, the max - * value each byte can have after the addition is 0x7f + 0x7f = 0xfe, - * and we don't need to worry about carrying over to the next byte. If - * any input bytes did have the high bit set, it doesn't matter - * because we check for those separately. - */ - zero_cum &= (chunk + vector8_broadcast(0x7F)); -#else - - /* - * Set all bits in each lane of the highbit accumulator where input - * bytes are zero. - */ - highbit_cum = vector8_or(highbit_cum, - vector8_eq(chunk, vector8_broadcast(0))); -#endif - - /* Capture all set bits in this chunk. */ - highbit_cum = vector8_or(highbit_cum, chunk); - - s += sizeof(chunk); - } - - /* Check if any high bits in the high bit accumulator got set. */ - if (vector8_is_highbit_set(highbit_cum)) - return false; - -#ifdef USE_NO_SIMD - /* Check if any high bits in the zero accumulator got cleared. */ - if (zero_cum != vector8_broadcast(0x80)) - return false; -#endif - - return true; -} - #endif /* PG_WCHAR_H */ diff --git a/src/include/utils/ascii.h b/src/include/utils/ascii.h index 630acd9bfd..7df024dad3 100644 --- a/src/include/utils/ascii.h +++ b/src/include/utils/ascii.h @@ -11,6 +11,74 @@ #ifndef _ASCII_H_ #define _ASCII_H_ +#include "port/simd.h" + extern void ascii_safe_strlcpy(char *dest, const char *src, size_t destsiz); +/* + * Verify a chunk of bytes for valid ASCII. + * + * Returns false if the input contains any zero bytes or bytes with the + * high-bit set. Input len must be a multiple of the chunk size (8 or 16). + */ +static inline bool +is_valid_ascii(const unsigned char *s, int len) +{ + const unsigned char *const s_end = s + len; + Vector8 chunk; + Vector8 highbit_cum = vector8_broadcast(0); +#ifdef USE_NO_SIMD + Vector8 zero_cum = vector8_broadcast(0x80); +#endif + + Assert(len % sizeof(chunk) == 0); + + while (s < s_end) + { + vector8_load(, s); + + /* Capture any zero bytes in this chunk. */ +#ifdef USE_NO_SIMD + + /* + * First, add 0x7f to each byte. This sets the high bit in each byte, + * unless it was a zero. If any resulting high bits are zero, the + * corresponding high bits in the zero accumulator will be cleared. + * + * If none of the bytes in the chunk had the high bit set, the max + * value each byte can have after the addition is 0x7f + 0x7f = 0xfe, + * and we don't need to worry about carrying over to the next byte. If + * any input bytes did have the high bit set, it doesn't
Re: Partial aggregates pushdown
On Mon, Nov 20, 2023 at 03:51:33PM -0500, Robert Haas wrote: > On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp > wrote: > > In postgres_fdw.sql, I have corrected the output format for floating point > > numbers > > by extra_float_digits. > > Looking at this, I find that it's not at all clear to me how the > partial aggregate function is defined. Let's look at what we have for > documentation: > > + > + Paraemter AGGPARTIALFUNC optionally defines a > + partial aggregate function used for partial aggregate pushdown; see > +for details. > + > > + Partial aggregate function (zero if none). > + See for the definition > + of partial aggregate function. > > + Partial aggregate pushdown is an optimization for queries that contains > + aggregate expressions for a partitioned table across one or more remote > + servers. If multiple conditions are met, partial aggregate function > > + When partial aggregate pushdown is used for aggregate expressions, > + remote queries replace aggregate function calls with partial > + aggregate function calls. If the data type of the state value is not > > But there's no definition of what the behavior of the function is > anywhere that I can see, not even in id="partial-aggregate-pushdown">. Everywhere it only describes how the > partial aggregate function is used, not what it is supposed to do. Yes, I had to figure that out myself, and I was wondering how much detail to have in our docs vs README files vs. C comments. I think we should put more details somewhere. > Looking at the changes in pg_aggregate.dat, it seems like the partial > aggregate function is a second aggregate defined in a way that mostly > matches the original, except that (1) if the original final function > would have returned a data type other than internal, then the final > function is removed; and (2) if the original final function would have > returned a value of internal type, then the final function is the > serialization function of the original aggregate. I think that's a > reasonable definition, but the documentation and code comments need to > be a lot clearer. Agreed. I wasn't sure enough about this to add it when I was reviewing the patch. > I do have a concern about this, though. It adds a lot of bloat. It > adds a whole lot of additional entries to pg_aggregate, and every new > aggregate we add in the future will require a bonus entry for this, > and it needs a bunch of new pg_proc entries as well. One idea that > I've had in the past is to instead introduce syntax that just does > this, without requiring a separate aggregate definition in each case. > For example, maybe instead of changing string_agg(whatever) to > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or > something. Then all aggregates could be treated in a generic way. I'm > not completely sure that's better, but I think it's worth considering. So use an SQL keyword to indicates a pushdown call? We could then automate the behavior rather than requiring special catalog functions? > I think that the control mechanism needs some thought. Right now, > there are two possible behaviors: either we assume that the local and > remote sides are the same unconditionally, or we assume that they're > the same if the remote side is a new enough version. I do like having > those behaviors available, but I wonder if we need to do something > better or different. What if somebody wants to push down a > non-built-in aggregate, for example? I realize that we don't have It does allow specification of extensions that can be pushed down. > great solutions to the problem of knowing which functions are > push-downable in general, and I don't know that partial aggregation > needs to be any better than anything else, but it's probably worth > comparing and contrasting the approach we take here with the > approaches we've taken in other, similar cases. From that point of > view, I think check_partial_aggregate_support is a novelty: we don't > do those kinds of checks in other cases, AFAIK. But on the other hand, > there is the 'extensions' argument to postgres_fdw. Right. I am not sure how to improve what the patch does. > I don't think the patch does a good job explaining why HAVING, > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING > shouldn't really be a problem, because HAVING is basically a WHERE > clause that occurs after aggregation is complete, and whether or not > the aggregation is safe shouldn't depend on what we're going to do > with the value afterward. The HAVING clause can't necessarily be > pushed to the remote side, but I don't see how or why it could make > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a > little trickier: if we pushed down DISTINCT, we'd still have to > re-DISTINCT-ify when combining locally, and if we pushed down ORDER
Re: PANIC serves too many masters
On Mon, 2023-11-20 at 17:12 -0500, Tom Lane wrote: > I'd be inclined to keep PANIC with its current meaning, and > incrementally change call sites where we decide that's not the > best behavior. I think those will be a minority, maybe a small > minority. (PANIC_EXIT had darn well better be a small minority.) Is the error level the right way to express what we want to happen? It seems like what we really want is to decide on the behavior, i.e. restart or not, and generate core or not. That could be done a different way, like: ereport(PANIC, (errmsg("could not locate a valid checkpoint record"), errabort(false),errrestart(false))); Regards, Jeff Davis
Re: PSQL error: total cell count of XXX exceeded
Alvaro Herrera writes: > I think we're bound to hit this limit at some point in the future, and > it seems easy enough to solve. I propose the attached, which is pretty > much what Hongxu last submitted, with some minor changes. This bit needs more work: - content->cells = pg_malloc0((ncolumns * nrows + 1) * sizeof(*content->cells)); + total_cells = (int64) ncolumns * nrows; + content->cells = pg_malloc0((total_cells + 1) * sizeof(*content->cells)); You've made the computation of total_cells reliable, but there's nothing stopping the subsequent computation of the malloc argument from overflowing (especially on 32-bit machines). I think we need an explicit test along the lines of if (total_cells >= SIZE_MAX / sizeof(*content->cells)) throw error; (">=" allows not needing to add +1.) Also, maybe total_cells should be uint64? We don't want negative values to pass this test. Alternatively, add a separate check that total_cells >= 0. It should be sufficient to be paranoid about this in printTableInit, since after that we know the product of ncolumns * nrows isn't too big. The rest of this passes an eyeball check. regards, tom lane
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On Mon, 2023-11-20 at 15:52 -0500, Robert Haas wrote: > I agree. Not to burden you, but do you know what the overhead is now, > and do you have any plans to further reduce it? I don't believe > that's > the only thing we ought to be doing here, necessarily, but it is one > thing that we definitely should be doing and probably the least > controversial. Running the simple test described here: https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com The baseline (no "SET search_path" clause on the function) is around 3800ms, and with the clause it shoots up to 8000ms. That's not good, but it is down from about 12000ms before. There are a few patches in the queue to bring it down further. Andres and I were discussing some GUC hashtable optimizations here: https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com which will (if committed) bring it down into the mid 7s. There are also a couple other patches I have here (and intend to commit soon): https://www.postgresql.org/message-id/e6fded24cb8a2c53d4ef069d9f69cc7baaafe9ef.camel%40j-davis.com and those I think will get it into the mid 6s. I think a bit lower combined with the GUC hash table optimizations above. So we are still looking at around 50% overhead for a simple function if all this stuff gets committed. Not great, but a lot better than before. Of course I welcome others to profile and see what they can do. There's a setjmp() call, and a couple allocations, and maybe some other stuff to look at. There are also higher-level ideas, like avoiding calling into guc.c in some cases, but that starts to get tricky as you pointed out: https://www.postgresql.org/message-id/CA%2BTgmoa8uKQgak5wH0%3D7sL-ukqbwnCPMXA2ZW7Ccdt7tdNGkzg%40mail.gmail.com It seems others are also interested in this problem, so I can put some more effort in after this round of patches goes in. I don't have a specific target other than "low enough overhead that we can reasonably recommend it as a best practice in multi-user environments". Regards, Jeff Davis
Re: PANIC serves too many masters
Jeff Davis writes: > On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote: >> I don't quite know what we should do. But the current situation >> decidedly >> doesn't seem great. > Agreed. +1 > Better classification is nice, but it also requires more > discipline and it might not always be obvious which category something > fits in. What about an error loop resulting in: > ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded"))); > We'd want a core file, but I don't think we want to restart in that > case, right? Why not restart? There's no strong reason to assume this will repeat. It might be worth having some independent logic in the postmaster that causes it to give up after too many crashes in a row. But with many/most of these call sites, by definition we're not too sure what is wrong. > Also, can we do a change like this incrementally by updating a few > PANIC sites at a time? Is it fine to leave plain PANICs in place for > the foreseeable future, or do you want all of them to eventually move? I'd be inclined to keep PANIC with its current meaning, and incrementally change call sites where we decide that's not the best behavior. I think those will be a minority, maybe a small minority. (PANIC_EXIT had darn well better be a small minority.) regards, tom lane
Re: Annoying build warnings from latest Apple toolchain
I wrote: > The autoconf side seems to just be letting this option default. > I'm not sure what the default choice is, but evidently it's not > "-undefined error"? Or were they stupid enough to not allow you > to explicitly select the default behavior? Seems we are not the only project having trouble with this: https://github.com/mesonbuild/meson/issues/12450 I had not realized that Apple recently wrote themselves a whole new linker, but apparently that's why all these deprecation warnings are showing up. It's not exactly clear whether "deprecation" means they actually plan to remove the feature later, or just that some bozo decided that explicitly specifying the default behavior is bad style. regards, tom lane
Re: PANIC serves too many masters
Hi, On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote: > I don't quite know what we should do. But the current situation > decidedly > doesn't seem great. Agreed. Better classification is nice, but it also requires more discipline and it might not always be obvious which category something fits in. What about an error loop resulting in: ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded"))); We'd want a core file, but I don't think we want to restart in that case, right? Also, can we do a change like this incrementally by updating a few PANIC sites at a time? Is it fine to leave plain PANICs in place for the foreseeable future, or do you want all of them to eventually move? Regards, Jeff Davis
Re: Annoying build warnings from latest Apple toolchain
Robert Haas writes: > On Mon, Nov 20, 2023 at 3:53 PM Tom Lane wrote: >> 13.6.2? longfin's host is on 13.6.1, and the only thing Software >> Update is showing me is an option to upgrade to Sonoma. But anyway... > Uh, I guess Apple made a special version just for me? That's > definitely what it says. Might be for M-series only; longfin's host is still Intel. >> Hmm ... I fixed these things in the autoconf build: neither my >> buildfarm animals nor manual builds show any warnings. I thought >> the problems weren't there in the meson build. Need to take another >> look I guess. > They're definitely there for me, and there are a whole lot of them. I > would have thought that if they were there for you in the meson build > you would have noticed, since ninja suppresses a lot of distracting > output that make prints. I'm generally still using autoconf, I only run meson builds when somebody complains about them ;-). But yeah, I see lots of "ld: warning: -undefined error is deprecated" when I do that. This seems to have been installed by Andres' 9a95a510a: ldflags += ['-isysroot', pg_sysroot] + # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we + # don't want because a) it's different from what we do for autoconf, b) it + # causes warnings starting in macOS Ventura + ldflags_mod += ['-Wl,-undefined,error'] The autoconf side seems to just be letting this option default. I'm not sure what the default choice is, but evidently it's not "-undefined error"? Or were they stupid enough to not allow you to explicitly select the default behavior? Also, I *don't* see any complaints about duplicate libraries. What build options are you using? regards, tom lane
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 1:37 PM Andres Freund wrote: > > Given that, I wonder if what we should do is to just add a new field to > pg_control that says "error out if backup_label does not exist", that we > set > when creating a streaming base backup > > I thought this was DOA since we don't want to ever leave the cluster in a state where a crash requires intervention to restart. But I agree that it is not possible to fool-proof agaInst a naive backup that copies over the pg_control file as-is if breaking the crashed cluster option is not in play. I agree that this works if the pg_control generated by stop backup produces the line and we retain the label file as a separate and now mandatory component to using the backup. Or is the idea to make v17 error if it sees a backup label unless pg_control has the feature flag field? Which doesn't exist normally, does in the basebackup version, and is removed once the backup is restored? David J.
Re: Annoying build warnings from latest Apple toolchain
On Mon, Nov 20, 2023 at 3:53 PM Tom Lane wrote: > 13.6.2? longfin's host is on 13.6.1, and the only thing Software > Update is showing me is an option to upgrade to Sonoma. But anyway... Uh, I guess Apple made a special version just for me? That's definitely what it says. > > [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser > > ld: warning: -undefined error is deprecated > > ld: warning: ignoring duplicate libraries: '-lz' > > Hmm ... I fixed these things in the autoconf build: neither my > buildfarm animals nor manual builds show any warnings. I thought > the problems weren't there in the meson build. Need to take another > look I guess. They're definitely there for me, and there are a whole lot of them. I would have thought that if they were there for you in the meson build you would have noticed, since ninja suppresses a lot of distracting output that make prints. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Annoying build warnings from latest Apple toolchain
Robert Haas writes: > Is there still outstanding work on this thread? Because I'm just now > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this > kind of thing in a meson build: 13.6.2? longfin's host is on 13.6.1, and the only thing Software Update is showing me is an option to upgrade to Sonoma. But anyway... > [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser > ld: warning: -undefined error is deprecated > ld: warning: ignoring duplicate libraries: '-lz' Hmm ... I fixed these things in the autoconf build: neither my buildfarm animals nor manual builds show any warnings. I thought the problems weren't there in the meson build. Need to take another look I guess. regards, tom lane
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On Tue, Nov 14, 2023 at 11:21 PM Jeff Davis wrote: > After adding the search path cache (recent commit f26c2368dc) hopefully > that helps to make the above suggestion more reasonable performance- > wise. I think we can call that progress. I agree. Not to burden you, but do you know what the overhead is now, and do you have any plans to further reduce it? I don't believe that's the only thing we ought to be doing here, necessarily, but it is one thing that we definitely should be doing and probably the least controversial. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Partial aggregates pushdown
On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp wrote: > In postgres_fdw.sql, I have corrected the output format for floating point > numbers > by extra_float_digits. Looking at this, I find that it's not at all clear to me how the partial aggregate function is defined. Let's look at what we have for documentation: + + Paraemter AGGPARTIALFUNC optionally defines a + partial aggregate function used for partial aggregate pushdown; see +for details. + + Partial aggregate function (zero if none). + See for the definition + of partial aggregate function. + Partial aggregate pushdown is an optimization for queries that contains + aggregate expressions for a partitioned table across one or more remote + servers. If multiple conditions are met, partial aggregate function + When partial aggregate pushdown is used for aggregate expressions, + remote queries replace aggregate function calls with partial + aggregate function calls. If the data type of the state value is not But there's no definition of what the behavior of the function is anywhere that I can see, not even in . Everywhere it only describes how the partial aggregate function is used, not what it is supposed to do. Looking at the changes in pg_aggregate.dat, it seems like the partial aggregate function is a second aggregate defined in a way that mostly matches the original, except that (1) if the original final function would have returned a data type other than internal, then the final function is removed; and (2) if the original final function would have returned a value of internal type, then the final function is the serialization function of the original aggregate. I think that's a reasonable definition, but the documentation and code comments need to be a lot clearer. I do have a concern about this, though. It adds a lot of bloat. It adds a whole lot of additional entries to pg_aggregate, and every new aggregate we add in the future will require a bonus entry for this, and it needs a bunch of new pg_proc entries as well. One idea that I've had in the past is to instead introduce syntax that just does this, without requiring a separate aggregate definition in each case. For example, maybe instead of changing string_agg(whatever) to string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or something. Then all aggregates could be treated in a generic way. I'm not completely sure that's better, but I think it's worth considering. I think that the control mechanism needs some thought. Right now, there are two possible behaviors: either we assume that the local and remote sides are the same unconditionally, or we assume that they're the same if the remote side is a new enough version. I do like having those behaviors available, but I wonder if we need to do something better or different. What if somebody wants to push down a non-built-in aggregate, for example? I realize that we don't have great solutions to the problem of knowing which functions are push-downable in general, and I don't know that partial aggregation needs to be any better than anything else, but it's probably worth comparing and contrasting the approach we take here with the approaches we've taken in other, similar cases. From that point of view, I think check_partial_aggregate_support is a novelty: we don't do those kinds of checks in other cases, AFAIK. But on the other hand, there is the 'extensions' argument to postgres_fdw. I don't think the patch does a good job explaining why HAVING, DISTINCT, and ORDER BY are a problem. It seems to me that HAVING shouldn't really be a problem, because HAVING is basically a WHERE clause that occurs after aggregation is complete, and whether or not the aggregation is safe shouldn't depend on what we're going to do with the value afterward. The HAVING clause can't necessarily be pushed to the remote side, but I don't see how or why it could make the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a little trickier: if we pushed down DISTINCT, we'd still have to re-DISTINCT-ify when combining locally, and if we pushed down ORDER BY, we'd have to do a merge pass to combine the returned values unless we could prove that the partitions were non-overlapping ranges that would be visited in the correct order. Although that all sounds doable, I think it's probably a good thing that the current patch doesn't try to handle it -- this is complicated already. But it should explain why it's not handling it and maybe even a bit about how it could be handling in the future, rather than just saying "well, this kind of thing is not safe." The trouble with that explanation is that it does nothing to help the reader understand whether the thing in question is *fundamentally* unsafe or whether we just don't have the right code to make it work. Typo: Paraemter I'm so sorry to keep complaining about
Re: PSQL error: total cell count of XXX exceeded
On 2023-Sep-12, Tom Lane wrote: > I'm more than a bit skeptical about trying to do something about this, > simply because this range of query result sizes is far past what is > practical. The OP clearly hasn't tested his patch on actually > overflowing query results, and I don't care to either. I think we're bound to hit this limit at some point in the future, and it seems easy enough to solve. I propose the attached, which is pretty much what Hongxu last submitted, with some minor changes. Having this make a difference requires some 128GB of RAM, so it's not a piece of cake, but it's an amount that can be reasonably expected to be physically installed in real machines nowadays. (I first thought we could just use pg_mul_s32_overflow during printTableInit and raise an error if that returns true, but that just postpones the problem.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Subversion to GIT: the shortest path to happiness I've ever heard of (Alexey Klyukin) >From 75abe5a485532cbc03db1eade2e1a6099914f98f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 20 Nov 2023 17:35:18 +0100 Subject: [PATCH v5] Avoid overflow in fe_utils' printTable API We were using 32-bit integer arithmetic to compute the total number of cells, which can overflow when used with large resultsets. We still don't allow more than 4 billion rows, but now the total number of cells can exceed that. Author: Hongxu Ma Discussion: https://postgr.es/m/tybp286mb0351b057b101c90d7c1239e6b4...@tybp286mb0351.jpnp286.prod.outlook.com --- src/fe_utils/print.c | 22 ++ src/include/fe_utils/print.h | 2 +- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/src/fe_utils/print.c b/src/fe_utils/print.c index 7af1ccb6b5..565cbc6d13 100644 --- a/src/fe_utils/print.c +++ b/src/fe_utils/print.c @@ -3172,6 +3172,8 @@ void printTableInit(printTableContent *const content, const printTableOpt *opt, const char *title, const int ncolumns, const int nrows) { + int64 total_cells; + content->opt = opt; content->title = title; content->ncolumns = ncolumns; @@ -3179,7 +3181,8 @@ printTableInit(printTableContent *const content, const printTableOpt *opt, content->headers = pg_malloc0((ncolumns + 1) * sizeof(*content->headers)); - content->cells = pg_malloc0((ncolumns * nrows + 1) * sizeof(*content->cells)); + total_cells = (int64) ncolumns * nrows; + content->cells = pg_malloc0((total_cells + 1) * sizeof(*content->cells)); content->cellmustfree = NULL; content->footers = NULL; @@ -3249,15 +3252,17 @@ void printTableAddCell(printTableContent *const content, char *cell, const bool translate, const bool mustfree) { + int64 total_cells; + #ifndef ENABLE_NLS (void) translate; /* unused parameter */ #endif - if (content->cellsadded >= content->ncolumns * content->nrows) + total_cells = (int64) content->ncolumns * content->nrows; + if (content->cellsadded >= total_cells) { - fprintf(stderr, _("Cannot add cell to table content: " - "total cell count of %d exceeded.\n"), -content->ncolumns * content->nrows); + fprintf(stderr, _("Cannot add cell to table content: total cell count of %lld exceeded.\n"), +(long long int) total_cells); exit(EXIT_FAILURE); } @@ -3273,7 +3278,7 @@ printTableAddCell(printTableContent *const content, char *cell, { if (content->cellmustfree == NULL) content->cellmustfree = -pg_malloc0((content->ncolumns * content->nrows + 1) * sizeof(bool)); +pg_malloc0((total_cells + 1) * sizeof(bool)); content->cellmustfree[content->cellsadded] = true; } @@ -3341,9 +3346,10 @@ printTableCleanup(printTableContent *const content) { if (content->cellmustfree) { - int i; + int64 total_cells; - for (i = 0; i < content->nrows * content->ncolumns; i++) + total_cells = (int64) content->ncolumns * content->nrows; + for (int64 i = 0; i < total_cells; i++) { if (content->cellmustfree[i]) free(unconstify(char *, content->cells[i])); diff --git a/src/include/fe_utils/print.h b/src/include/fe_utils/print.h index 13ebb00362..a697d0ba8d 100644 --- a/src/include/fe_utils/print.h +++ b/src/include/fe_utils/print.h @@ -171,7 +171,7 @@ typedef struct printTableContent const char **cells; /* NULL-terminated array of cell content * strings */ const char **cell; /* Pointer to the last added cell */ - long cellsadded; /* Number of cells added this far */ + int64 cellsadded; /* Number of cells added this far */ bool *cellmustfree; /* true for cells that need to be free()d */ printTableFooter *footers; /* Pointer to the first footer */ printTableFooter *footer; /* Pointer to the last added footer */ -- 2.39.2
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-20 15:56:19 -0400, David Steele wrote: > I understand this is an option -- but does it need to be? What is the > benefit of excluding the manifest? It's not free to create the manifest, particularly if checksums are enabled. Also, for external backups, there's no manifest... - Andres
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-20 11:11:13 -0500, Robert Haas wrote: > I think we need more votes to make a change this big. I have a > concern, which I think I've expressed before, that we keep whacking > around the backup APIs, and that has a cost which is potentially > larger than the benefits. +1. The amount of whacking around in this area has been substantial, and it's hard for operators to keep up. And realistically, with data sizes today, the pressure to do basebackups with disk snapshots etc is not going to shrink. Leaving that concern aside, I am still on the fence about this proposal. I think it does decrease the chance of getting things wrong in the streaming-basebackup case. But for external backups, it seems almost universally worse (with the exception of the torn pg_control issue, that we also can address otherwise): It doesn't reduce the risk of getting things wrong, you can still omit placing a file into the data directory and get silent corruption as a consequence. In addition, it's harder to see when looking at a base backup whether the process was right or not, because now the good and bad state look the same if you just look on the filesystem level! Then there's the issue of making ad-hoc debugging harder by not having a human readable file with information anymore, including when looking at the history, via backup_label.old. Given that, I wonder if what we should do is to just add a new field to pg_control that says "error out if backup_label does not exist", that we set when creating a streaming base backup Greetings, Andres Freund
Re: Implement missing join selectivity estimation for range types
On 14/11/2023 20:46, Tom Lane wrote: > I took a brief look through this very interesting work. I concur > with Tomas that it feels a little odd that range join selectivity > would become smarter than scalar inequality join selectivity, and > that we really ought to prioritize applying these methods to that > case. Still, that's a poor reason to not take the patch. Indeed, we started with ranges as this was the simpler case (no MCV) and was the topic of a course project. The idea is to later write a second patch that applies these ideas to scalar inequality while also handling MCV's correctly. > I also agree with the upthread criticism that having two identical > functions in different source files will be a maintenance nightmare. > Don't do it. When and if there's a reason for the behavior to > diverge between the range and multirange cases, it'd likely be > better to handle that by passing in a flag to say what to do. The duplication is indeed not ideal. However, there are already 8 other duplicate functions between the two files. I would thus suggest to leave the duplication in this patch and create a second one that removes all duplication from the two files, instead of just removing the duplication for our new function. What are your thoughts on this? If we do this, should the function definitions go in rangetypes.h or should we create a new rangetypes_selfuncs.h header? > But my real unhappiness with the patch as-submitted is the test cases, > which require rowcount estimates to be reproduced exactly. > We need a more forgiving test method. Usually the > approach is to set up a test case where the improved accuracy of > the estimate changes the planner's choice of plan compared to what > you got before, since that will normally not be too prone to change > from variations of a percent or two in the estimates. I have changed the test method to produce query plans for a 3-way range join. The plans for the different operators differ due to the computed selectivity estimation, which was not the case before this patch. Regards, Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c index cefc4710fd..c670d225a0 100644 --- a/src/backend/utils/adt/multirangetypes_selfuncs.c +++ b/src/backend/utils/adt/multirangetypes_selfuncs.c @@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache, return sum_frac; } + +/* + * This is a utility function used to estimate the join selectivity of + * range attributes using rangebound histogram statistics as described + * in this paper: + * + * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022 + * Selectivity Estimation of Inequality Joins In Databases + * https://doi.org/10.48550/arXiv.2206.07396 + * + * The attributes being joined will be treated as random variables + * that follow a distribution modeled by a Probability Density Function (PDF). + * Let the two attributes be denoted X, Y. + * This function finds the probability P(X < Y). + * Note that the PDFs of the two variables can easily be obtained + * from their bounds histogram, respectively hist1 and hist2 . + * + * Let the PDF of X, Y be denoted as f_X, f_Y. + * The probability P(X < Y) can be formalized as follows: + * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) ) + *= integral_-inf^inf( F_X(y) * f_Y(y) dy ) + * where F_X(y) denote the Cumulative Distribution Function of X at y. + * Note that F_X is the selectivity estimation (non-join), + * which is implemented using the function calc_hist_selectivity_scalar. + * + * Now given the histograms of the two attributes X, Y, we note the following: + * - The PDF of Y is a step function + * (constant piece-wise, where each piece is defined in a bin of Y's histogram) + * - The CDF of X is linear piece-wise + * (each piece is defined in a bin of X's histogram) + * This leads to the conclusion that their product + * (used to calculate the equation above) is also linear piece-wise. + * A new piece starts whenever either the bin of X or the bin of Y changes. + * By parallel scanning the two rangebound histograms of X and Y, + * we evaluate one piece of the result between every two consecutive rangebounds + * in the union of the two histograms. + * + * Given that the product F_X * f_y is linear in the interval + * between every two consecutive rangebounds, let them be denoted prev, cur, + * it can be shown that the above formula can be discretized into the following: + * P(X < Y) = + * 0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) ) + * where n, m are the lengths of the two histograms. + * + * As such, it is possible to fully compute the join selectivity + * as a summation of CDFs, iterating over the bounds of the two histograms. + * This maximizes the code reuse, since the CDF is computed using + * the calc_hist_selectivity_scalar function, which is the function used
Re: Annoying build warnings from latest Apple toolchain
Hi, On 2023-11-20 14:46:13 -0500, Robert Haas wrote: > On Mon, Nov 20, 2023 at 2:35 PM Andres Freund wrote: > > > Is there still outstanding work on this thread? Because I'm just now > > > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this > > > kind of thing in a meson build: > > > > Ventura? In that case I assume you installed newer developer tools? Because > > otherwise I think we were talking about issues introduced in Sonoma. > > I don't think I did anything special when installing developer tools. > xcode-select --version reports 2397, if that tells you anything. Odd then. My m1-mini running Ventura, also reporting 2397, doesn't show any of those warnings. I did a CI run with Sonoma, and that does show them. I'm updating said m1-mini it to Sonoma now, but that will take until I have to leave for an appointment. Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 15:47, Robert Haas wrote: On Mon, Nov 20, 2023 at 2:41 PM David Steele wrote: I can't see why a backup would continue to be valid without a manifest -- that's not very standard for backup software. If you have the critical info in backup_label, you can't afford to lose that, so why should backup_manifest be any different? I mean, you can run pg_basebackup --no-manifest. Maybe this would be a good thing to disable for page incremental. With all the work being done by pg_combinebackup, it seems like it would be a good idea to be able to verify the final result? I understand this is an option -- but does it need to be? What is the benefit of excluding the manifest? Regards, -David
Re: Parallel CREATE INDEX for BRIN indexes
On Wed, 8 Nov 2023 at 12:03, Tomas Vondra wrote: > > Hi, > > here's an updated patch, addressing the review comments, and reworking > how the work is divided between the workers & leader etc. > > 0001 is just v2, rebased to current master > > 0002 and 0003 address most of the issues, in particular it > > - removes the unnecessary spool > - fixes bs_reltuples type to double > - a couple comments are reworded to be clearer > - changes loop/condition in brinbuildCallbackParallel > - removes asserts added for debugging > - fixes cast in comparetup_index_brin > - 0003 then simplifies comparetup_index_brin > - I haven't inlined the tuplesort_puttuple_common parameter > (didn't seem worth it) OK, thanks > 0004 Reworks how the work is divided between workers and combined by the > leader. It undoes the tableam.c changes that attempted to divide the > relation into chunks matching the BRIN ranges, and instead merges the > results in the leader (using the BRIN "union" function). That's OK. > I haven't done any indentation fixes yet. > > I did fairly extensive testing, using pageinspect to compare indexes > built with/without parallelism. More testing is needed, but it seems to > work fine (with other opclasses and so on). After code-only review, here are some comments: > +++ b/src/backend/access/brin/brin.c > [...] > +/* Magic numbers for parallel state sharing */ > +#define PARALLEL_KEY_BRIN_SHAREDUINT64CONST(0xA001) > +#define PARALLEL_KEY_TUPLESORTUINT64CONST(0xA002) These shm keys use the same constants also in use in access/nbtree/nbtsort.c. While this shouldn't be an issue in normal operations, I'd prefer if we didn't actively introduce conflicting identifiers when we still have significant amounts of unused values remaining. > +#define PARALLEL_KEY_QUERY_TEXTUINT64CONST(0xA003) This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the others being in access/nbtree/nbtsort.c (value 0xA004, one more than brin's), backend/executor/execParallel.c (0xE008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though I've not checked that their uses are exactly the same, I'd expect at least btree to match mostly, if not fully, 1:1). I think we could probably benefit from a less ad-hoc sharing of query texts. I don't think that needs to happen specifically in this patch, but I think it's something to keep in mind in future efforts. > +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state) > [...] > +BrinSpool *spool = state->bs_spool; > [...] > +if (!state) > +return; I think the assignment to spool should be moved to below this condition, as _brin_begin_parallel calls this with state=NULL when it can't launch parallel workers, which will cause issues here. > +state->bs_numtuples = brinshared->indtuples; My IDE complains about bs_numtuples being an integer. This is a pre-existing issue, but still valid: we can hit an overflow on tables with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely, but problematic nonetheless. > + * FIXME This probably needs some memory management fixes - we're reading > + * tuples from the tuplesort, we're allocating an emty tuple, and so on. > + * Probably better to release this memory. This should probably be resolved. I also noticed that this is likely to execute `union_tuples` many times when pages_per_range is coprime with the parallel table scan's block stride (or when we for other reasons have many tuples returned for each range); and this `union_tuples` internally allocates and deletes its own memory context for its deserialization of the 'b' tuple. I think we should just pass a scratch context instead, so that we don't have the overhead of continously creating then deleting the same memory context. > +++ b/src/backend/catalog/index.c > [...] > -indexRelation->rd_rel->relam == BTREE_AM_OID) > +(indexRelation->rd_rel->relam == BTREE_AM_OID || > + indexRelation->rd_rel->relam == BRIN_AM_OID)) I think this needs some more effort. I imagine a new IndexAmRoutine->amcanbuildparallel is more appropriate than this hard-coded list - external indexes may want to utilize the parallel index creation planner facility, too. Some notes: As a project PostgreSQL seems to be trying to move away from hardcoding heap into everything in favour of the more AM-agnostic 'table'. I suggest replacing all mentions of "heap" in the arguments with "table", to reduce the work future maintainers need to do to fix this. Even when this AM is mostly targetted towards the heap AM, other AMs (such as those I've heard of that were developed internally at EDB) use the same block-addressing that heap does, and should thus be compatible with BRIN. Thus, "heap" is not a useful name here. There are 2 new mentions of "tuplestore" in the patch, while the structure used is tuplesort: one on form_and_spill_tuple, and
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 2:41 PM David Steele wrote: > I can't see why a backup would continue to be valid without a manifest > -- that's not very standard for backup software. If you have the > critical info in backup_label, you can't afford to lose that, so why > should backup_manifest be any different? I mean, you can run pg_basebackup --no-manifest. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Annoying build warnings from latest Apple toolchain
On Mon, Nov 20, 2023 at 2:35 PM Andres Freund wrote: > > Is there still outstanding work on this thread? Because I'm just now > > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this > > kind of thing in a meson build: > > Ventura? In that case I assume you installed newer developer tools? Because > otherwise I think we were talking about issues introduced in Sonoma. I don't think I did anything special when installing developer tools. xcode-select --version reports 2397, if that tells you anything. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 14:44, Robert Haas wrote: On Mon, Nov 20, 2023 at 12:54 PM David Steele wrote: Another thing we could do is explicitly error if we see backup_label in PGDATA during recovery. That's just a few lines of code so would not be a big deal to maintain. This error would only be visible on restore, so it presumes that backup software is being tested. I think that if we do decide to adopt this proposal, that would be a smart precaution. I'd be OK with it -- what do you think, Michael? Would this be enough that we would not need to rename the functions, or should we just go with the rename? A little hard to add to the tests, I think, since they are purely informational, i.e. not pushed up by the parser. Maybe we could just grep for the fields? Hmm. Or should they be pushed up by the parser? We could do that. I started on that road, but it's a lot of code for fields that aren't used. I think it would be better if the parser also loaded a data structure that represented the manifest. Seems to me there's a lot of duplicated code between pg_verifybackup and pg_combinebackup the way it is now. I think these fields would be handled the same as the rest of the fields in backup_label: returned from pg_backup_stop() and also stored in backup_manifest. Third-party software can do as they like with them and pg_combinebackup can just read from backup_manifest. I think that would be a bad plan, because this is critical information, and a backup manifest is not a thing that you're required to have. It's not a natural fit at all. We don't want to create a situation where if you nuke the backup_manifest then the server forgets that what it has is an incremental backup rather than a usable data directory. I can't see why a backup would continue to be valid without a manifest -- that's not very standard for backup software. If you have the critical info in backup_label, you can't afford to lose that, so why should backup_manifest be any different? Regards, -David
Re: Annoying build warnings from latest Apple toolchain
Hi, On 2023-11-20 14:14:00 -0500, Robert Haas wrote: > On Sat, Oct 7, 2023 at 12:09 PM Tom Lane wrote: > > Done that way. > > Is there still outstanding work on this thread? Because I'm just now > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this > kind of thing in a meson build: Ventura? In that case I assume you installed newer developer tools? Because otherwise I think we were talking about issues introduced in Sonoma. Greetings, Andres Freund
Re: Use of backup_label not noted in log
On 11/20/23 15:03, Andres Freund wrote: On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote: If we add a message for starting with "backup_label", shouldn't we also add a corresponding message for starting from a checkpoint found in the control file? If you see that in a problem report, you immediately know what is going on. Maybe - the reason I hesitate on that is that emitting an additional log message when starting from a base backup just adds something "once once the lifetime of a node". Whereas emitting something every start obviously doesn't impose any limit. Hmm, yeah, that would be a bit much. Here's the state with my updated patch, when starting up from a base backup: LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 64-bit LOG: listening on IPv6 address "::1", port 5441 LOG: listening on IPv4 address "127.0.0.1", port 5441 LOG: listening on Unix socket "/tmp/.s.PGSQL.5441" LOG: database system was interrupted; last known up at 2023-11-20 10:55:49 PST LOG: starting recovery from base backup with redo LSN E/AFF07F20, checkpoint LSN E/B01B17F0, on timeline ID 1 LOG: entering standby mode LOG: redo starts at E/AFF07F20 LOG: completed recovery from base backup with redo LSN E/AFF07F20 LOG: consistent recovery state reached at E/B420FC80 Besides the phrasing and the additional log message (I have no opinion about whether it should be backpatched or not), I used %u for TimelineID as appropriate, and added a comma before "on timeline". I still wonder if we need "base backup" in the messages? That sort of implies (at least to me) you used pg_basebackup but that may not be the case. FWIW, I also prefer "backup recovery" over "recovery from backup". "recovery from backup" reads fine here, but if gets more awkward when you want to say something like "recovery from backup settings". In that case, I think "backup recovery settings" reads better. Not important for this patch, maybe, but the recovery in pg_control patch went the other way and I definitely think it makes sense to keep them consistent, whichever way we go. Other than that, looks good for HEAD. Whether we back patch or not is another question, of course. Regards, -David
Re: trying again to get incremental backup
On Mon, Nov 20, 2023 at 2:10 PM Robert Haas wrote: > > Is this meant to support multiple timelines each with non-overlapping > > adjacent ranges, rather than multiple non-adjacent ranges? > > Correct. I don't see how non-adjacent LSN ranges could ever be a > useful thing, but adjacent ranges on different timelines are useful. Thinking about this a bit more, there are a couple of things we could do here in terms of syntax. Once idea is to give up on having a separate MANIFEST-WAL-RANGE command for each range and instead just cram everything into either a single command: MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}... Or even into a single option to the BASE_BACKUP command: BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...' Or, since we expect adjacent, non-overlapping ranges, you could even arrange to elide the duplicated boundary LSNs, e.g. MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn} Or BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn' I'm not sure what's best here. Trying to trim out the duplicated boundary LSNs feels a bit like rearrangement for the sake of rearrangement, but maybe it isn't really. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Annoying build warnings from latest Apple toolchain
On Sat, Oct 7, 2023 at 12:09 PM Tom Lane wrote: > Done that way. Is there still outstanding work on this thread? Because I'm just now using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this kind of thing in a meson build: [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [2266/2287] Linking target src/interfaces/ecpg/test/sql/insupd ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [2273/2287] Linking target src/interfaces/ecpg/test/sql/quote ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [2278/2287] Linking target src/interfaces/ecpg/test/sql/show ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [2280/2287] Linking target src/interfaces/ecpg/test/sql/sqlda ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' -- Robert Haas EDB: http://www.enterprisedb.com
Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
Hi, Here's a new version to improve the performance of FETCH_COUNT and extend the cases when it can be used. Patch 0001 adds a new mode in libpq to allow the app to retrieve larger chunks of results than the single row of the row-by-row mode. The maximum number of rows per PGresult is set by the user. Patch 0002 uses that mode in psql and gets rid of the cursor implementation as suggested upthread. The performance numbers look good. For a query retrieving 50M rows of about 200 bytes: select repeat('abc', 200) from generate_series(1,500) /usr/bin/time -v psql -At -c $query reports these metrics (medians of 5 runs): version | fetch_count | clock_time | user_time | sys_time | max_rss_size (kB) ---+-++---+--+--- 16-stable | 0 | 6.58 | 3.98 | 2.09 | 3446276 16-stable | 100 | 9.25 | 4.10 | 1.90 | 8768 16-stable |1000 | 11.13 | 5.17 | 1.66 | 8904 17-patch | 0 |6.5 | 3.94 | 2.09 | 3442696 17-patch | 100 | 5 | 3.56 | 0.93 | 4096 17-patch |1000 | 6.48 | 4.00 | 1.55 | 4344 Interestingly, retrieving by chunks of 100 rows appears to be a bit faster than the default one big chunk. It means that independently of using less memory, FETCH_COUNT implemented that way would be a performance enhancement compared to both not using it and using it in v16 with the cursor implementation. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 766bbe84def2db494f646caeaf29eefeba893c1a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Mon, 20 Nov 2023 17:24:55 +0100 Subject: [PATCH v4 1/2] Implement retrieval of results in chunks with libpq. This mode is similar to the single-row mode except that chunks of results contain up to N rows instead of a single row. It is meant to reduce the overhead of the row-by-row allocations for large result sets. The mode is selected with PQsetChunkedRowsMode(int maxRows) and results have the new status code PGRES_TUPLES_CHUNK. --- doc/src/sgml/libpq.sgml | 96 +++-- src/bin/pg_amcheck/pg_amcheck.c | 1 + src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-exec.c | 118 +-- src/interfaces/libpq/libpq-fe.h | 4 +- src/interfaces/libpq/libpq-int.h | 7 +- 6 files changed, 183 insertions(+), 44 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index ed88ac001a..8007bf67d8 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res); The PGresult contains a single result tuple from the current command. This status occurs only when single-row mode has been selected for the query -(see ). +(see ). + + + + + + PGRES_TUPLES_CHUNK + + +The PGresult contains several tuples +from the current command. The count of tuples cannot exceed +the maximum passed to . +This status occurs only when the chunked mode has been selected +for the query (see ). @@ -5187,8 +5200,8 @@ PGresult *PQgetResult(PGconn *conn); Another frequently-desired feature that can be obtained with and - is retrieving large query results a row at a time. This is discussed - in . + is retrieving large query results a limited number of rows at a time. This is discussed + in . @@ -5551,12 +5564,13 @@ int PQflush(PGconn *conn); - To enter single-row mode, call PQsetSingleRowMode - before retrieving results with PQgetResult. - This mode selection is effective only for the query currently - being processed. For more information on the use of - PQsetSingleRowMode, - refer to . + To enter single-row or chunked modes, call + respectively PQsetSingleRowMode + or PQsetChunkedRowsMode before retrieving results + with PQgetResult. This mode selection is effective + only for the query currently being processed. For more information on the + use of these functions refer + to . @@ -5895,10 +5909,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42; - - Retrieving Query Results Row-by-Row + + Retrieving Query Results by chunks - + libpq single-row mode @@ -5909,13 +5923,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42; PGresult. This can be unworkable for commands that return a large number of rows. For such cases, applications can use and in - single-row mode. In this mode, the result row(s) are - returned to the application
Re: trying again to get incremental backup
On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera wrote: > That sounds good to me. Not having to parse the manifest server-side > sounds like a win, as does saving the transfer, for the cases where the > manifest is large. OK. I'll look into this next week, hopefully. > Is this meant to support multiple timelines each with non-overlapping > adjacent ranges, rather than multiple non-adjacent ranges? Correct. I don't see how non-adjacent LSN ranges could ever be a useful thing, but adjacent ranges on different timelines are useful. > Do I have it right that you want to rewrite this bit before considering > this ready to commit? For sure. I don't think this is the only thing that needs to be revised before commit, but it's definitely *a* thing that needs to be revised before commit. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use of backup_label not noted in log
On 11/20/23 14:27, Andres Freund wrote: Hi, On 2023-11-19 14:28:12 -0400, David Steele wrote: On 11/18/23 17:49, Andres Freund wrote: On 2023-11-18 10:01:42 -0800, Andres Freund wrote: Not enamored with the phrasing of the log messages, but here's a prototype: When starting up with backup_label present: LOG: starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1 I'd prefer something like: LOG: starting backup recovery with redo... When restarting before reaching the end of the backup, but after backup_label has been removed: LOG: continuing to start from base backup with redo LSN A/34100028 LOG: entering standby mode LOG: redo starts at A/3954B958 And here: LOG: restarting backup recovery with redo... I like it. Cool. I've wondered whether it's worth also adding an explicit message just after ReachedEndOfBackup(), but it seems far less urgent due to the existing "consistent recovery state reached at %X/%X" message. I think the current message is sufficient, but what do you have in mind? Well, the consistency message is emitted after every restart. Whereas a single instance only should go through backup recovery once. So it seems worthwhile to differentiate the two in log messages. Ah, right. That works for me, then. Regards, -David
Re: Use of backup_label not noted in log
Hi, On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote: > On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote: > > + if (ControlFile->backupStartPoint != InvalidXLogRecPtr) > > + ereport(LOG, > > + (errmsg("continuing to start from base backup with redo > > LSN %X/%X", > > + > > LSN_FORMAT_ARGS(ControlFile->backupStartPoint; > > > > "Continuing to start" sounds a bit weird to me, though, considering > > that there are a few LOGs that say "starting" when there is a signal > > file, but I don't have a better idea on top of my mind. So that > > sounds OK here. > > We can only reach that message in recovery or standby mode, right? > So why not write "continuing to recover from base backup"? It can be reached without either too, albeit much less commonly. > If we add a message for starting with "backup_label", shouldn't > we also add a corresponding message for starting from a checkpoint > found in the control file? If you see that in a problem report, > you immediately know what is going on. Maybe - the reason I hesitate on that is that emitting an additional log message when starting from a base backup just adds something "once once the lifetime of a node". Whereas emitting something every start obviously doesn't impose any limit. You also can extrapolate from the messages absence that we started up without backup_label, it's not like there would be a lot of messages inbetween "database system was interrupted; last ..." and "starting backup recovery ..." (commonly there would be no messages) We can do more on HEAD of course, but we should be wary of just spamming the log unnecessarily as well. I guess we could add this message at the same time, including in the back branches. Initially I thought that might be unwise, because replacing elog(DEBUG1, "end of backup reached"); with a different message could theoretically cause issues, even if unlikely, given that it's a DEBUG1 message. But I think we actually want to emit the message a bit later, just *after* we updated the control file, as that's the actually relevant piece after which we won't go back to the "backup recovery" state. I am somewhat agnostic about whether we should add that in the back branches or not. Here's the state with my updated patch, when starting up from a base backup: LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 64-bit LOG: listening on IPv6 address "::1", port 5441 LOG: listening on IPv4 address "127.0.0.1", port 5441 LOG: listening on Unix socket "/tmp/.s.PGSQL.5441" LOG: database system was interrupted; last known up at 2023-11-20 10:55:49 PST LOG: starting recovery from base backup with redo LSN E/AFF07F20, checkpoint LSN E/B01B17F0, on timeline ID 1 LOG: entering standby mode LOG: redo starts at E/AFF07F20 LOG: completed recovery from base backup with redo LSN E/AFF07F20 LOG: consistent recovery state reached at E/B420FC80 Besides the phrasing and the additional log message (I have no opinion about whether it should be backpatched or not), I used %u for TimelineID as appropriate, and added a comma before "on timeline". Greetings, Andres Freund >From d80e55942a096d9f1ab10b84ab6f2a9d371cf88d Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 18 Nov 2023 13:27:17 -0800 Subject: [PATCH v2] WIP: Log when starting up from a base backup Author: Reviewed-by: Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de Backpatch: --- src/backend/access/transam/xlogrecovery.c | 32 +++ 1 file changed, 32 insertions(+) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index c6156aa..9803b481118 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, if (StandbyModeRequested) EnableStandbyMode(); + /* + * Omitting backup_label when creating a new replica, PITR node etc. + * unfortunately is a common cause of corruption. Logging that + * backup_label was used makes it a bit easier to exclude that as the + * cause of observed corruption. + * + * Do so before we try to read the checkpoint record (which can fail), + * as otherwise it can be hard to understand why a checkpoint other + * than ControlFile->checkPoint is used. + */ + ereport(LOG, +(errmsg("starting recovery from base backup with redo LSN %X/%X, checkpoint LSN %X/%X, on timeline ID %u", + LSN_FORMAT_ARGS(RedoStartLSN), + LSN_FORMAT_ARGS(CheckPointLoc), + CheckPointTLI))); + /* * When a backup_label file is present, we want to roll forward from * the checkpoint it identifies, rather than using pg_control. @@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, EnableStandbyMode(); } + /* + * For the
Re: trying again to get incremental backup
On 2023-Nov-16, Robert Haas wrote: > On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera > wrote: > > I don't understand this point. Currently, the protocol is that > > UPLOAD_MANIFEST is used to send the manifest prior to requesting the > > backup. You seem to be saying that you're thinking of removing support > > for UPLOAD_MANIFEST and instead just give the LSN as an option to the > > BASE_BACKUP command? > > I don't think I'd want to do exactly that, because then you could only > send one LSN, and I do think we want to send a set of LSN ranges with > the corresponding TLI for each. I was thinking about dumping > UPLOAD_MANIFEST and instead having a command like: > > INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698 > > The client would execute this command one or more times before > starting an incremental backup. That sounds good to me. Not having to parse the manifest server-side sounds like a win, as does saving the transfer, for the cases where the manifest is large. Is this meant to support multiple timelines each with non-overlapping adjacent ranges, rather than multiple non-adjacent ranges? Do I have it right that you want to rewrite this bit before considering this ready to commit? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Re: Hide exposed impl detail of wchar.c
On Fri, Nov 17, 2023 at 2:26 AM John Naylor wrote: > > On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart > wrote: > > > > It looks like is_valid_ascii() was originally added to pg_wchar.h so that > > it could easily be used elsewhere [0] [1], but that doesn't seem to have > > happened yet. > > > > Would moving this definition to a separate header file be a viable option? > > Seems fine to me. (I believe the original motivation for making it an > inline function was for in pg_mbstrlen_with_len(), but trying that > hasn't been a priority.) In that case, I took a look across the codebase and saw a utils/ascii.h that doesn't seem to have gotten much love, but I suppose one could argue that it's intended to be a backend-only header file? As the codebase is growing some enhanced UTF-8 support, you'll want somewhere that contains the optimized US-ASCII routines, because, as US-ASCII is a subset of UTF-8, and often faster to handle, it's typical for such codepaths to look like ```c while (i < len && no_multibyte_chars) { i = i + ascii_op_version(i, buffer, _multibyte_chars); } while (i < len) { i = i + utf8_op_version(i, buffer); } ``` So it should probably end up living somewhere near the UTF-8 support, and the easiest way to make it not go into something pgrx currently includes would be to make it a new header file, though there's a fair amount of API we don't touch. >From the pgrx / Rust perspective, Postgres function calls are passed via callback to a "guard function" that guarantees that longjmp and setjmp don't cause trouble (and makes sure we participate in that). So we only want to call Postgres functions if we "can't replace" them, as the overhead is quite a lot. That means UTF-8-per-se functions aren't very interesting to us as the Rust language already supports it, but we do benefit from access to transcoding to/from UTF-8. —Jubilee
Re: Add recovery to pg_control and remove backup_label
On Mon, Nov 20, 2023 at 12:54 PM David Steele wrote: > Another thing we could do is explicitly error if we see backup_label in > PGDATA during recovery. That's just a few lines of code so would not be > a big deal to maintain. This error would only be visible on restore, so > it presumes that backup software is being tested. I think that if we do decide to adopt this proposal, that would be a smart precaution. > A little hard to add to the tests, I think, since they are purely > informational, i.e. not pushed up by the parser. Maybe we could just > grep for the fields? Hmm. Or should they be pushed up by the parser? > I think these fields would be handled the same as the rest of the fields > in backup_label: returned from pg_backup_stop() and also stored in > backup_manifest. Third-party software can do as they like with them and > pg_combinebackup can just read from backup_manifest. I think that would be a bad plan, because this is critical information, and a backup manifest is not a thing that you're required to have. It's not a natural fit at all. We don't want to create a situation where if you nuke the backup_manifest then the server forgets that what it has is an incremental backup rather than a usable data directory. > We absolutely need more people to look at this and sign off. I'm glad > they have not so far because it has allowed time to whack the patch > around and get it into better shape. Cool. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use of backup_label not noted in log
Hi, On 2023-11-20 17:30:31 +0900, Michael Paquier wrote: > On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote: > > Note that the LSN in the "continuing" case is the one the backup started at, > > not where recovery will start. > > > > I've wondered whether it's worth also adding an explicit message just after > > ReachedEndOfBackup(), but it seems far less urgent due to the existing > > "consistent recovery state reached at %X/%X" message. > > Upgrading the surrounding DEBUG1 to a LOG is another option, but I > agree that I've seen less that as being an actual problem in the field > compared to the famous I-removed-a-backup-label-and-I-m-still-up, > until this user sees signs of corruption after recovery was finished, > sometimes days after putting back an instance online. "end of backup reached" could scare users, it doesn't obviously indicate something "good". "completed backup recovery, started at %X/%X" or such would be better imo. > + if (ControlFile->backupStartPoint != InvalidXLogRecPtr) > + ereport(LOG, > + (errmsg("continuing to start from base backup with redo > LSN %X/%X", > + LSN_FORMAT_ARGS(ControlFile->backupStartPoint; > > "Continuing to start" sounds a bit weird to me, though, considering > that there are a few LOGs that say "starting" when there is a signal > file, but I don't have a better idea on top of my mind. So that > sounds OK here. I didn't like it much either - but I like David's proposal in his sibling reply: LOG: starting backup recovery with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1 LOG: restarting backup recovery with redo LSN A/34100028 and adding the message from above: LOG: completing backup recovery with redo LSN A/34100028 Greetings, Andres Freund
Re: Use of backup_label not noted in log
Hi, On 2023-11-20 11:24:25 -0500, Robert Haas wrote: > I do also think it is worth considering how this proposal interacts > with the proposal to remove backup_label. If that proposal goes > through, then this proposal is obsolete, I believe. I think it's the opposite, if anything. Today you can at least tell there was use of a backup_label by looking for backup_label.old and you can verify fairly easily in a restore script that backup_label is present. If we "just" use pg_control, neither of those is as easy. I.e. log messages would be more important, not less. Depending on how things work out, we might need to reformulate and/or move them a bit, but that doesn't seem like a big deal. > But if this is a good idea, does that mean that's not a good idea? Or would > we try to make the pg_control which that patch would drop in place have some > internal difference which we could use to drive a similar log message? I think we absolutely have to. If there's no way to tell whether an "external" pg_backup_start/stop() procedure actually used the proper pg_control, it'd make the situation substantially worse compared to today's, already bad, situation. Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 12:11, Robert Haas wrote: On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier wrote: (I am not exactly sure how, but we've lost pgsql-hackers on the way when you sent v5. Now added back in CC with the two latest patches you've proposed attached.) Here is a short summary of what has been missed by the lists: - I've commented that the patch should not create, not show up in fields returned the SQL functions or stream control files with a size of 512B, just stick to 8kB. If this is worth changing this should be applied consistently across the board including initdb, discussed on its own thread. - The backup-related fields in the control file are reset at the end of recovery. I've suggested to not do that to keep a trace of what was happening during recovery. The latest version of the patch resets the fields. - With the backup_label file gone, we lose some information in the backups themselves, which is not good. Instead, you have suggested an approach where this data is added to the backup manifest, meaning that no information would be lost, particularly useful for self-contained backups. The fields planned to be added to the backup manifest are: -- The start and end time of the backup, the end timestamp being useful to know when stop time can be used for PITR. -- The backup label. I've agreed that it may be the best thing to do at this end to not lose any data related to the removal of the backup_label file. I think we need more votes to make a change this big. I have a concern, which I think I've expressed before, that we keep whacking around the backup APIs, and that has a cost which is potentially larger than the benefits. From my perspective it's not that big a change for backup software but it does bring a lot of benefits, including fixing an outstanding bug in Postgres, i.e. reading pg_control without getting a torn copy. The last time we changed the API, we changed pg_stop_backup to pg_backup_stop, but this doesn't do that, and I wonder if that's OK. Even if it is, do we really want to change this API around again after such a short time? This is a good point. We could just rename again, but not sure what names to go for this time. OTOH if the backup software is selecting fields then they will get an error because the names have changed. If the software is grabbing fields by position then they'll get a valid-looking result (even if querying by position is a terrible idea). Another thing we could do is explicitly error if we see backup_label in PGDATA during recovery. That's just a few lines of code so would not be a big deal to maintain. This error would only be visible on restore, so it presumes that backup software is being tested. Maybe just a rename to something like pg_backup_begin/end would be the way to go. That said, I don't have an intrinsic problem with moving this information from the backup_label to the backup_manifest file since it is purely informational. I do think there should perhaps be some additions to the test cases, though. A little hard to add to the tests, I think, since they are purely informational, i.e. not pushed up by the parser. Maybe we could just grep for the fields? I am concerned about the interaction of this proposal with incremental backup. When you take an incremental backup, you get something that looks a lot like a usable data directory but isn't. To prevent that from causing avoidable disasters, the present version of the patch adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the backup_label. pg_combinebackup knows to look for those fields, and the server knows that if they are present it should refuse to start. With this change, though, I suppose those fields would end up in pg_control. But that does not feel entirely great, because we have a goal of keeping the amount of real data in pg_control below 512 bytes, the traditional sector size, and this adds another 12 bytes of stuff to that file that currently doesn't need to be there. I feel like that's kind of a problem. I think these fields would be handled the same as the rest of the fields in backup_label: returned from pg_backup_stop() and also stored in backup_manifest. Third-party software can do as they like with them and pg_combinebackup can just read from backup_manifest. As for the pg_control file -- it might be best to give it a different name for backups that are not essentially copies of PGDATA. On the other hand, pgBackRest has included pg_control in incremental backups since day one and we've never had a user mistakenly do a manual restore of one and cause a problem (though manual restores are not the norm). Still, probably can't hurt to be a bit careful. But my main point here is ... if we have a few more senior hackers weigh in and vote in favor of this change, well then that's one thing. But IMHO a discussion that's mostly between 2 people is not nearly a strong enough consensus to justify this amount of
Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500
Richard Guo writes: > On Fri, Nov 17, 2023 at 11:38 AM Tom Lane wrote: >> That line of argument also leads to the conclusion that it'd be >> okay to expose info about the ordering of the CTE result to the >> upper planner. > In the light of this conclusion, I had a go at propagating the pathkeys > from CTEs up to the outer planner and came up with the attached. Oh, nice! I remembered we had code already to do this for regular SubqueryScans, but I thought we'd need to do some refactoring to apply it to CTEs. I think you are right though that convert_subquery_pathkeys can be used as-is. Some thoughts: * Do we really need to use make_tlist_from_pathtarget? Why isn't the tlist of the cteplan good enough (indeed, more so)? * I don't love having this code assume that it knows how to find the Path the cteplan was made from. It'd be better to make SS_process_ctes save that somewhere, maybe in a list paralleling root->cte_plan_ids. Alternatively: maybe it's time to do what the comments in SS_process_ctes vaguely speculate about, and just save the Path at that point, with construction of the plan left for createplan()? That might be a lot of refactoring for not much gain, so not sure. regards, tom lane
Re: Use of backup_label not noted in log
On 11/20/23 12:24, Robert Haas wrote: On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe wrote: I can accept that adding log messages to back branches is ok. Perhaps I am too nervous about things like that, because as an extension developer I have been bitten too often by ABI breaks in minor releases in the past. I think that adding a log message to the back branches would probably make my life better not worse, because when people do strange things and then send me the log messages to figure out what the heck happened, it would be there, and I'd have a clue. However, the world doesn't revolve around me. I can imagine users getting spooked if a new message that they've never seen before, and I think that risk should be considered. There are good reasons for keeping the back-branches stable, and as you said before, this isn't a bug fix. Personally I think that the value of the information outweighs the weirdness of a new message appearing. I do also think it is worth considering how this proposal interacts with the proposal to remove backup_label. If that proposal goes through, then this proposal is obsolete, I believe. Not at all. I don't even think the messages will need to be reworded, or not much since they don't mention backup_label. But if this is a good idea, does that mean that's not a good idea? Or would we try to make the pg_control which that patch would drop in place have some internal difference which we could use to drive a similar log message? The recovery in pg_control patch has all the same recovery info stored, so similar (or the same) log message would still be appropriate. Maybe we should, because knowing whether or not the user followed the backup procedure correctly would indeed be a big help and it would be regrettable to gain that capability only to lose it again... The info is certainly valuable and we wouldn't lose it, unless there is something I'm not getting. Regards, -David
Re: On non-Windows, hard depend on uselocale(3)
Thomas Munro writes: > If we are sure that we'll *never* want locale-aware printf-family > functions (ie we *always* want "C" locale), then in the thought > experiment above where I suggested we supply replacement _l() > functions, we could just skip that for the printf family, but make > that above comment actually true. Perhaps with Ryu, but otherwise by > punting to libc _l() or uselocale() save/restore. It is pretty annoying that we've got that shiny Ryu code and can't use it here. From memory, we did look into that and concluded that Ryu wasn't amenable to providing "exactly this many digits" as is required by most variants of printf's conversion specs. But maybe somebody should go try harder. (Worst case, you could do rounding off by hand on the produced digit string, but that's ugly...) regards, tom lane
Re: should check collations when creating partitioned index
Peter Eisentraut writes: > On 14.11.23 17:15, Tom Lane wrote: >> I don't love the patch details though. It seems entirely wrong to check >> this before we check the opclass match. > Not sure why? The order doesn't seem to matter? The case that was bothering me was if we had a non-collated type versus a collated type. That would result in throwing an error about collation mismatch, when complaining about the opclass seems more apropos. However, if we do this: > I see. That means we shouldn't raise an error on a mismatch but just do > if (key->partcollation[i] != collationIds[j]) > continue; it might not matter much. regards, tom lane
Re: Use of backup_label not noted in log
On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe wrote: > I can accept that adding log messages to back branches is ok. > Perhaps I am too nervous about things like that, because as an extension > developer I have been bitten too often by ABI breaks in minor releases > in the past. I think that adding a log message to the back branches would probably make my life better not worse, because when people do strange things and then send me the log messages to figure out what the heck happened, it would be there, and I'd have a clue. However, the world doesn't revolve around me. I can imagine users getting spooked if a new message that they've never seen before, and I think that risk should be considered. There are good reasons for keeping the back-branches stable, and as you said before, this isn't a bug fix. I do also think it is worth considering how this proposal interacts with the proposal to remove backup_label. If that proposal goes through, then this proposal is obsolete, I believe. But if this is a good idea, does that mean that's not a good idea? Or would we try to make the pg_control which that patch would drop in place have some internal difference which we could use to drive a similar log message? Maybe we should, because knowing whether or not the user followed the backup procedure correctly would indeed be a big help and it would be regrettable to gain that capability only to lose it again... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add recovery to pg_control and remove backup_label
On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier wrote: > (I am not exactly sure how, but we've lost pgsql-hackers on the way > when you sent v5. Now added back in CC with the two latest patches > you've proposed attached.) > > Here is a short summary of what has been missed by the lists: > - I've commented that the patch should not create, not show up in > fields returned the SQL functions or stream control files with a size > of 512B, just stick to 8kB. If this is worth changing this should be > applied consistently across the board including initdb, discussed on > its own thread. > - The backup-related fields in the control file are reset at the end > of recovery. I've suggested to not do that to keep a trace of what > was happening during recovery. The latest version of the patch resets > the fields. > - With the backup_label file gone, we lose some information in the > backups themselves, which is not good. Instead, you have suggested an > approach where this data is added to the backup manifest, meaning that > no information would be lost, particularly useful for self-contained > backups. The fields planned to be added to the backup manifest are: > -- The start and end time of the backup, the end timestamp being > useful to know when stop time can be used for PITR. > -- The backup label. > I've agreed that it may be the best thing to do at this end to not > lose any data related to the removal of the backup_label file. I think we need more votes to make a change this big. I have a concern, which I think I've expressed before, that we keep whacking around the backup APIs, and that has a cost which is potentially larger than the benefits. The last time we changed the API, we changed pg_stop_backup to pg_backup_stop, but this doesn't do that, and I wonder if that's OK. Even if it is, do we really want to change this API around again after such a short time? That said, I don't have an intrinsic problem with moving this information from the backup_label to the backup_manifest file since it is purely informational. I do think there should perhaps be some additions to the test cases, though. I am concerned about the interaction of this proposal with incremental backup. When you take an incremental backup, you get something that looks a lot like a usable data directory but isn't. To prevent that from causing avoidable disasters, the present version of the patch adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the backup_label. pg_combinebackup knows to look for those fields, and the server knows that if they are present it should refuse to start. With this change, though, I suppose those fields would end up in pg_control. But that does not feel entirely great, because we have a goal of keeping the amount of real data in pg_control below 512 bytes, the traditional sector size, and this adds another 12 bytes of stuff to that file that currently doesn't need to be there. I feel like that's kind of a problem. But my main point here is ... if we have a few more senior hackers weigh in and vote in favor of this change, well then that's one thing. But IMHO a discussion that's mostly between 2 people is not nearly a strong enough consensus to justify this amount of disruption. -- Robert Haas EDB: http://www.enterprisedb.com
[PATCH] fix race condition in libpq (related to ssl connections)
Hi, I've found a race condition in libpq. It is about the initialization of the my_bio_methods static variable in fe-secure-openssl.c, which is not protected by any lock. The race condition may make the initialization of the connection fail, and as an additional weird consequence, it might cause openssl call close(0), so stdin of the client application gets closed. I've prepared a patch to protect the initialization of my_bio_methods from the race. This is my first patch submission to the postgresql project, so I hope I didn't miss anything. Any comments and suggestions are of course very welcome. I also prepared a testcase. In the testcase tarball, there is a patch that adds sleeps at the right positions to make the close(0) problem occur practically always. It also includes comments to explain how the race can end up calling close(0). Concerning the patch, it is only tested on Linux. I'm unsure about whether the simple initialization of the mutex would work nowadays also on Windows or whether the more complicated initialization also to be found for the ssl_config_mutex in the same source file needs to be used. Let me know whether I should adapt that. We discovered the problem with release 11.5, but the patch and the testcase are against the master branch. Regards, Willi -- ___ Dr. Willi Mann | Principal Software Engineer, Tech Lead PQL Celonis SE | Theresienstrasse 6 | 80333 Munich, Germany F: +4989416159679 w.m...@celonis.com | www.celonis.com | LinkedIn | Twitter | Xing AG Munich HRB 225439 | Management: Martin Klenk, Bastian Nominacher, Alexander RinkeFrom 721be8d6ea26d78fc086a2e85413858585ca9300 Mon Sep 17 00:00:00 2001 From: Willi Mann Date: Thu, 16 Nov 2023 11:50:27 +0100 Subject: [PATCH] libpq: Fix race condition in initialization of my_bio_methods static variable --- src/interfaces/libpq/fe-secure-openssl.c | 26 ++-- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c index f1192d28f2..e8fa3eb403 100644 --- a/src/interfaces/libpq/fe-secure-openssl.c +++ b/src/interfaces/libpq/fe-secure-openssl.c @@ -81,6 +81,7 @@ static int PQssl_passwd_cb(char *buf, int size, int rwflag, void *userdata); static int my_sock_read(BIO *h, char *buf, int size); static int my_sock_write(BIO *h, const char *buf, int size); static BIO_METHOD *my_BIO_s_socket(void); +static void my_BIO_methods_init(void); static int my_SSL_set_fd(PGconn *conn, int fd); @@ -1882,8 +1883,8 @@ my_sock_write(BIO *h, const char *buf, int size) return res; } -static BIO_METHOD * -my_BIO_s_socket(void) +static void +my_BIO_methods_init(void) { if (!my_bio_methods) { @@ -1893,11 +1894,11 @@ my_BIO_s_socket(void) my_bio_index = BIO_get_new_index(); if (my_bio_index == -1) - return NULL; + return; my_bio_index |= (BIO_TYPE_DESCRIPTOR | BIO_TYPE_SOURCE_SINK); my_bio_methods = BIO_meth_new(my_bio_index, "libpq socket"); if (!my_bio_methods) - return NULL; + return; /* * As of this writing, these functions never fail. But check anyway, @@ -1914,17 +1915,30 @@ my_BIO_s_socket(void) { BIO_meth_free(my_bio_methods); my_bio_methods = NULL; - return NULL; + return; } #else my_bio_methods = malloc(sizeof(BIO_METHOD)); if (!my_bio_methods) - return NULL; + return; memcpy(my_bio_methods, biom, sizeof(BIO_METHOD)); my_bio_methods->bread = my_sock_read; my_bio_methods->bwrite = my_sock_write; #endif } +} + +static pthread_mutex_t my_BIO_methods_init_mutex = PTHREAD_MUTEX_INITIALIZER; + +static BIO_METHOD * +my_BIO_s_socket() +{ + if (pthread_mutex_lock(_BIO_methods_init_mutex) != 0) + { + return NULL; + } + my_BIO_methods_init(); + pthread_mutex_unlock(_BIO_methods_init_mutex); return my_bio_methods; } -- 2.39.2 testcase.tar.gz Description: application/gzip
Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans
Thanks. Our project is at https://github.com/tensorchord/pgvecto.rs. A custom index is implemented for the vector similarity search, which implements `amgettuples` with direction support to provide candidates for the order by clause. And we want to inject the filter condition using bitmap into the amgettuples process, instead of checking the tuples one by one to accelerate the whole process. Best Jinjing Zhou > From: "Tomas Vondra" > Date: Mon, Nov 20, 2023, 18:52 > Subject: Re: Inquiry on Generating Bitmaps from Filter Conditions in Index > Scans > To: "Jinjing Zhou", > "pgsql-hackers@lists.postgresql.org" > On 11/19/23 18:19, Jinjing Zhou wrote: > > Hi hackers, > > > > I hope this message finds you well. I am reaching out to seek guidance > > on a specific aspect of PostgreSQL's index scanning functionality. > > > > I am currently working on a vector search extension for postgres, where > > I need to generate bitmaps based on filter conditions during an index > > scan. The goal is to optimize the query performance by efficiently > > identifying the rows that meet the given criteria. > > > > The query plan looks like this > > > > Index Scan using products_feature_idx on products (cost=0.00..27.24 > > rows=495 width=12) > > Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector) > > Filter: ((price > '0.2'::double precision) AND (price <= > > '0.7'::double precision)) > > > > > > We have a custom index for the order by clause on the feature column. > > Now we want to utilize the index on other columns like price column. We > > want to access the bitmap of price column's filter condition in the > > feature column index. Is there any way I can achieve this goal? > > > > Any help or guidance is appreciated! > > > > I suppose you'll need to give more details about what exactly are you > trying to achieve, what you tried, maybe some code examples, etc. Your > question is quite vague, and it's unclear what "bitmaps generated on > filter conditions" or "custom index" means. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans
Thanks a lot! This is exactly what I'm asking. We've tried the CustomScanAPI at https://github.com/tensorchord/pgvecto.rs/pull/126, but met error with "variable not found in subplan target list". We're still investigating the root cause and thanks for your guidance! Best Jinjing Zhou > From: "Matthias van de Meent" > Date: Mon, Nov 20, 2023, 19:33 > Subject: Re: Inquiry on Generating Bitmaps from Filter Conditions in Index > Scans > To: "Jinjing Zhou" > Cc: "pgsql-hackers@lists.postgresql.org" > On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou wrote: > > > > Hi hackers, > > > > I hope this message finds you well. I am reaching out to seek guidance on a > > specific aspect of PostgreSQL's index scanning functionality. > > > > I am currently working on a vector search extension for postgres, where I > > need to generate bitmaps based on filter conditions during an index scan. > > The goal is to optimize the query performance by efficiently identifying > > the rows that meet the given criteria. > > > > The query plan looks like this > > > > Index Scan using products_feature_idx on products (cost=0.00..27.24 > > rows=495 width=12) > > Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector) > > Filter: ((price > '0.2'::double precision) AND (price <= > > '0.7'::double precision)) > > > > > > We have a custom index for the order by clause on the feature column. Now > > we want to utilize the index on other columns like price column. We want to > > access the bitmap of price column's filter condition in the feature column > > index. Is there any way I can achieve this goal? > > If you mean "I'd like to use bitmaps generated by combining filter > results from index A, B, and C for (pre-)filtering the ordered index > lookups in index D", > then there is no current infrastructure to do this. Bitmap scans > currently generate a data structure that is not indexable, and can > thus not be used efficiently to push an index's generated bitmap into > another bitmap's scans. > > There are efforts to improve the data structures we use for storing > TIDs during vacuum [0] which could extend to the TID bitmap structure, > but even then we'd need some significant effort to rewire Postgres' > internals to push down the bitmap filters; and that is even under the > assumption that pushing the bitmap down into the index AM is more > efficient than doing the merges above the index AM and then re-sorting > the data. > > So, in short, it's not currently available in community PostgreSQL. > You could probably create a planner hook + custom executor node that > does this, but it won't be able to use much of the features available > inside PostgreSQL. > > Kind regards, > > Matthias van de Meent > > [0] > https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com
Re: Show WAL write and fsync stats in pg_stat_io
Hi, Thanks for the feedback. On Mon, 20 Nov 2023 at 10:47, Michael Paquier wrote: > > On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote: > > There are some differences between pg_stat_wal and pg_stat_io while > > collecting WAL stats. For example in the XLogWrite() function in the > > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This > > is not something we want for pg_stat_io since pg_stat_io counts the > > number of blocks rather than the system calls, so instead incremented > > pg_stat_io by npages. > > > > Could that cause a problem since pg_stat_wal's behaviour will be > > changed? Of course, as an alternative we could change pg_stat_io's > > behaviour but in the end either pg_stat_wal's or pg_stat_io's > > behaviour will be changed. > > Yep, that could be confusing for existing applications that track the > information of pg_stat_wal. The number of writes is not something > that can be correctly shared between both. The timings for the writes > and the syncs could be shared at least, right? Yes, the timings for the writes and the syncs should work. Another question I have in mind is the pg_stat_reset_shared() function. When we call it with 'io' it will reset pg_stat_wal's timings and when we call it with 'wal' it won't reset them, right? > > This slightly relates to pgstat_count_io_op_n() in your latest patch, > where it feels a bit weird to see an update of > PendingWalStats.wal_sync sit in the middle of a routine dedicated to > pg_stat_io.. I am not completely sure what's the right balance here, > but I would try to implement things so as pg_stat_io paths does not > need to know about PendingWalStats. Write has block vs system calls differentiation but it is the same for sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but I agree that it looks a bit weird. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Use of backup_label not noted in log
On 11/20/23 06:35, Laurenz Albe wrote: If we add a message for starting with "backup_label", shouldn't we also add a corresponding message for starting from a checkpoint found in the control file? If you see that in a problem report, you immediately know what is going on. +1. It is easier to detect the presence of a message than the absence of one. Regards, -David
Re: Use of backup_label not noted in log
[Resending since I accidentally replied off-list] On 11/18/23 17:49, Andres Freund wrote: On 2023-11-18 10:01:42 -0800, Andres Freund wrote: What about adding it to the "redo starts at" message, something like redo starts at 12/12345678, taken from control file or redo starts at 12/12345678, taken from backup label I think it'd make sense to log use of backup_label earlier than that - the locations from backup_label might end up not being available in the archive, the primary or locally, and we'll error out with "could not locate a valid checkpoint record". I'd probably just do it within the if (read_backup_label()) block in InitWalRecovery(), *before* the ReadCheckpointRecord(). Not enamored with the phrasing of the log messages, but here's a prototype: When starting up with backup_label present: LOG: starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1 I'd prefer something like: LOG: starting backup recovery with redo... When restarting before reaching the end of the backup, but after backup_label has been removed: LOG: continuing to start from base backup with redo LSN A/34100028 LOG: entering standby mode LOG: redo starts at A/3954B958 And here: LOG: restarting backup recovery with redo... Note that the LSN in the "continuing" case is the one the backup started at, not where recovery will start. I've wondered whether it's worth also adding an explicit message just after ReachedEndOfBackup(), but it seems far less urgent due to the existing "consistent recovery state reached at %X/%X" message. I think the current message is sufficient, but what do you have in mind? We are quite inconsistent about how we spell LSNs. Sometimes with LSN preceding, sometimes not. Sometimes with (LSN). Etc. Well, this could be improved in HEAD for sure. I do like the idea of expanding the "redo starts at" message though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint, ControlFile->backupEndPoint would provide information about when the node might become consistent. +1 Playing around with this a bit, I'm wondering if we instead should remove that message, and emit something more informative earlier on. If there's a problem, you kinda want the information before we finally get to the loop in PerformWalLRecovery(). If e.g. there's no WAL you'll only get LOG: invalid checkpoint record PANIC: could not locate a valid checkpoint record which is delightfully lacking in details. I've been thinking about improving this myself. It would probably also help a lot to hint that restore_command may be missing or not returning results (but also not erroring). But there are a bunch of ways to get to this message so we'd need to be careful. There also are some other oddities: If the primary is down when starting up, and we'd need WAL from the primary for the first record, the "redo start at" message is delayed until that happens, because we emit the message not before we read the first record, but after. That's just plain odd. Agreed. Moving it up would be better. And sometimes we'll start referencing the LSN at which we are starting recovery before the "redo starts at" message. If e.g. we shut down at a restart point, we'll emit LOG: consistent recovery state reached at ... before LOG: redo starts at ... Huh, I haven't seen that one. Definitely confusing. But that's all clearly just material for HEAD. Absolutely. I've been thinking about some of this as well, but want to see if we can remove the backup label first so we don't have to rework a bunch of stuff. Of course, that shouldn't stop you from proceeding. I'm sure anything you are thinking of here could be adapted. Regards, -David
Re: Add recovery to pg_control and remove backup_label
On 11/19/23 21:15, Michael Paquier wrote: (I am not exactly sure how, but we've lost pgsql-hackers on the way when you sent v5. Now added back in CC with the two latest patches you've proposed attached.) Ugh, I must have hit reply instead of reply all. It's a rookie error and you hate to see it. Here is a short summary of what has been missed by the lists: - I've commented that the patch should not create, not show up in fields returned the SQL functions or stream control files with a size of 512B, just stick to 8kB. If this is worth changing this should be applied consistently across the board including initdb, discussed on its own thread. - The backup-related fields in the control file are reset at the end of recovery. I've suggested to not do that to keep a trace of what was happening during recovery. The latest version of the patch resets the fields. - With the backup_label file gone, we lose some information in the backups themselves, which is not good. Instead, you have suggested an approach where this data is added to the backup manifest, meaning that no information would be lost, particularly useful for self-contained backups. The fields planned to be added to the backup manifest are: -- The start and end time of the backup, the end timestamp being useful to know when stop time can be used for PITR. -- The backup label. I've agreed that it may be the best thing to do at this end to not lose any data related to the removal of the backup_label file. This looks right to me. On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote: On 11/15/23 20:03, Michael Paquier wrote: As the label is only an informational field, the parsing added to pg_verifybackup is not really needed because it is used nowhere in the validation process, so keeping the logic simpler would be the way to go IMO. This is contrary to the WAL range for example, where start and end LSNs are used for validation with a pg_waldump command. Robert, any comments about the addition of the label in the manifest? I'm sure Robert will comment on this when he gets the time, but for now I have backed off on passing the new info to pg_verifybackup and added start/stop time. FWIW, I'm OK with the bits for the backup manifest as presented. So if there are no remarks and/or no objections, I'd like to apply it but let give some room to others to comment on that as there's been a gap in the emails exchanged on pgsql-hackers. I hope that the summary I've posted above covers everything. So let's see about doing something around the middle of next week. With Thanksgiving in the US, a lot of folks will not have the time to monitor what's happening on this thread. Timing sounds good to me. + The end time for the backup. This is when the backup was stopped in + PostgreSQL and represents the earliest time + that can be used for time-based Point-In-Time Recovery. This one is actually a very good point. We'd lost this capacity with the backup_label file gone without the end timestamps in the control file. Yeah, the end time is very important for recovery scenarios. We definitely need that recorded somewhere. I've noticed on the other thread the remark about being less aggressive with the fields related to recovery in the control file, so I assume that this patch should leave the fields be after the end of recovery from the start and only rely on backupRecoveryRequired to decide if the recovery should use the fields or not: https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net + ControlFile->backupCheckPoint = InvalidXLogRecPtr; ControlFile->backupStartPoint = InvalidXLogRecPtr; + ControlFile->backupStartPointTLI = 0; ControlFile->backupEndPoint = InvalidXLogRecPtr; + ControlFile->backupFromStandby = false; ControlFile->backupEndRequired = false; Still, I get the temptation of being consistent with the current style on HEAD to reset everything, as well.. I'd rather reset everything for now (as we do now) and think about keeping these values as a separate patch. It may be that we don't want to keep all of them, or we need a separate flag to say recovery was completed. We are accumulating a lot of booleans here, maybe we need a state var (recoveryRequired, recoveryInProgress, recoveryComplete) and then define which other vars are valid in each state. Regards, -David
Re: Synchronizing slots from primary to standby
On 11/20/23 11:59 AM, Amit Kapila wrote: On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand wrote: On 11/18/23 11:45 AM, Amit Kapila wrote: On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand wrote: On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand wrote: I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown slotsync worker and drop slots. There could be other reasons(other than promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code there. I thought if the intention is to stop slotsync workers on promotion, maybe FinishWalRecovery() is a better place to do it as it's indicating the end of recovery and XLogShutdownWalRcv is also called in it. I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() in v35. That looks ok. I was thinking what if we just ignore creating such slots (which require init state) in the first place? I think that can be time-consuming in some cases but it will reduce the complexity and we can always improve such cases later if we really encounter them in the real world. I am not very sure that added complexity is worth addressing this particular case, so I would like to know your and others' opinions. I'm not sure I understand your point. Are you saying that we should not create slots on the standby that are "currently" reported in a 'i' state? (so just keep the 'r' and 'n' states?) Yes. As far the 'i' state here, from what I see, it is currently useful for: 1. Cascading standby to not sync slots with state = 'i' from the first standby. 2. Easily report Slots that did not catch up on the primary yet. 3. Avoid inactive slots to block "active" ones creation. So not creating those slots should not be an issue for 1. (sync are not needed on cascading standby as not created on the first standby yet) but is an issue for 2. (unless we provide another way to keep track and report such slots) and 3. (as I think we should still need to reserve WAL). I've a question: we'd still need to reserve WAL for those slots, no? If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() would not work as MyReplicationSlot would be NULL. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Flushing large data immediately in pqcomm
Hi hackers I've been looking into ways to reduce the overhead we're having in pqcomm and I'd like to propose a small patch to modify how socket_putmessage works. Currently socket_putmessage copies any input data into the pqcomm send buffer (PqSendBuffer) and the size of this buffer is 8K. When the send buffer gets full, it's flushed and we continue to copy more data into the send buffer until we have no data left to be sent. Since the send buffer is flushed whenever it's full, I think we are safe to say that if the size of input data is larger than the buffer size, which is 8K, then the buffer will be flushed at least once (or more, depends on the input size) to store and all the input data. Proposed change modifies socket_putmessage to send any data larger than 8K immediately without copying it into the send buffer. Assuming that the send buffer would be flushed anyway due to reaching its limit, the patch just gets rid of the copy part which seems unnecessary and sends data without waiting. This change affects places where pq_putmessage is used such as pg_basebackup, COPY TO, walsender etc. I did some experiments to see how the patch performs. Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO STDOUT". Here are perf results of both the patch and HEAD HEAD: - 94,13% 0,22% postgres postgres [.] DoCopyTo - 93,90% DoCopyTo - 91,80% CopyOneRowTo + 47,35% CopyAttributeOutText - 26,49% CopySendEndOfRow - 25,97% socket_putmessage - internal_putbytes - 24,38% internal_flush + secure_write + 1,47% memcpy (inlined) + 14,69% FunctionCall1Coll + 1,94% appendBinaryStringInfo + 0,75% MemoryContextResetOnly + 1,54% table_scan_getnextslot (inlined) Patch: - 94,40% 0,30% postgres postgres [.] DoCopyTo - 94,11% DoCopyTo - 92,41% CopyOneRowTo + 51,20% CopyAttributeOutText - 20,87% CopySendEndOfRow - 20,45% socket_putmessage - internal_putbytes - 18,50% internal_flush (inlined) internal_flush_buffer + secure_write + 1,61% memcpy (inlined) + 17,36% FunctionCall1Coll + 1,33% appendBinaryStringInfo + 0,93% MemoryContextResetOnly + 1,36% table_scan_getnextslot (inlined) The patch brings a ~5% gain in socket_putmessage. Also timed the pg_basebackup like: time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest -D test HEAD: real0m10,040s user0m0,768s sys 0m7,758s Patch: real0m8,882s user0m0,699s sys 0m6,980s It seems ~11% faster in this specific case. I'd appreciate any feedback/thoughts. [1] CREATE TABLE test(id int, name text, time TIMESTAMP); INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS name, NOW() AS time FROM generate_series(1, 1) AS i; Thanks, -- Melih Mutlu Microsoft 0001-Flush-large-data-immediately-in-pqcomm.patch Description: Binary data
Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans
On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou wrote: > > Hi hackers, > > I hope this message finds you well. I am reaching out to seek guidance on a > specific aspect of PostgreSQL's index scanning functionality. > > I am currently working on a vector search extension for postgres, where I > need to generate bitmaps based on filter conditions during an index scan. The > goal is to optimize the query performance by efficiently identifying the rows > that meet the given criteria. > > The query plan looks like this > > Index Scan using products_feature_idx on products (cost=0.00..27.24 rows=495 > width=12) > Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector) > Filter: ((price > '0.2'::double precision) AND (price <= > '0.7'::double precision)) > > > We have a custom index for the order by clause on the feature column. Now we > want to utilize the index on other columns like price column. We want to > access the bitmap of price column's filter condition in the feature column > index. Is there any way I can achieve this goal? If you mean "I'd like to use bitmaps generated by combining filter results from index A, B, and C for (pre-)filtering the ordered index lookups in index D", then there is no current infrastructure to do this. Bitmap scans currently generate a data structure that is not indexable, and can thus not be used efficiently to push an index's generated bitmap into another bitmap's scans. There are efforts to improve the data structures we use for storing TIDs during vacuum [0] which could extend to the TID bitmap structure, but even then we'd need some significant effort to rewire Postgres' internals to push down the bitmap filters; and that is even under the assumption that pushing the bitmap down into the index AM is more efficient than doing the merges above the index AM and then re-sorting the data. So, in short, it's not currently available in community PostgreSQL. You could probably create a planner hook + custom executor node that does this, but it won't be able to use much of the features available inside PostgreSQL. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com