Is a clearer memory lifespan for outerTuple and innerTuple useful?
Hi, When I am working on "shared detoast value"[0], where I want to avoid detoast the same datum over and over again, I have to decide which memory context should be used to hold the detoast value. later I found I have to use different MemoryContexts for the OuterTuple and innerTuple since OuterTuple usually have a longer lifespan. I found the following code in nodeMergeJoin.c which has pretty similar situation, just that it uses ExprContext rather than MemoryContext. MergeJoinState * ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags) /* * we need two additional econtexts in which we can compute the join * expressions from the left and right input tuples. The node's regular * econtext won't do because it gets reset too often. */ mergestate->mj_OuterEContext = CreateExprContext(estate); mergestate->mj_InnerEContext = CreateExprContext(estate); IIUC, we needs a MemoryContext rather than ExprContext in fact. In the attachment, I just use two MemoryContext instead of the two ExprContexts which should be less memory and more precise semantics, and works fine. shall we go in this direction? I attached the 2 MemoryContext in JoinState rather than MergeJoinState, which is for the "shared detoast value"[0] more or less. [0] https://www.postgresql.org/message-id/87ttoyihgm@163.com >From cadd94f8ff2398ca430b43494b3eb71225b017c3 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Fri, 15 Dec 2023 15:28:36 +0800 Subject: [PATCH v1] Use MemoryContext instead of ExprContext for nodeMergejoin.c --- src/backend/executor/nodeMergejoin.c | 27 ++- src/backend/executor/nodeNestloop.c | 1 + src/include/nodes/execnodes.h| 4 ++-- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 3cdab77dfc..d330e104f6 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -294,7 +294,7 @@ MJExamineQuals(List *mergeclauses, static MJEvalResult MJEvalOuterValues(MergeJoinState *mergestate) { - ExprContext *econtext = mergestate->mj_OuterEContext; + ExprContext *econtext = mergestate->js.ps.ps_ExprContext; MJEvalResult result = MJEVAL_MATCHABLE; int i; MemoryContext oldContext; @@ -303,9 +303,9 @@ MJEvalOuterValues(MergeJoinState *mergestate) if (TupIsNull(mergestate->mj_OuterTupleSlot)) return MJEVAL_ENDOFJOIN; - ResetExprContext(econtext); + MemoryContextReset(mergestate->js.outer_tuple_memory); - oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); + oldContext = MemoryContextSwitchTo(mergestate->js.outer_tuple_memory); econtext->ecxt_outertuple = mergestate->mj_OuterTupleSlot; @@ -341,7 +341,7 @@ MJEvalOuterValues(MergeJoinState *mergestate) static MJEvalResult MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot) { - ExprContext *econtext = mergestate->mj_InnerEContext; + ExprContext *econtext = mergestate->js.ps.ps_ExprContext; MJEvalResult result = MJEVAL_MATCHABLE; int i; MemoryContext oldContext; @@ -352,7 +352,7 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot) ResetExprContext(econtext); - oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); + oldContext = MemoryContextSwitchTo(mergestate->js.inner_tuple_memory); econtext->ecxt_innertuple = innerslot; @@ -1471,14 +1471,6 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags) */ ExecAssignExprContext(estate, >js.ps); - /* - * we need two additional econtexts in which we can compute the join - * expressions from the left and right input tuples. The node's regular - * econtext won't do because it gets reset too often. - */ - mergestate->mj_OuterEContext = CreateExprContext(estate); - mergestate->mj_InnerEContext = CreateExprContext(estate); - /* * initialize child nodes * @@ -1497,6 +1489,15 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags) (eflags | EXEC_FLAG_MARK)); innerDesc = ExecGetResultType(innerPlanState(mergestate)); + mergestate->js.outer_tuple_memory = AllocSetContextCreate( + mergestate->js.ps.ps_ExprContext->ecxt_per_query_memory, + "OuterTupleCtx", + ALLOCSET_SMALL_SIZES); + mergestate->js.inner_tuple_memory = AllocSetContextCreate( + mergestate->js.ps.ps_ExprContext->ecxt_per_query_memory, + "InnerTupleCtx", + ALLOCSET_SMALL_SIZES); + /* * For certain types of inner child nodes, it is advantageous to issue * MARK every time we advance past an inner tuple we will never return to. diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c index ebd1406843..5e10d01c4e 100644 --- a/src/backend/executor/nodeNestloop.c +++ b/src/backend/executor/nodeNestloop.c @@ -349,6 +349,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags) NL1_printf("ExecInitNestLoop: %s\n", "node
Support "Right Semi Join" plan shapes
Hi Richard Guo I see that the test samples are all (exists) subqueries ,I think semi join should also support ( in) and ( any) subqueries. would you do more test on ( in) and ( any) subqueries? Best whish
Re: Fix bug with indexes on whole-row expressions
On Wed, Dec 13, 2023 at 7:01 AM Tom Lane wrote: > ywgrit writes: > > I forbid to create indexes on whole-row expression in the following > patch. > > I'd like to hear your opinions. > > As I said in the previous thread, I don't think this can possibly > be acceptable. Surely there are people depending on the capability. > I'm not worried so much about the exact case of an index column > being a whole-row Var --- I agree that that's pretty useless --- > but an index column that is a function on a whole-row Var seems > quite useful. (Your patch fails to detect that, BTW, which means > it does not block the case presented in bug #18244.) > > I thought about extending the ALTER TABLE logic to disallow changes > in composite types that appear in index expressions. We already have > find_composite_type_dependencies(), and it turns out that this already > blocks ALTER for the case you want to forbid, but we concluded that we > didn't need to prevent it for the bug #18244 case: > > * If objsubid identifies a specific column, refer to that in error > * messages. Otherwise, search to see if there's a user column of > the > * type. (We assume system columns are never of interesting > types.) > * The search is needed because an index containing an expression > * column of the target type will just be recorded as a > whole-relation > * dependency. If we do not find a column of the type, the > dependency > * must indicate that the type is transiently referenced in an > index > * expression but not stored on disk, which we assume is OK, just > as > * we do for references in views. (It could also be that the > target > * type is embedded in some container type that is stored in an > index > * column, but the previous recursion should catch such cases.) > > Perhaps a reasonable answer would be to issue a WARNING (not error) > in the case where an index has this kind of dependency. The index > might need to be reindexed --- but it might not, too, and in any case > I doubt that flat-out forbidding the ALTER is a helpful idea. > > regards, tom lane > WARNING can be easily overlooked. Users of mobile/web apps don't see Postgres WARNINGs. Forbidding ALTER sounds more reasonable. Do you see any good use cases for whole-row indexes? And for such cases, wouldn't it be reasonable for users to specify all columns explicitly? E.g.: create index on t using btree(row(c1, c2, c3));
Re: Make COPY format extendable: Extract COPY TO format implementations
On Fri, Dec 15, 2023 at 12:45 PM Sutou Kouhei wrote: > > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Fri, 15 Dec 2023 11:27:30 +0800, > Junwang Zhao wrote: > > >> > Adding a prefix or suffix would be one option but to give extensions > >> > more flexibility, another option would be to support format = 'custom' > >> > and add the "handler" option to specify a copy handler function name > >> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', > >> > HANDLER = 'arrow_copy_handler'). > >> > > I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT, > > and user has to know the name of the specific handler names, not > > intuitive. > > Ah! I misunderstood this idea. "custom" is the special > format to use "HANDLER". I thought that we can use it like > >(FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl1') > > and > >(FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl2') > > . > > >> Interesting. If we use this option, users can choose an COPY > >> FORMAT implementation they like from multiple > >> implementations. For example, a developer may implement a > >> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON > >> related API and another developer may implement a handler > >> with simdjson[1] which is a fast JSON parser. Users can > >> choose whichever they like. > > Not sure about this, why not move Json copy handler to contrib > > as an example for others, any extensions share the same format > > function name and just install one? No bound would implement > > another CSV or TEXT copy handler IMHO. > > I should have used a different format not JSON as an example > for easy to understand. I just wanted to say that extension > developers can implement another implementation without > conflicting another implementation. Yeah, I can see the value of the HANDLER option now. The possibility of two extensions for the same format using same hanlder name should be rare I guess ;) > > > Thanks, > -- > kou -- Regards Junwang Zhao
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila wrote: > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > > > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada > > > wrote: > > > > > > > > I've heard the report internally that replication lag became huge when > > > > decoding transactions each consisting of 500k sub transactions. Note > > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > > > Can't you suggest them to use streaming mode to avoid this problem or > do you see some problem with that? Yeah, that's one option. But I can suggest > > > > > Here is a test script for a many subtransactions scenario. In my > > > > environment, the logical decoding took over 2min to decode one top > > > > transaction having 100k subtransctions. > > > > > > > > - > > > > create table test (c int); > > > > create or replace function testfn (cnt int) returns void as $$ > > > > begin > > > > for i in 1..cnt loop > > > > begin > > > > insert into test values (i); > > > > exception when division_by_zero then > > > > raise notice 'caught error'; > > > > return; > > > > end; > > > > end loop; > > > > end; > > > > $$ > > > > language plpgsql; > > > > select testfn(10) > > > > set logical_decoding_work_mem to '4MB'; > > > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > > > > > > > To deal with this problem, I initially thought of the idea (a) > > > > mentioned in the comment; use a binary heap to maintain the > > > > transactions sorted by the amount of changes or the size. But it seems > > > > not a good idea to try maintaining all transactions by its size since > > > > the size of each transaction could be changed frequently. > > > > > > > > The attached patch uses a different approach that consists of three > > > > strategies; (1) maintain the list of transactions whose size is larger > > > > than 10% of logical_decoding_work_mem, and preferentially evict a > > > > transaction from this list. > > > > > > IIUC, you are giving preference to multiple list ideas as compared to > (a) because you don't need to adjust the list each time the > transaction size changes, is that right? Right. > If so, I think there is a > cost to keep that data structure up-to-date but it can help in > reducing the number of times we need to serialize. Yes, there is a trade-off. What I don't want to do is to keep all transactions ordered since it's too costly. The proposed idea uses multiple lists to keep all transactions roughly ordered. The maintenance cost would be cheap since each list is unordered. It might be a good idea to have a threshold to switch how to pick the largest transaction based on the number of transactions in the reorderbuffer. If there are many transactions, we can use the proposed algorithm to find a possibly-largest transaction, otherwise use the current way. > > If the list is empty, all transactions are > > > > small enough, (2) so we evict the oldest top-level transaction from > > > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > > > freeing memory blocks in GenerationContext. Finally, if this is also > > > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > > > note the fact that even if a transaction is evicted the > > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > > > 0. In the worst case where all (quite a few) transactions are smaller > > > > than 10% of the memory limit, we might end up checking many > > > > transactions to find non-zero size transaction entries to evict. So > > > > the patch adds a new list to maintain all transactions that have at > > > > least one change in memory. > > > > > > > > Summarizing the algorithm I've implemented in the patch, > > > > > > > > 1. pick a transaction from the list of large transactions (larger than > > > > 10% of memory limit). > > > > 2. pick a transaction from the top-level transaction list in LSN order. > > > > 3. pick a transaction from the list of transactions that have at least > > > > one change in memory. > > > > > > > > With the patch, the above test case completed within 3 seconds in my > > > > environment. > > > > > > Thanks for working on this, I think it would be good to test other > > > scenarios as well where this might have some negative impact and see > > > where we stand. > > > > Agreed. > > > > > 1) A scenario where suppose you have one very large transaction that > > > is consuming ~40% of the memory and 5-6 comparatively smaller > > > transactions that are just above 10% of the memory limit. And now for > > > coming under the memory limit instead of getting 1 large transaction > > > evicted out, we are evicting out multiple times. > > > > Given the large transaction list will have up to 10 transactions, I > > think it's cheap to pick the largest transaction among them. It's O(N) > >
Re: "pgoutput" options missing on documentation
Thanks for the update. Here are some more review comments for the v01* patches. // Patch v00-0001 v01 modified the messages more than I was expecting, although what you did looks fine to me. ~~~ 1. + /* Check required options */ + if (!protocol_version_given) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + /* translator: %s is a pgoutput option */ + errmsg("missing pgoutput option: %s", "proto_version")); + if (!publication_names_given) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + /* translator: %s is a pgoutput option */ + errmsg("missing pgoutput option: %s", "publication_names")); I saw that the original "publication_names" error was using errcode(ERRCODE_INVALID_PARAMETER_VALUE), but TBH since there is no option given at all I felt ERRCODE_SYNTAX_ERROR might be more appropriate errcode for those 2 mandatory option errors. // Patch v00-0002 2. I saw that the chapter "55.4. Streaming Replication Protocol" [1] introduces "START_REPLICATION SLOT slot_name LOGICAL ..." command and it says --- option_name The name of an option passed to the slot's logical decoding plugin. --- Perhaps that part should now include a reference to your new information: SUGGESTION option_name The name of an option passed to the slot's logical decoding plugin. Please see for details about options that are accepted by the standard (pgoutput) plugin. ~~~ 3. - The logical replication START_REPLICATION command - accepts following parameters: + Using the START_REPLICATION command, + pgoutput) accepts the following options: Oops, you copied my typo. There is a spurious ')'. ~~~ 4. + + + + origin + + + + String option to send only changes by an origin. It also gets + the option "none" to send the changes that have no origin associated, + and the option "any" to send the changes regardless of their origin. + This can be used to avoid loops (infinite replication of the same data) + among replication nodes. + + + AFAIK pgoutput only understands origin values "any" and "none" and nothing else; I felt the "It also gets..." part implies it is more flexible than it is. SUGGESTION Possible values are "none" (to only send the changes that have no origin associated), or "any" (to send the changes regardless of their origin). ~~~ 5. Rearranging option details > > SUGGESTION > > > > -proto_version > > ... > > -publication_names > > ... > > -binary > > ... > > -messages > > ... > > > > Since protocol version 2: > > > > -streaming (boolean) > > ... > > > > Since protocol version 3: > > > > -two_phase > > ... > > > > Since protocol version 4: > > > > -streaming (boolean/parallel) > > ... > > -origin > > This is not going to be correct because not all options do require a > protocol version. "origin" is added in version 16, but doesn't check > for any "proto_version". Perhaps we should fix this too. > OK, to deal with that can't you just include "origin" in the first group which has no special protocol requirements? SUGGESTION -proto_version -publication_names -binary -messages -origin Requires minimum protocol version 2: -streaming (boolean) Requires minimum protocol version 3: -two_phase Requires minimum protocol version 4: -streaming (parallel) == [1] 55.4 https://www.postgresql.org/docs/devel/protocol-replication.html Kind Regards, Peter Smith. Fujitsu Australia
Re: Synchronizing slots from primary to standby
On Thu, Dec 14, 2023 at 10:15 AM Peter Smith wrote: > > A review comment for v47-0001 > Thanks for reviewing. I have addressed these in v48. There is some design change around the code part where we were checking cascading and were revalidating new GUC values on conf-reload. So code has changed entirely around that part where some of these comments were. Please review now and let me know. > == > src/backend/replication/slot.c > > 1. GetStandbySlotList > > +static void > +WalSndRereadConfigAndReInitSlotList(List **standby_slots) > +{ > + char*pre_standby_slot_names; > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + return; > > Should the RecoveryInProgress() check be first -- even before the > ProcessConfigFile call? > > == > Kind Regards, > Peter Smith. > Fujitsu Australia
Re: Synchronizing slots from primary to standby
On Wed, Dec 13, 2023 at 3:53 PM Peter Smith wrote: > > Hi Shveta, here are some review comments for v45-0002. > Thanks for the feedback. Addressed these in v48. Please find my comments on some. > == > doc/src/sgml/bgworker.sgml > > 1. > + > + > + BgWorkerStart_PostmasterStart > + > + > + > BgWorkerStart_PostmasterStart > + Start as soon as postgres itself has finished its own initialization; > + processes requesting this are not eligible for database connections. > + > + > + > + > + > + BgWorkerStart_ConsistentState > + > + > + > BgWorkerStart_ConsistentState > + Start as soon as a consistent state has been reached in a hot-standby, > + allowing processes to connect to databases and run read-only queries. > + > + > + > + > + > + BgWorkerStart_RecoveryFinished > + > + > + > BgWorkerStart_RecoveryFinished > + Start as soon as the system has entered normal read-write state. Note > + that the BgWorkerStart_ConsistentState and > + BgWorkerStart_RecoveryFinished are equivalent > + in a server that's not a hot standby. > + > + > + > + > + > + BgWorkerStart_ConsistentState_HotStandby > + > + > + > BgWorkerStart_ConsistentState_HotStandby > + Same meaning as BgWorkerStart_ConsistentState but > + it is more strict in terms of the server i.e. start the worker only > + if it is hot-standby. > + > + > + > + > > Maybe reorder these slightly, because I felt it is better if the > BgWorkerStart_ConsistentState_HotStandby comes next after > BgWorkerStart_ConsistentState, which it refers to > > For example:: > 1st.BgWorkerStart_PostmasterStart > 2nd.BgWorkerStart_ConsistentState > 3rd.BgWorkerStart_ConsistentState_HotStandby > 4th.BgWorkerStart_RecoveryFinished > > == > doc/src/sgml/config.sgml > > 2. > enable_syncslot = true > > Not sure, but I thought the "= true" part should be formatted too. > > SUGGESTION > enable_syncslot = true > > == > doc/src/sgml/logicaldecoding.sgml > > 3. > + > + A logical replication slot on the primary can be synchronized to the hot > + standby by enabling the failover option during slot creation and setting > + enable_syncslot on the standby. For the > synchronization > + to work, it is mandatory to have a physical replication slot between the > + primary and the standby. It's highly recommended that the said physical > + replication slot is listed in standby_slot_names on > + the primary to prevent the subscriber from consuming changes faster than > + the hot standby. Additionally, hot_standby_feedback > + must be enabled on the standby for the slots synchronization to work. > + > > I felt those parts that describe the mandatory GUCs should be kept together. > > SUGGESTION > For the synchronization to work, it is mandatory to have a physical > replication slot between the primary and the standby, and > hot_standby_feedback must be enabled on the > standby. > > It's also highly recommended that the said physical replication slot > is named in standby_slot_names list on the primary, > to prevent the subscriber from consuming changes faster than the hot > standby. > > ~~~ > > 4. (Chapter 49) > > By enabling synchronization of slots, logical replication can be > resumed after failover depending upon the > pg_replication_slots.sync_state for the synchronized slots on the > standby at the time of failover. Only slots that were in ready > sync_state ('r') on the standby before failover can be used for > logical replication after failover. However, the slots which were in > initiated sync_state ('i') and not sync-ready ('r') at the time of > failover will be dropped and logical replication for such slots can > not be resumed after failover. This applies to the case where a > logical subscription is disabled before failover and is enabled after > failover. If the synchronized slot due to disabled subscription could > not be made sync-ready ('r') on standby, then the subscription can not > be resumed after failover even when enabled. If the primary is idle, > then the synchronized slots on the standby may take a noticeable time > to reach the ready ('r') sync_state. This can be sped up by calling > the pg_log_standby_snapshot function on the primary. > > ~ > > Somehow, I still felt all that was too wordy/repetitive. Below is my > attempt to make it more concise. Thoughts? > > SUGGESTION > The ability to resume logical replication after failover depends upon > the pg_replication_slots.sync_state value for the synchronized slots > on the standby at the time of failover. Only slots that have attained > a "ready" sync_state ('r') on the standby before failover can be used > for logical replication after failover. Slots that have not yet > reached 'r' state (they are still 'i') will be dropped, therefore > logical
Re: Synchronizing slots from primary to standby
On Thu, Dec 14, 2023 at 4:40 PM Amit Kapila wrote: > > On Thu, Dec 14, 2023 at 7:00 AM Peter Smith wrote: > > > > Hi, here are a few more review comments for the patch v47-0002 > > > > (plus my review comments of v45-0002 [1] are yet to be addressed) > > > > == > > 1. General > > > > For consistency and readability, try to use variables of the same > > names whenever they have the same purpose, even when they declared are > > in different functions. A few like this were already mentioned in the > > previous review but there are more I keep noticing. > > > > For example, > > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. > > > > > > == > > src/backend/replication/logical/slotsync.c > > > > 2. > > +/* > > + * > > + * Validates the primary server info. > > + * > > + * Using the specified primary server connection, it verifies whether > > the master > > + * is a standby itself and returns true in that case to convey the caller > > that > > + * we are on the cascading standby. > > + * But if master is the primary server, it goes ahead and validates > > + * primary_slot_name. It emits error if the physical slot in > > primary_slot_name > > + * does not exist on the primary server. > > + */ > > +static bool > > +validate_primary_info(WalReceiverConn *wrconn) > > > > 2b. > > IMO it is too tricky to have a function called "validate_xxx", when > > actually you gave that return value some special unintuitive meaning > > other than just validation. IMO it is always better for the returned > > value to properly match the function name so the expectations are very > > obvious. So, In this case, I think a better function signature would > > be like this: > > > > SUGGESTION > > > > static void > > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) > > > > or > > > > static void > > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > > > > The terminology master_is_standby is a bit indirect for this usage, so > I would prefer the second one. Shall we name this function as > check_primary_info()? Additionally, can we rewrite the following > comment: "Using the specified primary server connection, check whether > we are cascading standby. It also validates primary_slot_info for > non-cascading-standbys.". > > -- > With Regards, > Amit Kapila. PFA v48. Changes are: 1) Addressed comments by Peter for v45-002 and v47-002 given in [1] and [2] respectively 2) Addressed comments by Amit for v47-002 given in [3], [4] 3) Addressed an old comment (#74 in [5]) of getting rid of header inclusion from tablesync.c when there was no code change in that file. Thanks Hou-san for working on this change. TODO: --Address the test comments in [1] for 050_standby_failover_slots_sync.pl --Review the feasibility of addressing one pending comment (comment 13 in [5]) of 'r'->'n' conversion. [1]: https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPsvxs-%3Dj3aCpPVs3e4w78HndCdO-F4bLPzAX70%2BdgWUuQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1L2ts%3DgfiF4aw7-DH8HWj29s08hVRq-Ff8%3DmjfdUXx8CA%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAA4eK1%2Bw9yv%2B4UZXhiDHZpGDfbeRHYDBu23FwsniS8sYUZeu1w%40mail.gmail.com [5]: https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com thanks Shveta
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 15 Dec 2023 11:27:30 +0800, Junwang Zhao wrote: >> > Adding a prefix or suffix would be one option but to give extensions >> > more flexibility, another option would be to support format = 'custom' >> > and add the "handler" option to specify a copy handler function name >> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', >> > HANDLER = 'arrow_copy_handler'). >> > I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT, > and user has to know the name of the specific handler names, not > intuitive. Ah! I misunderstood this idea. "custom" is the special format to use "HANDLER". I thought that we can use it like (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl1') and (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl2') . >> Interesting. If we use this option, users can choose an COPY >> FORMAT implementation they like from multiple >> implementations. For example, a developer may implement a >> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON >> related API and another developer may implement a handler >> with simdjson[1] which is a fast JSON parser. Users can >> choose whichever they like. > Not sure about this, why not move Json copy handler to contrib > as an example for others, any extensions share the same format > function name and just install one? No bound would implement > another CSV or TEXT copy handler IMHO. I should have used a different format not JSON as an example for easy to understand. I just wanted to say that extension developers can implement another implementation without conflicting another implementation. Thanks, -- kou
Re: Make COPY format extendable: Extract COPY TO format implementations
On Fri, Dec 15, 2023 at 9:53 AM Sutou Kouhei wrote: > > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Fri, 15 Dec 2023 05:19:43 +0900, > Masahiko Sawada wrote: > > > To avoid collisions, extensions can be created in a > > different schema than public. > > Thanks. I didn't notice it. > > > And note that built-in format copy handler doesn't need to > > declare its handler function. > > Right. I know it. > > > Adding a prefix or suffix would be one option but to give extensions > > more flexibility, another option would be to support format = 'custom' > > and add the "handler" option to specify a copy handler function name > > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', > > HANDLER = 'arrow_copy_handler'). > > Interesting. If we use this option, users can choose an COPY > FORMAT implementation they like from multiple > implementations. For example, a developer may implement a > COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON > related API and another developer may implement a handler > with simdjson[1] which is a fast JSON parser. Users can > choose whichever they like. > > But specifying HANDLER = '...' explicitly is a bit > inconvenient. Because only one handler will be installed in > most use cases. In the case, users don't need to choose one > handler. > > If we choose this option, it may be better that we also > provide a mechanism that can work without HANDLER. Searching > a function by name like tablesample method does is an option. Agreed. We can search the function by format name by default and the user can optionally specify the handler function name in case where the names of the installed custom copy handler collide. Probably the handler option stuff could be a follow-up patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada > > wrote: > > > > > > I've heard the report internally that replication lag became huge when > > > decoding transactions each consisting of 500k sub transactions. Note > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > Can't you suggest them to use streaming mode to avoid this problem or do you see some problem with that? > > > Here is a test script for a many subtransactions scenario. In my > > > environment, the logical decoding took over 2min to decode one top > > > transaction having 100k subtransctions. > > > > > > - > > > create table test (c int); > > > create or replace function testfn (cnt int) returns void as $$ > > > begin > > > for i in 1..cnt loop > > > begin > > > insert into test values (i); > > > exception when division_by_zero then > > > raise notice 'caught error'; > > > return; > > > end; > > > end loop; > > > end; > > > $$ > > > language plpgsql; > > > select testfn(10) > > > set logical_decoding_work_mem to '4MB'; > > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > > > > To deal with this problem, I initially thought of the idea (a) > > > mentioned in the comment; use a binary heap to maintain the > > > transactions sorted by the amount of changes or the size. But it seems > > > not a good idea to try maintaining all transactions by its size since > > > the size of each transaction could be changed frequently. > > > > > > The attached patch uses a different approach that consists of three > > > strategies; (1) maintain the list of transactions whose size is larger > > > than 10% of logical_decoding_work_mem, and preferentially evict a > > > transaction from this list. > > > IIUC, you are giving preference to multiple list ideas as compared to (a) because you don't need to adjust the list each time the transaction size changes, is that right? If so, I think there is a cost to keep that data structure up-to-date but it can help in reducing the number of times we need to serialize. If the list is empty, all transactions are > > > small enough, (2) so we evict the oldest top-level transaction from > > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > > freeing memory blocks in GenerationContext. Finally, if this is also > > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > > note the fact that even if a transaction is evicted the > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > > 0. In the worst case where all (quite a few) transactions are smaller > > > than 10% of the memory limit, we might end up checking many > > > transactions to find non-zero size transaction entries to evict. So > > > the patch adds a new list to maintain all transactions that have at > > > least one change in memory. > > > > > > Summarizing the algorithm I've implemented in the patch, > > > > > > 1. pick a transaction from the list of large transactions (larger than > > > 10% of memory limit). > > > 2. pick a transaction from the top-level transaction list in LSN order. > > > 3. pick a transaction from the list of transactions that have at least > > > one change in memory. > > > > > > With the patch, the above test case completed within 3 seconds in my > > > environment. > > > > Thanks for working on this, I think it would be good to test other > > scenarios as well where this might have some negative impact and see > > where we stand. > > Agreed. > > > 1) A scenario where suppose you have one very large transaction that > > is consuming ~40% of the memory and 5-6 comparatively smaller > > transactions that are just above 10% of the memory limit. And now for > > coming under the memory limit instead of getting 1 large transaction > > evicted out, we are evicting out multiple times. > > Given the large transaction list will have up to 10 transactions, I > think it's cheap to pick the largest transaction among them. It's O(N) > but N won't be large. > > > 2) Another scenario where all the transactions are under 10% of the > > memory limit but let's say there are some transactions are consuming > > around 8-9% of the memory limit each but those are not very old > > transactions whereas there are certain old transactions which are > > fairly small and consuming under 1% of memory limit and there are many > > such transactions. So how it would affect if we frequently select > > many of these transactions to come under memory limit instead of > > selecting a couple of large transactions which are consuming 8-9%? > > Yeah, probably we can do something for small transactions (i.e. small > and on-memory transactions). One idea is to pick the largest > transaction among them by iterating over all of them. Given that the > more transactions are evicted, the less
Re: Make COPY format extendable: Extract COPY TO format implementations
On Fri, Dec 15, 2023 at 8:53 AM Sutou Kouhei wrote: > > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Fri, 15 Dec 2023 05:19:43 +0900, > Masahiko Sawada wrote: > > > To avoid collisions, extensions can be created in a > > different schema than public. > > Thanks. I didn't notice it. > > > And note that built-in format copy handler doesn't need to > > declare its handler function. > > Right. I know it. > > > Adding a prefix or suffix would be one option but to give extensions > > more flexibility, another option would be to support format = 'custom' > > and add the "handler" option to specify a copy handler function name > > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', > > HANDLER = 'arrow_copy_handler'). > I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT, and user has to know the name of the specific handler names, not intuitive. > Interesting. If we use this option, users can choose an COPY > FORMAT implementation they like from multiple > implementations. For example, a developer may implement a > COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON > related API and another developer may implement a handler > with simdjson[1] which is a fast JSON parser. Users can > choose whichever they like. Not sure about this, why not move Json copy handler to contrib as an example for others, any extensions share the same format function name and just install one? No bound would implement another CSV or TEXT copy handler IMHO. > > But specifying HANDLER = '...' explicitly is a bit > inconvenient. Because only one handler will be installed in > most use cases. In the case, users don't need to choose one > handler. > > If we choose this option, it may be better that we also > provide a mechanism that can work without HANDLER. Searching > a function by name like tablesample method does is an option. > > > [1]: https://github.com/simdjson/simdjson > > > Thanks, > -- > kou -- Regards Junwang Zhao
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "RE: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 12 Dec 2023 02:31:53 +, "Hayato Kuroda (Fujitsu)" wrote: >> Can we discuss how to proceed this improvement? >> >> There are 2 approaches for it: >> >> 1. Do the followings concurrently: >>a. Implementing small changes that got a consensus and >> merge them step-by-step >> (e.g. We got a consensus that we need to extract the >> current format related routines.) >>b. Discuss design >> >>(v1-v3 patches use this approach.) >> >> 2. Implement one (large) complete patch set with design >>discussion and merge it >> >>(v4- patches use this approach.) >> >> Which approach is preferred? (Or should we choose another >> approach?) >> >> I thought that 1. is preferred because it will reduce review >> cost. So I chose 1. > > I'm ok to use approach 1, but could you please divide a large patch? E.g., > > 0001. defines an infrastructure for copy-API > 0002. adjusts current codes to use APIs > 0003. adds a test module in src/test/modules or contrib. > ... > > This approach helps reviewers to see patches deeper. Separated patches can be > combined when they are close to committable. It seems that I should have chosen another approach based on comments so far: 3. Do the followings in order: a. Implement a workable (but maybe dirty and/or incomplete) implementation to discuss design like [1], discuss design with it and get a consensus on design b. Implement small patches based on the design [1]: https://www.postgresql.org/message-id/CAD21AoCunywHird3GaPzWe6s9JG1wzxj3Cr6vGN36DDheGjOjA%40mail.gmail.com I'll implement a custom COPY FORMAT handler with [1] and provide a feedback with the experience. (It's for a.) Thanks, -- kou
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila wrote: > > > > It can only be cleaned if we process it but xact_decode won't allow us > > to process it and I don't think it would be a good idea to add another > > hack for sequences here. See below code: > > > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > > { > > SnapBuild *builder = ctx->snapshot_builder; > > ReorderBuffer *reorder = ctx->reorder; > > XLogReaderState *r = buf->record; > > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > > > /* > > * If the snapshot isn't yet fully built, we cannot decode anything, so > > * bail out. > > */ > > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > > return; > > That may be true for a transaction which is decoded, but I think all > the transactions which are added to ReorderBuffer should be cleaned up > once they have been processed irrespective of whether they are > decoded/sent downstream or not. In this case I see the sequence hash > being cleaned up for the sequence related transaction in Hayato's > reproducer. > It was because the test you are using was not designed to show the problem I mentioned. In this case, the rollback was after a full snapshot state was reached. -- With Regards, Amit Kapila.
Re: Add 64-bit XIDs into PostgreSQL 15
Hi Maxim Orlov Good news,xid64 has achieved a successful first phase,I tried to change the path status (https://commitfest.postgresql.org/43/3594/) ,But it seems incorrect Maxim Orlov 于2023年12月13日周三 20:26写道: > Hi! > > Just to keep this thread up to date, here's a new version after recent > changes in SLRU. > I'm also change order of the patches in the set, to make adding initdb MOX > options after the > "core 64 xid" patch, since MOX patch is unlikely to be committed and now > for test purpose only. > > -- > Best regards, > Maxim Orlov. >
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada wrote: > In v45, 0001 - 0006 are from earlier versions but I've merged previous > updates. So the radix tree now has RT_SET() and RT_FIND() but not > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > versions that incorporated the above comments. 0009 patch integrates > tidstore with lazy vacuum. Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com ...and got similar results, so we still have good space-efficiency on this test: master: INFO: finished vacuuming "john.public.test": index scans: 9 system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s v45: INFO: finished vacuuming "john.public.test": index scans: 1 system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s More sparse TID distributions won't be as favorable, but we have ideas to improve that in the future. For my next steps, I will finish the node-shrinking behavior and save for a later patchset. Not needed for tid store, but needs to happen because of assumptions in the code. Also, some time ago, I think I commented out RT_FREE_RECURSE to get something working, so I'll fix it, and look at other fixmes and todos. > Note that DSA segment problem is not > resolved yet in this patch. I remember you started a separate thread about this, but I don't think it got any attention. Maybe reply with a "TLDR;" and share a patch to allow controlling max segment size. Some more comments: v45-0003: Since RT_ITERATE_NEXT_PTR works for tid store, do we even need RT_ITERATE_NEXT anymore? The former should handle fixed-length values just fine? If so, we should rename it to match the latter. + * The caller is responsible for locking/unlocking the tree in shared mode. This is not new to v45, but this will come up again below. This needs more explanation: Since we're returning a pointer (to support variable-length values), the caller needs to maintain control until it's finished with the value. v45-0005: + * Regarding the concurrency support, we use a single LWLock for the TidStore. + * The TidStore is exclusively locked when inserting encoded tids to the + * radix tree or when resetting itself. When searching on the TidStore or + * doing the iteration, it is not locked but the underlying radix tree is + * locked in shared mode. This is just stating facts without giving any reasons. Readers are going to wonder why it's inconsistent. The "why" is much more important than the "what". Even with that, this comment is also far from the relevant parts, and so will get out of date. Maybe we can just make sure each relevant function is explained individually. v45-0007: -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); Tid store calls this max_bytes -- can we use that name here, too? "work_mem" is highly specific. - RT_PTR_ALLOC *slot; + RT_PTR_ALLOC *slot = NULL; We have a macro for invalid pointer because of DSA. v45-0008: - if (off < 1 || off > MAX_TUPLES_PER_PAGE) + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) elog(ERROR, "tuple offset out of range: %u", off); This is a superfluous distraction, since the error path is located way off in the cold segment of the binary. v45-0009: (just a few small things for now) - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the - * vacrel->dead_items array. + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. I think we can keep as "listed in the TID store". - * Allocate dead_items (either using palloc, or in dynamic shared memory). - * Sets dead_items in vacrel for caller. + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items + * in vacrel for caller. I think we want to keep "in dynamic shared memory". It's still true. I'm not sure anything needs to change here, actually. parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, - int nrequested_workers, int max_items, - int elevel, BufferAccessStrategy bstrategy) + int nrequested_workers, int vac_work_mem, + int max_offset, int elevel, + BufferAccessStrategy bstrategy) It seems very strange to me that this function has to pass the max_offset. In general, it's been simpler to assume we have a constant max_offset, but in this case that fact is not helping. Something to think about for later. - (errmsg("scanned index \"%s\" to remove %d row versions", + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions", This should be signed int64. v45-0010: Thinking about this some more, I'm not sure we need to do anything different for the *starting* segment size. (Controlling *max* size does seem important, however.) For the corner case of m_w_m = 1MB, it's fine if vacuum quits pruning immediately after (in effect) it finds the DSA has gone to 2MB. It's not worth
Re: Change GUC hashtable to use simplehash?
I wrote: > > * v8 with chunked interface: > latency average = 555.688 ms > > This starts to improve things for me. > > * v8 with chunked, and return lower 32 bits of full 64-bit hash: > latency average = 556.324 ms > > This is within the noise level. There doesn't seem to be much downside > of using a couple cycles for fasthash's 32-bit reduction. > > * revert back to master from Dec 4 and then cherry pick a86c61c9ee > (save last entry of SearchPathCache) > latency average = 545.747 ms > > So chunked incremental hashing gets within ~2% of that, which is nice. > It seems we should use that when removing strlen, when convenient. > > Updated next steps: > * Investigate whether/how to incorporate final length into the > calculation when we don't have the length up front. > * Add some desperately needed explanatory comments. > * Use this in some existing cases where it makes sense. > * Get back to GUC hash and dynahash. For #1 here, I cloned SMHasher and was dismayed at the complete lack of documentation, but after some poking around, found how to run the tests, using the 32-bit hash to save time. It turns out that the input length is important. I've attached two files of results -- "nolen" means stop using the initial length to tweak the internal seed. As you can see, there are 8 failures. "pluslen" means I then incorporated the length within the finalizer. This *does* pass SMHasher, so that's good. (of course this way can't produce the same hash as when we know the length up front, but that's not important). The attached shows how that would work, further whacking around and testing with Jeff's prototype for the search path cache hash table. I'll work on code comments and get it polished. --- Testing fasthash32 "fast-hash 32bit" with NO LENGTH [[[ Sanity Tests ]]] Verification value 0x6A202089 ... FAIL! (Expected 0xe9481afc) Running sanity check 1 .. PASS Running AppendedZeroesTest .. PASS [[[ Keyset 'Sparse' Tests ]]] Keyset 'Sparse' - 16-bit keys with up to 9 bits set - 50643 keys Testing collisions ( 32-bit) - Expected0.3, actual 0 (0.00x) Testing collisions (high 19-25 bits) - Worst is 20 bits: 1227/1203 (1.02x) Testing collisions (low 19-25 bits) - Worst is 24 bits: 81/76 (1.06x) Testing distribution - Worst bias is the 13-bit window at bit 31 - 0.436% Keyset 'Sparse' - 24-bit keys with up to 8 bits set - 1271626 keys Testing collisions ( 32-bit) - Expected 188.2, actual182 (0.97x) Testing distribution - Worst bias is the 17-bit window at bit 11 - 0.089% Keyset 'Sparse' - 32-bit keys with up to 7 bits set - 4514873 keys Testing collisions ( 32-bit) - Expected 2372.2, actual 2286 (0.96x) Testing distribution - Worst bias is the 19-bit window at bit 23 - 0.044% Keyset 'Sparse' - 40-bit keys with up to 6 bits set - 4598479 keys Testing collisions ( 32-bit) - Expected 2460.8, actual 2426 (0.99x) (-34) Testing distribution - Worst bias is the 19-bit window at bit 25 - 0.041% Keyset 'Sparse' - 48-bit keys with up to 6 bits set - 14196869 keys Testing collisions ( 32-bit) - Expected 23437.8, actual 23601 (1.01x) (164) Testing distribution - Worst bias is the 20-bit window at bit 31 - 0.020% Keyset 'Sparse' - 56-bit keys with up to 5 bits set - 4216423 keys Testing collisions ( 32-bit) - Expected 2069.0, actual 1958 (0.95x) Testing distribution - Worst bias is the 19-bit window at bit 2 - 0.040% Keyset 'Sparse' - 64-bit keys with up to 5 bits set - 8303633 keys Testing collisions ( 32-bit) - Expected 8021.7, actual 7994 (1.00x) (-27) Testing distribution - Worst bias is the 20-bit window at bit 30 - 0.040% Keyset 'Sparse' - 72-bit keys with up to 5 bits set - 15082603 keys Testing collisions ( 32-bit) - Expected 26451.8, actual 26503 (1.00x) (52) Testing distribution - Worst bias is the 20-bit window at bit 11 - 0.021% Keyset 'Sparse' - 96-bit keys with up to 4 bits set - 3469497 keys Testing collisions ( 32-bit) - Expected 1401.0, actual 1415 (1.01x) (15) Testing distribution - Worst bias is the 19-bit window at bit 12 - 0.058% Keyset 'Sparse' - 160-bit keys with up to 4 bits set - 26977161 keys Testing collisions ( 32-bit) - Expected 84546.1, actual 84165 (1.00x) (-381) Testing distribution - Worst bias is the 20-bit window at bit 12 - 0.012% Keyset 'Sparse' - 256-bit keys with up to 3 bits set - 2796417 keys Testing collisions ( 32-bit) - Expected 910.2, actual905 (0.99x) (-5) Testing distribution - Worst bias is the 18-bit window at bit 21 - 0.056% Keyset 'Sparse' - 512-bit keys with up to 3 bits set - 22370049 keys Testing collisions ( 32-bit) - Expected 58155.4, actual 57846 (0.99x) (-309) Testing distribution - Worst bias is the 20-bit window at bit 6 - 0.013% Keyset 'Sparse' - 1024-bit keys with up to 2 bits set - 524801 keys Testing collisions ( 32-bit) - Expected 32.1, actual 35 (1.09x) (3) Testing distribution - Worst bias is the 16-bit window at bit 13 - 0.118% Keyset 'Sparse' - 2048-bit keys with up to 2 bits set -
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 15 Dec 2023 05:19:43 +0900, Masahiko Sawada wrote: > To avoid collisions, extensions can be created in a > different schema than public. Thanks. I didn't notice it. > And note that built-in format copy handler doesn't need to > declare its handler function. Right. I know it. > Adding a prefix or suffix would be one option but to give extensions > more flexibility, another option would be to support format = 'custom' > and add the "handler" option to specify a copy handler function name > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', > HANDLER = 'arrow_copy_handler'). Interesting. If we use this option, users can choose an COPY FORMAT implementation they like from multiple implementations. For example, a developer may implement a COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON related API and another developer may implement a handler with simdjson[1] which is a fast JSON parser. Users can choose whichever they like. But specifying HANDLER = '...' explicitly is a bit inconvenient. Because only one handler will be installed in most use cases. In the case, users don't need to choose one handler. If we choose this option, it may be better that we also provide a mechanism that can work without HANDLER. Searching a function by name like tablesample method does is an option. [1]: https://github.com/simdjson/simdjson Thanks, -- kou
Re: Memory consumed by paths during partitionwise join planning
Forgot to mention, On Thu, Dec 14, 2023 at 5:34 PM Ashutosh Bapat wrote: > > On Thu, Dec 7, 2023 at 6:19 PM David Rowley wrote: > > > > Maybe we can try to move forward with your refcount idea and see how > > the performance looks. If that's intolerable then that might help us > > decide on the next best alternative solution. > > > > Here are performance numbers > > setup > > create table t1 (a integer primary key, b integer); > create table t1_parted (a integer primary key, b integer) partition by > range(a); > create 1000 partitions of t1 > > query (a five way self-join) > select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a = > c.a and c.a = d.a and d.a = e.a -- unpartitioned table case > select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d, > t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = > e.a; -- partitioned table case > > The numbers with patches attached to [1] with limitations listed in > the email are thus > > Ran each query 10 times through EXPLAIN (SUMMARY) and averaged > planning time with and without patch. > unpartitioned case > without patch: 0.25 > with patch: 0.19 > this improvement is probably misleading. The planning time for this > query change a lot. > > partitioned case (without partitionwise join) > without patch: 14580.65 > with patch: 14625.12 > % degradation: 0.3% > > partitioned case (with partitionwise join) > without patch: 23273.69 > with patch: 23328.52 > % degradation: 0.2% > > That looks pretty small considering the benefits. What do you think? > > [1] > https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com If you want to experiment, please use attached patches. There's a fix for segfault during initdb in them. The patches are still raw. -- Best Wishes, Ashutosh Bapat From abcd735ca251f4832125aa38cee6d03743dcf8bb Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Mon, 17 Jul 2023 10:44:18 +0530 Subject: [PATCH 2/4] Basic infrastructure to link, unlink and free pathnodes add_path and add_partial_path free path nodes which they deem sub-optimal. But they just free the uppermost pathnode in the path. The subpaths continue to occupy memory even if they are not used anywhere. The subpath nodes are not freed because they may be referenced by other paths. This commit introduces the infrastructure to track references to paths. As the path nodes are referenced they are "linked" using link_path() to referencing objects. When the path nodes are no longer useful they are "unlinked" using unlink_path() from the referencing objects. The paths whose references reach 0 during unlinking are freed automatically using free_path(). The function unlinks the sub path nodes and can be called when freeing any path node. When the final path for join tree is chosen the paths linked to each participating relation are unlinked, thus ultimately getting freed if not used. TODO The functions free_path(), unlink_path() and link_path() have ereports to catch code paths which do not use these function correctly. They call errbacktrace() which is not used anywhere else. These ereport calls will need to be fixed for productization. --- src/backend/optimizer/path/allpaths.c | 77 +++ src/backend/optimizer/util/pathnode.c | 136 ++ src/include/nodes/pathnodes.h | 1 + src/include/optimizer/pathnode.h | 38 +++ 4 files changed, 252 insertions(+) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 67921a0826..b7098ac39e 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -3388,6 +3388,81 @@ make_rel_from_joinlist(PlannerInfo *root, List *joinlist) } } +/* + * TODO: Need similar code to free paths in upper relations. + */ +static void +free_unused_paths_from_rel(RelOptInfo *rel) +{ + ListCell *lc_path; + intcnt_part; + + foreach(lc_path, rel->pathlist) + { + Path *path = (Path *) lfirst(lc_path); + + /* Free the path if none references it. */ + if (path->ref_count == 1) + { + /* TODO: use routine to unlink path from list */ + rel->pathlist = foreach_delete_current(rel->pathlist, lc_path); + unlink_path(); + } + } + + /* Do the same for partial pathlist. */ + foreach(lc_path, rel->partial_pathlist) + { + Path *path = (Path *) lfirst(lc_path); + + /* Free the path if none references it. */ + if (path->ref_count == 1) + { + rel->partial_pathlist = foreach_delete_current(rel->partial_pathlist, lc_path); + unlink_path(); + } + } + + /* + * TODO: We can perform this in generate_partitionwise_paths as well since + * by that time all the paths from partitions will be linked if used. + */ + + /* Free unused paths from the partition relations */ + for (cnt_part = 0; cnt_part < rel->nparts; cnt_part++) + { + free_unused_paths_from_rel(rel->part_rels[cnt_part]); + } +} + +/* + * Free unused paths from all the relations created
Re: Simplify newNode()
On Fri, Dec 15, 2023 at 11:44 AM Tom Lane wrote: > Heikki Linnakangas writes: > > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. > > It's not doing any good as it is, as it gets compiled to be identical to > > MemoryContextAllocZero. > > Also not so here. Admittedly, my results don't make much of a case > for keeping the two code paths, even on compilers where it matters. FWIW here is what I figured out once about why it gets compiled the same these days: https://www.postgresql.org/message-id/ca+hukglfa6ana0vs7lf0op0xbh05he8syx8nfhdyt7k2chy...@mail.gmail.com
Re: Simplify newNode()
Heikki Linnakangas writes: > On 14/12/2023 10:32, Peter Eisentraut wrote: >> But if we think that compilers are now smart enough, maybe we can unwind >> this whole stack a bit more? Maybe we don't need MemSetTest() and/or >> palloc0fast() and/or newNode() at all? > Good point. Looking closer, modern compilers will actually turn the > MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() > anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. I experimented with the same planner-intensive test case that I used in the last discussion back in 2008. I got these results: HEAD: tps = 144.974195 (without initial connection time) v1 patch: tps = 146.302004 (without initial connection time) v2 patch: tps = 144.882208 (without initial connection time) While there's not much daylight between these numbers, the times are quite reproducible for me. This is with RHEL8's gcc 8.5.0 on x86_64. That's probably a bit trailing-edge in terms of what people might be using with v17, but I don't think it's discountable. I also looked at the backend's overall code size per size(1): HEAD: textdata bss dec hex filename 8613007 100192 220176 8933375 884fff testversion.stock/bin/postgres v1 patch: textdata bss dec hex filename 8615126 100192 220144 8935462 885826 testversion.v1/bin/postgres v2 patch: textdata bss dec hex filename 8595322 100192 220144 8915658 880aca testversion.v2/bin/postgres I did check that the v1 patch successfully inlines newNode() and reduces it to just a MemoryContextAllocZeroAligned call, so it's correct that modern compilers do that better than whatever I tested in 2008. But I wonder what is happening in v2 to reduce the code size so much. MemoryContextAllocZeroAligned is not 20kB by itself. > Good point. Looking closer, modern compilers will actually turn the > MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() > anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. Not here ... > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. > It's not doing any good as it is, as it gets compiled to be identical to > MemoryContextAllocZero. Also not so here. Admittedly, my results don't make much of a case for keeping the two code paths, even on compilers where it matters. regards, tom lane
Re: useless LIMIT_OPTION_DEFAULT
Zhang Mingli writes: > By reading the codes, I found that we process limit option as > LIMIT_OPTION_WITH_TIES when using WITH TIES > and all others are LIMIT_OPTION_COUNT by commit > 357889eb17bb9c9336c4f324ceb1651da616fe57. > And check actual limit node in limit_needed(). > There is no need to maintain a useless default limit enum. I agree, that looks pretty useless. Our normal convention for representing not having any LIMIT clause would be to not create a SelectLimit node at all. I don't see why allowing a second representation is a good idea, especially when the code is not in fact doing that. git blame shows that this came in with 357889eb1. Alvaro, Surafel, do you want to argue for keeping things as-is? regards, tom lane
Re: Teach predtest about IS [NOT] proofs
James Coleman writes: > On Wed, Dec 13, 2023 at 1:36 PM Tom Lane wrote: >> I don't have an objection in principle to adding more smarts to >> predtest.c. However, we should be wary of slowing down cases where >> no BooleanTests are present to be optimized. I wonder if it could >> help to use a switch on nodeTag rather than a series of if(IsA()) >> tests. (I'd be inclined to rewrite the inner if-then-else chains >> as switches too, really. You get some benefit from the compiler >> noticing whether you've covered all the enum values.) > I think I could take this on; would you prefer it as a patch in this > series? Or as a new patch thread? No, keep it in the same thread (and make a CF entry, if you didn't already). It might be best to make a series of 2 patches, first just refactoring what's there per this discussion, and then a second one to add BooleanTest logic. >> I note you've actively broken the function's ability to cope with >> NULL input pointers. Maybe we don't need it to, but I'm not going >> to accept a patch that just side-swipes that case without any >> justification. > [ all callers have previously used predicate_classify ] OK, fair enough. The checks for nulls are probably from ancient habit, but I agree we could remove 'em here. >> Perhaps, rather than hoping people will notice comments that are >> potentially offscreen from what they're modifying, we should relocate >> those comment paras to be adjacent to the relevant parts of the >> function? > Splitting up that block comment makes sense to me. Done, let's make it so. >> I've not gone through the patch in detail to see whether I believe >> the proposed proof rules. It would help to have more comments >> justifying them. > Most of them are sufficiently simple -- e.g., X IS TRUE implies X -- > that I don't think there's a lot to say in justification. In some > cases I've noted the cases that force only strong or weak implication. Yeah, it's the strong-vs-weak distinction that makes me cautious here. One's high-school-algebra instinct for what's obviously true tends to not think about NULL/UNKNOWN, and you do have to consider that. >>> As noted in a TODO in the patch itself, I think it may be worth refactoring >>> the test_predtest module to run the "x, y" case as well as the "y, x" case >> I think that's actively undesirable. It is not typically the case that >> a proof rule for A => B also works in the other direction, so this would >> encourage wasting cycles in the tests. I fear it might also cause >> confusion about which direction a proof rule is supposed to work in. > That makes sense in the general case. > Boolean expressions seem like a special case in that regard: (subject > to what it looks like) would you be OK with a wrapping function that > does both directions (with output that shows which direction is being > tested) used only for the cases where we do want to check both > directions? Using a wrapper where appropriate would remove the inefficiency concern, but I still worry whether it will promote confusion about which direction we're proving things in. You'll need to be very clear about the labeling. regards, tom lane
pg_serial bloat
Hi, Our pg_serial truncation logic is a bit broken, as described by the comments in CheckPointPredicate() (a sort of race between xid cycles and checkpointing). We've seen a system with ~30GB of files in there (note: full/untruncated be would be 2³² xids × sizeof(uint64_t) = 32GB). It's not just a gradual disk space leak: according to disk space monitoring, this system suddenly wrote ~half of that data, which I think must be the while loop in SerialAdd() zeroing out pages. Ouch. I see a few questions: 1. How should we fix this fundamentally in future releases? One answer is to key SSI's xid lookup with FullTransactionId (conceptually cleaner IMHO but I'm not sure how far fxids need to 'spread' through the system to do it right). Another already mentioned in comments is to move some logic into vacuum so it can stay in sync with the xid cycle (maybe harder to think about and prove correct). 2. Could there be worse consequences than wasted disk and I/O? 3. Once a system reaches a bloated state like this, what can an administrator do? I looked into question 3. I convinced myself that it must be safe to unlink all the files under pg_serial while the cluster is down, because: * we don't need the data across restarts, it's just for spilling * we don't need the 'head' file because slru.c opens with O_CREAT * open(O_CREAT) followed by pwrite(..., offset) will create a harmless hole * we never read files outside the tailXid/headXid range we've written * we zero out pages as we add them in SerialAdd(), without reading If I have that right, perhaps we should not merely advise that it is safe to do that manually, but proactively do it in SerialInit(). That is where we establish in shared memory that we don't expect there to be any files on disk, so it must be a good spot to make that true if it is not: if (!found) { /* * Set control information to reflect empty SLRU. */ serialControl->headPage = -1; serialControl->headXid = InvalidTransactionId; serialControl->tailXid = InvalidTransactionId; + + /* Also delete any files on disk. */ + SlruScanDirectory(SerialSlruCtl, SlruScanDirCbDeleteAll, NULL); } In common cases that would just readdir() an empty directory. For testing, it is quite hard to convince predicate.c to write any files there: normally you have to overflow its transaction tracking, which requires more than (max backends + max prepared xacts) × 10 SERIALIZABLE transactions in just the right sort of overlapping pattern, so that the committed ones need to be spilled to disk. I might try to write a test for that, but it gets easier if you define TEST_SUMMARIZE_SERIAL. Then you don't need many transactions -- but you still need a slightly finicky schedule. Start with a couple of overlapping SSI transactions, then commit them, to get a non-empty FinishedSerializableTransaction list. Then create some more SSI transactions, which will call SerialAdd() due to the TEST_ macro. Then run a checkpoint, and you should see eg "" being created on demand during SLRU writeback, demonstrating that starting from an empty pg_serial directory is always OK. I wanted to try that to remind myself of how it all works, but I suppose it should be obvious that it's OK: initdb's initial state is an empty directory. To create a bunch of junk files that are really just thin links for the above change to unlink, or to test the truncate code when it sees a 'full' directory, you can do: cd pg_serial dd if=/dev/zero of= bs=256k count=1 awk 'BEGIN { for (i = 1; i <= 131071; i++) { printf("%04X\n", i); } }' | xargs -r -I {} ln {}
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Hi, On Tue, Dec 12, 2023 at 10:04 PM jian he wrote: > > On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina > wrote: > > > > Hi! Thank you for your work. Your patch looks better! > > Yes, thank you! It works fine, and I see that the regression tests have > > been passed. > > However, when I ran 'copy from with save_error' operation with simple csv > > files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created > > it, I described below): > > > > postgres=# create table test (x int primary key, y int not null); > > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x > > FOREIGN KEY(x) > > REFERENCES test(x)); > > > > I did not find a table with saved errors after operation, although I > > received a log about it: > > > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > > save_error > > NOTICE: 2 rows were skipped because of error. skipped row saved to table > > public.test_error > > ERROR: duplicate key value violates unique constraint "test_pkey" > > DETAIL: Key (x)=(2) already exists. > > CONTEXT: COPY test, line 3 > > > > postgres=# select * from public.test_error; > > ERROR: relation "public.test_error" does not exist > > LINE 1: select * from public.test_error; > > > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV > > save_error > > NOTICE: 2 rows were skipped because of error. skipped row saved to table > > public.test1_error > > ERROR: insert or update on table "test1" violates foreign key constraint > > "fk_x" > > DETAIL: Key (x)=(2) is not present in table "test". > > > > postgres=# select * from public.test1_error; > > ERROR: relation "public.test1_error" does not exist > > LINE 1: select * from public.test1_error; > > > > Two lines were written correctly in the csv files, therefore they should > > have been added to the tables, but they were not added to the tables test > > and test1. > > > > If I leave only the correct rows, everything works fine and the rows are > > added to the tables. > > > > in copy_test.csv: > > > > 2,0 > > > > 1,1 > > > > in copy_test1.csv: > > > > 2,0 > > > > 2,1 > > > > 1,1 > > > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > > COPY 2 > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV > > save_error > > NOTICE: No error happened.Error holding table public.test1_error will be > > droped > > COPY 3 > > > > Maybe I'm launching it the wrong way. If so, let me know about it. > > looks like the above is about constraints violation while copying. > constraints violation while copying not in the scope of this patch. > > Since COPY FROM is very like the INSERT command, > you do want all the valid constraints to check all the copied rows? > > but the notice raised by the patch is not right. > So I place the drop error saving table or raise notice logic above > `ExecResetTupleTable(estate->es_tupleTable, false)` in the function > CopyFrom. > > > > > I also notice interesting behavior if the table was previously created by > > the user. When I was creating an error_table before the 'copy from' > > operation, > > I received a message saying that it is impossible to create a table with > > the same name (it is shown below) during the 'copy from' operation. > > I think you should add information about this in the documentation, since > > this seems to be normal behavior to me. > > > > doc changed. you may check it. I've read this thread and the latest patch. IIUC with SAVE_ERROR option, COPY FROM creates an error table for the target table and writes error information there. While I agree that the final shape of this feature would be something like that design, I'm concerned some features are missing in order to make this feature useful in practice. For instance, error logs are inserted to error tables without bounds, meaning that users who want to tolerate errors during COPY FROM will have to truncate or drop the error tables periodically, or the database will grow with error logs without limit. Ideally such maintenance work should be done by the database. There might be some users who want to log such conversion errors in server logs to avoid such maintenance work. I think we should provide an option for where to write, at least. Also, since the error tables are normal user tables internally, error logs are also replicated to subscribers if there is a publication FOR ALL TABLES, unlike system catalogs. I think some users would not like such behavior. Looking at SAVE_ERROR feature closely, I think it consists of two separate features. That is, it enables COPY FROM to load data while (1) tolerating errors and (2) logging errors to somewhere (i.e., an error table). If we implement only (1), it would be like COPY FROM tolerate errors infinitely and log errors to /dev/null. The user cannot see the error details but I guess it could still help some cases as Andres mentioned[1] (it might be a good idea to
Re: Make COPY format extendable: Extract COPY TO format implementations
On Thu, Dec 14, 2023 at 6:44 PM Sutou Kouhei wrote: > > Hi, > > In > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Mon, 11 Dec 2023 10:57:15 +0900, > Masahiko Sawada wrote: > > > IIUC we cannot create two same name functions with the same arguments > > but a different return value type in the first place. It seems to me > > to be an overkill to change such a design. > > Oh, sorry. I didn't notice it. > > > Another idea is to encapsulate copy_to/from_handler by a super class > > like copy_handler. The handler function is called with an argument, > > say copyto, and returns copy_handler encapsulating either > > copy_to/from_handler depending on the argument. > > It's for using "${copy_format_name}" such as "json" and > "parquet" as a function name, right? Right. > If we use the > "${copy_format_name}" approach, we can't use function names > that are already used by tablesample method handler such as > "system" and "bernoulli" for COPY FORMAT name. Because both > of tablesample method handler function and COPY FORMAT > handler function use "(internal)" as arguments. > > I think that tablesample method names and COPY FORMAT names > will not be conflicted but the limitation (using the same > namespace for tablesample method and COPY FORMAT) is > unnecessary limitation. Presumably, such function name collisions are not limited to tablesample and copy, but apply to all functions that have an "internal" argument. To avoid collisions, extensions can be created in a different schema than public. And note that built-in format copy handler doesn't need to declare its handler function. > > How about using prefix ("copy_to_${copy_format_name}" or > something) or suffix ("${copy_format_name}_copy_to" or > something) for function names? For example, > "copy_to_json"/"copy_from_json" for "json" COPY FORMAT. > > ("copy_${copy_format_name}" that returns copy_handler > encapsulating either copy_to/from_handler depending on the > argument may be an option.) While there is a way to avoid collision as I mentioned above, I can see the point that we might want to avoid using a generic function name such as "arrow" and "parquet" as custom copy handler functions. Adding a prefix or suffix would be one option but to give extensions more flexibility, another option would be to support format = 'custom' and add the "handler" option to specify a copy handler function name to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom', HANDLER = 'arrow_copy_handler'). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote: > I haven't found the code path from where the sequence cleanup gets > called. But it's being called. Am I missing something? ReorderBufferCleanupTXN. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Add --check option to pgindent
On Wed Dec 13, 2023 at 2:46 PM CST, Andrew Dunstan wrote: On 2023-12-12 Tu 10:30, Alvaro Herrera wrote: > On 2023-Dec-12, Tom Lane wrote: > >> "Euler Taveira" writes: >>> When you add exceptions, it starts to complicate the UI. >> Indeed. It seems like --silent-diff was poorly defined and poorly >> named, and we need to rethink that option along the way to adding >> this behavior. The idea that --show-diff and --silent-diff can >> be used together is just inherently confusing, because they sound >> like opposites > Maybe it's enough to rename --silent-diff to --check. You can do > "--show-diff --check" and get both the error and the diff printed; or > just "--check" and it'll throw an error without further ado; or > "--show-diff" and it will both apply the diff and print it. > That seems reasonable. These features were fairly substantially debated when we put them in, but I'm fine with some tweaking. But note: --show-diff doesn't apply the diff, it's intentionally non-destructive. Here is a new patch: - Renames --silent-diff to --check - Renames --show-diff to --diff - Allows one to use --check and --diff in the same command I am not tied to the second patch if people like --show-diff better than --diff. Weirdly enough, my email client doesn't show this as part of the original thread, but I will respond here anyway. -- Tristan Partin Neon (https://neon.tech) From f0963daa95b222b00a7ae15d6702141352d3a81d Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Mon, 11 Dec 2023 17:34:17 -0600 Subject: [PATCH v2 1/3] Rename --silent-diff to --check in pgindent --check should be a little bit more self-explanatory for people coming from other similar formatter/linting tooling. --- src/tools/pgindent/pgindent | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index bce63d95da..fe0bd21f4a 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -23,7 +23,7 @@ my $devnull = File::Spec->devnull; my ($typedefs_file, $typedef_str, @excludes, $indent, $build, $show_diff, - $silent_diff, $help, @commits,); + $check, $help, @commits,); $help = 0; @@ -35,13 +35,13 @@ my %options = ( "excludes=s" => \@excludes, "indent=s" => \$indent, "show-diff" => \$show_diff, - "silent-diff" => \$silent_diff,); + "check" => \$check,); GetOptions(%options) || usage("bad command line argument"); usage() if $help; -usage("Cannot have both --silent-diff and --show-diff") - if $silent_diff && $show_diff; +usage("Cannot have both --check and --show-diff") + if $check && $show_diff; usage("Cannot use --commit with command line file list") if (@commits && @ARGV); @@ -324,7 +324,7 @@ Options: --excludes=PATH file containing list of filename patterns to ignore --indent=PATH path to pg_bsd_indent program --show-diff show the changes that would be made - --silent-diff exit with status 2 if any changes would be made + --check exit with status 2 if any changes would be made The --excludes and --commit options can be given more than once. EOF if ($help) @@ -417,7 +417,12 @@ foreach my $source_filename (@files) if ($source ne $orig_source) { - if ($silent_diff) + if ($check) + { + print show_diff($source, $source_filename); + exit 2; + } + elsif ($check) { exit 2; } -- Tristan Partin Neon (https://neon.tech) From ddd5cdaf8e967ddf84beb49ca64b5bed039b71ac Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Thu, 14 Dec 2023 10:36:05 -0600 Subject: [PATCH v2 2/3] Rename --show-diff to --diff in pgindent --diff should be a little bit more self-explanatory for people coming from other similar formatter/linting tooling. --- src/tools/pgindent/pgindent | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index fe0bd21f4a..067b77be54 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -22,7 +22,7 @@ my $indent_opts = my $devnull = File::Spec->devnull; my ($typedefs_file, $typedef_str, @excludes, - $indent, $build, $show_diff, + $indent, $build, $diff, $check, $help, @commits,); $help = 0; @@ -34,14 +34,14 @@ my %options = ( "list-of-typedefs=s" => \$typedef_str, "excludes=s" => \@excludes, "indent=s" => \$indent, - "show-diff" => \$show_diff, + "diff" => \$diff, "check" => \$check,); GetOptions(%options) || usage("bad command line argument"); usage() if $help; -usage("Cannot have both --check and --show-diff") - if $check && $show_diff; +usage("Cannot have both --check and --diff") + if $check && $diff; usage("Cannot use --commit with command line file list") if (@commits && @ARGV); @@ -294,7 +294,7 @@ sub run_indent return $source; } -sub show_diff +sub diff { my $indented = shift; my $source_filename = shift; @@ -323,7 +323,7 @@ Options:
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
On Fri, 1 Dec 2023 at 05:20, Nathan Bossart wrote: > Could we simplify it with something like the following? Great suggestion! Updated the patchset accordingly. This made it also easy to change the final patch to include the automatic scoped declaration logic for all of the new macros. I quite like how the calling code changes to not have to declare the variable. But it's definitely a larger divergence from the status quo than without patch 0003. So I'm not sure if it's desired. Finally, I also renamed the functions to use foreach instead of for_each, since based on this thread that seems to be the generally preferred naming. v6-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch Description: Binary data v6-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data v6-0003-Even-more-convenient-macros.patch Description: Binary data
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila wrote: > > On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat > wrote: > > > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > > > > > I think you forgot to attach the patch. > > > > Sorry. Here it is. > > > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > > > > > > > It looks like the solution works. But this is the only place where we > > > process a change before SNAPSHOT reaches FULL. But this is also the > > > only record which affects a decision to queue/not a following change. > > > So it should be ok. The sequence_hash'es as separate for each > > > transaction and they are cleaned when processing COMMIT record. > > > > > > > > > > But it is possible that even commit or abort also happens before the > > > snapshot reaches full state in which case the hash table will have > > > stale or invalid (for aborts) entries. That will probably be cleaned > > > at a later point by running_xact records. > > > > Why would cleaning wait till running_xact records? Won't txn entry > > itself be removed when processing commit/abort record? At the same the > > sequence hash will be cleaned as well. > > > > > Now, I think in theory, it > > > is possible that the same RelFileLocator can again be allocated before > > > we clean up the existing entry which can probably confuse the system. > > > > How? The transaction allocating the first time would be cleaned before > > it happens the second time. So shouldn't matter. > > > > It can only be cleaned if we process it but xact_decode won't allow us > to process it and I don't think it would be a good idea to add another > hack for sequences here. See below code: > > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > { > SnapBuild *builder = ctx->snapshot_builder; > ReorderBuffer *reorder = ctx->reorder; > XLogReaderState *r = buf->record; > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; > > /* > * If the snapshot isn't yet fully built, we cannot decode anything, so > * bail out. > */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) > return; That may be true for a transaction which is decoded, but I think all the transactions which are added to ReorderBuffer should be cleaned up once they have been processed irrespective of whether they are decoded/sent downstream or not. In this case I see the sequence hash being cleaned up for the sequence related transaction in Hayato's reproducer. See attached patch with a diagnostic change and the output below (notice sequence cleanup called on transaction 767). 2023-12-14 21:06:36.756 IST [386957] LOG: logical decoding found initial starting point at 0/15B2F68 2023-12-14 21:06:36.756 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 767 to end. 2023-12-14 21:06:36.756 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.679 IST [386957] LOG: XXX: skipped 2023-12-14 21:07:05.679 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:05.710 IST [386957] LOG: logical decoding found initial consistent point at 0/15B3388 2023-12-14 21:07:05.710 IST [386957] DETAIL: Waiting for transactions (approximately 1) older than 768 to end. 2023-12-14 21:07:05.710 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:39.292 IST [386298] LOG: checkpoint starting: time 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: XXX: the sequence is transactional 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: sequence cleanup called on transaction 767 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); 2023-12-14 21:07:40.919 IST [386957] LOG: logical decoding found consistent point at 0/15B3518 2023-12-14 21:07:40.919 IST [386957] DETAIL: There are no running transactions. 2023-12-14 21:07:40.919 IST [386957] STATEMENT: SELECT pg_create_logical_replication_slot('slot', 'test_decoding'); We see similar output when pg_logical_slot_get_changes() is called. I haven't found the code path from where the sequence cleanup gets called. But it's being called. Am I
Re: Clean up find_typedefs and add support for Mac
On Thu Dec 14, 2023 at 9:16 AM CST, Andrew Dunstan wrote: On 2023-12-13 We 15:59, Tristan Partin wrote: > On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote: >> >> On 2023-12-12 Tu 18:02, Tom Lane wrote: >> > "Tristan Partin" writes: >> >> The big patch here is adding support for Mac. objdump -W doesn't >> work on >> >> Mac. So, I used dsymutil and dwarfdump to achieve the same result. >> > We should probably nuke the current version of src/tools/find_typedef >> > altogether in favor of copying the current buildfarm code for that. >> > We know that the buildfarm's version works, whereas I'm not real sure >> > that src/tools/find_typedef is being used in anger by anyone. Also, >> > we're long past the point where developers can avoid having Perl >> > installed. >> > >> > Ideally, the buildfarm client would then start to use find_typedef >> > from the tree rather than have its own copy, both to reduce >> > duplication and to ensure that the in-tree copy keeps working. >> > >> > >> >> >> +1. I'd be more than happy to be rid of maintaining the code. We >> already performed a rather more complicated operation along these >> lines with the PostgreSQL::Test::AdjustUpgrade stuff. > > I'll work with you on GitHub to help make the transition. I've already > made a draft PR in the client-code repo, but I am sure I am missing > some stuff. I think we're putting the cart before the horse here. I think we need a perl module in core that can be loaded by the buildfarm code, and could also be used by a standalone find_typedef (c.f. src/test/PostgreSQL/Test/AdjustUpgrade.pm). To be useful that would have to be backported. I'll have a go at that. Here is an adaptation of the find_typedefs from run_build.pl. Maybe it will help you. -- Tristan Partin Neon (https://neon.tech) find_typedefs.pl Description: Perl program
Re: Clean up find_typedefs and add support for Mac
On 2023-12-13 We 15:59, Tristan Partin wrote: On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote: On 2023-12-12 Tu 18:02, Tom Lane wrote: > "Tristan Partin" writes: >> The big patch here is adding support for Mac. objdump -W doesn't work on >> Mac. So, I used dsymutil and dwarfdump to achieve the same result. > We should probably nuke the current version of src/tools/find_typedef > altogether in favor of copying the current buildfarm code for that. > We know that the buildfarm's version works, whereas I'm not real sure > that src/tools/find_typedef is being used in anger by anyone. Also, > we're long past the point where developers can avoid having Perl > installed. > > Ideally, the buildfarm client would then start to use find_typedef > from the tree rather than have its own copy, both to reduce > duplication and to ensure that the in-tree copy keeps working. > > +1. I'd be more than happy to be rid of maintaining the code. We already performed a rather more complicated operation along these lines with the PostgreSQL::Test::AdjustUpgrade stuff. I'll work with you on GitHub to help make the transition. I've already made a draft PR in the client-code repo, but I am sure I am missing some stuff. I think we're putting the cart before the horse here. I think we need a perl module in core that can be loaded by the buildfarm code, and could also be used by a standalone find_typedef (c.f. src/test/PostgreSQL/Test/AdjustUpgrade.pm). To be useful that would have to be backported. I'll have a go at that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: encoding affects ICU regex character classification
On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote: > Is someone able to test out upper & lower functions on U+A7BA ... > U+A7BF > across a few libs/versions? Those code points are unassigned in Unicode 11.0 and assigned in Unicode 12.0. In ICU 63-2 (based on Unicode 11.0), they just get mapped to themselves. In ICU 64-2 (based on Unicode 12.1) they get mapped the same way the builtin CTYPE maps them (based on Unicode 15.1). The concern over unassigned code points is misplaced. The application may be aware of newly-assigned code points, and there's no way they will be mapped correctly in Postgres if the provider is not aware of those code points. The user can either proceed in using unassigned code points and accept the risk of future changes, or wait for the provider to be upgraded. If the user doesn't have many expression indexes dependent on ctype behavior, it doesn't matter much. If they do have such indexes, the best we can offer is a controlled process, and the builtin provider allows the most visibility and control. (Aside: case mapping has very strong compatibility guarantees, but not perfect. For better compatibility guarantees, we should support case folding.) > And I have no idea if or when > glibc might have picked up the new unicode characters. That's a strong argument in favor of a builtin provider. Regards, Jeff Davis
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar wrote: > > On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin > wrote: > > > > On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > > > > > Andrey, do you have any stress tests or anything else that you used to > > > gain confidence in this code? > > I have done some more testing for the clog group update as the attached test file executes two concurrent scripts executed with pgbench, the first script is the slow script which will run 10-second long transactions and the second script is a very fast transaction with ~1 transactions per second. Along with that, I have also changed the bank size such that each bank will contain just 1 page i.e. 32k transactions per bank. I have done this way so that we do not need to keep long-running transactions running for very long in order to get the transactions from different banks committed during the same time. With this test, I have got that behavior and the below logs shows that multiple transaction range which is in different slru-bank (considering 32k transactions per bank) are doing group update at the same time. e.g. in the below logs, we can see xid range around 70600, 70548, and 70558, and xid range around 755, and 752 are getting group updates by different leaders but near the same time. It is running fine when running for a long duration, but I am not sure how to validate the sanity of this kind of test. 2023-12-14 14:43:31.813 GMT [3306] LOG: group leader procno 606 updated status of procno 606 xid 70600 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 for xid 70548 added for group update 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 is group leader and got the lock 2023-12-14 14:43:31.816 GMT [3326] LOG: group leader procno 586 updated status of procno 586 xid 70548 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 for xid 70558 added for group update 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 is group leader and got the lock 2023-12-14 14:43:31.818 GMT [3327] LOG: group leader procno 585 updated status of procno 585 xid 70558 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 for xid 752 added for group update 2023-12-14 14:43:31.829 GMT [3207] LOG: procno 669 for xid 755 added for group update 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 is group leader and got the lock 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 669 xid 755 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 687 xid 752 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com # Goal of this script to generate scenario where some old long running slow # transaction get committed with the new transactions such that they falls in # different slru banks rm -rf pgdata ./initdb -D pgdata echo "max_wal_size=20GB" >> pgdata/postgresql.conf echo "shared_buffers=20GB" >> pgdata/postgresql.conf echo "checkpoint_timeout=40min" >> pgdata/postgresql.conf echo "max_connections=700" >> pgdata/postgresql.conf echo "maintenance_work_mem=1GB" >> pgdata/postgresql.conf echo "subtrans_buffers=64" >> pgdata/postgresql.conf echo "multixact_members_buffers=128" >> pgdata/postgresql.conf #create slow_txn.sql script cat > slow_txn.sql << EOF BEGIN; INSERT INTO test VALUES(1); DELETE FROM test WHERE a=1; select pg_sleep(10); COMMIT; EOF #create fast_txn.sql script cat > fast_txn.sql << EOF BEGIN; INSERT INTO test1 VALUES(1); DELETE FROM test1 WHERE a=1; COMMIT; EOF ./pg_ctl -D pgdata -l logfile -c start ./psql -d postgres -c "create table test(a int)" ./psql -d postgres -c "create table test1(a int)" ./pgbench -i -s 1 postgres ./pgbench -f slow_txn.sql -c 28 -j 28 -P 1 -T 60 postgres & ./pgbench -f fast_txn.sql -c 100 -j 100 -P 1 -T 60 postgres sleep(10); ./pg_ctl -D pgdata -l logfile -c stop
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On 12.12.2023 16:04, jian he wrote: On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina wrote: Hi! Thank you for your work. Your patch looks better! Yes, thank you! It works fine, and I see that the regression tests have been passed. However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below): postgres=# create table test (x int primary key, y int not null); postgres=# create table test1 (x int, z int, CONSTRAINT fk_x FOREIGN KEY(x) REFERENCES test(x)); I did not find a table with saved errors after operation, although I received a log about it: postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error ERROR: duplicate key value violates unique constraint "test_pkey" DETAIL: Key (x)=(2) already exists. CONTEXT: COPY test, line 3 postgres=# select * from public.test_error; ERROR: relation "public.test_error" does not exist LINE 1: select * from public.test_error; postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error ERROR: insert or update on table "test1" violates foreign key constraint "fk_x" DETAIL: Key (x)=(2) is not present in table "test". postgres=# select * from public.test1_error; ERROR: relation "public.test1_error" does not exist LINE 1: select * from public.test1_error; Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1. If I leave only the correct rows, everything works fine and the rows are added to the tables. in copy_test.csv: 2,0 1,1 in copy_test1.csv: 2,0 2,1 1,1 postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV COPY 2 postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error NOTICE: No error happened.Error holding table public.test1_error will be droped COPY 3 Maybe I'm launching it the wrong way. If so, let me know about it. looks like the above is about constraints violation while copying. constraints violation while copying not in the scope of this patch. Since COPY FROM is very like the INSERT command, you do want all the valid constraints to check all the copied rows? No, I think it will be too much. but the notice raised by the patch is not right. So I place the drop error saving table or raise notice logic above `ExecResetTupleTable(estate->es_tupleTable, false)` in the function CopyFrom. Yes, I see it and agree with you. I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation, I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation. I think you should add information about this in the documentation, since this seems to be normal behavior to me. doc changed. you may check it. Yes, I saw it. Thank you. -- Regards, Alena Rybakina Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: Built-in CTYPE provider
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > In particular "el" (modern greek) has case mapping rules that > ICU seems to implement, but "el" is missing from the list > ("lt", "tr", and "az") you identified. I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the ctype semantics match C.UTF-8 for all code points. glibc is not doing this additional tailoring for "el". Therefore I believe the builtin CTYPE would be very useful for case mapping (both "simple" and "full") even without this additional tailoring. You are correct that ICU will still have some features that won't be supported by the builtin provider. Better word boundary semantics in INITCAP() are another advantage. Regards, Jeff Davis
Re: Simplify newNode()
On 14/12/2023 10:32, Peter Eisentraut wrote: I notice that the existing comments point out that the size argument should be a compile-time constant, but that is no longer the case for ExtensibleNode(). Also, newNode() is the only caller of palloc0fast(), which also points out that the size argument should be a compile-time constant, and palloc0fast() is the only caller of MemSetTest(). I can see how an older compiler might have gotten too confused by all that. But if we think that compilers are now smart enough, maybe we can unwind this whole stack a bit more? Maybe we don't need MemSetTest() and/or palloc0fast() and/or newNode() at all? Good point. Looking closer, modern compilers will actually turn the MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. Here's a link to a godbolt snippet to play with this: https://godbolt.org/z/9b89P3c8x (full link at [0]). Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. It's not doing any good as it is, as it gets compiled to be identical to MemoryContextAllocZero. (There are small differences depending compiler and version, but e.g. on gcc 13.2, the code generated for MemoryContextAllocZero() is actually smaller even though both call memset()) Another approach would be to add more hints to MemoryContextAllocZeroAligned to dissuade the compiler from turning the loop into a memset() call. If you add an "if (size > 1024) abort" there, then gcc 13 doesn't optimize into a memset() call, but clang still does. Some micro-benchmarks on that would be nice. But given that the compiler has been doing this optimization for a while and we haven't noticed, I think memset() should be considered the status quo, and converting it to a loop again should be considered a new optimization. Also, replacing MemoryContextAllocZeroAligned(CurrentMemoryContext, size) with palloc0(size) has one fewer argument and the assembly code of the call has one fewer instruction. That's something too. [0]
Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs
Hi, Sorry for the late reply. On Fri, 6 Oct 2023 at 17:07, Tom Lane wrote: > > As a quick cross-check, I searched our commit log to see how many > README-only commits there were so far this year. I found 11 since > January. (Several were triggered by the latest round of pgindent > code and process changes, so maybe this is more than typical.) > > Not sure what that tells us about the value of changing the CI > logic, but it does seem like it could be worth the one-liner > change needed to teach buildfarm animals to ignore READMEs. I agree that it could be worth implementing this logic on buildfarm animals. In case we want to implement the same logic on the CI, I added a new version of the patch; it skips CI completely if the changes are only in the README files. -- Regards, Nazir Bilal Yavuz Microsoft From 6c268233d13262a31965baf2dbb42913d83dab1d Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Thu, 14 Dec 2023 16:23:28 +0300 Subject: [PATCH v3] If the changes are only in the README files, skip all the tasks. If the changes are only in the README files, skip all the tasks to save CI time. --- .cirrus.tasks.yml | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index e137769850..dfd12f8b04 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -4,6 +4,12 @@ # further details, see src/tools/ci/README +# If the changes are only in the README files, skip all the tasks. +# +# This skip is overriden in 'SanityCheck' task. So, the same check is +# added 'SanityCheck' task too. +skip: changesIncludeOnly('**/README') + env: # The lower depth accelerates git clone. Use a bit of depth so that # concurrent tasks and retrying older jobs have a chance of working. @@ -55,11 +61,12 @@ on_failure_meson: _failure_meson task: name: SanityCheck - # If a specific OS is requested, don't run the sanity check. This shortens - # push-wait-for-ci cycle time a bit when debugging operating system specific - # failures. Uses skip instead of only_if, as cirrus otherwise warns about - # only_if conditions not matching. - skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*' + # Don't run the sanity check if the changes are only in the README files or + # if a specific OS is requested. The latter shortens push-wait-for-ci cycle + # time a bit when debugging operating system specific failures. Uses skip + # instead of only_if, as cirrus otherwise warns about only_if conditions not + # matching. + skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*' || changesIncludeOnly('**/README') env: CPUS: 4 -- 2.43.0
Re: Improve upcasting for INT range and multi range types
Hi, Thanks for the reply. I suspected that there were technical reasons that prevented the obvious right thing to be done. Would adding overloads to the functions and operators be something that could be considered as an acceptable solution? I've tried a very naive solution and it seems to work (there are for sure better options to declare the function overloads): begin; create function elem_contained_by_range(int4, int8range) returns boolean as $$ select elem_contained_by_range($1::int8, $2) $$ LANGUAGE SQL; create function elem_contained_by_range(int8, int4range) returns boolean as $$ select elem_contained_by_range($1, $2::text::int8range) $$ LANGUAGE SQL; create operator <@( LEFTARG = int4, RIGHTARG = int8range, FUNCTION = elem_contained_by_range, RESTRICT = rangesel, JOIN = contjoinsel, HASHES, MERGES ); create operator <@( LEFTARG = int8, RIGHTARG = int4range, FUNCTION = elem_contained_by_range, RESTRICT = rangesel, JOIN = contjoinsel, HASHES, MERGES ); select 2::int4 <@ '[1,9)'::int8range; select 2::int8 <@ '[1,9)'::int4range; rollback; The major drawback is that every combination operator - type would need its own overload creating a large number of them. As a side note it seems that int4range cannot be casted automatically to int8range. Best regards, Federico On Wed, 13 Dec 2023 at 05:16, Tom Lane wrote: > > jian he writes: > > Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1, > > 4)'::INT8RANGE is doable. > > Yeah, it would require a considerable expansion of the scope of > what can be matched by a polymorphic operator. I'm afraid that > the negative consequences (mainly, "ambiguous operator" failures > because more than one thing can be matched) would outweigh the > benefits. It is kind of annoying though that the system can't > do the "obvious" right thing here. > > regards, tom lane
Re: Relation bulk write facility
Melanie just reminded about an older thread about this same thing: https://www.postgresql.org/message-id/CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax%2Bckjig%40mail.gmail.com. I had completely forgotten about that. Melanie's patches in that thread implemented the same optimization of avoiding the fsync() if no checkpoint has happened during the index build. My patch here also implements batching the WAL records of multiple blocks, which was not part of those older patches. OTOH, those patches included an additional optimization of not bypassing the shared buffer cache if the index is small. That seems sensible too. In this new patch, I subconsciously implemented an API close to what I suggested at the end of that old thread. So I'd like to continue this effort based on this new patch. We can add the bypass-buffer-cache optimization later on top of this. With the new API that this introduces, it should be an isolated change to the implementation, with no changes required to the callers. -- Heikki Linnakangas Neon (https://neon.tech)
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
On Mon, 13 Nov 2023 at 03:39, Thomas Munro wrote: > We follow the common en-US usage: "canceled", "canceling" but > "cancellation". Blame Webstah et al. I changed all the places that were not adhering to those spellings. There were also a few of such places in parts of the codebase that these changes didn't touch. I included a new 0001 patch to fix those. I do feel like this patchset is pretty much in a committable state. So it would be very much appreciated if any comitter could help push it over the finish line. v23-0002-Return-2-from-pqReadData-on-EOF.patch Description: Binary data v23-0001-Fix-spelling-of-canceled-cancellation.patch Description: Binary data v23-0004-Start-using-new-libpq-cancel-APIs.patch Description: Binary data v23-0003-Add-non-blocking-version-of-PQcancel.patch Description: Binary data
Re: btree_gist into core?
Hi, On Wed, Jan 19, 2022 at 09:30:11AM +0100, Peter Eisentraut wrote: > > To use exclusion constraints in practice, you often need to install the > btree_gist extension, so that you can combine for example a range type check > and normal scalar key columns into one constraint. > > [...] > > There are also of course questions about how to smoothly arrange upgrades > from extensions to the built-in situations. I'm not sure if that's what you were thinking of, but I know at least one extension (that I'm maintaining) that explicitly relies on btree_gist extension, as in "requires = [...], btree_gist" in the .control file. Since you can't really tweak the control file on a per-major-version basis, this will require some extra thoughts to make sure that people can release extensions without having to tweak this file in some make rule or something like that.
Re: Memory consumed by paths during partitionwise join planning
On Thu, Dec 7, 2023 at 6:19 PM David Rowley wrote: > > Maybe we can try to move forward with your refcount idea and see how > the performance looks. If that's intolerable then that might help us > decide on the next best alternative solution. > Here are performance numbers setup create table t1 (a integer primary key, b integer); create table t1_parted (a integer primary key, b integer) partition by range(a); create 1000 partitions of t1 query (a five way self-join) select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = e.a -- unpartitioned table case select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d, t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = e.a; -- partitioned table case The numbers with patches attached to [1] with limitations listed in the email are thus Ran each query 10 times through EXPLAIN (SUMMARY) and averaged planning time with and without patch. unpartitioned case without patch: 0.25 with patch: 0.19 this improvement is probably misleading. The planning time for this query change a lot. partitioned case (without partitionwise join) without patch: 14580.65 with patch: 14625.12 % degradation: 0.3% partitioned case (with partitionwise join) without patch: 23273.69 with patch: 23328.52 % degradation: 0.2% That looks pretty small considering the benefits. What do you think? [1] https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 16:32, tender wang wrote: > > enable -O2, only one instruction: > xor eax, eax This is not fast code. This is how friendly C compiler suggests you that mask must be 127, not 128. Best regards, Andrey Borodin.
Re: btree_gist into core?
> Thoughts? I think it'd be really nice to do this without btree_gist. I imagine something like this: CREATE INDEX ON tbl USING gist ( range_col, int_col USING btree ) I think this would make the index access methods interface more useful. Index access method developers wouldn't need to provide operator classes for all data types. We could extend ACCESS METHOD definition to allow this: CREATE ACCESS METHOD my_hash_index TYPE INDEX IMPLEMENTS hash HANDLER my_hash_index_handler I realise this is a difficult project.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Andrey M. Borodin 于2023年12月14日周四 17:35写道: > > > > On 14 Dec 2023, at 14:28, tender wang wrote: > > > > Now that AND is more faster, Can we replace the '% > SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' > > unsigned int GetBankno1(unsigned int pageno) { > return pageno & 127; > } > > unsigned int GetBankno2(unsigned int pageno) { > return pageno % 128; > } > > Generates with -O2 > > GetBankno1(unsigned int): > mov eax, edi > and eax, 127 > ret > GetBankno2(unsigned int): > mov eax, edi > and eax, 127 > ret > > > Compiler is smart enough with constants. > Yeah, that's true. int GetBankno(long pageno) { unsigned short bank_mask = 128; int bankno = (pageno & bank_mask) % 128; return bankno; } enable -O2, only one instruction: xor eax, eax But if we all use '%', thing changs as below: int GetBankno(long pageno) { unsigned short bank_mask = 128; int bankno = (pageno % bank_mask) % 128; return bankno; } mov rdx, rdi sar rdx, 63 shr rdx, 57 lea rax, [rdi+rdx] and eax, 127 sub eax, edx > Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 16:06, Dilip Kumar wrote: > > I have noticed > a very good improvement with the addition of patch 0003. Indeed, a very impressive results! It’s almost x2 of performance on high contention scenario, on top of previous improvements. Best regards, Andrey Borodin.
Re: Synchronizing slots from primary to standby
On Thu, Dec 14, 2023 at 7:00 AM Peter Smith wrote: > > Hi, here are a few more review comments for the patch v47-0002 > > (plus my review comments of v45-0002 [1] are yet to be addressed) > > == > 1. General > > For consistency and readability, try to use variables of the same > names whenever they have the same purpose, even when they declared are > in different functions. A few like this were already mentioned in the > previous review but there are more I keep noticing. > > For example, > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. > > > == > src/backend/replication/logical/slotsync.c > > 2. > +/* > + * > + * Validates the primary server info. > + * > + * Using the specified primary server connection, it verifies whether > the master > + * is a standby itself and returns true in that case to convey the caller > that > + * we are on the cascading standby. > + * But if master is the primary server, it goes ahead and validates > + * primary_slot_name. It emits error if the physical slot in > primary_slot_name > + * does not exist on the primary server. > + */ > +static bool > +validate_primary_info(WalReceiverConn *wrconn) > > 2b. > IMO it is too tricky to have a function called "validate_xxx", when > actually you gave that return value some special unintuitive meaning > other than just validation. IMO it is always better for the returned > value to properly match the function name so the expectations are very > obvious. So, In this case, I think a better function signature would > be like this: > > SUGGESTION > > static void > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) > > or > > static void > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > The terminology master_is_standby is a bit indirect for this usage, so I would prefer the second one. Shall we name this function as check_primary_info()? Additionally, can we rewrite the following comment: "Using the specified primary server connection, check whether we are cascading standby. It also validates primary_slot_info for non-cascading-standbys.". -- With Regards, Amit Kapila.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin wrote: > > On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > > > Andrey, do you have any stress tests or anything else that you used to > > gain confidence in this code? > > We are using only first two steps of the patchset, these steps do not touch > locking stuff. > > We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! > I have run this test [1], instead of comparing against the master I have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs (patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and here is the result of the benchmark-1 and benchmark-2. I have noticed a very good improvement with the addition of patch 0003. Machine information: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 configurations: max_wal_size=20GB shared_buffers=20GB checkpoint_timeout=40min max_connections=700 maintenance_work_mem=1GB subtrans_buffers=$variable multixact_offsets_buffers=$variable multixact_members_buffers=$variable benchmark-1 version | subtrans | multixact | tps | buffers | offs/memb | func+ballast ---+--+--+-- patch-1 | 64 | 64/128| 87 + 58 patch-2 | 64 | 64/128 | 128 +83 patch-1 | 1024| 512/1024| 96 + 64 patch-2 | 1024| 512/1024| 163+108 benchmark-2 version | subtrans | multixact | tps | buffers | offs/memb | func ---+--+--+-- patch-1 | 64 | 64/128| 10 patch-2 | 64 | 64/128 | 12 patch-1 | 1024| 512/1024| 44 patch-2 | 1024| 512/1024| 72 [1] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Guiding principle for dropping LLVM versions?
On 25.10.23 07:47, Thomas Munro wrote: Ideally more distros would be present in this vacuum-horizon decision table, but I don't think it'd change the conclusion: 10 is the trailing edge. Therefore the attached patch scales back its ambition to that release. Tested on LLVM 10-18. This patch and the associated reasoning look good to me. I think this is good to go for PG17.
Re: "pgoutput" options missing on documentation
> I agree that we missed updating the parameters of the Logical > Streaming Replication Protocol documentation. I haven't reviewed all > the details yet but one minor thing that caught my attention while > looking at your patch is that we can update the exact additional > information we started to send for streaming mode parallel. We should > update the following sentence: "It accepts an additional value > "parallel" to enable sending extra information with the "Stream Abort" > messages to be used for parallelisation." I changed this in the new version. Thank you for picking this up.
Re: "pgoutput" options missing on documentation
> To reduce translation efforts, perhaps it is better to arrange for > these to share a common message. Good idea. I've done so. > Also, I am unsure whether to call these "parameters" or "options" -- I > wanted to call them parameters like in the documentation, but every > other message in this function refers to "options", so in my example, > I mimicked the nearby code YMMV. It looks like they are called "options" in most places. I changed the documentation to be consistent too. > Since the previous line already said pgoutput is the standard decoding > plugin, maybe it's not necessary to repeat that. > > SUGGESTION > Using the START_REPLICATION command, > pgoutput) accepts the following parameters: Changed. > 3. > I noticed in the protocol message formats docs [1] that those messages > are grouped according to the protocol version that supports them. > Perhaps this page should be arranged similarly for these parameters? > > For example, document the parameters in the order they were introduced. > > SUGGESTION > > -proto_version > ... > -publication_names > ... > -binary > ... > -messages > ... > > Since protocol version 2: > > -streaming (boolean) > ... > > Since protocol version 3: > > -two_phase > ... > > Since protocol version 4: > > -streaming (boolean/parallel) > ... > -origin This is not going to be correct because not all options do require a protocol version. "origin" is added in version 16, but doesn't check for any "proto_version". Perhaps we should fix this too. > 4. > + Boolean parameter to use the binary transfer mode. This is faster > + than the text mode, but slightly less robust > > SUGGESTION > Boolean parameter to use binary transfer mode. Binary mode is faster > than the text mode but slightly less robust Done. Thanks for the review. The new versions are attached. v01-0001-pgoutput-Improve-error-messages-for-options.patch Description: Binary data v01-0002-pgoutput-Document-missing-options.patch Description: Binary data
Re: planner chooses incremental but not the best one
On Tue, Dec 12, 2023 at 4:40 PM Nicolas Lutic wrote: > I've come across a behaviour of the planner I can't explain. > After a migration from 11 to 15 (on RDS) we noticed a degradation in > response time on a query, it went from a few seconds to ten minutes. > A vacuum(analyze) has been realized to be sure that all is clean. > The 'explain analyze' shows us a change of plan. Postgresql 15 chooses > `incremental sort` with an index corresponding to the ORDER BY clause > (on the created_at column). The previous v11 plan used a more efficient > index. > > By deactivating incremental sort, response times in v15 are equal to v11 > one. I think this issue is caused by under-estimating the startup cost of incremental sort, which in turn is caused by over-estimating the number of groups with equal presorted keys by estimate_num_groups(). We can simulate the same issue with the query below. create table t (a int, b int, c int, d int) partition by range (a); create table tp1 partition of t for values from (0) to (1000); create table tp2 partition of t for values from (1000) to (2000); insert into t select i%2000, i%1000, i%300, i from generate_series(1,100)i; create index on t(b); create index on t(c); analyze t; -- by default incremental sort is chosen explain analyze select * from t where b = 3 order by c, d limit 10; QUERY PLAN Limit (cost=143.03..570.89 rows=10 width=16) (actual time=375.399..375.402 rows=10 loops=1) -> Incremental Sort (cost=143.03..42671.85 rows=994 width=16) (actual time=375.397..375.399 rows=10 loops=1) Sort Key: t.c, t.d Presorted Key: t.c Full-sort Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB Pre-sorted Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB -> Merge Append (cost=0.85..42644.84 rows=994 width=16) (actual time=11.415..375.289 rows=335 loops=1) Sort Key: t.c -> Index Scan using tp1_c_idx on tp1 t_1 (cost=0.42..21318.39 rows=498 width=16) (actual time=6.666..189.356 rows=168 loops=1) Filter: (b = 3) Rows Removed by Filter: 171537 -> Index Scan using tp2_c_idx on tp2 t_2 (cost=0.42..21316.50 rows=496 width=16) (actual time=4.745..185.870 rows=168 loops=1) Filter: (b = 3) Rows Removed by Filter: 171534 Planning Time: 0.501 ms Execution Time: 375.477 ms (16 rows) -- disable incremental sort set enable_incremental_sort to off; SET explain analyze select * from t where b = 3 order by c, d limit 10; QUERY PLAN Limit (cost=2577.51..2577.53 rows=10 width=16) (actual time=2.928..2.930 rows=10 loops=1) -> Sort (cost=2577.51..2579.99 rows=994 width=16) (actual time=2.925..2.927 rows=10 loops=1) Sort Key: t.c, t.d Sort Method: top-N heapsort Memory: 25kB -> Append (cost=8.28..2556.03 rows=994 width=16) (actual time=0.627..2.670 rows=1000 loops=1) -> Bitmap Heap Scan on tp1 t_1 (cost=8.28..1276.62 rows=498 width=16) (actual time=0.625..1.688 rows=500 loops=1) Recheck Cond: (b = 3) Heap Blocks: exact=500 -> Bitmap Index Scan on tp1_b_idx (cost=0.00..8.16 rows=498 width=0) (actual time=0.330..0.330 rows=500 loops=1) Index Cond: (b = 3) -> Bitmap Heap Scan on tp2 t_2 (cost=8.27..1274.43 rows=496 width=16) (actual time=0.178..0.876 rows=500 loops=1) Recheck Cond: (b = 3) Heap Blocks: exact=500 -> Bitmap Index Scan on tp2_b_idx (cost=0.00..8.14 rows=496 width=0) (actual time=0.093..0.093 rows=500 loops=1) Index Cond: (b = 3) Planning Time: 0.481 ms Execution Time: 3.031 ms (17 rows) As we can see the execution time is 375.477 ms by default and 3.031 ms if we disable incremental sort. When we calculate the cost of incremental sort, we need to estimate the number of groups into which the relation is divided by the presorted keys, and then calculate the cost of sorting a single group. If we over-estimate the number of groups, the startup cost of incremental sort would be under-estimated. In the first plan above, the number of groups with equal 't.c' is estimated to be 300 by estimate_num_groups(), but is actually 3 after applying the restriction 'b = 3'. As a result, the startup cost of the incremental sort is estimated to be 143.03, but is actually 14222.68. That's why the
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 11 Dec 2023 10:57:15 +0900, Masahiko Sawada wrote: > IIUC we cannot create two same name functions with the same arguments > but a different return value type in the first place. It seems to me > to be an overkill to change such a design. Oh, sorry. I didn't notice it. > Another idea is to encapsulate copy_to/from_handler by a super class > like copy_handler. The handler function is called with an argument, > say copyto, and returns copy_handler encapsulating either > copy_to/from_handler depending on the argument. It's for using "${copy_format_name}" such as "json" and "parquet" as a function name, right? If we use the "${copy_format_name}" approach, we can't use function names that are already used by tablesample method handler such as "system" and "bernoulli" for COPY FORMAT name. Because both of tablesample method handler function and COPY FORMAT handler function use "(internal)" as arguments. I think that tablesample method names and COPY FORMAT names will not be conflicted but the limitation (using the same namespace for tablesample method and COPY FORMAT) is unnecessary limitation. How about using prefix ("copy_to_${copy_format_name}" or something) or suffix ("${copy_format_name}_copy_to" or something) for function names? For example, "copy_to_json"/"copy_from_json" for "json" COPY FORMAT. ("copy_${copy_format_name}" that returns copy_handler encapsulating either copy_to/from_handler depending on the argument may be an option.) Thanks, -- kou
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 14:28, tender wang wrote: > > Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' > operation in SimpleLruGetBankLock() with '& 127' unsigned int GetBankno1(unsigned int pageno) { return pageno & 127; } unsigned int GetBankno2(unsigned int pageno) { return pageno % 128; } Generates with -O2 GetBankno1(unsigned int): mov eax, edi and eax, 127 ret GetBankno2(unsigned int): mov eax, edi and eax, 127 ret Compiler is smart enough with constants. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Andrey M. Borodin 于2023年12月14日周四 17:02写道: > > > > On 14 Dec 2023, at 08:12, Amul Sul wrote: > > > > > > + int bankno = pageno & ctl->bank_mask; > > > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a > number > > of banks (num_banks) and get the bank number through modulus op (pageno % > > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) > which is a > > bit difficult to read compared to modulus op which is quite simple, > > straightforward and much common practice in hashing. > > > > Are there any advantages of using & over % ? > use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&' . int GetBankno1(int pageno) { return pageno & 127; } int GetBankno2(int pageno) { return pageno % 127; } under clang 13.0 GetBankno1: # @GetBankno1 pushrbp mov rbp, rsp mov dword ptr [rbp - 4], edi mov eax, dword ptr [rbp - 4] and eax, 127 pop rbp ret GetBankno2: # @GetBankno2 pushrbp mov rbp, rsp mov dword ptr [rbp - 4], edi mov eax, dword ptr [rbp - 4] mov ecx, 127 cdq idivecx mov eax, edx pop rbp ret under gcc 13.2 GetBankno1: pushrbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] and eax, 127 pop rbp ret GetBankno2: pushrbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] movsx rdx, eax imulrdx, rdx, -2130574327 shr rdx, 32 add edx, eax mov ecx, edx sar ecx, 6 cdq sub ecx, edx mov edx, ecx sal edx, 7 sub edx, ecx sub eax, edx mov ecx, eax mov eax, ecx pop rbp ret [1] https://godbolt.org/ The instruction AND is ~20 times faster than IDIV [0]. This is relatively > hot function, worth sacrificing some readability to save ~ten nanoseconds > on each check of a status of a transaction. > Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' : SimpleLruGetBankLock() { int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS; use '&' return &(ctl->shared->bank_locks[banklockno].lock); } Thoughts? > > [0] https://www.agner.org/optimize/instruction_tables.pdf > >
Re: [meson] expose buildtype debug/optimization info to pg_config
Hi Peter, Thanks for looking into this. On Thu, Dec 14, 2023 at 4:50 PM Peter Eisentraut wrote: > > On 12.12.23 11:40, Junwang Zhao wrote: > > build system using configure set VAL_CFLAGS with debug and > > optimization flags, so pg_config will show these infos. Some > > extensions depend on the mechanism. > > > > This patch exposes these flags with a typo fixed together. > > I have committed the typo fix. > > But I would like to learn more about the requirements of extensions in > this area. This seems a bit suspicious to me. This is what I found when building citus against an installation of meson debug build pg instance, since the CFLAGS doesn't contain -g flag, the binary doesn't include the debug information, which is different behavior from configure building system. Another issue I found is that some C++ extensions(ajust/parquet_fdw for example) don't build against the meson generated pgxs.mk, since it doesn't set the CXX command. CXX is only set when llvm option is enabled, which is different from old building system. I don't insist we make Meson the same behaviour with old building system, I just think the issues I raised might stop developers try the fancy new building system. And the fix I post might not be ideal, you and Andres might have better solutions. > -- Regards Junwang Zhao
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > > > I think you forgot to attach the patch. > > Sorry. Here it is. > > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > > > > It looks like the solution works. But this is the only place where we > > process a change before SNAPSHOT reaches FULL. But this is also the > > only record which affects a decision to queue/not a following change. > > So it should be ok. The sequence_hash'es as separate for each > > transaction and they are cleaned when processing COMMIT record. > > > > > > > But it is possible that even commit or abort also happens before the > > snapshot reaches full state in which case the hash table will have > > stale or invalid (for aborts) entries. That will probably be cleaned > > at a later point by running_xact records. > > Why would cleaning wait till running_xact records? Won't txn entry > itself be removed when processing commit/abort record? At the same the > sequence hash will be cleaned as well. > > > Now, I think in theory, it > > is possible that the same RelFileLocator can again be allocated before > > we clean up the existing entry which can probably confuse the system. > > How? The transaction allocating the first time would be cleaned before > it happens the second time. So shouldn't matter. > It can only be cleaned if we process it but xact_decode won't allow us to process it and I don't think it would be a good idea to add another hack for sequences here. See below code: xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) { SnapBuild *builder = ctx->snapshot_builder; ReorderBuffer *reorder = ctx->reorder; XLogReaderState *r = buf->record; uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; /* * If the snapshot isn't yet fully built, we cannot decode anything, so * bail out. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) return; -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar wrote: > > I think you forgot to attach the patch. Sorry. Here it is. On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila wrote: > > > > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > > > But it is possible that even commit or abort also happens before the > snapshot reaches full state in which case the hash table will have > stale or invalid (for aborts) entries. That will probably be cleaned > at a later point by running_xact records. Why would cleaning wait till running_xact records? Won't txn entry itself be removed when processing commit/abort record? At the same the sequence hash will be cleaned as well. > Now, I think in theory, it > is possible that the same RelFileLocator can again be allocated before > we clean up the existing entry which can probably confuse the system. How? The transaction allocating the first time would be cleaned before it happens the second time. So shouldn't matter. -- Best Wishes, Ashutosh Bapat diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 73e38cafd8..8e2ebbd8e0 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -1543,10 +1543,15 @@ smgr_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) elog(LOG, "XXX: smgr_decode. snapshot is %s", SnapBuildIdentify(builder)); /* -* If we don't have snapshot or we are just fast-forwarding, there is no -* point in decoding relfilenode information. +* If we are not building snapshot yet or we are just fast-forwarding, there +* is no point in decoding relfilenode information. If the sequence +* associated with relfilenode here is changed in the same transaction but +* after snapshot was built, the relfilenode needs to be present in the +* sequence hash table so that the change can be deemed as transactional. +* Otherwise it will not be queued. Hence we process this change even before +* we have built snapshot. */ - if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT || + if (SnapBuildCurrentState(builder) < SNAPBUILD_BUILDING_SNAPSHOT || ctx->fast_forward) { elog(LOG, "XXX: skipped");
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat wrote: > > On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar wrote: > > > > > > > > > > > It is correct that we can make a wrong decision about whether a change > > > is transactional or non-transactional when sequence DDL happens before > > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens > > > after that state. However, one thing to note here is that we won't try > > > to stream such a change because for non-transactional cases we don't > > > proceed unless the snapshot is in a consistent state. Now, if the > > > decision had been correct then we would probably have queued the > > > sequence change and discarded at commit. > > > > > > One thing that we deviate here is that for non-sequence transactional > > > cases (including logical messages), we immediately start queuing the > > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided > > > SnapBuildProcessChange() returns true which is quite possible) and > > > take final decision at commit/prepare/abort time. However, that won't > > > be the case for sequences because of the dependency of determining > > > transactional cases on one of the prior records. Now, I am not > > > completely sure at this stage if such a deviation can cause any > > > problem and or whether we are okay to have such a deviation for > > > sequences. > > > > Okay, so this particular scenario that I raised is somehow saved, I > > mean although we are considering transactional sequence operation as > > non-transactional we also know that if some of the changes for a > > transaction are skipped because the snapshot was not FULL that means > > that transaction can not be streamed because that transaction has to > > be committed before snapshot become CONSISTENT (based on the snapshot > > state change machinery). Ideally based on the same logic that the > > snapshot is not consistent the non-transactional sequence changes are > > also skipped. But the only thing that makes me a bit uncomfortable is > > that even though the result is not wrong we have made some wrong > > intermediate decisions i.e. considered transactional change as > > non-transactions. > > > > One solution to this issue is that, even if the snapshot state does > > not reach FULL just add the sequence relids to the hash, I mean that > > hash is only maintained for deciding whether the sequence is changed > > in that transaction or not. So no adding such relids to hash seems > > like a root cause of the issue. Honestly, I haven't analyzed this > > idea in detail about how easy it would be to add only these changes to > > the hash and what are the other dependencies, but this seems like a > > worthwhile direction IMHO. > > ... > It looks like the solution works. But this is the only place where we > process a change before SNAPSHOT reaches FULL. But this is also the > only record which affects a decision to queue/not a following change. > So it should be ok. The sequence_hash'es as separate for each > transaction and they are cleaned when processing COMMIT record. > > It looks like the solution works. But this is the only place where we process a change before SNAPSHOT reaches FULL. But this is also the only record which affects a decision to queue/not a following change. So it should be ok. The sequence_hash'es as separate for each transaction and they are cleaned when processing COMMIT record. > But it is possible that even commit or abort also happens before the snapshot reaches full state in which case the hash table will have stale or invalid (for aborts) entries. That will probably be cleaned at a later point by running_xact records. Now, I think in theory, it is possible that the same RelFileLocator can again be allocated before we clean up the existing entry which can probably confuse the system. It might or might not be a problem in practice but I think the more assumptions we add for sequences, the more difficult it will become to ensure its correctness. -- With Regards, Amit Kapila.
Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}
On Tue, Oct 31, 2023 at 03:11:03PM +0100, hubert depesz lubaczewski wrote: > On Tue, Oct 31, 2023 at 08:17:52AM +0900, Michael Paquier wrote: >> Thanks for the input. I was looking yesterday if this code was >> available somewhere, but couldn't find it.. Until this morning: >> https://gitlab.com/depesz/explain.depesz.com.git > > Well, the parser itself is https://gitlab.com/depesz/Pg--Explain/ :) That was close enough ;) > Well, if it's possible to deduce what is the meaning in given line, > I can add the logic to do the deduction to parser. > > Also, I want to say that I appreciate being looped in the discussion. I lost sight of this thread, so my apologies for the delay. The patch to fix the description of the EXPLAIN field has now been applied to v15 and v16. -- Michael signature.asc Description: PGP signature
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 08:12, Amul Sul wrote: > > > + int bankno = pageno & ctl->bank_mask; > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number > of banks (num_banks) and get the bank number through modulus op (pageno % > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a > bit difficult to read compared to modulus op which is quite simple, > straightforward and much common practice in hashing. > > Are there any advantages of using & over % ? The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readability to save ~ten nanoseconds on each check of a status of a transaction. [0] https://www.agner.org/optimize/instruction_tables.pdf
Re: [meson] expose buildtype debug/optimization info to pg_config
On 12.12.23 11:40, Junwang Zhao wrote: build system using configure set VAL_CFLAGS with debug and optimization flags, so pg_config will show these infos. Some extensions depend on the mechanism. This patch exposes these flags with a typo fixed together. I have committed the typo fix. But I would like to learn more about the requirements of extensions in this area. This seems a bit suspicious to me.
Re: Add code indentation check to cirrus-ci (was Re: Add BF member koel-like indentation checks to SanityCheck CI)
Hi, You may want to check out the WIP patch [1] about adding meson targets to run pgindent by Andres. [1] https://www.postgresql.org/message-id/20231019044907.ph6dw637loqg3lqk%40awork3.anarazel.de -- Regards, Nazir Bilal Yavuz Microsoft
Re: GUC names in messages
On 11.12.23 00:07, Peter Smith wrote: If the rule is changed to quote those MixedCase GUCs then the docs will require minor tweaking CURRENT In messages containing configuration variable names, do not include quotes when the names are visibly not natural English words, such as when they have underscores, are all-uppercase or have mixed case. Otherwise, quotes must be added. Do include quotes in a message where an arbitrary variable name is to be expanded. "are all-uppercase or have mixed case." --> "or are all-uppercase." After these discussions, I think this rule change was not a good idea. It effectively enforces these kinds of inconsistencies. For example, if you ever refactored "DateStyle is wrong" to "%s is wrong" you'd need to adjust the quotes, and thus user-visible behavior, for entirely internal reasons. This is not good. And then came the idea to determine the quoting dynamically, which I think everyone agreed was too much. So I don't see a way to make this work well.
Re: Simplify newNode()
On 14.12.23 01:48, Heikki Linnakangas wrote: The newNode() macro can be turned into a static inline function, which makes it a lot simpler. See attached. This was not possible when the macro was originally written, as we didn't require compiler to have static inline support, but nowadays we do. This was last discussed in 2008, see discussion at https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. In those tests, Tom observed that gcc refused to inline the static inline function. That was weird, the function is very small and doesn't do anything special. Whatever the problem was, I think we can dismiss it with modern compilers. It does get inlined on gcc 12 and clang 14 that I have installed. I notice that the existing comments point out that the size argument should be a compile-time constant, but that is no longer the case for ExtensibleNode(). Also, newNode() is the only caller of palloc0fast(), which also points out that the size argument should be a compile-time constant, and palloc0fast() is the only caller of MemSetTest(). I can see how an older compiler might have gotten too confused by all that. But if we think that compilers are now smart enough, maybe we can unwind this whole stack a bit more? Maybe we don't need MemSetTest() and/or palloc0fast() and/or newNode() at all?
Re: remaining sql/json patches
On Sat, Dec 9, 2023 at 2:05 PM jian he wrote: > Hi. Thanks for the review. > function JsonPathExecResult comment needs to be refactored? since it > changed a lot. I suppose you meant executeJsonPath()'s comment. I've added a description of the new callback function arguments. On Wed, Dec 13, 2023 at 6:59 PM jian he wrote: > Hi. small issues I found... > > typo: > +-- Test mutabilily od query functions Fixed. > > + default: > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("only datetime, bool, numeric, and text types can be casted > to jsonpath types"))); > > transformJsonPassingArgs's function: transformJsonValueExpr will make > the above code unreached. It's good to have the ereport to catch errors caused by any future changes. > also based on the `switch (typid)` cases, > I guess best message would be > errmsg("only datetime, bool, numeric, text, json, jsonb types can be > casted to jsonpath types"))); I've rewritten the message to mention the unsupported type. Maybe the supported types can go in a DETAIL message. I might do that later. > + case JSON_QUERY_OP: > + jsexpr->wrapper = func->wrapper; > + jsexpr->omit_quotes = (func->quotes == JS_QUOTES_OMIT); > + > + if (!OidIsValid(jsexpr->returning->typid)) > + { > + JsonReturning *ret = jsexpr->returning; > + > + ret->typid = JsonFuncExprDefaultReturnType(jsexpr); > + ret->typmod = -1; > + } > + jsexpr->result_coercion = coerceJsonFuncExprOutput(pstate, jsexpr); > > I noticed, if (!OidIsValid(jsexpr->returning->typid)) is the true > function JsonFuncExprDefaultReturnType may be called twice, not sure > if it's good or not.. If avoiding the double-calling means that we've to add more conditions in the code, I'm fine with leaving this as-is. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: remaining sql/json patches
On Sat, Dec 9, 2023 at 2:05 AM Andrew Dunstan wrote: > On 2023-12-08 Fr 11:37, Robert Haas wrote: > > On Fri, Dec 8, 2023 at 1:59 AM Amit Langote wrote: > >> Would it be messy to replace the lookahead approach by whatever's > >> suiable *in the future* when it becomes necessary to do so? > > It might be. Changing grammar rules to tends to change corner-case > > behavior if nothing else. We're best off picking the approach that we > > think is correct long term. > > All this makes me wonder if Alvaro's first suggested solution (adding > NESTED to the UNBOUNDED precedence level) wouldn't be better after all. I've done just that in the latest v32. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com