date:20231214

Is a clearer memory lifespan for outerTuple and innerTuple useful?

2023-12-14 Thread Andy Fan


Hi,

When I am working on "shared detoast value"[0], where I want to avoid
detoast the same datum over and over again, I have to decide which
memory context should be used to hold the detoast value. later I 
found I have to use different MemoryContexts for the OuterTuple and
innerTuple since OuterTuple usually have a longer lifespan.

I found the following code in nodeMergeJoin.c which has pretty similar
situation, just that it uses ExprContext rather than MemoryContext.

MergeJoinState *
ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)

/*
 * we need two additional econtexts in which we can compute the join
 * expressions from the left and right input tuples.  The node's regular
 * econtext won't do because it gets reset too often.
 */
mergestate->mj_OuterEContext = CreateExprContext(estate);
mergestate->mj_InnerEContext = CreateExprContext(estate);

IIUC, we needs a MemoryContext rather than ExprContext in fact. In the
attachment, I just use two MemoryContext instead of the two ExprContexts
which should be less memory and more precise semantics, and works
fine. shall we go in this direction?  I attached the 2 MemoryContext in
JoinState rather than MergeJoinState, which is for the "shared detoast
value"[0] more or less.  

[0] https://www.postgresql.org/message-id/87ttoyihgm@163.com

>From cadd94f8ff2398ca430b43494b3eb71225b017c3 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Fri, 15 Dec 2023 15:28:36 +0800
Subject: [PATCH v1] Use MemoryContext instead of ExprContext for
 nodeMergejoin.c

---
 src/backend/executor/nodeMergejoin.c | 27 ++-
 src/backend/executor/nodeNestloop.c  |  1 +
 src/include/nodes/execnodes.h|  4 ++--
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 3cdab77dfc..d330e104f6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -294,7 +294,7 @@ MJExamineQuals(List *mergeclauses,
 static MJEvalResult
 MJEvalOuterValues(MergeJoinState *mergestate)
 {
-	ExprContext *econtext = mergestate->mj_OuterEContext;
+	ExprContext *econtext = mergestate->js.ps.ps_ExprContext;
 	MJEvalResult result = MJEVAL_MATCHABLE;
 	int			i;
 	MemoryContext oldContext;
@@ -303,9 +303,9 @@ MJEvalOuterValues(MergeJoinState *mergestate)
 	if (TupIsNull(mergestate->mj_OuterTupleSlot))
 		return MJEVAL_ENDOFJOIN;
 
-	ResetExprContext(econtext);
+	MemoryContextReset(mergestate->js.outer_tuple_memory);
 
-	oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
+	oldContext = MemoryContextSwitchTo(mergestate->js.outer_tuple_memory);
 
 	econtext->ecxt_outertuple = mergestate->mj_OuterTupleSlot;
 
@@ -341,7 +341,7 @@ MJEvalOuterValues(MergeJoinState *mergestate)
 static MJEvalResult
 MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
 {
-	ExprContext *econtext = mergestate->mj_InnerEContext;
+	ExprContext *econtext = mergestate->js.ps.ps_ExprContext;
 	MJEvalResult result = MJEVAL_MATCHABLE;
 	int			i;
 	MemoryContext oldContext;
@@ -352,7 +352,7 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
 
 	ResetExprContext(econtext);
 
-	oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
+	oldContext = MemoryContextSwitchTo(mergestate->js.inner_tuple_memory);
 
 	econtext->ecxt_innertuple = innerslot;
 
@@ -1471,14 +1471,6 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 */
 	ExecAssignExprContext(estate, &mergestate->js.ps);
 
-	/*
-	 * we need two additional econtexts in which we can compute the join
-	 * expressions from the left and right input tuples.  The node's regular
-	 * econtext won't do because it gets reset too often.
-	 */
-	mergestate->mj_OuterEContext = CreateExprContext(estate);
-	mergestate->mj_InnerEContext = CreateExprContext(estate);
-
 	/*
 	 * initialize child nodes
 	 *
@@ -1497,6 +1489,15 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 			  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	mergestate->js.outer_tuple_memory = AllocSetContextCreate(
+		mergestate->js.ps.ps_ExprContext->ecxt_per_query_memory,
+		"OuterTupleCtx",
+		ALLOCSET_SMALL_SIZES);
+	mergestate->js.inner_tuple_memory = AllocSetContextCreate(
+		mergestate->js.ps.ps_ExprContext->ecxt_per_query_memory,
+		"InnerTupleCtx",
+		ALLOCSET_SMALL_SIZES);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index ebd1406843..5e10d01c4e 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -349,6 +349,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	NL1_printf("ExecInitNestLoop: %s\n",

Support "Right Semi Join" plan shapes

2023-12-14 Thread wenhui qiu

Hi Richard Guo   I see that the test samples are all (exists)
subqueries ,I think semi join should also support ( in) and ( any)
subqueries. would you do more test on  ( in) and ( any) subqueries?


Best whish

Re: Fix bug with indexes on whole-row expressions

2023-12-14 Thread Nikolay Samokhvalov

On Wed, Dec 13, 2023 at 7:01 AM Tom Lane  wrote:

> ywgrit  writes:
> > I forbid to create indexes on whole-row expression in the following
> patch.
> > I'd like to hear your opinions.
>
> As I said in the previous thread, I don't think this can possibly
> be acceptable.  Surely there are people depending on the capability.
> I'm not worried so much about the exact case of an index column
> being a whole-row Var --- I agree that that's pretty useless ---
> but an index column that is a function on a whole-row Var seems
> quite useful.  (Your patch fails to detect that, BTW, which means
> it does not block the case presented in bug #18244.)
>
> I thought about extending the ALTER TABLE logic to disallow changes
> in composite types that appear in index expressions.  We already have
> find_composite_type_dependencies(), and it turns out that this already
> blocks ALTER for the case you want to forbid, but we concluded that we
> didn't need to prevent it for the bug #18244 case:
>
>  * If objsubid identifies a specific column, refer to that in error
>  * messages.  Otherwise, search to see if there's a user column of
> the
>  * type.  (We assume system columns are never of interesting
> types.)
>  * The search is needed because an index containing an expression
>  * column of the target type will just be recorded as a
> whole-relation
>  * dependency.  If we do not find a column of the type, the
> dependency
>  * must indicate that the type is transiently referenced in an
> index
>  * expression but not stored on disk, which we assume is OK, just
> as
>  * we do for references in views.  (It could also be that the
> target
>  * type is embedded in some container type that is stored in an
> index
>  * column, but the previous recursion should catch such cases.)
>
> Perhaps a reasonable answer would be to issue a WARNING (not error)
> in the case where an index has this kind of dependency.  The index
> might need to be reindexed --- but it might not, too, and in any case
> I doubt that flat-out forbidding the ALTER is a helpful idea.
>
> regards, tom lane
>

WARNING can be easily overlooked. Users of mobile/web apps don't see
Postgres WARNINGs.

Forbidding ALTER sounds more reasonable.

Do you see any good use cases for whole-row indexes?

And for such cases, wouldn't it be reasonable for users to specify all
columns explicitly? E.g.:

   create index on t using btree(row(c1, c2, c3));

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Junwang Zhao

On Fri, Dec 15, 2023 at 12:45 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Fri, 15 Dec 2023 11:27:30 +0800,
>   Junwang Zhao  wrote:
>
> >> > Adding a prefix or suffix would be one option but to give extensions
> >> > more flexibility, another option would be to support format = 'custom'
> >> > and add the "handler" option to specify a copy handler function name
> >> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
> >> > HANDLER = 'arrow_copy_handler').
> >>
> > I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT,
> > and user has to know the name of the specific handler names, not
> > intuitive.
>
> Ah! I misunderstood this idea. "custom" is the special
> format to use "HANDLER". I thought that we can use it like
>
>(FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl1')
>
> and
>
>(FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl2')
>
> .
>
> >> Interesting. If we use this option, users can choose an COPY
> >> FORMAT implementation they like from multiple
> >> implementations. For example, a developer may implement a
> >> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON
> >> related API and another developer may implement a handler
> >> with simdjson[1] which is a fast JSON parser. Users can
> >> choose whichever they like.
> > Not sure about this, why not move Json copy handler to contrib
> > as an example for others, any extensions share the same format
> > function name and just install one? No bound would implement
> > another CSV or TEXT copy handler IMHO.
>
> I should have used a different format not JSON as an example
> for easy to understand. I just wanted to say that extension
> developers can implement another implementation without
> conflicting another implementation.

Yeah, I can see the value of the HANDLER option now. The possibility
of two extensions for the same format using same hanlder name should
be rare I guess ;)
>
>
> Thanks,
> --
> kou



-- 
Regards
Junwang Zhao

Re: Improve eviction algorithm in ReorderBuffer

2023-12-14 Thread Masahiko Sawada

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  wrote:
>
> On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I've heard the report internally that replication lag became huge when
> > > > decoding transactions each consisting of 500k sub transactions. Note
> > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> > > >
>
> Can't you suggest them to use streaming mode to avoid this problem or
> do you see some problem with that?

Yeah, that's one option. But I can suggest

>
> > > > Here is a test script for a many subtransactions scenario. In my
> > > > environment, the logical decoding took over 2min to decode one top
> > > > transaction having 100k subtransctions.
> > > >
> > > > -
> > > > create table test (c int);
> > > > create or replace function testfn (cnt int) returns void as $$
> > > > begin
> > > >   for i in 1..cnt loop
> > > > begin
> > > >   insert into test values (i);
> > > > exception when division_by_zero then
> > > >   raise notice 'caught error';
> > > >   return;
> > > > end;
> > > >   end loop;
> > > > end;
> > > > $$
> > > > language plpgsql;
> > > > select testfn(10)
> > > > set logical_decoding_work_mem to '4MB';
> > > > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > > > 
> > > >
> > > > To deal with this problem, I initially thought of the idea (a)
> > > > mentioned in the comment; use a binary heap to maintain the
> > > > transactions sorted by the amount of changes or the size. But it seems
> > > > not a good idea to try maintaining all transactions by  its size since
> > > > the size of each transaction could be changed frequently.
> > > >
> > > > The attached patch uses a different approach that consists of three
> > > > strategies; (1) maintain the list of transactions whose size is larger
> > > > than 10% of logical_decoding_work_mem, and preferentially evict a
> > > > transaction from this list.
> > > >
>
> IIUC, you are giving preference to multiple list ideas as compared to
> (a) because you don't need to adjust the list each time the
> transaction size changes, is that right?

Right.

>  If so, I think there is a
> cost to keep that data structure up-to-date but it can help in
> reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

>
>  If the list is empty, all transactions are
> > > > small enough, (2) so we evict the oldest top-level transaction from
> > > > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > > > freeing memory blocks in GenerationContext. Finally, if this is also
> > > > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > > > note the fact that even if a transaction is evicted the
> > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > > > 0. In the worst case where all (quite a few) transactions are smaller
> > > > than 10% of the memory limit, we might end up checking many
> > > > transactions to find non-zero size transaction entries to evict. So
> > > > the patch adds a new list to maintain all transactions that have at
> > > > least one change in memory.
> > > >
> > > > Summarizing the algorithm I've implemented in the patch,
> > > >
> > > > 1. pick a transaction from the list of large transactions (larger than
> > > > 10% of memory limit).
> > > > 2. pick a transaction from the top-level transaction list in LSN order.
> > > > 3. pick a transaction from the list of transactions that have at least
> > > > one change in memory.
> > > >
> > > > With the patch, the above test case completed within 3 seconds in my
> > > > environment.
> > >
> > > Thanks for working on this, I think it would be good to test other
> > > scenarios as well where this might have some negative impact and see
> > > where we stand.
> >
> > Agreed.
> >
> > > 1) A scenario where suppose you have one very large transaction that
> > > is consuming ~40% of the memory and 5-6 comparatively smaller
> > > transactions that are just above 10% of the memory limit.  And now for
> > > coming under the memory limit instead of getting 1 large transaction
> > > evicted out, we are evicting out multiple times.
> >
> > Given the large transaction list will have up to 10 transactions, I
> > think it's cheap to pick the largest transaction among them. It's O(N)
> >

Re: "pgoutput" options missing on documentation

2023-12-14 Thread Peter Smith

Thanks for the update. Here are some more review comments for the v01* patches.

//

Patch v00-0001

v01 modified the messages more than I was expecting, although what you
did looks fine to me.

~~~

1.
+ /* Check required options */
+ if (!protocol_version_given)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /* translator: %s is a pgoutput option */
+ errmsg("missing pgoutput option: %s", "proto_version"));
+ if (!publication_names_given)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /* translator: %s is a pgoutput option */
+ errmsg("missing pgoutput option: %s", "publication_names"));

I saw that the original "publication_names" error was using
errcode(ERRCODE_INVALID_PARAMETER_VALUE), but TBH since there is no
option given at all I felt ERRCODE_SYNTAX_ERROR might be more
appropriate errcode for those 2 mandatory option errors.

//

Patch v00-0002

2.

I saw that the chapter "55.4. Streaming Replication Protocol" [1]
introduces "START_REPLICATION SLOT slot_name LOGICAL ..." command and
it says
---
option_name
The name of an option passed to the slot's logical decoding plugin.
---

Perhaps that part should now include a reference to your new information:

SUGGESTION
option_name
The name of an option passed to the slot's logical decoding plugin.
Please see   for details about options that are accepted
by the standard (pgoutput) plugin.

~~~

3.
   
-   The logical replication START_REPLICATION command
-   accepts following parameters:
+   Using the START_REPLICATION command,
+   pgoutput) accepts the following options:


Oops, you copied my typo. There is a spurious ')'.

~~~

4.
+
+
+ 
+  origin
+ 
+ 
+  
+   String option to send only changes by an origin.  It also gets
+   the option "none" to send the changes that have no origin associated,
+   and the option "any" to send the changes regardless of their origin.
+   This can be used to avoid loops (infinite replication of the same data)
+   among replication nodes.
+  
+ 
+


AFAIK pgoutput only understands origin values "any" and "none" and
nothing else; I felt the "It also gets..." part implies it is more
flexible than it is.

SUGGESTION
Possible values are "none" (to only send the changes that have no
origin associated), or "any" (to send the changes regardless of their
origin).

~~~

5. Rearranging option details

> > SUGGESTION
> >
> > -proto_version
> >  ...
> > -publication_names
> >  ...
> > -binary
> >  ...
> > -messages
> >  ...
> >
> > Since protocol version 2:
> >
> > -streaming (boolean)
> >  ...
> >
> > Since protocol version 3:
> >
> > -two_phase
> >  ...
> >
> > Since protocol version 4:
> >
> > -streaming (boolean/parallel)
> >  ...
> > -origin
>
> This is not going to be correct because not all options do require a
> protocol version.  "origin" is added in version 16, but doesn't check
> for any "proto_version".  Perhaps we should fix this too.
>

OK, to deal with that can't you just include "origin" in the first
group which has no special protocol requirements?

SUGGESTION
-proto_version
-publication_names
-binary
-messages
-origin

Requires minimum protocol version 2:
-streaming (boolean)

Requires minimum protocol version 3:
-two_phase

Requires minimum protocol version 4:
-streaming (parallel)

==
[1] 55.4 https://www.postgresql.org/docs/devel/protocol-replication.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2023-12-14 Thread shveta malik

On Thu, Dec 14, 2023 at 10:15 AM Peter Smith  wrote:
>
> A review comment for v47-0001
>

Thanks for reviewing. I have addressed these in v48. There is some
design change around the code part where we were checking cascading
and were revalidating new GUC values on conf-reload. So code has
changed entirely around that part where some of these comments were.
Please review now and let me know.

> ==
> src/backend/replication/slot.c
>
> 1.  GetStandbySlotList
>
> +static void
> +WalSndRereadConfigAndReInitSlotList(List **standby_slots)
> +{
> + char*pre_standby_slot_names;
> +
> + ProcessConfigFile(PGC_SIGHUP);
> +
> + /*
> + * If we are running on a standby, there is no need to reload
> + * standby_slot_names since we do not support syncing slots to cascading
> + * standbys.
> + */
> + if (RecoveryInProgress())
> + return;
>
> Should the RecoveryInProgress() check be first -- even before the
> ProcessConfigFile call?
>
> ==
> Kind Regards,
> Peter Smith.
> Fujitsu Australia

Re: Synchronizing slots from primary to standby

2023-12-14 Thread shveta malik

On Wed, Dec 13, 2023 at 3:53 PM Peter Smith  wrote:
>
> Hi Shveta, here are some review comments for v45-0002.
>

Thanks for the feedback. Addressed these in v48. Please find my
comments on some.

> ==
> doc/src/sgml/bgworker.sgml
>
> 1.
> +   
> +
> + BgWorkerStart_PostmasterStart
> + 
> +  
> +   
> BgWorkerStart_PostmasterStart
> +   Start as soon as postgres itself has finished its own initialization;
> +   processes requesting this are not eligible for database connections.
> +  
> + 
> +
> +
> +
> + BgWorkerStart_ConsistentState
> + 
> +  
> +   
> BgWorkerStart_ConsistentState
> +   Start as soon as a consistent state has been reached in a hot-standby,
> +   allowing processes to connect to databases and run read-only queries.
> +  
> + 
> +
> +
> +
> + BgWorkerStart_RecoveryFinished
> + 
> +  
> +   
> BgWorkerStart_RecoveryFinished
> +   Start as soon as the system has entered normal read-write state. Note
> +   that the BgWorkerStart_ConsistentState and
> +  BgWorkerStart_RecoveryFinished are equivalent
> +   in a server that's not a hot standby.
> +  
> + 
> +
> +
> +
> + BgWorkerStart_ConsistentState_HotStandby
> + 
> +  
> +   
> BgWorkerStart_ConsistentState_HotStandby
> +   Same meaning as BgWorkerStart_ConsistentState but
> +   it is more strict in terms of the server i.e. start the worker only
> +   if it is hot-standby.
> +  
> + 
> +
> +   
>
> Maybe reorder these slightly, because I felt it is better if the
> BgWorkerStart_ConsistentState_HotStandby comes next after
> BgWorkerStart_ConsistentState, which it refers to
>
> For example::
> 1st.BgWorkerStart_PostmasterStart
> 2nd.BgWorkerStart_ConsistentState
> 3rd.BgWorkerStart_ConsistentState_HotStandby
> 4th.BgWorkerStart_RecoveryFinished
>
> ==
> doc/src/sgml/config.sgml
>
> 2.
> enable_syncslot = true
>
> Not sure, but I thought the "= true" part should be formatted too.
>
> SUGGESTION
> enable_syncslot = true
>
> ==
> doc/src/sgml/logicaldecoding.sgml
>
> 3.
> +
> + A logical replication slot on the primary can be synchronized to the hot
> + standby by enabling the failover option during slot creation and setting
> + enable_syncslot on the standby. For the 
> synchronization
> + to work, it is mandatory to have a physical replication slot between the
> + primary and the standby. It's highly recommended that the said physical
> + replication slot is listed in standby_slot_names on
> + the primary to prevent the subscriber from consuming changes faster than
> + the hot standby. Additionally, hot_standby_feedback
> + must be enabled on the standby for the slots synchronization to work.
> +
>
> I felt those parts that describe the mandatory GUCs should be kept together.
>
> SUGGESTION
> For the synchronization to work, it is mandatory to have a physical
> replication slot between the primary and the standby, and
> hot_standby_feedback must be enabled on the
> standby.
>
> It's also highly recommended that the said physical replication slot
> is named in standby_slot_names list on the primary,
> to prevent the subscriber from consuming changes faster than the hot
> standby.
>
> ~~~
>
> 4. (Chapter 49)
>
> By enabling synchronization of slots, logical replication can be
> resumed after failover depending upon the
> pg_replication_slots.sync_state for the synchronized slots on the
> standby at the time of failover. Only slots that were in ready
> sync_state ('r') on the standby before failover can be used for
> logical replication after failover. However, the slots which were in
> initiated sync_state ('i') and not sync-ready ('r') at the time of
> failover will be dropped and logical replication for such slots can
> not be resumed after failover. This applies to the case where a
> logical subscription is disabled before failover and is enabled after
> failover. If the synchronized slot due to disabled subscription could
> not be made sync-ready ('r') on standby, then the subscription can not
> be resumed after failover even when enabled. If the primary is idle,
> then the synchronized slots on the standby may take a noticeable time
> to reach the ready ('r') sync_state. This can be sped up by calling
> the pg_log_standby_snapshot function on the primary.
>
> ~
>
> Somehow, I still felt all that was too wordy/repetitive. Below is my
> attempt to make it more concise. Thoughts?
>
> SUGGESTION
> The ability to resume logical replication after failover depends upon
> the pg_replication_slots.sync_state value for the synchronized slots
> on the standby at the time of failover. Only slots that have attained
> a "ready" sync_state ('r') on the standby before failover can be used
> for logical replication after failover. Slots that have not yet
> reached 'r' state (they are still 'i') will be dropped, therefore
> logical rep

Re: Synchronizing slots from primary to standby

2023-12-14 Thread shveta malik

On Thu, Dec 14, 2023 at 4:40 PM Amit Kapila  wrote:
>
> On Thu, Dec 14, 2023 at 7:00 AM Peter Smith  wrote:
> >
> > Hi, here are a few more review comments for the patch v47-0002
> >
> > (plus my review comments of v45-0002 [1] are yet to be addressed)
> >
> > ==
> > 1. General
> >
> > For consistency and readability, try to use variables of the same
> > names whenever they have the same purpose, even when they declared are
> > in different functions. A few like this were already mentioned in the
> > previous review but there are more I keep noticing.
> >
> > For example,
> > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller.
> >
> >
> > ==
> > src/backend/replication/logical/slotsync.c
> >
> > 2.
> > +/*
> > + *
> > + * Validates the primary server info.
> > + *
> > + * Using the specified primary server connection, it verifies whether
> > the master
> > + * is a standby itself and returns true in that case to convey the caller 
> > that
> > + * we are on the cascading standby.
> > + * But if master is the primary server, it goes ahead and validates
> > + * primary_slot_name. It emits error if the physical slot in 
> > primary_slot_name
> > + * does not exist on the primary server.
> > + */
> > +static bool
> > +validate_primary_info(WalReceiverConn *wrconn)
> >
> > 2b.
> > IMO it is too tricky to have a function called "validate_xxx", when
> > actually you gave that return value some special unintuitive meaning
> > other than just validation. IMO it is always better for the returned
> > value to properly match the function name so the expectations are very
> > obvious. So, In this case, I think a better function signature would
> > be like this:
> >
> > SUGGESTION
> >
> > static void
> > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby)
> >
> > or
> >
> > static void
> > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)
> >
>
> The terminology master_is_standby is a bit indirect for this usage, so
> I would prefer the second one. Shall we name this function as
> check_primary_info()? Additionally, can we rewrite the following
> comment: "Using the specified primary server connection, check whether
> we are cascading standby. It also validates primary_slot_info for
> non-cascading-standbys.".
>
> --
> With Regards,
> Amit Kapila.


PFA v48. Changes are:

1) Addressed comments by Peter for v45-002 and v47-002 given in [1]
and [2] respectively
2) Addressed comments by Amit for v47-002 given in [3], [4]
3) Addressed an old comment (#74 in [5]) of getting rid of header
inclusion from tablesync.c when there was no code change in that file.
Thanks Hou-san for working on this change.


TODO:
--Address the test comments in [1] for 050_standby_failover_slots_sync.pl
--Review the feasibility of addressing one pending comment (comment 13
in [5]) of 'r'->'n' conversion.

[1]: 
https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAHut%2BPsvxs-%3Dj3aCpPVs3e4w78HndCdO-F4bLPzAX70%2BdgWUuQ%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/CAA4eK1L2ts%3DgfiF4aw7-DH8HWj29s08hVRq-Ff8%3DmjfdUXx8CA%40mail.gmail.com
[4]: 
https://www.postgresql.org/message-id/CAA4eK1%2Bw9yv%2B4UZXhiDHZpGDfbeRHYDBu23FwsniS8sYUZeu1w%40mail.gmail.com
[5]: 
https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com

thanks
Shveta

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 15 Dec 2023 11:27:30 +0800,
  Junwang Zhao  wrote:

>> > Adding a prefix or suffix would be one option but to give extensions
>> > more flexibility, another option would be to support format = 'custom'
>> > and add the "handler" option to specify a copy handler function name
>> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
>> > HANDLER = 'arrow_copy_handler').
>>
> I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT,
> and user has to know the name of the specific handler names, not
> intuitive.

Ah! I misunderstood this idea. "custom" is the special
format to use "HANDLER". I thought that we can use it like

   (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl1')

and

   (FORMAT = 'arrow', HANDLER = 'arrow_copy_handler_impl2')

.

>> Interesting. If we use this option, users can choose an COPY
>> FORMAT implementation they like from multiple
>> implementations. For example, a developer may implement a
>> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON
>> related API and another developer may implement a handler
>> with simdjson[1] which is a fast JSON parser. Users can
>> choose whichever they like.
> Not sure about this, why not move Json copy handler to contrib
> as an example for others, any extensions share the same format
> function name and just install one? No bound would implement
> another CSV or TEXT copy handler IMHO.

I should have used a different format not JSON as an example
for easy to understand. I just wanted to say that extension
developers can implement another implementation without
conflicting another implementation.


Thanks,
-- 
kou

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Masahiko Sawada

On Fri, Dec 15, 2023 at 9:53 AM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Fri, 15 Dec 2023 05:19:43 +0900,
>   Masahiko Sawada  wrote:
>
> > To avoid collisions, extensions can be created in a
> > different schema than public.
>
> Thanks. I didn't notice it.
>
> > And note that built-in format copy handler doesn't need to
> > declare its handler function.
>
> Right. I know it.
>
> > Adding a prefix or suffix would be one option but to give extensions
> > more flexibility, another option would be to support format = 'custom'
> > and add the "handler" option to specify a copy handler function name
> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
> > HANDLER = 'arrow_copy_handler').
>
> Interesting. If we use this option, users can choose an COPY
> FORMAT implementation they like from multiple
> implementations. For example, a developer may implement a
> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON
> related API and another developer may implement a handler
> with simdjson[1] which is a fast JSON parser. Users can
> choose whichever they like.
>
> But specifying HANDLER = '...' explicitly is a bit
> inconvenient. Because only one handler will be installed in
> most use cases. In the case, users don't need to choose one
> handler.
>
> If we choose this option, it may be better that we also
> provide a mechanism that can work without HANDLER. Searching
> a function by name like tablesample method does is an option.

Agreed. We can search the function by format name by default and the
user can optionally specify the handler function name in case where
the names of the installed custom copy handler collide. Probably the
handler option stuff could be a follow-up patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-14 Thread Amit Kapila

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
> >
> > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've heard the report internally that replication lag became huge when
> > > decoding transactions each consisting of 500k sub transactions. Note
> > > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> > >

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

> > > Here is a test script for a many subtransactions scenario. In my
> > > environment, the logical decoding took over 2min to decode one top
> > > transaction having 100k subtransctions.
> > >
> > > -
> > > create table test (c int);
> > > create or replace function testfn (cnt int) returns void as $$
> > > begin
> > >   for i in 1..cnt loop
> > > begin
> > >   insert into test values (i);
> > > exception when division_by_zero then
> > >   raise notice 'caught error';
> > >   return;
> > > end;
> > >   end loop;
> > > end;
> > > $$
> > > language plpgsql;
> > > select testfn(10)
> > > set logical_decoding_work_mem to '4MB';
> > > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > > 
> > >
> > > To deal with this problem, I initially thought of the idea (a)
> > > mentioned in the comment; use a binary heap to maintain the
> > > transactions sorted by the amount of changes or the size. But it seems
> > > not a good idea to try maintaining all transactions by  its size since
> > > the size of each transaction could be changed frequently.
> > >
> > > The attached patch uses a different approach that consists of three
> > > strategies; (1) maintain the list of transactions whose size is larger
> > > than 10% of logical_decoding_work_mem, and preferentially evict a
> > > transaction from this list.
> > >

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right? If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

 If the list is empty, all transactions are
> > > small enough, (2) so we evict the oldest top-level transaction from
> > > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > > freeing memory blocks in GenerationContext. Finally, if this is also
> > > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > > note the fact that even if a transaction is evicted the
> > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > > 0. In the worst case where all (quite a few) transactions are smaller
> > > than 10% of the memory limit, we might end up checking many
> > > transactions to find non-zero size transaction entries to evict. So
> > > the patch adds a new list to maintain all transactions that have at
> > > least one change in memory.
> > >
> > > Summarizing the algorithm I've implemented in the patch,
> > >
> > > 1. pick a transaction from the list of large transactions (larger than
> > > 10% of memory limit).
> > > 2. pick a transaction from the top-level transaction list in LSN order.
> > > 3. pick a transaction from the list of transactions that have at least
> > > one change in memory.
> > >
> > > With the patch, the above test case completed within 3 seconds in my
> > > environment.
> >
> > Thanks for working on this, I think it would be good to test other
> > scenarios as well where this might have some negative impact and see
> > where we stand.
>
> Agreed.
>
> > 1) A scenario where suppose you have one very large transaction that
> > is consuming ~40% of the memory and 5-6 comparatively smaller
> > transactions that are just above 10% of the memory limit.  And now for
> > coming under the memory limit instead of getting 1 large transaction
> > evicted out, we are evicting out multiple times.
>
> Given the large transaction list will have up to 10 transactions, I
> think it's cheap to pick the largest transaction among them. It's O(N)
> but N won't be large.
>
> > 2) Another scenario where all the transactions are under 10% of the
> > memory limit but let's say there are some transactions are consuming
> > around 8-9% of the memory limit each but those are not very old
> > transactions whereas there are certain old transactions which are
> > fairly small and consuming under 1% of memory limit and there are many
> > such transactions.  So how it would affect if we frequently select
> > many of these transactions to come under memory limit instead of
> > selecting a couple of large transactions which are consuming 8-9%?
>
> Yeah, probably we can do something for small transactions (i.e. small
> and on-memory transactions). One idea is to pick the largest
> transaction among them by iterating over all of them. Given that the
> more transactions are evicted, the less tr

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Junwang Zhao

On Fri, Dec 15, 2023 at 8:53 AM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Fri, 15 Dec 2023 05:19:43 +0900,
>   Masahiko Sawada  wrote:
>
> > To avoid collisions, extensions can be created in a
> > different schema than public.
>
> Thanks. I didn't notice it.
>
> > And note that built-in format copy handler doesn't need to
> > declare its handler function.
>
> Right. I know it.
>
> > Adding a prefix or suffix would be one option but to give extensions
> > more flexibility, another option would be to support format = 'custom'
> > and add the "handler" option to specify a copy handler function name
> > to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
> > HANDLER = 'arrow_copy_handler').
>
I like the prefix/suffix idea, easy to implement. *custom* is not a FORMAT,
and user has to know the name of the specific handler names, not
intuitive.

> Interesting. If we use this option, users can choose an COPY
> FORMAT implementation they like from multiple
> implementations. For example, a developer may implement a
> COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON
> related API and another developer may implement a handler
> with simdjson[1] which is a fast JSON parser. Users can
> choose whichever they like.
Not sure about this, why not move Json copy handler to contrib
as an example for others, any extensions share the same format
function name and just install one? No bound would implement
another CSV or TEXT copy handler IMHO.
>
> But specifying HANDLER = '...' explicitly is a bit
> inconvenient. Because only one handler will be installed in
> most use cases. In the case, users don't need to choose one
> handler.
>
> If we choose this option, it may be better that we also
> provide a mechanism that can work without HANDLER. Searching
> a function by name like tablesample method does is an option.
>
>
> [1]: https://github.com/simdjson/simdjson
>
>
> Thanks,
> --
> kou



-- 
Regards
Junwang Zhao

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Sutou Kouhei

Hi,

In 
 

  "RE: Make COPY format extendable: Extract COPY TO format implementations" on 
Tue, 12 Dec 2023 02:31:53 +,
  "Hayato Kuroda (Fujitsu)"  wrote:

>> Can we discuss how to proceed this improvement?
>> 
>> There are 2 approaches for it:
>> 
>> 1. Do the followings concurrently:
>>a. Implementing small changes that got a consensus and
>>   merge them step-by-step
>>   (e.g. We got a consensus that we need to extract the
>>   current format related routines.)
>>b. Discuss design
>> 
>>(v1-v3 patches use this approach.)
>> 
>> 2. Implement one (large) complete patch set with design
>>discussion and merge it
>> 
>>(v4- patches use this approach.)
>> 
>> Which approach is preferred? (Or should we choose another
>> approach?)
>> 
>> I thought that 1. is preferred because it will reduce review
>> cost. So I chose 1.
> 
> I'm ok to use approach 1, but could you please divide a large patch? E.g.,
> 
> 0001. defines an infrastructure for copy-API
> 0002. adjusts current codes to use APIs
> 0003. adds a test module in src/test/modules or contrib.
> ...
> 
> This approach helps reviewers to see patches deeper. Separated patches can be
> combined when they are close to committable.

It seems that I should have chosen another approach based on
comments so far:

3. Do the followings in order:
   a. Implement a workable (but maybe dirty and/or incomplete)
  implementation to discuss design like [1], discuss
  design with it and get a consensus on design
   b. Implement small patches based on the design

[1]: 
https://www.postgresql.org/message-id/CAD21AoCunywHird3GaPzWe6s9JG1wzxj3Cr6vGN36DDheGjOjA%40mail.gmail.com
 

I'll implement a custom COPY FORMAT handler with [1] and
provide a feedback with the experience. (It's for a.)


Thanks,
-- 
kou

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila

On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila  wrote:
> >
> > It can only be cleaned if we process it but xact_decode won't allow us
> > to process it and I don't think it would be a good idea to add another
> > hack for sequences here. See below code:
> >
> > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> > {
> > SnapBuild  *builder = ctx->snapshot_builder;
> > ReorderBuffer *reorder = ctx->reorder;
> > XLogReaderState *r = buf->record;
> > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
> >
> > /*
> > * If the snapshot isn't yet fully built, we cannot decode anything, so
> > * bail out.
> > */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> > return;
>
> That may be true for a transaction which is decoded, but I think all
> the transactions which are added to ReorderBuffer should be cleaned up
> once they have been processed irrespective of whether they are
> decoded/sent downstream or not. In this case I see the sequence hash
> being cleaned up for the sequence related transaction in Hayato's
> reproducer.
>

It was because the test you are using was not designed to show the
problem I mentioned. In this case, the rollback was after a full
snapshot state was reached.

-- 
With Regards,
Amit Kapila.

Re: Add 64-bit XIDs into PostgreSQL 15

2023-12-14 Thread wenhui qiu

Hi Maxim Orlov
Good news,xid64 has achieved a successful first phase,I tried to change
the path status (https://commitfest.postgresql.org/43/3594/) ,But it seems
incorrect

Maxim Orlov  于2023年12月13日周三 20:26写道：

> Hi!
>
> Just to keep this thread up to date, here's a new version after recent
> changes in SLRU.
> I'm also change order of the patches in the set, to make adding initdb MOX
> options after the
> "core 64 xid" patch, since MOX patch is unlikely to be committed and now
> for test purpose only.
>
> --
> Best regards,
> Maxim Orlov.
>

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-14 Thread John Naylor

On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada  wrote:
> In v45, 0001 - 0006 are from earlier versions but I've merged previous
> updates. So the radix tree now has RT_SET() and RT_FIND() but not
> RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> versions that incorporated the above comments. 0009 patch integrates
> tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO:  finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO:  finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

More sparse TID distributions won't be as favorable, but we have ideas
to improve that in the future.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

> Note that DSA segment problem is not
> resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

v45-0005:

+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.

This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

v45-0007:

-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);

Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;

We have a macro for invalid pointer because of DSA.

v45-0008:

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
  elog(ERROR, "tuple offset out of range: %u", off);

This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

v45-0009:

(just a few small things for now)

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *   vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.

I think we can keep as "listed in the TID store".

- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.

I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)

It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bot

Re: Change GUC hashtable to use simplehash?

2023-12-14 Thread John Naylor

I wrote:
>
> * v8 with chunked interface:
> latency average = 555.688 ms
>
> This starts to improve things for me.
>
> * v8 with chunked, and return lower 32 bits of full 64-bit hash:
> latency average = 556.324 ms
>
> This is within the noise level. There doesn't seem to be much downside
> of using a couple cycles for fasthash's 32-bit reduction.
>
> * revert back to master from Dec 4 and then cherry pick a86c61c9ee
> (save last entry of SearchPathCache)
> latency average = 545.747 ms
>
> So chunked incremental hashing gets within ~2% of that, which is nice.
> It seems we should use that when removing strlen, when convenient.
>
> Updated next steps:
> * Investigate whether/how to incorporate final length into the
> calculation when we don't have the length up front.
> * Add some desperately needed explanatory comments.
> * Use this in some existing cases where it makes sense.
> * Get back to GUC hash and dynahash.

For #1 here, I cloned SMHasher and was dismayed at the complete lack
of documentation, but after some poking around, found how to run the
tests, using the 32-bit hash to save time. It turns out that the input
length is important. I've attached two files of results -- "nolen"
means stop using the initial length to tweak the internal seed. As you
can see, there are 8 failures. "pluslen" means I then incorporated the
length within the finalizer. This *does* pass SMHasher, so that's
good. (of course this way can't produce the same hash as when we know
the length up front, but that's not important). The attached shows how
that would work, further whacking around and testing with Jeff's
prototype for the search path cache hash table. I'll work on code
comments and get it polished.
--- Testing fasthash32 "fast-hash 32bit" with NO LENGTH

[[[ Sanity Tests ]]]

Verification value 0x6A202089 ... FAIL! (Expected 0xe9481afc)
Running sanity check 1 .. PASS
Running AppendedZeroesTest .. PASS

[[[ Keyset 'Sparse' Tests ]]]

Keyset 'Sparse' - 16-bit keys with up to 9 bits set - 50643 keys
Testing collisions ( 32-bit) - Expected0.3, actual  0 (0.00x)
Testing collisions (high 19-25 bits) - Worst is 20 bits: 1227/1203 (1.02x)
Testing collisions (low  19-25 bits) - Worst is 24 bits: 81/76 (1.06x)
Testing distribution - Worst bias is the 13-bit window at bit 31 - 0.436%

Keyset 'Sparse' - 24-bit keys with up to 8 bits set - 1271626 keys
Testing collisions ( 32-bit) - Expected  188.2, actual182 (0.97x)
Testing distribution - Worst bias is the 17-bit window at bit 11 - 0.089%

Keyset 'Sparse' - 32-bit keys with up to 7 bits set - 4514873 keys
Testing collisions ( 32-bit) - Expected 2372.2, actual   2286 (0.96x)
Testing distribution - Worst bias is the 19-bit window at bit 23 - 0.044%

Keyset 'Sparse' - 40-bit keys with up to 6 bits set - 4598479 keys
Testing collisions ( 32-bit) - Expected 2460.8, actual   2426 (0.99x) (-34)
Testing distribution - Worst bias is the 19-bit window at bit 25 - 0.041%

Keyset 'Sparse' - 48-bit keys with up to 6 bits set - 14196869 keys
Testing collisions ( 32-bit) - Expected 23437.8, actual  23601 (1.01x) (164)
Testing distribution - Worst bias is the 20-bit window at bit 31 - 0.020%

Keyset 'Sparse' - 56-bit keys with up to 5 bits set - 4216423 keys
Testing collisions ( 32-bit) - Expected 2069.0, actual   1958 (0.95x)
Testing distribution - Worst bias is the 19-bit window at bit  2 - 0.040%

Keyset 'Sparse' - 64-bit keys with up to 5 bits set - 8303633 keys
Testing collisions ( 32-bit) - Expected 8021.7, actual   7994 (1.00x) (-27)
Testing distribution - Worst bias is the 20-bit window at bit 30 - 0.040%

Keyset 'Sparse' - 72-bit keys with up to 5 bits set - 15082603 keys
Testing collisions ( 32-bit) - Expected 26451.8, actual  26503 (1.00x) (52)
Testing distribution - Worst bias is the 20-bit window at bit 11 - 0.021%

Keyset 'Sparse' - 96-bit keys with up to 4 bits set - 3469497 keys
Testing collisions ( 32-bit) - Expected 1401.0, actual   1415 (1.01x) (15)
Testing distribution - Worst bias is the 19-bit window at bit 12 - 0.058%

Keyset 'Sparse' - 160-bit keys with up to 4 bits set - 26977161 keys
Testing collisions ( 32-bit) - Expected 84546.1, actual  84165 (1.00x) (-381)
Testing distribution - Worst bias is the 20-bit window at bit 12 - 0.012%

Keyset 'Sparse' - 256-bit keys with up to 3 bits set - 2796417 keys
Testing collisions ( 32-bit) - Expected  910.2, actual905 (0.99x) (-5)
Testing distribution - Worst bias is the 18-bit window at bit 21 - 0.056%

Keyset 'Sparse' - 512-bit keys with up to 3 bits set - 22370049 keys
Testing collisions ( 32-bit) - Expected 58155.4, actual  57846 (0.99x) (-309)
Testing distribution - Worst bias is the 20-bit window at bit  6 - 0.013%

Keyset 'Sparse' - 1024-bit keys with up to 2 bits set - 524801 keys
Testing collisions ( 32-bit) - Expected   32.1, actual 35 (1.09x) (3)
Testing distribution - Worst bias is the 16-bit window at bit 13 - 0.118%

Keyset 'Sparse' - 2048-bit keys with up to 2 bits set - 2

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 15 Dec 2023 05:19:43 +0900,
  Masahiko Sawada  wrote:

> To avoid collisions, extensions can be created in a
> different schema than public.

Thanks. I didn't notice it.

> And note that built-in format copy handler doesn't need to
> declare its handler function.

Right. I know it.

> Adding a prefix or suffix would be one option but to give extensions
> more flexibility, another option would be to support format = 'custom'
> and add the "handler" option to specify a copy handler function name
> to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
> HANDLER = 'arrow_copy_handler').

Interesting. If we use this option, users can choose an COPY
FORMAT implementation they like from multiple
implementations. For example, a developer may implement a
COPY FROM FORMAT = 'json' handler with PostgreSQL's JSON
related API and another developer may implement a handler
with simdjson[1] which is a fast JSON parser. Users can
choose whichever they like.

But specifying HANDLER = '...' explicitly is a bit
inconvenient. Because only one handler will be installed in
most use cases. In the case, users don't need to choose one
handler.

If we choose this option, it may be better that we also
provide a mechanism that can work without HANDLER. Searching
a function by name like tablesample method does is an option.

[1]: https://github.com/simdjson/simdjson

Thanks,
-- 
kou

Re: Memory consumed by paths during partitionwise join planning

2023-12-14 Thread Ashutosh Bapat

Forgot to mention,

On Thu, Dec 14, 2023 at 5:34 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 7, 2023 at 6:19 PM David Rowley  wrote:
> >
> > Maybe we can try to move forward with your refcount idea and see how
> > the performance looks.  If that's intolerable then that might help us
> > decide on the next best alternative solution.
> >
>
> Here are performance numbers
>
> setup
>
> create table t1 (a integer primary key, b integer);
> create table t1_parted (a integer primary key, b integer) partition by 
> range(a);
> create 1000 partitions of t1
>
> query (a five way self-join)
> select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a =
> c.a and c.a = d.a and d.a = e.a -- unpartitioned table case
> select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d,
> t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a =
> e.a; -- partitioned table case
>
> The numbers with patches attached to [1] with limitations listed in
> the email are thus
>
> Ran each query 10 times through EXPLAIN (SUMMARY) and averaged
> planning time with and without patch.
> unpartitioned case
> without patch: 0.25
> with patch: 0.19
> this improvement is probably misleading. The planning time for this
> query change a lot.
>
> partitioned case (without partitionwise join)
> without patch: 14580.65
> with patch: 14625.12
> % degradation: 0.3%
>
> partitioned case (with partitionwise join)
> without patch: 23273.69
> with patch: 23328.52
>  % degradation: 0.2%
>
> That looks pretty small considering the benefits. What do you think?
>
> [1] 
> https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com

If you want to experiment, please use attached patches. There's a fix
for segfault during initdb in them. The patches are still raw.

-- 
Best Wishes,
Ashutosh Bapat
From abcd735ca251f4832125aa38cee6d03743dcf8bb Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 17 Jul 2023 10:44:18 +0530
Subject: [PATCH 2/4] Basic infrastructure to link, unlink and free pathnodes

add_path and add_partial_path free path nodes which they deem sub-optimal. But
they just free the uppermost pathnode in the path. The subpaths continue to
occupy memory even if they are not used anywhere. The subpath nodes are not
freed because they may be referenced by other paths. This commit introduces the
infrastructure to track references to paths.

As the path nodes are referenced they are "linked" using link_path() to
referencing objects. When the path nodes are no longer useful they are
"unlinked" using unlink_path() from the referencing objects. The paths whose
references reach 0 during unlinking are freed automatically using
free_path(). The function unlinks the sub path nodes and can be called
when freeing any path node.

When the final path for join tree is chosen the paths linked to each
participating relation are unlinked, thus ultimately getting freed if not used.

TODO
The functions free_path(), unlink_path() and link_path() have ereports
to catch code paths which do not use these function correctly. They call
errbacktrace() which is not used anywhere else. These ereport calls will
need to be fixed for productization.
---
 src/backend/optimizer/path/allpaths.c |  77 +++
 src/backend/optimizer/util/pathnode.c | 136 ++
 src/include/nodes/pathnodes.h |   1 +
 src/include/optimizer/pathnode.h  |  38 +++
 4 files changed, 252 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 67921a0826..b7098ac39e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3388,6 +3388,81 @@ make_rel_from_joinlist(PlannerInfo *root, List *joinlist)
 	}
 }
 
+/*
+ * TODO: Need similar code to free paths in upper relations.
+ */
+static void
+free_unused_paths_from_rel(RelOptInfo *rel)
+{
+	ListCell	   *lc_path;
+	intcnt_part;
+
+	foreach(lc_path, rel->pathlist)
+	{
+		Path *path = (Path *) lfirst(lc_path);
+
+		/* Free the path if none references it. */
+		if (path->ref_count == 1)
+		{
+			/* TODO: use routine to unlink path from list */
+			rel->pathlist = foreach_delete_current(rel->pathlist, lc_path);
+			unlink_path(&path);
+		}
+	}
+
+	/* Do the same for partial pathlist. */
+	foreach(lc_path, rel->partial_pathlist)
+	{
+		Path *path = (Path *) lfirst(lc_path);
+
+		/* Free the path if none references it. */
+		if (path->ref_count == 1)
+		{
+			rel->partial_pathlist = foreach_delete_current(rel->partial_pathlist, lc_path);
+			unlink_path(&path);
+		}
+	}
+
+	/*
+	 * TODO: We can perform this in generate_partitionwise_paths as well since
+	 * by that time all the paths from partitions will be linked if used.
+	 */
+	
+	/* Free unused paths from the partition relations */
+	for (cnt_part = 0; cnt_part < rel->nparts; cnt_part++)
+	{
+		free_unused_paths_from_rel(rel->part_rels[cnt_part]);
+	}
+}
+
+/*
+ * Free unused paths from all the relation

Re: Simplify newNode()

2023-12-14 Thread Thomas Munro

On Fri, Dec 15, 2023 at 11:44 AM Tom Lane  wrote:
> Heikki Linnakangas  writes:
> > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned.
> > It's not doing any good as it is, as it gets compiled to be identical to
> > MemoryContextAllocZero.
>
> Also not so here.  Admittedly, my results don't make much of a case
> for keeping the two code paths, even on compilers where it matters.

FWIW here is what I figured out once about why it gets compiled the
same these days:

https://www.postgresql.org/message-id/ca+hukglfa6ana0vs7lf0op0xbh05he8syx8nfhdyt7k2chy...@mail.gmail.com

Re: Simplify newNode()

2023-12-14 Thread Tom Lane

Heikki Linnakangas  writes:
> On 14/12/2023 10:32, Peter Eisentraut wrote:
>> But if we think that compilers are now smart enough, maybe we can unwind
>> this whole stack a bit more?  Maybe we don't need MemSetTest() and/or
>> palloc0fast() and/or newNode() at all?

> Good point. Looking closer, modern compilers will actually turn the 
> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 

I experimented with the same planner-intensive test case that I used
in the last discussion back in 2008.  I got these results:

HEAD:
tps = 144.974195 (without initial connection time)

v1 patch:
tps = 146.302004 (without initial connection time)

v2 patch:
tps = 144.882208 (without initial connection time)

While there's not much daylight between these numbers, the times are
quite reproducible for me.  This is with RHEL8's gcc 8.5.0 on x86_64.
That's probably a bit trailing-edge in terms of what people might be
using with v17, but I don't think it's discountable.

I also looked at the backend's overall code size per size(1):

HEAD:
   textdata bss dec hex filename
8613007  100192  220176 8933375  884fff testversion.stock/bin/postgres

v1 patch:
   textdata bss dec hex filename
8615126  100192  220144 8935462  885826 testversion.v1/bin/postgres

v2 patch:
   textdata bss dec hex filename
8595322  100192  220144 8915658  880aca testversion.v2/bin/postgres

I did check that the v1 patch successfully inlines newNode() and
reduces it to just a MemoryContextAllocZeroAligned call, so it's
correct that modern compilers do that better than whatever I tested
in 2008.  But I wonder what is happening in v2 to reduce the code
size so much.  MemoryContextAllocZeroAligned is not 20kB by itself.

> Good point. Looking closer, modern compilers will actually turn the 
> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 

Not here ...

> Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. 
> It's not doing any good as it is, as it gets compiled to be identical to 
> MemoryContextAllocZero.

Also not so here.  Admittedly, my results don't make much of a case
for keeping the two code paths, even on compilers where it matters.

regards, tom lane

Re: useless LIMIT_OPTION_DEFAULT

2023-12-14 Thread Tom Lane

Zhang Mingli  writes:
> By reading the codes, I found that we process limit option as 
> LIMIT_OPTION_WITH_TIES when using WITH TIES
> and all others are LIMIT_OPTION_COUNT by  commit 
> 357889eb17bb9c9336c4f324ceb1651da616fe57.
> And check actual limit node in limit_needed().
> There is no need to maintain a useless default limit enum.

I agree, that looks pretty useless.  Our normal convention for
representing not having any LIMIT clause would be to not create
a SelectLimit node at all.  I don't see why allowing a second
representation is a good idea, especially when the code is not
in fact doing that.

git blame shows that this came in with 357889eb1.  Alvaro,
Surafel, do you want to argue for keeping things as-is?

regards, tom lane

Re: Teach predtest about IS [NOT] proofs

2023-12-14 Thread Tom Lane

James Coleman  writes:
> On Wed, Dec 13, 2023 at 1:36 PM Tom Lane  wrote:
>> I don't have an objection in principle to adding more smarts to
>> predtest.c.  However, we should be wary of slowing down cases where
>> no BooleanTests are present to be optimized.  I wonder if it could
>> help to use a switch on nodeTag rather than a series of if(IsA())
>> tests.  (I'd be inclined to rewrite the inner if-then-else chains
>> as switches too, really.  You get some benefit from the compiler
>> noticing whether you've covered all the enum values.)

> I think I could take this on; would you prefer it as a patch in this
> series? Or as a new patch thread?

No, keep it in the same thread (and make a CF entry, if you didn't
already).  It might be best to make a series of 2 patches, first
just refactoring what's there per this discussion, and then a
second one to add BooleanTest logic.

>> I note you've actively broken the function's ability to cope with
>> NULL input pointers.  Maybe we don't need it to, but I'm not going
>> to accept a patch that just side-swipes that case without any
>> justification.

> [ all callers have previously used predicate_classify ]

OK, fair enough.  The checks for nulls are probably from ancient
habit, but I agree we could remove 'em here.

>> Perhaps, rather than hoping people will notice comments that are
>> potentially offscreen from what they're modifying, we should relocate
>> those comment paras to be adjacent to the relevant parts of the
>> function?

> Splitting up that block comment makes sense to me.

Done, let's make it so.

>> I've not gone through the patch in detail to see whether I believe
>> the proposed proof rules.  It would help to have more comments
>> justifying them.

> Most of them are sufficiently simple -- e.g., X IS TRUE implies X --
> that I don't think there's a lot to say in justification. In some
> cases I've noted the cases that force only strong or weak implication.

Yeah, it's the strong-vs-weak distinction that makes me cautious here.
One's high-school-algebra instinct for what's obviously true tends to
not think about NULL/UNKNOWN, and you do have to consider that.

>>> As noted in a TODO in the patch itself, I think it may be worth refactoring
>>> the test_predtest module to run the "x, y" case as well as the "y, x" case

>> I think that's actively undesirable.  It is not typically the case that
>> a proof rule for A => B also works in the other direction, so this would
>> encourage wasting cycles in the tests.  I fear it might also cause
>> confusion about which direction a proof rule is supposed to work in.

> That makes sense in the general case.

> Boolean expressions seem like a special case in that regard: (subject
> to what it looks like) would you be OK with a wrapping function that
> does both directions (with output that shows which direction is being
> tested) used only for the cases where we do want to check both
> directions?

Using a wrapper where appropriate would remove the inefficiency
concern, but I still worry whether it will promote confusion about
which direction we're proving things in.  You'll need to be very clear
about the labeling.

regards, tom lane

pg_serial bloat

2023-12-14 Thread Thomas Munro

Hi,

Our pg_serial truncation logic is a bit broken, as described by the
comments in CheckPointPredicate() (a sort of race between xid cycles
and checkpointing).  We've seen a system with ~30GB of files in there
(note: full/untruncated be would be 2³² xids × sizeof(uint64_t) =
32GB).  It's not just a gradual disk space leak: according to disk
space monitoring, this system suddenly wrote ~half of that data, which
I think must be the while loop in SerialAdd() zeroing out pages.
Ouch.

I see a few questions:
1.  How should we fix this fundamentally in future releases?  One
answer is to key SSI's xid lookup with FullTransactionId (conceptually
cleaner IMHO but I'm not sure how far fxids need to 'spread' through
the system to do it right).  Another already mentioned in comments is
to move some logic into vacuum so it can stay in sync with the xid
cycle (maybe harder to think about and prove correct).
2.  Could there be worse consequences than wasted disk and I/O?
3.  Once a system reaches a bloated state like this, what can an
administrator do?

I looked into question 3.  I convinced myself that it must be safe to
unlink all the files under pg_serial while the cluster is down,
because:

 * we don't need the data across restarts, it's just for spilling
 * we don't need the 'head' file because slru.c opens with O_CREAT
 * open(O_CREAT) followed by pwrite(..., offset) will create a harmless hole
 * we never read files outside the tailXid/headXid range we've written
 * we zero out pages as we add them in SerialAdd(), without reading

If I have that right, perhaps we should not merely advise that it is
safe to do that manually, but proactively do it in SerialInit().  That
is where we establish in shared memory that we don't expect there to
be any files on disk, so it must be a good spot to make that true if
it is not:

if (!found)
{
/*
 * Set control information to reflect empty SLRU.
 */
serialControl->headPage = -1;
serialControl->headXid = InvalidTransactionId;
serialControl->tailXid = InvalidTransactionId;
+
+   /* Also delete any files on disk. */
+   SlruScanDirectory(SerialSlruCtl, SlruScanDirCbDeleteAll, NULL);
}

In common cases that would just readdir() an empty directory.

For testing, it is quite hard to convince predicate.c to write any
files there: normally you have to overflow its transaction tracking,
which requires more than (max backends + max prepared xacts) × 10
SERIALIZABLE transactions in just the right sort of overlapping
pattern, so that the committed ones need to be spilled to disk.  I
might try to write a test for that, but it gets easier if you define
TEST_SUMMARIZE_SERIAL.  Then you don't need many transactions -- but
you still need a slightly finicky schedule.  Start with a couple of
overlapping SSI transactions, then commit them, to get a non-empty
FinishedSerializableTransaction list.  Then create some more SSI
transactions, which will call SerialAdd() due to the TEST_ macro.
Then run a checkpoint, and you should see eg "" being created on
demand during SLRU writeback, demonstrating that starting from an
empty pg_serial directory is always OK.  I wanted to try that to
remind myself of how it all works, but I suppose it should be obvious
that it's OK: initdb's initial state is an empty directory.

To create a bunch of junk files that are really just thin links for
the above change to unlink, or to test the truncate code when it sees
a 'full' directory, you can do:

cd pg_serial
dd if=/dev/zero of= bs=256k count=1
awk 'BEGIN { for (i = 1; i <= 131071; i++) { printf("%04X\n", i); } }'
| xargs -r -I {} ln  {}

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-14 Thread Masahiko Sawada

Hi,

On Tue, Dec 12, 2023 at 10:04 PM jian he  wrote:
>
> On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
>  wrote:
> >
> > Hi! Thank you for your work. Your patch looks better!
> > Yes, thank you! It works fine, and I see that the regression tests have 
> > been passed. 🙂
> > However, when I ran 'copy from with save_error' operation with simple csv 
> > files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created 
> > it, I described below):
> >
> > postgres=# create table test (x int primary key, y int not null);
> > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
> >   FOREIGN KEY(x)
> >   REFERENCES test(x));
> >
> > I did not find a table with saved errors after operation, although I 
> > received a log about it:
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV 
> > save_error
> > NOTICE:  2 rows were skipped because of error. skipped row saved to table 
> > public.test_error
> > ERROR:  duplicate key value violates unique constraint "test_pkey"
> > DETAIL:  Key (x)=(2) already exists.
> > CONTEXT:  COPY test, line 3
> >
> > postgres=# select * from public.test_error;
> > ERROR:  relation "public.test_error" does not exist
> > LINE 1: select * from public.test_error;
> >
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
> > save_error
> > NOTICE:  2 rows were skipped because of error. skipped row saved to table 
> > public.test1_error
> > ERROR:  insert or update on table "test1" violates foreign key constraint 
> > "fk_x"
> > DETAIL:  Key (x)=(2) is not present in table "test".
> >
> > postgres=# select * from public.test1_error;
> > ERROR:  relation "public.test1_error" does not exist
> > LINE 1: select * from public.test1_error;
> >
> > Two lines were written correctly in the csv files, therefore they should 
> > have been added to the tables, but they were not added to the tables test 
> > and test1.
> >
> > If I leave only the correct rows, everything works fine and the rows are 
> > added to the tables.
> >
> > in copy_test.csv:
> >
> > 2,0
> >
> > 1,1
> >
> > in copy_test1.csv:
> >
> > 2,0
> >
> > 2,1
> >
> > 1,1
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
> > COPY 2
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
> > save_error
> > NOTICE:  No error happened.Error holding table public.test1_error will be 
> > droped
> > COPY 3
> >
> > Maybe I'm launching it the wrong way. If so, let me know about it.
>
> looks like the above is about constraints violation while copying.
> constraints violation while copying not in the scope of this patch.
>
> Since COPY FROM is very like the INSERT command,
> you do want all the valid constraints to check all the copied rows?
>
> but the notice raised by the patch is not right.
> So I place the drop error saving table or raise notice logic above
> `ExecResetTupleTable(estate->es_tupleTable, false)` in the function
> CopyFrom.
>
> >
> > I also notice interesting behavior if the table was previously created by 
> > the user. When I was creating an error_table before the 'copy from' 
> > operation,
> > I received a message saying that it is impossible to create a table with 
> > the same name (it is shown below) during the 'copy from' operation.
> > I think you should add information about this in the documentation, since 
> > this seems to be normal behavior to me.
> >
>
> doc changed. you may check it.

I've read this thread and the latest patch. IIUC with SAVE_ERROR
option, COPY FROM creates an error table for the target table and
writes error information there.

While I agree that the final shape of this feature would be something
like that design, I'm concerned some features are missing in order to
make this feature useful in practice. For instance, error logs are
inserted to error tables without bounds, meaning that users who want
to tolerate errors during COPY FROM  will have to truncate or drop the
error tables periodically, or the database will grow with error logs
without limit. Ideally such maintenance work should be done by the
database. There might be some users who want to log such conversion
errors in server logs to avoid such maintenance work. I think we
should provide an option for where to write, at least. Also, since the
error tables are normal user tables internally, error logs are also
replicated to subscribers if there is a publication FOR ALL TABLES,
unlike system catalogs. I think some users would not like such
behavior.

Looking at SAVE_ERROR feature closely, I think it consists of two
separate features. That is, it enables COPY FROM to load data while
(1) tolerating errors and (2) logging errors to somewhere (i.e., an
error table). If we implement only (1), it would be like COPY FROM
tolerate errors infinitely and log errors to /dev/null. The user
cannot see the error details but I guess it could still help some
cases as Andres mentioned[1] (it might be a good idea to s

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Masahiko Sawada

On Thu, Dec 14, 2023 at 6:44 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Mon, 11 Dec 2023 10:57:15 +0900,
>   Masahiko Sawada  wrote:
>
> > IIUC we cannot create two same name functions with the same arguments
> > but a different return value type in the first place. It seems to me
> > to be an overkill to change such a design.
>
> Oh, sorry. I didn't notice it.
>
> > Another idea is to encapsulate copy_to/from_handler by a super class
> > like copy_handler. The handler function is called with an argument,
> > say copyto, and returns copy_handler encapsulating either
> > copy_to/from_handler depending on the argument.
>
> It's for using "${copy_format_name}" such as "json" and
> "parquet" as a function name, right?

Right.

> If we use the
> "${copy_format_name}" approach, we can't use function names
> that are already used by tablesample method handler such as
> "system" and "bernoulli" for COPY FORMAT name. Because both
> of tablesample method handler function and COPY FORMAT
> handler function use "(internal)" as arguments.
>
> I think that tablesample method names and COPY FORMAT names
> will not be conflicted but the limitation (using the same
> namespace for tablesample method and COPY FORMAT) is
> unnecessary limitation.

Presumably, such function name collisions are not limited to
tablesample and copy, but apply to all functions that have an
"internal" argument. To avoid collisions, extensions can be created in
a different schema than public. And note that built-in format copy
handler doesn't need to declare its handler function.

>
> How about using prefix ("copy_to_${copy_format_name}" or
> something) or suffix ("${copy_format_name}_copy_to" or
> something) for function names? For example,
> "copy_to_json"/"copy_from_json" for "json" COPY FORMAT.
>
> ("copy_${copy_format_name}" that returns copy_handler
> encapsulating either copy_to/from_handler depending on the
> argument may be an option.)

While there is a way to avoid collision as I mentioned above, I can
see the point that we might want to avoid using a generic function
name such as "arrow" and "parquet" as custom copy handler functions.
Adding a prefix or suffix would be one option but to give extensions
more flexibility, another option would be to support format = 'custom'
and add the "handler" option to specify a copy handler function name
to call. For example, COPY ... FROM ... WITH (FORMAT = 'custom',
HANDLER = 'arrow_copy_handler').

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Euler Taveira

On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote:
> I haven't found the code path from where the sequence cleanup gets
> called. But it's being called. Am I missing something?

ReorderBufferCleanupTXN.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Add --check option to pgindent

2023-12-14 Thread Tristan Partin


On Wed Dec 13, 2023 at 2:46 PM CST, Andrew Dunstan wrote:


On 2023-12-12 Tu 10:30, Alvaro Herrera wrote:
> On 2023-Dec-12, Tom Lane wrote:
>
>> "Euler Taveira"  writes:
>>> When you add exceptions, it starts to complicate the UI.
>> Indeed.  It seems like --silent-diff was poorly defined and poorly
>> named, and we need to rethink that option along the way to adding
>> this behavior.  The idea that --show-diff and --silent-diff can
>> be used together is just inherently confusing, because they sound
>> like opposites
> Maybe it's enough to rename --silent-diff to --check.  You can do
> "--show-diff --check" and get both the error and the diff printed; or
> just "--check" and it'll throw an error without further ado; or
> "--show-diff" and it will both apply the diff and print it.
>

That seems reasonable. These features were fairly substantially debated 
when we put them in, but I'm fine with some tweaking. But note: 
--show-diff doesn't apply the diff, it's intentionally non-destructive.


Here is a new patch:

- Renames --silent-diff to --check
- Renames --show-diff to --diff
- Allows one to use --check and --diff in the same command

I am not tied to the second patch if people like --show-diff better than 
--diff.


Weirdly enough, my email client doesn't show this as part of the 
original thread, but I will respond here anyway.


--
Tristan Partin
Neon (https://neon.tech)
From f0963daa95b222b00a7ae15d6702141352d3a81d Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Mon, 11 Dec 2023 17:34:17 -0600
Subject: [PATCH v2 1/3] Rename --silent-diff to --check in pgindent

--check should be a little bit more self-explanatory for people coming
from other similar formatter/linting tooling.
---
 src/tools/pgindent/pgindent | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index bce63d95da..fe0bd21f4a 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -23,7 +23,7 @@ my $devnull = File::Spec->devnull;
 
 my ($typedefs_file, $typedef_str, @excludes,
 	$indent, $build, $show_diff,
-	$silent_diff, $help, @commits,);
+	$check, $help, @commits,);
 
 $help = 0;
 
@@ -35,13 +35,13 @@ my %options = (
 	"excludes=s" => \@excludes,
 	"indent=s" => \$indent,
 	"show-diff" => \$show_diff,
-	"silent-diff" => \$silent_diff,);
+	"check" => \$check,);
 GetOptions(%options) || usage("bad command line argument");
 
 usage() if $help;
 
-usage("Cannot have both --silent-diff and --show-diff")
-  if $silent_diff && $show_diff;
+usage("Cannot have both --check and --show-diff")
+  if $check && $show_diff;
 
 usage("Cannot use --commit with command line file list")
   if (@commits && @ARGV);
@@ -324,7 +324,7 @@ Options:
 	--excludes=PATH file containing list of filename patterns to ignore
 	--indent=PATH   path to pg_bsd_indent program
 	--show-diff show the changes that would be made
-	--silent-diff   exit with status 2 if any changes would be made
+	--check exit with status 2 if any changes would be made
 The --excludes and --commit options can be given more than once.
 EOF
 	if ($help)
@@ -417,7 +417,12 @@ foreach my $source_filename (@files)
 
 	if ($source ne $orig_source)
 	{
-		if ($silent_diff)
+		if ($check)
+		{
+			print show_diff($source, $source_filename);
+			exit 2;
+		}
+		elsif ($check)
 		{
 			exit 2;
 		}
-- 
Tristan Partin
Neon (https://neon.tech)

From ddd5cdaf8e967ddf84beb49ca64b5bed039b71ac Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Thu, 14 Dec 2023 10:36:05 -0600
Subject: [PATCH v2 2/3] Rename --show-diff to --diff in pgindent

--diff should be a little bit more self-explanatory for people coming
from other similar formatter/linting tooling.
---
 src/tools/pgindent/pgindent | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index fe0bd21f4a..067b77be54 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -22,7 +22,7 @@ my $indent_opts =
 my $devnull = File::Spec->devnull;
 
 my ($typedefs_file, $typedef_str, @excludes,
-	$indent, $build, $show_diff,
+	$indent, $build, $diff,
 	$check, $help, @commits,);
 
 $help = 0;
@@ -34,14 +34,14 @@ my %options = (
 	"list-of-typedefs=s" => \$typedef_str,
 	"excludes=s" => \@excludes,
 	"indent=s" => \$indent,
-	"show-diff" => \$show_diff,
+	"diff" => \$diff,
 	"check" => \$check,);
 GetOptions(%options) || usage("bad command line argument");
 
 usage() if $help;
 
-usage("Cannot have both --check and --show-diff")
-  if $check && $show_diff;
+usage("Cannot have both --check and --diff")
+  if $check && $diff;
 
 usage("Cannot use --commit with command line file list")
   if (@commits && @ARGV);
@@ -294,7 +294,7 @@ sub run_indent
 	return $source;
 }
 
-sub show_diff
+sub diff
 {
 	my $indented = shift;
 	my $source_filename = shift;
@@ -323,7 +323,7 @@ Options:
 	--list-o

Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-12-14 Thread Jelte Fennema-Nio

On Fri, 1 Dec 2023 at 05:20, Nathan Bossart  wrote:
> Could we simplify it with something like the following?

Great suggestion! Updated the patchset accordingly.

This made it also easy to change the final patch to include the
automatic scoped declaration logic for all of the new macros. I quite
like how the calling code changes to not have to declare the variable.
But it's definitely a larger divergence from the status quo than
without patch 0003. So I'm not sure if it's desired.

Finally, I also renamed the functions to use foreach instead of
for_each, since based on this thread that seems to be the generally
preferred naming.


v6-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch
Description: Binary data


v6-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data


v6-0003-Even-more-convenient-macros.patch
Description: Binary data

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Ashutosh Bapat

On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila  wrote:
>
> On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
> > >
> > > I think you forgot to attach the patch.
> >
> > Sorry. Here it is.
> >
> > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
> > >
> > > >
> > > It looks like the solution works. But this is the only place where we
> > > process a change before SNAPSHOT reaches FULL. But this is also the
> > > only record which affects a decision to queue/not a following change.
> > > So it should be ok. The sequence_hash'es as separate for each
> > > transaction and they are cleaned when processing COMMIT record.
> > > >
> > >
> > > But it is possible that even commit or abort also happens before the
> > > snapshot reaches full state in which case the hash table will have
> > > stale or invalid (for aborts) entries. That will probably be cleaned
> > > at a later point by running_xact records.
> >
> > Why would cleaning wait till running_xact records? Won't txn entry
> > itself be removed when processing commit/abort record? At the same the
> > sequence hash will be cleaned as well.
> >
> > > Now, I think in theory, it
> > > is possible that the same RelFileLocator can again be allocated before
> > > we clean up the existing entry which can probably confuse the system.
> >
> > How? The transaction allocating the first time would be cleaned before
> > it happens the second time. So shouldn't matter.
> >
>
> It can only be cleaned if we process it but xact_decode won't allow us
> to process it and I don't think it would be a good idea to add another
> hack for sequences here. See below code:
>
> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> {
> SnapBuild  *builder = ctx->snapshot_builder;
> ReorderBuffer *reorder = ctx->reorder;
> XLogReaderState *r = buf->record;
> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>
> /*
> * If the snapshot isn't yet fully built, we cannot decode anything, so
> * bail out.
> */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> return;

That may be true for a transaction which is decoded, but I think all
the transactions which are added to ReorderBuffer should be cleaned up
once they have been processed irrespective of whether they are
decoded/sent downstream or not. In this case I see the sequence hash
being cleaned up for the sequence related transaction in Hayato's
reproducer. See attached patch with a diagnostic change and the output
below (notice sequence cleanup called on transaction 767).
2023-12-14 21:06:36.756 IST [386957] LOG:  logical decoding found
initial starting point at 0/15B2F68
2023-12-14 21:06:36.756 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 767 to end.
2023-12-14 21:06:36.756 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: skipped
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.710 IST [386957] LOG:  logical decoding found
initial consistent point at 0/15B3388
2023-12-14 21:07:05.710 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 768 to end.
2023-12-14 21:07:05.710 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:39.292 IST [386298] LOG:  checkpoint starting: time
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_FULL_SNAPSHOT
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: the sequence is transactional
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  sequence cleanup called on
transaction 767
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  logical decoding found
consistent point at 0/15B3518
2023-12-14 21:07:40.919 IST [386957] DETAIL:  There are no running transactions.
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

We see similar output when pg_logical_slot_get_changes() is called.

I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I mis

Re: Clean up find_typedefs and add support for Mac

2023-12-14 Thread Tristan Partin


On Thu Dec 14, 2023 at 9:16 AM CST, Andrew Dunstan wrote:


On 2023-12-13 We 15:59, Tristan Partin wrote:
> On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote:
>>
>> On 2023-12-12 Tu 18:02, Tom Lane wrote:
>> > "Tristan Partin"  writes:
>> >> The big patch here is adding support for Mac. objdump -W doesn't 
>> work on

>> >> Mac. So, I used dsymutil and dwarfdump to achieve the same result.
>> > We should probably nuke the current version of src/tools/find_typedef
>> > altogether in favor of copying the current buildfarm code for that.
>> > We know that the buildfarm's version works, whereas I'm not real sure
>> > that src/tools/find_typedef is being used in anger by anyone.  Also,
>> > we're long past the point where developers can avoid having Perl
>> > installed.
>> >
>> > Ideally, the buildfarm client would then start to use find_typedef
>> > from the tree rather than have its own copy, both to reduce
>> > duplication and to ensure that the in-tree copy keeps working.
>> >
>> >
>>
>>
>> +1. I'd be more than happy to be rid of maintaining the code. We 
>> already performed a rather more complicated operation along these 
>> lines with the PostgreSQL::Test::AdjustUpgrade stuff.

>
> I'll work with you on GitHub to help make the transition. I've already 
> made a draft PR in the client-code repo, but I am sure I am missing 
> some stuff.




I think we're putting the cart before the horse here. I think we need a 
perl module in core that can be loaded by the buildfarm code,  and could 
also be used by a standalone find_typedef (c.f. 
src/test/PostgreSQL/Test/AdjustUpgrade.pm). To be useful that would have 
to be backported.


I'll have a go at that.


Here is an adaptation of the find_typedefs from run_build.pl. Maybe it 
will help you.


--
Tristan Partin
Neon (https://neon.tech)


find_typedefs.pl
Description: Perl program

Re: Clean up find_typedefs and add support for Mac

2023-12-14 Thread Andrew Dunstan

On 2023-12-13 We 15:59, Tristan Partin wrote:

On Wed Dec 13, 2023 at 2:35 PM CST, Andrew Dunstan wrote:

On 2023-12-12 Tu 18:02, Tom Lane wrote:
> "Tristan Partin"  writes:
>> The big patch here is adding support for Mac. objdump -W doesn't 
work on

>> Mac. So, I used dsymutil and dwarfdump to achieve the same result.
> We should probably nuke the current version of src/tools/find_typedef
> altogether in favor of copying the current buildfarm code for that.
> We know that the buildfarm's version works, whereas I'm not real sure
> that src/tools/find_typedef is being used in anger by anyone.  Also,
> we're long past the point where developers can avoid having Perl
> installed.
>
> Ideally, the buildfarm client would then start to use find_typedef
> from the tree rather than have its own copy, both to reduce
> duplication and to ensure that the in-tree copy keeps working.
>
>

+1. I'd be more than happy to be rid of maintaining the code. We 
already performed a rather more complicated operation along these 
lines with the PostgreSQL::Test::AdjustUpgrade stuff.

I'll work with you on GitHub to help make the transition. I've already 
made a draft PR in the client-code repo, but I am sure I am missing 
some stuff.

I think we're putting the cart before the horse here. I think we need a 
perl module in core that can be loaded by the buildfarm code,  and could 
also be used by a standalone find_typedef (c.f. 
src/test/PostgreSQL/Test/AdjustUpgrade.pm). To be useful that would have 
to be backported.

I'll have a go at that.

cheers

andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: encoding affects ICU regex character classification

2023-12-14 Thread Jeff Davis

On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote:
> Is someone able to test out upper & lower functions on U+A7BA ...
> U+A7BF
> across a few libs/versions?

Those code points are unassigned in Unicode 11.0 and assigned in
Unicode 12.0.

In ICU 63-2 (based on Unicode 11.0), they just get mapped to
themselves. In ICU 64-2 (based on Unicode 12.1) they get mapped the
same way the builtin CTYPE maps them (based on Unicode 15.1).

The concern over unassigned code points is misplaced. The application
may be aware of newly-assigned code points, and there's no way they
will be mapped correctly in Postgres if the provider is not aware of
those code points. The user can either proceed in using unassigned code
points and accept the risk of future changes, or wait for the provider
to be upgraded.

If the user doesn't have many expression indexes dependent on ctype
behavior, it doesn't matter much. If they do have such indexes, the
best we can offer is a controlled process, and the builtin provider
allows the most visibility and control.

(Aside: case mapping has very strong compatibility guarantees, but not
perfect. For better compatibility guarantees, we should support case
folding.)

> And I have no idea if or when
> glibc might have picked up the new unicode characters.

That's a strong argument in favor of a builtin provider.

Regards,
Jeff Davis

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar  wrote:
>
> On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  
> wrote:
>
> > > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> > >
> > > Andrey, do you have any stress tests or anything else that you used to
> > > gain confidence in this code?
> >

I have done some more testing for the clog group update as the
attached test file executes two concurrent scripts executed with
pgbench, the first script is the slow script which will run 10-second
long transactions and the second script is a very fast transaction
with ~1 transactions per second.  Along with that, I have also
changed the bank size such that each bank will contain just 1 page
i.e. 32k transactions per bank.  I have done this way so that we do
not need to keep long-running transactions running for very long in
order to get the transactions from different banks committed during
the same time.  With this test, I have got that behavior and the below
logs shows that multiple transaction range which is in different
slru-bank (considering 32k transactions per bank) are doing group
update at the same time. e.g. in the below logs, we can see xid range
around 70600, 70548, and 70558, and xid range around 755, and 752 are
getting group updates by different leaders but near the same time.

It is running fine when running for a long duration, but I am not sure
how to validate the sanity of this kind of test.

2023-12-14 14:43:31.813 GMT [3306] LOG:  group leader procno 606
updated status of procno 606 xid 70600
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 for xid 70548
added for group update
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 is group leader
and got the lock
2023-12-14 14:43:31.816 GMT [3326] LOG:  group leader procno 586
updated status of procno 586 xid 70548
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 for xid 70558
added for group update
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 is group leader
and got the lock
2023-12-14 14:43:31.818 GMT [3327] LOG:  group leader procno 585
updated status of procno 585 xid 70558
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 for xid 752 added
for group update
2023-12-14 14:43:31.829 GMT [3207] LOG:  procno 669 for xid 755 added
for group update
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 is group leader
and got the lock
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 669 xid 755
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 687 xid 752


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
# Goal of this script to generate scenario where some old long running slow
# transaction get committed with the new transactions such that they falls in
# different slru banks


rm -rf pgdata
./initdb -D pgdata
echo "max_wal_size=20GB" >> pgdata/postgresql.conf
echo "shared_buffers=20GB" >> pgdata/postgresql.conf
echo  "checkpoint_timeout=40min" >> pgdata/postgresql.conf
echo "max_connections=700" >> pgdata/postgresql.conf
echo "maintenance_work_mem=1GB" >> pgdata/postgresql.conf
echo "subtrans_buffers=64" >> pgdata/postgresql.conf
echo "multixact_members_buffers=128" >> pgdata/postgresql.conf

#create slow_txn.sql script
cat > slow_txn.sql << EOF
BEGIN;
INSERT INTO test VALUES(1);
DELETE FROM test WHERE a=1;
select pg_sleep(10);
COMMIT;
EOF

#create fast_txn.sql script
cat > fast_txn.sql << EOF
BEGIN;
INSERT INTO test1 VALUES(1);
DELETE FROM test1 WHERE a=1;
COMMIT;
EOF

./pg_ctl -D pgdata -l logfile -c start
./psql -d postgres -c "create table test(a int)"
./psql -d postgres -c "create table test1(a int)"
./pgbench -i -s 1 postgres
./pgbench -f slow_txn.sql -c 28 -j 28 -P 1 -T 60 postgres &
./pgbench -f fast_txn.sql -c 100 -j 100 -P 1 -T 60 postgres
sleep(10);
./pg_ctl -D pgdata -l logfile -c stop

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-14 Thread Alena Rybakina


On 12.12.2023 16:04, jian he wrote:

On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
  wrote:

Hi! Thank you for your work. Your patch looks better!
Yes, thank you! It works fine, and I see that the regression tests have been 
passed. 🙂
However, when I ran 'copy from with save_error' operation with simple csv files 
(copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I 
described below):

postgres=# create table test (x int primary key, y int not null);
postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
   FOREIGN KEY(x)
   REFERENCES test(x));

I did not find a table with saved errors after operation, although I received a 
log about it:

postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV 
save_error
NOTICE:  2 rows were skipped because of error. skipped row saved to table 
public.test_error
ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (x)=(2) already exists.
CONTEXT:  COPY test, line 3

postgres=# select * from public.test_error;
ERROR:  relation "public.test_error" does not exist
LINE 1: select * from public.test_error;

postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
save_error
NOTICE:  2 rows were skipped because of error. skipped row saved to table 
public.test1_error
ERROR:  insert or update on table "test1" violates foreign key constraint "fk_x"
DETAIL:  Key (x)=(2) is not present in table "test".

postgres=# select * from public.test1_error;
ERROR:  relation "public.test1_error" does not exist
LINE 1: select * from public.test1_error;

Two lines were written correctly in the csv files, therefore they should have 
been added to the tables, but they were not added to the tables test and test1.

If I leave only the correct rows, everything works fine and the rows are added 
to the tables.

in copy_test.csv:

2,0

1,1

in copy_test1.csv:

2,0

2,1

1,1

postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
COPY 2
postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
save_error
NOTICE:  No error happened.Error holding table public.test1_error will be droped
COPY 3

Maybe I'm launching it the wrong way. If so, let me know about it.

looks like the above is about constraints violation while copying.
constraints violation while copying not in the scope of this patch.

Since COPY FROM is very like the INSERT command,
you do want all the valid constraints to check all the copied rows?

No, I think it will be too much.

but the notice raised by the patch is not right.
So I place the drop error saving table or raise notice logic above
`ExecResetTupleTable(estate->es_tupleTable, false)` in the function
CopyFrom.

Yes, I see it and agree with you.

I also notice interesting behavior if the table was previously created by the 
user. When I was creating an error_table before the 'copy from' operation,
I received a message saying that it is impossible to create a table with the 
same name (it is shown below) during the 'copy from' operation.
I think you should add information about this in the documentation, since this 
seems to be normal behavior to me.


doc changed. you may check it.

Yes, I saw it. Thank you.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: Built-in CTYPE provider

2023-12-14 Thread Jeff Davis

On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> In particular "el" (modern greek) has case mapping rules that
> ICU seems to implement, but "el" is missing from the list
> ("lt", "tr", and "az") you identified.

I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the
ctype semantics match C.UTF-8 for all code points. glibc is not doing
this additional tailoring for "el".

Therefore I believe the builtin CTYPE would be very useful for case
mapping (both "simple" and "full") even without this additional
tailoring.

You are correct that ICU will still have some features that won't be
supported by the builtin provider. Better word boundary semantics in
INITCAP() are another advantage.

Regards,
Jeff Davis

Re: Simplify newNode()

2023-12-14 Thread Heikki Linnakangas


On 14/12/2023 10:32, Peter Eisentraut wrote:

I notice that the existing comments point out that the size argument
should be a compile-time constant, but that is no longer the case for
ExtensibleNode().  Also, newNode() is the only caller of palloc0fast(),
which also points out that the size argument should be a compile-time
constant, and palloc0fast() is the only caller of MemSetTest().  I can
see how an older compiler might have gotten too confused by all that.
But if we think that compilers are now smart enough, maybe we can unwind
this whole stack a bit more?  Maybe we don't need MemSetTest() and/or
palloc0fast() and/or newNode() at all?


Good point. Looking closer, modern compilers will actually turn the 
MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 
Here's a link to a godbolt snippet to play with this: 
https://godbolt.org/z/9b89P3c8x (full link at [0]).


Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. 
It's not doing any good as it is, as it gets compiled to be identical to 
MemoryContextAllocZero. (There are small differences depending compiler 
and version, but e.g. on gcc 13.2, the code generated for 
MemoryContextAllocZero() is actually smaller even though both call memset())


Another approach would be to add more hints to 
MemoryContextAllocZeroAligned to dissuade the compiler from turning the 
loop into a memset() call. If you add an "if (size > 1024) abort" there, 
then gcc 13 doesn't optimize into a memset() call, but clang still does. 
Some micro-benchmarks on that would be nice.


But given that the compiler has been doing this optimization for a while 
and we haven't noticed, I think memset() should be considered the status 
quo, and converting it to a loop again should be considered a new 
optimization.


Also, replacing MemoryContextAllocZeroAligned(CurrentMemoryContext, 
size) with palloc0(size) has one fewer argument and the assembly code of 
the call has one fewer instruction. That's something too.


[0] 
https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBIa0gAOqAqETgwe3r56icmOAkEh4SxRMVxxtpj2uQxCBEzEBOk%2Bflzllak1dQT5YZHRsdIKtfWNmS2Dnd2Fxf0AlLaoXsTI7BzmAMzByN5YANQma26D6BGongB0CPvYJhoAguub25h7B0e0eBEXVzf3ZhsMWy8u32hwI6CwVC%2Ba2udweAKeL1BxGCwChMPudwA9AAqHZUYioFg7ZAXHbYzE/B5UHYAfRpAHFQnI3HSXtc1gARHZrSl/CHBZ7vADWFQAnhBVDMbgBOOkRLx0RwMGmYVTxTAOKCSnZgMD7LkaUg7LhS2F8zA0EI7LwMYViiUzWk0%2BWK4IqtUaghax26/U7Q3%2B02/NYVJS8tb8q122jiyUy706vWcwPhyPPG3R2OOhO%2B5MaIPrVx4KiUs0Ri0CnYAWSYqlutFoqGQQjwAC9MPGIC3246NKo1lRB0OqI6ccadsA8MAmBFRQRngBacfk1MVq31xvNtuYACSCgAamI8OgIMke53u5hs2er4i/TW6w2m5eC3dBsQvA5q5gWCRRR5BFVAgOSYWp9isO4CFFdUIR2d9PwIb9f2If9ZCAkDajJKsfz/AD51UAhwJ%2BKCYItODtxpRDLyIyDoMwWD4K/bDkNQwCCI8G152IBQkNwtD2IWQDuJojFbhI%2BiyIAN1QY8dggbFmL/DoCAUZRkUEAAxG1kGzRSULwoDiX4ggjWk2TsXiJgFEGBAP1IUtpRlJyNEclznNcjydiMhhBmJBA6jJcYVJpd8UXs2E3Mijz3JilyvJOTwdnidSCEo1AQrBaJiClNYINE8SGIID8mJw/TjOwggEAwBRKQAdjymUzPQGUvLJeSxE3XTStY/DELQNiTJ2S9yPPXKHLHUQGx2Sq8B4/FMEwGl4h2YJiSs54FEJTABGedY0zADg5uIBaIEdFcIulJr3Pk%2BbFvibMmrJRJgi4nKGpcq64q87F5OOjqmwemT0CemShKNYabzehzPo837MCUAgupYgyCO83qoYumH42xLB6HnGl%2BvRuS9J6wzCaAjGMWlEmUcQ%2BTgEwVLkAQG0hQJ4zAfM56hMpmVqKi%2BnGYJlmGDZhRLOWTngYs0HXpEmUEtoa7sVmlUWHiKCkb4ga0Yp%2BWPqB5WgoULWyp18mCPCqnYuir6SeU1SUq0gEkpSqhtNMoGnqsmy7Icm33K8mnjI4oSeOxIhaloBQrYDqKWp2RXXZetKMqwYhsvlv5i1gqtsCrAB5AAlABNGk3AL0IABVsAADSr8uAAlsDcABpHdQnpaHDYF7FmY1IVTdJ1GLcRrOQwYfAS1hWquWDgaKqq9AarGiKxLowrisQ%2Bfeowpg6ryrzMUxJLgBpBgMEW0CiogGd3yYBxTQ8sdKuiZ46meC%2BdgvrAeNQakZo8XEmSCka8j6gOlKES%2BVcmDAHcuJGittpRjmPIwRwNB4Y7FVA/RCQpgjA3/rrVG50qYvwQPDZ4BAADuqA8R4AqMvHYH8koGGWMDchx1pq0JYMEPAPD2xMPeMABgbBBA7CoVZWoDMQAgIcordys0i4UMIqvZ%2BuIq4vC5F/cWD935PjYeRAEgpJE7GOgjWRF15FRX%2BlQncaRkTCE9KkcCOwxw2KSv9ZAK0GDEgcXgSacEnECAsVTfmHk2AsBpJ40C9FEEv2IA/IUOwIl/kEZuGJwN%2BDEGmggWaRDEIkJlP1XyO8gKL2qmSNglVqpxNxJJPA9QvBiDxNpKo00Zz0BCTKUpqNLLHUEIgmUY4mQBACCtakX8%2BloLkkQeI9BJIVHyWdUBVMemIRoNxAgzM6DNVUWOchTACHUneKLeiOx3i%2BUIds2g6B%2BldJcms12mBJLXN2RBZBuJkrPKcF4HirydiEIUKwZ4UyxGFIed1Wm38gKvNqdCkeuSbkAupECtgHj%2BkFJWUUgQvlmYBWxMwNggyooTWMt/YFcltC/PWSQHYWB5TAEnEYZZDlil9X8tklWWABmqPcqSnWO4uTFiYQwUUlLqV4lpfSrwjKUQsruOAtZbgOoRESWSMxQtkARBXu8scFzEKEI1QQTEuNGbPEmrQVVyAhThyxbPXiZtd6gX3qvUSq5LTPGYkIRmARTjxFPEMQakkxBGnoAwR0JgACsbgHLoFoVGmNF0TD1T2NG/2LlGxGDJLScY9RNFyUzcAMk15A1vVTYm62GaBBFtxBlVAy0/QQELcWuSEA8WcsdBlToexLByRfAWxgMwy0Jv9mmpNLkqGIueBALteaQQ5tmRGsdlaPLYlnYRSw1h80aBcSOpN9rJ10GnfmUsolHrYh%2BGsjcTYABa0RUD1inCEE8jzR7g23CNK8B9u7HgTuqxm%2BtSGYluNZaIXo1l7kPO8E8o8h362QcB0D9QoEEDsW4Pxk1vUOFSG2jmgGZTCogLqa9W52yQaPCeSGr5pReRnCQL0vMLqjwXFcRRyj81UDEGGV1DljqIT9Mxq4VSl4KBY9CTxuGBpGkhvBwjGY8AihjBAPjmi/QjICHB/d70aNMJOPUU68Hk0clPUMzE%2B5bgBHpEXDuHIaR5yrMoAuBcAg0gswEAubhJO9SNHx6T25GNAa9T6v1ynGZGgDDJ11rVkk/gRqFwaEX/P4ZcnxxYPi%2BPy1nj8DgcxaCcEjbwPwHAtCkFQJwGNm7e2bUWMsHtaweCkAIJoHLcwhQgEjYaPLHBJCFea6Vz

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2023-12-14 Thread Nazir Bilal Yavuz

Hi,

Sorry for the late reply.

On Fri, 6 Oct 2023 at 17:07, Tom Lane  wrote:
>
> As a quick cross-check, I searched our commit log to see how many
> README-only commits there were so far this year.  I found 11 since
> January.  (Several were triggered by the latest round of pgindent
> code and process changes, so maybe this is more than typical.)
>
> Not sure what that tells us about the value of changing the CI
> logic, but it does seem like it could be worth the one-liner
> change needed to teach buildfarm animals to ignore READMEs.

I agree that it could be worth implementing this logic on buildfarm animals.

In case we want to implement the same logic on the CI, I added a new
version of the patch; it skips CI completely if the changes are only
in the README files.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 6c268233d13262a31965baf2dbb42913d83dab1d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 14 Dec 2023 16:23:28 +0300
Subject: [PATCH v3] If the changes are only in the README files, skip all the
 tasks.

If the changes are only in the README files, skip all the tasks to save
CI time.
---
 .cirrus.tasks.yml | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index e137769850..dfd12f8b04 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -4,6 +4,12 @@
 # further details, see src/tools/ci/README
 
 
+# If the changes are only in the README files, skip all the tasks.
+#
+# This skip is overriden in 'SanityCheck' task. So, the same check is
+# added 'SanityCheck' task too.
+skip: changesIncludeOnly('**/README')
+
 env:
   # The lower depth accelerates git clone. Use a bit of depth so that
   # concurrent tasks and retrying older jobs have a chance of working.
@@ -55,11 +61,12 @@ on_failure_meson: &on_failure_meson
 task:
   name: SanityCheck
 
-  # If a specific OS is requested, don't run the sanity check. This shortens
-  # push-wait-for-ci cycle time a bit when debugging operating system specific
-  # failures. Uses skip instead of only_if, as cirrus otherwise warns about
-  # only_if conditions not matching.
-  skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*'
+  # Don't run the sanity check if the changes are only in the README files or
+  # if a specific OS is requested. The latter shortens push-wait-for-ci cycle
+  # time a bit when debugging operating system specific failures. Uses skip
+  # instead of only_if, as cirrus otherwise warns about only_if conditions not
+  # matching.
+  skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*' || changesIncludeOnly('**/README')
 
   env:
 CPUS: 4
-- 
2.43.0

Re: Improve upcasting for INT range and multi range types

2023-12-14 Thread Federico

Hi,

Thanks for the reply. I suspected that there were technical reasons
that prevented the obvious right thing to be done.

Would adding overloads to the functions and operators be something
that could be considered as an acceptable solution?
I've tried a very naive solution and it seems to work (there are for
sure better options to declare the function overloads):

begin;

create function elem_contained_by_range(int4, int8range) returns
boolean as $$ select elem_contained_by_range($1::int8, $2) $$ LANGUAGE
SQL;
create function elem_contained_by_range(int8, int4range) returns
boolean as $$ select elem_contained_by_range($1, $2::text::int8range)
$$ LANGUAGE SQL;

create operator <@(
LEFTARG = int4,
RIGHTARG = int8range,
FUNCTION = elem_contained_by_range,
RESTRICT = rangesel,
JOIN = contjoinsel,
HASHES, MERGES
);
create operator <@(
LEFTARG = int8,
RIGHTARG = int4range,
FUNCTION = elem_contained_by_range,
RESTRICT = rangesel,
JOIN = contjoinsel,
HASHES, MERGES
);

select 2::int4 <@ '[1,9)'::int8range;
select 2::int8 <@ '[1,9)'::int4range;

rollback;

The major drawback is that every combination operator - type would
need its own overload creating a large number of them.

As a side note it seems that int4range cannot be casted automatically
to int8range.

Best regards,
Federico

On Wed, 13 Dec 2023 at 05:16, Tom Lane  wrote:
>
> jian he  writes:
> > Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1,
> > 4)'::INT8RANGE is doable.
>
> Yeah, it would require a considerable expansion of the scope of
> what can be matched by a polymorphic operator.  I'm afraid that
> the negative consequences (mainly, "ambiguous operator" failures
> because more than one thing can be matched) would outweigh the
> benefits.  It is kind of annoying though that the system can't
> do the "obvious" right thing here.
>
> regards, tom lane

Re: Relation bulk write facility

2023-12-14 Thread Heikki Linnakangas


Melanie just reminded about an older thread about this same thing:
https://www.postgresql.org/message-id/CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax%2Bckjig%40mail.gmail.com. 
I had completely forgotten about that.


Melanie's patches in that thread implemented the same optimization of 
avoiding the fsync() if no checkpoint has happened during the index 
build. My patch here also implements batching the WAL records of 
multiple blocks, which was not part of those older patches. OTOH, those 
patches included an additional optimization of not bypassing the shared 
buffer cache if the index is small. That seems sensible too.


In this new patch, I subconsciously implemented an API close to what I 
suggested at the end of that old thread.


So I'd like to continue this effort based on this new patch. We can add 
the bypass-buffer-cache optimization later on top of this. With the new 
API that this introduces, it should be an isolated change to the 
implementation, with no changes required to the callers.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-12-14 Thread Jelte Fennema-Nio

On Mon, 13 Nov 2023 at 03:39, Thomas Munro  wrote:
> We follow the common en-US usage: "canceled", "canceling" but
> "cancellation".  Blame Webstah et al.

I changed all the places that were not adhering to those spellings.
There were also a few of such places in parts of the codebase that
these changes didn't touch. I included a new 0001 patch to fix those.

I do feel like this patchset is pretty much in a committable state. So
it would be very much appreciated if any comitter could help push it
over the finish line.

v23-0002-Return-2-from-pqReadData-on-EOF.patch
Description: Binary data

v23-0001-Fix-spelling-of-canceled-cancellation.patch
Description: Binary data

v23-0004-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data

v23-0003-Add-non-blocking-version-of-PQcancel.patch
Description: Binary data

Re: btree_gist into core?

2023-12-14 Thread Julien Rouhaud

Hi,

On Wed, Jan 19, 2022 at 09:30:11AM +0100, Peter Eisentraut wrote:
>
> To use exclusion constraints in practice, you often need to install the
> btree_gist extension, so that you can combine for example a range type check
> and normal scalar key columns into one constraint.
>
> [...]
>
> There are also of course questions about how to smoothly arrange upgrades
> from extensions to the built-in situations.

I'm not sure if that's what you were thinking of, but I know at least one
extension (that I'm maintaining) that explicitly relies on btree_gist
extension, as in "requires = [...], btree_gist" in the .control file.

Since you can't really tweak the control file on a per-major-version basis,
this will require some extra thoughts to make sure that people can release
extensions without having to tweak this file in some make rule or something
like that.

Re: Memory consumed by paths during partitionwise join planning

2023-12-14 Thread Ashutosh Bapat

On Thu, Dec 7, 2023 at 6:19 PM David Rowley  wrote:
>
> Maybe we can try to move forward with your refcount idea and see how
> the performance looks.  If that's intolerable then that might help us
> decide on the next best alternative solution.
>

Here are performance numbers

setup

create table t1 (a integer primary key, b integer);
create table t1_parted (a integer primary key, b integer) partition by range(a);
create 1000 partitions of t1

query (a five way self-join)
select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a =
c.a and c.a = d.a and d.a = e.a -- unpartitioned table case
select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d,
t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a =
e.a; -- partitioned table case

The numbers with patches attached to [1] with limitations listed in
the email are thus

Ran each query 10 times through EXPLAIN (SUMMARY) and averaged
planning time with and without patch.
unpartitioned case
without patch: 0.25
with patch: 0.19
this improvement is probably misleading. The planning time for this
query change a lot.

partitioned case (without partitionwise join)
without patch: 14580.65
with patch: 14625.12
% degradation: 0.3%

partitioned case (with partitionwise join)
without patch: 23273.69
with patch: 23328.52
 % degradation: 0.2%

That looks pretty small considering the benefits. What do you think?

[1] 
https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 16:32, tender wang  wrote:
> 
> enable -O2, only one  instruction:
> xor eax, eax

This is not fast code.  This is how friendly C compiler suggests you that mask 
must be 127, not 128.


Best regards, Andrey Borodin.

Re: btree_gist into core?

2023-12-14 Thread Emre Hasegeli

> Thoughts?

I think it'd be really nice to do this without btree_gist.

I imagine something like this:

CREATE INDEX ON tbl USING gist (
range_col,
int_col USING btree
)

I think this would make the index access methods interface more
useful.  Index access method developers wouldn't need to provide
operator classes for all data types.  We could extend ACCESS METHOD
definition to allow this:

CREATE ACCESS METHOD my_hash_index
TYPE INDEX
IMPLEMENTS hash
HANDLER my_hash_index_handler

I realise this is a difficult project.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread tender wang

Andrey M. Borodin  于2023年12月14日周四 17:35写道：

>
>
> > On 14 Dec 2023, at 14:28, tender wang  wrote:
> >
> >   Now that AND is more faster, Can we  replace the '%
> SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with '& 127'
>
> unsigned int GetBankno1(unsigned int pageno) {
> return pageno & 127;
> }
>
> unsigned int GetBankno2(unsigned int pageno) {
> return pageno % 128;
> }
>
> Generates with -O2
>
> GetBankno1(unsigned int):
> mov eax, edi
> and eax, 127
> ret
> GetBankno2(unsigned int):
> mov eax, edi
> and eax, 127
> ret
>
>
> Compiler is smart enough with constants.
>

Yeah, that's true.

int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno & bank_mask) % 128;
return bankno;
}
enable -O2, only one  instruction:
xor eax, eax

But if we all use '%',  thing changs as below:
int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno % bank_mask) % 128;
return bankno;
}
mov rdx, rdi
sar rdx, 63
shr rdx, 57
lea rax, [rdi+rdx]
and eax, 127
sub eax, edx


> Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 16:06, Dilip Kumar  wrote:
> 
>  I have noticed
> a very good improvement with the addition of patch 0003.

Indeed, a very impressive results! It’s almost x2 of performance on high 
contention scenario, on top of previous improvements.

Best regards, Andrey Borodin.

Re: Synchronizing slots from primary to standby

2023-12-14 Thread Amit Kapila

On Thu, Dec 14, 2023 at 7:00 AM Peter Smith  wrote:
>
> Hi, here are a few more review comments for the patch v47-0002
>
> (plus my review comments of v45-0002 [1] are yet to be addressed)
>
> ==
> 1. General
>
> For consistency and readability, try to use variables of the same
> names whenever they have the same purpose, even when they declared are
> in different functions. A few like this were already mentioned in the
> previous review but there are more I keep noticing.
>
> For example,
> 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller.
>
>
> ==
> src/backend/replication/logical/slotsync.c
>
> 2.
> +/*
> + *
> + * Validates the primary server info.
> + *
> + * Using the specified primary server connection, it verifies whether
> the master
> + * is a standby itself and returns true in that case to convey the caller 
> that
> + * we are on the cascading standby.
> + * But if master is the primary server, it goes ahead and validates
> + * primary_slot_name. It emits error if the physical slot in 
> primary_slot_name
> + * does not exist on the primary server.
> + */
> +static bool
> +validate_primary_info(WalReceiverConn *wrconn)
>
> 2b.
> IMO it is too tricky to have a function called "validate_xxx", when
> actually you gave that return value some special unintuitive meaning
> other than just validation. IMO it is always better for the returned
> value to properly match the function name so the expectations are very
> obvious. So, In this case, I think a better function signature would
> be like this:
>
> SUGGESTION
>
> static void
> validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby)
>
> or
>
> static void
> validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)
>

The terminology master_is_standby is a bit indirect for this usage, so
I would prefer the second one. Shall we name this function as
check_primary_info()? Additionally, can we rewrite the following
comment: "Using the specified primary server connection, check whether
we are cascading standby. It also validates primary_slot_info for
non-cascading-standbys.".

-- 
With Regards,
Amit Kapila.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  wrote:

> > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> >
> > Andrey, do you have any stress tests or anything else that you used to
> > gain confidence in this code?
>
> We are using only first two steps of the patchset, these steps do not touch 
> locking stuff.
>
> We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!
>

I have run this test [1], instead of comparing against the master I
have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs
(patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and
here is the result of the benchmark-1 and benchmark-2.  I have noticed
a very good improvement with the addition of patch 0003.

Machine information:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127

configurations:

max_wal_size=20GB
shared_buffers=20GB
checkpoint_timeout=40min
max_connections=700
maintenance_work_mem=1GB

subtrans_buffers=$variable
multixact_offsets_buffers=$variable
multixact_members_buffers=$variable

benchmark-1
version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func+ballast
---+--+--+--
patch-1 | 64   | 64/128|  87 + 58
patch-2 | 64   | 64/128 | 128 +83
patch-1 | 1024| 512/1024|  96 + 64
patch-2 | 1024| 512/1024| 163+108

benchmark-2

version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func
---+--+--+--
patch-1 | 64   | 64/128|  10
patch-2 | 64   | 64/128 | 12
patch-1 | 1024| 512/1024| 44
patch-2 | 1024| 512/1024| 72


[1] 
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Guiding principle for dropping LLVM versions?

2023-12-14 Thread Peter Eisentraut


On 25.10.23 07:47, Thomas Munro wrote:

Ideally more distros would be present in this vacuum-horizon decision
table, but I don't think it'd change the conclusion: 10 is the
trailing edge.  Therefore the attached patch scales back its ambition
to that release.  Tested on LLVM 10-18.


This patch and the associated reasoning look good to me.  I think this 
is good to go for PG17.

Re: "pgoutput" options missing on documentation

2023-12-14 Thread Emre Hasegeli

> I agree that we missed updating the parameters of the Logical
> Streaming Replication Protocol documentation. I haven't reviewed all
> the details yet but one minor thing that caught my attention while
> looking at your patch is that we can update the exact additional
> information we started to send for streaming mode parallel. We should
> update the following sentence: "It accepts an additional value
> "parallel" to enable sending extra information with the "Stream Abort"
> messages to be used for parallelisation."

I changed this in the new version.

Thank you for picking this up.

Re: "pgoutput" options missing on documentation

2023-12-14 Thread Emre Hasegeli

> To reduce translation efforts, perhaps it is better to arrange for
> these to share a common message.

Good idea.  I've done so.

> Also, I am unsure whether to call these "parameters" or "options" -- I
> wanted to call them parameters like in the documentation, but every
> other message in this function refers to "options", so in my example,
> I mimicked the nearby code YMMV.

It looks like they are called "options" in most places.  I changed the
documentation to be consistent too.

> Since the previous line already said pgoutput is the standard decoding
> plugin, maybe it's not necessary to repeat that.
>
> SUGGESTION
> Using the START_REPLICATION command,
> pgoutput) accepts the following parameters:

Changed.

> 3.
> I noticed in the protocol message formats docs [1] that those messages
> are grouped according to the protocol version that supports them.
> Perhaps this page should be arranged similarly for these parameters?
>
> For example, document the parameters in the order they were introduced.
>
> SUGGESTION
>
> -proto_version
>  ...
> -publication_names
>  ...
> -binary
>  ...
> -messages
>  ...
>
> Since protocol version 2:
>
> -streaming (boolean)
>  ...
>
> Since protocol version 3:
>
> -two_phase
>  ...
>
> Since protocol version 4:
>
> -streaming (boolean/parallel)
>  ...
> -origin

This is not going to be correct because not all options do require a
protocol version.  "origin" is added in version 16, but doesn't check
for any "proto_version".  Perhaps we should fix this too.

> 4.
> +   Boolean parameter to use the binary transfer mode.  This is faster
> +   than the text mode, but slightly less robust
>
> SUGGESTION
> Boolean parameter to use binary transfer mode. Binary mode is faster
> than the text mode but slightly less robust

Done.

Thanks for the review.

The new versions are attached.


v01-0001-pgoutput-Improve-error-messages-for-options.patch
Description: Binary data


v01-0002-pgoutput-Document-missing-options.patch
Description: Binary data

Re: planner chooses incremental but not the best one

2023-12-14 Thread Richard Guo

On Tue, Dec 12, 2023 at 4:40 PM Nicolas Lutic  wrote:

> I've come across a behaviour of the planner I can't explain.
> After a migration from 11 to 15 (on RDS) we noticed a degradation in
> response time on a query, it went from a few seconds to ten minutes.
> A vacuum(analyze) has been realized to be sure that all is clean.
> The 'explain analyze' shows us a change of plan. Postgresql 15 chooses
> `incremental sort` with an index corresponding to the ORDER BY clause
> (on the created_at column). The previous v11 plan used a more efficient
> index.
>
> By deactivating incremental sort, response times in v15 are equal to v11
> one.


I think this issue is caused by under-estimating the startup cost of
incremental sort, which in turn is caused by over-estimating the number
of groups with equal presorted keys by estimate_num_groups().

We can simulate the same issue with the query below.

create table t (a int, b int, c int, d int) partition by range (a);
create table tp1 partition of t for values from (0) to (1000);
create table tp2 partition of t for values from (1000) to (2000);

insert into t select i%2000, i%1000, i%300, i from
generate_series(1,100)i;

create index on t(b);
create index on t(c);

analyze t;

-- by default incremental sort is chosen
explain analyze select * from t where b = 3 order by c, d limit 10;
   QUERY
PLAN

 Limit  (cost=143.03..570.89 rows=10 width=16) (actual
time=375.399..375.402 rows=10 loops=1)
   ->  Incremental Sort  (cost=143.03..42671.85 rows=994 width=16) (actual
time=375.397..375.399 rows=10 loops=1)
 Sort Key: t.c, t.d
 Presorted Key: t.c
 Full-sort Groups: 1  Sort Method: top-N heapsort  Average Memory:
25kB  Peak Memory: 25kB
 Pre-sorted Groups: 1  Sort Method: top-N heapsort  Average Memory:
25kB  Peak Memory: 25kB
 ->  Merge Append  (cost=0.85..42644.84 rows=994 width=16) (actual
time=11.415..375.289 rows=335 loops=1)
   Sort Key: t.c
   ->  Index Scan using tp1_c_idx on tp1 t_1
 (cost=0.42..21318.39 rows=498 width=16) (actual time=6.666..189.356
rows=168 loops=1)
 Filter: (b = 3)
 Rows Removed by Filter: 171537
   ->  Index Scan using tp2_c_idx on tp2 t_2
 (cost=0.42..21316.50 rows=496 width=16) (actual time=4.745..185.870
rows=168 loops=1)
 Filter: (b = 3)
 Rows Removed by Filter: 171534
 Planning Time: 0.501 ms
 Execution Time: 375.477 ms
(16 rows)

-- disable incremental sort
set enable_incremental_sort to off;
SET

explain analyze select * from t where b = 3 order by c, d limit 10;
   QUERY PLAN

 Limit  (cost=2577.51..2577.53 rows=10 width=16) (actual time=2.928..2.930
rows=10 loops=1)
   ->  Sort  (cost=2577.51..2579.99 rows=994 width=16) (actual
time=2.925..2.927 rows=10 loops=1)
 Sort Key: t.c, t.d
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Append  (cost=8.28..2556.03 rows=994 width=16) (actual
time=0.627..2.670 rows=1000 loops=1)
   ->  Bitmap Heap Scan on tp1 t_1  (cost=8.28..1276.62
rows=498 width=16) (actual time=0.625..1.688 rows=500 loops=1)
 Recheck Cond: (b = 3)
 Heap Blocks: exact=500
 ->  Bitmap Index Scan on tp1_b_idx  (cost=0.00..8.16
rows=498 width=0) (actual time=0.330..0.330 rows=500 loops=1)
   Index Cond: (b = 3)
   ->  Bitmap Heap Scan on tp2 t_2  (cost=8.27..1274.43
rows=496 width=16) (actual time=0.178..0.876 rows=500 loops=1)
 Recheck Cond: (b = 3)
 Heap Blocks: exact=500
 ->  Bitmap Index Scan on tp2_b_idx  (cost=0.00..8.14
rows=496 width=0) (actual time=0.093..0.093 rows=500 loops=1)
   Index Cond: (b = 3)
 Planning Time: 0.481 ms
 Execution Time: 3.031 ms
(17 rows)

As we can see the execution time is 375.477 ms by default and 3.031 ms
if we disable incremental sort.

When we calculate the cost of incremental sort, we need to estimate the
number of groups into which the relation is divided by the presorted
keys, and then calculate the cost of sorting a single group.  If we
over-estimate the number of groups, the startup cost of incremental sort
would be under-estimated.  In the first plan above, the number of groups
with equal 't.c' is estimated to be 300 by estimate_num_groups(), but is
actually 3 after applying the restriction 'b = 3'.  As a result, the
startup cost of the incremental sort is estimated to be 143.03, but is
actually 14222.68.  That's why the pl

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-14 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Mon, 11 Dec 2023 10:57:15 +0900,
  Masahiko Sawada  wrote:

> IIUC we cannot create two same name functions with the same arguments
> but a different return value type in the first place. It seems to me
> to be an overkill to change such a design.

Oh, sorry. I didn't notice it.

> Another idea is to encapsulate copy_to/from_handler by a super class
> like copy_handler. The handler function is called with an argument,
> say copyto, and returns copy_handler encapsulating either
> copy_to/from_handler depending on the argument.

It's for using "${copy_format_name}" such as "json" and
"parquet" as a function name, right? If we use the
"${copy_format_name}" approach, we can't use function names
that are already used by tablesample method handler such as
"system" and "bernoulli" for COPY FORMAT name. Because both
of tablesample method handler function and COPY FORMAT
handler function use "(internal)" as arguments.

I think that tablesample method names and COPY FORMAT names
will not be conflicted but the limitation (using the same
namespace for tablesample method and COPY FORMAT) is
unnecessary limitation.

How about using prefix ("copy_to_${copy_format_name}" or
something) or suffix ("${copy_format_name}_copy_to" or
something) for function names? For example,
"copy_to_json"/"copy_from_json" for "json" COPY FORMAT.

("copy_${copy_format_name}" that returns copy_handler
encapsulating either copy_to/from_handler depending on the
argument may be an option.)

Thanks,
-- 
kou

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 14:28, tender wang  wrote:
> 
>   Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' 
> operation in  SimpleLruGetBankLock()  with '& 127'

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret


Compiler is smart enough with constants.


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread tender wang

Andrey M. Borodin  于2023年12月14日周四 17:02写道：

>
>
> > On 14 Dec 2023, at 08:12, Amul Sul  wrote:
> >
> >
> > + int bankno = pageno & ctl->bank_mask;
> >
> > I am a bit uncomfortable seeing it as a mask, why can't it be simply a
> number
> > of banks (num_banks) and get the bank number through modulus op (pageno %
> > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask)
> which is a
> > bit difficult to read compared to modulus op which is quite simple,
> > straightforward and much common practice in hashing.
> >
> > Are there any advantages of using &  over % ?
>

use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&'
.
int GetBankno1(int pageno) {
return pageno & 127;
}

int GetBankno2(int pageno) {
return pageno % 127;
}
under clang 13.0
GetBankno1: # @GetBankno1
pushrbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
and eax, 127
pop rbp
ret
GetBankno2: # @GetBankno2
pushrbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
mov ecx, 127
cdq
idivecx
mov eax, edx
pop rbp
ret
under gcc 13.2
GetBankno1:
pushrbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
and eax, 127
pop rbp
ret
GetBankno2:
pushrbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
movsx   rdx, eax
imulrdx, rdx, -2130574327
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 6
cdq
sub ecx, edx
mov edx, ecx
sal edx, 7
sub edx, ecx
sub eax, edx
mov ecx, eax
mov eax, ecx
pop rbp
ret


[1] https://godbolt.org/

The instruction AND is ~20 times faster than IDIV [0]. This is relatively
> hot function, worth sacrificing some readability to save ~ten nanoseconds
> on each check of a status of a transaction.
>

  Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS'
operation in  SimpleLruGetBankLock()  with '& 127' :
  SimpleLruGetBankLock()
{
   int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;

   use '&'
   return &(ctl->shared->bank_locks[banklockno].lock);
}
Thoughts?

>
> [0] https://www.agner.org/optimize/instruction_tables.pdf
>
>

Re: [meson] expose buildtype debug/optimization info to pg_config

2023-12-14 Thread Junwang Zhao

Hi Peter,

Thanks for looking into this.

On Thu, Dec 14, 2023 at 4:50 PM Peter Eisentraut  wrote:
>
> On 12.12.23 11:40, Junwang Zhao wrote:
> > build system using configure set VAL_CFLAGS with debug and
> > optimization flags, so pg_config will show these infos. Some
> > extensions depend on the mechanism.
> >
> > This patch exposes these flags with a typo fixed together.
>
> I have committed the typo fix.
>
> But I would like to learn more about the requirements of extensions in
> this area.  This seems a bit suspicious to me.

This is what I found when building citus against an installation
of meson debug build pg instance, since the CFLAGS doesn't
contain -g flag, the binary doesn't include the debug information,
which is different behavior from configure building system.

Another issue I found is that some C++
extensions(ajust/parquet_fdw for example) don't build against
the meson generated pgxs.mk, since it doesn't set the CXX
command. CXX is only set when llvm option is enabled, which
is different from old building system.

I don't insist we make Meson the same behaviour with old building
system, I just think the issues I raised might stop developers try
the fancy new building system. And the fix I post might not be
ideal, you and Andres might have better solutions.

>

-- 
Regards
Junwang Zhao

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila

On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
> >
> > I think you forgot to attach the patch.
>
> Sorry. Here it is.
>
> On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
> >
> > >
> > It looks like the solution works. But this is the only place where we
> > process a change before SNAPSHOT reaches FULL. But this is also the
> > only record which affects a decision to queue/not a following change.
> > So it should be ok. The sequence_hash'es as separate for each
> > transaction and they are cleaned when processing COMMIT record.
> > >
> >
> > But it is possible that even commit or abort also happens before the
> > snapshot reaches full state in which case the hash table will have
> > stale or invalid (for aborts) entries. That will probably be cleaned
> > at a later point by running_xact records.
>
> Why would cleaning wait till running_xact records? Won't txn entry
> itself be removed when processing commit/abort record? At the same the
> sequence hash will be cleaned as well.
>
> > Now, I think in theory, it
> > is possible that the same RelFileLocator can again be allocated before
> > we clean up the existing entry which can probably confuse the system.
>
> How? The transaction allocating the first time would be cleaned before
> it happens the second time. So shouldn't matter.
>

It can only be cleaned if we process it but xact_decode won't allow us
to process it and I don't think it would be a good idea to add another
hack for sequences here. See below code:

xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
SnapBuild  *builder = ctx->snapshot_builder;
ReorderBuffer *reorder = ctx->reorder;
XLogReaderState *r = buf->record;
uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;

/*
* If the snapshot isn't yet fully built, we cannot decode anything, so
* bail out.
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;


-- 
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Ashutosh Bapat

On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar  wrote:
>
> I think you forgot to attach the patch.

Sorry. Here it is.

On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila  wrote:
>
> >
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
> >
>
> But it is possible that even commit or abort also happens before the
> snapshot reaches full state in which case the hash table will have
> stale or invalid (for aborts) entries. That will probably be cleaned
> at a later point by running_xact records.

Why would cleaning wait till running_xact records? Won't txn entry
itself be removed when processing commit/abort record? At the same the
sequence hash will be cleaned as well.

> Now, I think in theory, it
> is possible that the same RelFileLocator can again be allocated before
> we clean up the existing entry which can probably confuse the system.

How? The transaction allocating the first time would be cleaned before
it happens the second time. So shouldn't matter.

-- 
Best Wishes,
Ashutosh Bapat
diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 73e38cafd8..8e2ebbd8e0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1543,10 +1543,15 @@ smgr_decode(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
elog(LOG, "XXX: smgr_decode. snapshot is %s", 
SnapBuildIdentify(builder));
 
/*
-* If we don't have snapshot or we are just fast-forwarding, there is no
-* point in decoding relfilenode information.
+* If we are not building snapshot yet or we are just fast-forwarding, 
there
+* is no point in decoding relfilenode information.  If the sequence
+* associated with relfilenode here is changed in the same transaction 
but
+* after snapshot was built, the relfilenode needs to be present in the
+* sequence hash table so that the change can be deemed as 
transactional.
+* Otherwise it will not be queued. Hence we process this change even 
before
+* we have built snapshot.
 */
-   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+   if (SnapBuildCurrentState(builder) < SNAPBUILD_BUILDING_SNAPSHOT ||
ctx->fast_forward)
{
elog(LOG, "XXX: skipped");

Re: logical decoding and replication of sequences, take 2

2023-12-14 Thread Amit Kapila

On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
 wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar  wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
>
...
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
>

>
It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record.
>

But it is possible that even commit or abort also happens before the
snapshot reaches full state in which case the hash table will have
stale or invalid (for aborts) entries. That will probably be cleaned
at a later point by running_xact records. Now, I think in theory, it
is possible that the same RelFileLocator can again be allocated before
we clean up the existing entry which can probably confuse the system.
It might or might not be a problem in practice but I think the more
assumptions we add for sequences, the more difficult it will become to
ensure its correctness.

-- 
With Regards,
Amit Kapila.

Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-12-14 Thread Michael Paquier

On Tue, Oct 31, 2023 at 03:11:03PM +0100, hubert depesz lubaczewski wrote:
> On Tue, Oct 31, 2023 at 08:17:52AM +0900, Michael Paquier wrote:
>> Thanks for the input.  I was looking yesterday if this code was
>> available somewhere, but couldn't find it..  Until this morning:
>> https://gitlab.com/depesz/explain.depesz.com.git
> 
> Well, the parser itself is https://gitlab.com/depesz/Pg--Explain/ :)

That was close enough ;)

> Well, if it's possible to deduce what is the meaning in given line,
> I can add the logic to do the deduction to parser.
> 
> Also, I want to say that I appreciate being looped in the discussion.

I lost sight of this thread, so my apologies for the delay.  The patch
to fix the description of the EXPLAIN field has now been applied to
v15 and v16.
--
Michael


signature.asc
Description: PGP signature

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 08:12, Amul Sul  wrote:
> 
> 
> + int bankno = pageno & ctl->bank_mask;
> 
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
> 
> Are there any advantages of using &  over % ?

The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot 
function, worth sacrificing some readability to save ~ten nanoseconds on each 
check of a status of a transaction.


[0] https://www.agner.org/optimize/instruction_tables.pdf

Re: [meson] expose buildtype debug/optimization info to pg_config

2023-12-14 Thread Peter Eisentraut


On 12.12.23 11:40, Junwang Zhao wrote:

build system using configure set VAL_CFLAGS with debug and
optimization flags, so pg_config will show these infos. Some
extensions depend on the mechanism.

This patch exposes these flags with a typo fixed together.


I have committed the typo fix.

But I would like to learn more about the requirements of extensions in 
this area.  This seems a bit suspicious to me.

Re: Add code indentation check to cirrus-ci (was Re: Add BF member koel-like indentation checks to SanityCheck CI)

2023-12-14 Thread Nazir Bilal Yavuz

Hi,

You may want to check out the WIP patch [1] about adding meson targets
to run pgindent by Andres.

[1] 
https://www.postgresql.org/message-id/20231019044907.ph6dw637loqg3lqk%40awork3.anarazel.de

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: GUC names in messages

2023-12-14 Thread Peter Eisentraut


On 11.12.23 00:07, Peter Smith wrote:

If the rule is changed to quote those MixedCase GUCs then the docs
will require minor tweaking

CURRENT

 In messages containing configuration variable names, do not include quotes
 when the names are visibly not natural English words, such as when they
 have underscores, are all-uppercase or have mixed case. Otherwise, quotes
 must be added. Do include quotes in a message where an arbitrary variable
 name is to be expanded.


"are all-uppercase or have mixed case." --> "or are all-uppercase."


After these discussions, I think this rule change was not a good idea. 
It effectively enforces these kinds of inconsistencies.  For example, if 
you ever refactored


"DateStyle is wrong"

to

"%s is wrong"

you'd need to adjust the quotes, and thus user-visible behavior, for 
entirely internal reasons.  This is not good.  And then came the idea to 
determine the quoting dynamically, which I think everyone agreed was too 
much.  So I don't see a way to make this work well.

Re: Simplify newNode()

2023-12-14 Thread Peter Eisentraut


On 14.12.23 01:48, Heikki Linnakangas wrote:
The newNode() macro can be turned into a static inline function, which 
makes it a lot simpler. See attached. This was not possible when the 
macro was originally written, as we didn't require compiler to have 
static inline support, but nowadays we do.


This was last discussed in 2008, see discussion at 
https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. 
In those tests, Tom observed that gcc refused to inline the static 
inline function. That was weird, the function is very small and doesn't 
do anything special. Whatever the problem was, I think we can dismiss it 
with modern compilers. It does get inlined on gcc 12 and clang 14 that I 
have installed.


I notice that the existing comments point out that the size argument 
should be a compile-time constant, but that is no longer the case for 
ExtensibleNode().  Also, newNode() is the only caller of palloc0fast(), 
which also points out that the size argument should be a compile-time 
constant, and palloc0fast() is the only caller of MemSetTest().  I can 
see how an older compiler might have gotten too confused by all that. 
But if we think that compilers are now smart enough, maybe we can unwind 
this whole stack a bit more?  Maybe we don't need MemSetTest() and/or 
palloc0fast() and/or newNode() at all?

Re: remaining sql/json patches

2023-12-14 Thread Amit Langote

On Sat, Dec 9, 2023 at 2:05 PM jian he  wrote:
> Hi.

Thanks for the review.

> function JsonPathExecResult comment needs to be refactored? since it
> changed a lot.

I suppose you meant executeJsonPath()'s comment.  I've added a
description of the new callback function arguments.

On Wed, Dec 13, 2023 at 6:59 PM jian he  wrote:
> Hi. small issues I found...
>
> typo:
> +-- Test mutabilily od query functions

Fixed.

>
> + default:
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("only datetime, bool, numeric, and text types can be casted
> to jsonpath types")));
>
> transformJsonPassingArgs's function: transformJsonValueExpr will make
> the above code unreached.

It's good to have the ereport to catch errors caused by any future changes.

> also based on the `switch (typid)` cases,
> I guess best message would be
> errmsg("only datetime, bool, numeric, text, json, jsonb types can be
> casted to jsonpath types")));

I've rewritten the message to mention the unsupported type.  Maybe the
supported types can go in a DETAIL message.  I might do that later.

> + case JSON_QUERY_OP:
> + jsexpr->wrapper = func->wrapper;
> + jsexpr->omit_quotes = (func->quotes == JS_QUOTES_OMIT);
> +
> + if (!OidIsValid(jsexpr->returning->typid))
> + {
> + JsonReturning *ret = jsexpr->returning;
> +
> + ret->typid = JsonFuncExprDefaultReturnType(jsexpr);
> + ret->typmod = -1;
> + }
> + jsexpr->result_coercion = coerceJsonFuncExprOutput(pstate, jsexpr);
>
> I noticed, if (!OidIsValid(jsexpr->returning->typid)) is the true
> function JsonFuncExprDefaultReturnType may be called twice, not sure
> if it's good or not..

If avoiding the double-calling means that we've to add more conditions
in the code, I'm fine with leaving this as-is.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: remaining sql/json patches

2023-12-14 Thread Amit Langote

On Sat, Dec 9, 2023 at 2:05 AM Andrew Dunstan  wrote:
> On 2023-12-08 Fr 11:37, Robert Haas wrote:
> > On Fri, Dec 8, 2023 at 1:59 AM Amit Langote  wrote:
> >> Would it be messy to replace the lookahead approach by whatever's
> >> suiable *in the future* when it becomes necessary to do so?
> > It might be. Changing grammar rules to tends to change corner-case
> > behavior if nothing else. We're best off picking the approach that we
> > think is correct long term.
>
> All this makes me wonder if Alvaro's first suggested solution (adding
> NESTED to the UNBOUNDED precedence level) wouldn't be better after all.

I've done just that in the latest v32.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

69 matches

Mail list logo