Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart
On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I first tried modifying
>> binaryheap to use "int" or "void *" instead, but that ended up requiring
>> some rather invasive changes in backend code, not to mention any extensions
>> that happen to be using it.

I followed through with the "void *" approach (attached), and it wasn't as
bad as I expected.

> I wonder whether we can't provide some alternate definition or "skin"
> for binaryheap that preserves the Datum API for backend code that wants
> that, while providing a void *-based API for frontend code to use.

I can give this a try next, but it might be rather #ifdef-heavy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b570460b19c5955c91b1fc28e8fdc791f41998e2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 20:59:46 -0700
Subject: [PATCH v4 1/4] use "void *" instead of "Datum" in binaryheap

---
 src/backend/executor/nodeGatherMerge.c| 16 +-
 src/backend/executor/nodeMergeAppend.c| 16 +-
 src/backend/lib/binaryheap.c  | 20 ++--
 src/backend/postmaster/pgarch.c   | 18 +--
 .../replication/logical/reorderbuffer.c   | 31 +--
 src/backend/storage/buffer/bufmgr.c   | 11 +++
 src/include/lib/binaryheap.h  | 14 -
 7 files changed, 61 insertions(+), 65 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..021682d87c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(const void *a, const void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 	  bool nowait, bool *done);
@@ -489,7 +489,7 @@ reread:
 /* Don't have a tuple yet, try to get one */
 if (gather_merge_readnext(gm_state, i, nowait))
 	binaryheap_add_unordered(gm_state->gm_heap,
-			 Int32GetDatum(i));
+			 (void *) (intptr_t) i);
 			}
 			else
 			{
@@ -565,10 +565,10 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, (void *) (intptr_t) i);
 		else
 		{
 			/* reader exhausted, remove it from heap */
@@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +750,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = (intptr_t) a;
+	SlotNumber	slot2 = (intptr_t) b;
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..b3e2c29401 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(const void *a, const void *b, void *arg);
 
 
 /* 
@@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate)
 		{
 			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 			if (!TupIsNull(node->ms_slots[i]))
-binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+binaryheap_add_unordered(node->ms_heap, (void *) (intptr_t) i);
 		}
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
@@ -244,10 +244,10 @@ ExecMergeAppend(PlanState *pstate)
 		 * by doing this before returning from the prior call, but it's better
 		 * to not pull tuples until necessary.)
 		 */
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		i = (intptr_t) binaryheap_first(node->ms_heap);
 		node->ms_slots[i] = 

Re: PG 16 draft release notes ready

2023-07-22 Thread jian he
>   https://momjian.us/pgsql_docs/release-16.html
>
>  I will adjust it to the feedback I receive;  that URL will quickly show
>  all updates.

>  Create a predefined role and grantable privilege with permission to perform 
> maintenance operations (Nathan Bossart)
> The predefined role is is called pg_maintain.

this feature was also reverted.
https://git.postgresql.org/cgit/postgresql.git/commit/?id=151c22deee66a3390ca9a1c3675e29de54ae73fc




Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
>> One item that requires more thought is binaryheap's use of Datum.  AFAICT
>> the Datum definitions live in postgres.h and aren't available to frontend
>> code.  I think we'll either need to move the Datum definitions to c.h or to
>> adjust binaryheap to use "void *".

> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.

I'm quite uncomfortable with putting Datum in c.h.  I know that the
typedef is merely a uintptr_t, but this solution seems to me to be
blowing all kinds of holes in the abstraction, because exactly none
of the infrastructure that goes along with Datum is or is ever likely
to be in any frontend build.  At the very least, frontend code that
refers to Datum will be misleading as hell.

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

regards, tom lane




Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart
On Sat, Jul 22, 2023 at 04:19:41PM -0700, Nathan Bossart wrote:
> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.  I also looked into moving the definitions to a
> separate datumdefs.h header that postgres.h would include, but that felt
> awkward because 1) postgres.h clearly states that it is intended for things
> "that never escape the backend" and 2) the definitions seem relatively
> inexpensive.  However, I think the latter option is still viable, so I'm
> fine with switching to it if folks think that is a better approach.

BTW we might be able to replace the open-coded heap in pg_dump_sort.c
(added by 79273cc) with a binaryheap, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart
On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
> Here is a work-in-progress patch set for converting ready_list to a
> priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
> minutes without these patches and 1.5 minutes with them.
> 
> One item that requires more thought is binaryheap's use of Datum.  AFAICT
> the Datum definitions live in postgres.h and aren't available to frontend
> code.  I think we'll either need to move the Datum definitions to c.h or to
> adjust binaryheap to use "void *".

In v3, I moved the Datum definitions to c.h.  I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it.  I also looked into moving the definitions to a
separate datumdefs.h header that postgres.h would include, but that felt
awkward because 1) postgres.h clearly states that it is intended for things
"that never escape the backend" and 2) the definitions seem relatively
inexpensive.  However, I think the latter option is still viable, so I'm
fine with switching to it if folks think that is a better approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 56066f836f389ca0efe169c1ea8d769f70aaa817 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 14:53:49 -0700
Subject: [PATCH v3 1/4] move datum definitions to c.h

---
 src/include/c.h| 512 
 src/include/postgres.h | 519 +
 2 files changed, 515 insertions(+), 516 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..daee2f50eb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -35,6 +35,7 @@
  *		7)		widely useful macros
  *		8)		random stuff
  *		9)		system-specific hacks
+ *		10)		Datum type + support functions
  *
  * NOTE: since this file is included by both frontend and backend modules,
  * it's usually wrong to put an "extern" declaration here, unless it's
@@ -1374,4 +1375,515 @@ typedef intptr_t sigjmp_buf[5];
 /* /port compatibility functions */
 #include "port.h"
 
+/* 
+ *Section 10:	 Datum type + support functions
+ * 
+ */
+
+/*
+ * A Datum contains either a value of a pass-by-value type or a pointer to a
+ * value of a pass-by-reference type.  Therefore, we require:
+ *
+ * sizeof(Datum) == sizeof(void *) == 4 or 8
+ *
+ * The functions below and the analogous functions for other types should be used to
+ * convert between a Datum and the appropriate C type.
+ */
+
+typedef uintptr_t Datum;
+
+/*
+ * A NullableDatum is used in places where both a Datum and its nullness needs
+ * to be stored. This can be more efficient than storing datums and nullness
+ * in separate arrays, due to better spatial locality, even if more space may
+ * be wasted due to padding.
+ */
+typedef struct NullableDatum
+{
+#define FIELDNO_NULLABLE_DATUM_DATUM 0
+	Datum		value;
+#define FIELDNO_NULLABLE_DATUM_ISNULL 1
+	bool		isnull;
+	/* due to alignment padding this could be used for flags for free */
+} NullableDatum;
+
+#define SIZEOF_DATUM SIZEOF_VOID_P
+
+/*
+ * DatumGetBool
+ *		Returns boolean value of a datum.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline bool
+DatumGetBool(Datum X)
+{
+	return (X != 0);
+}
+
+/*
+ * BoolGetDatum
+ *		Returns datum representation for a boolean.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline Datum
+BoolGetDatum(bool X)
+{
+	return (Datum) (X ? 1 : 0);
+}
+
+/*
+ * DatumGetChar
+ *		Returns character value of a datum.
+ */
+static inline char
+DatumGetChar(Datum X)
+{
+	return (char) X;
+}
+
+/*
+ * CharGetDatum
+ *		Returns datum representation for a character.
+ */
+static inline Datum
+CharGetDatum(char X)
+{
+	return (Datum) X;
+}
+
+/*
+ * Int8GetDatum
+ *		Returns datum representation for an 8-bit integer.
+ */
+static inline Datum
+Int8GetDatum(int8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt8
+ *		Returns 8-bit unsigned integer value of a datum.
+ */
+static inline uint8
+DatumGetUInt8(Datum X)
+{
+	return (uint8) X;
+}
+
+/*
+ * UInt8GetDatum
+ *		Returns datum representation for an 8-bit unsigned integer.
+ */
+static inline Datum
+UInt8GetDatum(uint8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetInt16
+ *		Returns 16-bit integer value of a datum.
+ */
+static inline int16
+DatumGetInt16(Datum X)
+{
+	return (int16) X;
+}
+
+/*
+ * Int16GetDatum
+ *		Returns datum representation for a 16-bit integer.
+ */
+static inline Datum
+Int16GetDatum(int16 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt16
+ *		Returns 16-bit unsigned integer value of a datum.
+ */
+static inline uint16
+DatumGetUInt16(Datum X)
+{
+	return (uint16) X;
+}
+
+/*
+ * UInt16GetDatum
+ *		Returns datum representation for a 16-bit 

Re: Fix search_path for all maintenance commands

2023-07-22 Thread Jeff Davis
On Sat, 2023-07-22 at 07:04 -0700, Noah Misch wrote:
> Commit a117ceb added that, and it added some test cases that behaved
> differently without that.

Thank you. The reasoning there seems to apply to search_path
restriction as well, so I will leave it as-is.

I'll wait a few more days for comment since I made some changes (also
it's the weekend), but I plan to commit something like the latest
version soon.

I might adjust the CREATE MATERIALIZED VIEW changes to be more minimal,
but that path is not important for security (see pre-existing comment)
so it's probably fine either way.

Regards,
Jeff Davis





Re: Fix search_path for all maintenance commands

2023-07-22 Thread Noah Misch
On Fri, Jul 21, 2023 at 03:32:43PM -0700, Jeff Davis wrote:
> Why do we switch to the table owner and use
> SECURITY_RESTRICTED_OPERATION in DefineIndex(), when we will switch in
> index_build (etc.) anyway?

Commit a117ceb added that, and it added some test cases that behaved
differently without that.

> Similarly, why do we switch in vacuum_rel(),
> when it doesn't matter for lazy vacuum and we will switch in
> cluster_rel() and do_analyze_rel() anyway?

It conforms to the "as soon as possible after locking the relation" coding
rule that commit a117ceb wrote into miscinit.c.  That provides future
proofing.




Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-22 Thread Amit Kapila
On Fri, Jul 21, 2023 at 6:55 AM Masahiko Sawada  wrote:
>
> I've attached the updated patch. I'll push it early next week, barring
> any objections.
>

You have moved most of the comments related to the restriction of
which index can be picked atop IsIndexUsableForReplicaIdentityFull().
Now, the comments related to limitation atop
FindUsableIndexForReplicaIdentityFull() look slightly odd as it refers
to limitations but those limitation were not stated. The comments I am
referring to are: "Note that the limitations of index scans for
replica identity full only  might not be a good idea in some
cases". Shall we move these as well atop
IsIndexUsableForReplicaIdentityFull()?

-- 
With Regards,
Amit Kapila.




Re: WAL Insertion Lock Improvements

2023-07-22 Thread Bharath Rupireddy
On Fri, Jul 21, 2023 at 11:29 AM Michael Paquier  wrote:
>
> +   /* Reading atomically avoids getting a torn value */
> +   value = pg_atomic_read_u64(valptr);
>
> Should this specify that this is specifically important for platforms
> where reading a uint64 could lead to a torn value read, if you apply
> this term in this context?  Sounding picky, I would make that a bit
> longer, say something like that:
> "Reading this value atomically is safe even on platforms where uint64
> cannot be read without observing a torn value."
>
> Only xlogprefetcher.c uses the term "torn" for a value by the way, but
> for a write.

Done.

> 0001 looks OK-ish seen from here.  Thoughts?

Yes, it looks safe to me too. FWIW, 0001 essentially implements what
an existing TODO comment introduced by commit 008608b9d5106 says:

/*
 * Read value using the lwlock's wait list lock, as we can't generally
 * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
 * do atomic 64 bit reads/writes the spinlock should be optimized away.
 */

I'm attaching v10 patch set here - 0001 has modified the comment as
above, no other changes in patch set.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v10-0001-Optimize-WAL-insertion-lock-acquisition-and-rele.patch
Description: Binary data


v10-0002-Improve-commets-in-and-around-LWLockWaitForVar-c.patch
Description: Binary data


v10-0003-Have-a-quick-exit-for-LWLockUpdateVar-when-there.patch
Description: Binary data


Re: Row pattern recognition

2023-07-22 Thread Tatsuo Ishii
> On 7/22/23 03:11, Tatsuo Ishii wrote:
> Maybe
> that can help clarify the design? It feels like it'll need to
> eventually
> be a "real" function that operates on the window state, even if it
> doesn't support all of the possible complexities in v1.
 Unfortunately an window function can not call other window functions.
>>> Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case?
> 
>> I am not sure at this point. Current PostgreSQL executor creates
>> WindowStatePerFuncData for each window function and aggregate
>> appearing in OVER clause. This means PREV/NEXT and other row pattern
>> navigation operators cannot have their own WindowStatePerFuncData if
>> they do not appear in OVER clauses in a query even if PREV/NEXT
>> etc. are defined as window function.
>> 
>>> Or
>>> does it make sense to split the pattern navigation "functions" into
>>> their own new concept, and maybe borrow some of the window function
>>> infrastructure for it?
> 
>> Maybe. Suppose a window function executes row pattern matching using
>> price > PREV(price). The window function already receives
>> WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to
>> PREV, we could let PREV do the real work (getting previous tuple).
>> I have not tried this yet, though.
> 
> 
> I don't understand this logic.  Window functions work over a window
> frame.

Yes.

> What we are talking about here is *defining* a window
> frame.

Well, we are defining a "reduced" window frame within a (full) window
frame. A "reduced" window frame is calculated each time when a window
function is called.

> How can a window function execute row pattern matching?

A window function is called for each row fed by an outer plan. It
fetches current, previous and next row to execute pattern matching. If
it matches, the window function moves to next row and repeat the
process, until pattern match fails.

Below is an example window function to execute pattern matching (I
will include this in the v3 patch). row_is_in_reduced_frame() is a
function to execute pattern matching. It returns the number of rows in
the reduced frame if pattern match succeeds. If succeeds, the function
returns the last row in the reduced frame instead of the last row in
the full window frame.

/*
 * last_value
 * return the value of VE evaluated on the last row of the
 * window frame, per spec.
 */
Datum
window_last_value(PG_FUNCTION_ARGS)
{
WindowObject winobj = PG_WINDOW_OBJECT();
Datum   result;
boolisnull;
int64   abspos;
int num_reduced_frame;

abspos = WinGetCurrentPosition(winobj);
num_reduced_frame = row_is_in_reduced_frame(winobj, abspos);

if (num_reduced_frame == 0)
/* no RPR is involved */
result = WinGetFuncArgInFrame(winobj, 0,
  0, 
WINDOW_SEEK_TAIL, true,
  
, NULL);
else if (num_reduced_frame > 0)
/* get last row value in the reduced frame */
result = WinGetFuncArgInFrame(winobj, 0,
  
num_reduced_frame - 1, WINDOW_SEEK_HEAD, true,
  
, NULL);
else
/* RPR is involved and current row is unmatched or skipped */
isnull = true;

if (isnull)
PG_RETURN_NULL();

PG_RETURN_DATUM(result);
}