subject:"Inefficiency in parallel pg_restore with many tables"

Re: Inefficiency in parallel pg_restore with many tables

2023-09-19 Thread Nathan Bossart

On Mon, Sep 18, 2023 at 02:22:32PM -0700, Nathan Bossart wrote:
> For now, I've committed 0001 and 0002.  I intend to commit the others soon.

I've committed the rest of the patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Nathan Bossart

On Mon, Sep 18, 2023 at 09:36:03PM -0400, Tom Lane wrote:
> But in any case, how long are we keeping src/tools/msvc/ ?

>From a skim of [0], it seems like it could be removed now.  I see a couple
of work-in-progress patches from Andres [1] that would probably serve as a
good starting point.  I won't have much time for this for the next few
weeks, so if someone else wants to pick it up, please feel free.

[0] https://postgr.es/m/20230408191007.7lysd42euafwl74f%40awork3.anarazel.de
[1] https://github.com/anarazel/postgres/commits/drop-homegrown-msvc

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Tom Lane

Nathan Bossart  writes:
> On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
>> bowerbird is unhappy with this.  I suppose you missed out updating
>> the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

> I saw that and have attempted to fix it with 83223f5.

Ah, right, sorry for the noise.

But in any case, how long are we keeping src/tools/msvc/ ?

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Michael Paquier

On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
> bowerbird is unhappy with this.  I suppose you missed out updating
> the src/tools/msvc/ scripts.
> (Weren't we about ready to nuke those?)

hamerkop seems to be the only buildfarm member that would complain if
these were to be gone today, on top of bowerbird, of course.
--
Michael


signature.asc
Description: PGP signature

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Nathan Bossart

On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> For now, I've committed 0001 and 0002.  I intend to commit the others soon.
> 
> bowerbird is unhappy with this.  I suppose you missed out updating
> the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

I saw that and have attempted to fix it with 83223f5.  I'm still waiting
for an MSVC animal to report back.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Tom Lane

Nathan Bossart  writes:
> For now, I've committed 0001 and 0002.  I intend to commit the others soon.

bowerbird is unhappy with this.  I suppose you missed out updating
the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-09-18 Thread Nathan Bossart

For now, I've committed 0001 and 0002.  I intend to commit the others soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 08:01:39PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> Upon closer inspection, I found a rather nasty problem.  The qsort
>> comparator expects a TocEntry **, but the binaryheap comparator expects a
>> TocEntry *, and we simply pass the arguments through to the qsort
>> comparator.  In v9, I added the requisite ampersands.
> 
> Ooops :-(
> 
>> I'm surprised this
>> worked at all.
> 
> Probably it was not sorting things appropriately.  Might be worth adding
> some test scaffolding to check that bigger tasks are chosen before
> smaller ones.

Further testing revealed that the binaryheap comparator function was
actually generating a min-heap since the qsort comparator sorts by
decreasing dataLength.  This is fixed in v10.  And I am 0 for 2 today...

Now that this appears to be functioning as expected, I see that the larger
entries are typically picked up earlier, but we do sometimes pick entries
quite a bit further down the list, as anticipated.  The case I was testing
(10k tables with the number of rows equal to the table number) was much
faster with this patch (just over a minute) than without it (over 16
minutes).

Sincerest apologies for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 59498e233d725d5c427e58f700402908fce2a10f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v10 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >=

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Tom Lane

Nathan Bossart  writes:
> Upon closer inspection, I found a rather nasty problem.  The qsort
> comparator expects a TocEntry **, but the binaryheap comparator expects a
> TocEntry *, and we simply pass the arguments through to the qsort
> comparator.  In v9, I added the requisite ampersands.

Ooops :-(

> I'm surprised this
> worked at all.

Probably it was not sorting things appropriately.  Might be worth adding
some test scaffolding to check that bigger tasks are chosen before
smaller ones.

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Wed, Sep 13, 2023 at 11:34:50AM -0700, Nathan Bossart wrote:
> On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
>> Other than those nitpicks, I like v6.  I'll mark this RfC.
> 
> Great.  I've posted a v8 with your comments addressed in order to get one
> more round of cfbot coverage.  Assuming those tests pass and there is no
> additional feedback, I'll plan on committing this in the next few days.

Upon closer inspection, I found a rather nasty problem.  The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator.  In v9, I added the requisite ampersands.  I'm surprised this
worked at all.  I'm planning to run some additional tests to make sure this
patch set works as expected.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From df3d862907b29687503a8f362e0a3cc3dcbc3cc4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v9 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap,

Re: Inefficiency in parallel pg_restore with many tables

2023-09-13 Thread Nathan Bossart

On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
> Hmm ... I do not like v7 very much at all.  It requires rather ugly
> changes to all of the existing callers, and what are we actually
> buying?  If anything, it makes things slower for pass-by-value items
> like integers.  I'd stick with the Datum convention in the backend.
> 
> Instead, I took a closer look through the v6 patch set.
> I think that's in pretty good shape and nearly committable,
> but I have a few thoughts:

Thanks for reviewing.  I'm fine with proceeding with the v6 approach.  Even
though the alternative approach makes the API consistent for the frontend
and backend, I'm also not a huge fan of the pointer gymnastics required in
the comparators.  Granted, we still have to do some intptr_t conversions in
pg_dump_sort.c with the v6 approach, but that seems to be an exception.

> * I'm not sure about defining bh_node_type as a macro:
> 
> +#ifdef FRONTEND
> +#define bh_node_type void *
> +#else
> +#define bh_node_type Datum
> +#endif
> 
> rather than an actual typedef:
> 
> +#ifdef FRONTEND
> +typedef void *bh_node_type;
> +#else
> +typedef Datum bh_node_type;
> +#endif
> 
> My concern here is that bh_node_type is effectively acting as a
> typedef, so that pgindent might misbehave if it's not declared as a
> typedef.  On the other hand, there doesn't seem to be any indentation
> problem in the patchset as it stands, and we don't expect any code
> outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
> (If you do choose to make it a typedef, remember to add it to
> typedefs.list.)

I think a typedef makes more sense here.

> * As a matter of style, I'd recommend adding braces in places
> like this:
> 
>   if (heap->bh_size >= heap->bh_space)
> + {
> +#ifdef FRONTEND
> + pg_fatal("out of binary heap slots");
> +#else
>   elog(ERROR, "out of binary heap slots");
> +#endif
> + }
>   heap->bh_nodes[heap->bh_size] = d;
> 
> It's not wrong as you have it, but I think it's more readable
> and less easy to accidentally break with the extra braces.

Fair point.

> * In 0002, isn't the comment for binaryheap_remove_node wrong?
> 
> + * Removes the nth node from the heap.  The caller must ensure that there are
> + * at least (n - 1) nodes in the heap.  O(log n) worst case.
> 
> Shouldn't that be "(n + 1)"?  Also, I'd specify "n'th (zero based) node"
> for clarity.

Yeah, that's a mistake.

> * I would say that this bit in 0004:
> 
> - j = removeHeapElement(pendingHeap, heapLength--);
> + j = (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> needs an explicit cast to int:
> 
> + j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> otherwise some compilers might complain about the result possibly
> not fitting in "j".

Sure.  IMO it's a tad more readable, too.

> Other than those nitpicks, I like v6.  I'll mark this RfC.

Great.  I've posted a v8 with your comments addressed in order to get one
more round of cfbot coverage.  Assuming those tests pass and there is no
additional feedback, I'll plan on committing this in the next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7dd09c72a44e5ba293c6e9b2d2afa608ba070f03 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v8 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 +++
 src/tools/pgindent/typedefs.list |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build

Re: Inefficiency in parallel pg_restore with many tables

2023-09-10 Thread Tom Lane

Nathan Bossart  writes:
> I spent some more time on this patch and made the relevant adjustments to
> the rest of the set.

Hmm ... I do not like v7 very much at all.  It requires rather ugly
changes to all of the existing callers, and what are we actually
buying?  If anything, it makes things slower for pass-by-value items
like integers.  I'd stick with the Datum convention in the backend.

Instead, I took a closer look through the v6 patch set.
I think that's in pretty good shape and nearly committable,
but I have a few thoughts:

* I'm not sure about defining bh_node_type as a macro:

+#ifdef FRONTEND
+#define bh_node_type void *
+#else
+#define bh_node_type Datum
+#endif

rather than an actual typedef:

+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif

My concern here is that bh_node_type is effectively acting as a
typedef, so that pgindent might misbehave if it's not declared as a
typedef.  On the other hand, there doesn't seem to be any indentation
problem in the patchset as it stands, and we don't expect any code
outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
(If you do choose to make it a typedef, remember to add it to
typedefs.list.)

* As a matter of style, I'd recommend adding braces in places
like this:

if (heap->bh_size >= heap->bh_space)
+   {
+#ifdef FRONTEND
+   pg_fatal("out of binary heap slots");
+#else
elog(ERROR, "out of binary heap slots");
+#endif
+   }
heap->bh_nodes[heap->bh_size] = d;

It's not wrong as you have it, but I think it's more readable
and less easy to accidentally break with the extra braces.

* In 0002, isn't the comment for binaryheap_remove_node wrong?

+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.

Shouldn't that be "(n + 1)"?  Also, I'd specify "n'th (zero based) node"
for clarity.

* I would say that this bit in 0004:

-   j = removeHeapElement(pendingHeap, heapLength--);
+   j = (intptr_t) binaryheap_remove_first(pendingHeap);

needs an explicit cast to int:

+   j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);

otherwise some compilers might complain about the result possibly
not fitting in "j".

Other than those nitpicks, I like v6.  I'll mark this RfC.

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-09-04 Thread Nathan Bossart

On Sat, Sep 02, 2023 at 11:55:21AM -0700, Nathan Bossart wrote:
> I ended up hacking together a (nowhere near committable) patch to see how
> hard it would be to allow using any type with binaryheap.  It doesn't seem
> too bad.

I spent some more time on this patch and made the relevant adjustments to
the rest of the set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3a51f25ae712b792bdde90b133c09b0b940f4198 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 4 Sep 2023 15:04:40 -0700
Subject: [PATCH v7 1/5] Allow binaryheap to use any type.

We would like to make binaryheap available to frontend code in a
future commit, but presently the implementation uses Datum for the
node type, which is inaccessible in the frontend.  This commit
modifies the implementation to allow using any type for the nodes.
binaryheap_allocate() now requires callers to specify the size of
the node, and several of the other functions now use void pointers.
---
 src/backend/executor/nodeGatherMerge.c| 19 ++--
 src/backend/executor/nodeMergeAppend.c| 22 ++---
 src/backend/lib/binaryheap.c  | 99 +++
 src/backend/postmaster/pgarch.c   | 36 ---
 .../replication/logical/reorderbuffer.c   | 21 ++--
 src/backend/storage/buffer/bufmgr.c   | 20 ++--
 src/include/lib/binaryheap.h  | 21 ++--
 7 files changed, 137 insertions(+), 101 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..f76406b575 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(const void *a, const void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 	  bool nowait, bool *done);
@@ -429,6 +429,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
+			sizeof(int),
 			heap_compare_slots,
 			gm_state);
 }
@@ -489,7 +490,7 @@ reread:
 /* Don't have a tuple yet, try to get one */
 if (gather_merge_readnext(gm_state, i, nowait))
 	binaryheap_add_unordered(gm_state->gm_heap,
-			 Int32GetDatum(i));
+			 );
 			}
 			else
 			{
@@ -565,14 +566,14 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, );
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, );
 		else
 		{
 			/* reader exhausted, remove it from heap */
-			(void) binaryheap_remove_first(gm_state->gm_heap);
+			binaryheap_remove_first(gm_state->gm_heap, NULL);
 		}
 	}
 
@@ -585,7 +586,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, );
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +751,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((const int *) a);
+	SlotNumber	slot2 = *((const int *) b);
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..e0ace294a0 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(const void *a, const void *b, void *arg);
 
 
 /* 
@@ -125,8 +125,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
-			  mergestate);
+	mergestate->ms_heap =

Re: Inefficiency in parallel pg_restore with many tables

2023-09-03 Thread Nathan Bossart

On Sun, Sep 03, 2023 at 12:04:00PM +0200, Alvaro Herrera wrote:
> On 2023-Sep-02, Nathan Bossart wrote:
>> I ended up hacking together a (nowhere near committable) patch to see how
>> hard it would be to allow using any type with binaryheap.  It doesn't seem
>> too bad.
> 
> Yeah, using void * seems to lead to interfaces that are pretty much the
> same as bsearch() or qsort().

Right.  This is what I had in mind.

> (Why isn't your payload type const,
> though?)

It probably should be const.  This patch was just a proof-of-concept and
still requireѕ a bit of work.

> I do wonder why did you change _remove_first and _first to have a
> 'result' output argument instead of a return value.  Does this change
> actually buy you anything?  simplehash.h doesn't do that either.
> 
>> -extern void binaryheap_add(binaryheap *heap, Datum d);
>> -extern Datum binaryheap_first(binaryheap *heap);
>> -extern Datum binaryheap_remove_first(binaryheap *heap);
>> -extern void binaryheap_replace_first(binaryheap *heap, Datum d);
>> +extern void binaryheap_add(binaryheap *heap, void *d);
>> +extern void binaryheap_first(binaryheap *heap, void *result);
>> +extern void binaryheap_remove_first(binaryheap *heap, void *result);
>> +extern void binaryheap_replace_first(binaryheap *heap, void *d);

_first could likely just return a pointer to the data in the binary heap's
array.  However, _remove_first has to copy the data somewhere, so I think
the alternative would be to return a palloc'd value.  Is there another way
that I'm not thinking of?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-03 Thread Alvaro Herrera

On 2023-Sep-02, Nathan Bossart wrote:

> On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:

> > Yeah, something similar to simplehash for binary heaps could be nice.  That
> > being said, I don't know if there's a strong reason to specialize the
> > implementation for a given C data type in most cases.
> 
> I ended up hacking together a (nowhere near committable) patch to see how
> hard it would be to allow using any type with binaryheap.  It doesn't seem
> too bad.

Yeah, using void * seems to lead to interfaces that are pretty much the
same as bsearch() or qsort().  (Why isn't your payload type const,
though?)

I do wonder why did you change _remove_first and _first to have a
'result' output argument instead of a return value.  Does this change
actually buy you anything?  simplehash.h doesn't do that either.

> -extern void binaryheap_add(binaryheap *heap, Datum d);
> -extern Datum binaryheap_first(binaryheap *heap);
> -extern Datum binaryheap_remove_first(binaryheap *heap);
> -extern void binaryheap_replace_first(binaryheap *heap, Datum d);
> +extern void binaryheap_add(binaryheap *heap, void *d);
> +extern void binaryheap_first(binaryheap *heap, void *result);
> +extern void binaryheap_remove_first(binaryheap *heap, void *result);
> +extern void binaryheap_replace_first(binaryheap *heap, void *d);

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Inefficiency in parallel pg_restore with many tables

2023-09-02 Thread Nathan Bossart

On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:
> On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:
>> In hindsight, I think that making binaryheap depend on Datum was a bad
>> idea. I think that was my idea, and I think it wasn't very smart.
>> Considering that people have coded to that decision up until now, it
>> might not be too easy to change at this point. But in principle I
>> guess you'd want to be able to make a heap out of any C data type,
>> rather than just Datum, or just Datum in the backend and just void *
>> in the frontend.
> 
> Yeah, something similar to simplehash for binary heaps could be nice.  That
> being said, I don't know if there's a strong reason to specialize the
> implementation for a given C data type in most cases.  I suspect many
> callers are just fine with dealing with pointers (e.g., I wouldn't store an
> entire TocEntry in the array), and smaller types like integers are already
> stored directly in the array thanks to the use of Datum.  However, it
> _would_ allow us to abandon this frontend/backend void */Datum kludge,
> which is something.

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap.  It doesn't seem
too bad.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From fa51a6b48d1a7cdcffbd5ccbe4274f39dfb0c1b5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 2 Sep 2023 11:48:36 -0700
Subject: [PATCH 1/1] allow binaryheap to use any type

---
 src/backend/executor/nodeGatherMerge.c| 20 +++---
 src/backend/executor/nodeMergeAppend.c| 20 +++---
 src/backend/lib/binaryheap.c  | 71 ++-
 src/backend/postmaster/pgarch.c   | 37 +-
 .../replication/logical/reorderbuffer.c   | 22 +++---
 src/backend/storage/buffer/bufmgr.c   | 21 +++---
 src/include/lib/binaryheap.h  | 17 ++---
 7 files changed, 112 insertions(+), 96 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..4c75d224c8 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(void *a, void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 	  bool nowait, bool *done);
@@ -428,7 +428,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	}
 
 	/* Allocate the resources for the merge */
-	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
+	gm_state->gm_heap = binaryheap_allocate(nreaders + 1, sizeof(int),
 			heap_compare_slots,
 			gm_state);
 }
@@ -489,7 +489,7 @@ reread:
 /* Don't have a tuple yet, try to get one */
 if (gather_merge_readnext(gm_state, i, nowait))
 	binaryheap_add_unordered(gm_state->gm_heap,
-			 Int32GetDatum(i));
+			 );
 			}
 			else
 			{
@@ -565,14 +565,14 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, );
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, );
 		else
 		{
 			/* reader exhausted, remove it from heap */
-			(void) binaryheap_remove_first(gm_state->gm_heap);
+			binaryheap_remove_first(gm_state->gm_heap, );
 		}
 	}
 
@@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, );
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +750,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(void *a, void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((int *) a);
+	SlotNumber	slot2 = *((int *) b);
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..928b4b3719 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState

Re: Inefficiency in parallel pg_restore with many tables

2023-09-01 Thread Nathan Bossart

On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:
> In hindsight, I think that making binaryheap depend on Datum was a bad
> idea. I think that was my idea, and I think it wasn't very smart.
> Considering that people have coded to that decision up until now, it
> might not be too easy to change at this point. But in principle I
> guess you'd want to be able to make a heap out of any C data type,
> rather than just Datum, or just Datum in the backend and just void *
> in the frontend.

Yeah, something similar to simplehash for binary heaps could be nice.  That
being said, I don't know if there's a strong reason to specialize the
implementation for a given C data type in most cases.  I suspect many
callers are just fine with dealing with pointers (e.g., I wouldn't store an
entire TocEntry in the array), and smaller types like integers are already
stored directly in the array thanks to the use of Datum.  However, it
_would_ allow us to abandon this frontend/backend void */Datum kludge,
which is something.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-01 Thread Robert Haas

On Tue, Jul 25, 2023 at 2:53 PM Nathan Bossart  wrote:
> On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:
> > Here is a sketch of this approach.  It required fewer #ifdefs than I was
> > expecting.  At the moment, this one seems like the winner to me.
>
> Here is a polished patch set for this approach.  I've also added a 0004
> that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
> IMHO these patches are in decent shape.

[ drive-by comment that hopefully doesn't cause too much pain ]

In hindsight, I think that making binaryheap depend on Datum was a bad
idea. I think that was my idea, and I think it wasn't very smart.
Considering that people have coded to that decision up until now, it
might not be too easy to change at this point. But in principle I
guess you'd want to be able to make a heap out of any C data type,
rather than just Datum, or just Datum in the backend and just void *
in the frontend.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-01 Thread Nathan Bossart

On Fri, Sep 01, 2023 at 01:41:41PM -0400, Tom Lane wrote:
> I've not actually looked at any of these patchsets after the first one.
> I have added myself as a reviewer and will hopefully get to it within
> a week or so.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-09-01 Thread Tom Lane

Nathan Bossart  writes:
> I'm hoping to commit these patches at some point in the current commitfest.
> I don't sense anything tremendously controversial, and they provide a
> pretty nice speedup in some cases.  Are there any remaining concerns?

I've not actually looked at any of these patchsets after the first one.
I have added myself as a reviewer and will hopefully get to it within
a week or so.

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-09-01 Thread Nathan Bossart

On Tue, Jul 25, 2023 at 11:53:36AM -0700, Nathan Bossart wrote:
> Here is a polished patch set for this approach.  I've also added a 0004
> that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
> IMHO these patches are in decent shape.

I'm hoping to commit these patches at some point in the current commitfest.
I don't sense anything tremendously controversial, and they provide a
pretty nice speedup in some cases.  Are there any remaining concerns?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-07-25 Thread Nathan Bossart

On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:
> Here is a sketch of this approach.  It required fewer #ifdefs than I was
> expecting.  At the moment, this one seems like the winner to me.

Here is a polished patch set for this approach.  I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From def4af18e68cece22008863f664e9e0fd060f917 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v6 1/4] Make binaryheap available to frontend code.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Álvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 37 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 -
 6 files changed, 47 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (91%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 91%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..4836cd1c6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,14 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure

Re: Inefficiency in parallel pg_restore with many tables

2023-07-24 Thread Nathan Bossart

On Sat, Jul 22, 2023 at 10:57:03PM -0700, Nathan Bossart wrote:
> On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:
>> I wonder whether we can't provide some alternate definition or "skin"
>> for binaryheap that preserves the Datum API for backend code that wants
>> that, while providing a void *-based API for frontend code to use.
> 
> I can give this a try next, but it might be rather #ifdef-heavy.

Here is a sketch of this approach.  It required fewer #ifdefs than I was
expecting.  At the moment, this one seems like the winner to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 13017359cc6dd35f3e465280373b4b82e87b7c5b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v5 1/3] make binaryheap available to frontend

---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 37 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 23 ++-
 6 files changed, 44 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (91%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 91%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..a6be4b42ae 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_elem_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,14 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_elem_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_elem_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_elem_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +184,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_elem_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_elem_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +219,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void

Re: Inefficiency in parallel pg_restore with many tables

2023-07-24 Thread Pierre Ducroquet

On Saturday, July 15, 2023 7:47:12 PM CEST Tom Lane wrote:
> I'm not sure how big a deal this is in practice: in most situations
> the individual jobs are larger than they are in this toy example,
> plus the initial non-parallelizable part of the restore is a bigger
> bottleneck anyway with this many tables.  Still, we do have one
> real-world complaint, so maybe we should look into improving it.

Hi

For what it's worth, at my current job it's kind of a big deal. I was going to 
start looking at the bad performance I got on pg_restore for some databases 
with over 50k tables (in 200 namespaces) when I found this thread. The dump 
weights in about 2,8GB, the toc.dat file is 230MB, 50 120 tables, 142 069 
constraints and 73 669 indexes.

HEAD pg_restore duration: 30 minutes
pg_restore with latest patch from Nathan Bossart: 23 minutes

This is indeed better, but there is still a lot of room for improvements. With 
such usecases, I was able to go much faster using the patched pg_restore with 
a script that parallelize on each schema instead of relying on the choices 
made by pg_restore. It seems the choice of parallelizing only the data loading 
is losing nice speedup opportunities with a huge number of objects.

patched pg_restore + parallel restore of schemas: 10 minutes

Anyway, the patch works really fine as is, and I will certainly keep trying 
future iterations.

Regards

 Pierre

Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart

On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I first tried modifying
>> binaryheap to use "int" or "void *" instead, but that ended up requiring
>> some rather invasive changes in backend code, not to mention any extensions
>> that happen to be using it.

I followed through with the "void *" approach (attached), and it wasn't as
bad as I expected.

> I wonder whether we can't provide some alternate definition or "skin"
> for binaryheap that preserves the Datum API for backend code that wants
> that, while providing a void *-based API for frontend code to use.

I can give this a try next, but it might be rather #ifdef-heavy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b570460b19c5955c91b1fc28e8fdc791f41998e2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 20:59:46 -0700
Subject: [PATCH v4 1/4] use "void *" instead of "Datum" in binaryheap

---
 src/backend/executor/nodeGatherMerge.c| 16 +-
 src/backend/executor/nodeMergeAppend.c| 16 +-
 src/backend/lib/binaryheap.c  | 20 ++--
 src/backend/postmaster/pgarch.c   | 18 +--
 .../replication/logical/reorderbuffer.c   | 31 +--
 src/backend/storage/buffer/bufmgr.c   | 11 +++
 src/include/lib/binaryheap.h  | 14 -
 7 files changed, 61 insertions(+), 65 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..021682d87c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(const void *a, const void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 	  bool nowait, bool *done);
@@ -489,7 +489,7 @@ reread:
 /* Don't have a tuple yet, try to get one */
 if (gather_merge_readnext(gm_state, i, nowait))
 	binaryheap_add_unordered(gm_state->gm_heap,
-			 Int32GetDatum(i));
+			 (void *) (intptr_t) i);
 			}
 			else
 			{
@@ -565,10 +565,10 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, (void *) (intptr_t) i);
 		else
 		{
 			/* reader exhausted, remove it from heap */
@@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +750,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = (intptr_t) a;
+	SlotNumber	slot2 = (intptr_t) b;
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..b3e2c29401 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(const void *a, const void *b, void *arg);
 
 
 /* 
@@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate)
 		{
 			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 			if (!TupIsNull(node->ms_slots[i]))
-binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+binaryheap_add_unordered(node->ms_heap, (void *) (intptr_t) i);
 		}
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
@@ -244,10 +244,10 @@ ExecMergeAppend(PlanState *pstate)
 		 * by doing this before returning from the prior call, but it's better
 		 * to not pull tuples until necessary.)
 		 */
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		i = (intptr_t) binaryheap_first(node->ms_heap);
 		node->ms_slots[i] =

Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Tom Lane

Nathan Bossart  writes:
> On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
>> One item that requires more thought is binaryheap's use of Datum.  AFAICT
>> the Datum definitions live in postgres.h and aren't available to frontend
>> code.  I think we'll either need to move the Datum definitions to c.h or to
>> adjust binaryheap to use "void *".

> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.

I'm quite uncomfortable with putting Datum in c.h.  I know that the
typedef is merely a uintptr_t, but this solution seems to me to be
blowing all kinds of holes in the abstraction, because exactly none
of the infrastructure that goes along with Datum is or is ever likely
to be in any frontend build.  At the very least, frontend code that
refers to Datum will be misleading as hell.

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart

On Sat, Jul 22, 2023 at 04:19:41PM -0700, Nathan Bossart wrote:
> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.  I also looked into moving the definitions to a
> separate datumdefs.h header that postgres.h would include, but that felt
> awkward because 1) postgres.h clearly states that it is intended for things
> "that never escape the backend" and 2) the definitions seem relatively
> inexpensive.  However, I think the latter option is still viable, so I'm
> fine with switching to it if folks think that is a better approach.

BTW we might be able to replace the open-coded heap in pg_dump_sort.c
(added by 79273cc) with a binaryheap, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-07-22 Thread Nathan Bossart

On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
> Here is a work-in-progress patch set for converting ready_list to a
> priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
> minutes without these patches and 1.5 minutes with them.
> 
> One item that requires more thought is binaryheap's use of Datum.  AFAICT
> the Datum definitions live in postgres.h and aren't available to frontend
> code.  I think we'll either need to move the Datum definitions to c.h or to
> adjust binaryheap to use "void *".

In v3, I moved the Datum definitions to c.h.  I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it.  I also looked into moving the definitions to a
separate datumdefs.h header that postgres.h would include, but that felt
awkward because 1) postgres.h clearly states that it is intended for things
"that never escape the backend" and 2) the definitions seem relatively
inexpensive.  However, I think the latter option is still viable, so I'm
fine with switching to it if folks think that is a better approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 56066f836f389ca0efe169c1ea8d769f70aaa817 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 14:53:49 -0700
Subject: [PATCH v3 1/4] move datum definitions to c.h

---
 src/include/c.h| 512 
 src/include/postgres.h | 519 +
 2 files changed, 515 insertions(+), 516 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..daee2f50eb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -35,6 +35,7 @@
  *		7)		widely useful macros
  *		8)		random stuff
  *		9)		system-specific hacks
+ *		10)		Datum type + support functions
  *
  * NOTE: since this file is included by both frontend and backend modules,
  * it's usually wrong to put an "extern" declaration here, unless it's
@@ -1374,4 +1375,515 @@ typedef intptr_t sigjmp_buf[5];
 /* /port compatibility functions */
 #include "port.h"
 
+/* 
+ *Section 10:	 Datum type + support functions
+ * 
+ */
+
+/*
+ * A Datum contains either a value of a pass-by-value type or a pointer to a
+ * value of a pass-by-reference type.  Therefore, we require:
+ *
+ * sizeof(Datum) == sizeof(void *) == 4 or 8
+ *
+ * The functions below and the analogous functions for other types should be used to
+ * convert between a Datum and the appropriate C type.
+ */
+
+typedef uintptr_t Datum;
+
+/*
+ * A NullableDatum is used in places where both a Datum and its nullness needs
+ * to be stored. This can be more efficient than storing datums and nullness
+ * in separate arrays, due to better spatial locality, even if more space may
+ * be wasted due to padding.
+ */
+typedef struct NullableDatum
+{
+#define FIELDNO_NULLABLE_DATUM_DATUM 0
+	Datum		value;
+#define FIELDNO_NULLABLE_DATUM_ISNULL 1
+	bool		isnull;
+	/* due to alignment padding this could be used for flags for free */
+} NullableDatum;
+
+#define SIZEOF_DATUM SIZEOF_VOID_P
+
+/*
+ * DatumGetBool
+ *		Returns boolean value of a datum.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline bool
+DatumGetBool(Datum X)
+{
+	return (X != 0);
+}
+
+/*
+ * BoolGetDatum
+ *		Returns datum representation for a boolean.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline Datum
+BoolGetDatum(bool X)
+{
+	return (Datum) (X ? 1 : 0);
+}
+
+/*
+ * DatumGetChar
+ *		Returns character value of a datum.
+ */
+static inline char
+DatumGetChar(Datum X)
+{
+	return (char) X;
+}
+
+/*
+ * CharGetDatum
+ *		Returns datum representation for a character.
+ */
+static inline Datum
+CharGetDatum(char X)
+{
+	return (Datum) X;
+}
+
+/*
+ * Int8GetDatum
+ *		Returns datum representation for an 8-bit integer.
+ */
+static inline Datum
+Int8GetDatum(int8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt8
+ *		Returns 8-bit unsigned integer value of a datum.
+ */
+static inline uint8
+DatumGetUInt8(Datum X)
+{
+	return (uint8) X;
+}
+
+/*
+ * UInt8GetDatum
+ *		Returns datum representation for an 8-bit unsigned integer.
+ */
+static inline Datum
+UInt8GetDatum(uint8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetInt16
+ *		Returns 16-bit integer value of a datum.
+ */
+static inline int16
+DatumGetInt16(Datum X)
+{
+	return (int16) X;
+}
+
+/*
+ * Int16GetDatum
+ *		Returns datum representation for a 16-bit integer.
+ */
+static inline Datum
+Int16GetDatum(int16 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt16
+ *		Returns 16-bit unsigned integer value of a datum.
+ */
+static inline uint16
+DatumGetUInt16(Datum X)
+{
+	return (uint16) X;
+}
+
+/*
+ * UInt16GetDatum
+ *		Returns datum representation for a 16-bit

Re: Inefficiency in parallel pg_restore with many tables

2023-07-20 Thread Nathan Bossart

Here is a work-in-progress patch set for converting ready_list to a
priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
minutes without these patches and 1.5 minutes with them.

One item that requires more thought is binaryheap's use of Datum.  AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code.  I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

[0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bbf7b32f0abd0e9e2f7a0759cc79fcccbb5f9281 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 19 Jul 2023 15:54:00 -0700
Subject: [PATCH v2 1/4] misc binaryheap fixes

---
 src/backend/postmaster/pgarch.c | 12 ++--
 src/backend/replication/logical/reorderbuffer.c |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..c28e6f2070 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -108,7 +108,7 @@ static ArchiveModuleState *archive_module_state;
  * archive_status.  Minimizing the number of directory scans when there are
  * many files to archive can significantly improve archival rate.
  *
- * arch_heap is a max-heap that is used during the directory scan to track
+ * arch_heap is a min-heap that is used during the directory scan to track
  * the highest-priority files to archive.  After the directory scan
  * completes, the file names are stored in ascending order of priority in
  * arch_files.  pgarch_readyXlog() returns files from arch_files until it
@@ -248,7 +248,7 @@ PgArchiverMain(void)
 	arch_files = palloc(sizeof(struct arch_files_state));
 	arch_files->arch_files_size = 0;
 
-	/* Initialize our max-heap for prioritizing files to archive. */
+	/* Initialize our min-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
 ready_file_comparator, NULL);
 
@@ -624,7 +624,7 @@ pgarch_readyXlog(char *xlog)
 		basename[basenamelen] = '\0';
 
 		/*
-		 * Store the file in our max-heap if it has a high enough priority.
+		 * Store the file in our min-heap if it has a high enough priority.
 		 */
 		if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN)
 		{
@@ -644,15 +644,15 @@ pgarch_readyXlog(char *xlog)
 			 * Remove the lowest priority file and add the current one to the
 			 * heap.
 			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+			arch_file = DatumGetCString(binaryheap_first(arch_files->arch_heap));
 			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_replace_first(arch_files->arch_heap, CStringGetDatum(arch_file));
 		}
 	}
 	FreeDir(rldir);
 
 	/* If no files were found, simply return. */
-	if (arch_files->arch_heap->bh_size == 0)
+	if (binaryheap_empty(arch_files->arch_heap))
 		return false;
 
 	/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..05bbcecd29 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1381,7 +1381,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	int32		off;
 
 	/* nothing there anymore */
-	if (state->heap->bh_size == 0)
+	if (binaryheap_empty(state->heap))
 		return NULL;
 
 	off = DatumGetInt32(binaryheap_first(state->heap));
-- 
2.25.1

>From 8f0339263dd21bb04c1b96a0e850a9b57a2e3f8b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 20 Jul 2023 09:45:03 -0700
Subject: [PATCH v2 2/4] make binaryheap available to frontend

---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 15 
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 29 
 6 files changed, 46 insertions(+), 2 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git

Re: Inefficiency in parallel pg_restore with many tables

2023-07-18 Thread Nathan Bossart

On Tue, Jul 18, 2023 at 06:05:11PM +0200, Alvaro Herrera wrote:
> On 2023-Jul-17, Nathan Bossart wrote:
> 
>> @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator 
>> compare, void *arg)
>>  binaryheap *heap;
>>  
>>  sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
>> +#ifdef FRONTEND
>> +heap = (binaryheap *) pg_malloc(sz);
>> +#else
>>  heap = (binaryheap *) palloc(sz);
>> +#endif
> 
> Hmm, as I recall fe_memutils.c provides you with palloc() in the
> frontend environment, so you don't actually need this one.

Ah, yes it does.  Thanks for the pointer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-07-18 Thread Alvaro Herrera

On 2023-Jul-17, Nathan Bossart wrote:

> @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator 
> compare, void *arg)
>   binaryheap *heap;
>  
>   sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
> +#ifdef FRONTEND
> + heap = (binaryheap *) pg_malloc(sz);
> +#else
>   heap = (binaryheap *) palloc(sz);
> +#endif

Hmm, as I recall fe_memutils.c provides you with palloc() in the
frontend environment, so you don't actually need this one.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

Re: Inefficiency in parallel pg_restore with many tables

2023-07-17 Thread Nathan Bossart

On Sun, Jul 16, 2023 at 08:54:24PM -0700, Nathan Bossart wrote:
> This seems worth a try.  IIUC you are suggesting making binaryheap.c
> frontend-friendly and expanding its API a bit.  If no one has volunteered,
> I could probably hack something together.

I spent some time on the binaryheap changes.  I haven't had a chance to
plug it into the ready_list yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c
index 1737546757..36b6f5c51f 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/backend/lib/binaryheap.c
@@ -11,10 +11,17 @@
  *-
  */
 
+#ifndef FRONTEND
 #include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	binaryheap *heap;
 
 	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+#ifdef FRONTEND
+	heap = (binaryheap *) pg_malloc(sz);
+#else
 	heap = (binaryheap *) palloc(sz);
+#endif
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
@@ -109,7 +120,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -141,7 +156,11 @@ void
 binaryheap_add(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -196,6 +215,65 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove
+ *
+ * Removes the given datum from the heap.  The caller must ensure that the
+ * datum exists in the heap.  Always O(n).
+ */
+void
+binaryheap_remove(binaryheap *heap, Datum d)
+{
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	for (int i = 0; i < heap->bh_size; i++)
+	{
+		if (heap->bh_compare(heap->bh_nodes[i],
+			 d,
+			 heap->bh_arg) == 0)
+		{
+			binaryheap_remove_node(heap, >bh_nodes[i]);
+			return;
+		}
+	}
+
+#ifdef FRONTEND
+	pg_fatal("datum not found in heap");
+#else
+	elog(ERROR, "datum not found in heap");
+#endif
+}
+
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the given node from the heap.  The caller must ensure that the node
+ * exists in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, Datum *n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= >bh_nodes[0]);
+	Assert(n <= >bh_nodes[heap->bh_size - 1]);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+		   *n,
+		   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	*n = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n - >bh_nodes[0]);
+	else if (cmp < 0)
+		sift_down(heap, n - >bh_nodes[0]);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..411f7358ba 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -47,6 +47,8 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, Datum d);
 extern Datum binaryheap_first(binaryheap *heap);
 extern Datum binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove(binaryheap *heap, Datum d);
+extern void binaryheap_remove_node(binaryheap *heap, Datum *n);
 extern void binaryheap_replace_first(binaryheap *heap, Datum d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)

Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Nathan Bossart

On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote:
> Actually, as long as we're talking about approximately-correct behavior:
> let's make the ready_list be a priority heap, and then just make
> pop_next_work_item scan forward from the array start until it finds an
> item that's runnable per the lock heuristic.  If the heap root is
> blocked, the next things we'll examine will be its two children.
> We might pick the lower-priority of those two, but it's still known to
> be higher priority than at least 50% of the remaining heap entries, so
> it shouldn't be too awful as a choice.  The argument gets weaker the
> further you go into the heap, but we're not expecting that having most
> of the top entries blocked will be a common case.  (Besides which, the
> priorities are pretty crude to begin with.)  Once selected, pulling out
> an entry that is not the heap root is no problem: you just start the
> sift-down process from there.
> 
> The main advantage of this over the only-sort-sometimes idea is that
> we can guarantee that the largest ready item will always be dispatched
> as soon as it can be (because it will be the heap root).  So cases
> involving one big table (with big indexes) and a lot of little ones
> should get scheduled sanely, which is the main thing we want this
> algorithm to ensure.  With the other approach we can't really promise
> much at all.

This seems worth a try.  IIUC you are suggesting making binaryheap.c
frontend-friendly and expanding its API a bit.  If no one has volunteered,
I could probably hack something together.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-07-15 Sa 13:47, Tom Lane wrote:
>> I wonder if we could replace the sorted ready-list with a priority heap,
>> although that might be complicated by the fact that pop_next_work_item
>> has to be capable of popping something that's not necessarily the
>> largest remaining job.  Another idea could be to be a little less eager
>> to sort the list every time; I think in practice scheduling wouldn't
>> get much worse if we only re-sorted every so often.

> Yeah, I think that last idea is reasonable. Something like if the number 
> added since the last sort is more than min(50, list_length/4) then sort. 
> That shouldn't be too invasive.

Actually, as long as we're talking about approximately-correct behavior:
let's make the ready_list be a priority heap, and then just make
pop_next_work_item scan forward from the array start until it finds an
item that's runnable per the lock heuristic.  If the heap root is
blocked, the next things we'll examine will be its two children.
We might pick the lower-priority of those two, but it's still known to
be higher priority than at least 50% of the remaining heap entries, so
it shouldn't be too awful as a choice.  The argument gets weaker the
further you go into the heap, but we're not expecting that having most
of the top entries blocked will be a common case.  (Besides which, the
priorities are pretty crude to begin with.)  Once selected, pulling out
an entry that is not the heap root is no problem: you just start the
sift-down process from there.

The main advantage of this over the only-sort-sometimes idea is that
we can guarantee that the largest ready item will always be dispatched
as soon as it can be (because it will be the heap root).  So cases
involving one big table (with big indexes) and a lot of little ones
should get scheduled sanely, which is the main thing we want this
algorithm to ensure.  With the other approach we can't really promise
much at all.

regards, tom lane

Re: Inefficiency in parallel pg_restore with many tables

2023-07-16 Thread Andrew Dunstan



On 2023-07-15 Sa 13:47, Tom Lane wrote:

I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..10 loop
   execute format('create table t%s (f1 int unique, f2 int unique);', i);
   execute format('insert into t%s select x, x from generate_series(1,1000) x',
  i);
   if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.



Yeah, I think that last idea is reasonable. Something like if the number 
added since the last sort is more than min(50, list_length/4) then sort. 
That shouldn't be too invasive.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Inefficiency in parallel pg_restore with many tables

2023-07-15 Thread Andres Freund

Hi,

On 2023-07-15 13:47:12 -0400, Tom Lane wrote:
> I wonder if we could replace the sorted ready-list with a priority heap,
> although that might be complicated by the fact that pop_next_work_item
> has to be capable of popping something that's not necessarily the
> largest remaining job.  Another idea could be to be a little less eager
> to sort the list every time; I think in practice scheduling wouldn't
> get much worse if we only re-sorted every so often.

Perhaps we could keep track of where the newly inserted items are, and use
insertion sort or such when the number of new elements is much smaller than
the size of the already sorted elements?

As you say, a straight priority heap might not be easy. But we could just open
code using two sorted arrays, one large, one for recent additions that needs
to be newly sorted. And occasionally merge the small array into the big array,
once it has gotten large enough that sorting becomes expensive.  We could go
for a heap of N>2 such arrays, but I doubt it would be worth much.

Greetings,

Andres Freund

Inefficiency in parallel pg_restore with many tables

2023-07-15 Thread Tom Lane

I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..10 loop
  execute format('create table t%s (f1 int unique, f2 int unique);', i);
  execute format('insert into t%s select x, x from generate_series(1,1000) x',
 i);
  if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

I don't have time to pursue this right now, but perhaps someone
else would like to.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAEzn%3DHSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg%40mail.gmail.com

37 matches

Mail list logo