Re: Inefficiency in parallel pg_restore with many tables
On Mon, Sep 18, 2023 at 02:22:32PM -0700, Nathan Bossart wrote: > For now, I've committed 0001 and 0002. I intend to commit the others soon. I've committed the rest of the patches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On Mon, Sep 18, 2023 at 09:36:03PM -0400, Tom Lane wrote: > But in any case, how long are we keeping src/tools/msvc/ ? >From a skim of [0], it seems like it could be removed now. I see a couple of work-in-progress patches from Andres [1] that would probably serve as a good starting point. I won't have much time for this for the next few weeks, so if someone else wants to pick it up, please feel free. [0] https://postgr.es/m/20230408191007.7lysd42euafwl74f%40awork3.anarazel.de [1] https://github.com/anarazel/postgres/commits/drop-homegrown-msvc -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote: >> bowerbird is unhappy with this. I suppose you missed out updating >> the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?) > I saw that and have attempted to fix it with 83223f5. Ah, right, sorry for the noise. But in any case, how long are we keeping src/tools/msvc/ ? regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote: > bowerbird is unhappy with this. I suppose you missed out updating > the src/tools/msvc/ scripts. > (Weren't we about ready to nuke those?) hamerkop seems to be the only buildfarm member that would complain if these were to be gone today, on top of bowerbird, of course. -- Michael signature.asc Description: PGP signature
Re: Inefficiency in parallel pg_restore with many tables
On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote: > Nathan Bossart writes: >> For now, I've committed 0001 and 0002. I intend to commit the others soon. > > bowerbird is unhappy with this. I suppose you missed out updating > the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?) I saw that and have attempted to fix it with 83223f5. I'm still waiting for an MSVC animal to report back. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > For now, I've committed 0001 and 0002. I intend to commit the others soon. bowerbird is unhappy with this. I suppose you missed out updating the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?) regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
For now, I've committed 0001 and 0002. I intend to commit the others soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On Wed, Sep 13, 2023 at 08:01:39PM -0400, Tom Lane wrote: > Nathan Bossart writes: >> Upon closer inspection, I found a rather nasty problem. The qsort >> comparator expects a TocEntry **, but the binaryheap comparator expects a >> TocEntry *, and we simply pass the arguments through to the qsort >> comparator. In v9, I added the requisite ampersands. > > Ooops :-( > >> I'm surprised this >> worked at all. > > Probably it was not sorting things appropriately. Might be worth adding > some test scaffolding to check that bigger tasks are chosen before > smaller ones. Further testing revealed that the binaryheap comparator function was actually generating a min-heap since the qsort comparator sorts by decreasing dataLength. This is fixed in v10. And I am 0 for 2 today... Now that this appears to be functioning as expected, I see that the larger entries are typically picked up earlier, but we do sometimes pick entries quite a bit further down the list, as anticipated. The case I was testing (10k tables with the number of rows equal to the table number) was much faster with this patch (just over a minute) than without it (over 16 minutes). Sincerest apologies for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 59498e233d725d5c427e58f700402908fce2a10f Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 15:04:45 -0700 Subject: [PATCH v10 1/4] Make binaryheap available to frontend code. There are a couple of places in frontend code that could make use of this simple binary heap implementation. This commit makes binaryheap usable in frontend code, much like commit 26aaf97b68 did for StringInfo. Like StringInfo, the header file is left in lib/ to reduce the likelihood of unnecessary breakage. The frontend version of binaryheap exposes a void *-based API since frontend code does not have access to the Datum definitions. This seemed like a better approach than switching all existing uses to void * or making the Datum definitions available to frontend code. Reviewed-by: Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 41 +--- src/common/meson.build | 1 + src/include/lib/binaryheap.h | 26 +++ src/tools/pgindent/typedefs.list | 1 + 7 files changed, 52 insertions(+), 20 deletions(-) rename src/{backend/lib => common}/binaryheap.c (90%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git a/src/common/Makefile b/src/common/Makefile index 113029bf7b..cc5c54dcee 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS) OBJS_COMMON = \ archive.o \ base64.o \ + binaryheap.o \ checksum_helper.o \ compression.o \ config_info.o \ diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c similarity index 90% rename from src/backend/lib/binaryheap.c rename to src/common/binaryheap.c index 1737546757..39a8243a6d 100644 --- a/src/backend/lib/binaryheap.c +++ b/src/common/binaryheap.c @@ -6,15 +6,22 @@ * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group * * IDENTIFICATION - * src/backend/lib/binaryheap.c + * src/common/binaryheap.c * *- */ +#ifdef FRONTEND +#include "postgres_fe.h" +#else #include "postgres.h" +#endif #include +#ifdef FRONTEND +#include "common/logging.h" +#endif #include "lib/binaryheap.h" static void sift_down(binaryheap *heap, int node_off); @@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) int sz; binaryheap *heap; - sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; + sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity; heap = (binaryheap *) palloc(sz); heap->bh_space = capacity; heap->bh_compare = compare; @@ -106,10 +113,16 @@ parent_offset(int i) * afterwards. */ void -binaryheap_add_unordered(binaryheap *heap, Datum d) +binaryheap_add_unordered(binaryheap *heap, bh_node_type d) { if (heap->bh_size >=
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > Upon closer inspection, I found a rather nasty problem. The qsort > comparator expects a TocEntry **, but the binaryheap comparator expects a > TocEntry *, and we simply pass the arguments through to the qsort > comparator. In v9, I added the requisite ampersands. Ooops :-( > I'm surprised this > worked at all. Probably it was not sorting things appropriately. Might be worth adding some test scaffolding to check that bigger tasks are chosen before smaller ones. regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Wed, Sep 13, 2023 at 11:34:50AM -0700, Nathan Bossart wrote: > On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote: >> Other than those nitpicks, I like v6. I'll mark this RfC. > > Great. I've posted a v8 with your comments addressed in order to get one > more round of cfbot coverage. Assuming those tests pass and there is no > additional feedback, I'll plan on committing this in the next few days. Upon closer inspection, I found a rather nasty problem. The qsort comparator expects a TocEntry **, but the binaryheap comparator expects a TocEntry *, and we simply pass the arguments through to the qsort comparator. In v9, I added the requisite ampersands. I'm surprised this worked at all. I'm planning to run some additional tests to make sure this patch set works as expected. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From df3d862907b29687503a8f362e0a3cc3dcbc3cc4 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 15:04:45 -0700 Subject: [PATCH v9 1/4] Make binaryheap available to frontend code. There are a couple of places in frontend code that could make use of this simple binary heap implementation. This commit makes binaryheap usable in frontend code, much like commit 26aaf97b68 did for StringInfo. Like StringInfo, the header file is left in lib/ to reduce the likelihood of unnecessary breakage. The frontend version of binaryheap exposes a void *-based API since frontend code does not have access to the Datum definitions. This seemed like a better approach than switching all existing uses to void * or making the Datum definitions available to frontend code. Reviewed-by: Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 41 +--- src/common/meson.build | 1 + src/include/lib/binaryheap.h | 26 +++ src/tools/pgindent/typedefs.list | 1 + 7 files changed, 52 insertions(+), 20 deletions(-) rename src/{backend/lib => common}/binaryheap.c (90%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git a/src/common/Makefile b/src/common/Makefile index 113029bf7b..cc5c54dcee 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS) OBJS_COMMON = \ archive.o \ base64.o \ + binaryheap.o \ checksum_helper.o \ compression.o \ config_info.o \ diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c similarity index 90% rename from src/backend/lib/binaryheap.c rename to src/common/binaryheap.c index 1737546757..39a8243a6d 100644 --- a/src/backend/lib/binaryheap.c +++ b/src/common/binaryheap.c @@ -6,15 +6,22 @@ * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group * * IDENTIFICATION - * src/backend/lib/binaryheap.c + * src/common/binaryheap.c * *- */ +#ifdef FRONTEND +#include "postgres_fe.h" +#else #include "postgres.h" +#endif #include +#ifdef FRONTEND +#include "common/logging.h" +#endif #include "lib/binaryheap.h" static void sift_down(binaryheap *heap, int node_off); @@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) int sz; binaryheap *heap; - sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; + sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity; heap = (binaryheap *) palloc(sz); heap->bh_space = capacity; heap->bh_compare = compare; @@ -106,10 +113,16 @@ parent_offset(int i) * afterwards. */ void -binaryheap_add_unordered(binaryheap *heap, Datum d) +binaryheap_add_unordered(binaryheap *heap, bh_node_type d) { if (heap->bh_size >= heap->bh_space) + { +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif + } heap->bh_has_heap_property = false; heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; @@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap) * the heap property. */ void -binaryheap_add(binaryheap *heap, Datum d) +binaryheap_add(binaryheap *heap,
Re: Inefficiency in parallel pg_restore with many tables
On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote: > Hmm ... I do not like v7 very much at all. It requires rather ugly > changes to all of the existing callers, and what are we actually > buying? If anything, it makes things slower for pass-by-value items > like integers. I'd stick with the Datum convention in the backend. > > Instead, I took a closer look through the v6 patch set. > I think that's in pretty good shape and nearly committable, > but I have a few thoughts: Thanks for reviewing. I'm fine with proceeding with the v6 approach. Even though the alternative approach makes the API consistent for the frontend and backend, I'm also not a huge fan of the pointer gymnastics required in the comparators. Granted, we still have to do some intptr_t conversions in pg_dump_sort.c with the v6 approach, but that seems to be an exception. > * I'm not sure about defining bh_node_type as a macro: > > +#ifdef FRONTEND > +#define bh_node_type void * > +#else > +#define bh_node_type Datum > +#endif > > rather than an actual typedef: > > +#ifdef FRONTEND > +typedef void *bh_node_type; > +#else > +typedef Datum bh_node_type; > +#endif > > My concern here is that bh_node_type is effectively acting as a > typedef, so that pgindent might misbehave if it's not declared as a > typedef. On the other hand, there doesn't seem to be any indentation > problem in the patchset as it stands, and we don't expect any code > outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine. > (If you do choose to make it a typedef, remember to add it to > typedefs.list.) I think a typedef makes more sense here. > * As a matter of style, I'd recommend adding braces in places > like this: > > if (heap->bh_size >= heap->bh_space) > + { > +#ifdef FRONTEND > + pg_fatal("out of binary heap slots"); > +#else > elog(ERROR, "out of binary heap slots"); > +#endif > + } > heap->bh_nodes[heap->bh_size] = d; > > It's not wrong as you have it, but I think it's more readable > and less easy to accidentally break with the extra braces. Fair point. > * In 0002, isn't the comment for binaryheap_remove_node wrong? > > + * Removes the nth node from the heap. The caller must ensure that there are > + * at least (n - 1) nodes in the heap. O(log n) worst case. > > Shouldn't that be "(n + 1)"? Also, I'd specify "n'th (zero based) node" > for clarity. Yeah, that's a mistake. > * I would say that this bit in 0004: > > - j = removeHeapElement(pendingHeap, heapLength--); > + j = (intptr_t) binaryheap_remove_first(pendingHeap); > > needs an explicit cast to int: > > + j = (int) (intptr_t) binaryheap_remove_first(pendingHeap); > > otherwise some compilers might complain about the result possibly > not fitting in "j". Sure. IMO it's a tad more readable, too. > Other than those nitpicks, I like v6. I'll mark this RfC. Great. I've posted a v8 with your comments addressed in order to get one more round of cfbot coverage. Assuming those tests pass and there is no additional feedback, I'll plan on committing this in the next few days. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 7dd09c72a44e5ba293c6e9b2d2afa608ba070f03 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 15:04:45 -0700 Subject: [PATCH v8 1/4] Make binaryheap available to frontend code. There are a couple of places in frontend code that could make use of this simple binary heap implementation. This commit makes binaryheap usable in frontend code, much like commit 26aaf97b68 did for StringInfo. Like StringInfo, the header file is left in lib/ to reduce the likelihood of unnecessary breakage. The frontend version of binaryheap exposes a void *-based API since frontend code does not have access to the Datum definitions. This seemed like a better approach than switching all existing uses to void * or making the Datum definitions available to frontend code. Reviewed-by: Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 41 +--- src/common/meson.build | 1 + src/include/lib/binaryheap.h | 26 +++ src/tools/pgindent/typedefs.list | 1 + 7 files changed, 52 insertions(+), 20 deletions(-) rename src/{backend/lib => common}/binaryheap.c (90%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > I spent some more time on this patch and made the relevant adjustments to > the rest of the set. Hmm ... I do not like v7 very much at all. It requires rather ugly changes to all of the existing callers, and what are we actually buying? If anything, it makes things slower for pass-by-value items like integers. I'd stick with the Datum convention in the backend. Instead, I took a closer look through the v6 patch set. I think that's in pretty good shape and nearly committable, but I have a few thoughts: * I'm not sure about defining bh_node_type as a macro: +#ifdef FRONTEND +#define bh_node_type void * +#else +#define bh_node_type Datum +#endif rather than an actual typedef: +#ifdef FRONTEND +typedef void *bh_node_type; +#else +typedef Datum bh_node_type; +#endif My concern here is that bh_node_type is effectively acting as a typedef, so that pgindent might misbehave if it's not declared as a typedef. On the other hand, there doesn't seem to be any indentation problem in the patchset as it stands, and we don't expect any code outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine. (If you do choose to make it a typedef, remember to add it to typedefs.list.) * As a matter of style, I'd recommend adding braces in places like this: if (heap->bh_size >= heap->bh_space) + { +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif + } heap->bh_nodes[heap->bh_size] = d; It's not wrong as you have it, but I think it's more readable and less easy to accidentally break with the extra braces. * In 0002, isn't the comment for binaryheap_remove_node wrong? + * Removes the nth node from the heap. The caller must ensure that there are + * at least (n - 1) nodes in the heap. O(log n) worst case. Shouldn't that be "(n + 1)"? Also, I'd specify "n'th (zero based) node" for clarity. * I would say that this bit in 0004: - j = removeHeapElement(pendingHeap, heapLength--); + j = (intptr_t) binaryheap_remove_first(pendingHeap); needs an explicit cast to int: + j = (int) (intptr_t) binaryheap_remove_first(pendingHeap); otherwise some compilers might complain about the result possibly not fitting in "j". Other than those nitpicks, I like v6. I'll mark this RfC. regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Sat, Sep 02, 2023 at 11:55:21AM -0700, Nathan Bossart wrote: > I ended up hacking together a (nowhere near committable) patch to see how > hard it would be to allow using any type with binaryheap. It doesn't seem > too bad. I spent some more time on this patch and made the relevant adjustments to the rest of the set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3a51f25ae712b792bdde90b133c09b0b940f4198 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 4 Sep 2023 15:04:40 -0700 Subject: [PATCH v7 1/5] Allow binaryheap to use any type. We would like to make binaryheap available to frontend code in a future commit, but presently the implementation uses Datum for the node type, which is inaccessible in the frontend. This commit modifies the implementation to allow using any type for the nodes. binaryheap_allocate() now requires callers to specify the size of the node, and several of the other functions now use void pointers. --- src/backend/executor/nodeGatherMerge.c| 19 ++-- src/backend/executor/nodeMergeAppend.c| 22 ++--- src/backend/lib/binaryheap.c | 99 +++ src/backend/postmaster/pgarch.c | 36 --- .../replication/logical/reorderbuffer.c | 21 ++-- src/backend/storage/buffer/bufmgr.c | 20 ++-- src/include/lib/binaryheap.h | 21 ++-- 7 files changed, 137 insertions(+), 101 deletions(-) diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c index 9d5e1a46e9..f76406b575 100644 --- a/src/backend/executor/nodeGatherMerge.c +++ b/src/backend/executor/nodeGatherMerge.c @@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer } GMReaderTupleBuffer; static TupleTableSlot *ExecGatherMerge(PlanState *pstate); -static int32 heap_compare_slots(Datum a, Datum b, void *arg); +static int32 heap_compare_slots(const void *a, const void *b, void *arg); static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state); static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader, bool nowait, bool *done); @@ -429,6 +429,7 @@ gather_merge_setup(GatherMergeState *gm_state) /* Allocate the resources for the merge */ gm_state->gm_heap = binaryheap_allocate(nreaders + 1, + sizeof(int), heap_compare_slots, gm_state); } @@ -489,7 +490,7 @@ reread: /* Don't have a tuple yet, try to get one */ if (gather_merge_readnext(gm_state, i, nowait)) binaryheap_add_unordered(gm_state->gm_heap, - Int32GetDatum(i)); + ); } else { @@ -565,14 +566,14 @@ gather_merge_getnext(GatherMergeState *gm_state) * the heap, because it might now compare differently against the * other elements of the heap. */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + binaryheap_first(gm_state->gm_heap, ); if (gather_merge_readnext(gm_state, i, false)) - binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i)); + binaryheap_replace_first(gm_state->gm_heap, ); else { /* reader exhausted, remove it from heap */ - (void) binaryheap_remove_first(gm_state->gm_heap); + binaryheap_remove_first(gm_state->gm_heap, NULL); } } @@ -585,7 +586,7 @@ gather_merge_getnext(GatherMergeState *gm_state) else { /* Return next tuple from whichever participant has the leading one */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + binaryheap_first(gm_state->gm_heap, ); return gm_state->gm_slots[i]; } } @@ -750,11 +751,11 @@ typedef int32 SlotNumber; * Compare the tuples in the two given slots. */ static int32 -heap_compare_slots(Datum a, Datum b, void *arg) +heap_compare_slots(const void *a, const void *b, void *arg) { GatherMergeState *node = (GatherMergeState *) arg; - SlotNumber slot1 = DatumGetInt32(a); - SlotNumber slot2 = DatumGetInt32(b); + SlotNumber slot1 = *((const int *) a); + SlotNumber slot2 = *((const int *) b); TupleTableSlot *s1 = node->gm_slots[slot1]; TupleTableSlot *s2 = node->gm_slots[slot2]; diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c index 21b5726e6e..e0ace294a0 100644 --- a/src/backend/executor/nodeMergeAppend.c +++ b/src/backend/executor/nodeMergeAppend.c @@ -52,7 +52,7 @@ typedef int32 SlotNumber; static TupleTableSlot *ExecMergeAppend(PlanState *pstate); -static int heap_compare_slots(Datum a, Datum b, void *arg); +static int heap_compare_slots(const void *a, const void *b, void *arg); /* @@ -125,8 +125,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags) mergestate->ms_nplans = nplans; mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans); - mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, - mergestate); + mergestate->ms_heap =
Re: Inefficiency in parallel pg_restore with many tables
On Sun, Sep 03, 2023 at 12:04:00PM +0200, Alvaro Herrera wrote: > On 2023-Sep-02, Nathan Bossart wrote: >> I ended up hacking together a (nowhere near committable) patch to see how >> hard it would be to allow using any type with binaryheap. It doesn't seem >> too bad. > > Yeah, using void * seems to lead to interfaces that are pretty much the > same as bsearch() or qsort(). Right. This is what I had in mind. > (Why isn't your payload type const, > though?) It probably should be const. This patch was just a proof-of-concept and still requireѕ a bit of work. > I do wonder why did you change _remove_first and _first to have a > 'result' output argument instead of a return value. Does this change > actually buy you anything? simplehash.h doesn't do that either. > >> -extern void binaryheap_add(binaryheap *heap, Datum d); >> -extern Datum binaryheap_first(binaryheap *heap); >> -extern Datum binaryheap_remove_first(binaryheap *heap); >> -extern void binaryheap_replace_first(binaryheap *heap, Datum d); >> +extern void binaryheap_add(binaryheap *heap, void *d); >> +extern void binaryheap_first(binaryheap *heap, void *result); >> +extern void binaryheap_remove_first(binaryheap *heap, void *result); >> +extern void binaryheap_replace_first(binaryheap *heap, void *d); _first could likely just return a pointer to the data in the binary heap's array. However, _remove_first has to copy the data somewhere, so I think the alternative would be to return a palloc'd value. Is there another way that I'm not thinking of? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On 2023-Sep-02, Nathan Bossart wrote: > On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote: > > Yeah, something similar to simplehash for binary heaps could be nice. That > > being said, I don't know if there's a strong reason to specialize the > > implementation for a given C data type in most cases. > > I ended up hacking together a (nowhere near committable) patch to see how > hard it would be to allow using any type with binaryheap. It doesn't seem > too bad. Yeah, using void * seems to lead to interfaces that are pretty much the same as bsearch() or qsort(). (Why isn't your payload type const, though?) I do wonder why did you change _remove_first and _first to have a 'result' output argument instead of a return value. Does this change actually buy you anything? simplehash.h doesn't do that either. > -extern void binaryheap_add(binaryheap *heap, Datum d); > -extern Datum binaryheap_first(binaryheap *heap); > -extern Datum binaryheap_remove_first(binaryheap *heap); > -extern void binaryheap_replace_first(binaryheap *heap, Datum d); > +extern void binaryheap_add(binaryheap *heap, void *d); > +extern void binaryheap_first(binaryheap *heap, void *result); > +extern void binaryheap_remove_first(binaryheap *heap, void *result); > +extern void binaryheap_replace_first(binaryheap *heap, void *d); -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Inefficiency in parallel pg_restore with many tables
On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote: > On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote: >> In hindsight, I think that making binaryheap depend on Datum was a bad >> idea. I think that was my idea, and I think it wasn't very smart. >> Considering that people have coded to that decision up until now, it >> might not be too easy to change at this point. But in principle I >> guess you'd want to be able to make a heap out of any C data type, >> rather than just Datum, or just Datum in the backend and just void * >> in the frontend. > > Yeah, something similar to simplehash for binary heaps could be nice. That > being said, I don't know if there's a strong reason to specialize the > implementation for a given C data type in most cases. I suspect many > callers are just fine with dealing with pointers (e.g., I wouldn't store an > entire TocEntry in the array), and smaller types like integers are already > stored directly in the array thanks to the use of Datum. However, it > _would_ allow us to abandon this frontend/backend void */Datum kludge, > which is something. I ended up hacking together a (nowhere near committable) patch to see how hard it would be to allow using any type with binaryheap. It doesn't seem too bad. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From fa51a6b48d1a7cdcffbd5ccbe4274f39dfb0c1b5 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 2 Sep 2023 11:48:36 -0700 Subject: [PATCH 1/1] allow binaryheap to use any type --- src/backend/executor/nodeGatherMerge.c| 20 +++--- src/backend/executor/nodeMergeAppend.c| 20 +++--- src/backend/lib/binaryheap.c | 71 ++- src/backend/postmaster/pgarch.c | 37 +- .../replication/logical/reorderbuffer.c | 22 +++--- src/backend/storage/buffer/bufmgr.c | 21 +++--- src/include/lib/binaryheap.h | 17 ++--- 7 files changed, 112 insertions(+), 96 deletions(-) diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c index 9d5e1a46e9..4c75d224c8 100644 --- a/src/backend/executor/nodeGatherMerge.c +++ b/src/backend/executor/nodeGatherMerge.c @@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer } GMReaderTupleBuffer; static TupleTableSlot *ExecGatherMerge(PlanState *pstate); -static int32 heap_compare_slots(Datum a, Datum b, void *arg); +static int32 heap_compare_slots(void *a, void *b, void *arg); static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state); static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader, bool nowait, bool *done); @@ -428,7 +428,7 @@ gather_merge_setup(GatherMergeState *gm_state) } /* Allocate the resources for the merge */ - gm_state->gm_heap = binaryheap_allocate(nreaders + 1, + gm_state->gm_heap = binaryheap_allocate(nreaders + 1, sizeof(int), heap_compare_slots, gm_state); } @@ -489,7 +489,7 @@ reread: /* Don't have a tuple yet, try to get one */ if (gather_merge_readnext(gm_state, i, nowait)) binaryheap_add_unordered(gm_state->gm_heap, - Int32GetDatum(i)); + ); } else { @@ -565,14 +565,14 @@ gather_merge_getnext(GatherMergeState *gm_state) * the heap, because it might now compare differently against the * other elements of the heap. */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + binaryheap_first(gm_state->gm_heap, ); if (gather_merge_readnext(gm_state, i, false)) - binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i)); + binaryheap_replace_first(gm_state->gm_heap, ); else { /* reader exhausted, remove it from heap */ - (void) binaryheap_remove_first(gm_state->gm_heap); + binaryheap_remove_first(gm_state->gm_heap, ); } } @@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state) else { /* Return next tuple from whichever participant has the leading one */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + binaryheap_first(gm_state->gm_heap, ); return gm_state->gm_slots[i]; } } @@ -750,11 +750,11 @@ typedef int32 SlotNumber; * Compare the tuples in the two given slots. */ static int32 -heap_compare_slots(Datum a, Datum b, void *arg) +heap_compare_slots(void *a, void *b, void *arg) { GatherMergeState *node = (GatherMergeState *) arg; - SlotNumber slot1 = DatumGetInt32(a); - SlotNumber slot2 = DatumGetInt32(b); + SlotNumber slot1 = *((int *) a); + SlotNumber slot2 = *((int *) b); TupleTableSlot *s1 = node->gm_slots[slot1]; TupleTableSlot *s2 = node->gm_slots[slot2]; diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c index 21b5726e6e..928b4b3719 100644 --- a/src/backend/executor/nodeMergeAppend.c +++ b/src/backend/executor/nodeMergeAppend.c @@ -52,7 +52,7 @@ typedef int32 SlotNumber; static TupleTableSlot *ExecMergeAppend(PlanState
Re: Inefficiency in parallel pg_restore with many tables
On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote: > In hindsight, I think that making binaryheap depend on Datum was a bad > idea. I think that was my idea, and I think it wasn't very smart. > Considering that people have coded to that decision up until now, it > might not be too easy to change at this point. But in principle I > guess you'd want to be able to make a heap out of any C data type, > rather than just Datum, or just Datum in the backend and just void * > in the frontend. Yeah, something similar to simplehash for binary heaps could be nice. That being said, I don't know if there's a strong reason to specialize the implementation for a given C data type in most cases. I suspect many callers are just fine with dealing with pointers (e.g., I wouldn't store an entire TocEntry in the array), and smaller types like integers are already stored directly in the array thanks to the use of Datum. However, it _would_ allow us to abandon this frontend/backend void */Datum kludge, which is something. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On Tue, Jul 25, 2023 at 2:53 PM Nathan Bossart wrote: > On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote: > > Here is a sketch of this approach. It required fewer #ifdefs than I was > > expecting. At the moment, this one seems like the winner to me. > > Here is a polished patch set for this approach. I've also added a 0004 > that replaces the open-coded heap in pg_dump_sort.c with a binaryheap. > IMHO these patches are in decent shape. [ drive-by comment that hopefully doesn't cause too much pain ] In hindsight, I think that making binaryheap depend on Datum was a bad idea. I think that was my idea, and I think it wasn't very smart. Considering that people have coded to that decision up until now, it might not be too easy to change at this point. But in principle I guess you'd want to be able to make a heap out of any C data type, rather than just Datum, or just Datum in the backend and just void * in the frontend. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Inefficiency in parallel pg_restore with many tables
On Fri, Sep 01, 2023 at 01:41:41PM -0400, Tom Lane wrote: > I've not actually looked at any of these patchsets after the first one. > I have added myself as a reviewer and will hopefully get to it within > a week or so. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > I'm hoping to commit these patches at some point in the current commitfest. > I don't sense anything tremendously controversial, and they provide a > pretty nice speedup in some cases. Are there any remaining concerns? I've not actually looked at any of these patchsets after the first one. I have added myself as a reviewer and will hopefully get to it within a week or so. regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Tue, Jul 25, 2023 at 11:53:36AM -0700, Nathan Bossart wrote: > Here is a polished patch set for this approach. I've also added a 0004 > that replaces the open-coded heap in pg_dump_sort.c with a binaryheap. > IMHO these patches are in decent shape. I'm hoping to commit these patches at some point in the current commitfest. I don't sense anything tremendously controversial, and they provide a pretty nice speedup in some cases. Are there any remaining concerns? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote: > Here is a sketch of this approach. It required fewer #ifdefs than I was > expecting. At the moment, this one seems like the winner to me. Here is a polished patch set for this approach. I've also added a 0004 that replaces the open-coded heap in pg_dump_sort.c with a binaryheap. IMHO these patches are in decent shape. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From def4af18e68cece22008863f664e9e0fd060f917 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 15:04:45 -0700 Subject: [PATCH v6 1/4] Make binaryheap available to frontend code. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are a couple of places in frontend code that could make use of this simple binary heap implementation. This commit makes binaryheap usable in frontend code, much like commit 26aaf97b68 did for StringInfo. Like StringInfo, the header file is left in lib/ to reduce the likelihood of unnecessary breakage. The frontend version of binaryheap exposes a void *-based API since frontend code does not have access to the Datum definitions. This seemed like a better approach than switching all existing uses to void * or making the Datum definitions available to frontend code. Reviewed-by: Tom Lane, Álvaro Herrera Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 37 +--- src/common/meson.build | 1 + src/include/lib/binaryheap.h | 26 - 6 files changed, 47 insertions(+), 20 deletions(-) rename src/{backend/lib => common}/binaryheap.c (91%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git a/src/common/Makefile b/src/common/Makefile index 113029bf7b..cc5c54dcee 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS) OBJS_COMMON = \ archive.o \ base64.o \ + binaryheap.o \ checksum_helper.o \ compression.o \ config_info.o \ diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c similarity index 91% rename from src/backend/lib/binaryheap.c rename to src/common/binaryheap.c index 1737546757..4836cd1c6d 100644 --- a/src/backend/lib/binaryheap.c +++ b/src/common/binaryheap.c @@ -6,15 +6,22 @@ * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group * * IDENTIFICATION - * src/backend/lib/binaryheap.c + * src/common/binaryheap.c * *- */ +#ifdef FRONTEND +#include "postgres_fe.h" +#else #include "postgres.h" +#endif #include +#ifdef FRONTEND +#include "common/logging.h" +#endif #include "lib/binaryheap.h" static void sift_down(binaryheap *heap, int node_off); @@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) int sz; binaryheap *heap; - sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; + sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity; heap = (binaryheap *) palloc(sz); heap->bh_space = capacity; heap->bh_compare = compare; @@ -106,10 +113,14 @@ parent_offset(int i) * afterwards. */ void -binaryheap_add_unordered(binaryheap *heap, Datum d) +binaryheap_add_unordered(binaryheap *heap, bh_node_type d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_has_heap_property = false; heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; @@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap) * the heap property. */ void -binaryheap_add(binaryheap *heap, Datum d) +binaryheap_add(binaryheap *heap, bh_node_type d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; sift_up(heap, heap->bh_size - 1); @@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d) * without modifying the heap. The caller must ensure
Re: Inefficiency in parallel pg_restore with many tables
On Sat, Jul 22, 2023 at 10:57:03PM -0700, Nathan Bossart wrote: > On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote: >> I wonder whether we can't provide some alternate definition or "skin" >> for binaryheap that preserves the Datum API for backend code that wants >> that, while providing a void *-based API for frontend code to use. > > I can give this a try next, but it might be rather #ifdef-heavy. Here is a sketch of this approach. It required fewer #ifdefs than I was expecting. At the moment, this one seems like the winner to me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 13017359cc6dd35f3e465280373b4b82e87b7c5b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 15:04:45 -0700 Subject: [PATCH v5 1/3] make binaryheap available to frontend --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 37 +--- src/common/meson.build | 1 + src/include/lib/binaryheap.h | 23 ++- 6 files changed, 44 insertions(+), 20 deletions(-) rename src/{backend/lib => common}/binaryheap.c (91%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git a/src/common/Makefile b/src/common/Makefile index 113029bf7b..cc5c54dcee 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS) OBJS_COMMON = \ archive.o \ base64.o \ + binaryheap.o \ checksum_helper.o \ compression.o \ config_info.o \ diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c similarity index 91% rename from src/backend/lib/binaryheap.c rename to src/common/binaryheap.c index 1737546757..a6be4b42ae 100644 --- a/src/backend/lib/binaryheap.c +++ b/src/common/binaryheap.c @@ -6,15 +6,22 @@ * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group * * IDENTIFICATION - * src/backend/lib/binaryheap.c + * src/common/binaryheap.c * *- */ +#ifdef FRONTEND +#include "postgres_fe.h" +#else #include "postgres.h" +#endif #include +#ifdef FRONTEND +#include "common/logging.h" +#endif #include "lib/binaryheap.h" static void sift_down(binaryheap *heap, int node_off); @@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) int sz; binaryheap *heap; - sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; + sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_elem_type) * capacity; heap = (binaryheap *) palloc(sz); heap->bh_space = capacity; heap->bh_compare = compare; @@ -106,10 +113,14 @@ parent_offset(int i) * afterwards. */ void -binaryheap_add_unordered(binaryheap *heap, Datum d) +binaryheap_add_unordered(binaryheap *heap, bh_elem_type d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_has_heap_property = false; heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; @@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap) * the heap property. */ void -binaryheap_add(binaryheap *heap, Datum d) +binaryheap_add(binaryheap *heap, bh_elem_type d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; sift_up(heap, heap->bh_size - 1); @@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d) * without modifying the heap. The caller must ensure that this * routine is not used on an empty heap. Always O(1). */ -Datum +bh_elem_type binaryheap_first(binaryheap *heap) { Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); @@ -169,10 +184,10 @@ binaryheap_first(binaryheap *heap) * that this routine is not used on an empty heap. O(log n) worst * case. */ -Datum +bh_elem_type binaryheap_remove_first(binaryheap *heap) { - Datum result; + bh_elem_type result; Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); @@ -204,7 +219,7 @@ binaryheap_remove_first(binaryheap *heap) * sifting the new node down. */ void
Re: Inefficiency in parallel pg_restore with many tables
On Saturday, July 15, 2023 7:47:12 PM CEST Tom Lane wrote: > I'm not sure how big a deal this is in practice: in most situations > the individual jobs are larger than they are in this toy example, > plus the initial non-parallelizable part of the restore is a bigger > bottleneck anyway with this many tables. Still, we do have one > real-world complaint, so maybe we should look into improving it. Hi For what it's worth, at my current job it's kind of a big deal. I was going to start looking at the bad performance I got on pg_restore for some databases with over 50k tables (in 200 namespaces) when I found this thread. The dump weights in about 2,8GB, the toc.dat file is 230MB, 50 120 tables, 142 069 constraints and 73 669 indexes. HEAD pg_restore duration: 30 minutes pg_restore with latest patch from Nathan Bossart: 23 minutes This is indeed better, but there is still a lot of room for improvements. With such usecases, I was able to go much faster using the patched pg_restore with a script that parallelize on each schema instead of relying on the choices made by pg_restore. It seems the choice of parallelizing only the data loading is losing nice speedup opportunities with a huge number of objects. patched pg_restore + parallel restore of schemas: 10 minutes Anyway, the patch works really fine as is, and I will certainly keep trying future iterations. Regards Pierre
Re: Inefficiency in parallel pg_restore with many tables
On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote: > Nathan Bossart writes: >> I first tried modifying >> binaryheap to use "int" or "void *" instead, but that ended up requiring >> some rather invasive changes in backend code, not to mention any extensions >> that happen to be using it. I followed through with the "void *" approach (attached), and it wasn't as bad as I expected. > I wonder whether we can't provide some alternate definition or "skin" > for binaryheap that preserves the Datum API for backend code that wants > that, while providing a void *-based API for frontend code to use. I can give this a try next, but it might be rather #ifdef-heavy. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From b570460b19c5955c91b1fc28e8fdc791f41998e2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 20:59:46 -0700 Subject: [PATCH v4 1/4] use "void *" instead of "Datum" in binaryheap --- src/backend/executor/nodeGatherMerge.c| 16 +- src/backend/executor/nodeMergeAppend.c| 16 +- src/backend/lib/binaryheap.c | 20 ++-- src/backend/postmaster/pgarch.c | 18 +-- .../replication/logical/reorderbuffer.c | 31 +-- src/backend/storage/buffer/bufmgr.c | 11 +++ src/include/lib/binaryheap.h | 14 - 7 files changed, 61 insertions(+), 65 deletions(-) diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c index 9d5e1a46e9..021682d87c 100644 --- a/src/backend/executor/nodeGatherMerge.c +++ b/src/backend/executor/nodeGatherMerge.c @@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer } GMReaderTupleBuffer; static TupleTableSlot *ExecGatherMerge(PlanState *pstate); -static int32 heap_compare_slots(Datum a, Datum b, void *arg); +static int32 heap_compare_slots(const void *a, const void *b, void *arg); static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state); static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader, bool nowait, bool *done); @@ -489,7 +489,7 @@ reread: /* Don't have a tuple yet, try to get one */ if (gather_merge_readnext(gm_state, i, nowait)) binaryheap_add_unordered(gm_state->gm_heap, - Int32GetDatum(i)); + (void *) (intptr_t) i); } else { @@ -565,10 +565,10 @@ gather_merge_getnext(GatherMergeState *gm_state) * the heap, because it might now compare differently against the * other elements of the heap. */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + i = (intptr_t) binaryheap_first(gm_state->gm_heap); if (gather_merge_readnext(gm_state, i, false)) - binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i)); + binaryheap_replace_first(gm_state->gm_heap, (void *) (intptr_t) i); else { /* reader exhausted, remove it from heap */ @@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state) else { /* Return next tuple from whichever participant has the leading one */ - i = DatumGetInt32(binaryheap_first(gm_state->gm_heap)); + i = (intptr_t) binaryheap_first(gm_state->gm_heap); return gm_state->gm_slots[i]; } } @@ -750,11 +750,11 @@ typedef int32 SlotNumber; * Compare the tuples in the two given slots. */ static int32 -heap_compare_slots(Datum a, Datum b, void *arg) +heap_compare_slots(const void *a, const void *b, void *arg) { GatherMergeState *node = (GatherMergeState *) arg; - SlotNumber slot1 = DatumGetInt32(a); - SlotNumber slot2 = DatumGetInt32(b); + SlotNumber slot1 = (intptr_t) a; + SlotNumber slot2 = (intptr_t) b; TupleTableSlot *s1 = node->gm_slots[slot1]; TupleTableSlot *s2 = node->gm_slots[slot2]; diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c index 21b5726e6e..b3e2c29401 100644 --- a/src/backend/executor/nodeMergeAppend.c +++ b/src/backend/executor/nodeMergeAppend.c @@ -52,7 +52,7 @@ typedef int32 SlotNumber; static TupleTableSlot *ExecMergeAppend(PlanState *pstate); -static int heap_compare_slots(Datum a, Datum b, void *arg); +static int heap_compare_slots(const void *a, const void *b, void *arg); /* @@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate) { node->ms_slots[i] = ExecProcNode(node->mergeplans[i]); if (!TupIsNull(node->ms_slots[i])) -binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i)); +binaryheap_add_unordered(node->ms_heap, (void *) (intptr_t) i); } binaryheap_build(node->ms_heap); node->ms_initialized = true; @@ -244,10 +244,10 @@ ExecMergeAppend(PlanState *pstate) * by doing this before returning from the prior call, but it's better * to not pull tuples until necessary.) */ - i = DatumGetInt32(binaryheap_first(node->ms_heap)); + i = (intptr_t) binaryheap_first(node->ms_heap); node->ms_slots[i] =
Re: Inefficiency in parallel pg_restore with many tables
Nathan Bossart writes: > On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote: >> One item that requires more thought is binaryheap's use of Datum. AFAICT >> the Datum definitions live in postgres.h and aren't available to frontend >> code. I think we'll either need to move the Datum definitions to c.h or to >> adjust binaryheap to use "void *". > In v3, I moved the Datum definitions to c.h. I first tried modifying > binaryheap to use "int" or "void *" instead, but that ended up requiring > some rather invasive changes in backend code, not to mention any extensions > that happen to be using it. I'm quite uncomfortable with putting Datum in c.h. I know that the typedef is merely a uintptr_t, but this solution seems to me to be blowing all kinds of holes in the abstraction, because exactly none of the infrastructure that goes along with Datum is or is ever likely to be in any frontend build. At the very least, frontend code that refers to Datum will be misleading as hell. I wonder whether we can't provide some alternate definition or "skin" for binaryheap that preserves the Datum API for backend code that wants that, while providing a void *-based API for frontend code to use. regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On Sat, Jul 22, 2023 at 04:19:41PM -0700, Nathan Bossart wrote: > In v3, I moved the Datum definitions to c.h. I first tried modifying > binaryheap to use "int" or "void *" instead, but that ended up requiring > some rather invasive changes in backend code, not to mention any extensions > that happen to be using it. I also looked into moving the definitions to a > separate datumdefs.h header that postgres.h would include, but that felt > awkward because 1) postgres.h clearly states that it is intended for things > "that never escape the backend" and 2) the definitions seem relatively > inexpensive. However, I think the latter option is still viable, so I'm > fine with switching to it if folks think that is a better approach. BTW we might be able to replace the open-coded heap in pg_dump_sort.c (added by 79273cc) with a binaryheap, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote: > Here is a work-in-progress patch set for converting ready_list to a > priority queue. On my machine, Tom's 100k-table example [0] takes 11.5 > minutes without these patches and 1.5 minutes with them. > > One item that requires more thought is binaryheap's use of Datum. AFAICT > the Datum definitions live in postgres.h and aren't available to frontend > code. I think we'll either need to move the Datum definitions to c.h or to > adjust binaryheap to use "void *". In v3, I moved the Datum definitions to c.h. I first tried modifying binaryheap to use "int" or "void *" instead, but that ended up requiring some rather invasive changes in backend code, not to mention any extensions that happen to be using it. I also looked into moving the definitions to a separate datumdefs.h header that postgres.h would include, but that felt awkward because 1) postgres.h clearly states that it is intended for things "that never escape the backend" and 2) the definitions seem relatively inexpensive. However, I think the latter option is still viable, so I'm fine with switching to it if folks think that is a better approach. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 56066f836f389ca0efe169c1ea8d769f70aaa817 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 22 Jul 2023 14:53:49 -0700 Subject: [PATCH v3 1/4] move datum definitions to c.h --- src/include/c.h| 512 src/include/postgres.h | 519 + 2 files changed, 515 insertions(+), 516 deletions(-) diff --git a/src/include/c.h b/src/include/c.h index f69d739be5..daee2f50eb 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -35,6 +35,7 @@ * 7) widely useful macros * 8) random stuff * 9) system-specific hacks + * 10) Datum type + support functions * * NOTE: since this file is included by both frontend and backend modules, * it's usually wrong to put an "extern" declaration here, unless it's @@ -1374,4 +1375,515 @@ typedef intptr_t sigjmp_buf[5]; /* /port compatibility functions */ #include "port.h" +/* + *Section 10: Datum type + support functions + * + */ + +/* + * A Datum contains either a value of a pass-by-value type or a pointer to a + * value of a pass-by-reference type. Therefore, we require: + * + * sizeof(Datum) == sizeof(void *) == 4 or 8 + * + * The functions below and the analogous functions for other types should be used to + * convert between a Datum and the appropriate C type. + */ + +typedef uintptr_t Datum; + +/* + * A NullableDatum is used in places where both a Datum and its nullness needs + * to be stored. This can be more efficient than storing datums and nullness + * in separate arrays, due to better spatial locality, even if more space may + * be wasted due to padding. + */ +typedef struct NullableDatum +{ +#define FIELDNO_NULLABLE_DATUM_DATUM 0 + Datum value; +#define FIELDNO_NULLABLE_DATUM_ISNULL 1 + bool isnull; + /* due to alignment padding this could be used for flags for free */ +} NullableDatum; + +#define SIZEOF_DATUM SIZEOF_VOID_P + +/* + * DatumGetBool + * Returns boolean value of a datum. + * + * Note: any nonzero value will be considered true. + */ +static inline bool +DatumGetBool(Datum X) +{ + return (X != 0); +} + +/* + * BoolGetDatum + * Returns datum representation for a boolean. + * + * Note: any nonzero value will be considered true. + */ +static inline Datum +BoolGetDatum(bool X) +{ + return (Datum) (X ? 1 : 0); +} + +/* + * DatumGetChar + * Returns character value of a datum. + */ +static inline char +DatumGetChar(Datum X) +{ + return (char) X; +} + +/* + * CharGetDatum + * Returns datum representation for a character. + */ +static inline Datum +CharGetDatum(char X) +{ + return (Datum) X; +} + +/* + * Int8GetDatum + * Returns datum representation for an 8-bit integer. + */ +static inline Datum +Int8GetDatum(int8 X) +{ + return (Datum) X; +} + +/* + * DatumGetUInt8 + * Returns 8-bit unsigned integer value of a datum. + */ +static inline uint8 +DatumGetUInt8(Datum X) +{ + return (uint8) X; +} + +/* + * UInt8GetDatum + * Returns datum representation for an 8-bit unsigned integer. + */ +static inline Datum +UInt8GetDatum(uint8 X) +{ + return (Datum) X; +} + +/* + * DatumGetInt16 + * Returns 16-bit integer value of a datum. + */ +static inline int16 +DatumGetInt16(Datum X) +{ + return (int16) X; +} + +/* + * Int16GetDatum + * Returns datum representation for a 16-bit integer. + */ +static inline Datum +Int16GetDatum(int16 X) +{ + return (Datum) X; +} + +/* + * DatumGetUInt16 + * Returns 16-bit unsigned integer value of a datum. + */ +static inline uint16 +DatumGetUInt16(Datum X) +{ + return (uint16) X; +} + +/* + * UInt16GetDatum + * Returns datum representation for a 16-bit
Re: Inefficiency in parallel pg_restore with many tables
Here is a work-in-progress patch set for converting ready_list to a priority queue. On my machine, Tom's 100k-table example [0] takes 11.5 minutes without these patches and 1.5 minutes with them. One item that requires more thought is binaryheap's use of Datum. AFAICT the Datum definitions live in postgres.h and aren't available to frontend code. I think we'll either need to move the Datum definitions to c.h or to adjust binaryheap to use "void *". [0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From bbf7b32f0abd0e9e2f7a0759cc79fcccbb5f9281 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 19 Jul 2023 15:54:00 -0700 Subject: [PATCH v2 1/4] misc binaryheap fixes --- src/backend/postmaster/pgarch.c | 12 ++-- src/backend/replication/logical/reorderbuffer.c | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 46af349564..c28e6f2070 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -108,7 +108,7 @@ static ArchiveModuleState *archive_module_state; * archive_status. Minimizing the number of directory scans when there are * many files to archive can significantly improve archival rate. * - * arch_heap is a max-heap that is used during the directory scan to track + * arch_heap is a min-heap that is used during the directory scan to track * the highest-priority files to archive. After the directory scan * completes, the file names are stored in ascending order of priority in * arch_files. pgarch_readyXlog() returns files from arch_files until it @@ -248,7 +248,7 @@ PgArchiverMain(void) arch_files = palloc(sizeof(struct arch_files_state)); arch_files->arch_files_size = 0; - /* Initialize our max-heap for prioritizing files to archive. */ + /* Initialize our min-heap for prioritizing files to archive. */ arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN, ready_file_comparator, NULL); @@ -624,7 +624,7 @@ pgarch_readyXlog(char *xlog) basename[basenamelen] = '\0'; /* - * Store the file in our max-heap if it has a high enough priority. + * Store the file in our min-heap if it has a high enough priority. */ if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN) { @@ -644,15 +644,15 @@ pgarch_readyXlog(char *xlog) * Remove the lowest priority file and add the current one to the * heap. */ - arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap)); + arch_file = DatumGetCString(binaryheap_first(arch_files->arch_heap)); strcpy(arch_file, basename); - binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file)); + binaryheap_replace_first(arch_files->arch_heap, CStringGetDatum(arch_file)); } } FreeDir(rldir); /* If no files were found, simply return. */ - if (arch_files->arch_heap->bh_size == 0) + if (binaryheap_empty(arch_files->arch_heap)) return false; /* diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 26d252bd87..05bbcecd29 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1381,7 +1381,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state) int32 off; /* nothing there anymore */ - if (state->heap->bh_size == 0) + if (binaryheap_empty(state->heap)) return NULL; off = DatumGetInt32(binaryheap_first(state->heap)); -- 2.25.1 >From 8f0339263dd21bb04c1b96a0e850a9b57a2e3f8b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 20 Jul 2023 09:45:03 -0700 Subject: [PATCH v2 2/4] make binaryheap available to frontend --- src/backend/lib/Makefile | 1 - src/backend/lib/meson.build | 1 - src/common/Makefile | 1 + src/{backend/lib => common}/binaryheap.c | 15 src/common/meson.build | 1 + src/include/lib/binaryheap.h | 29 6 files changed, 46 insertions(+), 2 deletions(-) rename src/{backend/lib => common}/binaryheap.c (96%) diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..b6cefd9cca 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -13,7 +13,6 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global OBJS = \ - binaryheap.o \ bipartite_match.o \ bloomfilter.o \ dshash.o \ diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build index 974cab8776..b4e88f54ae 100644 --- a/src/backend/lib/meson.build +++ b/src/backend/lib/meson.build @@ -1,7 +1,6 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group backend_sources += files( - 'binaryheap.c', 'bipartite_match.c', 'bloomfilter.c', 'dshash.c', diff --git
Re: Inefficiency in parallel pg_restore with many tables
On Tue, Jul 18, 2023 at 06:05:11PM +0200, Alvaro Herrera wrote: > On 2023-Jul-17, Nathan Bossart wrote: > >> @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator >> compare, void *arg) >> binaryheap *heap; >> >> sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; >> +#ifdef FRONTEND >> +heap = (binaryheap *) pg_malloc(sz); >> +#else >> heap = (binaryheap *) palloc(sz); >> +#endif > > Hmm, as I recall fe_memutils.c provides you with palloc() in the > frontend environment, so you don't actually need this one. Ah, yes it does. Thanks for the pointer. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
On 2023-Jul-17, Nathan Bossart wrote: > @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator > compare, void *arg) > binaryheap *heap; > > sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; > +#ifdef FRONTEND > + heap = (binaryheap *) pg_malloc(sz); > +#else > heap = (binaryheap *) palloc(sz); > +#endif Hmm, as I recall fe_memutils.c provides you with palloc() in the frontend environment, so you don't actually need this one. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
Re: Inefficiency in parallel pg_restore with many tables
On Sun, Jul 16, 2023 at 08:54:24PM -0700, Nathan Bossart wrote: > This seems worth a try. IIUC you are suggesting making binaryheap.c > frontend-friendly and expanding its API a bit. If no one has volunteered, > I could probably hack something together. I spent some time on the binaryheap changes. I haven't had a chance to plug it into the ready_list yet. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c index 1737546757..36b6f5c51f 100644 --- a/src/backend/lib/binaryheap.c +++ b/src/backend/lib/binaryheap.c @@ -11,10 +11,17 @@ *- */ +#ifndef FRONTEND #include "postgres.h" +#else +#include "postgres_fe.h" +#endif #include +#ifdef FRONTEND +#include "common/logging.h" +#endif #include "lib/binaryheap.h" static void sift_down(binaryheap *heap, int node_off); @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) binaryheap *heap; sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity; +#ifdef FRONTEND + heap = (binaryheap *) pg_malloc(sz); +#else heap = (binaryheap *) palloc(sz); +#endif heap->bh_space = capacity; heap->bh_compare = compare; heap->bh_arg = arg; @@ -109,7 +120,11 @@ void binaryheap_add_unordered(binaryheap *heap, Datum d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_has_heap_property = false; heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; @@ -141,7 +156,11 @@ void binaryheap_add(binaryheap *heap, Datum d) { if (heap->bh_size >= heap->bh_space) +#ifdef FRONTEND + pg_fatal("out of binary heap slots"); +#else elog(ERROR, "out of binary heap slots"); +#endif heap->bh_nodes[heap->bh_size] = d; heap->bh_size++; sift_up(heap, heap->bh_size - 1); @@ -196,6 +215,65 @@ binaryheap_remove_first(binaryheap *heap) return result; } +/* + * binaryheap_remove + * + * Removes the given datum from the heap. The caller must ensure that the + * datum exists in the heap. Always O(n). + */ +void +binaryheap_remove(binaryheap *heap, Datum d) +{ + Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); + + for (int i = 0; i < heap->bh_size; i++) + { + if (heap->bh_compare(heap->bh_nodes[i], + d, + heap->bh_arg) == 0) + { + binaryheap_remove_node(heap, >bh_nodes[i]); + return; + } + } + +#ifdef FRONTEND + pg_fatal("datum not found in heap"); +#else + elog(ERROR, "datum not found in heap"); +#endif +} + +/* + * binaryheap_remove_node + * + * Removes the given node from the heap. The caller must ensure that the node + * exists in the heap. O(log n) worst case. + */ +void +binaryheap_remove_node(binaryheap *heap, Datum *n) +{ + int cmp; + + Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); + Assert(n >= >bh_nodes[0]); + Assert(n <= >bh_nodes[heap->bh_size - 1]); + + /* compare last node to the one that is being removed */ + cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size], + *n, + heap->bh_arg); + + /* remove the last node, placing it in the vacated entry */ + *n = heap->bh_nodes[heap->bh_size]; + + /* sift as needed to preserve the heap property */ + if (cmp > 0) + sift_up(heap, n - >bh_nodes[0]); + else if (cmp < 0) + sift_down(heap, n - >bh_nodes[0]); +} + /* * binaryheap_replace_first * diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h index 52f7b06b25..411f7358ba 100644 --- a/src/include/lib/binaryheap.h +++ b/src/include/lib/binaryheap.h @@ -47,6 +47,8 @@ extern void binaryheap_build(binaryheap *heap); extern void binaryheap_add(binaryheap *heap, Datum d); extern Datum binaryheap_first(binaryheap *heap); extern Datum binaryheap_remove_first(binaryheap *heap); +extern void binaryheap_remove(binaryheap *heap, Datum d); +extern void binaryheap_remove_node(binaryheap *heap, Datum *n); extern void binaryheap_replace_first(binaryheap *heap, Datum d); #define binaryheap_empty(h) ((h)->bh_size == 0)
Re: Inefficiency in parallel pg_restore with many tables
On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote: > Actually, as long as we're talking about approximately-correct behavior: > let's make the ready_list be a priority heap, and then just make > pop_next_work_item scan forward from the array start until it finds an > item that's runnable per the lock heuristic. If the heap root is > blocked, the next things we'll examine will be its two children. > We might pick the lower-priority of those two, but it's still known to > be higher priority than at least 50% of the remaining heap entries, so > it shouldn't be too awful as a choice. The argument gets weaker the > further you go into the heap, but we're not expecting that having most > of the top entries blocked will be a common case. (Besides which, the > priorities are pretty crude to begin with.) Once selected, pulling out > an entry that is not the heap root is no problem: you just start the > sift-down process from there. > > The main advantage of this over the only-sort-sometimes idea is that > we can guarantee that the largest ready item will always be dispatched > as soon as it can be (because it will be the heap root). So cases > involving one big table (with big indexes) and a lot of little ones > should get scheduled sanely, which is the main thing we want this > algorithm to ensure. With the other approach we can't really promise > much at all. This seems worth a try. IIUC you are suggesting making binaryheap.c frontend-friendly and expanding its API a bit. If no one has volunteered, I could probably hack something together. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Inefficiency in parallel pg_restore with many tables
Andrew Dunstan writes: > On 2023-07-15 Sa 13:47, Tom Lane wrote: >> I wonder if we could replace the sorted ready-list with a priority heap, >> although that might be complicated by the fact that pop_next_work_item >> has to be capable of popping something that's not necessarily the >> largest remaining job. Another idea could be to be a little less eager >> to sort the list every time; I think in practice scheduling wouldn't >> get much worse if we only re-sorted every so often. > Yeah, I think that last idea is reasonable. Something like if the number > added since the last sort is more than min(50, list_length/4) then sort. > That shouldn't be too invasive. Actually, as long as we're talking about approximately-correct behavior: let's make the ready_list be a priority heap, and then just make pop_next_work_item scan forward from the array start until it finds an item that's runnable per the lock heuristic. If the heap root is blocked, the next things we'll examine will be its two children. We might pick the lower-priority of those two, but it's still known to be higher priority than at least 50% of the remaining heap entries, so it shouldn't be too awful as a choice. The argument gets weaker the further you go into the heap, but we're not expecting that having most of the top entries blocked will be a common case. (Besides which, the priorities are pretty crude to begin with.) Once selected, pulling out an entry that is not the heap root is no problem: you just start the sift-down process from there. The main advantage of this over the only-sort-sometimes idea is that we can guarantee that the largest ready item will always be dispatched as soon as it can be (because it will be the heap root). So cases involving one big table (with big indexes) and a lot of little ones should get scheduled sanely, which is the main thing we want this algorithm to ensure. With the other approach we can't really promise much at all. regards, tom lane
Re: Inefficiency in parallel pg_restore with many tables
On 2023-07-15 Sa 13:47, Tom Lane wrote: I looked into the performance gripe at [1] about pg_restore not making effective use of parallel workers when there are a lot of tables. I was able to reproduce that by dumping and parallel restoring 100K tables made according to this script: do $$ begin for i in 1..10 loop execute format('create table t%s (f1 int unique, f2 int unique);', i); execute format('insert into t%s select x, x from generate_series(1,1000) x', i); if i % 100 = 0 then commit; end if; end loop; end $$; Once pg_restore reaches the parallelizable part of the restore, what I see is that the parent pg_restore process goes to 100% CPU while its children (and the server) mostly sit idle; that is, the task dispatch logic in pg_backup_archiver.c is unable to dispatch tasks fast enough to keep the children busy. A quick perf check showed most of the time being eaten by pg_qsort and TocEntrySizeCompare. What I believe is happening is that we start the parallel restore phase with 100K TableData items that are ready to go (they are in the ready_list) and 200K AddConstraint items that are pending, because we make those have dependencies on the corresponding TableData so we don't build an index until after its table is populated. Each time one of the TableData items is completed by some worker, the two AddConstraint items for its table are moved from the pending_list to the ready_list --- and that means ready_list_insert marks the ready_list as no longer sorted. When we go to pop the next task from the ready_list, we re-sort that entire list first. So we spend something like O(N^2 * log(N)) time just sorting, if there are N tables. Clearly, this code is much less bright than it thinks it is (and that's all my fault, if memory serves). I'm not sure how big a deal this is in practice: in most situations the individual jobs are larger than they are in this toy example, plus the initial non-parallelizable part of the restore is a bigger bottleneck anyway with this many tables. Still, we do have one real-world complaint, so maybe we should look into improving it. I wonder if we could replace the sorted ready-list with a priority heap, although that might be complicated by the fact that pop_next_work_item has to be capable of popping something that's not necessarily the largest remaining job. Another idea could be to be a little less eager to sort the list every time; I think in practice scheduling wouldn't get much worse if we only re-sorted every so often. Yeah, I think that last idea is reasonable. Something like if the number added since the last sort is more than min(50, list_length/4) then sort. That shouldn't be too invasive. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Inefficiency in parallel pg_restore with many tables
Hi, On 2023-07-15 13:47:12 -0400, Tom Lane wrote: > I wonder if we could replace the sorted ready-list with a priority heap, > although that might be complicated by the fact that pop_next_work_item > has to be capable of popping something that's not necessarily the > largest remaining job. Another idea could be to be a little less eager > to sort the list every time; I think in practice scheduling wouldn't > get much worse if we only re-sorted every so often. Perhaps we could keep track of where the newly inserted items are, and use insertion sort or such when the number of new elements is much smaller than the size of the already sorted elements? As you say, a straight priority heap might not be easy. But we could just open code using two sorted arrays, one large, one for recent additions that needs to be newly sorted. And occasionally merge the small array into the big array, once it has gotten large enough that sorting becomes expensive. We could go for a heap of N>2 such arrays, but I doubt it would be worth much. Greetings, Andres Freund
Inefficiency in parallel pg_restore with many tables
I looked into the performance gripe at [1] about pg_restore not making effective use of parallel workers when there are a lot of tables. I was able to reproduce that by dumping and parallel restoring 100K tables made according to this script: do $$ begin for i in 1..10 loop execute format('create table t%s (f1 int unique, f2 int unique);', i); execute format('insert into t%s select x, x from generate_series(1,1000) x', i); if i % 100 = 0 then commit; end if; end loop; end $$; Once pg_restore reaches the parallelizable part of the restore, what I see is that the parent pg_restore process goes to 100% CPU while its children (and the server) mostly sit idle; that is, the task dispatch logic in pg_backup_archiver.c is unable to dispatch tasks fast enough to keep the children busy. A quick perf check showed most of the time being eaten by pg_qsort and TocEntrySizeCompare. What I believe is happening is that we start the parallel restore phase with 100K TableData items that are ready to go (they are in the ready_list) and 200K AddConstraint items that are pending, because we make those have dependencies on the corresponding TableData so we don't build an index until after its table is populated. Each time one of the TableData items is completed by some worker, the two AddConstraint items for its table are moved from the pending_list to the ready_list --- and that means ready_list_insert marks the ready_list as no longer sorted. When we go to pop the next task from the ready_list, we re-sort that entire list first. So we spend something like O(N^2 * log(N)) time just sorting, if there are N tables. Clearly, this code is much less bright than it thinks it is (and that's all my fault, if memory serves). I'm not sure how big a deal this is in practice: in most situations the individual jobs are larger than they are in this toy example, plus the initial non-parallelizable part of the restore is a bigger bottleneck anyway with this many tables. Still, we do have one real-world complaint, so maybe we should look into improving it. I wonder if we could replace the sorted ready-list with a priority heap, although that might be complicated by the fact that pop_next_work_item has to be capable of popping something that's not necessarily the largest remaining job. Another idea could be to be a little less eager to sort the list every time; I think in practice scheduling wouldn't get much worse if we only re-sorted every so often. I don't have time to pursue this right now, but perhaps someone else would like to. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CAEzn%3DHSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg%40mail.gmail.com