Re: A qsort template
I wrote: > I agree this is only useful in development. Removal sounds fine to me, > so I'll do that soon. This is done. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Fri, May 20, 2022 at 5:43 AM Tom Lane wrote: > > Peter Geoghegan writes: > > On Thu, May 19, 2022 at 1:12 PM Justin Pryzby wrote: > >> Should these debug lines be removed ? > >> > >> elog(DEBUG1, "qsort_tuple"); > > > I agree -- DEBUG1 seems too chatty for something like this. DEBUG2 > > would be more appropriate IMV. Though I don't feel very strongly about > > it. > > Given the lack of context identification, I'd put the usefulness of > these in production at close to zero. +1 for removing them > altogether, or failing that, downgrade to DEBUG5 or so. I agree this is only useful in development. Removal sounds fine to me, so I'll do that soon. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
Peter Geoghegan writes: > On Thu, May 19, 2022 at 1:12 PM Justin Pryzby wrote: >> Should these debug lines be removed ? >> >> elog(DEBUG1, "qsort_tuple"); > I agree -- DEBUG1 seems too chatty for something like this. DEBUG2 > would be more appropriate IMV. Though I don't feel very strongly about > it. Given the lack of context identification, I'd put the usefulness of these in production at close to zero. +1 for removing them altogether, or failing that, downgrade to DEBUG5 or so. regards, tom lane
Re: A qsort template
On Thu, May 19, 2022 at 1:12 PM Justin Pryzby wrote: > Should these debug lines be removed ? > > elog(DEBUG1, "qsort_tuple"); I agree -- DEBUG1 seems too chatty for something like this. DEBUG2 would be more appropriate IMV. Though I don't feel very strongly about it. -- Peter Geoghegan
Re: A qsort template
On Fri, Apr 22, 2022 at 11:37:29AM +0700, John Naylor wrote: > On Fri, Apr 22, 2022 at 11:13 AM David Rowley wrote: > > > > On Thu, 21 Apr 2022 at 19:09, John Naylor > > wrote: > > > I intend to commit David's v2 fix next week, unless there are > > > objections, or unless he beats me to it. > > > > I wasn't sure if you wanted to handle it or not, but I don't mind > > doing it, so I just pushed it after a small adjustment to a comment. > > Thank you! Should these debug lines be removed ? elog(DEBUG1, "qsort_tuple"); Perhaps if I ask for debug output, I shouldn't be surprised if it changes between major releases - but I still found this surprising. I'm sure it's useful during development and maybe during beta. It could even make sense if it were shown during regression tests (preferably at DEBUG2). But right now it's not. is that ts=# \dt DEBUG: qsort_tuple List of relations -- Justin
Re: A qsort template
On Fri, Apr 22, 2022 at 4:37 PM John Naylor wrote: > On Fri, Apr 22, 2022 at 11:13 AM David Rowley wrote: > > On Thu, 21 Apr 2022 at 19:09, John Naylor > > wrote: > > > I intend to commit David's v2 fix next week, unless there are > > > objections, or unless he beats me to it. > > > > I wasn't sure if you wanted to handle it or not, but I don't mind > > doing it, so I just pushed it after a small adjustment to a comment. > > Thank you! Thanks both for working on this. Seems like a good call to defer the choice of further specialisations.
Re: A qsort template
On Fri, Apr 22, 2022 at 11:13 AM David Rowley wrote: > > On Thu, 21 Apr 2022 at 19:09, John Naylor > wrote: > > I intend to commit David's v2 fix next week, unless there are > > objections, or unless he beats me to it. > > I wasn't sure if you wanted to handle it or not, but I don't mind > doing it, so I just pushed it after a small adjustment to a comment. Thank you! -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Thu, 21 Apr 2022 at 19:09, John Naylor wrote: > I intend to commit David's v2 fix next week, unless there are > objections, or unless he beats me to it. I wasn't sure if you wanted to handle it or not, but I don't mind doing it, so I just pushed it after a small adjustment to a comment. Before going ahead with it I did test a 2-key sort where the leading key values were all the same. I wondered if we'd still see any regression from having to re-compare the leading key all over again. I just did: create table ab (a bigint, b bigint); insert into ab select 0,x from generate_series(1,100)x; vacuum freeze ab; I then ran: select * from ab order by a,b offset 100; 697492434 (Specialize tuplesort routines for different kinds of abbreviated keys) $ pgbench -n -f bench1.sql -T 60 -M prepared postgres tps = 10.651740 (without initial connection time) tps = 10.813647 (without initial connection time) tps = 10.648960 (without initial connection time) 697492434~1 (Remove obsolete comment) $ pgbench -n -f bench1.sql -T 60 -M prepared postgres tps = 9.957163 (without initial connection time) tps = 10.191168 (without initial connection time) tps = 10.145281 (without initial connection time) So it seems there was no regression for that case, at least, not on the AMD machine that I tested on. David
Re: A qsort template
I intend to commit David's v2 fix next week, unless there are objections, or unless he beats me to it. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
> Okay, this makes logical sense and is a smaller patch to boot. I've > re-run my tests (attached) to make sure we have our bases covered. I'm > sharing the min-of-five, as before, but locally I tried . The My sentence there was supposed to read "I tried using median and it was a bit less noisy". -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Tue, Apr 19, 2022 at 12:30 PM David Rowley wrote: > > Thanks for looking at this. > > On Tue, 19 Apr 2022 at 02:11, John Naylor > wrote: > > IIUC, this function is called by tuplesort_begin_common, which in turn > > is called by tuplesort_begin_{heap, indexes, etc}. The latter callers > > set the onlyKey and now oneKeySort variables as appropriate, and > > sometimes hard-coded to false. Is it intentional to set them here > > first? > > > > Falling under the polish that you were likely thinking of above: > > I did put the patch together quickly just for the benchmark and at the > time I was subtly aware that the onlyKey field was being set using a > similar condition as I was using to set the boolean field I'd added. > On reflection today, it should be fine just to check if that field is > NULL or not in the 3 new comparison functions. Similarly to before, > this only needs to be done if the datums compare equally, so does not > add any code to the path where the datums are non-equal. It looks > like the other tuplesort_begin_* functions use a different comparison > function that will never make use of the specialization comparison > functions added by 697492434. Okay, this makes logical sense and is a smaller patch to boot. I've re-run my tests (attached) to make sure we have our bases covered. I'm sharing the min-of-five, as before, but locally I tried . The regression is fixed, and most other differences from v15 seem to be noise. It's possible the naturally fastest cases (pre-sorted ints and bigints) are slower than v15-revert than expected from noise, but it's not clear. I think this is good to go. -- John Naylor EDB: http://www.enterprisedb.com qsort-fix-regression-jcn2.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: A qsort template
Thanks for looking at this. On Tue, 19 Apr 2022 at 02:11, John Naylor wrote: > IIUC, this function is called by tuplesort_begin_common, which in turn > is called by tuplesort_begin_{heap, indexes, etc}. The latter callers > set the onlyKey and now oneKeySort variables as appropriate, and > sometimes hard-coded to false. Is it intentional to set them here > first? > > Falling under the polish that you were likely thinking of above: I did put the patch together quickly just for the benchmark and at the time I was subtly aware that the onlyKey field was being set using a similar condition as I was using to set the boolean field I'd added. On reflection today, it should be fine just to check if that field is NULL or not in the 3 new comparison functions. Similarly to before, this only needs to be done if the datums compare equally, so does not add any code to the path where the datums are non-equal. It looks like the other tuplesort_begin_* functions use a different comparison function that will never make use of the specialization comparison functions added by 697492434. I separated out the "or" condition that I'd added tot he existing "if" to make it easier to write a comment explaining why we can skip the tiebreak function call. Updated patch attached. David diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 1174e1a31c..12d1bf551b 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -436,7 +436,10 @@ struct Tuplesortstate /* * This variable is shared by the single-key MinimalTuple case and the -* Datum case (which both use qsort_ssup()). Otherwise it's NULL. +* Datum case (which both use qsort_ssup()). It is also used by various +* sort specialization functions when comparing the leading key in a +* tiebreak situation to determine if there are any subsequent keys to +* sort on. It's otherwise NULL. */ SortSupport onlyKey; @@ -698,7 +701,12 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) compare = ApplyUnsignedSortComparator(a->datum1, a->isnull1, b->datum1, b->isnull1, >sortKeys[0]); - if (compare != 0) + + /* +* No need to call the tiebreak function when the datums differ or if this +* is the only key we're sorting on. +*/ + if (compare != 0 || state->onlyKey != NULL) return compare; return state->comparetup(a, b, state); @@ -713,7 +721,12 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) compare = ApplySignedSortComparator(a->datum1, a->isnull1, b->datum1, b->isnull1, >sortKeys[0]); - if (compare != 0) + + /* +* No need to call the tiebreak function when the datums differ or if this +* is the only key we're sorting on. +*/ + if (compare != 0 || state->onlyKey != NULL) return compare; return state->comparetup(a, b, state); @@ -728,7 +741,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) compare = ApplyInt32SortComparator(a->datum1, a->isnull1, b->datum1, b->isnull1, >sortKeys[0]); - if (compare != 0) + + /* +* No need to call the tiebreak function when the datums differ or if this +* is the only key we're sorting on. +*/ + if (compare != 0 || state->onlyKey != NULL) return compare; return state->comparetup(a, b, state);
Re: A qsort template
On Tue, Apr 12, 2022 at 7:58 AM David Rowley wrote: > > I've attached the patch I tested. It was thrown together very quickly > just to try out the performance. If it's interesting I can polish it > up a bit. If not, I didn't waste too much time. @@ -959,6 +965,10 @@ tuplesort_begin_batch(Tuplesortstate *state) state->tapeset = NULL; + /* check if specialized sorts can skip calling the tiebreak function */ + state->oneKeySort = state->nKeys == 1 && + !state->sortKeys[0].abbrev_converter; + IIUC, this function is called by tuplesort_begin_common, which in turn is called by tuplesort_begin_{heap, indexes, etc}. The latter callers set the onlyKey and now oneKeySort variables as appropriate, and sometimes hard-coded to false. Is it intentional to set them here first? Falling under the polish that you were likely thinking of above: We might rename oneKeySort to skipTiebreaker to avoid confusion. SInce the test for these variable is the same, we could consolidate them into a block and reword this existing comment (which I find a little confusing anyway): /* * The "onlyKey" optimization cannot be used with abbreviated keys, since * tie-breaker comparisons may be required. Typically, the optimization * is only of value to pass-by-value types anyway, whereas abbreviated * keys are typically only of value to pass-by-reference types. */ I can take a stab at this, unless you had something else in mind. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Thu, Apr 14, 2022 at 1:46 PM David Rowley wrote: > > On Wed, 13 Apr 2022 at 23:19, John Naylor > wrote: > > More broadly than the regression, Thomas' is very often the fastest of > > all, at the cost of more binary size. David's is occasionally slower > > than v15 or v15 with revert, but much of that is a slight difference > > and some is probably noise. To add to my summary of results - the v15 code, with and without extra patches, seems slightly worse on B-tree index creation for very low cardinality keys, but that's not an index that's going to be useful (and therefore common) so that's a good tradeoff in my view. The regression David found is more concerning. > Just to get an opinion from some other hardware, I've run your test > script on my AMD 3990x machine. Thanks for that. I only see 4 non-Btree measurements in your results that are larger than v15-revert, versus 8 in mine (Comet Lake). And overall, most of those seem within the noise level. > My opinion here is that the best thing we can learn from both of our > results is, do the patches fix the regression? I'd say the answer is yes for both. > I don't believe it should be about if adding the additional > specializations performs better than skipping the tie break function > call. I think it's pretty obvious that the specializations will be > faster. I think if it was decided that v16 would be the version where > more work should be done to decide on what should be specialized and > what shouldn't be, then we shouldn't let this regression force our > hand to make that choice now. It'll be pretty hard to remove any > specializations once they've been in a released version of Postgres. I agree that a narrow fix is preferable. I'll take a closer look at your patch soon. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Wed, 13 Apr 2022 at 23:19, John Naylor wrote: > More broadly than the regression, Thomas' is very often the fastest of > all, at the cost of more binary size. David's is occasionally slower > than v15 or v15 with revert, but much of that is a slight difference > and some is probably noise. Just to get an opinion from some other hardware, I've run your test script on my AMD 3990x machine. My opinion here is that the best thing we can learn from both of our results is, do the patches fix the regression? I don't believe it should be about if adding the additional specializations performs better than skipping the tie break function call. I think it's pretty obvious that the specializations will be faster. I think if it was decided that v16 would be the version where more work should be done to decide on what should be specialized and what shouldn't be, then we shouldn't let this regression force our hand to make that choice now. It'll be pretty hard to remove any specializations once they've been in a released version of Postgres. David qsort-fix-regression-dgr1.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: A qsort template
As promised, I've done another round of tests (script and spreadsheet attached) with - v15 with 6974924347 and cc58eecc5d reverted - v15 with Thomas' patch - v15 with David's patch - v15 as is ("std") ...where v15 is at 7b735f8b52ad. This time I limited it to int, bigint, and text types. Since more cases now use random distributions, I also took some measures to tighten up the measurements: - Reuse the same random distribution for all tests where the input is randomized, by invoking the script with/without a second parameter - For the text case, use lpadded ints so that lexicographic order is the same as numeric order. I verified David's mod100 test case and added most test cases from the Orson Peters paper I mentioned above. I won't explain all of them here, but the low cardinality ones are randomized sets of: - mod8 - dupsq: x mod sqrt(n) , for 10 million about 3 thousand distinct values - dup8: (x**8 + n/2) mod n , for 10 million about 80 thousand distinct values, about 80% with 64 duplicates and 20% with 256 duplicates All the clear regressions I can see in v15 are in the above for one or more query types / data types, and both Thomas and David's patches restore performance for those. More broadly than the regression, Thomas' is very often the fastest of all, at the cost of more binary size. David's is occasionally slower than v15 or v15 with revert, but much of that is a slight difference and some is probably noise. -- John Naylor EDB: http://www.enterprisedb.com qsort-fix-regression-jcn1.ods Description: application/vnd.oasis.opendocument.spreadsheet sort-bench-int-text-plus-low-card.sh Description: application/shellscript
Re: A qsort template
On Mon, 11 Apr 2022 at 22:11, John Naylor wrote: > > On Mon, Apr 11, 2022 at 5:34 AM David Rowley wrote: > > > With this particular test, v15 is about 15% *slower* than v14. I > > didn't know what to blame at first, so I tried commenting out the sort > > specialisations and got the results in the red bars in the graph. This > > made it about 7.5% *faster* than v14. So looks like this patch is to > > blame. I then hacked the comparator function that's used in the > > specialisations for BIGINT to comment out the tiebreak to remove the > > indirect function call, which happens to do nothing in this 1 column > > sort case. The aim here was to get an idea what the performance would > > be if there was a specialisation for single column sorts. That's the > > yellow bars, which show about 10% *faster* than master. > > Thanks for investigating! (I assume you meant 10% faster than v14?) Yes, I did mean to say v14. (I'm too used to comparing everything to master) David
Re: A qsort template
On Mon, Apr 11, 2022 at 5:34 AM David Rowley wrote: > With this particular test, v15 is about 15% *slower* than v14. I > didn't know what to blame at first, so I tried commenting out the sort > specialisations and got the results in the red bars in the graph. This > made it about 7.5% *faster* than v14. So looks like this patch is to > blame. I then hacked the comparator function that's used in the > specialisations for BIGINT to comment out the tiebreak to remove the > indirect function call, which happens to do nothing in this 1 column > sort case. The aim here was to get an idea what the performance would > be if there was a specialisation for single column sorts. That's the > yellow bars, which show about 10% *faster* than master. Thanks for investigating! (I assume you meant 10% faster than v14?) On Mon, Apr 11, 2022 at 4:55 AM Peter Geoghegan wrote: > The B quicksort implementation that we adopted is generally > extremely fast for that case, since it uses 3 way partitioning (based > on the Dutch National Flag algorithm). This essentially makes sorting > large groups of duplicates take only linear time (not linearithmic > time). In the below thread, I wondered if it still counts as extremely fast nowadays. I hope to give an answer to that during next cycle. Relevant to the open item, the paper linked there has a variety of low-cardinality cases. I'll incorporate them in a round of tests soon. https://www.postgresql.org/message-id/cafbsxshanjtsx9dnjppxjxwg3bu+yq6pnmsfpm0uvyuafdw...@mail.gmail.com On Mon, Apr 11, 2022 at 4:44 AM Thomas Munro wrote: > Upthread we were discussing which variations it'd be worth investing > extra text segment space on to gain speedup and we put those hard > decisions off for future work, but on reflection, we probably should > tackle this particular point to avoid a regression. I think something > like the attached achieves that (draft, not tested much yet, could > perhaps find a tidier way to code the decision tree). In short: > variants qsort_tuple_{int32,signed,unsigned}() no longer fall back, > but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do. Looks good at a glance, I will get some numbers after modifying my test scripts. > We should perhaps also reconsider the other XXX comment about finding > a way to skip the retest of column 1 in the tiebreak comparator. > Perhaps you'd just install a different comparetup function, eg > comparetup_index_btree_tail (which would sharing code), so no need to > multiply specialisations for that. If we need to add these cases to avoid regression, it makes sense to make them work as well as we reasonably can. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Mon, 11 Apr 2022 at 09:44, Thomas Munro wrote: > David Rowley privately reported a performance regression when sorting > single ints with a lot of duplicates, in a case that previously hit > qsort_ssup() but now hits qsort_tuple_int32() and then has to call the > tiebreaker comparator. Note that this comes up only for sorts in a > query, not for eg index builds which always have to tiebreak on item > ptr. I don't have data right now but that'd likely be due to: I've now added this as an open item for v15. David
Re: A qsort template
On Sun, Apr 10, 2022 at 2:44 PM Thomas Munro wrote: > David Rowley privately reported a performance regression when sorting > single ints with a lot of duplicates, in a case that previously hit > qsort_ssup() but now hits qsort_tuple_int32() and then has to call the > tiebreaker comparator. That's not good. The B quicksort implementation that we adopted is generally extremely fast for that case, since it uses 3 way partitioning (based on the Dutch National Flag algorithm). This essentially makes sorting large groups of duplicates take only linear time (not linearithmic time). -- Peter Geoghegan
Re: A qsort template
Hi, David Rowley privately reported a performance regression when sorting single ints with a lot of duplicates, in a case that previously hit qsort_ssup() but now hits qsort_tuple_int32() and then has to call the tiebreaker comparator. Note that this comes up only for sorts in a query, not for eg index builds which always have to tiebreak on item ptr. I don't have data right now but that'd likely be due to: + * XXX: For now, there is no specialization for cases where datum1 is + * authoritative and we don't even need to fall back to a callback at all (that + * would be true for types like int4/int8/timestamp/date, but not true for + * abbreviations of text or multi-key sorts. There could be! Is it worth it? Upthread we were discussing which variations it'd be worth investing extra text segment space on to gain speedup and we put those hard decisions off for future work, but on reflection, we probably should tackle this particular point to avoid a regression. I think something like the attached achieves that (draft, not tested much yet, could perhaps find a tidier way to code the decision tree). In short: variants qsort_tuple_{int32,signed,unsigned}() no longer fall back, but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do. We should perhaps also reconsider the other XXX comment about finding a way to skip the retest of column 1 in the tiebreak comparator. Perhaps you'd just install a different comparetup function, eg comparetup_index_btree_tail (which would sharing code), so no need to multiply specialisations for that. Planning to look at this more closely after I've sorted out some other problems, but thought I'd post this draft/problem report early in case John or others have thoughts or would like to run some experiments. From 3ff95aedb1065393a0ae1496cedb463bc998a329 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 6 Apr 2022 11:38:31 +1200 Subject: [PATCH] Specialize tuplesort for keys without tiebreaker. Commit 69749243 noted a future opportunity to avoid calling the comparator function in the case of single column non-abbreviated sort, where datum1 alone is enough to decide the order. David Rowley reported a performance regression in single-column integer sorts with a lot of duplicates, where previously the qsort_ssup() specialization was reached, due to extra calls to the comparator. It seems like a good idea to fix that, so let's complete that TODO item now. XXX WIP, is the if-then-else getting too ugly, maybe switch to a table of functions to search? XXX needs testing, experiments --- src/backend/utils/sort/tuplesort.c | 110 +++-- 1 file changed, 89 insertions(+), 21 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 571fb95532..dc16e6c242 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -312,6 +312,14 @@ struct Tuplesortstate */ bool haveDatum1; + /* + * Whether the comparator function must always be called for tuples with + * equal datum1. This should be set to true if the sorting has extra + * 'invisible' sort order beyond the regular keys, to disable optimizations + * that would otherwise skip the function call. + */ + bool alwaysTiebreak; + /* * This array holds the tuples now in sort memory. If we are in state * INITIAL, the tuples are in no particular order; if we are in state @@ -682,11 +690,6 @@ static void tuplesort_updatemax(Tuplesortstate *state); * * XXX: For now, these fall back to comparator functions that will compare the * leading datum a second time. - * - * XXX: For now, there is no specialization for cases where datum1 is - * authoritative and we don't even need to fall back to a callback at all (that - * would be true for types like int4/int8/timestamp/date, but not true for - * abbreviations of text or multi-key sorts. There could be! Is it worth it? */ /* Used if first key's comparator is ssup_datum_unsigned_compare */ @@ -740,10 +743,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) * qsort_ssup() is specialized for the case where the comparetup function * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts * and Datum sorts. qsort_tuple_{unsigned,signed,int32} are specialized for - * common comparison functions on pass-by-value leading datums. + * common comparison functions on pass-by-value leading datums. The _tiebreak + * variants are for when it is necessary to call the comparator function even + * if datum1 is equal. */ -#define ST_SORT qsort_tuple_unsigned +#define ST_SORT qsort_tuple_unsigned_tiebreak #define ST_ELEMENT_TYPE SortTuple #define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state) #define ST_COMPARE_ARG_TYPE Tuplesortstate @@ -752,7 +757,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) #define ST_DEFINE #include "lib/sort_template.h"
Re: A qsort template
Here is the updated insertion sort threshold patch based on Thomas' experimental v4 0010, with adjusted regression test output. I only found a couple places where it could make sense to add sort keys to test queries, but 1) not enough to make a big difference and 2) the adjustments looked out of place, so I decided to just update all the regression tests in one go. Since the patch here is a bit more (and less) involved than Thomas' 0010, I'm going to refrain from committing until it gets review. If not in the next couple days, I will bring it up at the beginning of the v16 cycle. -- John Naylor EDB: http://www.enterprisedb.com From 74b934bc5ed8f6733c064c0ef832e1aa9949f216 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 6 Apr 2022 16:38:28 +0700 Subject: [PATCH v5] Raise qsort insertion sort threshold Our qsort algorithm is from NetBSD and is described in the 1993 paper "Engineering a Sort Function". It includes three hard coded thresholds that control the point at which we drop to insertion sort, and whether we use 1, 3 or 9 samples for choosing a pivot. The insertion sort threshold of 7 was chosen by testing on hardware available in 1993. Our testing has shown that hardware from 2010 to 2020 prefer a larger threshold. The ideal value varies with input distribution, but 20 seems to be a good compromise that works well with all of them. This is more in line with thresholds found in other software as well. Since the new insertion sort threshold is larger then the singleton range where a single sample is chosen for the pivot, get rid of the middle range. There are now two thresholds. Also use macros intead of hard-coded values. This improves readability and enables specializing the thresholds if desired. The changes in the regression tests are needed for the change in sort stability when the sort key contains duplicates. Thomas Munro and John Naylor Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJhOtjQH%2BwjCodtJzHAFCAPYyt6Qm9E1mUoJph42qo1hg%40mail.gmail.com Discussion: https://www.postgresql.org/message-id/CAFBsxsHr-C1xqjUMjeUN3-FvNzKiAt3gcfBKt8PFN2mov7z2gQ%40mail.gmail.com --- .../expected/pg_stat_statements.out | 6 +- src/include/lib/sort_template.h | 38 +- src/test/regress/expected/create_index.out| 2 +- src/test/regress/expected/geometry.out| 2 +- src/test/regress/expected/groupingsets.out| 4 +- src/test/regress/expected/inet.out| 4 +- src/test/regress/expected/join.out| 2 +- src/test/regress/expected/sqljson.out | 10 +- src/test/regress/expected/tsrf.out| 28 +- src/test/regress/expected/tuplesort.out | 10 +- src/test/regress/expected/window.out | 542 +- 11 files changed, 331 insertions(+), 317 deletions(-) diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out index e0abe34bb6..aeb8f04aea 100644 --- a/contrib/pg_stat_statements/expected/pg_stat_statements.out +++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out @@ -169,12 +169,12 @@ SELECT * SELECT * FROM test ORDER BY a; a | b ---+-- - 1 | a 1 | 111 - 2 | b + 1 | a 2 | 222 - 3 | c + 2 | b 3 | 333 + 3 | c 4 | 444 5 | 555 6 | 666 diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h index 3122a93009..7b92b2c084 100644 --- a/src/include/lib/sort_template.h +++ b/src/include/lib/sort_template.h @@ -40,6 +40,11 @@ * * - ST_COMPARE_ARG_TYPE - type of extra argument * + * Optional macros for tuning algorithm choices: + * + * - ST_THRESHOLD_INSERTION_SORT - below this size we do insertion sort + * - ST_THRESHOLD_MED9 - above this size we partition with med9, otherwise with med3 + * * The prototype of the generated sort function is: * * void ST_SORT(ST_ELEMENT_TYPE *data, size_t n, @@ -172,6 +177,14 @@ #define ST_SORT_INVOKE_ARG #endif +/* Default input size thresholds to control algorithm behavior. */ +#ifndef ST_THRESHOLD_INSERTION_SORT +#define ST_THRESHOLD_INSERTION_SORT 20 +#endif +#ifndef ST_THRESHOLD_MED9 +#define ST_THRESHOLD_MED9 40 +#endif + #ifdef ST_DECLARE #ifdef ST_COMPARE_RUNTIME_POINTER @@ -296,7 +309,7 @@ ST_SORT(ST_ELEMENT_TYPE * data, size_t n loop: DO_CHECK_FOR_INTERRUPTS(); - if (n < 7) + if (n < ST_THRESHOLD_INSERTION_SORT) { for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) @@ -318,21 +331,20 @@ loop: } if (presorted) return; + /* Partition with median of three for medium input. */ + pl = a; pm = a + (n / 2) * ST_POINTER_STEP; - if (n > 7) + pn = a + (n - 1) * ST_POINTER_STEP; + if (n >
Re: A qsort template
On Mon, Apr 4, 2022 at 4:32 AM Andres Freund wrote: > Would be good to get this committed soon, so we can see further ubsan > violations introduced in the next few days (and so I can unblock my local dev > tests :P). Pushed (with a minor tweak).
Re: A qsort template
Hi, On 2022-04-03 17:46:28 +1200, Thomas Munro wrote: > On Sun, Apr 3, 2022 at 11:11 AM Andres Freund wrote: > > On 2022-04-03 09:45:13 +1200, Thomas Munro wrote: > > > I think we just need to decide up front if we're in a situation that > > > can't provide datum1/isnull1 (in this case because it's an expression > > > index), and skip the optimised paths. Here's an experimental patch... > > > still looking into whether there are more cases like this... > > I didn't find anything else. > > Maybe it'd be better if we explicitly declared whether datum1 is used > in each tuplesort mode's 'begin' function, right next to the code that > installs the set of routines that are in control of that? Trying that > in this version. Is it clearer what's going on like this? Seems an improvement. > > I'm a bit worried that none of the !ubsan tests failed on this... > > In accordance with whoever-it-was-that-said-that's law about things > that aren't tested, this are turned out to be broken already[1]. Yea :/. Would be good to get this committed soon, so we can see further ubsan violations introduced in the next few days (and so I can unblock my local dev tests :P). Greetings, Andres Freund
Re: A qsort template
On Sun, Apr 3, 2022 at 11:11 AM Andres Freund wrote: > On 2022-04-03 09:45:13 +1200, Thomas Munro wrote: > > I think we just need to decide up front if we're in a situation that > > can't provide datum1/isnull1 (in this case because it's an expression > > index), and skip the optimised paths. Here's an experimental patch... > > still looking into whether there are more cases like this... I didn't find anything else. Maybe it'd be better if we explicitly declared whether datum1 is used in each tuplesort mode's 'begin' function, right next to the code that installs the set of routines that are in control of that? Trying that in this version. Is it clearer what's going on like this? > That's a lot of redundant checks. How about putting all the checks for > optimized paths into one if (state->sortKeys && !state->disabl1e_datum1)? OK, sure. > I'm a bit worried that none of the !ubsan tests failed on this... In accordance with whoever-it-was-that-said-that's law about things that aren't tested, this are turned out to be broken already[1]. Once we fix that we should have a new test in the three that might also eventually have failed under this UB, given enough chaos. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BbA%2BbmwD36_oDxAoLrCwZjVtST2fqe%3Db4%3DqZcmU7u89A%40mail.gmail.com From b0a79f91edc6ce305f77373b92f31100fcad07d5 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 3 Apr 2022 09:41:04 +1200 Subject: [PATCH v2] Fix tuplesort optimizations for expression-based CLUSTER. When redirecting sorts to specialized variants, commit 69749243 failed to handle the case where CLUSTER-sort decides not to initialize datum1 and isnull1. Fix by hoisting that decision up a level and advertising whether datum1 can be used, in the Tuplesortstate object. Per reports from UBsan and Valgrind while running the cluster.sql test. Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CAFBsxsF1TeK5Fic0M%2BTSJXzbKsY6aBqJGNj6ptURuB09ZF6k_w%40mail.gmail.com --- src/backend/utils/sort/tuplesort.c | 73 ++ 1 file changed, 53 insertions(+), 20 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 361527098f..58441ed638 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -306,6 +306,12 @@ struct Tuplesortstate void (*readtup) (Tuplesortstate *state, SortTuple *stup, LogicalTape *tape, unsigned int len); + /* + * Whether SortTuple's datum1 and isnull1 members are maintained by the + * above routines. If not, some sort specializations are disabled. + */ + bool haveDatum1; + /* * This array holds the tuples now in sort memory. If we are in state * INITIAL, the tuples are in no particular order; if we are in state @@ -1016,6 +1022,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, state->copytup = copytup_heap; state->writetup = writetup_heap; state->readtup = readtup_heap; + state->haveDatum1 = true; state->tupDesc = tupDesc; /* assume we need not copy tupDesc */ state->abbrevNext = 10; @@ -1095,6 +1102,15 @@ tuplesort_begin_cluster(TupleDesc tupDesc, state->indexInfo = BuildIndexInfo(indexRel); + /* + * If we don't have a simple leading attribute, we can't easily initialize + * datum1, so disable optimizations based on datum1. + */ + if (state->indexInfo->ii_IndexAttrNumbers[0] == 0) + state->haveDatum1 = false; + else + state->haveDatum1 = true; + state->tupDesc = tupDesc; /* assume we need not copy tupDesc */ indexScanKey = _bt_mkscankey(indexRel, NULL); @@ -1188,6 +1204,7 @@ tuplesort_begin_index_btree(Relation heapRel, state->writetup = writetup_index; state->readtup = readtup_index; state->abbrevNext = 10; + state->haveDatum1 = true; state->heapRel = heapRel; state->indexRel = indexRel; @@ -1262,6 +1279,7 @@ tuplesort_begin_index_hash(Relation heapRel, state->copytup = copytup_index; state->writetup = writetup_index; state->readtup = readtup_index; + state->haveDatum1 = true; state->heapRel = heapRel; state->indexRel = indexRel; @@ -1302,6 +1320,7 @@ tuplesort_begin_index_gist(Relation heapRel, state->copytup = copytup_index; state->writetup = writetup_index; state->readtup = readtup_index; + state->haveDatum1 = true; state->heapRel = heapRel; state->indexRel = indexRel; @@ -1366,6 +1385,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, state->writetup = writetup_datum; state->readtup = readtup_datum; state->abbrevNext = 10; + state->haveDatum1 = true; state->datumType = datumType; @@ -3593,27 +3613,40 @@ tuplesort_sort_memtuples(Tuplesortstate *state) if (state->memtupcount > 1) { - /* Do we have a specialization for the leading column's comparator? */ - if (state->sortKeys && - state->sortKeys[0].comparator == ssup_datum_unsigned_cmp) - { - elog(DEBUG1, "qsort_tuple_unsigned"); - qsort_tuple_unsigned(state->memtuples, state->memtupcount,
Re: A qsort template
Hi, On 2022-04-03 09:45:13 +1200, Thomas Munro wrote: > On Sun, Apr 3, 2022 at 9:03 AM Andres Freund wrote: > > It's certainly not pretty that copytup_cluster() can use SortTuples without > > actually using SortTuples. Afaics it basically only computes isnull1/datum1 > > if > > state->indexInfo->ii_IndexAttrNumbers[0] == 0. > > I think we just need to decide up front if we're in a situation that > can't provide datum1/isnull1 (in this case because it's an expression > index), and skip the optimised paths. Here's an experimental patch... > still looking into whether there are more cases like this... That's a lot of redundant checks. How about putting all the checks for optimized paths into one if (state->sortKeys && !state->disable_datum1)? I'm a bit worried that none of the !ubsan tests failed on this... Greetings, Andres Freund
Re: A qsort template
On Sun, Apr 3, 2022 at 9:03 AM Andres Freund wrote: > It's certainly not pretty that copytup_cluster() can use SortTuples without > actually using SortTuples. Afaics it basically only computes isnull1/datum1 if > state->indexInfo->ii_IndexAttrNumbers[0] == 0. I think we just need to decide up front if we're in a situation that can't provide datum1/isnull1 (in this case because it's an expression index), and skip the optimised paths. Here's an experimental patch... still looking into whether there are more cases like this... (There's also room to recognise when you don't even need to look at isnull1 for a less branchy optimised sort, but that was already discussed and put off for later.) From a8a7c9ec4f0b96ed9d889d008731864f3d4e87d1 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 3 Apr 2022 09:41:04 +1200 Subject: [PATCH] WIP: Fix tuplesort optimizations for expression-based CLUSTER. --- src/backend/utils/sort/tuplesort.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 361527098f..34714d9baf 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -268,6 +268,8 @@ struct Tuplesortstate MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */ LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */ + bool disable_datum1; /* disable leading value-based optimizations */ + /* * These function pointers decouple the routines that must know what kind * of tuple we are sorting from the routines that don't need to know it. @@ -1095,6 +1097,13 @@ tuplesort_begin_cluster(TupleDesc tupDesc, state->indexInfo = BuildIndexInfo(indexRel); + /* + * If we don't have a simple attribute, disable the use of datum1/isnull1. + * copytup_cluster() doesn't know how to compute expressions. + */ + if (state->indexInfo->ii_IndexAttrNumbers[0] == 0) + state->disable_datum1 = true; + state->tupDesc = tupDesc; /* assume we need not copy tupDesc */ indexScanKey = _bt_mkscankey(indexRel, NULL); @@ -3593,20 +3602,27 @@ tuplesort_sort_memtuples(Tuplesortstate *state) if (state->memtupcount > 1) { - /* Do we have a specialization for the leading column's comparator? */ + /* + * Do we have a specialization for the leading column's comparator, + * and has the leading column's value or abbreviation been stored in + * datum1/isnull1? + */ if (state->sortKeys && + !state->disable_datum1 && state->sortKeys[0].comparator == ssup_datum_unsigned_cmp) { elog(DEBUG1, "qsort_tuple_unsigned"); qsort_tuple_unsigned(state->memtuples, state->memtupcount, state); } else if (state->sortKeys && + !state->disable_datum1 && state->sortKeys[0].comparator == ssup_datum_signed_cmp) { elog(DEBUG1, "qsort_tuple_signed"); qsort_tuple_signed(state->memtuples, state->memtupcount, state); } else if (state->sortKeys && + !state->disable_datum1 && state->sortKeys[0].comparator == ssup_datum_int32_cmp) { elog(DEBUG1, "qsort_tuple_int32"); @@ -4134,7 +4150,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup) * set up first-column key value, and potentially abbreviate, if it's a * simple column */ - if (state->indexInfo->ii_IndexAttrNumbers[0] == 0) + if (state->disable_datum1) return; original = heap_getattr(tuple, -- 2.35.1
Re: A qsort template
Hi, On 2022-04-02 15:20:27 -0500, Justin Pryzby wrote: > On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote: > > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > > > Reproduced locally, using the same few lines from the cluster.sql > > > test. I'll try to dig more tomorrow... > > > > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11, > > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ... > > Like Thomas just said, I had to use: > CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all > > I'm a couple few steps out of my league here, but it may be an issue with: > > commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23 > Author: Robert Haas > Date: Mon Jan 19 15:20:31 2015 -0500 > > Use abbreviated keys for faster sorting of text datums. > > This is enough to avoid the crash, which might be a useful hint.. > > @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple > *stup, void *tup) > /* > * set up first-column key value, and potentially abbreviate, if it's > a > * simple column > */ > + stup->isnull1 = false; > if (state->indexInfo->ii_IndexAttrNumbers[0] == 0) > return; > > original = heap_getattr(tuple, > > state->indexInfo->ii_IndexAttrNumbers[0], > state->tupDesc, > >isnull1); I don't think that can be correct - the column can be NULL afaics. And I don't think in that patch it's needed, because it always goes through ->comparetup() when state->onlyKey isn't explicitly set. Which tuplesort_begin_cluster() as well as several others don't. And you'd just sort an uninitialized datum immediately after. It's certainly not pretty that copytup_cluster() can use SortTuples without actually using SortTuples. Afaics it basically only computes isnull1/datum1 if state->indexInfo->ii_IndexAttrNumbers[0] == 0. Greetings, Andres Freund
Re: A qsort template
On Sun, Apr 3, 2022 at 8:20 AM Justin Pryzby wrote: > @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple > *stup, void *tup) > + stup->isnull1 = false; Looks like I might have failed to grok the scheme for encoding null into SortTuple objects. It's clearly uninitialised in some paths, with a special 0 value in datum1. Will need to look more closely with more coffee...
Re: A qsort template
Hi, On 2022-04-03 08:07:58 +1200, Thomas Munro wrote: > On Sun, Apr 3, 2022 at 12:41 AM John Naylor > wrote: > > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > > > Reproduced locally, using the same few lines from the cluster.sql > > > test. I'll try to dig more tomorrow... > > > > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11, > > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ... > > Maybe you need to add -fno-sanitize-recover=all to make it crash, > otherwise it just prints the warning and keeps going. I commented with a few more details on https://postgr.es/m/20220402201557.thanbsxcql5lk6pc%40alap3.anarazel.de and an preliminary analysis in https://www.postgresql.org/message-id/20220402203344.ahup2u5n73cdbbcv%40alap3.anarazel.de Greetings, Andres Freund
Re: A qsort template
On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote: > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > > Reproduced locally, using the same few lines from the cluster.sql > > test. I'll try to dig more tomorrow... > > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11, > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ... Like Thomas just said, I had to use: CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all I'm a couple few steps out of my league here, but it may be an issue with: commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23 Author: Robert Haas Date: Mon Jan 19 15:20:31 2015 -0500 Use abbreviated keys for faster sorting of text datums. This is enough to avoid the crash, which might be a useful hint.. @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup) /* * set up first-column key value, and potentially abbreviate, if it's a * simple column */ + stup->isnull1 = false; if (state->indexInfo->ii_IndexAttrNumbers[0] == 0) return; original = heap_getattr(tuple, state->indexInfo->ii_IndexAttrNumbers[0], state->tupDesc, >isnull1);
Re: A qsort template
On Sun, Apr 3, 2022 at 12:41 AM John Naylor wrote: > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > > Reproduced locally, using the same few lines from the cluster.sql > > test. I'll try to dig more tomorrow... > > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11, > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ... Maybe you need to add -fno-sanitize-recover=all to make it crash, otherwise it just prints the warning and keeps going.
Re: A qsort template
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > Reproduced locally, using the same few lines from the cluster.sql > test. I'll try to dig more tomorrow... Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11, with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ... -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro wrote: > It looks like UBsan sees a problem, per BF animal kestrel: > > /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51: > runtime error: load of value 96, which is not a valid value for type > 'bool' Yeah, same with tamandua. Then, skink (a Valgrind animal) shows: ==1940791== VALGRINDERROR-BEGIN ==1940791== Conditional jump or move depends on uninitialised value(s) ==1940791==at 0x73D394: ApplyInt32SortComparator (sortsupport.h:311) ==1940791==by 0x73D394: qsort_tuple_int32_compare (tuplesort.c:722) ==1940791==by 0x73D394: qsort_tuple_int32 (sort_template.h:313) ==1940791==by 0x7409BC: tuplesort_sort_memtuples (tuplesort.c:3613) ==1940791==by 0x742806: tuplesort_performsort (tuplesort.c:2154) ==1940791==by 0x23C109: heapam_relation_copy_for_cluster (heapam_handler.c:955) ==1940791==by 0x35799A: table_relation_copy_for_cluster (tableam.h:1658) ==1940791==by 0x35799A: copy_table_data (cluster.c:913) ==1940791==by 0x359016: rebuild_relation (cluster.c:606) ==1940791==by 0x35914E: cluster_rel (cluster.c:427) ==1940791==by 0x3594EB: cluster (cluster.c:195) ==1940791==by 0x5C73FF: standard_ProcessUtility (utility.c:862) ==1940791==by 0x5C78D0: ProcessUtility (utility.c:530) ==1940791==by 0x5C4C7B: PortalRunUtility (pquery.c:1158) ==1940791==by 0x5C4F78: PortalRunMulti (pquery.c:1315) ==1940791== Uninitialised value was created by a stack allocation ==1940791==at 0x74224E: tuplesort_putheaptuple (tuplesort.c:1800) -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Sat, Apr 2, 2022 at 9:38 PM John Naylor wrote: > On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro wrote: > > On Thu, Mar 31, 2022 at 11:09 PM John Naylor > > wrote: > > > In a couple days I'm going to commit the v3 patch "accelerate tuple > > > sorting for common types" as-is after giving it one more look, barring > > > objections. > > Pushed. It looks like UBsan sees a problem, per BF animal kestrel: /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51: runtime error: load of value 96, which is not a valid value for type 'bool' #5 0x00eb65d4 in qsort_tuple_int32_compare (a=0x4292ce0, b=0x4292cf8, state=0x4280130) at /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722 #6 qsort_tuple_int32 (data=, n=133, arg=arg@entry=0x4280130) at /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/include/lib/sort_template.h:313 #7 0x00eaf747 in tuplesort_sort_memtuples (state=state@entry=0x4280130) at /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:3613 #8 0x00eaedcb in tuplesort_performsort (state=state@entry=0x4280130) at /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:2154 #9 0x00573d60 in heapam_relation_copy_for_cluster (OldHeap=, NewHeap=, OldIndex=, use_sort=, OldestXmin=11681, xid_cutoff=, multi_cutoff=0x7ffecb0cfa70, num_tuples=0x7ffecb0cfa38, tups_vacuumed=0x7ffecb0cfa20, tups_recently_dead=0x7ffecb0cfa28) at /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/access/heap/heapam_handler.c:955 Reproduced locally, using the same few lines from the cluster.sql test. I'll try to dig more tomorrow...
Re: A qsort template
I wrote: > I started towards incorporating the change in insertion sort threshold > (part of 0010), but that caused regression test failures, so that will > have to wait for a bit of analysis and retesting. (My earlier tests > were done in a separate module.) The failures seem to be where sort order is partially specified. E.g. ORDER BY col_a, where there are duplicates there and other columns are different. Insertion sort is stable IIRC, so moving the threshold caused different orders in these cases. Some cases can be conveniently fixed with additional columns in the ORDER BY clause. I'll go through the failures and see how much can be cleaned up as a preparatory refactoring. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro wrote: > > On Thu, Mar 31, 2022 at 11:09 PM John Naylor > wrote: > > In a couple days I'm going to commit the v3 patch "accelerate tuple > > sorting for common types" as-is after giving it one more look, barring > > objections. Pushed. > Hi John, > > Thanks so much for all the work you've done here! I feel bad that I > lobbed so many experimental patches in here and then ran away due to > lack of cycles. That particular patch (the one cfbot has been chewing > on all this time) does indeed seem committable, despite the > deficiencies/opportunities listed in comments. It's nice to reduce > code duplication, it gives the right answers, and it goes faster. Thanks for chiming in! It gives me more confidence that there wasn't anything amiss that may have gone unnoticed. And no worries -- my own review efforts here have been sporadic. ;-) -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Thu, Mar 31, 2022 at 11:09 PM John Naylor wrote: > In a couple days I'm going to commit the v3 patch "accelerate tuple > sorting for common types" as-is after giving it one more look, barring > objections. Hi John, Thanks so much for all the work you've done here! I feel bad that I lobbed so many experimental patches in here and then ran away due to lack of cycles. That particular patch (the one cfbot has been chewing on all this time) does indeed seem committable, despite the deficiencies/opportunities listed in comments. It's nice to reduce code duplication, it gives the right answers, and it goes faster. > I started towards incorporating the change in insertion sort threshold > (part of 0010), but that caused regression test failures, so that will > have to wait for a bit of analysis and retesting. (My earlier tests > were done in a separate module.) > > The rest in this series that I looked at closely were either > refactoring or could use some minor tweaks so likely v16 material. Looking forward to it.
Re: A qsort template
In a couple days I'm going to commit the v3 patch "accelerate tuple sorting for common types" as-is after giving it one more look, barring objections. I started towards incorporating the change in insertion sort threshold (part of 0010), but that caused regression test failures, so that will have to wait for a bit of analysis and retesting. (My earlier tests were done in a separate module.) The rest in this series that I looked at closely were either refactoring or could use some minor tweaks so likely v16 material. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
I wrote: > > 0010 - Thresholds on my TODO list. > > I did some basic tests on the insertion sort thresholds, and it looks > like we could safely and profitably increase the current value from 7 > to 20 or so, in line with other more recent implementations. I've > attached an addendum on top of 0012 and the full test results on an > Intel Coffee Lake machine with gcc 11.1. I found that the object test > setup in 0012 had some kind of bug that was comparing the pointer of > the object array. Rather than fix that, I decided to use Datums, but > with the two extremes in comparator: simple branching with machine > instructions vs. a SQL-callable function. The papers I've read > indicate the results for Datum sizes would not be much different for > small structs. The largest existing sort element is SortTuple, but > that's only 24 bytes and has a bulky comparator as well. > > The first thing to note is that I rejected outright any testing of a > "middle value" where the pivot is simply the middle of the array. Even > the Bently and McIlroy paper which is the reference for our > implementation says "The range that consists of the single integer 7 > could be eliminated, but has been left adjustable because on some > machines larger ranges are a few percent better". > > I tested thresholds up to 64, which is where I guessed results to get > worse (most implementations are smaller than that). Here are the best > thresholds at a quick glance: > > - elementary comparator: > > random: 16 or greater > decreasing, rotate: get noticeably better all the way up to 64 > organ: little difference, but seems to get better all the way up to 64 > 0/1: seems to get worse above 20 > > - SQL-callable comparator: > > random: between 12 and 20, but slight differences until 32 > decreasing, rotate: get noticeably better all the way up to 64 > organ: seems best at 12, but slight differences until 32 > 0/1: slight differences > > Based on these tests and this machine, it seems 20 is a good default > value. I'll repeat this test on one older Intel and one non-Intel > platform with older compilers. The above was an Intel Comet Lake / gcc 11, and I've run the same test on a Haswell-era Xeon / gcc 8 and a Power8 machine / gcc 4.8. The results on those machines are pretty close to the above (full results attached). The noticeable exception is the Power8 on random input with a slow comparator -- those measurements there are more random than others so we can't draw conclusions from them, but the deviations are small in any case. I'm still thinking 20 or so is about right. I've put a lot out here recently, so I'll take a break now and come back in a few weeks. (no running tally here because the conclusions haven't changed since last message) -- John Naylor EDB: http://www.enterprisedb.com NOTICE: [direct] size=8MB, order=random, threshold=7, test=0, time=0.130188 NOTICE: [direct] size=8MB, order=random, threshold=7, test=1, time=0.124503 NOTICE: [direct] size=8MB, order=random, threshold=7, test=2, time=0.124557 NOTICE: [direct] size=8MB, order=random, threshold=12, test=0, time=0.119103 NOTICE: [direct] size=8MB, order=random, threshold=12, test=1, time=0.119035 NOTICE: [direct] size=8MB, order=random, threshold=12, test=2, time=0.086948 NOTICE: [direct] size=8MB, order=random, threshold=16, test=0, time=0.082710 NOTICE: [direct] size=8MB, order=random, threshold=16, test=1, time=0.082771 NOTICE: [direct] size=8MB, order=random, threshold=16, test=2, time=0.083004 NOTICE: [direct] size=8MB, order=random, threshold=20, test=0, time=0.080768 NOTICE: [direct] size=8MB, order=random, threshold=20, test=1, time=0.080764 NOTICE: [direct] size=8MB, order=random, threshold=20, test=2, time=0.080635 NOTICE: [direct] size=8MB, order=random, threshold=24, test=0, time=0.080678 NOTICE: [direct] size=8MB, order=random, threshold=24, test=1, time=0.080555 NOTICE: [direct] size=8MB, order=random, threshold=24, test=2, time=0.080604 NOTICE: [direct] size=8MB, order=random, threshold=28, test=0, time=0.079002 NOTICE: [direct] size=8MB, order=random, threshold=28, test=1, time=0.078901 NOTICE: [direct] size=8MB, order=random, threshold=28, test=2, time=0.082200 NOTICE: [direct] size=8MB, order=random, threshold=32, test=0, time=0.079317 NOTICE: [direct] size=8MB, order=random, threshold=32, test=1, time=0.079164 NOTICE: [direct] size=8MB, order=random, threshold=32, test=2, time=0.079308 NOTICE: [direct] size=8MB, order=random, threshold=64, test=0, time=0.078604 NOTICE: [direct] size=8MB, order=random, threshold=64, test=1, time=0.078718 NOTICE: [direct] size=8MB, order=random, threshold=64, test=2, time=0.078689 NOTICE: [direct] size=8MB, order=decreasing, threshold=7, test=0, time=0.025097 NOTICE: [direct] size=8MB, order=decreasing, threshold=7, test=1, time=0.025078 NOTICE: [direct] size=8MB, order=decreasing, threshold=7, test=2, time=0.025095 NOTICE: [direct] size=8MB, order=decreasing, threshold=12, test=0,
Re: A qsort template
I wrote: > 0010 - Thresholds on my TODO list. I did some basic tests on the insertion sort thresholds, and it looks like we could safely and profitably increase the current value from 7 to 20 or so, in line with other more recent implementations. I've attached an addendum on top of 0012 and the full test results on an Intel Coffee Lake machine with gcc 11.1. I found that the object test setup in 0012 had some kind of bug that was comparing the pointer of the object array. Rather than fix that, I decided to use Datums, but with the two extremes in comparator: simple branching with machine instructions vs. a SQL-callable function. The papers I've read indicate the results for Datum sizes would not be much different for small structs. The largest existing sort element is SortTuple, but that's only 24 bytes and has a bulky comparator as well. The first thing to note is that I rejected outright any testing of a "middle value" where the pivot is simply the middle of the array. Even the Bently and McIlroy paper which is the reference for our implementation says "The range that consists of the single integer 7 could be eliminated, but has been left adjustable because on some machines larger ranges are a few percent better". I tested thresholds up to 64, which is where I guessed results to get worse (most implementations are smaller than that). Here are the best thresholds at a quick glance: - elementary comparator: random: 16 or greater decreasing, rotate: get noticeably better all the way up to 64 organ: little difference, but seems to get better all the way up to 64 0/1: seems to get worse above 20 - SQL-callable comparator: random: between 12 and 20, but slight differences until 32 decreasing, rotate: get noticeably better all the way up to 64 organ: seems best at 12, but slight differences until 32 0/1: slight differences Based on these tests and this machine, it seems 20 is a good default value. I'll repeat this test on one older Intel and one non-Intel platform with older compilers. -- Running tally of patchset: 0001 - bsearch and unique is good to have, and we can keep the return type pending further tests -- if none happen this cycle, suggest committing this without the return type symbol. 0002/3 - I've yet to see a case where branchless comparators win, but other than that, these are good. Notational improvement and not performance sensitive. 0004/5 - Computing the arguments slows it down, but accessing the underlying int16s gives an improvement. [1] Haven't done an in-situ test on VACUUM. Could be worth it for pg15, since I imagine the proposals for dead tuple storage won't be ready this cycle. 0006 - I expect this to be slower too. I also wonder if this could also use the global function in 0004 once it's improved. 0007 - untested 0008 - Good performance in microbenchmarks, no in-situ testing. Inlined reversal is not worth the binary space or notational overhead. 0009 - Based on 0004, I would guess that computing the arguments is too slow. Not sure how to test in-situ to see if specializing helps. 0010 - Suggest leaving out the middle threshold and setting the insertion sort threshold to ~20. Might also name them ST_INSERTION_SORT_THRESHOLD and ST_NINTHER_THRESHOLD. (TODO: test on other platforms) 0011 - Committed. v3-0001 comparators for abbreviated keys - Clearly a win in this state already, especially for the "unsigned" case [2]. (gist untested) There are additional possible improvements mentioned, but they seem like a PG16 project(s). [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com (TODO: refine test) -- John Naylor EDB: http://www.enterprisedb.com NOTICE: [direct] size=8MB, order=random, threshold=7, test=0, time=0.084503 NOTICE: [direct] size=8MB, order=random, threshold=7, test=1, time=0.087980 NOTICE: [direct] size=8MB, order=random, threshold=7, test=2, time=0.084299 NOTICE: [direct] size=8MB, order=random, threshold=12, test=0, time=0.081893 NOTICE: [direct] size=8MB, order=random, threshold=12, test=1, time=0.081907 NOTICE: [direct] size=8MB, order=random, threshold=12, test=2, time=0.081943 NOTICE: [direct] size=8MB, order=random, threshold=16, test=0, time=0.080810 NOTICE: [direct] size=8MB, order=random, threshold=16, test=1, time=0.080793 NOTICE: [direct] size=8MB, order=random, threshold=16, test=2, time=0.080922 NOTICE: [direct] size=8MB, order=random, threshold=20, test=0, time=0.080399 NOTICE: [direct] size=8MB, order=random, threshold=20, test=1, time=0.080433 NOTICE: [direct] size=8MB, order=random, threshold=20, test=2, time=0.080413 NOTICE: [direct] size=8MB, order=random, threshold=24, test=0, time=0.080342 NOTICE: [direct] size=8MB, order=random, threshold=24, test=1, time=0.080446 NOTICE: [direct] size=8MB, order=random, threshold=24, test=2, time=0.080339 NOTICE:
Re: A qsort template
Hi, I've run a few tests to get some feel for the effects of various comparators on Datums containing int32. I've attached the full results, as well as the (messy) patch which applies on top of 0012 to run the tests. I'll excerpt some of those results as I go through them here. For now, I only ran input orders of sorted, random, and reversed. 1) Specializing This is a win in all cases, including SQL-callable comparators (the case here is for _bt_sort_array_elements). NOTICE: [traditional qsort] size=8MB, order=random, cmp=arg, test=2, time=0.140526 NOTICE: [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023 NOTICE: [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708 NOTICE: [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0, time=0.192063 2) Branchless operations The int case is for how to perform the comparison, and the SQL case is referring to how to reverse the sort order.Surprisingly, they don't seem to help for direct comparisons, and in fact they seem worse. I'll have to dig a bit deeper to be sure, but it's not looking good now. NOTICE: [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781 NOTICE: [branchless] size=8MB, order=random, cmp=branchless, test=0, time=0.091837 NOTICE: [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2, time=0.192018 NOTICE: [SQL inlined reverse] size=8MB, order=random, cmp=SQL-inline-rev, test=0, time=0.190797 When the effect is reversing a list, the direct comparisons seem much worse, and the SQL ones aren't helped. NOTICE: [inlined] size=8MB, order=decreasing, cmp=inline, test=2, time=0.024963 NOTICE: [branchless] size=8MB, order=decreasing, cmp=branchless, test=0, time=0.036423 NOTICE: [SQL inlined] size=8MB, order=decreasing, cmp=SQL-inline, test=0, time=0.125182 NOTICE: [SQL inlined reverse] size=8MB, order=increasing, cmp=SQL-inline-rev, test=0, time=0.127051 -- Since I have a couple more planned tests, I'll keep a running tally on the current state of the patch set so that summaries are not scattered over many emails: 0001 - bsearch and unique is good to have, and we can keep the return type pending further tests 0002/3 - I've yet to see a case where branchless comparators win, but other than that, these are good. Notational improvement and not performance sensitive. 0004/5 - Computing the arguments slows it down, but accessing the underlying int16s gives an improvement. [1] Haven't done an in-situ test on VACUUM. Could be worth it for pg15, since I imagine the proposals for dead tuple storage won't be ready this cycle. 0006 - I expect this to be slower too. I also wonder if this could also use the global function in 0004 once it's improved. 0007 - untested 0008 - Good performance in microbenchmarks, no in-situ testing. Inlined reversal is not worth the binary space or notational overhead. 0009 - Based on 0004, I would guess that computing the arguments is too slow. Not sure how to test in-situ to see if specializing helps. 0010 - Thresholds on my TODO list. 0011 - A simple correction -- I'll go ahead and commit this. v3-0001 comparators for abbreviated keys - Clearly a win, especially for the "unsigned" case [2]. There are still possible improvements, but they seem like a pg16 project(s). [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com (I just realized in that message I didn't attach the script for that, and also attached an extra draft spreadsheet. I'll improve the tests and rerun later) -- John Naylor EDB: http://www.enterprisedb.com NOTICE: [traditional qsort] size=8MB, order=random, cmp=arg, test=0, time=0.144545 NOTICE: [traditional qsort] size=8MB, order=random, cmp=arg, test=1, time=0.140534 NOTICE: [traditional qsort] size=8MB, order=random, cmp=arg, test=2, time=0.140526 NOTICE: [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023 NOTICE: [inlined] size=8MB, order=random, cmp=inline, test=1, time=0.086370 NOTICE: [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781 NOTICE: [branchless] size=8MB, order=random, cmp=branchless, test=0, time=0.091837 NOTICE: [branchless] size=8MB, order=random, cmp=branchless, test=1, time=0.090426 NOTICE: [branchless] size=8MB, order=random, cmp=branchless, test=2, time=0.091245 NOTICE: [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=0, time=0.257024 NOTICE: [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=1, time=0.256487 NOTICE: [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708 NOTICE: [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0, time=0.192063 NOTICE: [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=1, time=0.191919 NOTICE: [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2, time=0.192018 NOTICE: [SQL arg reverse] size=8MB,
Re: A qsort template
On Tue, Jan 18, 2022 at 9:58 PM Peter Geoghegan wrote: > > On Tue, Jan 18, 2022 at 6:39 PM John Naylor > wrote: > > Editorializing the null position in queries is not very common in my > > experience. Not null is interesting since it'd be trivial to pass > > constant false to the same Apply[XYZ]SortComparator() and let the > > compiler remove all those branches for us. On the other hand, those > > branches would be otherwise predicted well, so it might make little or > > no difference. > > If you were going to do this, maybe you could encode NULL directly in > an abbreviated key. I think that that could be made to work if it was > limited to opclasses with abbreviated keys encoded as unsigned > integers. Just a thought. Now that you mention that, I do remember reading about this technique in the context of b-tree access, so it does make sense. If we had that capability, it would be trivial to order the nulls how we want while building the sort tuple datums, and the not-null case would be handled automatically. And have a smaller code footprint, I think. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Tue, Jan 18, 2022 at 6:39 PM John Naylor wrote: > Editorializing the null position in queries is not very common in my > experience. Not null is interesting since it'd be trivial to pass > constant false to the same Apply[XYZ]SortComparator() and let the > compiler remove all those branches for us. On the other hand, those > branches would be otherwise predicted well, so it might make little or > no difference. If you were going to do this, maybe you could encode NULL directly in an abbreviated key. I think that that could be made to work if it was limited to opclasses with abbreviated keys encoded as unsigned integers. Just a thought. -- Peter Geoghegan
Re: A qsort template
I wrote: > That said, I think what I'll do next is test the v3-0001 standalone > patch with tuplesort specializations for more data types. I already > have a decent test script that I can build on for this. I've run a test with 10 million records using all types found in the v3 patch "accelerate tuple sorting for common types", using a variety of initial orderings, covering index build (btree only, no gist) and queries (both single value and whole record). Attached is the test script and a spreadsheet with the raw data as well as comparisons of the min runtimes in seconds from 5 runs. This is using gcc 11.1 on fairly recent Intel hardware. Overall, this shows a good improvement for these types. One exception is the "0/1" ordering, which is two values in random order. I'm guessing it's because of the cardinality detector, but some runs have apparent possible regressions. It's a bit high and sporadic to just blow off as noise, but this case might not be telling us anything useful. Other notes: - The inet type seems unnaturally fast in some places, meaning faster than int or date. That's suspicous, but I haven't yet dug deeper into that. - With the patch, the VM binary size increases by ~9kB. I have some hunches on the "future research" comments: XXX Can we avoid repeating the null-handling logic? More templating? ;-) XXX Is it worth specializing for reverse sort? I'll run a limited test on DESC to see if anything stands out, but I wonder if the use case is not common -- I seem to remember seeing DESC less often on the first sort key column. XXX Is it worth specializing for nulls first, nulls last, not null? Editorializing the null position in queries is not very common in my experience. Not null is interesting since it'd be trivial to pass constant false to the same Apply[XYZ]SortComparator() and let the compiler remove all those branches for us. On the other hand, those branches would be otherwise predicted well, so it might make little or no difference. XXX Should we have separate cases for "result is authoritative", "need XXX tiebreaker for atts 1..n (= abbrev case)", "need tie breaker for XXX atts 2..n"? The first one seems to be the only case where the SortTuple could be smaller, since the tuple pointer is null. That sounds like a good avenue to explore. Less memory usage is always good. Not sure what you mean by the third case -- there are 2+ sort keys, but the first is authoritative from the datum, so the full comparison can skip the first key? -- John Naylor EDB: http://www.enterprisedb.com qsort-specialize-types-jcn1.ods Description: application/vnd.oasis.opendocument.spreadsheet test-tuplesort-20220113.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: A qsort template
On Mon, Jun 28, 2021 at 8:16 PM Thomas Munro wrote: [v4 patchset] Hi Thomas, (Sorry for the delay -- I have some time to put into this now.) > The lower numbered patches are all things that are reused in many > places, and in my humble opinion improve the notation and type safety > and code deduplication generally when working with common types I think 0001-0003 have had enough review previously to commit them, as they are mostly notational. There's a small amount of bitrot, but not enough to change the conclusions any. Also 0011 with the missing #undef. On Thu, Aug 5, 2021 at 7:18 PM Peter Geoghegan wrote: > > If somebody wants to get a sense of what the size hit is from all of > these specializations, I can recommend the diff feature of bloaty: > > https://github.com/google/bloaty/blob/master/doc/using.md#size-diffs > > Obviously you'd approach this by building postgres without the patch, > and diffing that baseline to postgres with the patch. And possibly > variations of the patch, with less or more sort specializations. Thanks, that's a neat feature! For 0001-0003, the diff shows +700 bytes in memory, so pretty small: $ bloaty -s vm src/backend/postgres -- src/backend/postgres.orig FILE SIZEVM SIZE -- -- +0.0%+608 +0.0%+608.text +0.0% +64 +0.0% +64.eh_frame +0.0% +24 +0.0% +24.dynsym +0.0% +14 +0.0% +14.dynstr +0.0% +2 +0.0% +2.gnu.version +0.0% +58 [ = ] 0.debug_abbrev +0.1% +48 [ = ] 0.debug_aranges +0.0% +1.65Ki [ = ] 0.debug_info +0.0%+942 [ = ] 0.debug_line +0.1% +26 [ = ] 0.debug_line_str +0.0%+333 [ = ] 0.debug_loclists -0.0% -23 [ = ] 0.debug_rnglists +0.0% +73 [ = ] 0.debug_str -1.0% -4 [ = ] 0.shstrtab +0.0% +20 [ = ] 0.strtab +0.0% +24 [ = ] 0.symtab +131% +3.30Ki [ = ] 0[Unmapped] +0.0% +7.11Ki +0.0%+712TOTAL [back to Thomas] > I tried to measure a speedup in vacuum, but so far I have not. I did > learn some things though: While doing that with an uncorrelated index > and a lot of deleted tuples, I found that adding more > maintenance_work_mem doesn't help beyond a few MB, because then cache > misses dominate to the point where it's not better than doing multiple > passes (and this is familiar to me from work on hash joins). If I > turned on huge pages on Linux and set min_dynamic_shared_memory so > that the parallel DSM used by vacuum lives in huge pages, then > parallel vacuum with a large maintenance_work_mem starts to do much > better than non-parallel vacuum by improving the TLB misses (as with > hash joins). I thought that was quite interesting! Perhaps > bsearch_itemptr might help with correlated indexes with a lot of > deleted indexes (so not dominated by cache misses), though? > > (I wouldn't be suprised if someone comes up with a much better idea > than bsearch for that anyway... a few ideas have been suggested.) That's interesting about the (un)correlated index having such a large effect on cache hit rate! By now there has been some discussion and a benchmark for dead tuple storage [1]. bit there doesn't seem to be recent activity on that thread. We might consider adding the ItemPtr comparator work I did in [2] for v15 if we don't have any of the other proposals in place by feature freeze. My concern there is the speedups I observed were observed when the values were comfortably in L2 cache, IIRC. That would need wider testing. That said, I think what I'll do next is test the v3-0001 standalone patch with tuplesort specializations for more data types. I already have a decent test script that I can build on for this. (this is the one currently in CI) Then, I want to do at least preliminary testing of the qsort boundary parameters. Those two things should be doable for this commitfest. [1] https://www.postgresql.org/message-id/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFBsxsG_c24CHKA3cWrOP1HynWGLOkLb8hyZfsD9db5g-ZPagA%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Sun, Aug 1, 2021 at 5:41 PM Thomas Munro wrote: > On Fri, Jul 30, 2021 at 12:34 PM John Naylor > wrote: > > I got around to getting a benchmark together to serve as a starting point. > > I based it off something I got from the archives, but don't remember where > > (I seem to remember Tomas Vondra wrote the original, but not sure). To > > start I just used types that were there already -- int, text, numeric. The > > latter two won't be helped by this patch, but I wanted to keep something > > like that so we can see what kind of noise variation there is. I'll > > probably cut text out in the future and just keep numeric for that purpose. > > Thanks, that's very useful. If somebody wants to get a sense of what the size hit is from all of these specializations, I can recommend the diff feature of bloaty: https://github.com/google/bloaty/blob/master/doc/using.md#size-diffs Obviously you'd approach this by building postgres without the patch, and diffing that baseline to postgres with the patch. And possibly variations of the patch, with less or more sort specializations. -- Peter Geoghegan
Re: A qsort template
On Mon, Aug 2, 2021 at 12:40 PM Thomas Munro wrote: > Great! I saw similar sorts of numbers. It's really just a few > crumbs, nothing compared to the gains David just found over in the > thread "Use generation context to speed up tuplesorts", but on the > bright side, these crumbs will be magnified by that work. (Hmm, that also makes me wonder about using a smaller SortTuple when possible...)
Re: A qsort template
On Fri, Jul 30, 2021 at 12:34 PM John Naylor wrote: > I got around to getting a benchmark together to serve as a starting point. I > based it off something I got from the archives, but don't remember where (I > seem to remember Tomas Vondra wrote the original, but not sure). To start I > just used types that were there already -- int, text, numeric. The latter two > won't be helped by this patch, but I wanted to keep something like that so we > can see what kind of noise variation there is. I'll probably cut text out in > the future and just keep numeric for that purpose. Thanks, that's very useful. > I've attached both the script and a crude spreadsheet. I'll try to figure out > something nicer for future tests, and maybe some graphs. The "comparison" > sheet has the results side by side (min of five). There are 6 distributions > of values: > - random > - sorted > - "almost sorted" > - reversed > - organ pipe (first half ascending, second half descending) > - rotated (sorted but then put the smallest at the end) > - random 0s/1s > > I included both "select a" and "select *" to make sure we have the recent > datum sort optimization represented. The results look pretty good for ints -- > about the same speed up master gets going from tuple sorts to datum sorts, > and those got faster in turn also. Great! I saw similar sorts of numbers. It's really just a few crumbs, nothing compared to the gains David just found over in the thread "Use generation context to speed up tuplesorts", but on the bright side, these crumbs will be magnified by that work. > Next I think I'll run microbenchmarks on int64s with the test harness you > attached earlier, and experiment with the qsort parameters a bit. Cool. I haven't had much luck experimenting with that yet, though I consider the promotion from magic numbers to names as an improvement in any case. > I'm also attaching your tuplesort patch so others can see what exactly I'm > comparing. We've been bouncing around quite a few different ideas and patches in this thread; soon I'll try to bring it back to one patch set with the ideas that are looking good so far in a more tidied up form. For the tupesort.c part, I added some TODO notes in v3-0001-WIP-Accelerate-tuple-sorting-for-common-types.patch's commit message (see reply to Peter).
Re: A qsort template
On Fri, Jul 30, 2021 at 7:11 PM Peter Geoghegan wrote: > If you're going to specialize the sort routine for unsigned integer > style abbreviated keys then you might as well cover all relevant > opclasses/types. Almost all abbreviated key schemes produce > conditioned datums that are designed to use simple 3-way unsigned int > comparator. It's not just text. (Actually, the only abbreviated key > scheme that doesn't do it that way is numeric.) Right, that was the plan, but this was just experimenting with an idea. Looks like John's also seeing evidence that it may be worth pursuing. (Re numeric, I guess it must be possible to rearrange things so it can use ssup_datum_signed_cmp; maybe something like NaN -> INT64_MAX, +inf -> INT64_MAX - 1, -inf -> INT64_MIN, and then -1 - (whatever we're doing now for normal values).) > Offhand I know that UUID, macaddr, and inet all have abbreviated keys > that can use your new ssup_datum_binary_cmp() comparator instead of > their own duplicated comparator (which will make them use the > corresponding specialized sort routine inside tuplesort.c). Thanks, I've added these ones, and also gist_bbox_zorder_cmp_abbrev. I also renamed that function to ssup_datum_unsigned_cmp(), because "binary" was misleading. From a1ca8e9ca989de5f61f1b8f067cb9785034ce9cc Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 3 Jul 2021 19:02:10 +1200 Subject: [PATCH v3] WIP: Accelerate tuple sorting for common types. Several data types have their own fast comparator functions that can be replaced with common binary or signed comparator functions. These functions can then be recognized by tuplesort.c and used to dispatch to corresponding specialized sort functions, to accelerate sorting. XXX WIP, experiment grade only, many open questions... such as: XXX Can we avoid repeating the null-handling logic? XXX Is it worth specializing for reverse sort? XXX Is it worth specializing for nulls first, nulls last, not null? XXX Should we have separate cases for "result is authoritative", "need XXX tiebreaker for atts 1..n (= abbrev case)", "need tie breaker for XXX atts 2..n"? XXX If we did all of the above, that'd give: XXX {nulls_first, nulls_last, not_null} x XXX {asc, desc} x XXX {tiebreak_none, tiebreak_first, tiebreak_rest} x XXX {unsigned, signed, int32} XXX = 54 specializations! Seems a little over the top. XXX You could use a smaller SortTuple for some cases. XXX How should we handle numeric? --- src/backend/access/gist/gistproc.c | 18 +-- src/backend/access/nbtree/nbtcompare.c | 22 ++-- src/backend/utils/adt/date.c | 15 +-- src/backend/utils/adt/mac.c| 23 +--- src/backend/utils/adt/network.c| 17 +-- src/backend/utils/adt/timestamp.c | 11 ++ src/backend/utils/adt/uuid.c | 25 +--- src/backend/utils/adt/varlena.c| 34 +- src/backend/utils/sort/tuplesort.c | 160 - src/include/utils/sortsupport.h| 115 ++ 10 files changed, 308 insertions(+), 132 deletions(-) diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c index d474612b77..20135ea0ec 100644 --- a/src/backend/access/gist/gistproc.c +++ b/src/backend/access/gist/gistproc.c @@ -37,7 +37,6 @@ static uint64 part_bits32_by2(uint32 x); static uint32 ieee_float32_to_uint32(float f); static int gist_bbox_zorder_cmp(Datum a, Datum b, SortSupport ssup); static Datum gist_bbox_zorder_abbrev_convert(Datum original, SortSupport ssup); -static int gist_bbox_zorder_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup); static bool gist_bbox_zorder_abbrev_abort(int memtupcount, SortSupport ssup); @@ -1726,21 +1725,6 @@ gist_bbox_zorder_abbrev_convert(Datum original, SortSupport ssup) #endif } -static int -gist_bbox_zorder_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup) -{ - /* - * Compare the pre-computed Z-orders as unsigned integers. Datum is a - * typedef for 'uintptr_t', so no casting is required. - */ - if (z1 > z2) - return 1; - else if (z1 < z2) - return -1; - else - return 0; -} - /* * We never consider aborting the abbreviation. * @@ -1764,7 +1748,7 @@ gist_point_sortsupport(PG_FUNCTION_ARGS) if (ssup->abbreviate) { - ssup->comparator = gist_bbox_zorder_cmp_abbrev; + ssup->comparator = ssup_datum_unsigned_cmp; ssup->abbrev_converter = gist_bbox_zorder_abbrev_convert; ssup->abbrev_abort = gist_bbox_zorder_abbrev_abort; ssup->abbrev_full_comparator = gist_bbox_zorder_cmp; diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c index 7ac73cb8c2..204cf778fb 100644 --- a/src/backend/access/nbtree/nbtcompare.c +++ b/src/backend/access/nbtree/nbtcompare.c @@ -119,26 +119,12 @@ btint4cmp(PG_FUNCTION_ARGS) PG_RETURN_INT32(A_LESS_THAN_B); } -static int -btint4fastcmp(Datum x, Datum y, SortSupport ssup) -{ - int32 a = DatumGetInt32(x); - int32 b = DatumGetInt32(y); - - if (a > b) - return A_GREATER_THAN_B;
Re: A qsort template
On Fri, Jul 30, 2021 at 7:47 AM Ranier Vilela wrote: > The patch attached does not apply cleanly, > please can fix it? It applies just fine with "patch", for those wondering. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
Em qui., 29 de jul. de 2021 Ã s 21:34, John Naylor < john.nay...@enterprisedb.com> escreveu: > > On Sun, Jul 4, 2021 at 12:27 AM Thomas Munro > wrote: > > > > Since you are experimenting with tuplesort and likely thinking similar > > thoughts, here's a patch I've been using to explore that area. I've > > seen it get, for example, ~1.18x speedup for simple index builds in > > favourable winds (YMMV, early hacking results only). Currently, it > > kicks in when the leading column is of type int4, int8, timestamp, > > timestamptz, date or text + friends (when abbreviatable, currently > > that means "C" and ICU collations only), while increasing the > > executable by only 8.5kB (Clang, amd64, -O2, no debug). > > > > These types are handled with just three specialisations. Their custom > > "fast" comparators all boiled down to comparisons of datum bits, > > varying only in signedness and width, so I tried throwing them away > > and using 3 new common routines. Then I extended > > tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to > > recognise qualifying users of those and select 3 corresponding sort > > specialisations. > > I got around to getting a benchmark together to serve as a starting point. > I based it off something I got from the archives, but don't remember where > (I seem to remember Tomas Vondra wrote the original, but not sure). To > start I just used types that were there already -- int, text, numeric. The > latter two won't be helped by this patch, but I wanted to keep something > like that so we can see what kind of noise variation there is. I'll > probably cut text out in the future and just keep numeric for that purpose. > > I've attached both the script and a crude spreadsheet. I'll try to figure > out something nicer for future tests, and maybe some graphs. The > "comparison" sheet has the results side by side (min of five). There are 6 > distributions of values: > - random > - sorted > - "almost sorted" > - reversed > - organ pipe (first half ascending, second half descending) > - rotated (sorted but then put the smallest at the end) > - random 0s/1s > > I included both "select a" and "select *" to make sure we have the recent > datum sort optimization represented. The results look pretty good for ints > -- about the same speed up master gets going from tuple sorts to datum > sorts, and those got faster in turn also. > > Next I think I'll run microbenchmarks on int64s with the test harness you > attached earlier, and experiment with the qsort parameters a bit. > > I'm also attaching your tuplesort patch so others can see what exactly I'm > comparing. > The patch attached does not apply cleanly, please can fix it? error: patch failed: src/backend/utils/sort/tuplesort.c:4776 error: src/backend/utils/sort/tuplesort.c: patch does not apply regards, Ranier Vilela
Re: A qsort template
On Fri, Jul 30, 2021 at 3:34 AM John Naylor wrote: > I'm also attaching your tuplesort patch so others can see what exactly I'm > comparing. If you're going to specialize the sort routine for unsigned integer style abbreviated keys then you might as well cover all relevant opclasses/types. Almost all abbreviated key schemes produce conditioned datums that are designed to use simple 3-way unsigned int comparator. It's not just text. (Actually, the only abbreviated key scheme that doesn't do it that way is numeric.) Offhand I know that UUID, macaddr, and inet all have abbreviated keys that can use your new ssup_datum_binary_cmp() comparator instead of their own duplicated comparator (which will make them use the corresponding specialized sort routine inside tuplesort.c). -- Peter Geoghegan
Re: A qsort template
On Sun, Jul 4, 2021 at 12:27 AM Thomas Munro wrote: > > Since you are experimenting with tuplesort and likely thinking similar > thoughts, here's a patch I've been using to explore that area. I've > seen it get, for example, ~1.18x speedup for simple index builds in > favourable winds (YMMV, early hacking results only). Currently, it > kicks in when the leading column is of type int4, int8, timestamp, > timestamptz, date or text + friends (when abbreviatable, currently > that means "C" and ICU collations only), while increasing the > executable by only 8.5kB (Clang, amd64, -O2, no debug). > > These types are handled with just three specialisations. Their custom > "fast" comparators all boiled down to comparisons of datum bits, > varying only in signedness and width, so I tried throwing them away > and using 3 new common routines. Then I extended > tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to > recognise qualifying users of those and select 3 corresponding sort > specialisations. I got around to getting a benchmark together to serve as a starting point. I based it off something I got from the archives, but don't remember where (I seem to remember Tomas Vondra wrote the original, but not sure). To start I just used types that were there already -- int, text, numeric. The latter two won't be helped by this patch, but I wanted to keep something like that so we can see what kind of noise variation there is. I'll probably cut text out in the future and just keep numeric for that purpose. I've attached both the script and a crude spreadsheet. I'll try to figure out something nicer for future tests, and maybe some graphs. The "comparison" sheet has the results side by side (min of five). There are 6 distributions of values: - random - sorted - "almost sorted" - reversed - organ pipe (first half ascending, second half descending) - rotated (sorted but then put the smallest at the end) - random 0s/1s I included both "select a" and "select *" to make sure we have the recent datum sort optimization represented. The results look pretty good for ints -- about the same speed up master gets going from tuple sorts to datum sorts, and those got faster in turn also. Next I think I'll run microbenchmarks on int64s with the test harness you attached earlier, and experiment with the qsort parameters a bit. I'm also attaching your tuplesort patch so others can see what exactly I'm comparing. -- John Naylor EDB: http://www.enterprisedb.com set -e ROWS=$1 WORK_MEM=$2 function log { echo `date +%s` [`date +'%Y-%m-%d %H:%M:%S'`] $1 } function create_tables { ./inst/bin/psql > /dev/null < /dev/null < /dev/null < /dev/null < 16384 AND relkind = 'r'; EOF } # load test tables function load_random { truncate_tables log "loading random" ./inst/bin/psql > /dev/null < /dev/null < /dev/null < /dev/null < /dev/null < /dev/null < /dev/null < 0.5 then '' else '' end FROM data_int; INSERT INTO num_test SELECT round(a), md5(a::text) FROM data_int; EOF prewarm vacuum_analyze } # run function run_query { times="" group=$1 wmem=$2 workers=$3 query=$4 echo "--" `date` [`date +%s`] >> explains.log echo "-- $group rows=$ROWS work_mem=$wmem workers=$workers" >> explains.log ./inst/bin/psql >> explains.log 2>&1 < /dev/null <> results.csv } function run_index { times="" group=$1 wmem=$2 workers=$3 query=$4 times="" s=`date +%s` for i in `seq 1 5`; do /usr/bin/time -f '%e' -o 'query.time' ./inst/bin/psql > /dev/null <> results.csv } function run_queries { # for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do for wm in $WORK_MEM; do log "running queries work_mem=$wm noparallel" run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 OFFSET '$ROWS'' run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 OFFSET '$ROWS'' # run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 DESC OFFSET '$ROWS'' # run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 DESC OFFSET '$ROWS'' run_index $1 $wm 0 'CREATE INDEX x ON int_test (a); DROP INDEX x' run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 OFFSET '$ROWS'' run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 OFFSET '$ROWS'' # run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 DESC OFFSET '$ROWS'' # run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 DESC OFFSET '$ROWS'' run_index $1 $wm 0 'CREATE INDEX x ON txt_test (a); DROP INDEX x' run_query $1 $wm 0 'SELECT * FROM num_test ORDER BY 1 OFFSET '$ROWS'' # run_query $1 $wm 0 'SELECT * FROM num_test ORDER BY 1 DESC OFFSET '$ROWS'' run_index $1 $wm 0 'CREATE INDEX x ON num_test (a); DROP INDEX x' done } create_tables log "sort benchmark $ROWS" load_random run_queries "random" load_sorted run_queries "sorted" load_almost_sorted run_queries "almost_sorted" load_reversed
Re: A qsort template
On Thu, Jun 17, 2021 at 1:20 PM Thomas Munro wrote: > On Thu, Jun 17, 2021 at 1:14 PM Tom Lane wrote: > > The big problem in my mind, which would not be alleviated in the > > slightest by having a separate file, is that it'd be easy to miss > > removing entries if they ever become obsolete. > > I suppose you could invent some kind of declaration syntax in a > comment near the use of the pseudo-typename in the source tree that is > mechanically extracted. What do you think about something like this? From 8252eb18c19c8a78fa6326ae5de8261d7d793dda Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 22 Jul 2021 19:05:15 +1200 Subject: [PATCH] Teach pgindent about special file-local typenames. Allow source files to declare an extra typename with a special annotation of the form: @pgindent typename XXX@ Use these to fix some whitespace problems in simplehash.h and sort_template.h. --- src/include/lib/simplehash.h| 110 +--- src/include/lib/sort_template.h | 21 +++--- src/tools/pgindent/pgindent | 11 +++- 3 files changed, 79 insertions(+), 63 deletions(-) diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index da51781e98..3e44c7cc03 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -86,6 +86,12 @@ * presence is relevant to determine whether a lookup needs to continue * looking or is done - buckets following a deleted element are shifted * backwards, unless they're empty or already at their optimal position. + * + * Help pgindent understand our pseudo-typenames: + * + * @pgindent typename SH_ELEMENT_TYPE@ + * @pgindent typename SH_TYPE@ + * @pgindent typename SH_ITERATOR@ */ #include "port/pg_bitutils.h" @@ -164,7 +170,7 @@ typedef struct SH_TYPE /* user defined data, useful for callbacks */ void *private_data; -} SH_TYPE; +} SH_TYPE; typedef enum SH_STATUS { @@ -177,67 +183,67 @@ typedef struct SH_ITERATOR uint32 cur;/* current element */ uint32 end; booldone; /* iterator exhausted? */ -} SH_ITERATOR; +} SH_ITERATOR; /* externally visible function prototypes */ #ifdef SH_RAW_ALLOCATOR /* _hash _create(uint32 nelements, void *private_data) */ -SH_SCOPE SH_TYPE *SH_CREATE(uint32 nelements, void *private_data); +SH_SCOPE SH_TYPE *SH_CREATE(uint32 nelements, void *private_data); #else /* * _hash _create(MemoryContext ctx, uint32 nelements, * void *private_data) */ -SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements, - void *private_data); +SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements, + void *private_data); #endif /* void _destroy(_hash *tb) */ -SH_SCOPE void SH_DESTROY(SH_TYPE * tb); +SH_SCOPE void SH_DESTROY(SH_TYPE *tb); /* void _reset(_hash *tb) */ -SH_SCOPE void SH_RESET(SH_TYPE * tb); +SH_SCOPE void SH_RESET(SH_TYPE *tb); /* void _grow(_hash *tb) */ -SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize); +SH_SCOPE void SH_GROW(SH_TYPE *tb, uint32 newsize); /* *_insert(_hash *tb, key, bool *found) */ -SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found); +SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE *tb, SH_KEY_TYPE key, bool *found); /* * *_insert_hash(_hash *tb, key, uint32 hash, * bool *found) */ -SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key, - uint32 hash, bool *found); +SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE *tb, SH_KEY_TYPE key, + uint32 hash, bool *found); /* *_lookup(_hash *tb, key) */ -SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key); +SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE *tb, SH_KEY_TYPE key); /* *_lookup_hash(_hash *tb, key, uint32 hash) */ -SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key, - uint32 hash); +SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE *tb, SH_KEY_TYPE key, + uint32 hash); /* void _delete_item(_hash *tb, *entry) */ -SH_SCOPE void SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry); +SH_SCOPE void SH_DELETE_ITEM(SH_TYPE *tb, SH_ELEMENT_TYPE *entry); /* bool _delete(_hash *tb, key) */ -SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key); +SH_SCOPE bool SH_DELETE(SH_TYPE *tb,
Re: A qsort template
On Thu, Jul 15, 2021 at 7:50 AM vignesh C wrote: > The patch does not apply on Head anymore, could you rebase and post a > patch. I'm changing the status to "Waiting for Author". The patch set is fine. The error is my fault since I attached an experimental addendum and neglected to name it as .txt. I've set it back to "needs review" and will resume testing shortly. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Sun, Jul 4, 2021 at 9:58 AM Thomas Munro wrote: > > On Fri, Jul 2, 2021 at 2:32 PM John Naylor > wrote: > > I suspect if we experiment on two extremes of type "heaviness" (accessing > > and comparing trivial or not), such as uint32 and tuplesort, we'll have a > > pretty good idea what the parameters should be, if anything different. I'll > > do some testing along those lines. > > Cool. > > Since you are experimenting with tuplesort and likely thinking similar > thoughts, here's a patch I've been using to explore that area. I've > seen it get, for example, ~1.18x speedup for simple index builds in > favourable winds (YMMV, early hacking results only). Currently, it > kicks in when the leading column is of type int4, int8, timestamp, > timestamptz, date or text + friends (when abbreviatable, currently > that means "C" and ICU collations only), while increasing the > executable by only 8.5kB (Clang, amd64, -O2, no debug). > > These types are handled with just three specialisations. Their custom > "fast" comparators all boiled down to comparisons of datum bits, > varying only in signedness and width, so I tried throwing them away > and using 3 new common routines. Then I extended > tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to > recognise qualifying users of those and select 3 corresponding sort > specialisations. > > It might turn out to be worth burning some more executable size on > extra variants (for example, see XXX notes in the code comments for > opportunities; one could also go nuts trying smaller things like > special cases for not-null, nulls first, reverse sort, ... to kill all > those branches), or not. The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Regards, Vignesh
Re: A qsort template
On Fri, Jul 2, 2021 at 2:32 PM John Naylor wrote: > I suspect if we experiment on two extremes of type "heaviness" (accessing and > comparing trivial or not), such as uint32 and tuplesort, we'll have a pretty > good idea what the parameters should be, if anything different. I'll do some > testing along those lines. Cool. Since you are experimenting with tuplesort and likely thinking similar thoughts, here's a patch I've been using to explore that area. I've seen it get, for example, ~1.18x speedup for simple index builds in favourable winds (YMMV, early hacking results only). Currently, it kicks in when the leading column is of type int4, int8, timestamp, timestamptz, date or text + friends (when abbreviatable, currently that means "C" and ICU collations only), while increasing the executable by only 8.5kB (Clang, amd64, -O2, no debug). These types are handled with just three specialisations. Their custom "fast" comparators all boiled down to comparisons of datum bits, varying only in signedness and width, so I tried throwing them away and using 3 new common routines. Then I extended tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to recognise qualifying users of those and select 3 corresponding sort specialisations. It might turn out to be worth burning some more executable size on extra variants (for example, see XXX notes in the code comments for opportunities; one could also go nuts trying smaller things like special cases for not-null, nulls first, reverse sort, ... to kill all those branches), or not. From 62a8acbc745f3b4a90c9a14b6b61989a9d83bece Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 3 Jul 2021 19:02:10 +1200 Subject: [PATCH] WIP: Accelerate tuple sorting for common types. Several data types have their own fast comparator functions that can be replaced with common binary or signed comparator functions. These functions can then be recognized by tuplesort.c and used to dispatch to corresponding specialized sort functions, to accelerate sorting. XXX WIP, experiment grade only, many open questions... --- src/backend/access/nbtree/nbtcompare.c | 22 ++-- src/backend/utils/adt/date.c | 15 +-- src/backend/utils/adt/timestamp.c | 11 ++ src/backend/utils/adt/varlena.c| 26 +--- src/backend/utils/sort/tuplesort.c | 161 - src/include/utils/sortsupport.h| 115 ++ 6 files changed, 295 insertions(+), 55 deletions(-) diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c index 7ac73cb8c2..204cf778fb 100644 --- a/src/backend/access/nbtree/nbtcompare.c +++ b/src/backend/access/nbtree/nbtcompare.c @@ -119,26 +119,12 @@ btint4cmp(PG_FUNCTION_ARGS) PG_RETURN_INT32(A_LESS_THAN_B); } -static int -btint4fastcmp(Datum x, Datum y, SortSupport ssup) -{ - int32 a = DatumGetInt32(x); - int32 b = DatumGetInt32(y); - - if (a > b) - return A_GREATER_THAN_B; - else if (a == b) - return 0; - else - return A_LESS_THAN_B; -} - Datum btint4sortsupport(PG_FUNCTION_ARGS) { SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); - ssup->comparator = btint4fastcmp; + ssup->comparator = ssup_datum_int32_cmp; PG_RETURN_VOID(); } @@ -156,6 +142,7 @@ btint8cmp(PG_FUNCTION_ARGS) PG_RETURN_INT32(A_LESS_THAN_B); } +#ifndef USE_FLOAT8_BYVAL static int btint8fastcmp(Datum x, Datum y, SortSupport ssup) { @@ -169,13 +156,18 @@ btint8fastcmp(Datum x, Datum y, SortSupport ssup) else return A_LESS_THAN_B; } +#endif Datum btint8sortsupport(PG_FUNCTION_ARGS) { SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); +#ifdef USE_FLOAT8_BYVAL + ssup->comparator = ssup_datum_signed_cmp; +#else ssup->comparator = btint8fastcmp; +#endif PG_RETURN_VOID(); } diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c index 0bea16cb67..350c662d50 100644 --- a/src/backend/utils/adt/date.c +++ b/src/backend/utils/adt/date.c @@ -438,25 +438,12 @@ date_cmp(PG_FUNCTION_ARGS) PG_RETURN_INT32(0); } -static int -date_fastcmp(Datum x, Datum y, SortSupport ssup) -{ - DateADT a = DatumGetDateADT(x); - DateADT b = DatumGetDateADT(y); - - if (a < b) - return -1; - else if (a > b) - return 1; - return 0; -} - Datum date_sortsupport(PG_FUNCTION_ARGS) { SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); - ssup->comparator = date_fastcmp; + ssup->comparator = ssup_datum_int32_cmp; PG_RETURN_VOID(); } diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 79761f809c..c678517db6 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -37,6 +37,7 @@ #include "utils/datetime.h"
Re: A qsort template
On Thu, Jul 1, 2021 at 6:10 PM Thomas Munro wrote: > One thing I'm wondering about is whether it's worth having stuff to > support future experimentation like ST_SORT_SMALL_THRESHOLD and > ST_COMPARE_RET_TYPE in the tree, or whether we should pare it back to > the minimal changes that definitely produce results. I think I'd like > to keep those changes: even if it may be some time, possibly an > infinite amount, before we figure out how to tune the thresholds > profitably, giving them names instead of using magic numbers seems > like progress. I suspect if we experiment on two extremes of type "heaviness" (accessing and comparing trivial or not), such as uint32 and tuplesort, we'll have a pretty good idea what the parameters should be, if anything different. I'll do some testing along those lines. (BTW, I just realized I lied and sent a .patch file after all, oops) -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Fri, Jul 2, 2021 at 4:39 AM John Naylor wrote: > For item pointers, it made sense to try doing math to reduce the number of > branches. That made things worse, so let's try the opposite: Increase the > number of branches so we do less math. In the attached patch (applies on top > of your 0012 and a .txt to avoid confusing the CF bot), I test a new > comparator with this approach, and also try a wider range of thresholds. The > thresholds don't seem to make any noticeable difference with this data type, > but the new comparator (cmp=ids below) gives a nice speedup in this test: > NOTICE: [traditional qsort] order=random, threshold=7, cmp=32, test=0, > time=4.964657 > NOTICE: order=random, threshold=7, cmp=std, test=0, time=2.810627 > NOTICE: order=random, threshold=7, cmp=ids, test=0, time=1.692711 Oooh. So, the awkwardness of the 64 maths with unaligned inputs (even though we obtain all inputs with 16 bit loads) was hurting, and you realised the same sort of thing might be happening also with the 32 bit version and went the other way. (It'd be nice to understand exactly why.) I tried your 16 bit comparison version on Intel, AMD and Apple CPUs and the results were all in the same ballpark. For random input, I see something like ~1.7x speedup over traditional qsort from specialising (cmp=std), and ~2.7x from going 16 bit (cmp=ids). For increasing and decreasing input, it's ~2x speedup from specialising and ~4x speedup from going 16 bit. Beautiful. One thing I'm wondering about is whether it's worth having stuff to support future experimentation like ST_SORT_SMALL_THRESHOLD and ST_COMPARE_RET_TYPE in the tree, or whether we should pare it back to the minimal changes that definitely produce results. I think I'd like to keep those changes: even if it may be some time, possibly an infinite amount, before we figure out how to tune the thresholds profitably, giving them names instead of using magic numbers seems like progress. The Alexandrescu talk was extremely entertaining, thanks.
Re: A qsort template
I wrote: > One of the points of the talk I linked to is "if doing the sensible thing makes things worse, try something silly instead". For item pointers, it made sense to try doing math to reduce the number of branches. That made things worse, so let's try the opposite: Increase the number of branches so we do less math. In the attached patch (applies on top of your 0012 and a .txt to avoid confusing the CF bot), I test a new comparator with this approach, and also try a wider range of thresholds. The thresholds don't seem to make any noticeable difference with this data type, but the new comparator (cmp=ids below) gives a nice speedup in this test: # SELECT test_sort_itemptr(); NOTICE: [traditional qsort] order=random, threshold=7, cmp=32, test=0, time=4.964657 NOTICE: [traditional qsort] order=random, threshold=7, cmp=32, test=1, time=5.185384 NOTICE: [traditional qsort] order=random, threshold=7, cmp=32, test=2, time=5.058179 NOTICE: order=random, threshold=7, cmp=std, test=0, time=2.810627 NOTICE: order=random, threshold=7, cmp=std, test=1, time=2.804940 NOTICE: order=random, threshold=7, cmp=std, test=2, time=2.800677 NOTICE: order=random, threshold=7, cmp=ids, test=0, time=1.692711 NOTICE: order=random, threshold=7, cmp=ids, test=1, time=1.694546 NOTICE: order=random, threshold=7, cmp=ids, test=2, time=1.692839 NOTICE: order=random, threshold=12, cmp=std, test=0, time=2.687033 NOTICE: order=random, threshold=12, cmp=std, test=1, time=2.681974 NOTICE: order=random, threshold=12, cmp=std, test=2, time=2.687833 NOTICE: order=random, threshold=12, cmp=ids, test=0, time=1.666418 NOTICE: order=random, threshold=12, cmp=ids, test=1, time=1.666188 NOTICE: order=random, threshold=12, cmp=ids, test=2, time=1.664176 NOTICE: order=random, threshold=16, cmp=std, test=0, time=2.574147 NOTICE: order=random, threshold=16, cmp=std, test=1, time=2.579981 NOTICE: order=random, threshold=16, cmp=std, test=2, time=2.572861 NOTICE: order=random, threshold=16, cmp=ids, test=0, time=1.699432 NOTICE: order=random, threshold=16, cmp=ids, test=1, time=1.703075 NOTICE: order=random, threshold=16, cmp=ids, test=2, time=1.697173 NOTICE: order=random, threshold=32, cmp=std, test=0, time=2.750040 NOTICE: order=random, threshold=32, cmp=std, test=1, time=2.744138 NOTICE: order=random, threshold=32, cmp=std, test=2, time=2.748026 NOTICE: order=random, threshold=32, cmp=ids, test=0, time=1.677414 NOTICE: order=random, threshold=32, cmp=ids, test=1, time=1.683792 NOTICE: order=random, threshold=32, cmp=ids, test=2, time=1.701309 NOTICE: [traditional qsort] order=increasing, threshold=7, cmp=32, test=0, time=2.543837 NOTICE: [traditional qsort] order=increasing, threshold=7, cmp=32, test=1, time=2.290497 NOTICE: [traditional qsort] order=increasing, threshold=7, cmp=32, test=2, time=2.262956 NOTICE: order=increasing, threshold=7, cmp=std, test=0, time=1.033052 NOTICE: order=increasing, threshold=7, cmp=std, test=1, time=1.032079 NOTICE: order=increasing, threshold=7, cmp=std, test=2, time=1.041836 NOTICE: order=increasing, threshold=7, cmp=ids, test=0, time=0.367355 NOTICE: order=increasing, threshold=7, cmp=ids, test=1, time=0.367428 NOTICE: order=increasing, threshold=7, cmp=ids, test=2, time=0.367384 NOTICE: order=increasing, threshold=12, cmp=std, test=0, time=1.004991 NOTICE: order=increasing, threshold=12, cmp=std, test=1, time=1.008045 NOTICE: order=increasing, threshold=12, cmp=std, test=2, time=1.010778 NOTICE: order=increasing, threshold=12, cmp=ids, test=0, time=0.370944 NOTICE: order=increasing, threshold=12, cmp=ids, test=1, time=0.368669 NOTICE: order=increasing, threshold=12, cmp=ids, test=2, time=0.370100 NOTICE: order=increasing, threshold=16, cmp=std, test=0, time=1.023682 NOTICE: order=increasing, threshold=16, cmp=std, test=1, time=1.025805 NOTICE: order=increasing, threshold=16, cmp=std, test=2, time=1.022005 NOTICE: order=increasing, threshold=16, cmp=ids, test=0, time=0.365398 NOTICE: order=increasing, threshold=16, cmp=ids, test=1, time=0.365586 NOTICE: order=increasing, threshold=16, cmp=ids, test=2, time=0.364807 NOTICE: order=increasing, threshold=32, cmp=std, test=0, time=0.950780 NOTICE: order=increasing, threshold=32, cmp=std, test=1, time=0.949920 NOTICE: order=increasing, threshold=32, cmp=std, test=2, time=0.953239 NOTICE: order=increasing, threshold=32, cmp=ids, test=0, time=0.367866 NOTICE: order=increasing, threshold=32, cmp=ids, test=1, time=0.372179 NOTICE: order=increasing, threshold=32, cmp=ids, test=2, time=0.371115 NOTICE: [traditional qsort] order=decreasing, threshold=7, cmp=32, test=0, time=2.317475 NOTICE: [traditional qsort] order=decreasing, threshold=7, cmp=32, test=1, time=2.323446 NOTICE: [traditional qsort] order=decreasing, threshold=7, cmp=32, test=2, time=2.326714 NOTICE: order=decreasing, threshold=7, cmp=std, test=0, time=1.022270 NOTICE: order=decreasing, threshold=7, cmp=std, test=1, time=1.015133 NOTICE:
Re: A qsort template
On Tue, Jun 29, 2021 at 2:56 AM Thomas Munro wrote: > Here's a version that includes a rather hackish test module that you > might find useful to explore various weird effects. Testing sorting > routines is really hard, of course... there's a zillion parameters and > things you could do in the data and cache effects etc etc. One of the That module is incredibly useful! Yeah, while brushing up on recent findings on sorting, it's clear there's a huge amount of options with different tradeoffs. I did see your tweet last year about the "small sort" threshold that was tested on a VAX machine, but hadn't given it any thought til now. Looking around, I've seen quite a range, always with the caveat of "it depends". A couple interesting variations: Golang uses 12, with an extra tweak: // Do ShellSort pass with gap 6 // It could be written in this simplified form cause b-a <= 12 for i := a + 6; i < b; i++ { if data.Less(i, i-6) { data.Swap(i, i-6) } } insertionSort(data, a, b) Andrei Alexandrescu gave a couple talks discussing the small-sort part of quicksort, and demonstrated a ruthlessly-optimized make-heap + unguarded-insertion-sort, using a threshold of 256. He reported a 6% speed-up sorting a million doubles, IIRC: video: https://www.youtube.com/watch?v=FJJTYQYB1JQ slides: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf That might not be workable for us, but it's a fun talk. > main things that jumps out pretty clearly though with these simple > tests is that sorting 6 byte ItemPointerData objects is *really slow* > compared to more natural object sizes (look at the times and the > MEMORY values in the scripts). Another is that specialised sort > functions are much faster than traditional qsort (being one of the > goals of this exercise). Sadly, the 64 bit comparison technique is > not looking too good in the output of this test. One of the points of the talk I linked to is "if doing the sensible thing makes things worse, try something silly instead". Anyway, I'll play around with the scripts and see if something useful pops out. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Tue, Jun 29, 2021 at 1:11 PM Thomas Munro wrote: > I spotted a mistake in v3: I didn't rename ST_COMPARE_TYPE to > ST_COMPARE_RET_TYPE in the 0009 patch (well, I did, but forgot to > commit before I ran git format-patch). I won't send another tarball > just for that, but will correct it next time. Here's a version that includes a rather hackish test module that you might find useful to explore various weird effects. Testing sorting routines is really hard, of course... there's a zillion parameters and things you could do in the data and cache effects etc etc. One of the main things that jumps out pretty clearly though with these simple tests is that sorting 6 byte ItemPointerData objects is *really slow* compared to more natural object sizes (look at the times and the MEMORY values in the scripts). Another is that specialised sort functions are much faster than traditional qsort (being one of the goals of this exercise). Sadly, the 64 bit comparison technique is not looking too good in the output of this test. more-sort-search-specializations-v4.tgz Description: application/compressed-tar
Re: A qsort template
I spotted a mistake in v3: I didn't rename ST_COMPARE_TYPE to ST_COMPARE_RET_TYPE in the 0009 patch (well, I did, but forgot to commit before I ran git format-patch). I won't send another tarball just for that, but will correct it next time.
Re: A qsort template
Hi John, On Tue, Jun 29, 2021 at 7:13 AM John Naylor wrote: > I plan to do some performance testing with VACUUM, ANALYZE etc soon, to see > if I can detect any significant differences. Thanks! > I did a quick check of the MacOS/clang binary size (no debug symbols): > > master:8108408 > 0001-0009: 8125224 Not too bad. > Later, I'll drill down into the individual patches and see if anything stands > out. > > There were already some comments for v2 upthread about formatting and an > overflow hazard, but I did find a few more things to ask about: Right, here's an update with fixes discussed earlier with Zhihong and Tom: * COMPARE_TYPE -> COMPARE_RET_TYPE * quit fighting with pgindent (I will try to fix this problem generally later) * fix overflow hazard > - For my curiosity, there are a lot of calls to qsort/qunique in the tree -- > without having looked exhaustively, do these patches focus on cases where > there are bespoke comparator functions and/or hot code paths? Patches 0006-0009 are highly specialised for local usage by a single module, and require some kind of evidence that they're worth their bytes, and the onus is on me there of course -- but any ideas and feedback are welcome. There are other opportunities like these, maybe better ones. That reminds me: I recently had a perf report from Andres that showed the qsort in compute_scalar_stats() as quite hot. That's probably a good candidate, and is not yet done in the current patch set. The lower numbered patches are all things that are reused in many places, and in my humble opinion improve the notation and type safety and code deduplication generally when working with common types ItemPtr, BlockNumber, Oid, aside from any performance arguments. At least the ItemPtr stuff *might* also speed something useful up. I tried to measure a speedup in vacuum, but so far I have not. I did learn some things though: While doing that with an uncorrelated index and a lot of deleted tuples, I found that adding more maintenance_work_mem doesn't help beyond a few MB, because then cache misses dominate to the point where it's not better than doing multiple passes (and this is familiar to me from work on hash joins). If I turned on huge pages on Linux and set min_dynamic_shared_memory so that the parallel DSM used by vacuum lives in huge pages, then parallel vacuum with a large maintenance_work_mem starts to do much better than non-parallel vacuum by improving the TLB misses (as with hash joins). I thought that was quite interesting! Perhaps bsearch_itemptr might help with correlated indexes with a lot of deleted indexes (so not dominated by cache misses), though? (I wouldn't be suprised if someone comes up with a much better idea than bsearch for that anyway... a few ideas have been suggested.) > - Aside from the qsort{_arg} precedence, is there a practical reason for > keeping the new global functions in their own files? Better idea for layout welcome. One thing I wondered while trying to figure out where to put functions that operate on itemptr: why is itemptr_encode() in src/include/catalog/index.h?! > - 0002 / 0004 > > +/* Search and unique functions inline in header. */ > > The functions are pretty small, but is there some advantage for inlining > these? Glibc's bsearch definition is already in a header for inlining (as is our qunique), so I thought I should preserve that characteristic on principle. I don't have any evidence though. Other libcs I looked at didn't have bsearch in a header. So by doing this we make the generated code the same across platforms (all other relevant things being equal). I don't know if it really makes much difference, especially since in this case the comparator and size would still be inlined if we defined it in the .c (unlike standard bsearch)... Probably only lazy_tid_reaped() calls it enough to potentially show any difference in a non-microbenchmark workload, if anything does. > - 0003 > > #include "lib/qunique.h" is not needed anymore. Fixed. > This isn't quite relevant for the current patch perhaps, but I'm wondering > why we don't already call bsearch for RelationHasSysCache() and > RelationSupportsSysCache(). Right, I missed that. Done. Nice to delete some more code. > - 0008 > > +#define ST_COMPARE(a, b, cxt) \ > + DatumGetInt32(FunctionCall2Coll(>flinfo, cxt->collation, *a, *b)) > > This seems like a pretty heavyweight comparison, so I'm not sure inlining > buys us much, but it seems also there are fewer branches this way. I'll come > up with a test and see what happens. I will be very interested to see the results. Thanks! more-sort-search-specializations-v3.tgz Description: application/compressed-tar
Re: A qsort template
On Wed, Jun 16, 2021 at 1:55 AM Thomas Munro wrote: [v2 patch] Hi Thomas, I plan to do some performance testing with VACUUM, ANALYZE etc soon, to see if I can detect any significant differences. I did a quick check of the MacOS/clang binary size (no debug symbols): master:8108408 0001-0009: 8125224 Later, I'll drill down into the individual patches and see if anything stands out. There were already some comments for v2 upthread about formatting and an overflow hazard, but I did find a few more things to ask about: - For my curiosity, there are a lot of calls to qsort/qunique in the tree -- without having looked exhaustively, do these patches focus on cases where there are bespoke comparator functions and/or hot code paths? - Aside from the qsort{_arg} precedence, is there a practical reason for keeping the new global functions in their own files? - 0002 / 0004 +/* Search and unique functions inline in header. */ The functions are pretty small, but is there some advantage for inlining these? - 0003 #include "lib/qunique.h" is not needed anymore. This isn't quite relevant for the current patch perhaps, but I'm wondering why we don't already call bsearch for RelationHasSysCache() and RelationSupportsSysCache(). - 0008 +#define ST_COMPARE(a, b, cxt) \ + DatumGetInt32(FunctionCall2Coll(>flinfo, cxt->collation, *a, *b)) This seems like a pretty heavyweight comparison, so I'm not sure inlining buys us much, but it seems also there are fewer branches this way. I'll come up with a test and see what happens. -- John Naylor EDB: http://www.enterprisedb.com
Re: A qsort template
On Thu, Jun 17, 2021 at 1:14 PM Tom Lane wrote: > The big problem in my mind, which would not be alleviated in the > slightest by having a separate file, is that it'd be easy to miss > removing entries if they ever become obsolete. I suppose you could invent some kind of declaration syntax in a comment near the use of the pseudo-typename in the source tree that is mechanically extracted.
Re: A qsort template
Thomas Munro writes: > Perhaps one day we could add a > secondary file, not updated by that mechanism, that holds a manually > maintained list for cases like this. Yeah, the comments in pgindent already speculate about that. For now, those include and exclude lists are short enough that keeping them inside the script seems a lot easier than building tooling to get them from somewhere else. The big problem in my mind, which would not be alleviated in the slightest by having a separate file, is that it'd be easy to miss removing entries if they ever become obsolete. regards, tom lane
Re: A qsort template
On Thu, Jun 17, 2021 at 11:40 AM Tom Lane wrote: > Thomas Munro writes: > > Hmm, well it was only recently damaged by commit def5b065, and that's > > because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I > > was correcting that in this patch. > > If ST_ELEMENT_TYPE isn't recognized as a typedef by the buildfarm's > typedef collectors, this sort of manual addition to typedefs.list > is not going to survive the next pgindent run. No, I will NOT > promise to manually add it back every time. > > We do already have special provision for injecting additional typedefs > in the pgindent script, so one possibility is to add it there: > > -my @additional = ("bool\n"); > +my @additional = ("bool\nST_ELEMENT_TYPE\n"); > > On the whole I'm not sure that this is a big enough formatting > issue to justify a special hack, though. Is there any more than > the one line that gets misformatted? Ohh. In that case, I won't bother with that hunk and will live with the extra space. There are several other lines like this in the tree, where people use caveman template macrology that is invisible to whatever analyser is being used for that, and I can see that that's just going to have to be OK for now. Perhaps one day we could add a secondary file, not updated by that mechanism, that holds a manually maintained list for cases like this.
Re: A qsort template
Thomas Munro writes: > Hmm, well it was only recently damaged by commit def5b065, and that's > because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I > was correcting that in this patch. If ST_ELEMENT_TYPE isn't recognized as a typedef by the buildfarm's typedef collectors, this sort of manual addition to typedefs.list is not going to survive the next pgindent run. No, I will NOT promise to manually add it back every time. We do already have special provision for injecting additional typedefs in the pgindent script, so one possibility is to add it there: -my @additional = ("bool\n"); +my @additional = ("bool\nST_ELEMENT_TYPE\n"); On the whole I'm not sure that this is a big enough formatting issue to justify a special hack, though. Is there any more than the one line that gets misformatted? regards, tom lane
Re: A qsort template
On Wed, Jun 16, 2021 at 2:54 PM Thomas Munro wrote: > Hi Zhihong, > > On Thu, Jun 17, 2021 at 8:13 AM Zhihong Yu wrote: > > In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : > > > > - const ST_ELEMENT_TYPE * > ST_SORT_PROTO_ARG); > > + const ST_ELEMENT_TYPE > *ST_SORT_PROTO_ARG); > > > > It seems there is no real change in the line above. Better keep the > original formation. > > Hmm, well it was only recently damaged by commit def5b065, and that's > because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I > was correcting that in this patch. (That file is used by > pg_bsd_indent to decide if an identifier is a type or a variable, > which affects whether '*' is formatted like a unary operator/type > syntax or a binary operator.) > > > * - ST_COMPARE_ARG_TYPE - type of extra argument > > * > > + * To say that the comparator returns a type other than int, use: > > + * > > + * - ST_COMPARE_TYPE - an integer type > > > > Since the ST_COMPARE_TYPE is meant to designate the type of the return > value, maybe ST_COMPARE_RET_TYPE would be better name. > > It also goes with ST_COMPARE_ARG_TYPE preceding this. > > Good idea, will do. > > > - ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data, > > - *pa, > > - *pb, > > - *pc, > > - *pd, > > - *pl, > > - *pm, > > - *pn; > > + ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data; > > + ST_POINTER_TYPE *pa; > > > > There doesn't seem to be material change for the above hunk. > > In master, you can't write #define ST_ELEMENT_TYPE some_type *, which > seems like it would be quite useful. You can use pointers as element > types, but only with a typedef name due to C parsing rules. some_type > **a, *pa, ... declares some_type *pa, but we want some_type **pa. I > don't want to have to introduce extra typedefs. The change fixes that > problem by not using C's squirrelly variable declaration list syntax. > > > + while (left <= right) > > + { > > + size_t mid = (left + right) / 2; > > > > The computation for midpoint should be left + (right-left)/2. > > Right, my way can overflow. Will fix. Thanks! > Hi, Thanks for giving me background on typedefs. The relevant changes look fine to me. Cheers
Re: A qsort template
Hi Zhihong, On Thu, Jun 17, 2021 at 8:13 AM Zhihong Yu wrote: > In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : > > - const ST_ELEMENT_TYPE * > ST_SORT_PROTO_ARG); > + const ST_ELEMENT_TYPE > *ST_SORT_PROTO_ARG); > > It seems there is no real change in the line above. Better keep the original > formation. Hmm, well it was only recently damaged by commit def5b065, and that's because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I was correcting that in this patch. (That file is used by pg_bsd_indent to decide if an identifier is a type or a variable, which affects whether '*' is formatted like a unary operator/type syntax or a binary operator.) > * - ST_COMPARE_ARG_TYPE - type of extra argument > * > + * To say that the comparator returns a type other than int, use: > + * > + * - ST_COMPARE_TYPE - an integer type > > Since the ST_COMPARE_TYPE is meant to designate the type of the return value, > maybe ST_COMPARE_RET_TYPE would be better name. > It also goes with ST_COMPARE_ARG_TYPE preceding this. Good idea, will do. > - ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data, > - *pa, > - *pb, > - *pc, > - *pd, > - *pl, > - *pm, > - *pn; > + ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data; > + ST_POINTER_TYPE *pa; > > There doesn't seem to be material change for the above hunk. In master, you can't write #define ST_ELEMENT_TYPE some_type *, which seems like it would be quite useful. You can use pointers as element types, but only with a typedef name due to C parsing rules. some_type **a, *pa, ... declares some_type *pa, but we want some_type **pa. I don't want to have to introduce extra typedefs. The change fixes that problem by not using C's squirrelly variable declaration list syntax. > + while (left <= right) > + { > + size_t mid = (left + right) / 2; > > The computation for midpoint should be left + (right-left)/2. Right, my way can overflow. Will fix. Thanks!
Re: A qsort template
On Tue, Jun 15, 2021 at 10:55 PM Thomas Munro wrote: > On Mon, Mar 15, 2021 at 1:09 PM Thomas Munro > wrote: > > On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu wrote: > > > + * Remove duplicates from an array. Return the new size. > > > + */ > > > +ST_SCOPE size_t > > > +ST_UNIQUE(ST_ELEMENT_TYPE *array, > > > > > > The array is supposed to be sorted, right ? > > > The comment should mention this. > > > > Good point, will update. Thanks! > > Rebased. Also fixed some formatting problems and updated > typedefs.list so they don't come back. > Hi, In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : - const ST_ELEMENT_TYPE * ST_SORT_PROTO_ARG); + const ST_ELEMENT_TYPE *ST_SORT_PROTO_ARG); It seems there is no real change in the line above. Better keep the original formation. * - ST_COMPARE_ARG_TYPE - type of extra argument * + * To say that the comparator returns a type other than int, use: + * + * - ST_COMPARE_TYPE - an integer type Since the ST_COMPARE_TYPE is meant to designate the type of the return value, maybe ST_COMPARE_RET_TYPE would be better name. It also goes with ST_COMPARE_ARG_TYPE preceding this. - ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data, - *pa, - *pb, - *pc, - *pd, - *pl, - *pm, - *pn; + ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data; + ST_POINTER_TYPE *pa; There doesn't seem to be material change for the above hunk. + while (left <= right) + { + size_t mid = (left + right) / 2; The computation for midpoint should be left + (right-left)/2. Cheers
Re: A qsort template
On Mon, Mar 15, 2021 at 1:09 PM Thomas Munro wrote: > On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu wrote: > > + * Remove duplicates from an array. Return the new size. > > + */ > > +ST_SCOPE size_t > > +ST_UNIQUE(ST_ELEMENT_TYPE *array, > > > > The array is supposed to be sorted, right ? > > The comment should mention this. > > Good point, will update. Thanks! Rebased. Also fixed some formatting problems and updated typedefs.list so they don't come back. more-sort-search-specializations-v2.tgz Description: application/compressed-tar
Re: A qsort template
On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu wrote: > For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : > > + * Remove duplicates from an array. Return the new size. > + */ > +ST_SCOPE size_t > +ST_UNIQUE(ST_ELEMENT_TYPE *array, > > The array is supposed to be sorted, right ? > The comment should mention this. Good point, will update. Thanks!
Re: A qsort template
Hi, For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : + * Remove duplicates from an array. Return the new size. + */ +ST_SCOPE size_t +ST_UNIQUE(ST_ELEMENT_TYPE *array, The array is supposed to be sorted, right ? The comment should mention this. Cheers On Sat, Mar 13, 2021 at 6:36 PM Thomas Munro wrote: > On Sat, Mar 13, 2021 at 3:49 PM Thomas Munro > wrote: > > On Fri, Mar 12, 2021 at 7:58 AM Andres Freund > wrote: > > > I wish we had the same for bsearch... :) > > > > Glibc already has the definition of the traditional void-based > > function in /usr/include/bits/stdlib-bsearch.h, so the generated code > > when the compiler can see the comparator definition is already good in > > eg lazy_tid_reaped() and eg some nbtree search routines. We could > > probably expose more trivial comparators in headers to get more of > > that, and we could perhaps put our own bsearch definition in a header > > for other platforms that didn't think of that... > > > > It might be worth doing type-safe macro templates as well, though (as > > I already did in an earlier proposal[1]), just to have nice type safe > > code though, not sure, I'm thinking about that... > > I remembered a very good reason to do this: the ability to do > branch-free comparators in more places by introducing optional wider > results. That's good for TIDs (needs 49 bits), and places that want > to "reverse" a traditional comparator (just doing -result on an int > comparator that might theoretically return INT_MIN requires at least > 33 bits). So I rebased the relevant parts of my earlier version, and > went through and wrote a bunch of examples to demonstrate all this > stuff actually working. > > There are two categories of change in these patches: > > 0002-0005: Places that sort/unique/search OIDs, BlockNumbers and TIDs, > which can reuse a small set of typed functions (a few more could be > added, if useful). See sortitemptr.h and sortscalar.h. Mostly this > is just a notational improvement, and an excuse to drop a bunch of > duplicated code. In a few places this might really speed something > important up! Like VACUUM's lazy_tid_reaped(). > > 0006-0009. Places where a specialised function is generated for one > special purpose, such as ANALYZE's HeapTuple sort, tidbitmap.c's > pagetable sort, some places in nbtree code etc. These may require > some case-by-case research on whether the extra executable size is > worth the speedup, and there are surely more opportunities like that; > I just picked on these arbitrarily. >
Re: A qsort template
On Sat, Mar 13, 2021 at 3:49 PM Thomas Munro wrote: > On Fri, Mar 12, 2021 at 7:58 AM Andres Freund wrote: > > I wish we had the same for bsearch... :) > > Glibc already has the definition of the traditional void-based > function in /usr/include/bits/stdlib-bsearch.h, so the generated code > when the compiler can see the comparator definition is already good in > eg lazy_tid_reaped() and eg some nbtree search routines. We could > probably expose more trivial comparators in headers to get more of > that, and we could perhaps put our own bsearch definition in a header > for other platforms that didn't think of that... > > It might be worth doing type-safe macro templates as well, though (as > I already did in an earlier proposal[1]), just to have nice type safe > code though, not sure, I'm thinking about that... I remembered a very good reason to do this: the ability to do branch-free comparators in more places by introducing optional wider results. That's good for TIDs (needs 49 bits), and places that want to "reverse" a traditional comparator (just doing -result on an int comparator that might theoretically return INT_MIN requires at least 33 bits). So I rebased the relevant parts of my earlier version, and went through and wrote a bunch of examples to demonstrate all this stuff actually working. There are two categories of change in these patches: 0002-0005: Places that sort/unique/search OIDs, BlockNumbers and TIDs, which can reuse a small set of typed functions (a few more could be added, if useful). See sortitemptr.h and sortscalar.h. Mostly this is just a notational improvement, and an excuse to drop a bunch of duplicated code. In a few places this might really speed something important up! Like VACUUM's lazy_tid_reaped(). 0006-0009. Places where a specialised function is generated for one special purpose, such as ANALYZE's HeapTuple sort, tidbitmap.c's pagetable sort, some places in nbtree code etc. These may require some case-by-case research on whether the extra executable size is worth the speedup, and there are surely more opportunities like that; I just picked on these arbitrarily. From d0fb306d2720d14be2d46f4ae4198b25a33a95fa Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 13 Mar 2021 17:29:44 +1300 Subject: [PATCH 1/9] Add bsearch and unique templates to sort_template.h. Since binary search and uniquify are so closely related to sorting, let's optionally generate compatible functions at the same time. Also, optionally support comparators with wider return types. This will allow us to do more branchless comparators. Also, adjust the ST_SORT template to cope with pointer types. --- src/include/lib/sort_template.h | 142 1 file changed, 124 insertions(+), 18 deletions(-) diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h index 771c789ced..a6097e1ac5 100644 --- a/src/include/lib/sort_template.h +++ b/src/include/lib/sort_template.h @@ -3,7 +3,8 @@ * sort_template.h * * A template for a sort algorithm that supports varying degrees of - * specialization. + * specialization. Also related algorithms for binary search and + * unique, on sorted arrays. * * Copyright (c) 2021, PostgreSQL Global Development Group * Portions Copyright (c) 1992-1994, Regents of the University of California @@ -13,7 +14,9 @@ * To generate functions specialized for a type, the following parameter * macros should be #define'd before this file is included. * - * - ST_SORT - the name of a sort function to be generated + * - ST_SORT - if defined the name of a sort function + * - ST_UNIQUE - if defined the name of a unique function + * - ST_SEARCH - if defined the name of a search function * - ST_ELEMENT_TYPE - type of the referenced elements * - ST_DECLARE - if defined the functions and types are declared * - ST_DEFINE - if defined the functions and types are defined @@ -40,6 +43,10 @@ * * - ST_COMPARE_ARG_TYPE - type of extra argument * + * To say that the comparator returns a type other than int, use: + * + * - ST_COMPARE_TYPE - an integer type + * * The prototype of the generated sort function is: * * void ST_SORT(ST_ELEMENT_TYPE *data, size_t n, @@ -179,13 +186,35 @@ typedef int (*ST_COMPARATOR_TYPE_NAME) (const ST_ELEMENT_TYPE *, const ST_ELEMENT_TYPE *ST_SORT_PROTO_ARG); #endif +#ifdef ST_SORT /* Declare the sort function. Note optional arguments at end. */ -ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE *first, size_t n +ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE *array, + size_t n ST_SORT_PROTO_ELEMENT_SIZE ST_SORT_PROTO_COMPARE ST_SORT_PROTO_ARG); +#endif /* ST_SORT */ -#endif +#ifdef ST_SEARCH +/* Declare the search function. */ +ST_SCOPE ST_ELEMENT_TYPE *ST_SEARCH(ST_ELEMENT_TYPE *value, + ST_ELEMENT_TYPE *array, + size_t n + ST_SORT_PROTO_ELEMENT_SIZE + ST_SORT_PROTO_COMPARE +
Re: A qsort template
On Fri, Mar 12, 2021 at 7:58 AM Andres Freund wrote: > I wish we had the same for bsearch... :) Glibc already has the definition of the traditional void-based function in /usr/include/bits/stdlib-bsearch.h, so the generated code when the compiler can see the comparator definition is already good in eg lazy_tid_reaped() and eg some nbtree search routines. We could probably expose more trivial comparators in headers to get more of that, and we could perhaps put our own bsearch definition in a header for other platforms that didn't think of that... It might be worth doing type-safe macro templates as well, though (as I already did in an earlier proposal[1]), just to have nice type safe code though, not sure, I'm thinking about that... [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLY47Cvu62mFDT53Ya0P95cGggcBN6R6aLpx6%3DGm5j%2B1A%40mail.gmail.com
Re: A qsort template
Hi, I wish we had the same for bsearch... :) On 2021-03-03 17:17:13 +1300, Thomas Munro wrote: > As for which cases are actually worth specialising, I've attached the > example that Andres mentioned earlier; it seems like a reasonable > candidate to go ahead and commit too, but I realised that I'd > forgotten to attach it earlier. > From 4cec5cb9a2e0c50726b7337fb8221281e155c4cd Mon Sep 17 00:00:00 2001 > From: Thomas Munro > Date: Thu, 18 Feb 2021 14:47:28 +1300 > Subject: [PATCH] Specialize checkpointer sort functions. > > When sorting a potentially large number of dirty buffers, the > checkpointer can benefit from a faster sort routine. One reported > improvement on a large buffer pool system was 1.4s -> 0.6s. > > Discussion: > https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com Looks good to me. Greetings, Andres Freund
Re: A qsort template
On Wed, Mar 3, 2021 at 10:25 AM Daniel Gustafsson wrote: > > On 18 Feb 2021, at 04:09, Thomas Munro wrote: > > In another thread[1], I proposed $SUBJECT, but then we found a better > > solution to that thread's specific problem. The general idea is still > > good though: it's possible to (1) replace several existing copies of > > our qsort algorithm with one, and (2) make new specialised versions a > > bit more easily than the existing Perl generator allows. So, I'm back > > with a rebased stack of patches. I'll leave specific cases for new > > worthwhile specialisations for separate proposals; I've heard about > > several. > > Just to play around with this while reviewing I made a qsort_strcmp, like in > the attached, and tested it using a ~9M word [0] randomly shuffled wordlist. > While being too small input to make any meaningful difference in runtime (it > shaved a hair off but it might well be within the error margin) there was no > regression either. More importantly, it was really simple and quick to make a > tailored qsort which is the intention with the patch. While still being a bit > of magic, moving from the Perl generator makes this slightly less magic IMO so > +1 on this approach. Thanks for testing and reviewing! > A tiny nitpick on the patch itself: > > + * - ST_COMPARE(a, b) - a simple comparison expression > + * - ST_COMPARE(a, b, arg) - variant that takes an extra argument > Indentation. Fixed. Also ran pgindent. > All tests pass and the documentation in the the sort_template.h is enough to > go > on, but I would prefer to see a comment in port/qsort.c referring back to > sort_template.h for documentation. I tried adding a comment along the lines "see lib/sort_template.h for details", but it felt pretty redundant, when the file contains very little other than #include "lib/sort_template.h" which should already tell you to go and look there to find out what this is about... I went ahead and pushed these. I am sure there are plenty of opportunities to experiment with this code. Here are some I recall Peter Geoghegan mentioning: 1. If you know that elements are unique, you could remove some branches that deal with equal elements (see "r == 0"). 2. Perhaps you might want to be able to disable the "presorted" check in some cases? 3. The parameters 7, 7 and 40 were probably tuned for an ancient Vax or similar[1]. We see higher insertion sort thesholds such as 27 in more recent sort algorithms[2] used in eg the JVM. You could perhaps speculate that the right answer depends in part on the element size; I dunno, but if so, here we have that at compile time while traditional qsort() does not. As for which cases are actually worth specialising, I've attached the example that Andres mentioned earlier; it seems like a reasonable candidate to go ahead and commit too, but I realised that I'd forgotten to attach it earlier. It's possible that the existing support sorting tuples could be further specialised for common sort key data types; I haven't tried that. [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162=rep1=pdf [2] https://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf From 4cec5cb9a2e0c50726b7337fb8221281e155c4cd Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 18 Feb 2021 14:47:28 +1300 Subject: [PATCH] Specialize checkpointer sort functions. When sorting a potentially large number of dirty buffers, the checkpointer can benefit from a faster sort routine. One reported improvement on a large buffer pool system was 1.4s -> 0.6s. Discussion: https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 37 + 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 561c212092..7adec2ddc3 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -488,8 +488,8 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode, static void AtProcExit_Buffers(int code, Datum arg); static void CheckForBufferLeaks(void); static int rnode_comparator(const void *p1, const void *p2); -static int buffertag_comparator(const void *p1, const void *p2); -static int ckpt_buforder_comparator(const void *pa, const void *pb); +static inline int buffertag_comparator(const BufferTag *a, const BufferTag *b); +static inline int ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b); static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg); @@ -1831,6 +1831,13 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner) } } +#define ST_SORT sort_checkpoint_bufferids +#define ST_ELEMENT_TYPE CkptSortItem +#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b) +#define ST_SCOPE static +#define ST_DEFINE +#include + /* * BufferSync -- Write out all dirty buffers in the pool. * @@ -1931,8 +1938,7
Re: A qsort template
> On 18 Feb 2021, at 04:09, Thomas Munro wrote: > In another thread[1], I proposed $SUBJECT, but then we found a better > solution to that thread's specific problem. The general idea is still > good though: it's possible to (1) replace several existing copies of > our qsort algorithm with one, and (2) make new specialised versions a > bit more easily than the existing Perl generator allows. So, I'm back > with a rebased stack of patches. I'll leave specific cases for new > worthwhile specialisations for separate proposals; I've heard about > several. Just to play around with this while reviewing I made a qsort_strcmp, like in the attached, and tested it using a ~9M word [0] randomly shuffled wordlist. While being too small input to make any meaningful difference in runtime (it shaved a hair off but it might well be within the error margin) there was no regression either. More importantly, it was really simple and quick to make a tailored qsort which is the intention with the patch. While still being a bit of magic, moving from the Perl generator makes this slightly less magic IMO so +1 on this approach. A tiny nitpick on the patch itself: + * - ST_COMPARE(a, b) - a simple comparison expression + * - ST_COMPARE(a, b, arg) - variant that takes an extra argument Indentation. All tests pass and the documentation in the the sort_template.h is enough to go on, but I would prefer to see a comment in port/qsort.c referring back to sort_template.h for documentation. -- Daniel Gustafsson https://vmware.com/ [0] https://github.com/dwyl/english-words/ shuffled 20 times over qsort_tmpl_strcmp-patch Description: Binary data
Re: A qsort template
Hi, On 2021-02-18 16:09:49 +1300, Thomas Munro wrote: > In another thread[1], I proposed $SUBJECT, but then we found a better > solution to that thread's specific problem. The general idea is still > good though: it's possible to (1) replace several existing copies of > our qsort algorithm with one, and (2) make new specialised versions a > bit more easily than the existing Perl generator allows. So, I'm back > with a rebased stack of patches. I'll leave specific cases for new > worthwhile specialisations for separate proposals; I've heard about > several. One place that could benefit is the qsort that BufferSync() does at the start. I tried your patch for that, and it does reduce the sort time considerably. For 64GB of mostly dirty shared_buffers from ~1.4s to 0.6s. Now, obviously one can argue that that's not going to be the crucial spot, and wouldn't be entirely wrong. OTOH, in my AIO branch I see checkpointer doing ~10GB/s, leading to the sort being a measurable portion of the overall time. Greetings, Andres Freund
A qsort template
Hello, In another thread[1], I proposed $SUBJECT, but then we found a better solution to that thread's specific problem. The general idea is still good though: it's possible to (1) replace several existing copies of our qsort algorithm with one, and (2) make new specialised versions a bit more easily than the existing Perl generator allows. So, I'm back with a rebased stack of patches. I'll leave specific cases for new worthwhile specialisations for separate proposals; I've heard about several. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com From edfa27a45eca139d683bd0a02136e1da5a489243 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 17 Aug 2020 21:31:56 +1200 Subject: [PATCH 1/3] Add sort_template.h for making fast sort functions. Move our qsort implementation into a header that can be used to define specialized functions for better performance. --- src/include/lib/sort_template.h | 431 1 file changed, 431 insertions(+) create mode 100644 src/include/lib/sort_template.h diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h new file mode 100644 index 00..09523a8e4c --- /dev/null +++ b/src/include/lib/sort_template.h @@ -0,0 +1,431 @@ +/*- + * + * sort_template.h + * + * A template for a sort algorithm that supports varying degrees of + * specialization. + * + * Copyright (c) 2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1992-1994, Regents of the University of California + * + * Usage notes: + * + * To generate functions specialized for a type, the following parameter + * macros should be #define'd before this file is included. + * + * - ST_SORT - the name of a sort function to be generated + * - ST_ELEMENT_TYPE - type of the referenced elements + * - ST_DECLARE - if defined the functions and types are declared + * - ST_DEFINE - if defined the functions and types are defined + * - ST_SCOPE - scope (e.g. extern, static inline) for functions + * + * Instead of ST_ELEMENT_TYPE, ST_ELEMENT_TYPE_VOID can be defined. Then + * the generated functions will automatically gain an "element_size" + * parameter. This allows us to generate a traditional qsort function. + * + * One of the following macros must be defined, to show how to compare + * elements. The first two options are arbitrary expressions depending + * on whether an extra pass-through argument is desired, and the third + * option should be defined if the sort function should receive a + * function pointer at runtime. + * + * - ST_COMPARE(a, b) - a simple comparison expression + * - ST_COMPARE(a, b, arg) - variant that takes an extra argument + * - ST_COMPARE_RUNTIME_POINTER - sort function takes a function pointer + * + * To say that the comparator and therefore also sort function should + * receive an extra pass-through argument, specify the type of the + * argument. + * + * - ST_COMPARE_ARG_TYPE - type of extra argument + * + * The prototype of the generated sort function is: + * + * void ST_SORT(ST_ELEMENT_TYPE *data, size_t n, + * [size_t element_size,] + * [ST_SORT_compare_function compare,] + * [ST_COMPARE_ARG_TYPE *arg]); + * + * ST_SORT_compare_function is a function pointer of the following type: + * + * int (*)(const ST_ELEMENT_TYPE *a, const ST_ELEMENT_TYPE *b, + * [ST_COMPARE_ARG_TYPE *arg]) + * + * HISTORY + * + * Modifications from vanilla NetBSD source: + * - Add do ... while() macro fix + * - Remove __inline, _DIAGASSERTs, __P + * - Remove ill-considered "swap_cnt" switch to insertion sort, in favor + * of a simple check for presorted input. + * - Take care to recurse on the smaller partition, to bound stack usage + * - Convert into a header that can generate specialized functions + * + * IDENTIFICATION + * src/include/lib/sort_template.h + * + *- + */ + +/* $NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $ */ + +/*- + * Copyright (c) 1992, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written