Re: A qsort template

2022-05-23 Thread John Naylor
I wrote:
> I agree this is only useful in development. Removal sounds fine to me,
> so I'll do that soon.

This is done.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-05-20 Thread John Naylor
On Fri, May 20, 2022 at 5:43 AM Tom Lane  wrote:
>
> Peter Geoghegan  writes:
> > On Thu, May 19, 2022 at 1:12 PM Justin Pryzby  wrote:
> >> Should these debug lines be removed ?
> >>
> >> elog(DEBUG1, "qsort_tuple");
>
> > I agree -- DEBUG1 seems too chatty for something like this. DEBUG2
> > would be more appropriate IMV. Though I don't feel very strongly about
> > it.
>
> Given the lack of context identification, I'd put the usefulness of
> these in production at close to zero.  +1 for removing them
> altogether, or failing that, downgrade to DEBUG5 or so.

I agree this is only useful in development. Removal sounds fine to me,
so I'll do that soon.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-05-19 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, May 19, 2022 at 1:12 PM Justin Pryzby  wrote:
>> Should these debug lines be removed ?
>> 
>> elog(DEBUG1, "qsort_tuple");

> I agree -- DEBUG1 seems too chatty for something like this. DEBUG2
> would be more appropriate IMV. Though I don't feel very strongly about
> it.

Given the lack of context identification, I'd put the usefulness of
these in production at close to zero.  +1 for removing them
altogether, or failing that, downgrade to DEBUG5 or so.

regards, tom lane




Re: A qsort template

2022-05-19 Thread Peter Geoghegan
On Thu, May 19, 2022 at 1:12 PM Justin Pryzby  wrote:
> Should these debug lines be removed ?
>
> elog(DEBUG1, "qsort_tuple");

I agree -- DEBUG1 seems too chatty for something like this. DEBUG2
would be more appropriate IMV. Though I don't feel very strongly about
it.

--
Peter Geoghegan




Re: A qsort template

2022-05-19 Thread Justin Pryzby
On Fri, Apr 22, 2022 at 11:37:29AM +0700, John Naylor wrote:
> On Fri, Apr 22, 2022 at 11:13 AM David Rowley  wrote:
> >
> > On Thu, 21 Apr 2022 at 19:09, John Naylor  
> > wrote:
> > > I intend to commit David's v2 fix next week, unless there are
> > > objections, or unless he beats me to it.
> >
> > I wasn't sure if you wanted to handle it or not, but I don't mind
> > doing it, so I just pushed it after a small adjustment to a comment.
> 
> Thank you!

Should these debug lines be removed ?

elog(DEBUG1, "qsort_tuple");

Perhaps if I ask for debug output, I shouldn't be surprised if it changes
between major releases - but I still found this surprising.

I'm sure it's useful during development and maybe during beta.  It could even
make sense if it were shown during regression tests (preferably at DEBUG2).
But right now it's not.  is that

ts=# \dt
DEBUG:  qsort_tuple
List of relations

-- 
Justin




Re: A qsort template

2022-04-21 Thread Thomas Munro
On Fri, Apr 22, 2022 at 4:37 PM John Naylor
 wrote:
> On Fri, Apr 22, 2022 at 11:13 AM David Rowley  wrote:
> > On Thu, 21 Apr 2022 at 19:09, John Naylor  
> > wrote:
> > > I intend to commit David's v2 fix next week, unless there are
> > > objections, or unless he beats me to it.
> >
> > I wasn't sure if you wanted to handle it or not, but I don't mind
> > doing it, so I just pushed it after a small adjustment to a comment.
>
> Thank you!

Thanks both for working on this.  Seems like a good call to defer the
choice of further specialisations.




Re: A qsort template

2022-04-21 Thread John Naylor
On Fri, Apr 22, 2022 at 11:13 AM David Rowley  wrote:
>
> On Thu, 21 Apr 2022 at 19:09, John Naylor  
> wrote:
> > I intend to commit David's v2 fix next week, unless there are
> > objections, or unless he beats me to it.
>
> I wasn't sure if you wanted to handle it or not, but I don't mind
> doing it, so I just pushed it after a small adjustment to a comment.

Thank you!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-21 Thread David Rowley
On Thu, 21 Apr 2022 at 19:09, John Naylor  wrote:
> I intend to commit David's v2 fix next week, unless there are
> objections, or unless he beats me to it.

I wasn't sure if you wanted to handle it or not, but I don't mind
doing it, so I just pushed it after a small adjustment to a comment.

Before going ahead with it I did test a 2-key sort where the leading
key values were all the same.  I wondered if we'd still see any
regression from having to re-compare the leading key all over again.

I just did:

create table ab (a bigint, b bigint);
insert into ab select 0,x from generate_series(1,100)x;
vacuum freeze ab;

I then ran:
select * from ab order by a,b offset 100;

697492434 (Specialize tuplesort routines for different kinds of
abbreviated keys)
$ pgbench -n -f bench1.sql -T 60 -M prepared postgres
tps = 10.651740 (without initial connection time)
tps = 10.813647 (without initial connection time)
tps = 10.648960 (without initial connection time)

697492434~1 (Remove obsolete comment)
$ pgbench -n -f bench1.sql -T 60 -M prepared postgres
tps = 9.957163 (without initial connection time)
tps = 10.191168 (without initial connection time)
tps = 10.145281 (without initial connection time)

So it seems there was no regression for that case, at least, not on
the AMD machine that I tested on.

David




Re: A qsort template

2022-04-21 Thread John Naylor
I intend to commit David's v2 fix next week, unless there are
objections, or unless he beats me to it.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-19 Thread John Naylor
> Okay, this makes logical sense and is a smaller patch to boot. I've
> re-run my tests (attached) to make sure we have our bases covered. I'm
> sharing the min-of-five, as before, but locally I tried . The

My sentence there was supposed to read "I tried using median and it
was a bit less noisy".

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-19 Thread John Naylor
On Tue, Apr 19, 2022 at 12:30 PM David Rowley  wrote:
>
> Thanks for looking at this.
>
> On Tue, 19 Apr 2022 at 02:11, John Naylor  
> wrote:
> > IIUC, this function is called by tuplesort_begin_common, which in turn
> > is called by tuplesort_begin_{heap, indexes, etc}. The latter callers
> > set the onlyKey and now oneKeySort variables as appropriate, and
> > sometimes hard-coded to false. Is it intentional to set them here
> > first?
> >
> > Falling under the polish that you were likely thinking of above:
>
> I did put the patch together quickly just for the benchmark and at the
> time I was subtly aware that the onlyKey field was being set using a
> similar condition as I was using to set the boolean field I'd added.
> On reflection today, it should be fine just to check if that field is
> NULL or not in the 3 new comparison functions. Similarly to before,
> this only needs to be done if the datums compare equally, so does not
> add any code to the path where the datums are non-equal.  It looks
> like the other tuplesort_begin_* functions use a different comparison
> function that will never make use of the specialization comparison
> functions added by 697492434.

Okay, this makes logical sense and is a smaller patch to boot. I've
re-run my tests (attached) to make sure we have our bases covered. I'm
sharing the min-of-five, as before, but locally I tried . The
regression is fixed, and most other differences from v15 seem to be
noise. It's possible the naturally fastest cases (pre-sorted ints and
bigints) are slower than v15-revert than expected from noise, but it's
not clear.

I think this is good to go.

-- 
John Naylor
EDB: http://www.enterprisedb.com


qsort-fix-regression-jcn2.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: A qsort template

2022-04-18 Thread David Rowley
Thanks for looking at this.

On Tue, 19 Apr 2022 at 02:11, John Naylor  wrote:
> IIUC, this function is called by tuplesort_begin_common, which in turn
> is called by tuplesort_begin_{heap, indexes, etc}. The latter callers
> set the onlyKey and now oneKeySort variables as appropriate, and
> sometimes hard-coded to false. Is it intentional to set them here
> first?
>
> Falling under the polish that you were likely thinking of above:

I did put the patch together quickly just for the benchmark and at the
time I was subtly aware that the onlyKey field was being set using a
similar condition as I was using to set the boolean field I'd added.
On reflection today, it should be fine just to check if that field is
NULL or not in the 3 new comparison functions. Similarly to before,
this only needs to be done if the datums compare equally, so does not
add any code to the path where the datums are non-equal.  It looks
like the other tuplesort_begin_* functions use a different comparison
function that will never make use of the specialization comparison
functions added by 697492434.

I separated out the "or" condition that I'd added tot he existing "if"
to make it easier to write a comment explaining why we can skip the
tiebreak function call.

Updated patch attached.

David
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 1174e1a31c..12d1bf551b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -436,7 +436,10 @@ struct Tuplesortstate
 
/*
 * This variable is shared by the single-key MinimalTuple case and the
-* Datum case (which both use qsort_ssup()).  Otherwise it's NULL.
+* Datum case (which both use qsort_ssup()).  It is also used by various
+* sort specialization functions when comparing the leading key in a
+* tiebreak situation to determine if there are any subsequent keys to
+* sort on. It's otherwise NULL.
 */
SortSupport onlyKey;
 
@@ -698,7 +701,12 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, 
Tuplesortstate *state)
compare = ApplyUnsignedSortComparator(a->datum1, a->isnull1,

  b->datum1, b->isnull1,

  >sortKeys[0]);
-   if (compare != 0)
+
+   /*
+* No need to call the tiebreak function when the datums differ or if 
this
+* is the only key we're sorting on.
+*/
+   if (compare != 0 || state->onlyKey != NULL)
return compare;
 
return state->comparetup(a, b, state);
@@ -713,7 +721,12 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, 
Tuplesortstate *state)
compare = ApplySignedSortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,

>sortKeys[0]);
-   if (compare != 0)
+
+   /*
+* No need to call the tiebreak function when the datums differ or if 
this
+* is the only key we're sorting on.
+*/
+   if (compare != 0 || state->onlyKey != NULL)
return compare;
 
return state->comparetup(a, b, state);
@@ -728,7 +741,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, 
Tuplesortstate *state)
compare = ApplyInt32SortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,

>sortKeys[0]);
-   if (compare != 0)
+
+   /*
+* No need to call the tiebreak function when the datums differ or if 
this
+* is the only key we're sorting on.
+*/
+   if (compare != 0 || state->onlyKey != NULL)
return compare;
 
return state->comparetup(a, b, state);


Re: A qsort template

2022-04-18 Thread John Naylor
On Tue, Apr 12, 2022 at 7:58 AM David Rowley  wrote:
>
> I've attached the patch I tested. It was thrown together very quickly
> just to try out the performance. If it's interesting I can polish it
> up a bit. If not, I didn't waste too much time.

@@ -959,6 +965,10 @@ tuplesort_begin_batch(Tuplesortstate *state)

  state->tapeset = NULL;

+ /* check if specialized sorts can skip calling the tiebreak function */
+ state->oneKeySort = state->nKeys == 1 &&
+ !state->sortKeys[0].abbrev_converter;
+

IIUC, this function is called by tuplesort_begin_common, which in turn
is called by tuplesort_begin_{heap, indexes, etc}. The latter callers
set the onlyKey and now oneKeySort variables as appropriate, and
sometimes hard-coded to false. Is it intentional to set them here
first?

Falling under the polish that you were likely thinking of above:

We might rename oneKeySort to skipTiebreaker to avoid confusion.
SInce the test for these variable is the same, we could consolidate
them into a block and reword this existing comment  (which I find a
little confusing anyway):

/*
* The "onlyKey" optimization cannot be used with abbreviated keys, since
* tie-breaker comparisons may be required.  Typically, the optimization
* is only of value to pass-by-value types anyway, whereas abbreviated
* keys are typically only of value to pass-by-reference types.
*/

I can take a stab at this, unless you had something else in mind.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-14 Thread John Naylor
On Thu, Apr 14, 2022 at 1:46 PM David Rowley  wrote:
>
> On Wed, 13 Apr 2022 at 23:19, John Naylor  
> wrote:
> > More broadly than the regression, Thomas' is very often the fastest of
> > all, at the cost of more binary size. David's is occasionally slower
> > than v15 or v15 with revert, but much of that is a slight difference
> > and some is probably noise.

To add to my summary of results - the v15 code, with and without extra
patches, seems slightly worse on B-tree index creation for very low
cardinality keys, but that's not an index that's going to be useful
(and therefore common) so that's a good tradeoff in my view. The
regression David found is more concerning.

> Just to get an opinion from some other hardware, I've run your test
> script on my AMD 3990x machine.

Thanks for that. I only see 4 non-Btree measurements in your results
that are larger than v15-revert, versus 8 in mine (Comet Lake). And
overall, most of those seem within the noise level.

> My opinion here is that the best thing we can learn from both of our
> results is, do the patches fix the regression?

I'd say the answer is yes for both.

> I don't believe it should be about if adding the additional
> specializations performs better than skipping the tie break function
> call.  I think it's pretty obvious that the specializations will be
> faster.  I think if it was decided that v16 would be the version where
> more work should be done to decide on what should be specialized and
> what shouldn't be, then we shouldn't let this regression force our
> hand to make that choice now. It'll be pretty hard to remove any
> specializations once they've been in a released version of Postgres.

I agree that a narrow fix is preferable. I'll take a closer look at
your patch soon.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-14 Thread David Rowley
On Wed, 13 Apr 2022 at 23:19, John Naylor  wrote:
> More broadly than the regression, Thomas' is very often the fastest of
> all, at the cost of more binary size. David's is occasionally slower
> than v15 or v15 with revert, but much of that is a slight difference
> and some is probably noise.

Just to get an opinion from some other hardware, I've run your test
script on my AMD 3990x machine.

My opinion here is that the best thing we can learn from both of our
results is, do the patches fix the regression?

I don't believe it should be about if adding the additional
specializations performs better than skipping the tie break function
call.  I think it's pretty obvious that the specializations will be
faster.  I think if it was decided that v16 would be the version where
more work should be done to decide on what should be specialized and
what shouldn't be, then we shouldn't let this regression force our
hand to make that choice now. It'll be pretty hard to remove any
specializations once they've been in a released version of Postgres.

David


qsort-fix-regression-dgr1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: A qsort template

2022-04-13 Thread John Naylor
As promised, I've done another round of tests (script and spreadsheet
attached) with

- v15 with 6974924347 and cc58eecc5d reverted
- v15 with Thomas' patch
- v15 with David's patch
- v15 as is ("std")

...where v15 is at 7b735f8b52ad. This time I limited it to int,
bigint, and text types.

Since more cases now use random distributions,  I also took some
measures to tighten up the measurements:

- Reuse the same random distribution for all tests where the input is
randomized, by invoking the script with/without a second parameter
- For the text case, use lpadded ints so that lexicographic order is
the same as numeric order.

I verified David's mod100 test case and added most test cases from the
Orson Peters paper I mentioned above. I won't explain all of them
here, but the low cardinality ones are randomized sets of:

- mod8
- dupsq: x mod sqrt(n) , for 10 million about 3 thousand distinct values
- dup8: (x**8 + n/2) mod n , for 10 million about 80 thousand distinct
values, about 80% with 64 duplicates and 20% with 256 duplicates

All the clear regressions I can see in v15 are in the above for one or
more query types / data types, and both Thomas and David's patches
restore performance for those.

More broadly than the regression, Thomas' is very often the fastest of
all, at the cost of more binary size. David's is occasionally slower
than v15 or v15 with revert, but much of that is a slight difference
and some is probably noise.

-- 
John Naylor
EDB: http://www.enterprisedb.com


qsort-fix-regression-jcn1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


sort-bench-int-text-plus-low-card.sh
Description: application/shellscript


Re: A qsort template

2022-04-11 Thread David Rowley
On Mon, 11 Apr 2022 at 22:11, John Naylor  wrote:
>
> On Mon, Apr 11, 2022 at 5:34 AM David Rowley  wrote:
>
> > With this particular test, v15 is about 15% *slower* than v14.  I
> > didn't know what to blame at first, so I tried commenting out the sort
> > specialisations and got the results in the red bars in the graph. This
> > made it about 7.5% *faster* than v14. So looks like this patch is to
> > blame.  I then hacked the comparator function that's used in the
> > specialisations for BIGINT to comment out the tiebreak to remove the
> > indirect function call, which happens to do nothing in this 1 column
> > sort case.  The aim here was to get an idea what the performance would
> > be if there was a specialisation for single column sorts. That's the
> > yellow bars, which show about 10% *faster* than master.
>
> Thanks for investigating! (I assume you meant 10% faster than v14?)

Yes, I did mean to say v14.   (I'm too used to comparing everything to master)

David




Re: A qsort template

2022-04-11 Thread John Naylor
On Mon, Apr 11, 2022 at 5:34 AM David Rowley  wrote:

> With this particular test, v15 is about 15% *slower* than v14.  I
> didn't know what to blame at first, so I tried commenting out the sort
> specialisations and got the results in the red bars in the graph. This
> made it about 7.5% *faster* than v14. So looks like this patch is to
> blame.  I then hacked the comparator function that's used in the
> specialisations for BIGINT to comment out the tiebreak to remove the
> indirect function call, which happens to do nothing in this 1 column
> sort case.  The aim here was to get an idea what the performance would
> be if there was a specialisation for single column sorts. That's the
> yellow bars, which show about 10% *faster* than master.

Thanks for investigating! (I assume you meant 10% faster than v14?)

On Mon, Apr 11, 2022 at 4:55 AM Peter Geoghegan  wrote:

> The B quicksort implementation that we adopted is generally
> extremely fast for that case, since it uses 3 way partitioning (based
> on the Dutch National Flag algorithm). This essentially makes sorting
> large groups of duplicates take only linear time (not linearithmic
> time).

In the below thread, I wondered if it still counts as extremely fast
nowadays. I hope to give an answer to that during next cycle. Relevant
to the open item, the paper linked there has a variety of
low-cardinality cases. I'll incorporate them in a round of tests soon.

https://www.postgresql.org/message-id/cafbsxshanjtsx9dnjppxjxwg3bu+yq6pnmsfpm0uvyuafdw...@mail.gmail.com

On Mon, Apr 11, 2022 at 4:44 AM Thomas Munro  wrote:

> Upthread we were discussing which variations it'd be worth investing
> extra text segment space on to gain speedup and we put those hard
> decisions off for future work, but on reflection, we probably should
> tackle this particular point to avoid a regression.  I think something
> like the attached achieves that (draft, not tested much yet, could
> perhaps find a tidier way to code the decision tree).  In short:
> variants qsort_tuple_{int32,signed,unsigned}() no longer fall back,
> but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do.

Looks good at a glance, I will get some numbers after modifying my test scripts.

> We should perhaps also reconsider the other XXX comment about finding
> a way to skip the retest of column 1 in the tiebreak comparator.
> Perhaps you'd just install a different comparetup function, eg
> comparetup_index_btree_tail (which would sharing code), so no need to
> multiply specialisations for that.

If we need to add these cases to avoid regression, it makes sense to
make them work as well as we reasonably can.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-10 Thread David Rowley
On Mon, 11 Apr 2022 at 09:44, Thomas Munro  wrote:
> David Rowley privately reported a performance regression when sorting
> single ints with a lot of duplicates, in a case that previously hit
> qsort_ssup() but now hits qsort_tuple_int32() and then has to call the
> tiebreaker comparator.  Note that this comes up only for sorts in a
> query, not for eg index builds which always have to tiebreak on item
> ptr.  I don't have data right now but that'd likely be due to:

I've now added this as an open item for v15.

David




Re: A qsort template

2022-04-10 Thread Peter Geoghegan
On Sun, Apr 10, 2022 at 2:44 PM Thomas Munro  wrote:
> David Rowley privately reported a performance regression when sorting
> single ints with a lot of duplicates, in a case that previously hit
> qsort_ssup() but now hits qsort_tuple_int32() and then has to call the
> tiebreaker comparator.

That's not good.

The B quicksort implementation that we adopted is generally
extremely fast for that case, since it uses 3 way partitioning (based
on the Dutch National Flag algorithm). This essentially makes sorting
large groups of duplicates take only linear time (not linearithmic
time).

-- 
Peter Geoghegan




Re: A qsort template

2022-04-10 Thread Thomas Munro
Hi,

David Rowley privately reported a performance regression when sorting
single ints with a lot of duplicates, in a case that previously hit
qsort_ssup() but now hits qsort_tuple_int32() and then has to call the
tiebreaker comparator.  Note that this comes up only for sorts in a
query, not for eg index builds which always have to tiebreak on item
ptr.  I don't have data right now but that'd likely be due to:

+ * XXX: For now, there is no specialization for cases where datum1 is
+ * authoritative and we don't even need to fall back to a callback at all (that
+ * would be true for types like int4/int8/timestamp/date, but not true for
+ * abbreviations of text or multi-key sorts.  There could be!  Is it worth it?

Upthread we were discussing which variations it'd be worth investing
extra text segment space on to gain speedup and we put those hard
decisions off for future work, but on reflection, we probably should
tackle this particular point to avoid a regression.  I think something
like the attached achieves that (draft, not tested much yet, could
perhaps find a tidier way to code the decision tree).  In short:
variants qsort_tuple_{int32,signed,unsigned}() no longer fall back,
but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do.

We should perhaps also reconsider the other XXX comment about finding
a way to skip the retest of column 1 in the tiebreak comparator.
Perhaps you'd just install a different comparetup function, eg
comparetup_index_btree_tail (which would sharing code), so no need to
multiply specialisations for that.

Planning to look at this more closely after I've sorted out some other
problems, but thought I'd post this draft/problem report early in case
John or others have thoughts or would like to run some experiments.
From 3ff95aedb1065393a0ae1496cedb463bc998a329 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 6 Apr 2022 11:38:31 +1200
Subject: [PATCH] Specialize tuplesort for keys without tiebreaker.

Commit 69749243 noted a future opportunity to avoid calling the
comparator function in the case of single column non-abbreviated sort,
where datum1 alone is enough to decide the order.

David Rowley reported a performance regression in single-column integer
sorts with a lot of duplicates, where previously the qsort_ssup()
specialization was reached, due to extra calls to the comparator.  It
seems like a good idea to fix that, so let's complete that TODO item
now.

XXX WIP, is the if-then-else getting too ugly, maybe switch to a table of
functions to search?
XXX needs testing, experiments
---
 src/backend/utils/sort/tuplesort.c | 110 +++--
 1 file changed, 89 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 571fb95532..dc16e6c242 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -312,6 +312,14 @@ struct Tuplesortstate
 	 */
 	bool		haveDatum1;
 
+	/*
+	 * Whether the comparator function must always be called for tuples with
+	 * equal datum1.  This should be set to true if the sorting has extra
+	 * 'invisible' sort order beyond the regular keys, to disable optimizations
+	 * that would otherwise skip the function call.
+	 */
+	bool		alwaysTiebreak;
+
 	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
@@ -682,11 +690,6 @@ static void tuplesort_updatemax(Tuplesortstate *state);
  *
  * XXX: For now, these fall back to comparator functions that will compare the
  * leading datum a second time.
- *
- * XXX: For now, there is no specialization for cases where datum1 is
- * authoritative and we don't even need to fall back to a callback at all (that
- * would be true for types like int4/int8/timestamp/date, but not true for
- * abbreviations of text or multi-key sorts.  There could be!  Is it worth it?
  */
 
 /* Used if first key's comparator is ssup_datum_unsigned_compare */
@@ -740,10 +743,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
  * qsort_ssup() is specialized for the case where the comparetup function
  * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
  * and Datum sorts.  qsort_tuple_{unsigned,signed,int32} are specialized for
- * common comparison functions on pass-by-value leading datums.
+ * common comparison functions on pass-by-value leading datums.  The _tiebreak
+ * variants are for when it is necessary to call the comparator function even
+ * if datum1 is equal.
  */
 
-#define ST_SORT qsort_tuple_unsigned
+#define ST_SORT qsort_tuple_unsigned_tiebreak
 #define ST_ELEMENT_TYPE SortTuple
 #define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state)
 #define ST_COMPARE_ARG_TYPE Tuplesortstate
@@ -752,7 +757,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 #define ST_DEFINE
 #include "lib/sort_template.h"
 

Re: A qsort template

2022-04-06 Thread John Naylor
Here is the updated insertion sort threshold patch based on Thomas'
experimental v4 0010, with adjusted regression test output. I only
found a couple places where it could make sense to add sort keys to
test queries, but 1) not enough to make a big difference and 2) the
adjustments looked out of place, so I decided to just update all the
regression tests in one go. Since the patch here is a bit more (and
less) involved than Thomas' 0010, I'm going to refrain from committing
until it gets review. If not in the next couple days, I will bring it
up at the beginning of the v16 cycle.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 74b934bc5ed8f6733c064c0ef832e1aa9949f216 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 6 Apr 2022 16:38:28 +0700
Subject: [PATCH v5] Raise qsort insertion sort threshold

Our qsort algorithm is from NetBSD and is described in the 1993 paper
"Engineering a Sort Function". It includes three hard coded thresholds that
control the point at which we drop to insertion sort, and whether we use 1,
3 or 9 samples for choosing a pivot. The insertion sort threshold of 7 was
chosen by testing on hardware available in 1993.  Our testing has shown
that hardware from 2010 to 2020 prefer a larger threshold. The ideal value
varies with input distribution, but 20 seems to be a good compromise that
works well with all of them. This is more in line with thresholds found
in other software as well.

Since the new insertion sort threshold is larger then the singleton range
where a single sample is chosen for the pivot, get rid of the middle
range. There are now two thresholds.

Also use macros intead of hard-coded values. This improves readability
and enables specializing the thresholds if desired.

The changes in the regression tests are needed for the change in sort
stability when the sort key contains duplicates.

Thomas Munro and John Naylor

Discussion:
https://www.postgresql.org/message-id/CA%2BhUKGJhOtjQH%2BwjCodtJzHAFCAPYyt6Qm9E1mUoJph42qo1hg%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/CAFBsxsHr-C1xqjUMjeUN3-FvNzKiAt3gcfBKt8PFN2mov7z2gQ%40mail.gmail.com
---
 .../expected/pg_stat_statements.out   |   6 +-
 src/include/lib/sort_template.h   |  38 +-
 src/test/regress/expected/create_index.out|   2 +-
 src/test/regress/expected/geometry.out|   2 +-
 src/test/regress/expected/groupingsets.out|   4 +-
 src/test/regress/expected/inet.out|   4 +-
 src/test/regress/expected/join.out|   2 +-
 src/test/regress/expected/sqljson.out |  10 +-
 src/test/regress/expected/tsrf.out|  28 +-
 src/test/regress/expected/tuplesort.out   |  10 +-
 src/test/regress/expected/window.out  | 542 +-
 11 files changed, 331 insertions(+), 317 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0abe34bb6..aeb8f04aea 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -169,12 +169,12 @@ SELECT *
 SELECT * FROM test ORDER BY a;
  a |  b   
 ---+--
- 1 | a   
  1 | 111 
- 2 | b   
+ 1 | a   
  2 | 222 
- 3 | c   
+ 2 | b   
  3 | 333 
+ 3 | c   
  4 | 444 
  5 | 555 
  6 | 666 
diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
index 3122a93009..7b92b2c084 100644
--- a/src/include/lib/sort_template.h
+++ b/src/include/lib/sort_template.h
@@ -40,6 +40,11 @@
  *
  *	  - ST_COMPARE_ARG_TYPE - type of extra argument
  *
+ *	  Optional macros for tuning algorithm choices:
+ *
+ *	  - ST_THRESHOLD_INSERTION_SORT - below this size we do insertion sort
+ *	  - ST_THRESHOLD_MED9 - above this size we partition with med9, otherwise with med3
+ *
  *	  The prototype of the generated sort function is:
  *
  *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
@@ -172,6 +177,14 @@
 #define ST_SORT_INVOKE_ARG
 #endif
 
+/* Default input size thresholds to control algorithm behavior. */
+#ifndef ST_THRESHOLD_INSERTION_SORT
+#define ST_THRESHOLD_INSERTION_SORT 20
+#endif
+#ifndef ST_THRESHOLD_MED9
+#define ST_THRESHOLD_MED9 40
+#endif
+
 #ifdef ST_DECLARE
 
 #ifdef ST_COMPARE_RUNTIME_POINTER
@@ -296,7 +309,7 @@ ST_SORT(ST_ELEMENT_TYPE * data, size_t n
 
 loop:
 	DO_CHECK_FOR_INTERRUPTS();
-	if (n < 7)
+	if (n < ST_THRESHOLD_INSERTION_SORT)
 	{
 		for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 			 pm += ST_POINTER_STEP)
@@ -318,21 +331,20 @@ loop:
 	}
 	if (presorted)
 		return;
+	/* Partition with median of three for medium input. */
+	pl = a;
 	pm = a + (n / 2) * ST_POINTER_STEP;
-	if (n > 7)
+	pn = a + (n - 1) * ST_POINTER_STEP;
+	if (n > 

Re: A qsort template

2022-04-03 Thread Thomas Munro
On Mon, Apr 4, 2022 at 4:32 AM Andres Freund  wrote:
> Would be good to get this committed soon, so we can see further ubsan
> violations introduced in the next few days (and so I can unblock my local dev
> tests :P).

Pushed (with a minor tweak).




Re: A qsort template

2022-04-03 Thread Andres Freund
Hi,

On 2022-04-03 17:46:28 +1200, Thomas Munro wrote:
> On Sun, Apr 3, 2022 at 11:11 AM Andres Freund  wrote:
> > On 2022-04-03 09:45:13 +1200, Thomas Munro wrote:
> > > I think we just need to decide up front if we're in a situation that
> > > can't provide datum1/isnull1 (in this case because it's an expression
> > > index), and skip the optimised paths.  Here's an experimental patch...
> > > still looking into whether there are more cases like this...
> 
> I didn't find anything else.
> 
> Maybe it'd be better if we explicitly declared whether datum1 is used
> in each tuplesort mode's 'begin' function, right next to the code that
> installs the set of routines that are in control of that?  Trying that
> in this version.  Is it clearer what's going on like this?

Seems an improvement.


> > I'm a bit worried that none of the !ubsan tests failed on this...
> 
> In accordance with whoever-it-was-that-said-that's law about things
> that aren't tested, this are turned out to be broken already[1].

Yea :/.


Would be good to get this committed soon, so we can see further ubsan
violations introduced in the next few days (and so I can unblock my local dev
tests :P).

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 11:11 AM Andres Freund  wrote:
> On 2022-04-03 09:45:13 +1200, Thomas Munro wrote:
> > I think we just need to decide up front if we're in a situation that
> > can't provide datum1/isnull1 (in this case because it's an expression
> > index), and skip the optimised paths.  Here's an experimental patch...
> > still looking into whether there are more cases like this...

I didn't find anything else.

Maybe it'd be better if we explicitly declared whether datum1 is used
in each tuplesort mode's 'begin' function, right next to the code that
installs the set of routines that are in control of that?  Trying that
in this version.  Is it clearer what's going on like this?

> That's a lot of redundant checks. How about putting all the checks for
> optimized paths into one if (state->sortKeys && !state->disabl1e_datum1)?

OK, sure.

> I'm a bit worried that none of the !ubsan tests failed on this...

In accordance with whoever-it-was-that-said-that's law about things
that aren't tested, this are turned out to be broken already[1].  Once
we fix that we should have a new test in the three that might also
eventually have failed under this UB, given enough chaos.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BbA%2BbmwD36_oDxAoLrCwZjVtST2fqe%3Db4%3DqZcmU7u89A%40mail.gmail.com
From b0a79f91edc6ce305f77373b92f31100fcad07d5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 3 Apr 2022 09:41:04 +1200
Subject: [PATCH v2] Fix tuplesort optimizations for expression-based CLUSTER.

When redirecting sorts to specialized variants, commit 69749243 failed
to handle the case where CLUSTER-sort decides not to initialize datum1
and isnull1.  Fix by hoisting that decision up a level and advertising
whether datum1 can be used, in the Tuplesortstate object.

Per reports from UBsan and Valgrind while running the cluster.sql test.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CAFBsxsF1TeK5Fic0M%2BTSJXzbKsY6aBqJGNj6ptURuB09ZF6k_w%40mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c | 73 ++
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 361527098f..58441ed638 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -306,6 +306,12 @@ struct Tuplesortstate
 	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
 			LogicalTape *tape, unsigned int len);
 
+	/*
+	 * Whether SortTuple's datum1 and isnull1 members are maintained by the
+	 * above routines.  If not, some sort specializations are disabled.
+	 */
+	bool		haveDatum1;
+
 	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
@@ -1016,6 +1022,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->haveDatum1 = true;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -1095,6 +1102,15 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->indexInfo = BuildIndexInfo(indexRel);
 
+	/*
+	 * If we don't have a simple leading attribute, we can't easily initialize
+	 * datum1, so disable optimizations based on datum1.
+	 */
+	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+		state->haveDatum1 = false;
+	else
+		state->haveDatum1 = true;
+
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1188,6 +1204,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
 	state->abbrevNext = 10;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1262,6 +1279,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1302,6 +1320,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1366,6 +1385,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
 	state->abbrevNext = 10;
+	state->haveDatum1 = true;
 
 	state->datumType = datumType;
 
@@ -3593,27 +3613,40 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 
 	if (state->memtupcount > 1)
 	{
-		/* Do we have a specialization for the leading column's comparator? */
-		if (state->sortKeys &&
-			state->sortKeys[0].comparator == ssup_datum_unsigned_cmp)
-		{
-			elog(DEBUG1, "qsort_tuple_unsigned");
-			qsort_tuple_unsigned(state->memtuples, state->memtupcount, 

Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-03 09:45:13 +1200, Thomas Munro wrote:
> On Sun, Apr 3, 2022 at 9:03 AM Andres Freund  wrote:
> > It's certainly not pretty that copytup_cluster() can use SortTuples without
> > actually using SortTuples. Afaics it basically only computes isnull1/datum1 
> > if
> > state->indexInfo->ii_IndexAttrNumbers[0] == 0.
> 
> I think we just need to decide up front if we're in a situation that
> can't provide datum1/isnull1 (in this case because it's an expression
> index), and skip the optimised paths.  Here's an experimental patch...
> still looking into whether there are more cases like this...

That's a lot of redundant checks. How about putting all the checks for
optimized paths into one if (state->sortKeys && !state->disable_datum1)?

I'm a bit worried that none of the !ubsan tests failed on this...

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 9:03 AM Andres Freund  wrote:
> It's certainly not pretty that copytup_cluster() can use SortTuples without
> actually using SortTuples. Afaics it basically only computes isnull1/datum1 if
> state->indexInfo->ii_IndexAttrNumbers[0] == 0.

I think we just need to decide up front if we're in a situation that
can't provide datum1/isnull1 (in this case because it's an expression
index), and skip the optimised paths.  Here's an experimental patch...
still looking into whether there are more cases like this...

(There's also room to recognise when you don't even need to look at
isnull1 for a less branchy optimised sort, but that was already
discussed and put off for later.)
From a8a7c9ec4f0b96ed9d889d008731864f3d4e87d1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 3 Apr 2022 09:41:04 +1200
Subject: [PATCH] WIP: Fix tuplesort optimizations for expression-based
 CLUSTER.

---
 src/backend/utils/sort/tuplesort.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 361527098f..34714d9baf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -268,6 +268,8 @@ struct Tuplesortstate
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
 
+	bool		disable_datum1;	/* disable leading value-based optimizations */
+
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are sorting from the routines that don't need to know it.
@@ -1095,6 +1097,13 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->indexInfo = BuildIndexInfo(indexRel);
 
+	/*
+	 * If we don't have a simple attribute, disable the use of datum1/isnull1.
+	 * copytup_cluster() doesn't know how to compute expressions.
+	 */
+	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+		state->disable_datum1 = true;
+
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -3593,20 +3602,27 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 
 	if (state->memtupcount > 1)
 	{
-		/* Do we have a specialization for the leading column's comparator? */
+		/*
+		 * Do we have a specialization for the leading column's comparator,
+		 * and has the leading column's value or abbreviation been stored in
+		 * datum1/isnull1?
+		 */
 		if (state->sortKeys &&
+			!state->disable_datum1 &&
 			state->sortKeys[0].comparator == ssup_datum_unsigned_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_unsigned");
 			qsort_tuple_unsigned(state->memtuples, state->memtupcount, state);
 		}
 		else if (state->sortKeys &&
+ !state->disable_datum1 &&
  state->sortKeys[0].comparator == ssup_datum_signed_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_signed");
 			qsort_tuple_signed(state->memtuples, state->memtupcount, state);
 		}
 		else if (state->sortKeys &&
+ !state->disable_datum1 &&
  state->sortKeys[0].comparator == ssup_datum_int32_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_int32");
@@ -4134,7 +4150,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 	 * set up first-column key value, and potentially abbreviate, if it's a
 	 * simple column
 	 */
-	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+	if (state->disable_datum1)
 		return;
 
 	original = heap_getattr(tuple,
-- 
2.35.1



Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-02 15:20:27 -0500, Justin Pryzby wrote:
> On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote:
> > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > > Reproduced locally, using the same few lines from the cluster.sql
> > > test.  I'll try to dig more tomorrow...
> > 
> > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...
> 
> Like Thomas just said, I had to use:
> CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all
> 
> I'm a couple few steps out of my league here, but it may be an issue with:
> 
> commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23
> Author: Robert Haas 
> Date:   Mon Jan 19 15:20:31 2015 -0500
> 
> Use abbreviated keys for faster sorting of text datums.
> 
> This is enough to avoid the crash, which might be a useful hint..
>
> @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple 
> *stup, void *tup)
> /*
>  * set up first-column key value, and potentially abbreviate, if it's 
> a
>  * simple column
>  */
> +   stup->isnull1 = false;
> if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
> return;
>  
> original = heap_getattr(tuple,
> 
> state->indexInfo->ii_IndexAttrNumbers[0],
> state->tupDesc,
> >isnull1);

I don't think that can be correct - the column can be NULL afaics. And I don't
think in that patch it's needed, because it always goes through ->comparetup()
when state->onlyKey isn't explicitly set. Which tuplesort_begin_cluster() as
well as several others don't.  And you'd just sort an uninitialized datum
immediately after.

It's certainly not pretty that copytup_cluster() can use SortTuples without
actually using SortTuples. Afaics it basically only computes isnull1/datum1 if
state->indexInfo->ii_IndexAttrNumbers[0] == 0.

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 8:20 AM Justin Pryzby  wrote:
> @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple 
> *stup, void *tup)

> +   stup->isnull1 = false;

Looks like I might have failed to grok the scheme for encoding null
into SortTuple objects.  It's clearly uninitialised in some paths,
with a special 0 value in datum1.  Will need to look more closely with
more coffee...




Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-03 08:07:58 +1200, Thomas Munro wrote:
> On Sun, Apr 3, 2022 at 12:41 AM John Naylor
>  wrote:
> > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > > Reproduced locally, using the same few lines from the cluster.sql
> > > test.  I'll try to dig more tomorrow...
> >
> > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...
> 
> Maybe you need to add -fno-sanitize-recover=all to make it crash,
> otherwise it just prints the warning and keeps going.

I commented with a few more details on 
https://postgr.es/m/20220402201557.thanbsxcql5lk6pc%40alap3.anarazel.de
and an preliminary analysis in
https://www.postgresql.org/message-id/20220402203344.ahup2u5n73cdbbcv%40alap3.anarazel.de

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread Justin Pryzby
On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote:
> On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > Reproduced locally, using the same few lines from the cluster.sql
> > test.  I'll try to dig more tomorrow...
> 
> Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

Like Thomas just said, I had to use:
CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all

I'm a couple few steps out of my league here, but it may be an issue with:

commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23
Author: Robert Haas 
Date:   Mon Jan 19 15:20:31 2015 -0500

Use abbreviated keys for faster sorting of text datums.

This is enough to avoid the crash, which might be a useful hint..

@@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, 
void *tup)
/*
 * set up first-column key value, and potentially abbreviate, if it's a
 * simple column
 */
+   stup->isnull1 = false;
if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
return;
 
original = heap_getattr(tuple,

state->indexInfo->ii_IndexAttrNumbers[0],
state->tupDesc,
>isnull1);




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 12:41 AM John Naylor
 wrote:
> On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > Reproduced locally, using the same few lines from the cluster.sql
> > test.  I'll try to dig more tomorrow...
>
> Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

Maybe you need to add -fno-sanitize-recover=all to make it crash,
otherwise it just prints the warning and keeps going.




Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> Reproduced locally, using the same few lines from the cluster.sql
> test.  I'll try to dig more tomorrow...

Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> It looks like UBsan sees a problem, per BF animal kestrel:
>
> /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51:
> runtime error: load of value 96, which is not a valid value for type
> 'bool'

Yeah, same with tamandua. Then, skink (a Valgrind animal) shows:

==1940791== VALGRINDERROR-BEGIN
==1940791== Conditional jump or move depends on uninitialised value(s)
==1940791==at 0x73D394: ApplyInt32SortComparator (sortsupport.h:311)
==1940791==by 0x73D394: qsort_tuple_int32_compare (tuplesort.c:722)
==1940791==by 0x73D394: qsort_tuple_int32 (sort_template.h:313)
==1940791==by 0x7409BC: tuplesort_sort_memtuples (tuplesort.c:3613)
==1940791==by 0x742806: tuplesort_performsort (tuplesort.c:2154)
==1940791==by 0x23C109: heapam_relation_copy_for_cluster
(heapam_handler.c:955)
==1940791==by 0x35799A: table_relation_copy_for_cluster (tableam.h:1658)
==1940791==by 0x35799A: copy_table_data (cluster.c:913)
==1940791==by 0x359016: rebuild_relation (cluster.c:606)
==1940791==by 0x35914E: cluster_rel (cluster.c:427)
==1940791==by 0x3594EB: cluster (cluster.c:195)
==1940791==by 0x5C73FF: standard_ProcessUtility (utility.c:862)
==1940791==by 0x5C78D0: ProcessUtility (utility.c:530)
==1940791==by 0x5C4C7B: PortalRunUtility (pquery.c:1158)
==1940791==by 0x5C4F78: PortalRunMulti (pquery.c:1315)
==1940791==  Uninitialised value was created by a stack allocation
==1940791==at 0x74224E: tuplesort_putheaptuple (tuplesort.c:1800)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sat, Apr 2, 2022 at 9:38 PM John Naylor  wrote:
> On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro  wrote:
> > On Thu, Mar 31, 2022 at 11:09 PM John Naylor
> >  wrote:
> > > In a couple days I'm going to commit the v3 patch "accelerate tuple
> > > sorting for common types" as-is after giving it one more look, barring
> > > objections.
>
> Pushed.

It looks like UBsan sees a problem, per BF animal kestrel:

/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51:
runtime error: load of value 96, which is not a valid value for type
'bool'

#5  0x00eb65d4 in qsort_tuple_int32_compare (a=0x4292ce0,
b=0x4292cf8, state=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722
#6  qsort_tuple_int32 (data=, n=133,
arg=arg@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/include/lib/sort_template.h:313
#7  0x00eaf747 in tuplesort_sort_memtuples
(state=state@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:3613
#8  0x00eaedcb in tuplesort_performsort
(state=state@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:2154
#9  0x00573d60 in heapam_relation_copy_for_cluster
(OldHeap=, NewHeap=, OldIndex=, use_sort=, OldestXmin=11681,
xid_cutoff=, multi_cutoff=0x7ffecb0cfa70,
num_tuples=0x7ffecb0cfa38, tups_vacuumed=0x7ffecb0cfa20,
tups_recently_dead=0x7ffecb0cfa28) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/access/heap/heapam_handler.c:955

Reproduced locally, using the same few lines from the cluster.sql
test.  I'll try to dig more tomorrow...




Re: A qsort template

2022-04-02 Thread John Naylor
I wrote:

> I started towards incorporating the change in insertion sort threshold
> (part of 0010), but that caused regression test failures, so that will
> have to wait for a bit of analysis and retesting. (My earlier tests
> were done in a separate module.)

The failures seem to be where sort order is partially specified. E.g.
ORDER BY col_a, where there are duplicates there and other columns are
different. Insertion sort is stable IIRC, so moving the threshold
caused different orders in these cases. Some cases can be conveniently
fixed with additional columns in the ORDER BY clause. I'll go through
the failures and see how much can be cleaned up as a preparatory
refactoring.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro  wrote:
>
> On Thu, Mar 31, 2022 at 11:09 PM John Naylor
>  wrote:
> > In a couple days I'm going to commit the v3 patch "accelerate tuple
> > sorting for common types" as-is after giving it one more look, barring
> > objections.

Pushed.

> Hi John,
>
> Thanks so much for all the work you've done here!  I feel bad that I
> lobbed so many experimental patches in here and then ran away due to
> lack of cycles.  That particular patch (the one cfbot has been chewing
> on all this time) does indeed seem committable, despite the
> deficiencies/opportunities listed in comments.  It's nice to reduce
> code duplication, it gives the right answers, and it goes faster.

Thanks for chiming in! It gives me more confidence that there wasn't
anything amiss that may have gone unnoticed. And no worries -- my own
review efforts here have been sporadic. ;-)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-03-31 Thread Thomas Munro
On Thu, Mar 31, 2022 at 11:09 PM John Naylor
 wrote:
> In a couple days I'm going to commit the v3 patch "accelerate tuple
> sorting for common types" as-is after giving it one more look, barring
> objections.

Hi John,

Thanks so much for all the work you've done here!  I feel bad that I
lobbed so many experimental patches in here and then ran away due to
lack of cycles.  That particular patch (the one cfbot has been chewing
on all this time) does indeed seem committable, despite the
deficiencies/opportunities listed in comments.  It's nice to reduce
code duplication, it gives the right answers, and it goes faster.

> I started towards incorporating the change in insertion sort threshold
> (part of 0010), but that caused regression test failures, so that will
> have to wait for a bit of analysis and retesting. (My earlier tests
> were done in a separate module.)
>
> The rest in this series that I looked at closely were either
> refactoring or could use some minor tweaks so likely v16 material.

Looking forward to it.




Re: A qsort template

2022-03-31 Thread John Naylor
In a couple days I'm going to commit the v3 patch "accelerate tuple
sorting for common types" as-is after giving it one more look, barring
objections.

I started towards incorporating the change in insertion sort threshold
(part of 0010), but that caused regression test failures, so that will
have to wait for a bit of analysis and retesting. (My earlier tests
were done in a separate module.)

The rest in this series that I looked at closely were either
refactoring or could use some minor tweaks so likely v16 material.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-02-02 Thread John Naylor
I wrote:

> > 0010 - Thresholds on my TODO list.
>
> I did some basic tests on the insertion sort thresholds, and it looks
> like we could safely and profitably increase the current value from 7
> to 20 or so, in line with other more recent implementations. I've
> attached an addendum on top of 0012 and the full test results on an
> Intel Coffee Lake machine with gcc 11.1. I found that the object test
> setup in 0012 had some kind of bug that was comparing the pointer of
> the object array. Rather than fix that, I decided to use Datums, but
> with the two extremes in comparator: simple branching with machine
> instructions vs. a SQL-callable function. The papers I've read
> indicate the results for Datum sizes would not be much different for
> small structs. The largest existing sort element is SortTuple, but
> that's only 24 bytes and has a bulky comparator as well.
>
> The first thing to note is that I rejected outright any testing of a
> "middle value" where the pivot is simply the middle of the array. Even
> the Bently and McIlroy paper which is the reference for our
> implementation says "The range that consists of the single integer 7
> could be eliminated, but has been left adjustable because on some
> machines larger ranges are a few percent better".
>
> I tested thresholds up to 64, which is where I guessed results to get
> worse (most implementations are smaller than that). Here are the best
> thresholds at a quick glance:
>
> - elementary comparator:
>
> random: 16 or greater
> decreasing, rotate: get noticeably better all the way up to 64
> organ: little difference, but seems to get better all the way up to 64
> 0/1: seems to get worse above 20
>
> - SQL-callable comparator:
>
> random: between 12 and 20, but slight differences until 32
> decreasing, rotate: get noticeably better all the way up to 64
> organ: seems best at 12, but slight differences until 32
> 0/1: slight differences
>
> Based on these tests and this machine, it seems 20 is a good default
> value. I'll repeat this test on one older Intel and one non-Intel
> platform with older compilers.

The above was an Intel Comet Lake / gcc 11, and I've run the same test
on a Haswell-era Xeon / gcc 8 and a Power8 machine / gcc 4.8. The
results on those machines are pretty close to the above (full results
attached). The noticeable exception is the Power8 on random input with
a slow comparator -- those measurements there are more random than
others so we can't draw conclusions from them, but the deviations are
small in any case. I'm still thinking 20 or so is about right.

I've put a lot out here recently, so I'll take a break now and come
back in a few weeks.

(no running tally here because the conclusions haven't changed since
last message)
-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=0, time=0.130188
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=1, time=0.124503
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=2, time=0.124557
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=0, time=0.119103
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=1, time=0.119035
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=2, time=0.086948
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=0, time=0.082710
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=1, time=0.082771
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=2, time=0.083004
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=0, time=0.080768
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=1, time=0.080764
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=2, time=0.080635
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=0, time=0.080678
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=1, time=0.080555
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=2, time=0.080604
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=0, time=0.079002
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=1, time=0.078901
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=2, time=0.082200
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=0, time=0.079317
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=1, time=0.079164
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=2, time=0.079308
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=0, time=0.078604
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=1, time=0.078718
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=2, time=0.078689
NOTICE:  [direct] size=8MB, order=decreasing, threshold=7, test=0, time=0.025097
NOTICE:  [direct] size=8MB, order=decreasing, threshold=7, test=1, time=0.025078
NOTICE:  [direct] size=8MB, order=decreasing, threshold=7, test=2, time=0.025095
NOTICE:  [direct] size=8MB, order=decreasing, threshold=12, test=0, 

Re: A qsort template

2022-01-31 Thread John Naylor
I wrote:

> 0010 - Thresholds on my TODO list.

I did some basic tests on the insertion sort thresholds, and it looks
like we could safely and profitably increase the current value from 7
to 20 or so, in line with other more recent implementations. I've
attached an addendum on top of 0012 and the full test results on an
Intel Coffee Lake machine with gcc 11.1. I found that the object test
setup in 0012 had some kind of bug that was comparing the pointer of
the object array. Rather than fix that, I decided to use Datums, but
with the two extremes in comparator: simple branching with machine
instructions vs. a SQL-callable function. The papers I've read
indicate the results for Datum sizes would not be much different for
small structs. The largest existing sort element is SortTuple, but
that's only 24 bytes and has a bulky comparator as well.

The first thing to note is that I rejected outright any testing of a
"middle value" where the pivot is simply the middle of the array. Even
the Bently and McIlroy paper which is the reference for our
implementation says "The range that consists of the single integer 7
could be eliminated, but has been left adjustable because on some
machines larger ranges are a few percent better".

I tested thresholds up to 64, which is where I guessed results to get
worse (most implementations are smaller than that). Here are the best
thresholds at a quick glance:

- elementary comparator:

random: 16 or greater
decreasing, rotate: get noticeably better all the way up to 64
organ: little difference, but seems to get better all the way up to 64
0/1: seems to get worse above 20

- SQL-callable comparator:

random: between 12 and 20, but slight differences until 32
decreasing, rotate: get noticeably better all the way up to 64
organ: seems best at 12, but slight differences until 32
0/1: slight differences

Based on these tests and this machine, it seems 20 is a good default
value. I'll repeat this test on one older Intel and one non-Intel
platform with older compilers.

--
Running tally of patchset:

0001 - bsearch and unique is good to have, and we can keep the return
type pending further tests -- if none happen this cycle, suggest
committing this without the return type symbol.
0002/3 - I've yet to see a case where branchless comparators win, but
other than that, these are good. Notational improvement and not
performance sensitive.

0004/5 - Computing the arguments slows it down, but accessing the
underlying int16s gives an improvement. [1] Haven't done an in-situ
test on VACUUM. Could be worth it for pg15, since I imagine the
proposals for dead tuple storage won't be ready this cycle.
0006 - I expect this to be slower too. I also wonder if this could
also use the global function in 0004 once it's improved.

0007 - untested

0008 - Good performance in microbenchmarks, no in-situ testing.
Inlined reversal is not worth the binary space or notational overhead.

0009 - Based on 0004, I would guess that computing the arguments is
too slow. Not sure how to test in-situ to see if specializing helps.

0010 - Suggest leaving out the middle threshold and setting the
insertion sort threshold to ~20. Might also name them
ST_INSERTION_SORT_THRESHOLD and ST_NINTHER_THRESHOLD. (TODO: test on
other platforms)

0011 - Committed.

v3-0001 comparators for abbreviated keys - Clearly a win in this state
already, especially
for the "unsigned" case [2]. (gist untested) There are additional
possible improvements mentioned,
but they seem like a PG16 project(s).

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com
(TODO: refine test)

-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=0, time=0.084503
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=1, time=0.087980
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=2, time=0.084299
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=0, time=0.081893
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=1, time=0.081907
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=2, time=0.081943
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=0, time=0.080810
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=1, time=0.080793
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=2, time=0.080922
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=0, time=0.080399
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=1, time=0.080433
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=2, time=0.080413
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=0, time=0.080342
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=1, time=0.080446
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=2, time=0.080339
NOTICE:  

Re: A qsort template

2022-01-27 Thread John Naylor
Hi,

I've run a few tests to get some feel for the effects of various
comparators on Datums containing int32. I've attached the full
results, as well as the (messy) patch which applies on top of 0012 to
run the tests. I'll excerpt some of those results as I go through them
here. For now, I only ran input orders of sorted, random, and
reversed.

1) Specializing

This is a win in all cases, including SQL-callable comparators (the
case here is for _bt_sort_array_elements).

NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=2,
time=0.140526
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023

NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0,
time=0.192063

2) Branchless operations

The int case is for how to perform the comparison, and the SQL case is
referring to how to reverse the sort order.Surprisingly, they don't
seem to help for direct comparisons, and in fact they seem worse. I'll
have to dig a bit deeper to be sure, but it's not looking good now.

NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=0,
time=0.091837

NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2,
time=0.192018
NOTICE:  [SQL inlined reverse] size=8MB, order=random,
cmp=SQL-inline-rev, test=0, time=0.190797

When the effect is reversing a list, the direct comparisons seem much
worse, and the SQL ones aren't helped.

NOTICE:  [inlined] size=8MB, order=decreasing, cmp=inline, test=2, time=0.024963
NOTICE:  [branchless] size=8MB, order=decreasing, cmp=branchless,
test=0, time=0.036423

NOTICE:  [SQL inlined] size=8MB, order=decreasing, cmp=SQL-inline,
test=0, time=0.125182
NOTICE:  [SQL inlined reverse] size=8MB, order=increasing,
cmp=SQL-inline-rev, test=0, time=0.127051

--
Since I have a couple more planned tests, I'll keep a running tally on
the current state of the patch set so that summaries are not scattered
over many emails:

0001 - bsearch and unique is good to have, and we can keep the return
type pending further tests
0002/3 - I've yet to see a case where branchless comparators win, but
other than that, these are good. Notational improvement and not
performance sensitive.

0004/5 - Computing the arguments slows it down, but accessing the
underlying int16s gives an improvement. [1] Haven't done an in-situ
test on VACUUM. Could be worth it for pg15, since I imagine the
proposals for dead tuple storage won't be ready this cycle.
0006 - I expect this to be slower too. I also wonder if this could
also use the global function in 0004 once it's improved.

0007 - untested

0008 - Good performance in microbenchmarks, no in-situ testing.
Inlined reversal is not worth the binary space or notational overhead.

0009 - Based on 0004, I would guess that computing the arguments is
too slow. Not sure how to test in-situ to see if specializing helps.

0010 - Thresholds on my TODO list.

0011 - A simple correction -- I'll go ahead and commit this.

v3-0001 comparators for abbreviated keys - Clearly a win, especially
for the "unsigned" case [2]. There are still possible improvements,
but they seem like a pg16 project(s).

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com
(I just realized in that message I didn't attach the script for that,
and also attached an extra draft spreadsheet. I'll improve the tests
and rerun later)

-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=0, 
time=0.144545
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=1, 
time=0.140534
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=2, 
time=0.140526
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=1, time=0.086370
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=0, 
time=0.091837
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=1, 
time=0.090426
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=2, 
time=0.091245
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=0, time=0.257024
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=1, time=0.256487
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0, 
time=0.192063
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=1, 
time=0.191919
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2, 
time=0.192018
NOTICE:  [SQL arg reverse] size=8MB, 

Re: A qsort template

2022-01-19 Thread John Naylor
On Tue, Jan 18, 2022 at 9:58 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 18, 2022 at 6:39 PM John Naylor
>  wrote:
> > Editorializing the null position in queries is not very common in my
> > experience. Not null is interesting since it'd be trivial to pass
> > constant false to the same Apply[XYZ]SortComparator() and let the
> > compiler remove all those branches for us. On the other hand, those
> > branches would be otherwise predicted well, so it might make little or
> > no difference.
>
> If you were going to do this, maybe you could encode NULL directly in
> an abbreviated key. I think that that could be made to work if it was
> limited to opclasses with abbreviated keys encoded as unsigned
> integers. Just a thought.

Now that you mention that, I do remember reading about this technique
in the context of b-tree access, so it does make sense. If we had that
capability, it would be trivial to order the nulls how we want while
building the sort tuple datums, and the not-null case would be handled
automatically. And have a smaller code footprint, I think.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-01-18 Thread Peter Geoghegan
On Tue, Jan 18, 2022 at 6:39 PM John Naylor
 wrote:
> Editorializing the null position in queries is not very common in my
> experience. Not null is interesting since it'd be trivial to pass
> constant false to the same Apply[XYZ]SortComparator() and let the
> compiler remove all those branches for us. On the other hand, those
> branches would be otherwise predicted well, so it might make little or
> no difference.

If you were going to do this, maybe you could encode NULL directly in
an abbreviated key. I think that that could be made to work if it was
limited to opclasses with abbreviated keys encoded as unsigned
integers. Just a thought.

-- 
Peter Geoghegan




Re: A qsort template

2022-01-18 Thread John Naylor
I wrote:

> That said, I think what I'll do next is test the v3-0001 standalone
> patch with tuplesort specializations for more data types. I already
> have a decent test script that I can build on for this.

I've run a test with 10 million records using all types found in the
v3 patch "accelerate tuple sorting for common types", using a variety
of initial orderings, covering index build (btree only, no gist) and
queries (both single value and whole record). Attached is the test
script and a spreadsheet with the raw data as well as comparisons of
the min runtimes in seconds from 5 runs. This is using gcc 11.1 on
fairly recent Intel hardware.

Overall, this shows a good improvement for these types. One exception
is the "0/1" ordering, which is two values in random order. I'm
guessing it's because of the cardinality detector, but some runs have
apparent possible regressions. It's a bit high and sporadic to just
blow off as noise, but this case might not be telling us anything
useful.

Other notes:

- The inet type seems unnaturally fast in some places, meaning faster
than int or date. That's suspicous, but I haven't yet dug deeper into
that.

- With the patch, the VM binary size increases by ~9kB.

I have some hunches on the "future research" comments:

XXX Can we avoid repeating the null-handling logic?

More templating? ;-)

XXX Is it worth specializing for reverse sort?

I'll run a limited test on DESC to see if anything stands out, but I
wonder if the use case is not common -- I seem to remember seeing DESC
less often on the first sort key column.

XXX Is it worth specializing for nulls first, nulls last, not null?

Editorializing the null position in queries is not very common in my
experience. Not null is interesting since it'd be trivial to pass
constant false to the same Apply[XYZ]SortComparator() and let the
compiler remove all those branches for us. On the other hand, those
branches would be otherwise predicted well, so it might make little or
no difference.

XXX Should we have separate cases for "result is authoritative", "need
XXX tiebreaker for atts 1..n (= abbrev case)", "need tie breaker for
XXX atts 2..n"?

The first one seems to be the only case where the SortTuple could be
smaller, since the tuple pointer is null. That sounds like a good
avenue to explore. Less memory usage is always good.

Not sure what you mean by the third case -- there are 2+ sort keys,
but the first is authoritative from the datum, so the full comparison
can skip the first key?

-- 
John Naylor
EDB: http://www.enterprisedb.com


qsort-specialize-types-jcn1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


test-tuplesort-20220113.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: A qsort template

2022-01-12 Thread John Naylor
On Mon, Jun 28, 2021 at 8:16 PM Thomas Munro  wrote:

[v4 patchset]

Hi Thomas,

(Sorry for the delay -- I have some time to put into this now.)

> The lower numbered patches are all things that are reused in many
> places, and in my humble opinion improve the notation and type safety
> and code deduplication generally when working with common types

I think 0001-0003 have had enough review previously to commit them, as
they are mostly notational. There's a small amount of bitrot, but not
enough to change the conclusions any. Also 0011 with the missing
#undef.

On Thu, Aug 5, 2021 at 7:18 PM Peter Geoghegan  wrote:
>
> If somebody wants to get a sense of what the size hit is from all of
> these specializations, I can recommend the diff feature of bloaty:
>
> https://github.com/google/bloaty/blob/master/doc/using.md#size-diffs
>
> Obviously you'd approach this by building postgres without the patch,
> and diffing that baseline to postgres with the patch. And possibly
> variations of the patch, with less or more sort specializations.

Thanks, that's a neat feature! For 0001-0003, the diff shows +700
bytes in memory, so pretty small:

$ bloaty -s vm src/backend/postgres -- src/backend/postgres.orig
FILE SIZEVM SIZE
 --  --
  +0.0%+608  +0.0%+608.text
  +0.0% +64  +0.0% +64.eh_frame
  +0.0% +24  +0.0% +24.dynsym
  +0.0% +14  +0.0% +14.dynstr
  +0.0%  +2  +0.0%  +2.gnu.version
  +0.0% +58  [ = ]   0.debug_abbrev
  +0.1% +48  [ = ]   0.debug_aranges
  +0.0% +1.65Ki  [ = ]   0.debug_info
  +0.0%+942  [ = ]   0.debug_line
  +0.1% +26  [ = ]   0.debug_line_str
  +0.0%+333  [ = ]   0.debug_loclists
  -0.0% -23  [ = ]   0.debug_rnglists
  +0.0% +73  [ = ]   0.debug_str
  -1.0%  -4  [ = ]   0.shstrtab
  +0.0% +20  [ = ]   0.strtab
  +0.0% +24  [ = ]   0.symtab
  +131% +3.30Ki  [ = ]   0[Unmapped]
  +0.0% +7.11Ki  +0.0%+712TOTAL

[back to Thomas]

> I tried to measure a speedup in vacuum, but so far I have not.  I did
> learn some things though:  While doing that with an uncorrelated index
> and a lot of deleted tuples, I found that adding more
> maintenance_work_mem doesn't help beyond a few MB, because then cache
> misses dominate to the point where it's not better than doing multiple
> passes (and this is familiar to me from work on hash joins).  If I
> turned on huge pages on Linux and set min_dynamic_shared_memory so
> that the parallel DSM used by vacuum lives in huge pages, then
> parallel vacuum with a large maintenance_work_mem starts to do much
> better than non-parallel vacuum by improving the TLB misses (as with
> hash joins).  I thought that was quite interesting!  Perhaps
> bsearch_itemptr might help with correlated indexes with a lot of
> deleted indexes (so not dominated by cache misses), though?
>
> (I wouldn't be suprised if someone comes up with a much better idea
> than bsearch for that anyway...  a few ideas have been suggested.)

That's interesting about the (un)correlated index having such a large
effect on cache hit rate! By now there has been some discussion and a
benchmark for dead tuple storage [1]. bit there doesn't seem to be
recent activity on that thread. We might consider adding the ItemPtr
comparator work I did in [2] for v15 if we don't have any of the other
proposals in place by feature freeze. My concern there is the speedups
I observed were observed when the values were comfortably in L2 cache,
IIRC. That would need wider testing.

That said, I think what I'll do next is test the v3-0001 standalone
patch with tuplesort specializations for more data types. I already
have a decent test script that I can build on for this. (this is the
one currently in CI)

Then, I want to do at least preliminary testing of the qsort boundary
parameters.

Those two things should be doable for this commitfest.

[1] 
https://www.postgresql.org/message-id/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsG_c24CHKA3cWrOP1HynWGLOkLb8hyZfsD9db5g-ZPagA%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2021-08-05 Thread Peter Geoghegan
On Sun, Aug 1, 2021 at 5:41 PM Thomas Munro  wrote:
> On Fri, Jul 30, 2021 at 12:34 PM John Naylor
>  wrote:
> > I got around to getting a benchmark together to serve as a starting point. 
> > I based it off something I got from the archives, but don't remember where 
> > (I seem to remember Tomas Vondra wrote the original, but not sure). To 
> > start I just used types that were there already -- int, text, numeric. The 
> > latter two won't be helped by this patch, but I wanted to keep something 
> > like that so we can see what kind of noise variation there is. I'll 
> > probably cut text out in the future and just keep numeric for that purpose.
>
> Thanks, that's very useful.

If somebody wants to get a sense of what the size hit is from all of
these specializations, I can recommend the diff feature of bloaty:

https://github.com/google/bloaty/blob/master/doc/using.md#size-diffs

Obviously you'd approach this by building postgres without the patch,
and diffing that baseline to postgres with the patch. And possibly
variations of the patch, with less or more sort specializations.

-- 
Peter Geoghegan




Re: A qsort template

2021-08-01 Thread Thomas Munro
On Mon, Aug 2, 2021 at 12:40 PM Thomas Munro  wrote:
> Great!  I saw similar sorts of numbers.  It's really just a few
> crumbs, nothing compared to the gains David just found over in the
> thread "Use generation context to speed up tuplesorts", but on the
> bright side, these crumbs will be magnified by that work.

(Hmm, that also makes me wonder about using a smaller SortTuple when
possible...)




Re: A qsort template

2021-08-01 Thread Thomas Munro
On Fri, Jul 30, 2021 at 12:34 PM John Naylor
 wrote:
> I got around to getting a benchmark together to serve as a starting point. I 
> based it off something I got from the archives, but don't remember where (I 
> seem to remember Tomas Vondra wrote the original, but not sure). To start I 
> just used types that were there already -- int, text, numeric. The latter two 
> won't be helped by this patch, but I wanted to keep something like that so we 
> can see what kind of noise variation there is. I'll probably cut text out in 
> the future and just keep numeric for that purpose.

Thanks, that's very useful.

> I've attached both the script and a crude spreadsheet. I'll try to figure out 
> something nicer for future tests, and maybe some graphs. The "comparison" 
> sheet has the results side by side (min of five). There are 6 distributions 
> of values:
> - random
> - sorted
> - "almost sorted"
> - reversed
> - organ pipe (first half ascending, second half descending)
> - rotated (sorted but then put the smallest at the end)
> - random 0s/1s
>
> I included both "select a" and "select *" to make sure we have the recent 
> datum sort optimization represented. The results look pretty good for ints -- 
> about the same speed up master gets going from tuple sorts to datum sorts, 
> and those got faster in turn also.

Great!  I saw similar sorts of numbers.  It's really just a few
crumbs, nothing compared to the gains David just found over in the
thread "Use generation context to speed up tuplesorts", but on the
bright side, these crumbs will be magnified by that work.

> Next I think I'll run microbenchmarks on int64s with the test harness you 
> attached earlier, and experiment with the qsort parameters a bit.

Cool.  I haven't had much luck experimenting with that yet, though I
consider the promotion from magic numbers to names as an improvement
in any case.

> I'm also attaching your tuplesort patch so others can see what exactly I'm 
> comparing.

We've been bouncing around quite a few different ideas and patches in
this thread; soon I'll try to bring it back to one patch set with the
ideas that are looking good so far in a more tidied up form.  For the
tupesort.c part, I added some TODO notes in
v3-0001-WIP-Accelerate-tuple-sorting-for-common-types.patch's commit
message (see reply to Peter).




Re: A qsort template

2021-08-01 Thread Thomas Munro
On Fri, Jul 30, 2021 at 7:11 PM Peter Geoghegan  wrote:
> If you're going to specialize the sort routine for unsigned integer
> style abbreviated keys then you might as well cover all relevant
> opclasses/types. Almost all abbreviated key schemes produce
> conditioned datums that are designed to use simple 3-way unsigned int
> comparator. It's not just text. (Actually, the only abbreviated key
> scheme that doesn't do it that way is numeric.)

Right, that was the plan, but this was just experimenting with an
idea.  Looks like John's also seeing evidence that it may be worth
pursuing.

(Re numeric, I guess it must be possible to rearrange things so it can
use ssup_datum_signed_cmp; maybe something like NaN -> INT64_MAX, +inf
-> INT64_MAX - 1, -inf -> INT64_MIN, and then -1 - (whatever we're
doing now for normal values).)

> Offhand I know that UUID, macaddr, and inet all have abbreviated keys
> that can use your new ssup_datum_binary_cmp() comparator instead of
> their own duplicated comparator (which will make them use the
> corresponding specialized sort routine inside tuplesort.c).

Thanks, I've added these ones, and also gist_bbox_zorder_cmp_abbrev.

I also renamed that function to ssup_datum_unsigned_cmp(), because
"binary" was misleading.
From a1ca8e9ca989de5f61f1b8f067cb9785034ce9cc Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 3 Jul 2021 19:02:10 +1200
Subject: [PATCH v3] WIP: Accelerate tuple sorting for common types.

Several data types have their own fast comparator functions that can be
replaced with common binary or signed comparator functions.  These
functions can then be recognized by tuplesort.c and used to dispatch to
corresponding specialized sort functions, to accelerate sorting.

XXX WIP, experiment grade only, many open questions... such as:

XXX Can we avoid repeating the null-handling logic?
XXX Is it worth specializing for reverse sort?
XXX Is it worth specializing for nulls first, nulls last, not null?
XXX Should we have separate cases for "result is authoritative", "need
XXX tiebreaker for atts 1..n (= abbrev case)", "need tie breaker for
XXX atts 2..n"?
XXX If we did all of the above, that'd give:
XXX  {nulls_first, nulls_last, not_null} x
XXX  {asc, desc} x
XXX  {tiebreak_none, tiebreak_first, tiebreak_rest} x
XXX  {unsigned, signed, int32}
XXX  = 54 specializations!  Seems a little over the top.
XXX You could use a smaller SortTuple for some cases.
XXX How should we handle numeric?
---
 src/backend/access/gist/gistproc.c |  18 +--
 src/backend/access/nbtree/nbtcompare.c |  22 ++--
 src/backend/utils/adt/date.c   |  15 +--
 src/backend/utils/adt/mac.c|  23 +---
 src/backend/utils/adt/network.c|  17 +--
 src/backend/utils/adt/timestamp.c  |  11 ++
 src/backend/utils/adt/uuid.c   |  25 +---
 src/backend/utils/adt/varlena.c|  34 +-
 src/backend/utils/sort/tuplesort.c | 160 -
 src/include/utils/sortsupport.h| 115 ++
 10 files changed, 308 insertions(+), 132 deletions(-)

diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index d474612b77..20135ea0ec 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -37,7 +37,6 @@ static uint64 part_bits32_by2(uint32 x);
 static uint32 ieee_float32_to_uint32(float f);
 static int	gist_bbox_zorder_cmp(Datum a, Datum b, SortSupport ssup);
 static Datum gist_bbox_zorder_abbrev_convert(Datum original, SortSupport ssup);
-static int	gist_bbox_zorder_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup);
 static bool gist_bbox_zorder_abbrev_abort(int memtupcount, SortSupport ssup);
 
 
@@ -1726,21 +1725,6 @@ gist_bbox_zorder_abbrev_convert(Datum original, SortSupport ssup)
 #endif
 }
 
-static int
-gist_bbox_zorder_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
-{
-	/*
-	 * Compare the pre-computed Z-orders as unsigned integers. Datum is a
-	 * typedef for 'uintptr_t', so no casting is required.
-	 */
-	if (z1 > z2)
-		return 1;
-	else if (z1 < z2)
-		return -1;
-	else
-		return 0;
-}
-
 /*
  * We never consider aborting the abbreviation.
  *
@@ -1764,7 +1748,7 @@ gist_point_sortsupport(PG_FUNCTION_ARGS)
 
 	if (ssup->abbreviate)
 	{
-		ssup->comparator = gist_bbox_zorder_cmp_abbrev;
+		ssup->comparator = ssup_datum_unsigned_cmp;
 		ssup->abbrev_converter = gist_bbox_zorder_abbrev_convert;
 		ssup->abbrev_abort = gist_bbox_zorder_abbrev_abort;
 		ssup->abbrev_full_comparator = gist_bbox_zorder_cmp;
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 7ac73cb8c2..204cf778fb 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -119,26 +119,12 @@ btint4cmp(PG_FUNCTION_ARGS)
 		PG_RETURN_INT32(A_LESS_THAN_B);
 }
 
-static int
-btint4fastcmp(Datum x, Datum y, SortSupport ssup)
-{
-	int32		a = DatumGetInt32(x);
-	int32		b = DatumGetInt32(y);
-
-	if (a > b)
-		return A_GREATER_THAN_B;

Re: A qsort template

2021-07-30 Thread John Naylor
On Fri, Jul 30, 2021 at 7:47 AM Ranier Vilela  wrote:

> The patch attached does not apply cleanly,
> please can fix it?

It applies just fine with "patch", for those wondering.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A qsort template

2021-07-30 Thread Ranier Vilela
Em qui., 29 de jul. de 2021 às 21:34, John Naylor <
john.nay...@enterprisedb.com> escreveu:

>
> On Sun, Jul 4, 2021 at 12:27 AM Thomas Munro 
> wrote:
> >
> > Since you are experimenting with tuplesort and likely thinking similar
> > thoughts, here's a patch I've been using to explore that area.  I've
> > seen it get, for example, ~1.18x speedup for simple index builds in
> > favourable winds (YMMV, early hacking results only).  Currently, it
> > kicks in when the leading column is of type int4, int8, timestamp,
> > timestamptz, date or text + friends (when abbreviatable, currently
> > that means "C" and ICU collations only), while increasing the
> > executable by only 8.5kB (Clang, amd64, -O2, no debug).
> >
> > These types are handled with just three specialisations.  Their custom
> > "fast" comparators all boiled down to comparisons of datum bits,
> > varying only in signedness and width, so I tried throwing them away
> > and using 3 new common routines.  Then I extended
> > tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to
> > recognise qualifying users of those and select 3 corresponding sort
> > specialisations.
>
> I got around to getting a benchmark together to serve as a starting point.
> I based it off something I got from the archives, but don't remember where
> (I seem to remember Tomas Vondra wrote the original, but not sure). To
> start I just used types that were there already -- int, text, numeric. The
> latter two won't be helped by this patch, but I wanted to keep something
> like that so we can see what kind of noise variation there is. I'll
> probably cut text out in the future and just keep numeric for that purpose.
>
> I've attached both the script and a crude spreadsheet. I'll try to figure
> out something nicer for future tests, and maybe some graphs. The
> "comparison" sheet has the results side by side (min of five). There are 6
> distributions of values:
> - random
> - sorted
> - "almost sorted"
> - reversed
> - organ pipe (first half ascending, second half descending)
> - rotated (sorted but then put the smallest at the end)
> - random 0s/1s
>
> I included both "select a" and "select *" to make sure we have the recent
> datum sort optimization represented. The results look pretty good for ints
> -- about the same speed up master gets going from tuple sorts to datum
> sorts, and those got faster in turn also.
>
> Next I think I'll run microbenchmarks on int64s with the test harness you
> attached earlier, and experiment with the qsort parameters a bit.
>
> I'm also attaching your tuplesort patch so others can see what exactly I'm
> comparing.
>
The patch attached does not apply cleanly,
please can fix it?

error: patch failed: src/backend/utils/sort/tuplesort.c:4776
error: src/backend/utils/sort/tuplesort.c: patch does not apply

regards,
Ranier Vilela


Re: A qsort template

2021-07-30 Thread Peter Geoghegan
On Fri, Jul 30, 2021 at 3:34 AM John Naylor
 wrote:
> I'm also attaching your tuplesort patch so others can see what exactly I'm 
> comparing.

If you're going to specialize the sort routine for unsigned integer
style abbreviated keys then you might as well cover all relevant
opclasses/types. Almost all abbreviated key schemes produce
conditioned datums that are designed to use simple 3-way unsigned int
comparator. It's not just text. (Actually, the only abbreviated key
scheme that doesn't do it that way is numeric.)

Offhand I know that UUID, macaddr, and inet all have abbreviated keys
that can use your new ssup_datum_binary_cmp() comparator instead of
their own duplicated comparator (which will make them use the
corresponding specialized sort routine inside tuplesort.c).

-- 
Peter Geoghegan




Re: A qsort template

2021-07-29 Thread John Naylor
On Sun, Jul 4, 2021 at 12:27 AM Thomas Munro  wrote:
>
> Since you are experimenting with tuplesort and likely thinking similar
> thoughts, here's a patch I've been using to explore that area.  I've
> seen it get, for example, ~1.18x speedup for simple index builds in
> favourable winds (YMMV, early hacking results only).  Currently, it
> kicks in when the leading column is of type int4, int8, timestamp,
> timestamptz, date or text + friends (when abbreviatable, currently
> that means "C" and ICU collations only), while increasing the
> executable by only 8.5kB (Clang, amd64, -O2, no debug).
>
> These types are handled with just three specialisations.  Their custom
> "fast" comparators all boiled down to comparisons of datum bits,
> varying only in signedness and width, so I tried throwing them away
> and using 3 new common routines.  Then I extended
> tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to
> recognise qualifying users of those and select 3 corresponding sort
> specialisations.

I got around to getting a benchmark together to serve as a starting point.
I based it off something I got from the archives, but don't remember where
(I seem to remember Tomas Vondra wrote the original, but not sure). To
start I just used types that were there already -- int, text, numeric. The
latter two won't be helped by this patch, but I wanted to keep something
like that so we can see what kind of noise variation there is. I'll
probably cut text out in the future and just keep numeric for that purpose.

I've attached both the script and a crude spreadsheet. I'll try to figure
out something nicer for future tests, and maybe some graphs. The
"comparison" sheet has the results side by side (min of five). There are 6
distributions of values:
- random
- sorted
- "almost sorted"
- reversed
- organ pipe (first half ascending, second half descending)
- rotated (sorted but then put the smallest at the end)
- random 0s/1s

I included both "select a" and "select *" to make sure we have the recent
datum sort optimization represented. The results look pretty good for ints
-- about the same speed up master gets going from tuple sorts to datum
sorts, and those got faster in turn also.

Next I think I'll run microbenchmarks on int64s with the test harness you
attached earlier, and experiment with the qsort parameters a bit.

I'm also attaching your tuplesort patch so others can see what exactly I'm
comparing.

--
John Naylor
EDB: http://www.enterprisedb.com
set -e

ROWS=$1
WORK_MEM=$2


function log {
	echo `date +%s` [`date +'%Y-%m-%d %H:%M:%S'`] $1
}

function create_tables {

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < 16384 AND relkind = 'r';
EOF

}

# load test tables

function load_random {

	truncate_tables

	log "loading random"

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < 0.5 
	then ''
	else ''
	end
FROM data_int;
INSERT INTO num_test SELECT round(a), md5(a::text) FROM data_int;
EOF

	prewarm

	vacuum_analyze

}

# run

function run_query {

	times=""
	group=$1
	wmem=$2
	workers=$3
	query=$4

	echo "--" `date` [`date +%s`] >> explains.log
	echo "-- $group rows=$ROWS work_mem=$wmem workers=$workers" >> explains.log

	./inst/bin/psql >> explains.log 2>&1 < /dev/null <> results.csv
}

function run_index {

times=""
group=$1
wmem=$2
workers=$3
query=$4

times=""

	s=`date +%s`

for i in `seq 1 5`; do

/usr/bin/time -f '%e' -o 'query.time' ./inst/bin/psql > /dev/null <> results.csv
}

function run_queries {

#	for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do
	for wm in $WORK_MEM; do

		log "running queries work_mem=$wm noparallel"

		run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 OFFSET '$ROWS''
#		run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 DESC OFFSET '$ROWS''
#		run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 DESC OFFSET '$ROWS''
		run_index $1 $wm 0 'CREATE INDEX x ON int_test (a); DROP INDEX x'

		run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 OFFSET '$ROWS''
#		run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 DESC OFFSET '$ROWS''
#		run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 DESC OFFSET '$ROWS''
		run_index $1 $wm 0 'CREATE INDEX x ON txt_test (a); DROP INDEX x'

		run_query $1 $wm 0 'SELECT * FROM num_test ORDER BY 1 OFFSET '$ROWS''
#		run_query $1 $wm 0 'SELECT * FROM num_test ORDER BY 1 DESC OFFSET '$ROWS''
		run_index $1 $wm 0 'CREATE INDEX x ON num_test (a); DROP INDEX x'

	done

}

create_tables

log "sort benchmark $ROWS"

load_random
run_queries "random"

load_sorted
run_queries "sorted"

load_almost_sorted
run_queries "almost_sorted"

load_reversed

Re: A qsort template

2021-07-22 Thread Thomas Munro
On Thu, Jun 17, 2021 at 1:20 PM Thomas Munro  wrote:
> On Thu, Jun 17, 2021 at 1:14 PM Tom Lane  wrote:
> > The big problem in my mind, which would not be alleviated in the
> > slightest by having a separate file, is that it'd be easy to miss
> > removing entries if they ever become obsolete.
>
> I suppose you could invent some kind of declaration syntax in a
> comment near the use of the pseudo-typename in the source tree that is
> mechanically extracted.

What do you think about something like this?
From 8252eb18c19c8a78fa6326ae5de8261d7d793dda Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 22 Jul 2021 19:05:15 +1200
Subject: [PATCH] Teach pgindent about special file-local typenames.

Allow source files to declare an extra typename with a special
annotation of the form:

@pgindent typename XXX@

Use these to fix some whitespace problems in simplehash.h and
sort_template.h.
---
 src/include/lib/simplehash.h| 110 +---
 src/include/lib/sort_template.h |  21 +++---
 src/tools/pgindent/pgindent |  11 +++-
 3 files changed, 79 insertions(+), 63 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98..3e44c7cc03 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -86,6 +86,12 @@
  *   presence is relevant to determine whether a lookup needs to continue
  *   looking or is done - buckets following a deleted element are shifted
  *   backwards, unless they're empty or already at their optimal position.
+ *
+ * Help pgindent understand our pseudo-typenames:
+ *
+ *   @pgindent typename SH_ELEMENT_TYPE@
+ *   @pgindent typename SH_TYPE@
+ *   @pgindent typename SH_ITERATOR@
  */
 
 #include "port/pg_bitutils.h"
@@ -164,7 +170,7 @@ typedef struct SH_TYPE
 
/* user defined data, useful for callbacks */
void   *private_data;
-}  SH_TYPE;
+} SH_TYPE;
 
 typedef enum SH_STATUS
 {
@@ -177,67 +183,67 @@ typedef struct SH_ITERATOR
uint32  cur;/* current element */
uint32  end;
booldone;   /* iterator exhausted? */
-}  SH_ITERATOR;
+} SH_ITERATOR;
 
 /* externally visible function prototypes */
 #ifdef SH_RAW_ALLOCATOR
 /* _hash _create(uint32 nelements, void *private_data) */
-SH_SCOPE   SH_TYPE *SH_CREATE(uint32 nelements, void *private_data);
+SH_SCOPE SH_TYPE *SH_CREATE(uint32 nelements, void *private_data);
 #else
 /*
  * _hash _create(MemoryContext ctx, uint32 nelements,
  *  void 
*private_data)
  */
-SH_SCOPE   SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements,
-  void *private_data);
+SH_SCOPE SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements,
+   void *private_data);
 #endif
 
 /* void _destroy(_hash *tb) */
-SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
+SH_SCOPE void SH_DESTROY(SH_TYPE *tb);
 
 /* void _reset(_hash *tb) */
-SH_SCOPE void SH_RESET(SH_TYPE * tb);
+SH_SCOPE void SH_RESET(SH_TYPE *tb);
 
 /* void _grow(_hash *tb) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
+SH_SCOPE void SH_GROW(SH_TYPE *tb, uint32 newsize);
 
 /*  *_insert(_hash *tb,  key, bool *found) */
-SH_SCOPE   SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool 
*found);
+SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE *tb, SH_KEY_TYPE key, bool *found);
 
 /*
  *  *_insert_hash(_hash *tb,  key, uint32 hash,
  *   bool *found)
  */
-SH_SCOPE   SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
-   
uint32 hash, bool *found);
+SH_SCOPE SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE *tb, SH_KEY_TYPE key,
+   
 uint32 hash, bool *found);
 
 /*  *_lookup(_hash *tb,  key) */
-SH_SCOPE   SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key);
+SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE *tb, SH_KEY_TYPE key);
 
 /*  *_lookup_hash(_hash *tb,  key, uint32 hash) 
*/
-SH_SCOPE   SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
-   
uint32 hash);
+SH_SCOPE SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE *tb, SH_KEY_TYPE key,
+   
 uint32 hash);
 
 /* void _delete_item(_hash *tb,  *entry) */
-SH_SCOPE void SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry);
+SH_SCOPE void SH_DELETE_ITEM(SH_TYPE *tb, SH_ELEMENT_TYPE *entry);
 
 /* bool _delete(_hash *tb,  key) */
-SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
+SH_SCOPE bool SH_DELETE(SH_TYPE *tb, 

Re: A qsort template

2021-07-15 Thread John Naylor
On Thu, Jul 15, 2021 at 7:50 AM vignesh C  wrote:
> The patch does not apply on Head anymore, could you rebase and post a
> patch. I'm changing the status to "Waiting for Author".

The patch set is fine. The error is my fault since I attached an
experimental addendum and neglected to name it as .txt. I've set it back to
"needs review" and will resume testing shortly.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A qsort template

2021-07-15 Thread vignesh C
On Sun, Jul 4, 2021 at 9:58 AM Thomas Munro  wrote:
>
> On Fri, Jul 2, 2021 at 2:32 PM John Naylor  
> wrote:
> > I suspect if we experiment on two extremes of type "heaviness" (accessing 
> > and comparing trivial or not), such as uint32 and tuplesort, we'll have a 
> > pretty good idea what the parameters should be, if anything different. I'll 
> > do some testing along those lines.
>
> Cool.
>
> Since you are experimenting with tuplesort and likely thinking similar
> thoughts, here's a patch I've been using to explore that area.  I've
> seen it get, for example, ~1.18x speedup for simple index builds in
> favourable winds (YMMV, early hacking results only).  Currently, it
> kicks in when the leading column is of type int4, int8, timestamp,
> timestamptz, date or text + friends (when abbreviatable, currently
> that means "C" and ICU collations only), while increasing the
> executable by only 8.5kB (Clang, amd64, -O2, no debug).
>
> These types are handled with just three specialisations.  Their custom
> "fast" comparators all boiled down to comparisons of datum bits,
> varying only in signedness and width, so I tried throwing them away
> and using 3 new common routines.  Then I extended
> tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to
> recognise qualifying users of those and select 3 corresponding sort
> specialisations.
>
> It might turn out to be worth burning some more executable size on
> extra variants (for example, see XXX notes in the code comments for
> opportunities; one could also go nuts trying smaller things like
> special cases for not-null, nulls first, reverse sort, ... to kill all
> those branches), or not.

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh




Re: A qsort template

2021-07-03 Thread Thomas Munro
On Fri, Jul 2, 2021 at 2:32 PM John Naylor  wrote:
> I suspect if we experiment on two extremes of type "heaviness" (accessing and 
> comparing trivial or not), such as uint32 and tuplesort, we'll have a pretty 
> good idea what the parameters should be, if anything different. I'll do some 
> testing along those lines.

Cool.

Since you are experimenting with tuplesort and likely thinking similar
thoughts, here's a patch I've been using to explore that area.  I've
seen it get, for example, ~1.18x speedup for simple index builds in
favourable winds (YMMV, early hacking results only).  Currently, it
kicks in when the leading column is of type int4, int8, timestamp,
timestamptz, date or text + friends (when abbreviatable, currently
that means "C" and ICU collations only), while increasing the
executable by only 8.5kB (Clang, amd64, -O2, no debug).

These types are handled with just three specialisations.  Their custom
"fast" comparators all boiled down to comparisons of datum bits,
varying only in signedness and width, so I tried throwing them away
and using 3 new common routines.  Then I extended
tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to
recognise qualifying users of those and select 3 corresponding sort
specialisations.

It might turn out to be worth burning some more executable size on
extra variants (for example, see XXX notes in the code comments for
opportunities; one could also go nuts trying smaller things like
special cases for not-null, nulls first, reverse sort, ... to kill all
those branches), or not.
From 62a8acbc745f3b4a90c9a14b6b61989a9d83bece Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 3 Jul 2021 19:02:10 +1200
Subject: [PATCH] WIP: Accelerate tuple sorting for common types.

Several data types have their own fast comparator functions that can be
replaced with common binary or signed comparator functions.  These
functions can then be recognized by tuplesort.c and used to dispatch to
corresponding specialized sort functions, to accelerate sorting.

XXX WIP, experiment grade only, many open questions...
---
 src/backend/access/nbtree/nbtcompare.c |  22 ++--
 src/backend/utils/adt/date.c   |  15 +--
 src/backend/utils/adt/timestamp.c  |  11 ++
 src/backend/utils/adt/varlena.c|  26 +---
 src/backend/utils/sort/tuplesort.c | 161 -
 src/include/utils/sortsupport.h| 115 ++
 6 files changed, 295 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/nbtree/nbtcompare.c 
b/src/backend/access/nbtree/nbtcompare.c
index 7ac73cb8c2..204cf778fb 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -119,26 +119,12 @@ btint4cmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32(A_LESS_THAN_B);
 }
 
-static int
-btint4fastcmp(Datum x, Datum y, SortSupport ssup)
-{
-   int32   a = DatumGetInt32(x);
-   int32   b = DatumGetInt32(y);
-
-   if (a > b)
-   return A_GREATER_THAN_B;
-   else if (a == b)
-   return 0;
-   else
-   return A_LESS_THAN_B;
-}
-
 Datum
 btint4sortsupport(PG_FUNCTION_ARGS)
 {
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
 
-   ssup->comparator = btint4fastcmp;
+   ssup->comparator = ssup_datum_int32_cmp;
PG_RETURN_VOID();
 }
 
@@ -156,6 +142,7 @@ btint8cmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32(A_LESS_THAN_B);
 }
 
+#ifndef USE_FLOAT8_BYVAL
 static int
 btint8fastcmp(Datum x, Datum y, SortSupport ssup)
 {
@@ -169,13 +156,18 @@ btint8fastcmp(Datum x, Datum y, SortSupport ssup)
else
return A_LESS_THAN_B;
 }
+#endif
 
 Datum
 btint8sortsupport(PG_FUNCTION_ARGS)
 {
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
 
+#ifdef USE_FLOAT8_BYVAL
+   ssup->comparator = ssup_datum_signed_cmp;
+#else
ssup->comparator = btint8fastcmp;
+#endif
PG_RETURN_VOID();
 }
 
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 0bea16cb67..350c662d50 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -438,25 +438,12 @@ date_cmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32(0);
 }
 
-static int
-date_fastcmp(Datum x, Datum y, SortSupport ssup)
-{
-   DateADT a = DatumGetDateADT(x);
-   DateADT b = DatumGetDateADT(y);
-
-   if (a < b)
-   return -1;
-   else if (a > b)
-   return 1;
-   return 0;
-}
-
 Datum
 date_sortsupport(PG_FUNCTION_ARGS)
 {
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
 
-   ssup->comparator = date_fastcmp;
+   ssup->comparator = ssup_datum_int32_cmp;
PG_RETURN_VOID();
 }
 
diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index 79761f809c..c678517db6 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -37,6 +37,7 @@
 #include "utils/datetime.h"
 

Re: A qsort template

2021-07-01 Thread John Naylor
On Thu, Jul 1, 2021 at 6:10 PM Thomas Munro  wrote:

> One thing I'm wondering about is whether it's worth having stuff to
> support future experimentation like ST_SORT_SMALL_THRESHOLD and
> ST_COMPARE_RET_TYPE in the tree, or whether we should pare it back to
> the minimal changes that definitely produce results.  I think I'd like
> to keep those changes: even if it may be some time, possibly an
> infinite amount, before we figure out how to tune the thresholds
> profitably, giving them names instead of using magic numbers seems
> like progress.

I suspect if we experiment on two extremes of type "heaviness" (accessing
and comparing trivial or not), such as uint32 and tuplesort, we'll have a
pretty good idea what the parameters should be, if anything different. I'll
do some testing along those lines.

(BTW, I just realized I lied and sent a .patch file after all, oops)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A qsort template

2021-07-01 Thread Thomas Munro
On Fri, Jul 2, 2021 at 4:39 AM John Naylor  wrote:
> For item pointers, it made sense to try doing math to reduce the number of 
> branches. That made things worse, so let's try the opposite: Increase the 
> number of branches so we do less math. In the attached patch (applies on top 
> of your 0012 and a .txt to avoid confusing the CF bot), I test a new 
> comparator with this approach, and also try a wider range of thresholds. The 
> thresholds don't seem to make any noticeable difference with this data type, 
> but the new comparator (cmp=ids below) gives a nice speedup in this test:

> NOTICE:  [traditional qsort] order=random, threshold=7, cmp=32, test=0, 
> time=4.964657

> NOTICE:  order=random, threshold=7, cmp=std, test=0, time=2.810627

> NOTICE:  order=random, threshold=7, cmp=ids, test=0, time=1.692711

Oooh.  So, the awkwardness of the 64 maths with unaligned inputs (even
though we obtain all inputs with 16 bit loads) was hurting, and you
realised the same sort of thing might be happening also with the 32
bit version and went the other way.  (It'd be nice to understand
exactly why.)

I tried your 16 bit comparison version on Intel, AMD and Apple CPUs
and the results were all in the same ballpark.  For random input, I
see something like ~1.7x speedup over traditional qsort from
specialising (cmp=std), and ~2.7x from going 16 bit (cmp=ids).  For
increasing and decreasing input, it's ~2x speedup from specialising
and ~4x speedup from going 16 bit.  Beautiful.

One thing I'm wondering about is whether it's worth having stuff to
support future experimentation like ST_SORT_SMALL_THRESHOLD and
ST_COMPARE_RET_TYPE in the tree, or whether we should pare it back to
the minimal changes that definitely produce results.  I think I'd like
to keep those changes: even if it may be some time, possibly an
infinite amount, before we figure out how to tune the thresholds
profitably, giving them names instead of using magic numbers seems
like progress.

The Alexandrescu talk was extremely entertaining, thanks.




Re: A qsort template

2021-07-01 Thread John Naylor
I wrote:

> One of the points of the talk I linked to is "if doing the sensible thing
makes things worse, try something silly instead".

For item pointers, it made sense to try doing math to reduce the number of
branches. That made things worse, so let's try the opposite: Increase the
number of branches so we do less math. In the attached patch (applies on
top of your 0012 and a .txt to avoid confusing the CF bot), I test a new
comparator with this approach, and also try a wider range of thresholds.
The thresholds don't seem to make any noticeable difference with this data
type, but the new comparator (cmp=ids below) gives a nice speedup in this
test:

# SELECT test_sort_itemptr();
NOTICE:  [traditional qsort] order=random, threshold=7, cmp=32, test=0,
time=4.964657
NOTICE:  [traditional qsort] order=random, threshold=7, cmp=32, test=1,
time=5.185384
NOTICE:  [traditional qsort] order=random, threshold=7, cmp=32, test=2,
time=5.058179
NOTICE:  order=random, threshold=7, cmp=std, test=0, time=2.810627
NOTICE:  order=random, threshold=7, cmp=std, test=1, time=2.804940
NOTICE:  order=random, threshold=7, cmp=std, test=2, time=2.800677
NOTICE:  order=random, threshold=7, cmp=ids, test=0, time=1.692711
NOTICE:  order=random, threshold=7, cmp=ids, test=1, time=1.694546
NOTICE:  order=random, threshold=7, cmp=ids, test=2, time=1.692839
NOTICE:  order=random, threshold=12, cmp=std, test=0, time=2.687033
NOTICE:  order=random, threshold=12, cmp=std, test=1, time=2.681974
NOTICE:  order=random, threshold=12, cmp=std, test=2, time=2.687833
NOTICE:  order=random, threshold=12, cmp=ids, test=0, time=1.666418
NOTICE:  order=random, threshold=12, cmp=ids, test=1, time=1.666188
NOTICE:  order=random, threshold=12, cmp=ids, test=2, time=1.664176
NOTICE:  order=random, threshold=16, cmp=std, test=0, time=2.574147
NOTICE:  order=random, threshold=16, cmp=std, test=1, time=2.579981
NOTICE:  order=random, threshold=16, cmp=std, test=2, time=2.572861
NOTICE:  order=random, threshold=16, cmp=ids, test=0, time=1.699432
NOTICE:  order=random, threshold=16, cmp=ids, test=1, time=1.703075
NOTICE:  order=random, threshold=16, cmp=ids, test=2, time=1.697173
NOTICE:  order=random, threshold=32, cmp=std, test=0, time=2.750040
NOTICE:  order=random, threshold=32, cmp=std, test=1, time=2.744138
NOTICE:  order=random, threshold=32, cmp=std, test=2, time=2.748026
NOTICE:  order=random, threshold=32, cmp=ids, test=0, time=1.677414
NOTICE:  order=random, threshold=32, cmp=ids, test=1, time=1.683792
NOTICE:  order=random, threshold=32, cmp=ids, test=2, time=1.701309
NOTICE:  [traditional qsort] order=increasing, threshold=7, cmp=32, test=0,
time=2.543837
NOTICE:  [traditional qsort] order=increasing, threshold=7, cmp=32, test=1,
time=2.290497
NOTICE:  [traditional qsort] order=increasing, threshold=7, cmp=32, test=2,
time=2.262956
NOTICE:  order=increasing, threshold=7, cmp=std, test=0, time=1.033052
NOTICE:  order=increasing, threshold=7, cmp=std, test=1, time=1.032079
NOTICE:  order=increasing, threshold=7, cmp=std, test=2, time=1.041836
NOTICE:  order=increasing, threshold=7, cmp=ids, test=0, time=0.367355
NOTICE:  order=increasing, threshold=7, cmp=ids, test=1, time=0.367428
NOTICE:  order=increasing, threshold=7, cmp=ids, test=2, time=0.367384
NOTICE:  order=increasing, threshold=12, cmp=std, test=0, time=1.004991
NOTICE:  order=increasing, threshold=12, cmp=std, test=1, time=1.008045
NOTICE:  order=increasing, threshold=12, cmp=std, test=2, time=1.010778
NOTICE:  order=increasing, threshold=12, cmp=ids, test=0, time=0.370944
NOTICE:  order=increasing, threshold=12, cmp=ids, test=1, time=0.368669
NOTICE:  order=increasing, threshold=12, cmp=ids, test=2, time=0.370100
NOTICE:  order=increasing, threshold=16, cmp=std, test=0, time=1.023682
NOTICE:  order=increasing, threshold=16, cmp=std, test=1, time=1.025805
NOTICE:  order=increasing, threshold=16, cmp=std, test=2, time=1.022005
NOTICE:  order=increasing, threshold=16, cmp=ids, test=0, time=0.365398
NOTICE:  order=increasing, threshold=16, cmp=ids, test=1, time=0.365586
NOTICE:  order=increasing, threshold=16, cmp=ids, test=2, time=0.364807
NOTICE:  order=increasing, threshold=32, cmp=std, test=0, time=0.950780
NOTICE:  order=increasing, threshold=32, cmp=std, test=1, time=0.949920
NOTICE:  order=increasing, threshold=32, cmp=std, test=2, time=0.953239
NOTICE:  order=increasing, threshold=32, cmp=ids, test=0, time=0.367866
NOTICE:  order=increasing, threshold=32, cmp=ids, test=1, time=0.372179
NOTICE:  order=increasing, threshold=32, cmp=ids, test=2, time=0.371115
NOTICE:  [traditional qsort] order=decreasing, threshold=7, cmp=32, test=0,
time=2.317475
NOTICE:  [traditional qsort] order=decreasing, threshold=7, cmp=32, test=1,
time=2.323446
NOTICE:  [traditional qsort] order=decreasing, threshold=7, cmp=32, test=2,
time=2.326714
NOTICE:  order=decreasing, threshold=7, cmp=std, test=0, time=1.022270
NOTICE:  order=decreasing, threshold=7, cmp=std, test=1, time=1.015133
NOTICE:  

Re: A qsort template

2021-06-29 Thread John Naylor
On Tue, Jun 29, 2021 at 2:56 AM Thomas Munro  wrote:

> Here's a version that includes a rather hackish test module that you
> might find useful to explore various weird effects.  Testing sorting
> routines is really hard, of course... there's a zillion parameters and
> things you could do in the data and cache effects etc etc.  One of the

That module is incredibly useful!

Yeah, while brushing up on recent findings on sorting, it's clear there's a
huge amount of options with different tradeoffs. I did see your tweet last
year about the "small sort" threshold that was tested on a VAX machine, but
hadn't given it any thought til now. Looking around, I've seen quite a
range, always with the caveat of "it depends". A couple interesting
variations:

Golang uses 12, with an extra tweak:

// Do ShellSort pass with gap 6
// It could be written in this simplified form cause b-a <= 12
for i := a + 6; i < b; i++ {
if data.Less(i, i-6) {
data.Swap(i, i-6)
}
}
insertionSort(data, a, b)

Andrei Alexandrescu gave a couple talks discussing the small-sort part of
quicksort, and demonstrated a ruthlessly-optimized make-heap +
unguarded-insertion-sort, using a threshold of 256. He reported a 6%
speed-up sorting a million doubles, IIRC:

video: https://www.youtube.com/watch?v=FJJTYQYB1JQ
slides:
https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf

That might not be workable for us, but it's a fun talk.

> main things that jumps out pretty clearly though with these simple
> tests is that sorting 6 byte ItemPointerData objects is *really slow*
> compared to more natural object sizes (look at the times and the
> MEMORY values in the scripts).  Another is that specialised sort
> functions are much faster than traditional qsort (being one of the
> goals of this exercise).  Sadly, the 64 bit comparison technique is
> not looking too good in the output of this test.

One of the points of the talk I linked to is "if doing the sensible thing
makes things worse, try something silly instead".

Anyway, I'll play around with the scripts and see if something useful pops
out.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A qsort template

2021-06-29 Thread Thomas Munro
On Tue, Jun 29, 2021 at 1:11 PM Thomas Munro  wrote:
> I spotted a mistake in v3: I didn't rename ST_COMPARE_TYPE to
> ST_COMPARE_RET_TYPE in the 0009 patch (well, I did, but forgot to
> commit before I ran git format-patch).  I won't send another tarball
> just for that, but will correct it next time.

Here's a version that includes a rather hackish test module that you
might find useful to explore various weird effects.  Testing sorting
routines is really hard, of course... there's a zillion parameters and
things you could do in the data and cache effects etc etc.  One of the
main things that jumps out pretty clearly though with these simple
tests is that sorting 6 byte ItemPointerData objects is *really slow*
compared to more natural object sizes (look at the times and the
MEMORY values in the scripts).  Another is that specialised sort
functions are much faster than traditional qsort (being one of the
goals of this exercise).  Sadly, the 64 bit comparison technique is
not looking too good in the output of this test.


more-sort-search-specializations-v4.tgz
Description: application/compressed-tar


Re: A qsort template

2021-06-28 Thread Thomas Munro
I spotted a mistake in v3: I didn't rename ST_COMPARE_TYPE to
ST_COMPARE_RET_TYPE in the 0009 patch (well, I did, but forgot to
commit before I ran git format-patch).  I won't send another tarball
just for that, but will correct it next time.




Re: A qsort template

2021-06-28 Thread Thomas Munro
Hi John,

On Tue, Jun 29, 2021 at 7:13 AM John Naylor
 wrote:
> I plan to do some performance testing with VACUUM, ANALYZE etc soon, to see 
> if I can detect any significant differences.

Thanks!

> I did a quick check of the MacOS/clang binary size (no debug symbols):
>
> master:8108408
> 0001-0009: 8125224

Not too bad.

> Later, I'll drill down into the individual patches and see if anything stands 
> out.
>
> There were already some comments for v2 upthread about formatting and an 
> overflow hazard, but I did find a few more things to ask about:

Right, here's an update with fixes discussed earlier with Zhihong and Tom:

* COMPARE_TYPE -> COMPARE_RET_TYPE
* quit fighting with pgindent (I will try to fix this problem generally later)
* fix overflow hazard

> - For my curiosity, there are a lot of calls to qsort/qunique in the tree -- 
> without having looked exhaustively, do these patches focus on cases where 
> there are bespoke comparator functions and/or hot code paths?

Patches 0006-0009 are highly specialised for local usage by a single
module, and require some kind of evidence that they're worth their
bytes, and the onus is on me there of course -- but any ideas and
feedback are welcome.  There are other opportunities like these, maybe
better ones.  That reminds me: I recently had a perf report from
Andres that showed the qsort in compute_scalar_stats() as quite hot.
That's probably a good candidate, and is not yet done in the current
patch set.

The lower numbered patches are all things that are reused in many
places, and in my humble opinion improve the notation and type safety
and code deduplication generally when working with common types
ItemPtr, BlockNumber, Oid, aside from any performance arguments.  At
least the ItemPtr stuff *might* also speed something useful up.

I tried to measure a speedup in vacuum, but so far I have not.  I did
learn some things though:  While doing that with an uncorrelated index
and a lot of deleted tuples, I found that adding more
maintenance_work_mem doesn't help beyond a few MB, because then cache
misses dominate to the point where it's not better than doing multiple
passes (and this is familiar to me from work on hash joins).  If I
turned on huge pages on Linux and set min_dynamic_shared_memory so
that the parallel DSM used by vacuum lives in huge pages, then
parallel vacuum with a large maintenance_work_mem starts to do much
better than non-parallel vacuum by improving the TLB misses (as with
hash joins).  I thought that was quite interesting!  Perhaps
bsearch_itemptr might help with correlated indexes with a lot of
deleted indexes (so not dominated by cache misses), though?

(I wouldn't be suprised if someone comes up with a much better idea
than bsearch for that anyway...  a few ideas have been suggested.)

> - Aside from the qsort{_arg} precedence, is there a practical reason for 
> keeping the new global functions in their own files?

Better idea for layout welcome.  One thing I wondered while trying to
figure out where to put functions that operate on itemptr: why is
itemptr_encode() in src/include/catalog/index.h?!

> - 0002 / 0004
>
> +/* Search and unique functions inline in header. */
>
> The functions are pretty small, but is there some advantage for inlining 
> these?

Glibc's bsearch definition is already in a header for inlining (as is
our qunique), so I thought I should preserve that characteristic on
principle.  I don't have any evidence though.  Other libcs I looked at
didn't have bsearch in a header.  So by doing this we make the
generated code the same across platforms (all other relevant things
being equal).  I don't know if it really makes much difference,
especially since in this case the comparator and size would still be
inlined if we defined it in the .c (unlike standard bsearch)...
Probably only lazy_tid_reaped() calls it enough to potentially show
any difference in a non-microbenchmark workload, if anything does.

> - 0003
>
> #include "lib/qunique.h" is not needed anymore.

Fixed.

> This isn't quite relevant for the current patch perhaps, but I'm wondering 
> why we don't already call bsearch for RelationHasSysCache() and 
> RelationSupportsSysCache().

Right, I missed that.  Done.  Nice to delete some more code.

> - 0008
>
> +#define ST_COMPARE(a, b, cxt) \
> + DatumGetInt32(FunctionCall2Coll(>flinfo, cxt->collation, *a, *b))
>
> This seems like a pretty heavyweight comparison, so I'm not sure inlining 
> buys us much, but it seems also there are fewer branches this way. I'll come 
> up with a test and see what happens.

I will be very interested to see the results.  Thanks!


more-sort-search-specializations-v3.tgz
Description: application/compressed-tar


Re: A qsort template

2021-06-28 Thread John Naylor
On Wed, Jun 16, 2021 at 1:55 AM Thomas Munro  wrote:
[v2 patch]

Hi Thomas,

I plan to do some performance testing with VACUUM, ANALYZE etc soon, to see
if I can detect any significant differences.

I did a quick check of the MacOS/clang binary size (no debug symbols):

master:8108408
0001-0009: 8125224

Later, I'll drill down into the individual patches and see if anything
stands out.

There were already some comments for v2 upthread about formatting and an
overflow hazard, but I did find a few more things to ask about:

- For my curiosity, there are a lot of calls to qsort/qunique in the tree
-- without having looked exhaustively, do these patches focus on cases
where there are bespoke comparator functions and/or hot code paths?

- Aside from the qsort{_arg} precedence, is there a practical reason for
keeping the new global functions in their own files?

- 0002 / 0004

+/* Search and unique functions inline in header. */

The functions are pretty small, but is there some advantage for inlining
these?

- 0003

#include "lib/qunique.h" is not needed anymore.

This isn't quite relevant for the current patch perhaps, but I'm wondering
why we don't already call bsearch for RelationHasSysCache() and
RelationSupportsSysCache().

- 0008

+#define ST_COMPARE(a, b, cxt) \
+ DatumGetInt32(FunctionCall2Coll(>flinfo, cxt->collation, *a, *b))

This seems like a pretty heavyweight comparison, so I'm not sure inlining
buys us much, but it seems also there are fewer branches this way. I'll
come up with a test and see what happens.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A qsort template

2021-06-16 Thread Thomas Munro
On Thu, Jun 17, 2021 at 1:14 PM Tom Lane  wrote:
> The big problem in my mind, which would not be alleviated in the
> slightest by having a separate file, is that it'd be easy to miss
> removing entries if they ever become obsolete.

I suppose you could invent some kind of declaration syntax in a
comment near the use of the pseudo-typename in the source tree that is
mechanically extracted.




Re: A qsort template

2021-06-16 Thread Tom Lane
Thomas Munro  writes:
> Perhaps one day we could add a
> secondary file, not updated by that mechanism, that holds a manually
> maintained list for cases like this.

Yeah, the comments in pgindent already speculate about that.  For
now, those include and exclude lists are short enough that keeping
them inside the script seems a lot easier than building tooling
to get them from somewhere else.

The big problem in my mind, which would not be alleviated in the
slightest by having a separate file, is that it'd be easy to miss
removing entries if they ever become obsolete.

regards, tom lane




Re: A qsort template

2021-06-16 Thread Thomas Munro
On Thu, Jun 17, 2021 at 11:40 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Hmm, well it was only recently damaged by commit def5b065, and that's
> > because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I
> > was correcting that in this patch.
>
> If ST_ELEMENT_TYPE isn't recognized as a typedef by the buildfarm's
> typedef collectors, this sort of manual addition to typedefs.list
> is not going to survive the next pgindent run.  No, I will NOT
> promise to manually add it back every time.
>
> We do already have special provision for injecting additional typedefs
> in the pgindent script, so one possibility is to add it there:
>
> -my @additional = ("bool\n");
> +my @additional = ("bool\nST_ELEMENT_TYPE\n");
>
> On the whole I'm not sure that this is a big enough formatting
> issue to justify a special hack, though.  Is there any more than
> the one line that gets misformatted?

Ohh.  In that case, I won't bother with that hunk and will live with
the extra space.  There are several other lines like this in the tree,
where people use caveman template macrology that is invisible to
whatever analyser is being used for that, and I can see that that's
just going to have to be OK for now.  Perhaps one day we could add a
secondary file, not updated by that mechanism, that holds a manually
maintained list for cases like this.




Re: A qsort template

2021-06-16 Thread Tom Lane
Thomas Munro  writes:
> Hmm, well it was only recently damaged by commit def5b065, and that's
> because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I
> was correcting that in this patch.

If ST_ELEMENT_TYPE isn't recognized as a typedef by the buildfarm's
typedef collectors, this sort of manual addition to typedefs.list
is not going to survive the next pgindent run.  No, I will NOT
promise to manually add it back every time.

We do already have special provision for injecting additional typedefs
in the pgindent script, so one possibility is to add it there:

-my @additional = ("bool\n");
+my @additional = ("bool\nST_ELEMENT_TYPE\n");

On the whole I'm not sure that this is a big enough formatting
issue to justify a special hack, though.  Is there any more than
the one line that gets misformatted?

regards, tom lane




Re: A qsort template

2021-06-16 Thread Zhihong Yu
On Wed, Jun 16, 2021 at 2:54 PM Thomas Munro  wrote:

> Hi Zhihong,
>
> On Thu, Jun 17, 2021 at 8:13 AM Zhihong Yu  wrote:
> > In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :
> >
> > -   const ST_ELEMENT_TYPE *
> ST_SORT_PROTO_ARG);
> > +   const ST_ELEMENT_TYPE
> *ST_SORT_PROTO_ARG);
> >
> > It seems there is no real change in the line above. Better keep the
> original formation.
>
> Hmm, well it was only recently damaged by commit def5b065, and that's
> because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I
> was correcting that in this patch.  (That file is used by
> pg_bsd_indent to decide if an identifier is a type or a variable,
> which affects whether '*' is formatted like a unary operator/type
> syntax or a binary operator.)
>
> >   *   - ST_COMPARE_ARG_TYPE - type of extra argument
> >   *
> > + *   To say that the comparator returns a type other than int, use:
> > + *
> > + *   - ST_COMPARE_TYPE - an integer type
> >
> > Since the ST_COMPARE_TYPE is meant to designate the type of the return
> value, maybe ST_COMPARE_RET_TYPE would be better name.
> > It also goes with ST_COMPARE_ARG_TYPE preceding this.
>
> Good idea, will do.
>
> > -   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data,
> > -  *pa,
> > -  *pb,
> > -  *pc,
> > -  *pd,
> > -  *pl,
> > -  *pm,
> > -  *pn;
> > +   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data;
> > +   ST_POINTER_TYPE *pa;
> >
> > There doesn't seem to be material change for the above hunk.
>
> In master, you can't write #define ST_ELEMENT_TYPE some_type *, which
> seems like it would be quite useful.  You can use pointers as element
> types, but only with a typedef name due to C parsing rules.  some_type
> **a, *pa, ... declares some_type *pa, but we want some_type **pa.  I
> don't want to have to introduce extra typedefs.  The change fixes that
> problem by not using C's squirrelly variable declaration list syntax.
>
> > +   while (left <= right)
> > +   {
> > +   size_t  mid = (left + right) / 2;
> >
> > The computation for midpoint should be left + (right-left)/2.
>
> Right, my way can overflow.  Will fix.  Thanks!
>

Hi,
Thanks for giving me background on typedefs.
The relevant changes look fine to me.

Cheers


Re: A qsort template

2021-06-16 Thread Thomas Munro
Hi Zhihong,

On Thu, Jun 17, 2021 at 8:13 AM Zhihong Yu  wrote:
> In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :
>
> -   const ST_ELEMENT_TYPE * 
> ST_SORT_PROTO_ARG);
> +   const ST_ELEMENT_TYPE 
> *ST_SORT_PROTO_ARG);
>
> It seems there is no real change in the line above. Better keep the original 
> formation.

Hmm, well it was only recently damaged by commit def5b065, and that's
because I'd forgotten to put ST_ELEMENT_TYPE into typedefs.list, and I
was correcting that in this patch.  (That file is used by
pg_bsd_indent to decide if an identifier is a type or a variable,
which affects whether '*' is formatted like a unary operator/type
syntax or a binary operator.)

>   *   - ST_COMPARE_ARG_TYPE - type of extra argument
>   *
> + *   To say that the comparator returns a type other than int, use:
> + *
> + *   - ST_COMPARE_TYPE - an integer type
>
> Since the ST_COMPARE_TYPE is meant to designate the type of the return value, 
> maybe ST_COMPARE_RET_TYPE would be better name.
> It also goes with ST_COMPARE_ARG_TYPE preceding this.

Good idea, will do.

> -   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data,
> -  *pa,
> -  *pb,
> -  *pc,
> -  *pd,
> -  *pl,
> -  *pm,
> -  *pn;
> +   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data;
> +   ST_POINTER_TYPE *pa;
>
> There doesn't seem to be material change for the above hunk.

In master, you can't write #define ST_ELEMENT_TYPE some_type *, which
seems like it would be quite useful.  You can use pointers as element
types, but only with a typedef name due to C parsing rules.  some_type
**a, *pa, ... declares some_type *pa, but we want some_type **pa.  I
don't want to have to introduce extra typedefs.  The change fixes that
problem by not using C's squirrelly variable declaration list syntax.

> +   while (left <= right)
> +   {
> +   size_t  mid = (left + right) / 2;
>
> The computation for midpoint should be left + (right-left)/2.

Right, my way can overflow.  Will fix.  Thanks!




Re: A qsort template

2021-06-16 Thread Zhihong Yu
On Tue, Jun 15, 2021 at 10:55 PM Thomas Munro 
wrote:

> On Mon, Mar 15, 2021 at 1:09 PM Thomas Munro 
> wrote:
> > On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu  wrote:
> > > + * Remove duplicates from an array.  Return the new size.
> > > + */
> > > +ST_SCOPE size_t
> > > +ST_UNIQUE(ST_ELEMENT_TYPE *array,
> > >
> > > The array is supposed to be sorted, right ?
> > > The comment should mention this.
> >
> > Good point, will update.  Thanks!
>
> Rebased.  Also fixed some formatting problems and updated
> typedefs.list so they don't come back.
>

Hi,
In 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :

-   const ST_ELEMENT_TYPE *
ST_SORT_PROTO_ARG);
+   const ST_ELEMENT_TYPE
*ST_SORT_PROTO_ARG);

It seems there is no real change in the line above. Better keep the
original formation.

  *   - ST_COMPARE_ARG_TYPE - type of extra argument
  *
+ *   To say that the comparator returns a type other than int, use:
+ *
+ *   - ST_COMPARE_TYPE - an integer type

Since the ST_COMPARE_TYPE is meant to designate the type of the return
value, maybe ST_COMPARE_RET_TYPE would be better name.
It also goes with ST_COMPARE_ARG_TYPE preceding this.

-   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data,
-  *pa,
-  *pb,
-  *pc,
-  *pd,
-  *pl,
-  *pm,
-  *pn;
+   ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data;
+   ST_POINTER_TYPE *pa;

There doesn't seem to be material change for the above hunk.

+   while (left <= right)
+   {
+   size_t  mid = (left + right) / 2;

The computation for midpoint should be left + (right-left)/2.

Cheers


Re: A qsort template

2021-06-15 Thread Thomas Munro
On Mon, Mar 15, 2021 at 1:09 PM Thomas Munro  wrote:
> On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu  wrote:
> > + * Remove duplicates from an array.  Return the new size.
> > + */
> > +ST_SCOPE size_t
> > +ST_UNIQUE(ST_ELEMENT_TYPE *array,
> >
> > The array is supposed to be sorted, right ?
> > The comment should mention this.
>
> Good point, will update.  Thanks!

Rebased.  Also fixed some formatting problems and updated
typedefs.list so they don't come back.


more-sort-search-specializations-v2.tgz
Description: application/compressed-tar


Re: A qsort template

2021-03-14 Thread Thomas Munro
On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu  wrote:
> For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :
>
> + * Remove duplicates from an array.  Return the new size.
> + */
> +ST_SCOPE size_t
> +ST_UNIQUE(ST_ELEMENT_TYPE *array,
>
> The array is supposed to be sorted, right ?
> The comment should mention this.

Good point, will update.  Thanks!




Re: A qsort template

2021-03-13 Thread Zhihong Yu
Hi,

For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :

+ * Remove duplicates from an array.  Return the new size.
+ */
+ST_SCOPE size_t
+ST_UNIQUE(ST_ELEMENT_TYPE *array,

The array is supposed to be sorted, right ?
The comment should mention this.

Cheers

On Sat, Mar 13, 2021 at 6:36 PM Thomas Munro  wrote:

> On Sat, Mar 13, 2021 at 3:49 PM Thomas Munro 
> wrote:
> > On Fri, Mar 12, 2021 at 7:58 AM Andres Freund 
> wrote:
> > > I wish we had the same for bsearch... :)
> >
> > Glibc already has the definition of the traditional void-based
> > function in /usr/include/bits/stdlib-bsearch.h, so the generated code
> > when the compiler can see the comparator definition is already good in
> > eg lazy_tid_reaped() and eg some nbtree search routines.  We could
> > probably expose more trivial comparators in headers to get more of
> > that, and we could perhaps put our own bsearch definition in a header
> > for other platforms that didn't think of that...
> >
> > It might be worth doing type-safe macro templates as well, though (as
> > I already did in an earlier proposal[1]), just to have nice type safe
> > code though, not sure, I'm thinking about that...
>
> I remembered a very good reason to do this: the ability to do
> branch-free comparators in more places by introducing optional wider
> results.  That's good for TIDs (needs 49 bits), and places that want
> to "reverse" a traditional comparator (just doing -result on an int
> comparator that might theoretically return INT_MIN requires at least
> 33 bits).  So I rebased the relevant parts of my earlier version, and
> went through and wrote a bunch of examples to demonstrate all this
> stuff actually working.
>
> There are two categories of change in these patches:
>
> 0002-0005: Places that sort/unique/search OIDs, BlockNumbers and TIDs,
> which can reuse a small set of typed functions (a few more could be
> added, if useful).  See sortitemptr.h and sortscalar.h.  Mostly this
> is just a notational improvement, and an excuse to drop a bunch of
> duplicated code.  In a few places this might really speed something
> important up!  Like VACUUM's lazy_tid_reaped().
>
> 0006-0009.  Places where a specialised function is generated for one
> special purpose, such as ANALYZE's HeapTuple sort, tidbitmap.c's
> pagetable sort,  some places in nbtree code etc.  These may require
> some case-by-case research on whether the extra executable size is
> worth the speedup, and there are surely more opportunities like that;
> I just picked on these arbitrarily.
>


Re: A qsort template

2021-03-13 Thread Thomas Munro
On Sat, Mar 13, 2021 at 3:49 PM Thomas Munro  wrote:
> On Fri, Mar 12, 2021 at 7:58 AM Andres Freund  wrote:
> > I wish we had the same for bsearch... :)
>
> Glibc already has the definition of the traditional void-based
> function in /usr/include/bits/stdlib-bsearch.h, so the generated code
> when the compiler can see the comparator definition is already good in
> eg lazy_tid_reaped() and eg some nbtree search routines.  We could
> probably expose more trivial comparators in headers to get more of
> that, and we could perhaps put our own bsearch definition in a header
> for other platforms that didn't think of that...
>
> It might be worth doing type-safe macro templates as well, though (as
> I already did in an earlier proposal[1]), just to have nice type safe
> code though, not sure, I'm thinking about that...

I remembered a very good reason to do this: the ability to do
branch-free comparators in more places by introducing optional wider
results.  That's good for TIDs (needs 49 bits), and places that want
to "reverse" a traditional comparator (just doing -result on an int
comparator that might theoretically return INT_MIN requires at least
33 bits).  So I rebased the relevant parts of my earlier version, and
went through and wrote a bunch of examples to demonstrate all this
stuff actually working.

There are two categories of change in these patches:

0002-0005: Places that sort/unique/search OIDs, BlockNumbers and TIDs,
which can reuse a small set of typed functions (a few more could be
added, if useful).  See sortitemptr.h and sortscalar.h.  Mostly this
is just a notational improvement, and an excuse to drop a bunch of
duplicated code.  In a few places this might really speed something
important up!  Like VACUUM's lazy_tid_reaped().

0006-0009.  Places where a specialised function is generated for one
special purpose, such as ANALYZE's HeapTuple sort, tidbitmap.c's
pagetable sort,  some places in nbtree code etc.  These may require
some case-by-case research on whether the extra executable size is
worth the speedup, and there are surely more opportunities like that;
I just picked on these arbitrarily.
From d0fb306d2720d14be2d46f4ae4198b25a33a95fa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 13 Mar 2021 17:29:44 +1300
Subject: [PATCH 1/9] Add bsearch and unique templates to sort_template.h.

Since binary search and uniquify are so closely related to sorting,
let's optionally generate compatible functions at the same time.

Also, optionally support comparators with wider return types.  This will
allow us to do more branchless comparators.

Also, adjust the ST_SORT template to cope with pointer types.
---
 src/include/lib/sort_template.h | 142 
 1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
index 771c789ced..a6097e1ac5 100644
--- a/src/include/lib/sort_template.h
+++ b/src/include/lib/sort_template.h
@@ -3,7 +3,8 @@
  * sort_template.h
  *
  *	  A template for a sort algorithm that supports varying degrees of
- *	  specialization.
+ *	  specialization.  Also related algorithms for binary search and
+ *	  unique, on sorted arrays.
  *
  * Copyright (c) 2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1992-1994, Regents of the University of California
@@ -13,7 +14,9 @@
  *	  To generate functions specialized for a type, the following parameter
  *	  macros should be #define'd before this file is included.
  *
- *	  - ST_SORT - the name of a sort function to be generated
+ *	  - ST_SORT - if defined the name of a sort function
+ *	  - ST_UNIQUE - if defined the name of a unique function
+ *	  - ST_SEARCH - if defined the name of a search function
  *	  - ST_ELEMENT_TYPE - type of the referenced elements
  *	  - ST_DECLARE - if defined the functions and types are declared
  *	  - ST_DEFINE - if defined the functions and types are defined
@@ -40,6 +43,10 @@
  *
  *	  - ST_COMPARE_ARG_TYPE - type of extra argument
  *
+ *	  To say that the comparator returns a type other than int, use:
+ *
+ * 	  - ST_COMPARE_TYPE - an integer type
+ *
  *	  The prototype of the generated sort function is:
  *
  *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
@@ -179,13 +186,35 @@ typedef int (*ST_COMPARATOR_TYPE_NAME) (const ST_ELEMENT_TYPE *,
 		const ST_ELEMENT_TYPE *ST_SORT_PROTO_ARG);
 #endif
 
+#ifdef ST_SORT
 /* Declare the sort function.  Note optional arguments at end. */
-ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE *first, size_t n
+ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE *array,
+	  size_t n
 	  ST_SORT_PROTO_ELEMENT_SIZE
 	  ST_SORT_PROTO_COMPARE
 	  ST_SORT_PROTO_ARG);
+#endif /* ST_SORT */
 
-#endif
+#ifdef ST_SEARCH
+/* Declare the search function. */
+ST_SCOPE ST_ELEMENT_TYPE *ST_SEARCH(ST_ELEMENT_TYPE *value,
+	ST_ELEMENT_TYPE *array,
+	size_t n
+	ST_SORT_PROTO_ELEMENT_SIZE
+	ST_SORT_PROTO_COMPARE
+	

Re: A qsort template

2021-03-12 Thread Thomas Munro
On Fri, Mar 12, 2021 at 7:58 AM Andres Freund  wrote:
> I wish we had the same for bsearch... :)

Glibc already has the definition of the traditional void-based
function in /usr/include/bits/stdlib-bsearch.h, so the generated code
when the compiler can see the comparator definition is already good in
eg lazy_tid_reaped() and eg some nbtree search routines.  We could
probably expose more trivial comparators in headers to get more of
that, and we could perhaps put our own bsearch definition in a header
for other platforms that didn't think of that...

It might be worth doing type-safe macro templates as well, though (as
I already did in an earlier proposal[1]), just to have nice type safe
code though, not sure, I'm thinking about that...

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLY47Cvu62mFDT53Ya0P95cGggcBN6R6aLpx6%3DGm5j%2B1A%40mail.gmail.com




Re: A qsort template

2021-03-11 Thread Andres Freund
Hi,

I wish we had the same for bsearch... :)


On 2021-03-03 17:17:13 +1300, Thomas Munro wrote:
> As for which cases are actually worth specialising, I've attached the
> example that Andres mentioned earlier; it seems like a reasonable
> candidate to go ahead and commit too, but I realised that I'd
> forgotten to attach it earlier.

> From 4cec5cb9a2e0c50726b7337fb8221281e155c4cd Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Thu, 18 Feb 2021 14:47:28 +1300
> Subject: [PATCH] Specialize checkpointer sort functions.
> 
> When sorting a potentially large number of dirty buffers, the
> checkpointer can benefit from a faster sort routine.  One reported
> improvement on a large buffer pool system was 1.4s -> 0.6s.
> 
> Discussion: 
> https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com

Looks good to me.

Greetings,

Andres Freund




Re: A qsort template

2021-03-02 Thread Thomas Munro
On Wed, Mar 3, 2021 at 10:25 AM Daniel Gustafsson  wrote:
> > On 18 Feb 2021, at 04:09, Thomas Munro  wrote:
> > In another thread[1], I proposed $SUBJECT, but then we found a better
> > solution to that thread's specific problem.  The general idea is still
> > good though: it's possible to (1) replace several existing copies of
> > our qsort algorithm with one, and (2) make new specialised versions a
> > bit more easily than the existing Perl generator allows.  So, I'm back
> > with a rebased stack of patches.  I'll leave specific cases for new
> > worthwhile specialisations for separate proposals; I've heard about
> > several.
>
> Just to play around with this while reviewing I made a qsort_strcmp, like in
> the attached, and tested it using a ~9M word [0] randomly shuffled wordlist.
> While being too small input to make any meaningful difference in runtime (it
> shaved a hair off but it might well be within the error margin) there was no
> regression either.  More importantly, it was really simple and quick to make a
> tailored qsort which is the intention with the patch.  While still being a bit
> of magic, moving from the Perl generator makes this slightly less magic IMO so
> +1 on this approach.

Thanks for testing and reviewing!

> A tiny nitpick on the patch itself:
>
> + *   - ST_COMPARE(a, b) - a simple comparison expression
> + *   - ST_COMPARE(a, b, arg) - variant that takes an extra argument
> Indentation.

Fixed.  Also ran pgindent.

> All tests pass and the documentation in the the sort_template.h is enough to 
> go
> on, but I would prefer to see a comment in port/qsort.c referring back to
> sort_template.h for documentation.

I tried adding a comment along the lines "see lib/sort_template.h for
details", but it felt pretty redundant, when the file contains very
little other than #include "lib/sort_template.h" which should already
tell you to go and look there to find out what this is about...

I went ahead and pushed these.

I am sure there are plenty of opportunities to experiment with this
code.  Here are some I recall Peter Geoghegan mentioning:

1.  If you know that elements are unique, you could remove some
branches that deal with equal elements (see "r == 0").
2.  Perhaps you might want to be able to disable the "presorted" check
in some cases?
3.  The parameters 7, 7 and 40 were probably tuned for an ancient Vax
or similar[1].  We see higher insertion sort thesholds such as 27 in
more recent sort algorithms[2] used in eg the JVM.  You could perhaps
speculate that the right answer depends in part on the element size; I
dunno, but if so, here we have that at compile time while traditional
qsort() does not.

As for which cases are actually worth specialising, I've attached the
example that Andres mentioned earlier; it seems like a reasonable
candidate to go ahead and commit too, but I realised that I'd
forgotten to attach it earlier.

It's possible that the existing support sorting tuples could be
further specialised for common sort key data types; I haven't tried
that.

[1] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162=rep1=pdf
[2] https://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
From 4cec5cb9a2e0c50726b7337fb8221281e155c4cd Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 18 Feb 2021 14:47:28 +1300
Subject: [PATCH] Specialize checkpointer sort functions.

When sorting a potentially large number of dirty buffers, the
checkpointer can benefit from a faster sort routine.  One reported
improvement on a large buffer pool system was 1.4s -> 0.6s.

Discussion: https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 37 +
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..7adec2ddc3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -488,8 +488,8 @@ static void FindAndDropRelFileNodeBuffers(RelFileNode rnode,
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
-static int	buffertag_comparator(const void *p1, const void *p2);
-static int	ckpt_buforder_comparator(const void *pa, const void *pb);
+static inline int buffertag_comparator(const BufferTag *a, const BufferTag *b);
+static inline int ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
@@ -1831,6 +1831,13 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 	}
 }
 
+#define ST_SORT sort_checkpoint_bufferids
+#define ST_ELEMENT_TYPE CkptSortItem
+#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
+#define ST_SCOPE static
+#define ST_DEFINE
+#include 
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1931,8 +1938,7 

Re: A qsort template

2021-03-02 Thread Daniel Gustafsson
> On 18 Feb 2021, at 04:09, Thomas Munro  wrote:

> In another thread[1], I proposed $SUBJECT, but then we found a better
> solution to that thread's specific problem.  The general idea is still
> good though: it's possible to (1) replace several existing copies of
> our qsort algorithm with one, and (2) make new specialised versions a
> bit more easily than the existing Perl generator allows.  So, I'm back
> with a rebased stack of patches.  I'll leave specific cases for new
> worthwhile specialisations for separate proposals; I've heard about
> several.

Just to play around with this while reviewing I made a qsort_strcmp, like in
the attached, and tested it using a ~9M word [0] randomly shuffled wordlist.
While being too small input to make any meaningful difference in runtime (it
shaved a hair off but it might well be within the error margin) there was no
regression either.  More importantly, it was really simple and quick to make a
tailored qsort which is the intention with the patch.  While still being a bit
of magic, moving from the Perl generator makes this slightly less magic IMO so
+1 on this approach.

A tiny nitpick on the patch itself:

+ *   - ST_COMPARE(a, b) - a simple comparison expression
+ *   - ST_COMPARE(a, b, arg) - variant that takes an extra argument
Indentation.

All tests pass and the documentation in the the sort_template.h is enough to go
on, but I would prefer to see a comment in port/qsort.c referring back to
sort_template.h for documentation.

--
Daniel Gustafsson   https://vmware.com/

[0] https://github.com/dwyl/english-words/ shuffled 20 times over



qsort_tmpl_strcmp-patch
Description: Binary data


Re: A qsort template

2021-02-17 Thread Andres Freund
Hi,

On 2021-02-18 16:09:49 +1300, Thomas Munro wrote:
> In another thread[1], I proposed $SUBJECT, but then we found a better
> solution to that thread's specific problem.  The general idea is still
> good though: it's possible to (1) replace several existing copies of
> our qsort algorithm with one, and (2) make new specialised versions a
> bit more easily than the existing Perl generator allows.  So, I'm back
> with a rebased stack of patches.  I'll leave specific cases for new
> worthwhile specialisations for separate proposals; I've heard about
> several.

One place that could benefit is the qsort that BufferSync() does at the
start. I tried your patch for that, and it does reduce the sort time
considerably. For 64GB of mostly dirty shared_buffers from ~1.4s to
0.6s.

Now, obviously one can argue that that's not going to be the crucial
spot, and wouldn't be entirely wrong. OTOH, in my AIO branch I see
checkpointer doing ~10GB/s, leading to the sort being a measurable
portion of the overall time.

Greetings,

Andres Freund




A qsort template

2021-02-17 Thread Thomas Munro
Hello,

In another thread[1], I proposed $SUBJECT, but then we found a better
solution to that thread's specific problem.  The general idea is still
good though: it's possible to (1) replace several existing copies of
our qsort algorithm with one, and (2) make new specialised versions a
bit more easily than the existing Perl generator allows.  So, I'm back
with a rebased stack of patches.  I'll leave specific cases for new
worthwhile specialisations for separate proposals; I've heard about
several.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
From edfa27a45eca139d683bd0a02136e1da5a489243 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 17 Aug 2020 21:31:56 +1200
Subject: [PATCH 1/3] Add sort_template.h for making fast sort functions.

Move our qsort implementation into a header that can be used to
define specialized functions for better performance.
---
 src/include/lib/sort_template.h | 431 
 1 file changed, 431 insertions(+)
 create mode 100644 src/include/lib/sort_template.h

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
new file mode 100644
index 00..09523a8e4c
--- /dev/null
+++ b/src/include/lib/sort_template.h
@@ -0,0 +1,431 @@
+/*-
+ *
+ * sort_template.h
+ *
+ *	  A template for a sort algorithm that supports varying degrees of
+ *	  specialization.
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1992-1994, Regents of the University of California
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - ST_SORT - the name of a sort function to be generated
+ *	  - ST_ELEMENT_TYPE - type of the referenced elements
+ *	  - ST_DECLARE - if defined the functions and types are declared
+ *	  - ST_DEFINE - if defined the functions and types are defined
+ *	  - ST_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  Instead of ST_ELEMENT_TYPE, ST_ELEMENT_TYPE_VOID can be defined.  Then
+ *	  the generated functions will automatically gain an "element_size"
+ *	  parameter.  This allows us to generate a traditional qsort function.
+ *
+ *	  One of the following macros must be defined, to show how to compare
+ *	  elements.  The first two options are arbitrary expressions depending
+ *	  on whether an extra pass-through argument is desired, and the third
+ *	  option should be defined if the sort function should receive a
+ *	  function pointer at runtime.
+ *
+ * 	  - ST_COMPARE(a, b) - a simple comparison expression
+ *	  - ST_COMPARE(a, b, arg) - variant that takes an extra argument
+ *	  - ST_COMPARE_RUNTIME_POINTER - sort function takes a function pointer
+ *
+ *	  To say that the comparator and therefore also sort function should
+ *	  receive an extra pass-through argument, specify the type of the
+ *	  argument.
+ *
+ *	  - ST_COMPARE_ARG_TYPE - type of extra argument
+ *
+ *	  The prototype of the generated sort function is:
+ *
+ *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
+ *   [size_t element_size,]
+ *   [ST_SORT_compare_function compare,]
+ *   [ST_COMPARE_ARG_TYPE *arg]);
+ *
+ *	  ST_SORT_compare_function is a function pointer of the following type:
+ *
+ *	  int (*)(const ST_ELEMENT_TYPE *a, const ST_ELEMENT_TYPE *b,
+ *			  [ST_COMPARE_ARG_TYPE *arg])
+ *
+ * HISTORY
+ *
+ *	  Modifications from vanilla NetBSD source:
+ *	  - Add do ... while() macro fix
+ *	  - Remove __inline, _DIAGASSERTs, __P
+ *	  - Remove ill-considered "swap_cnt" switch to insertion sort, in favor
+ *		of a simple check for presorted input.
+ *	  - Take care to recurse on the smaller partition, to bound stack usage
+ *	  - Convert into a header that can generate specialized functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_template.h
+ *
+ *-
+ */
+
+/*	  $NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $   */
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	  The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+ *	  without specific prior written