Re: Considering additional sort specialisation functions for PG16

2023-02-16 Thread John Naylor
I wrote:
> Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:

Hacking on this has gotten only as far as the "compiles but segfaults"
stage, but I wanted to note an idea that occurred to me:

Internal qsort doesn't need the srctape member, and removing both that and
isnull1 would allow 16-byte "init-tuples" for qsort, which would save a bit
of work_mem space, binary space for qsort specializations, and work done
during swaps.

During heap sort, we already copy one entry into a stack variable to keep
from clobbering it, so it's not a big step to read a member from the init
array and form a regular sorttuple from it.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-30 Thread John Naylor
I wrote:

> On Thu, Jan 26, 2023 at 7:15 PM David Rowley  wrote:
> > I think the slower sorts I found in [2] could also be partially caused
> > by the current sort specialisation comparators re-comparing the
> > leading column during a tie-break. I've not gotten around to disabling
> > the sort specialisations to see if and how much this is a factor for
> > that test.
>
> Right, that's worth addressing independently of the window function
consideration. I'm still swapping this area back in my head, but I believe
one issue is that state->base.onlyKey signals two things: "one sortkey, not
abbreviated". We could add a separate branch for "first key unabbreviated,
nkeys>1" -- I don't think we'd need to specialize, just branch -- and
instead of state->base.comparetup, call a set of analogous functions that
only handle keys 2 and above (comparetup_tail_* ? or possibly just add a
boolean parameter compare_first). That would not pose a huge challenge, I
think, since they're already written like this:
>
> /* Compare the leading sort key */
> compare = ApplySortComparator(...);
> if (compare != 0)
> return compare;
>
> /* Compare additional sort keys */
> ...
>
> The null/non-null separation would eliminate a bunch of branches in
inlined comparators, so we could afford to add another branch for number of
keys.

I gave this a go, and it turns out we don't need any extra branches in the
inlined comparators -- the new fallbacks are naturally written to account
for the "!onlyKey" case. If the first sortkey was abbreviated, call its
full comparator, otherwise skip to the next sortkey (if there isn't one, we
shouldn't have gotten here). The existing comparetup functions try the
simple case and then call the fallback (which could be inlined for them but
I haven't looked).

Tests pass, but I'm not sure yet if we need more tests. I don't have a
purpose-built benchmark at the moment, but I'll see if any of my existing
tests exercise this code path. I can also try the window function case
unless someone beats me to it.

--
John Naylor
EDB: http://www.enterprisedb.com
From 65b94223f8bdb726b97d695f0040c12197d5e65d Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 30 Jan 2023 17:10:00 +0700
Subject: [PATCH v1] Split out fallback functionality from comparetup*
 functions

Previously, if a specialized comparator find equal datum1 keys,
the comparetup function would call the full comparator on the
datum before proceeding with either the unabbreviated first key
or the second key.

Add a comparetup_fallback field for these that call special
fallback functions. The existing comparetup functions just call
ApplySortComparator where we can, than call our new fallback.
---
 src/backend/utils/sort/tuplesort.c |   6 +-
 src/backend/utils/sort/tuplesortvariants.c | 114 -
 src/include/utils/tuplesort.h  |   6 ++
 3 files changed, 96 insertions(+), 30 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9ca9835aab..54f64d6c2d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -513,7 +513,7 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_fallback(a, b, state);
 }
 
 #if SIZEOF_DATUM >= 8
@@ -537,7 +537,7 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_fallback(a, b, state);
 }
 #endif
 
@@ -561,7 +561,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_fallback(a, b, state);
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..e65ac49d9d 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -47,18 +47,24 @@ static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 			   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
 			Tuplesortstate *state);
+static int	comparetup_heap_fallback(const SortTuple *a, const SortTuple *b,
+			Tuplesortstate *state);
 static void writetup_heap(Tuplesortstate *state, LogicalTape *tape,
 		  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 		 LogicalTape *tape, unsigned int len);
 static int	comparetup_cluster(const SortTuple *a, const SortTuple *b,
 			   Tuplesortstate *state);
+static int	comparetup_cluster_fallback(const SortTuple *a, const SortTuple *b,
+			   Tuplesortstate *state);
 static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
 			 

Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Thu, Jan 26, 2023 at 7:15 PM David Rowley  wrote:
>
> On Thu, 26 Jan 2023 at 23:29, John Naylor 
wrote:
> > Coming back to this, I wanted to sketch out this idea in a bit more
detail.
> >
> > Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:
> > - Qsort the non-null array, including whatever specialization is
available. Existing specialized comparators could ignore nulls (and their
ordering) taking less space in the binary.
> > - Only if there is more than one sort key, qsort the null array.
Ideally at some point we would have a method of ignoring the first sortkey
(this is an existing opportunity that applies elsewhere as well).
> > - To handle two arrays, grow_memtuples() would need some adjustment, as
would any callers that read the final result of an in-memory sort -- they
would need to retrieve the tuples starting with the appropriate array
depending on NULLS FIRST/LAST behavior.
>
> Thanks for coming back to this. I've been thinking about this again
> recently due to what was discovered in [1].

That was indeed part of the motivation for bringing this up.

> Why I'm bringing this up here is that I wondered, in addition to what
> you're mentioning above, if we're making some changes to allow
> multiple in-memory arrays, would it be worth going a little further
> and allowing it so we could have N arrays which we'd try to keep under
> L3 cache size. Maybe some new GUC could be used to know what a good
> value is.  With fast enough disks, it's often faster to use smaller
> values of work_mem which don't exceed L3 to keep these batches
> smaller.  Keeping them in memory would remove the need to write out
> tapes.

That's interesting. I don't know enough to guess how complex it would be to
make "external" merges agnostic about whether the tapes are on disk or in
memory.

If in-memory sorts were designed analogously to external ones,
grow_memtuples would never have to repalloc, it could just allocate a new
tape, which could further shave some cycles.

My hunch in my last email was that having separate groups of tapes for each
null/non-null first sortkey would be complex, because it increases the
number of places that have to know about nulls and their ordering. If we
wanted to go to N arrays instead of 2, and additionally wanted separate
null/non-null treatment, it seems we would still need 2 sets of arrays, one
with N non-null-first-sortkey tapes and one with M null-first-sortkey tapes.

So using 2 arrays seems like the logical first step. I'd be curious to hear
other possible development paths.

> I think the slower sorts I found in [2] could also be partially caused
> by the current sort specialisation comparators re-comparing the
> leading column during a tie-break. I've not gotten around to disabling
> the sort specialisations to see if and how much this is a factor for
> that test.

Right, that's worth addressing independently of the window function
consideration. I'm still swapping this area back in my head, but I believe
one issue is that state->base.onlyKey signals two things: "one sortkey, not
abbreviated". We could add a separate branch for "first key unabbreviated,
nkeys>1" -- I don't think we'd need to specialize, just branch -- and
instead of state->base.comparetup, call a set of analogous functions that
only handle keys 2 and above (comparetup_tail_* ? or possibly just add a
boolean parameter compare_first). That would not pose a huge challenge, I
think, since they're already written like this:

/* Compare the leading sort key */
compare = ApplySortComparator(...);
if (compare != 0)
return compare;

/* Compare additional sort keys */
...

The null/non-null separation would eliminate a bunch of branches in inlined
comparators, so we could afford to add another branch for number of keys.

I haven't thought through either of these ideas in the gory detail, but I
don't yet see any big obstacles.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread David Rowley
On Thu, 26 Jan 2023 at 23:29, John Naylor  wrote:
> Coming back to this, I wanted to sketch out this idea in a bit more detail.
>
> Have two memtuple arrays, one for first sortkey null and one for first 
> sortkey non-null:
> - Qsort the non-null array, including whatever specialization is available. 
> Existing specialized comparators could ignore nulls (and their ordering) 
> taking less space in the binary.
> - Only if there is more than one sort key, qsort the null array. Ideally at 
> some point we would have a method of ignoring the first sortkey (this is an 
> existing opportunity that applies elsewhere as well).
> - To handle two arrays, grow_memtuples() would need some adjustment, as would 
> any callers that read the final result of an in-memory sort -- they would 
> need to retrieve the tuples starting with the appropriate array depending on 
> NULLS FIRST/LAST behavior.

Thanks for coming back to this. I've been thinking about this again
recently due to what was discovered in [1].  Basically, the patch on
that thread is trying to eliminate the need for the ORDER BY sort in a
query such as: SELECT a,b,row_number() over (order by a) from ab order
by a,b;  the idea is to just perform the full sort on a,b for the
WindowClause so save from having to do an Incremental Sort on b for
the ORDER BY after evaluating the window funcs.  Surprisingly (for
me), we found a bunch of cases where the performance is better to do a
sort on some of the keys, then do an Incremental sort on the remainder
rather than just doing a single sort on everything.

I don't really understand why this is fully yet, but one theory I have
is that it might be down to work_mem being larger than the CPU's L3
cache and causing more cache line misses.  With the more simple sort,
there's less swapping of items in the array because the comparison
function sees tuples as equal more often.  I did find that the gap
between the two is not as large with fewer tuples to sort.

I think the slower sorts I found in [2] could also be partially caused
by the current sort specialisation comparators re-comparing the
leading column during a tie-break. I've not gotten around to disabling
the sort specialisations to see if and how much this is a factor for
that test.

Why I'm bringing this up here is that I wondered, in addition to what
you're mentioning above, if we're making some changes to allow
multiple in-memory arrays, would it be worth going a little further
and allowing it so we could have N arrays which we'd try to keep under
L3 cache size. Maybe some new GUC could be used to know what a good
value is.  With fast enough disks, it's often faster to use smaller
values of work_mem which don't exceed L3 to keep these batches
smaller.  Keeping them in memory would remove the need to write out
tapes.

I don't really know exactly if such a feature could easily be tagged
on to what you propose above, but I feel like what you're talking
about is part of the way towards that at least, maybe at the least,
the additional work could be kept in mind when this is written so that
it's easier to extend in the future.

David

[1] 
https://postgr.es/m/CAApHDvpAO5H_L84kn9gCJ_hihOavtmDjimKYyftjWtF69BJ=8...@mail.gmail.com
[2] 
https://postgr.es/m/CAApHDvqh%2BqOHk4sbvvy%3DQr2NjPqAAVYf82oXY0g%3DZ2hRpC2Vmg%40mail.gmail.com




Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov 
wrote:

> > - Only if there is more than one sort key, qsort the null array.
Ideally at some point we would have a method of ignoring the first sortkey
(this is an existing opportunity that applies elsewhere as well).

> Should we need to sort by the second sort key provided the first one
> in NULL by standard or by some part of the code relying on this? I

I'm not sure I quite understand the question.

If there is more than one sort key, and the specialized comparison on the
first key gives a definitive zero result, it falls back to comparing all
keys from the full tuple. (The sorttuple struct only contains the first
sortkey, which might actually be an abbreviated key.) A possible
optimization, relevant here and also elsewhere, is to compare only using
keys starting from key2. But note: if the first key is abbreviated, a zero
result is not definitive, and we must check the first key's full value from
the tuple.

> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.

There is in fact a preferred order between these tuples -- the second key
is the tie breaker in this case.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread Pavel Borisov
Hi, John!
Generally, I like the separation of non-null values before sorting and
would like to join as a reviewer when we come to patch. I have only a
small question:

> - Only if there is more than one sort key, qsort the null array. Ideally at 
> some point we would have a method of ignoring the first sortkey (this is an 
> existing opportunity that applies elsewhere as well).
Should we need to sort by the second sort key provided the first one
in NULL by standard or by some part of the code relying on this? I
suppose NULL values in the first sort key mean attribute values are
undefined and there is no preferred order between these tuples, even
if their second sort keys are different.

And maybe (unlikely IMO) we need some analog of NULLS DISCTICNT/NOT
DISTINCT in this scope?

Kind regards,
Pavel Borisov,
Supabase.




Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Tue, Aug 23, 2022 at 1:13 PM John Naylor 
wrote:
>
> On Tue, Aug 23, 2022 at 11:24 AM David Rowley 
wrote:
> >
> > On Tue, 23 Aug 2022 at 15:22, John Naylor 
wrote:
> > > Did you happen to see
> > >
> > >
https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
> >
> > I missed that.  It looks like a much more promising idea than what I
> > came up with. I've not looked at your code yet, but I'm interested and
> > will aim to look soon.
>
> Note that I haven't actually implemented this idea yet, just tried to
> model the effects by lobotomizing the current comparators. I think
> it's worth pursuing and will try to come back to it this cycle, but if
> you or anyone else wants to try, that's fine of course.

Coming back to this, I wanted to sketch out this idea in a bit more detail.

Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:
- Qsort the non-null array, including whatever specialization is available.
Existing specialized comparators could ignore nulls (and their ordering)
taking less space in the binary.
- Only if there is more than one sort key, qsort the null array. Ideally at
some point we would have a method of ignoring the first sortkey (this is an
existing opportunity that applies elsewhere as well).
- To handle two arrays, grow_memtuples() would need some adjustment, as
would any callers that read the final result of an in-memory sort -- they
would need to retrieve the tuples starting with the appropriate array
depending on NULLS FIRST/LAST behavior.

I believe external merges wouldn't have to do anything different, since
when writing out the tapes, we read from the arrays in the right order.

(One could extend this idea further and have two pools of tapes for null
and non-null first sortkey, that are merged separately, in the right order.
That sounds like quite a bit more complexity than is worth, however.)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2022-08-23 Thread John Naylor
On Tue, Aug 23, 2022 at 11:24 AM David Rowley  wrote:
>
> On Tue, 23 Aug 2022 at 15:22, John Naylor  
> wrote:
> > Did you happen to see
> >
> > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
>
> I missed that.  It looks like a much more promising idea than what I
> came up with. I've not looked at your code yet, but I'm interested and
> will aim to look soon.

Note that I haven't actually implemented this idea yet, just tried to
model the effects by lobotomizing the current comparators. I think
it's worth pursuing and will try to come back to it this cycle, but if
you or anyone else wants to try, that's fine of course.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Considering additional sort specialisation functions for PG16

2022-08-22 Thread David Rowley
On Tue, 23 Aug 2022 at 15:22, John Naylor  wrote:
> Did you happen to see
>
> https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com

I missed that.  It looks like a much more promising idea than what I
came up with. I've not looked at your code yet, but I'm interested and
will aim to look soon.

David




Re: Considering additional sort specialisation functions for PG16

2022-08-22 Thread John Naylor
On Tue, Aug 23, 2022 at 9:18 AM David Rowley  wrote:
>
> I have the following dimensions in mind for consideration:
>
> 1. Specialisations to handle sorting of non-null datums (eliminates
> checking for nulls in the comparison function)
> 2. Specialisations to handle single column sorts (eliminates
> tiebreaker function call or any checks for existence of tiebreaker)
> 3. ASC sort (No need for if (ssup->ssup_reverse) 
> INVERT_COMPARE_RESULT(compare))
>
> If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24
> specialization functions.  That seems a bit excessive. So here I'd
> like to discuss which ones we should add, if any.
>
> I've attached a very basic implementation of #1 which adds 3 new
> functions for sorting non-null datums.

Did you happen to see

https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com

where I experimented with removing all null handling? What I had in
mind was pre-partitioning nulls and non-nulls when populating the
SortTuple array, then calling qsort twice, once with the non-null
partition with comparators that assume non-null, and (only if there
are additional sort keys) once on the null partition. And the
pre-partitioning would take care of nulls first/last upfront. I
haven't looked into the feasibility of this yet, but the good thing
about the concept is that it removes null handling in the comparators
without additional sort specializations.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Considering additional sort specialisation functions for PG16

2022-08-22 Thread David Rowley
Hi,

(Background: 697492434 added 3 new sort functions to remove the
indirect function calls for the comparator function.  This sped up
sorting for various of our built-in data types.)

There was a bit of unfinished discussion around exactly how far to
take these specialisations for PG15.  We could certainly add more.

There are various other things we could do to further speed up sorting
for these datatypes.  One example is, we could add 3 more variations
of these functions that can be called when there are no NULL datums to
sort.  That effectively multiplies the number of specialisations by 2,
or adds another dimension.

I have the following dimensions in mind for consideration:

1. Specialisations to handle sorting of non-null datums (eliminates
checking for nulls in the comparison function)
2. Specialisations to handle single column sorts (eliminates
tiebreaker function call or any checks for existence of tiebreaker)
3. ASC sort (No need for if (ssup->ssup_reverse) INVERT_COMPARE_RESULT(compare))

If we did all of the above then we'd end up with 3 * 2 * 2 * 2 = 24
specialization functions.  That seems a bit excessive. So here I'd
like to discuss which ones we should add, if any.

I've attached a very basic implementation of #1 which adds 3 new
functions for sorting non-null datums.  This could be made a bit more
advanced.  For now, I just added a bool flag to track if we have any
NULL datum1s in memtuples[].  For bounded sorts, we may remove NULLs
from that array, and may end up with no nulls after having seen null.
So maybe a count would be better than a flag.

A quick performance test with 1 million random INTs shows ~6%
performance improvement when there are no nulls.

Master
$ pgbench -n -f bench.sql -T 60 postgres
latency average = 159.837 ms
latency average = 161.193 ms
latency average = 159.512 ms

master + not_null_sort_specializations.patch
$ pgbench -n -f bench.sql -T 60 postgres
latency average = 150.791 ms
latency average = 149.843 ms
latency average = 150.319 ms

I didn't test for any regression when there are NULLs and we're unable
to use the new specializations. I'm hoping the null tracking will be
almost free, but I will need to check.

It's all quite subjective to know which specializations should be
added. I think #1 is likely to have the biggest wins when it can be
used as it removes the most branching in the comparator function,
however the biggest gains are not the only thing to consider. We also
need to consider how commonly these functions will be used. I don't
have any information about that.

David
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index d90a1aebdf..52927f93b8 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -218,6 +218,8 @@ struct Tuplesortstate
int memtupsize; /* allocated length of 
memtuples array */
boolgrowmemtuples;  /* memtuples' growth still underway? */
 
+   boolmemtuples_hasnull;  /* does any SortTuple in 
memtuples have
+* 
isnull1 set to true? */
/*
 * Memory for tuples is sometimes allocated using a simple slab 
allocator,
 * rather than with palloc().  Currently, we switch to slab allocation
@@ -496,7 +498,10 @@ static void tuplesort_updatemax(Tuplesortstate *state);
  * abbreviations of text or multi-key sorts.  There could be!  Is it worth it?
  */
 
-/* Used if first key's comparator is ssup_datum_unsigned_compare */
+/*
+ * Used if first key's comparator is ssup_datum_unsigned_compare and
+ * state->memtuples_hasnull is true
+ */
 static pg_attribute_always_inline int
 qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 {
@@ -518,8 +523,39 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, 
Tuplesortstate *state)
return state->base.comparetup(a, b, state);
 }
 
+/*
+ * Used if first key's comparator is ssup_datum_unsigned_compare and
+ * state->memtuples_hasnull is false
+ */
+static pg_attribute_always_inline int
+qsort_tuple_unsigned_compare_notnull(SortTuple *a, SortTuple *b,
+
Tuplesortstate *state)
+{
+   int compare;
+
+   /* make sure the null tracking code didn't mess up */
+   Assert(!a->isnull1 && !b->isnull1);
+
+   compare = ApplyUnsignedSortComparatorNotNull(a->datum1, b->datum1,
+   
 >base.sortKeys[0]);
+   if (compare != 0)
+   return compare;
+
+   /*
+* No need to waste effort calling the tiebreak function when there are 
no
+* other keys to sort on.
+*/
+   if (state->base.onlyKey != NULL)
+   return 0;
+
+   return state->base.comparetup(a, b, state);
+}
+
 #if