Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-21 Thread David Rowley
On Wed, 21 Jul 2021 at 22:09, David Rowley  wrote:
>
> On Wed, 21 Jul 2021 at 13:39, James Coleman  wrote:
> > Thanks for doing the math measuring how much we could impact things.
> >
> > I'm +lots on getting this committed as is.
>
> Ok good. I plan on taking a final look at the v10 patch tomorrow
> morning NZ time (about 12 hours from now) and if all is well, I'll
> push it.

Pushed.

David




RE: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-21 Thread houzj.f...@fujitsu.com
On July 22, 2021 8:38 AM David Rowley 
> On Thu, 22 Jul 2021 at 12:27, houzj.f...@fujitsu.com 
> wrote:
> > The above seems can be shorter like the following ?
> >
> > for (;;)
> > {
> > slot = ExecProcNode(outerNode);
> > if (TupIsNull(slot))
> > break;
> > if (node->datumSort)
> > {
> > slot_getsomeattrs(slot, 1);
> > tuplesort_putdatum(tuplesortstate,
> > slot->tts_values[0],
> > slot->tts_isnull[0]);
> > }
> > else
> > tuplesort_puttupleslot(tuplesortstate, slot); }
> 
> I don't think that's a good change.  It puts the branch inside the loop the 
> pulls
> all tuples from the subplan.  Given the loop is likely to be very hot combined
> with the fact that it's so simple, I'd much rather have two separate loops to
> keep the extra branch outside the loop.  It's true the branch predictor is 
> likely
> to get the prediction correct on each iteration, but unless the compiler
> rewrites this into two loops then the comparison and jump must be done per
> loop.

Ah, you are right, I missed that. Thanks for the explanation.

Best regards,
houzj


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-21 Thread David Rowley
On Thu, 22 Jul 2021 at 12:27, houzj.f...@fujitsu.com
 wrote:
> The above seems can be shorter like the following ?
>
> for (;;)
> {
> slot = ExecProcNode(outerNode);
> if (TupIsNull(slot))
> break;
> if (node->datumSort)
> {
> slot_getsomeattrs(slot, 1);
> tuplesort_putdatum(tuplesortstate,
> slot->tts_values[0],
> slot->tts_isnull[0]);
> }
> else
> tuplesort_puttupleslot(tuplesortstate, slot);
> }

I don't think that's a good change.  It puts the branch inside the
loop the pulls all tuples from the subplan.  Given the loop is likely
to be very hot combined with the fact that it's so simple, I'd much
rather have two separate loops to keep the extra branch outside the
loop.  It's true the branch predictor is likely to get the prediction
correct on each iteration, but unless the compiler rewrites this into
two loops then the comparison and jump must be done per loop.

David




RE: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-21 Thread houzj.f...@fujitsu.com
From: David Rowley 
> On Wed, 21 Jul 2021 at 13:39, James Coleman  wrote:
> > Thanks for doing the math measuring how much we could impact things.
> >
> > I'm +lots on getting this committed as is.
> 
> Ok good. I plan on taking a final look at the v10 patch tomorrow morning NZ
> time (about 12 hours from now) and if all is well, I'll push it.
> 
> If anyone feels differently, please let me know before then.
Hi,

I noticed a minor thing about the v10 patch.

-
-   for (;;)
+   if (node->datumSort)
{
-   slot = ExecProcNode(outerNode);
-
-   if (TupIsNull(slot))
-   break;
-
-   tuplesort_puttupleslot(tuplesortstate, slot);
+   for (;;)
+   {
+   slot = ExecProcNode(outerNode);
+
+   if (TupIsNull(slot))
+   break;
+   slot_getsomeattrs(slot, 1);
+   tuplesort_putdatum(tuplesortstate,
+  
slot->tts_values[0],
+  
slot->tts_isnull[0]);
+   }
+   }
+   else
+   {
+   for (;;)
+   {
+   slot = ExecProcNode(outerNode);
+
+   if (TupIsNull(slot))
+   break;
+   tuplesort_puttupleslot(tuplesortstate, slot);
+   }

The above seems can be shorter like the following ?

for (;;)
{
slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
if (node->datumSort)
{
slot_getsomeattrs(slot, 1);
tuplesort_putdatum(tuplesortstate,
slot->tts_values[0],
slot->tts_isnull[0]);
}
else
tuplesort_puttupleslot(tuplesortstate, slot);
}

Best regards,
houzj



Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-21 Thread David Rowley
On Wed, 21 Jul 2021 at 13:39, James Coleman  wrote:
> Thanks for doing the math measuring how much we could impact things.
>
> I'm +lots on getting this committed as is.

Ok good. I plan on taking a final look at the v10 patch tomorrow
morning NZ time (about 12 hours from now) and if all is well, I'll
push it.

If anyone feels differently, please let me know before then.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread James Coleman
On Tue, Jul 20, 2021 at 4:35 AM David Rowley  wrote:
>
> On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> > To be clear up front: I'm in favor of the patch, and I don't want to
> > put unnecessary stumbling blocks up for it getting committed. So if we
> > decide to proceed as is, that's fine with me.
>
> Thanks for making that clear.
>
> > But I'm not sure that the "cost model, unfortunately, does not account
> > for that" is entirely accurate. The end of cost_tuplesort contains a
> > cost "per extracted tuple". It does, however, note that it doesn't
> > charge cpu_tuple_cost, which maybe is what you'd want to fully
> > incorporate this into the model. But given this run_cost isn't about
> > accounting for comparison cost (that's been done earlier) which is the
> > part that'd be the same between tuple and datum sort, it seems to me
> > that we could lower the cpu_operator_cost here by something like 10%
> > if it's byref and 30% if it's byval?
>
> I failed to notice that last part that adds the additional cpu_operator_cost.
>
> The default cpu_operator_cost is 0.0025, so with the 10k tuple
> benchmark, that adds an additional charge of 25 to the total cost.
>
> If we take test 1 from my results on v5 as an example:
>
> > Test1 446.1 657.3 147.32%
>
> Looking at explain for that query:
>
> regression=# explain select two from tenk1 order by two offset 100;
>   QUERY PLAN
> --
>  Limit  (cost=1133.95..1133.95 rows=1 width=4)
>->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
>  Sort Key: two
>  ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
> (4 rows)
>
> If we want the costs to reflect reality again here then we'd have
> reduce 1133.95 by something like 147.32% (the performance difference).
> That would bring the cost down to 769.72, which is way more than we
> have to play with than the 25 that the cpu_operator_cost * tuples
> gives us.
>
> If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
> cost would become 1126.45.  That's not great considering the actual
> performance indicates that 769.72 would be a better number.
>
> If we look at John's result for test 1: He saw 588 tps on master and
> 998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
> to get close to reality on that machine.
>
> My thoughts are that the small surcharge added at the end of
> cost_tuplesort() is just not enough for us to play with.  I think to
> get us closer to fixing this correctly would require a redesign of the
> tuplesort costing entirely. I think that would be about an order of
> magnitude more effort than this patch was, so I really feel like I
> don't want to do this.
>
> I kinda feel that since the comparison_cost is always just 2.0 *
> cpu_operator_cost regardless of the number of columns in the sort,
> then if we add too many new smarts to try and properly adjust for this
> new optimization, unless we do a completly new cost modal for this,
> then we might as well be putting lipstick on a pig.
>
> It sounds like James mostly just mentioned the sorting just to ensure
> it was properly considered and does not really feel strongly that it
> needs to be adjusted.  Does anyone else feel that we should be
> adjusting it?

Thanks for doing the math measuring how much we could impact things.

I'm +lots on getting this committed as is.

Thanks all for your work on the improvement!

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 23:28, Ranier Vilela  wrote:
> I took a look at cost_tuplesort and I think that some small adjustments could 
> be made as part of the improvement process.
> It is attached.
> 1. long is a very problematic type; better int64?
> 2. 1024 can be int, not long?
> 3. 2 changed all to 2.0 (double)?
> 4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?
>
> Finally, to at least document (add comments) those conclusions,
> would be nice, wouldn't it?

I don't think there's anything useful here. If you think otherwise,
please take it to another thread.  Also, I'd recommend at least
compiling any patches you send to -hackers in the future. Going by the
CF bot, this one does not.

You might also want to read up on type promotion rules in C.  Your
sort_mem calculation change does not do what you think it does. Class
it as homework to figure out what's wrong with it.  No need to report
your findings here. Just thought it would be useful for you to learn
those things.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Fri, 16 Jul 2021 at 15:44, David Rowley  wrote:
>
> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau  wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort, and 
> > my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.

Attaching the same v10 patch again so the CF bot picks up the correct
patch again.

David


v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patch
Description: Binary data


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread Ranier Vilela
Em ter., 20 de jul. de 2021 às 05:35, David Rowley 
escreveu:

> On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> > To be clear up front: I'm in favor of the patch, and I don't want to
> > put unnecessary stumbling blocks up for it getting committed. So if we
> > decide to proceed as is, that's fine with me.
>
> Thanks for making that clear.
>
> > But I'm not sure that the "cost model, unfortunately, does not account
> > for that" is entirely accurate. The end of cost_tuplesort contains a
> > cost "per extracted tuple". It does, however, note that it doesn't
> > charge cpu_tuple_cost, which maybe is what you'd want to fully
> > incorporate this into the model. But given this run_cost isn't about
> > accounting for comparison cost (that's been done earlier) which is the
> > part that'd be the same between tuple and datum sort, it seems to me
> > that we could lower the cpu_operator_cost here by something like 10%
> > if it's byref and 30% if it's byval?
>
> I failed to notice that last part that adds the additional
> cpu_operator_cost.
>
> The default cpu_operator_cost is 0.0025, so with the 10k tuple
> benchmark, that adds an additional charge of 25 to the total cost.
>
> If we take test 1 from my results on v5 as an example:
>
> > Test1 446.1 657.3 147.32%
>
> Looking at explain for that query:
>
> regression=# explain select two from tenk1 order by two offset 100;
>   QUERY PLAN
> --
>  Limit  (cost=1133.95..1133.95 rows=1 width=4)
>->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
>  Sort Key: two
>  ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
> (4 rows)
>
> If we want the costs to reflect reality again here then we'd have
> reduce 1133.95 by something like 147.32% (the performance difference).
> That would bring the cost down to 769.72, which is way more than we
> have to play with than the 25 that the cpu_operator_cost * tuples
> gives us.
>
> If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
> cost would become 1126.45.  That's not great considering the actual
> performance indicates that 769.72 would be a better number.
>
> If we look at John's result for test 1: He saw 588 tps on master and
> 998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
> to get close to reality on that machine.
>
> My thoughts are that the small surcharge added at the end of
> cost_tuplesort() is just not enough for us to play with.  I think to
> get us closer to fixing this correctly would require a redesign of the
> tuplesort costing entirely. I think that would be about an order of
> magnitude more effort than this patch was, so I really feel like I
> don't want to do this.
>
I understand that redesign would require a lot of work,
but why not do it step by step?


> I kinda feel that since the comparison_cost is always just 2.0 *
> cpu_operator_cost regardless of the number of columns in the sort,
> then if we add too many new smarts to try and properly adjust for this
> new optimization, unless we do a completly new cost modal for this,
> then we might as well be putting lipstick on a pig.
>
I think one first step is naming this 2.0?
Does this magic number don't have a good name?


>
> It sounds like James mostly just mentioned the sorting just to ensure
> it was properly considered and does not really feel strongly that it
> needs to be adjusted.  Does anyone else feel that we should be
> adjusting it?
>
I took a look at cost_tuplesort and I think that some small adjustments
could be made as part of the improvement process.
It is attached.
1. long is a very problematic type; better int64?
2. 1024 can be int, not long?
3. 2 changed all to 2.0 (double)?
4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?

Finally, to at least document (add comments) those conclusions,
would be nice, wouldn't it?

regards,
Ranier Vilela


costsize.patch
Description: Binary data


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> To be clear up front: I'm in favor of the patch, and I don't want to
> put unnecessary stumbling blocks up for it getting committed. So if we
> decide to proceed as is, that's fine with me.

Thanks for making that clear.

> But I'm not sure that the "cost model, unfortunately, does not account
> for that" is entirely accurate. The end of cost_tuplesort contains a
> cost "per extracted tuple". It does, however, note that it doesn't
> charge cpu_tuple_cost, which maybe is what you'd want to fully
> incorporate this into the model. But given this run_cost isn't about
> accounting for comparison cost (that's been done earlier) which is the
> part that'd be the same between tuple and datum sort, it seems to me
> that we could lower the cpu_operator_cost here by something like 10%
> if it's byref and 30% if it's byval?

I failed to notice that last part that adds the additional cpu_operator_cost.

The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.

If we take test 1 from my results on v5 as an example:

> Test1 446.1 657.3 147.32%

Looking at explain for that query:

regression=# explain select two from tenk1 order by two offset 100;
  QUERY PLAN
--
 Limit  (cost=1133.95..1133.95 rows=1 width=4)
   ->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
 Sort Key: two
 ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
(4 rows)

If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.

If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45.  That's not great considering the actual
performance indicates that 769.72 would be a better number.

If we look at John's result for test 1: He saw 588 tps on master and
998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.

My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with.  I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.

I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.

It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted.  Does anyone else feel that we should be
adjusting it?

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-19 Thread James Coleman
On Sat, Jul 17, 2021 at 4:36 AM David Rowley  wrote:
>
>  On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
>
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
>
> static double
> relation_byte_size(double tuples, int width)
> {
> return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
>
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
>
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
>
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
>
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.

To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.

But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-18 Thread Ronan Dunklau
Le samedi 17 juillet 2021, 10:36:09 CEST David Rowley a écrit :
>  On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
> 
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
> 
> static double
> relation_byte_size(double tuples, int width)
> {
> return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
> 
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
> 
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
> 
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
> 
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.
> 
Thank you for taking the time to perform that analysis. I agree with you and 
tt looks to me that if we were to start accounting for it, we would have to 
make the change almost transparent for tuple sorts so that it stays roughly 
the same, which is impossible since we don't apply the comparison cost to all 
tuples but only to the number of tuples we actually expect to compare.

On the other hand, if we don't change the sorting cost and it just ends up 
being faster in some cases I doubt anyone would complain.


-- 
Ronan Dunklau






Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-17 Thread David Rowley
 On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> The only remaining question I have is whether or not costing needs to
> change, given the very significant speedup for datum sort.

I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.

static double
relation_byte_size(double tuples, int width)
{
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}

It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.

The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.

I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size().  There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on.  Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.

Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost.  The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
does not account for that.   That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-16 Thread James Coleman
On Thu, Jul 15, 2021 at 11:45 PM David Rowley  wrote:
>
> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau  wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort, and 
> > my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.
>
> The changes are fairly minor. I made the bracing consistent between
> both tuplesort_begin calls. I rewrote the comment at the top of
> ExecSort() to make it more clear about each method used.
>
> I also adjusted the comment down at the end of ExecSort that was
> mentioning something about tuplesort_gettupleslot returning NULL.
> Your patch didn't touch this, but to me, the comment just looked wrong
> both before and after the changes. tuplesort_gettupleslot returns
> false and sets the slot to empty when it runs out of tuples.  Anyway,
> I wrote something there that I think improves that.
>
> I feel like this patch is commit-worthy now.  However, I'll leave it
> for a few days, maybe until after the weekend as there's been a fair
> bit of interest and I imagine someone will have comments to make.

The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-16 Thread Ranier Vilela
Em sex., 16 de jul. de 2021 às 00:45, David Rowley 
escreveu:

> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau 
> wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort,
> and my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.
>
> The changes are fairly minor. I made the bracing consistent between
> both tuplesort_begin calls. I rewrote the comment at the top of
> ExecSort() to make it more clear about each method used.
>
With relation to the braces, it's still not clear to me which style to
follow.
I gave Ronan directions about it.
And I think maybe, it's still not clear when to use it or not.


> I also adjusted the comment down at the end of ExecSort that was
> mentioning something about tuplesort_gettupleslot returning NULL.
> Your patch didn't touch this, but to me, the comment just looked wrong
> both before and after the changes. tuplesort_gettupleslot returns
> false and sets the slot to empty when it runs out of tuples.  Anyway,
> I wrote something there that I think improves that.
>
Can help a little here, but, seems good to me.


> I feel like this patch is commit-worthy now.  However, I'll leave it
> for a few days, maybe until after the weekend as there's been a fair
> bit of interest and I imagine someone will have comments to make.
>
A little lack of time.

But I finally can understand v7b.
Really struct field is necessary and he fails with the next tuple, ok.
The only conclusion I can come to is that he is faster because he fails to
sort correctly.
It's no use being faster and getting wrong results.

So, +1 from me to commit v10.

Thanks for working together.

Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread David Rowley
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau  wrote:
> Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> apologies if I misunderstood your point.

I took this and adjusted a few things and ended up with the attached patch.

The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.

I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples.  Anyway,
I wrote something there that I think improves that.

I feel like this patch is commit-worthy now.  However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.

David


v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patch
Description: Binary data


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread David Rowley
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau  wrote:
>
> Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :>
> > Ronan's latest results plus John's make me think there's no need to
> > separate out the node function as I did in v8.  However, I do think v6
> > could learn a little from v8. I think I'd rather see the sort method
> > determined in ExecInitSort() rather than ExecSort(). I think
> > minimising those few extra instructions in ExecSort() might help the
> > L1 instruction cache.
> >
>
> I'm not sure I understand what you expect from moving that to ExecInitSort ?

The motivation was to reduce the extra code that's being added to
ExecSort. I checked the assembly of ExecSort on v6 and v9 and v6 was
544 lines of assembly and v9 is 534 lines.

> Maybe we should also implement the tuplesort_state initialization in
> ExecInitSort ? (not the actual feeding and sorting of course).

I don't think that would be a good idea.  Setting the datumSort does
not require any new memory to be allocated. That's not the case for
the tuplesort_begin routines.  The difference here is that we can
delay the memory allocation until we pull the first tuple and if we
don't pull any tuples from the outer node then there are no needless
allocations.

> Please find attached a v9 just moving the flag setting to ExecInitSort, and my
> apologies if I misunderstood your point.

That's exactly what I meant. Thanks

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread James Coleman
On Thu, Jul 15, 2021 at 10:19 AM David Rowley  wrote:
>
> On Fri, 16 Jul 2021 at 01:44, James Coleman  wrote:
> >
> > On Wed, Jul 14, 2021 at 9:22 PM David Rowley  wrote:
> > >
> > > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
> > > >
> > > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley 
> > > >  escreveu:
> > > >> But, in v8 there is no additional branch, so no branch to mispredict.
> > > >> I don't really see how your explanation fits.
> > > >
> > > > In v8 the branch occurs at :
> > > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> > >
> > > You do know that branch is in a function that's only executed once
> > > during executor initialization, right?
> >
> > This is why I have a hard time believing there's a "real" change here
> > and not the result of either noise or something not really
> > controllable like executable layout changing.
>
> Yeah, I think we likely are at the level where layout changes in the
> compiled code are going to make things hard to measure.  I just want
> to make sure we're not going to end up with some regression that's
> actual and not random depending on layout changes of unrelated code.
> I think a branch that's taken consistently *should* be predicted
> correctly each time.
>
> Anyway, I think all the comparisons with v7b can safely be ignored. As
> Ronan pointed out, v7b has some issues due to it not recording the
> sort method in the executor state that leads to it forgetting which
> method it used once we start pulling tuples from it. The reproductions
> of that are it calling tuplesort_gettupleslot() from the 2nd tuple
> onwards regardless of if we've done a datum or tuple sort.
>
> Ronan's latest results plus John's make me think there's no need to
> separate out the node function as I did in v8.  However, I do think v6
> could learn a little from v8. I think I'd rather see the sort method
> determined in ExecInitSort() rather than ExecSort(). I think
> minimising those few extra instructions in ExecSort() might help the
> L1 instruction cache.

I ran master/v6/v8 tests for 90s each with David's test script on an
AWS c5n.metal instance (so should be immune to noise neighbor issues).
Here are comparative results:

Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
v6 68.66% 0.05% 32.21% -0.83% 12.58% 10.42% -1.48% 50.98%
v8 69.78% -0.44% 32.45% -1.11% 12.01% 10.58% -1.40% 49.30%

So I see a consistent change in the data, but I don't really see a
good explanation for it not being noise. Can't prove that yet though.

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ronan Dunklau
Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :> 
> Ronan's latest results plus John's make me think there's no need to
> separate out the node function as I did in v8.  However, I do think v6
> could learn a little from v8. I think I'd rather see the sort method
> determined in ExecInitSort() rather than ExecSort(). I think
> minimising those few extra instructions in ExecSort() might help the
> L1 instruction cache.
> 

I'm not sure I understand what you expect from moving that to ExecInitSort ? 
Maybe we should also implement the tuplesort_state initialization in 
ExecInitSort ? (not the actual feeding and sorting of course).

Please find attached a v9 just moving the flag setting to ExecInitSort, and my 
apologies if I misunderstood your point.

-- 
Ronan Dunklaucommit 2bf8db23c8ddb53f7dbcba33c60350bda9d703ea
Author: Ronan Dunklau 
Date:   Tue Jul 6 16:34:31 2021 +0200

Use optimized single-datum tuplesort in ExecSort

Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..861e4c6e9e 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,31 +90,63 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (node->datumSort)
+		{
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		}
+		else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
+		 * Scan the subplan and feed all the tuples to tuplesort, using either
+		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
-
-		for (;;)
+		if (node->datumSort)
 		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
+		}
+		else
+		{
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+tuplesort_puttupleslot(tuplesortstate, slot);
+			}
 		}
 
 		/*
@@ -150,9 +186,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-  ScanDirectionIsForward(dir),
-  false, slot, NULL);
+	if (node->datumSort)
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+			   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+	else
+		(void) tuplesort_gettupleslot(tuplesortstate,
+	  ScanDirectionIsForward(dir),
+	  false, slot, NULL);
+
 	return slot;
 }
 
@@ -191,6 +236,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->datumSort = false;
 
 	/*
 	 * Miscellaneous initialization
@@ -220,6 +266,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(>ss.ps, );
 	

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ranier Vilela
Em qui., 15 de jul. de 2021 às 11:19, David Rowley 
escreveu:

> On Fri, 16 Jul 2021 at 01:44, James Coleman  wrote:
> >
> > On Wed, Jul 14, 2021 at 9:22 PM David Rowley 
> wrote:
> > >
> > > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela 
> wrote:
> > > >
> > > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley <
> dgrowle...@gmail.com> escreveu:
> > > >> But, in v8 there is no additional branch, so no branch to
> mispredict.
> > > >> I don't really see how your explanation fits.
> > > >
> > > > In v8 the branch occurs at :
> > > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> > >
> > > You do know that branch is in a function that's only executed once
> > > during executor initialization, right?
> >
> > This is why I have a hard time believing there's a "real" change here
> > and not the result of either noise or something not really
> > controllable like executable layout changing.
>
> Yeah, I think we likely are at the level where layout changes in the
> compiled code are going to make things hard to measure.  I just want
> to make sure we're not going to end up with some regression that's
> actual and not random depending on layout changes of unrelated code.
> I think a branch that's taken consistently *should* be predicted
> correctly each time.


> Anyway, I think all the comparisons with v7b can safely be ignored. As
> Ronan pointed out, v7b has some issues due to it not recording the
> sort method in the executor state that leads to it forgetting which
> method it used once we start pulling tuples from it. The reproductions
> of that are it calling tuplesort_gettupleslot() from the 2nd tuple
> onwards regardless of if we've done a datum or tuple sort.
>
Sorry for insisting on this.
Assuming v7b is doing it the wrong way, which I still don't think it is.
Why is it still faster than v6 and v8?

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ranier Vilela
Em qua., 14 de jul. de 2021 às 22:22, David Rowley 
escreveu:

> On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
> >
> > Em qua., 14 de jul. de 2021 às 21:21, David Rowley 
> escreveu:
> >> But, in v8 there is no additional branch, so no branch to mispredict.
> >> I don't really see how your explanation fits.
> >
> > In v8 the branch occurs at :
> > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
>
> You do know that branch is in a function that's only executed once
> during executor initialization, right?
>
There's a real difference between v8 and v6, if I understood correctly.

v6 the branches is per tuple:
+ if (tupDesc->natts == 1)

v8 the branches is per state:
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

I think that a big different way to solve the problem.
Or am I getting it wrong?

If the sortstate number of attributes is equal to 1, is it worth the same
for each tuple?
Can you explain this, please?

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread David Rowley
On Fri, 16 Jul 2021 at 01:44, James Coleman  wrote:
>
> On Wed, Jul 14, 2021 at 9:22 PM David Rowley  wrote:
> >
> > On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
> > >
> > > Em qua., 14 de jul. de 2021 às 21:21, David Rowley  
> > > escreveu:
> > >> But, in v8 there is no additional branch, so no branch to mispredict.
> > >> I don't really see how your explanation fits.
> > >
> > > In v8 the branch occurs at :
> > > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
> >
> > You do know that branch is in a function that's only executed once
> > during executor initialization, right?
>
> This is why I have a hard time believing there's a "real" change here
> and not the result of either noise or something not really
> controllable like executable layout changing.

Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure.  I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.

Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.

Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8.  However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread James Coleman
On Wed, Jul 14, 2021 at 9:22 PM David Rowley  wrote:
>
> On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
> >
> > Em qua., 14 de jul. de 2021 às 21:21, David Rowley  
> > escreveu:
> >> But, in v8 there is no additional branch, so no branch to mispredict.
> >> I don't really see how your explanation fits.
> >
> > In v8 the branch occurs at :
> > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
>
> You do know that branch is in a function that's only executed once
> during executor initialization, right?

This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ranier Vilela
Em qui., 15 de jul. de 2021 às 09:27, Ronan Dunklau 
escreveu:

> Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
> > Is there a special reason to not share v7b tests and results?
> >
>
> The v7b patch is wrong, as it loses the type of tuplesort being used

I don't see 'node->datumSort' being anywhere else yet.


> and as
> such always tries to fetch results using tuplesort_gettupleslot after the
> first
> tuple is fetched.

Is that why it is faster than v6?

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ronan Dunklau
Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
> Is there a special reason to not share v7b tests and results?
> 

The v7b patch is wrong, as it loses the type of tuplesort being used and as 
such always tries to fetch results using tuplesort_gettupleslot after the first 
tuple is fetched. 


-- 
Ronan Dunklau






Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ranier Vilela
Em qui., 15 de jul. de 2021 às 07:18, Ronan Dunklau 
escreveu:

> Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
> > On Wed, Jul 14, 2021 at 6:14 AM David Rowley 
> wrote:
> > > It would be good to get a 2nd opinion about this idea.  Also, more
> > > benchmark results with v6 and v8 would be good too.
> >
>
> Hello,
>
> Thank you for trying this approach in v8 David !
>
> I've decided to test on more "stable" hardware, an EC-2 medium instance,
> compiling with Debian's gcc 8.3. That's still not ideal but a lot better
> than
> a laptop.
>
> To gather more meaningful results, I ran every pgbench for 30s instead of
> the
> 10 in the initial script provided by David. I ran the full script once for
> HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate
> noise
> that could happen for 90 consecutive seconds, and took for each of those
> the
> median of the 6 runs.  It's much less noisy than my previous runs but
> still
> not as as stable as I'd like to.
>
> The results are attached in graph form, as well as the raw data if someone
> wants it.
>
> As a conclusion, I don't think it's worth it to introduce a separate
> execprocnode function for that case. It is likely the minor difference
> still
> observed can be explained to noise, as they fluctuate if you compare the
> min,
> max, average or median values from the results.
>
Is there a special reason to not share v7b tests and results?

IMHO he is much more branch friendly.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-15 Thread Ronan Dunklau
Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
> On Wed, Jul 14, 2021 at 6:14 AM David Rowley  wrote:
> > It would be good to get a 2nd opinion about this idea.  Also, more
> > benchmark results with v6 and v8 would be good too.
> 

Hello,

Thank you for trying this approach in v8 David !

I've decided to test on more "stable" hardware, an EC-2 medium instance, 
compiling with Debian's gcc 8.3. That's still not ideal but a lot better than 
a laptop. 

To gather more meaningful results, I ran every pgbench for 30s instead of the 
10 in the initial script provided by David. I ran the full script once for 
HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate noise 
that could happen for 90 consecutive seconds, and took for each of those the 
median of the 6 runs.  It's much less noisy than my previous runs but still 
not as as stable as I'd like to.

The results are attached in graph form, as well as the raw data if someone 
wants it.

As a conclusion, I don't think it's worth it to introduce a separate 
execprocnode function for that case. It is likely the minor difference still 
observed can be explained to noise, as they fluctuate if you compare the min, 
max, average or median values from the results.

Best regards,

-- 
Ronan Dunklauname,querynumber,run,value
master,1,1,448.134618
master,1,2,449.589985
master,1,3,450.619392
master,1,4,451.835713
master,1,5,453.799955
master,1,6,449.656700
master,2,1,160.304157
master,2,2,159.194826
master,2,3,160.486337
master,2,4,157.327040
master,2,5,149.773621
master,2,6,156.001160
master,3,1,300.677621
master,3,2,300.906490
master,3,3,300.948030
master,3,4,300.494371
master,3,5,300.236476
master,3,6,300.537330
master,4,1,131.364657
master,4,2,120.390513
master,4,3,125.846061
master,4,4,131.216385
master,4,5,131.036531
master,4,6,132.669908
master,5,1,202.268035
master,5,2,203.257486
master,5,3,202.559354
master,5,4,202.392410
master,5,5,201.532070
master,5,6,196.208607
master,6,1,178.584481
master,6,2,177.955467
master,6,3,178.801931
master,6,4,174.769964
master,6,5,178.523056
master,6,6,178.089879
master,7,1,113.231133
master,7,2,113.978583
master,7,3,107.244092
master,7,4,113.309029
master,7,5,114.732538
master,7,6,112.608871
master,8,1,374.876726
master,8,2,377.042935
master,8,3,377.417160
master,8,4,379.001567
master,8,5,379.139665
master,8,6,378.489882
v6,1,1,774.599303
v6,1,2,794.282738
v6,1,3,794.573274
v6,1,4,786.462616
v6,1,5,793.436419
v6,1,6,794.077087
v6,2,1,181.684654
v6,2,2,181.511128
v6,2,3,171.686299
v6,2,4,156.607679
v6,2,5,150.534859
v6,2,6,153.857928
v6,3,1,437.601327
v6,3,2,438.129653
v6,3,3,438.291123
v6,3,4,440.916755
v6,3,5,441.174564
v6,3,6,440.738299
v6,4,1,147.996631
v6,4,2,148.126633
v6,4,3,146.419228
v6,4,4,130.344895
v6,4,5,131.744894
v6,4,6,133.263237
v6,5,1,225.757004
v6,5,2,225.600496
v6,5,3,225.576930
v6,5,4,226.347284
v6,5,5,224.425254
v6,5,6,218.768964
v6,6,1,196.128685
v6,6,2,197.337412
v6,6,3,198.729160
v6,6,4,197.664255
v6,6,5,200.419565
v6,6,6,200.076977
v6,7,1,125.414408
v6,7,2,122.788568
v6,7,3,124.698258
v6,7,4,112.768749
v6,7,5,113.066527
v6,7,6,111.011427
v6,8,1,643.292463
v6,8,2,644.358034
v6,8,3,643.875166
v6,8,4,642.763715
v6,8,5,642.806728
v6,8,6,641.536225
v8,1,1,780.693054
v8,1,2,780.707289
v8,1,3,779.381265
v8,1,4,755.864425
v8,1,5,790.653100
v8,1,6,788.112981
v8,2,1,159.354465
v8,2,2,157.702581
v8,2,3,159.168693
v8,2,4,154.088551
v8,2,5,158.638991
v8,2,6,160.611918
v8,3,1,442.174582
v8,3,2,442.874461
v8,3,3,443.113160
v8,3,4,443.669698
v8,3,5,444.011233
v8,3,6,443.424492
v8,4,1,133.106784
v8,4,2,130.355708
v8,4,3,132.366164
v8,4,4,131.248541
v8,4,5,131.318346
v8,4,6,131.119015
v8,5,1,225.082939
v8,5,2,225.890904
v8,5,3,225.603399
v8,5,4,220.509114
v8,5,5,222.529699
v8,5,6,224.236719
v8,6,1,199.296811
v8,6,2,192.199622
v8,6,3,196.138832
v8,6,4,197.790249
v8,6,5,197.283257
v8,6,6,198.136669
v8,7,1,110.023948
v8,7,2,111.809473
v8,7,3,113.334062
v8,7,4,111.040694
v8,7,5,110.979502
v8,7,6,113.120494
v8,8,1,640.786958
v8,8,2,640.502172
v8,8,3,640.414708
v8,8,4,638.864723
v8,8,5,636.939979
v8,8,6,639.068674


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread Ranier Vilela
Em qua., 14 de jul. de 2021 às 22:22, David Rowley 
escreveu:

> On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
> >
> > Em qua., 14 de jul. de 2021 às 21:21, David Rowley 
> escreveu:
> >> But, in v8 there is no additional branch, so no branch to mispredict.
> >> I don't really see how your explanation fits.
> >
> > In v8 the branch occurs at :
> > + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
>
> You do know that branch is in a function that's only executed once
> during executor initialization, right?
>
The branch prediction should work better.
I have no idea why it works worse.

I redid all tests:
notebook 8GB RAM 256GB SSD
ubuntu 64 bits (20.04)
clang-12
powerhigh (charger on)
none configuration (all defaults)


  HEAD   v6  v7b  v8  v6
vs head
v7b vs v6   v8 vs v7b
Test1 576,868013 940,947236 1090,253859 1016,0443 163,11% 115,87% 93,19%
Test2 184,748363 177,6254 177,346229 178,230258 96,14% 99,84% 100,50%
Test3 410,030055 541,889704 605,843924 534,946166 132,16% 111,80% 88,30%
Test4 153,331752 147,98418 148,010894 147,771155 96,51% 100,02% 99,84%
Test5 268,97555 301,979647 316,928492 300,94932 112,27% 104,95% 94,96%
Test6 234,910125 259,71483 269,851427 260,567637 110,56% 103,90% 96,56%
Test7 142,704153 136,09163 136,802695 136,935709 95,37% 100,52% 100,10%
Test8 498,634855 763,482151 867,350046 804,833884 153,11% 113,60% 92,79%

The values are high here, because now, the tests are made with full power
of cpu to all patchs!
I think that more testing is needed with v7b and v8.

Anyway, two functions (ExecSortTuple and ExecSortDatum) are almost equal,
maybe not a good idea.

file results attached.

regards,
Ranier Vilela
Benchmarks datumSort:

1) head

a)
Test1
tps = 569.209858 (without initial connection time)
tps = 563.220614 (without initial connection time)
tps = 576.868013 (without initial connection time)
Test2
tps = 181.738827 (without initial connection time)
tps = 181.846192 (without initial connection time)
tps = 184.748363 (without initial connection time)
Test3
tps = 403.548507 (without initial connection time)
tps = 403.089439 (without initial connection time)
tps = 402.613085 (without initial connection time)
Test4
tps = 149.144474 (without initial connection time)
tps = 149.399761 (without initial connection time)
tps = 149.154955 (without initial connection time)
Test5
tps = 267.874525 (without initial connection time)
tps = 268.411193 (without initial connection time)
tps = 268.975550 (without initial connection time)
Test6
tps = 233.527790 (without initial connection time)
tps = 233.633064 (without initial connection time)
tps = 234.910125 (without initial connection time)
Test7
tps = 141.286126 (without initial connection time)
tps = 142.704153 (without initial connection time)
tps = 141.205352 (without initial connection time)
Test8
tps = 487.544464 (without initial connection time)
tps = 491.117533 (without initial connection time)
tps = 488.632738 (without initial connection time)

b)
Test1
tps = 566.830371 (without initial connection time)
tps = 570.918955 (without initial connection time)
tps = 568.750619 (without initial connection time)
Test2
tps = 183.407740 (without initial connection time)
tps = 180.912646 (without initial connection time)
tps = 180.520571 (without initial connection time)
Test3
tps = 407.706768 (without initial connection time)
tps = 410.030055 (without initial connection time)
tps = 408.764070 (without initial connection time)
Test4
tps = 153.331752 (without initial connection time)
tps = 150.824075 (without initial connection time)
tps = 151.791484 (without initial connection time)
Test5
tps = 267.737001 (without initial connection time)
tps = 268.400295 (without initial connection time)
tps = 267.696627 (without initial connection time)
Test6
tps = 232.793249 (without initial connection time)
tps = 232.996156 (without initial connection time)
tps = 233.452329 (without initial connection time)
Test7
tps = 142.180635 (without initial connection time)
tps = 141.762886 (without initial connection time)
tps = 141.488497 (without initial connection time)
Test8
tps = 489.623012 (without initial connection time)
tps = 498.634855 (without initial connection time)
tps = 491.009624 (without initial connection time)


2) v6 Ronan

a)
Test1
tps = 937.913563 (without initial connection time)
tps = 940.947236 (without initial connection time)
tps = 934.811056 (without initial connection time)
Test2
tps = 176.990496 (without initial connection time)
tps = 177.242999 (without initial connection time)
tps = 177.625400 (without initial connection time)
Test3
tps = 539.970412 (without initial connection time)
tps = 540.415402 (without initial connection time)
tps = 541.889704 (without initial connection time)
Test4
tps = 147.006777 (without initial connection time)
tps = 147.984180 (without initial connection time)
tps = 147.605122 (without initial connection time)
Test5
tps = 301.528886 (without 

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread David Rowley
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela  wrote:
>
> Em qua., 14 de jul. de 2021 às 21:21, David Rowley  
> escreveu:
>> But, in v8 there is no additional branch, so no branch to mispredict.
>> I don't really see how your explanation fits.
>
> In v8 the branch occurs at :
> + if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

You do know that branch is in a function that's only executed once
during executor initialization, right?

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread Ranier Vilela
Em qua., 14 de jul. de 2021 às 21:21, David Rowley 
escreveu:

> On Thu, 15 Jul 2021 at 12:10, Ranier Vilela  wrote:
> >
> > Em qua., 14 de jul. de 2021 às 20:43, David Rowley 
> escreveu:
> >>
> >> On Thu, 15 Jul 2021 at 05:55, Ranier Vilela 
> wrote:
> >> > I repeated (3 times) the benchmark with v8 here,
> >> > and the results were not good.
> >>
> >> Do you have any good theories on why the additional branching that's
> >> done in v7b vs v8 might cause it to run faster?
> >
> >
> > Branch Predictions works with *more* probable path,
> > otherwise a penalty occurs and the cpu must revert the results.
>
> But, in v8 there is no additional branch, so no branch to mispredict.
> I don't really see how your explanation fits.
>
In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)

datumSort is tested first.

Cpu time is a more expensive resource.
Always is executed two branches, if it is right path, win,
otherwise occurs a penalty time.


> It seems much more likely to me that the results were just noisy.  It
> would be good to see if you can recreate them consistently.
>
I do.
Can you please share results with v7b?

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread David Rowley
On Thu, 15 Jul 2021 at 12:10, Ranier Vilela  wrote:
>
> Em qua., 14 de jul. de 2021 às 20:43, David Rowley  
> escreveu:
>>
>> On Thu, 15 Jul 2021 at 05:55, Ranier Vilela  wrote:
>> > I repeated (3 times) the benchmark with v8 here,
>> > and the results were not good.
>>
>> Do you have any good theories on why the additional branching that's
>> done in v7b vs v8 might cause it to run faster?
>
>
> Branch Predictions works with *more* probable path,
> otherwise a penalty occurs and the cpu must revert the results.

But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.

It seems much more likely to me that the results were just noisy.  It
would be good to see if you can recreate them consistently.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread Ranier Vilela
Em qua., 14 de jul. de 2021 às 20:43, David Rowley 
escreveu:

> On Thu, 15 Jul 2021 at 05:55, Ranier Vilela  wrote:
> > I repeated (3 times) the benchmark with v8 here,
> > and the results were not good.
>
> Do you have any good theories on why the additional branching that's
> done in v7b vs v8 might cause it to run faster?


Branch Predictions works with *more* probable path,
otherwise a penalty occurs and the cpu must revert the results.

In this case it seems to me that most of the time, tuplesort is the path.
So as it is tested if it is *datumSort* and the *prediction* fails,
the cpu has more work to reverse the wrong path.

To help the branch, test a more probable case first, anywhere.
if, switch, etc.

Another gain is the local variable tupleSort, which is obviously faster
than node.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread David Rowley
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela  wrote:
> I repeated (3 times) the benchmark with v8 here,
> and the results were not good.

Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread John Naylor
On Wed, Jul 14, 2021 at 6:14 AM David Rowley  wrote:

> It would be good to get a 2nd opinion about this idea.  Also, more
> benchmark results with v6 and v8 would be good too.

I tested this on an older Xeon, gcc 8.4 (here median of each test, full
results attached):

test HEAD v6 v8

Test1 588 1007 998
Test2 198 202 197
Test3 374 516 512
Test4 172 165 166
Test5 255 279 283
Test6 227 251 251
Test7 145 147 146
Test8 474 783 770

Test4 could be a regression, but 2 and 7 look fine here.

--
John Naylor
EDB: http://www.enterprisedb.com
HEAD

Test1
tps = 576.619577 (without initial connection time)
tps = 588.508521 (without initial connection time)
tps = 591.358071 (without initial connection time)
Test2
tps = 193.739697 (without initial connection time)
tps = 198.485550 (without initial connection time)
tps = 212.013611 (without initial connection time)
Test3
tps = 374.865428 (without initial connection time)
tps = 365.930271 (without initial connection time)
tps = 373.662057 (without initial connection time)
Test4
tps = 170.663828 (without initial connection time)
tps = 174.067103 (without initial connection time)
tps = 172.869990 (without initial connection time)
Test5
tps = 255.232210 (without initial connection time)
tps = 254.580880 (without initial connection time)
tps = 260.868383 (without initial connection time)
Test6
tps = 227.235812 (without initial connection time)
tps = 227.012394 (without initial connection time)
tps = 232.099234 (without initial connection time)
Test7
tps = 144.924898 (without initial connection time)
tps = 141.170317 (without initial connection time)
tps = 152.338359 (without initial connection time)
Test8
tps = 473.734920 (without initial connection time)
tps = 485.953163 (without initial connection time)
tps = 474.231567 (without initial connection time)

v6

Test1
tps = 985.531367 (without initial connection time)
tps = 1006.531956 (without initial connection time)
tps = 1006.928283 (without initial connection time)
Test2
tps = 206.325749 (without initial connection time)
tps = 199.169152 (without initial connection time)
tps = 202.264704 (without initial connection time)
Test3
tps = 515.819017 (without initial connection time)
tps = 514.739251 (without initial connection time)
tps = 526.872129 (without initial connection time)
Test4
tps = 172.372655 (without initial connection time)
tps = 161.658650 (without initial connection time)
tps = 165.141814 (without initial connection time)
Test5
tps = 278.929619 (without initial connection time)
tps = 276.071475 (without initial connection time)
tps = 282.395453 (without initial connection time)
Test6
tps = 251.143322 (without initial connection time)
tps = 248.366339 (without initial connection time)
tps = 254.129250 (without initial connection time)
Test7
tps = 144.616732 (without initial connection time)
tps = 154.306604 (without initial connection time)
tps = 146.636703 (without initial connection time)
Test8
tps = 782.767588 (without initial connection time)
tps = 778.912500 (without initial connection time)
tps = 796.911010 (without initial connection time)

v8

Test1
tps = 976.674468 (without initial connection time)
tps = 997.846584 (without initial connection time)
tps = 997.877722 (without initial connection time)
Test2
tps = 199.376800 (without initial connection time)
tps = 197.235818 (without initial connection time)
tps = 201.615343 (without initial connection time)
Test3
tps = 461.241555 (without initial connection time)
tps = 511.929037 (without initial connection time)
tps = 516.238460 (without initial connection time)
Test4
tps = 166.201945 (without initial connection time)
tps = 166.315151 (without initial connection time)
tps = 169.887666 (without initial connection time)
Test5
tps = 279.323293 (without initial connection time)
tps = 285.120557 (without initial connection time)
tps = 282.674135 (without initial connection time)
Test6
tps = 250.446057 (without initial connection time)
tps = 256.509619 (without initial connection time)
tps = 250.906259 (without initial connection time)
Test7
tps = 155.097565 (without initial connection time)
tps = 145.446077 (without initial connection time)
tps = 145.750081 (without initial connection time)
Test8
tps = 769.952946 (without initial connection time)
tps = 767.403431 (without initial connection time)
tps = 787.201077 (without initial connection time)


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-14 Thread Ranier Vilela
Em qua., 14 de jul. de 2021 às 07:14, David Rowley 
escreveu:

> On Tue, 13 Jul 2021 at 15:15, David Rowley  wrote:
> > In theory, we likely could get rid of the small regression by having
> > two versions of ExecSort() and setting the correct one during
> > ExecInitSort() by setting the function pointer to the version we want
> > to use in sortstate->ss.ps.ExecProcNode.
>
> Just to see how it would perform, I tried what I mentioned above. I've
> included what I ended up with in the attached POC patch.
>
> I got the following results on my AMD hardware.
>
> Test master v8 patch comparison
> Test1   448.0   671.7   149.9%
> Test2   316.4   317.5   100.3%
> Test3   299.5   381.6   127.4%
> Test4   219.7   229.5   104.5%
> Test5   226.3   254.6   112.5%
> Test6   197.9   217.9   110.1%
> Test7   179.2   185.3   103.4%
> Test8   389.2   544.8   140.0%
>
I'm a little surprised by your results.
Test1 and Test8 look pretty good to me.
What is compiler and environment?

I repeated (3 times) the benchmark with v8 here,
and the results were not good.


  HEADv6  v7bv8
v6 vs head  v8 vs v6 v8 vs v7b
Test1 288,149636 449,018541 550,48505 468,168165 155,83% 104,26% 85,05%
Test2 94,766955 95,451406 94,718982 94,800275 100,72% 99,32% 100,09%
Test3 190,521319 260,279802 278,115296 262,538383 136,61% 100,87% 94,40%
Test4 78,779344 78,253455 77,941482 78,471546 99,33% 100,28% 100,68%
Test5 131,362614 142,662223 149,639041 144,849303 108,60% 101,53% 96,80%
Test6 112,884298 124,181671 127,58497 124,29376 110,01% 100,09% 97,42%
Test7 69,308587 68,643067 69,087544 69,437312 99,04% 101,16% 100,51%
Test8 243,674171 364,681142 419,259703 369,239176 149,66% 101,25% 88,07%



> This time I saw no regression on tests 2, 4 and 7.
>
> I looked to see if there was anywhere else in the executor that
> conditionally uses a different exec function in this way and found
> nothing, so I'm not too sure if it's a good idea to start doing this.
>
Specialized functions can be a way to optimize. The compilers themselves do
it.
But the ExecSortTuple and ExecSortDatum are much more similar,
which can cause maintenance problems.
I don't think in this case it would be a good idea.


>
> It would be good to get a 2nd opinion about this idea.  Also, more
> benchmark results with v6 and v8 would be good too.
>
Yeah, another different machine.
I would like to see other results with v7b.

Attached the file with all results from v8.

regards,
Ranier Vilela
Benchmarks datumSort:

6) v8 David

a)
Test1
tps = 426.606950 (without initial connection time)
tps = 420.964492 (without initial connection time)
tps = 429.016435 (without initial connection time)
Test2
tps = 93.388625 (without initial connection time)
tps = 94.571572 (without initial connection time)
tps = 94.581301 (without initial connection time)
Test3
tps = 251.625641 (without initial connection time)
tps = 251.769007 (without initial connection time)
tps = 251.576880 (without initial connection time)
Test4
tps = 77.892592 (without initial connection time)
tps = 77.664981 (without initial connection time)
tps = 77.618023 (without initial connection time)
Test5
tps = 141.801858 (without initial connection time)
tps = 141.957810 (without initial connection time)
tps = 141.849105 (without initial connection time)
Test6
tps = 122.650449 (without initial connection time)
tps = 122.603506 (without initial connection time)
tps = 122.786432 (without initial connection time)
Test7
tps = 68.602538 (without initial connection time)
tps = 68.940470 (without initial connection time)
tps = 68.770827 (without initial connection time)
Test8
tps = 350.593188 (without initial connection time)
tps = 349.741689 (without initial connection time)
tps = 349.544567 (without initial connection time)

b)
Test1
tps = 430.025697 (without initial connection time)
tps = 427.884165 (without initial connection time)
tps = 428.708592 (without initial connection time)
Test2
tps = 94.207150 (without initial connection time)
tps = 93.821936 (without initial connection time)
tps = 93.647174 (without initial connection time)
Test3
tps = 251.784817 (without initial connection time)
tps = 251.336243 (without initial connection time)
tps = 251.431278 (without initial connection time)
Test4
tps = 77.884797 (without initial connection time)
tps = 77.413191 (without initial connection time)
tps = 77.569484 (without initial connection time)
Test5
tps = 141.787480 (without initial connection time)
tps = 142.344187 (without initial connection time)
tps = 141.819273 (without initial connection time)
Test6
tps = 122.848858 (without initial connection time)
tps = 122.935840 (without initial connection time)
tps = 123.559398 (without initial connection time)
Test7
tps = 68.854804 (without initial connection time)
tps = 68.929120 (without initial connection time)
tps = 68.779992 (without initial connection time)
Test8
tps = 349.630138 (without initial connection time)
tps = 

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread Ranier Vilela
Em ter., 13 de jul. de 2021 às 14:42, Ranier Vilela 
escreveu:

> Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela 
> escreveu:
>
>> Em ter., 13 de jul. de 2021 às 09:24, David Rowley 
>> escreveu:
>>
>>> On Wed, 14 Jul 2021 at 00:06, Ranier Vilela  wrote:
>>> >
>>> > Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
>>> ronan.dunk...@aiven.io> escreveu:
>>> >> I would be
>>> >> surprised the check adds that much to the whole execution though.
>>> >
>>> > I think this branch is a misprediction.
>>>
>>> It could be.  I wondered that myself when I saw Ronan's results were
>>> better than mine for 2,4 and 7.  However, I think Ronan had quite a
>>> bit of noise in his results as there's no reason for the speedup in
>>> tests 2,4 and 7.
>>
>>
>>> > In most cases is it not datumSort?
>>>
>>> who knows.  Maybe someone's workload always requires the datum sort.
>>>
>>> > That's why I would like to use unlikely.
>>>
>>> We really only use unlikely() in cases where we want to move code out
>>> of line to a cold area because it's really never executed under normal
>>> circumstances. We tend to do that for ERROR cases as we don't ever
>>> really want to optimise for errors. We also sometimes do it when some
>>> function has a branch to initialise something during the first call.
>>> The case in question here does not fit for either of those two cases.
>>>
>> Hum, I understand the usage cases now.
>> Thanks for the hint.
>>
>>
>>>
>>> > IMO all the tests should all be to verify past behavior first.
>>>
>>> I'm not quire sure what you mean there.
>>>
>> I'm saying we could help the branch by keeping the same testing logic as
>> before and not reversing it.
>> Attached is a version to demonstrate this, I don't pretend to be v7.
>>
>> I couldn't find a good phrase to the contrary:
>> "are we *not* using the single value optimization ?"
>>
>> I don't have time to take the tests right now.
>>
> Finally I had time to benchmark (David's benchsort.sh)
>
> ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
> Table with the best results of each.
>
>
>  HEADv6v7 v7b v6 vs
> master   v7 vs v6v7b vs v6
> Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
> Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
> Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
> Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
> Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
> Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
> Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
> Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%
>
> I have no idea why v7 failed with test6?
> v6 slowdown with test4 and test7.
> v7b slowdown with test2 and test4, in relation with v7.
>
 v7b slowdown with test2 and test4, in relation with *v6*.


> If field struct datumSort is not absolutely necessary, I think that v7
> will be better.
>
 *v7b* will be better.

Sorry for the noise.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread Ranier Vilela
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela 
escreveu:

> Em ter., 13 de jul. de 2021 às 09:24, David Rowley 
> escreveu:
>
>> On Wed, 14 Jul 2021 at 00:06, Ranier Vilela  wrote:
>> >
>> > Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
>> ronan.dunk...@aiven.io> escreveu:
>> >> I would be
>> >> surprised the check adds that much to the whole execution though.
>> >
>> > I think this branch is a misprediction.
>>
>> It could be.  I wondered that myself when I saw Ronan's results were
>> better than mine for 2,4 and 7.  However, I think Ronan had quite a
>> bit of noise in his results as there's no reason for the speedup in
>> tests 2,4 and 7.
>
>
>> > In most cases is it not datumSort?
>>
>> who knows.  Maybe someone's workload always requires the datum sort.
>>
>> > That's why I would like to use unlikely.
>>
>> We really only use unlikely() in cases where we want to move code out
>> of line to a cold area because it's really never executed under normal
>> circumstances. We tend to do that for ERROR cases as we don't ever
>> really want to optimise for errors. We also sometimes do it when some
>> function has a branch to initialise something during the first call.
>> The case in question here does not fit for either of those two cases.
>>
> Hum, I understand the usage cases now.
> Thanks for the hint.
>
>
>>
>> > IMO all the tests should all be to verify past behavior first.
>>
>> I'm not quire sure what you mean there.
>>
> I'm saying we could help the branch by keeping the same testing logic as
> before and not reversing it.
> Attached is a version to demonstrate this, I don't pretend to be v7.
>
> I couldn't find a good phrase to the contrary:
> "are we *not* using the single value optimization ?"
>
> I don't have time to take the tests right now.
>
Finally I had time to benchmark (David's benchsort.sh)

ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.


 HEADv6v7 v7b v6 vs
master   v7 vs v6v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%

I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.

If field struct datumSort is not absolutely necessary, I think that v7 will
be better.
Attached the patchs and file results.
Benchmarks datumSort:

1) head

a)
Test1
tps = 286.571700 (without initial connection time)
tps = 285.650960 (without initial connection time)
tps = 287.069988 (without initial connection time)
Test2
tps = 94.565228 (without initial connection time)
tps = 94.690370 (without initial connection time)
tps = 94.741829 (without initial connection time)
Test3
tps = 190.581533 (without initial connection time)
tps = 190.405579 (without initial connection time)
tps = 190.222865 (without initial connection time)
Test4
tps = 78.001599 (without initial connection time)
tps = 78.163404 (without initial connection time)
tps = 78.325974 (without initial connection time)
Test5
tps = 130.276589 (without initial connection time)
tps = 128.854218 (without initial connection time)
tps = 129.062907 (without initial connection time)
Test6
tps = 112.629915 (without initial connection time)
tps = 113.179220 (without initial connection time)
tps = 112.395572 (without initial connection time)
Test7
tps = 69.109489 (without initial connection time)
tps = 68.973755 (without initial connection time)
tps = 69.204028 (without initial connection time)
Test8
tps = 242.364808 (without initial connection time)
tps = 242.448441 (without initial connection time)
tps = 243.191950 (without initial connection time)


b)
Test1
tps = 288.149636 (without initial connection time)
tps = 287.761581 (without initial connection time)
tps = 286.865262 (without initial connection time)
Test2
tps = 94.766955 (without initial connection time)
tps = 94.361680 (without initial connection time)
tps = 94.636953 (without initial connection time)
Test3
tps = 190.394402 (without initial connection time)
tps = 190.521319 (without initial connection time)
tps = 190.446202 (without initial connection time)
Test4
tps = 78.096603 (without initial connection time)
tps = 78.576600 (without initial connection time)
tps = 78.779344 (without initial connection time)
Test5
tps = 131.131458 (without initial connection time)
tps = 131.344875 (without initial connection time)
tps = 131.362614 (without initial connection time)
Test6
tps = 112.884298 

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread Ranier Vilela
Em ter., 13 de jul. de 2021 às 09:24, David Rowley 
escreveu:

> On Wed, 14 Jul 2021 at 00:06, Ranier Vilela  wrote:
> >
> > Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
> ronan.dunk...@aiven.io> escreveu:
> >> I would be
> >> surprised the check adds that much to the whole execution though.
> >
> > I think this branch is a misprediction.
>
> It could be.  I wondered that myself when I saw Ronan's results were
> better than mine for 2,4 and 7.  However, I think Ronan had quite a
> bit of noise in his results as there's no reason for the speedup in
> tests 2,4 and 7.


> > In most cases is it not datumSort?
>
> who knows.  Maybe someone's workload always requires the datum sort.
>
> > That's why I would like to use unlikely.
>
> We really only use unlikely() in cases where we want to move code out
> of line to a cold area because it's really never executed under normal
> circumstances. We tend to do that for ERROR cases as we don't ever
> really want to optimise for errors. We also sometimes do it when some
> function has a branch to initialise something during the first call.
> The case in question here does not fit for either of those two cases.
>
Hum, I understand the usage cases now.
Thanks for the hint.


>
> > IMO all the tests should all be to verify past behavior first.
>
> I'm not quire sure what you mean there.
>
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.

I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"

I don't have time to take the tests right now.

regards,
Ranier Vilela


allow_single_datum_sort.patch
Description: Binary data


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread David Rowley
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela  wrote:
>
> Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau  
> escreveu:
>> I would be
>> surprised the check adds that much to the whole execution though.
>
> I think this branch is a misprediction.

It could be.  I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7.  However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.

> In most cases is it not datumSort?

who knows.  Maybe someone's workload always requires the datum sort.

> That's why I would like to use unlikely.

We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.

> IMO all the tests should all be to verify past behavior first.

I'm not quire sure what you mean there.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread Ranier Vilela
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau 
escreveu:

> > I've now pushed that bug fix so it's fine to remove the change to
>
>

> I would be
> surprised the check adds that much to the whole execution though.
>
I think this branch is a misprediction.
In most cases is it not datumSort?
That's why I would like to use unlikely.

IMO all the tests should all be to verify past behavior first.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-13 Thread Ronan Dunklau
> I've now pushed that bug fix so it's fine to remove the change to
> tuplesort.c now.

Thanks, I've rebased the patch, please find attached the v6.

> 
> I also did a round of benchmarking on this patch using the attached
> script. Anyone wanting to run it will need to run make installcheck
> first to create the required tables.

I've run your benchmark, keeping the best of three runs each time.
This is an intel laptop, so as many things are running on it there is a lot of 
noise... 

Both standard and patched run come from a compilation with gcc -O2. No changes 
have been done to the default settings.

Query # Master  Patched Variation
1   884 1627184.05%
2   364 375 103.02%
3   568 783 137.85%
4   296 297 100.34%
5   421 484 114.96%
6   359 408 113.65%
7   237 251 105.91%
8   806 1271157.69%

Since I didn't reproduce your slowdown at all on the first run, I tried to 
rerun the benchmark several times and for the "dubious cases" (2, 4 and 7), 
the results are too jittery to conclude one way or another in my case.  I 
don't have access to proper hardware, so not sure if that would be useful in 
any way to just run the bench for thousands of xacts instead. I would be 
surprised the check adds that much to the whole execution though.

I attach a graph similar to yours for reference.


-- 
Ronan Dunklaucommit 01885709d1718710db1665bf2b3f39d83c720147
Author: Ronan Dunklau 
Date:   Tue Jul 6 16:34:31 2021 +0200

Use optimized single-datum tuplesort in ExecSort

Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..df4e79c6ba 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,31 +90,64 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts == 1)
+		{
+			node->datumSort = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		}
+		else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
+		 * Scan the subplan and feed all the tuples to tuplesort, using either
+		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
-
-		for (;;)
+		if (node->datumSort)
 		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
+		}
+		else
+		{
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+tuplesort_puttupleslot(tuplesortstate, slot);
+			}
 		}
 
 		/*
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-  ScanDirectionIsForward(dir),
-  false, slot, NULL);
+	if (node->datumSort)
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, 

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-12 Thread David Rowley
On Tue, 13 Jul 2021 at 01:59, Ronan Dunklau  wrote:
> > 3. This seems to be a bug fix where byval datum sorts do not properly
> > handle bounded sorts.  I think that maybe that should be fixed and
> > backpatched.  I don't see anything that says Datum sorts can't be
> > bounded and if there were some restriction on that I'd expect
> > tuplesort_set_bound() to fail when the Tuplesortstate had been set up
> > with tuplesort_begin_datum().
>
> I've kept this as-is for now, but I will remove it from my patch if it is
> deemed worthy of back-patching in your other thread.

I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.

I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.

On an AMD machine, I got the following results.

Result in transactions per second.
Test master v5 patch compare
Test1 446.1 657.3 147.32%
Test2 315.8 314.0 99.44%
Test3 302.3 392.1 129.67%
Test4 232.7 230.7 99.12%
Test5 230.0 446.1 194.00%
Test6 199.5 217.9 109.23%
Test7 188.7 185.3 98.21%
Test8 385.4 544.0 141.17%

Tests 2, 4, 7 are designed to check if there is any regression from
doing the additional run-time checks to see if we're doing datumSort.
I measured a very small penalty from this. It's most visible in test7
with a drop of about 1.8%. Each test did OFFSET 100 as I didn't
want to measure the overhead of outputting tuples.

All the other tests show a pretty good gain. Test6 is testing a byref
type, so it appears the gains are not just from byval datums.

It would be good to see the benchmark script run on a few other
machines to get an idea if the gains and losses are consistent.

In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode. But maybe the small
regression is not worth going to that trouble over. I'm not aware of
any other executor nodes that have logic like that so maybe it would
be a bad idea to introduce something like that.

David
#!/bin/bash

# This should improve
echo "select two from tenk1 order by two offset 100" > bench.sql
echo Test1
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should stay the same
echo "select * from tenk1 order by two offset 100" > bench.sql
echo Test2
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should improve
echo "select tenthous from tenk1 order by tenthous offset 100" > bench.sql
echo Test3
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should stay the same
echo "select * from tenk1 order by tenthous offset 100" > bench.sql
echo Test4
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should improve
echo "select stringu1 from tenk1 order by stringu1 offset 100" > bench.sql
echo Test5
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should improve
echo "select stringu2 from tenk1 order by stringu2 offset 100" > bench.sql
echo Test6
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should stay the same
echo "select * from tenk1 order by stringu2 offset 100" > bench.sql
echo Test7
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps

# This should improve
echo "select twenty from tenk1 order by twenty offset 100" > bench.sql
echo Test8
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps
pgbench -n -f bench.sql -T 10 regression | grep tps




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-12 Thread Ronan Dunklau
Le lundi 12 juillet 2021, 15:11:17 CEST David Rowley a écrit :
> On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau  wrote:
> > In the meantime I fixed some formatting issues, please find attached a new
> > patch.
> 
> I started to look at this.

Thank you ! I'm attaching a new version of the patch taking your remarks into 
account.
> 
> First I wondered how often we might be able to apply this
> optimisation, so I ran make check after adding some elog(NOTICE) calls
> to output which method is going to be used just before we do the
> tuplestore_begin_* calls.  It looks like there are 614 instances of
> Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
> 223 of the 614 are byval types and the other 391 are byref. Not that
> the regression tests are a good reflection of the real world, but if
> it were then that's quite a good number of cases to be able to
> optimise.

That's an interesting stat.

> 
> As for the patch, just a few things:
> 
> 1. Can you add the missing braces in this if condition and the else
> condition that belongs to it.
> 
> + if (node->is_single_val)
> + for (;;)
> + {
> 

Done.

> 2. I think it would nicer to name the new is_single_val field
> "datumSort" instead. To me it seems more clear what it is for.

Done.

> 
> 3. This seems to be a bug fix where byval datum sorts do not properly
> handle bounded sorts.  I think that maybe that should be fixed and
> backpatched.  I don't see anything that says Datum sorts can't be
> bounded and if there were some restriction on that I'd expect
> tuplesort_set_bound() to fail when the Tuplesortstate had been set up
> with tuplesort_begin_datum().

I've kept this as-is for now, but I will remove it from my patch if it is 
deemed worthy of back-patching in your other thread. 

Regards,

-- 
Ronan Dunklaucommit 91210952cf435995c9db6b539bf29f6fd744be70
Author: Ronan Dunklau 
Date:   Tue Jul 6 16:34:31 2021 +0200

Use optimized single-datum tuplesort in ExecSort

Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..df4e79c6ba 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,31 +90,64 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts == 1)
+		{
+			node->datumSort = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		}
+		else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
+		 * Scan the subplan and feed all the tuples to tuplesort, using either
+		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
-
-		for (;;)
+		if (node->datumSort)
 		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
+		}
+		else
+		{
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-12 Thread David Rowley
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau  wrote:
> In the meantime I fixed some formatting issues, please find attached a new
> patch.

I started to look at this.

First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls.  It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.

As for the patch, just a few things:

1. Can you add the missing braces in this if condition and the else
condition that belongs to it.

+ if (node->is_single_val)
+ for (;;)
+ {

2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.

3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts.  I think that maybe that should be fixed and
backpatched.  I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().

 static void
 free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 {
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }

I can take this to another thread.

That's all I have for now.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-07 Thread Ronan Dunklau
Le mardi 6 juillet 2021, 17:37:53 CEST James Coleman a écrit :
> Yes and no. When incremental sort has to do a full sort there will
> always be at least 2 attributes. But in prefix sort mode (see
> prefixsort_state) only non-presorted columns are sorted (i.e., if
> given a,b already sorted by a, then only b is sorted). So the
> prefixsort_state could use this optimization.

The optimization is not when we actually sort on a single key, but when we get 
a single attribute in / out of the tuplesort.  Since sorting always add 
resjunk entries for the keys being sorted on, I don't think we can ever end up 
in a situation where the optimization would kick in, since the entries for the 
already-performed-sort keys will need to be present in the output.

Maybe if instead of adding resjunk entries to the whole query's targetlist, 
sort and incrementalsort nodes were able to do a projection from the input 
(needed tle + resjunk sorting tle) to a tuple containing only the needed tle 
on output before actually sorting it, it would be possible, but that would be 
quite a big design change.

In the meantime I fixed some formatting issues, please find attached a new 
patch.


-- 
Ronan Dunklaudiff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..9d8b0a77da 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,32 +90,61 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts == 1)
+		{
+			node->is_single_val = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		}
+		else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
+		 * Scan the subplan and feed all the tuples to tuplesort, using either
+		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
-
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
+		if (node->is_single_val)
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
+		else
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+tuplesort_puttupleslot(tuplesortstate, slot);
+			}
 
 		/*
 		 * Complete the sort.
@@ -150,9 +183,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-  ScanDirectionIsForward(dir),
-  false, slot, NULL);
+	if (node->is_single_val)
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+			   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+	else
+		(void) tuplesort_gettupleslot(tuplesortstate,
+	  ScanDirectionIsForward(dir),
+	  false, slot, NULL);
+
 	return slot;
 }
 
@@ -191,6 +233,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->is_single_val = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/utils/sort/tuplesort.c 

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread James Coleman
On Tue, Jul 6, 2021 at 11:03 AM Ronan Dunklau  wrote:
>
> Thank you for the review, I will address those shortly, but will answer some
> questions in the meantime.
>
> > > First, the changes are lacking any explanatory comments. Probably we
> > > should follow how nodeAgg does this and add both comments to the
> > > ExecSort function header as well as specific comments above the "if"
> > > around the new tuplesort_begin_datum explaining the specific
> > > conditions that are required for the optimization to be useful and
> > > safe.
>
> Done, since I lifted the restrictions following your questions, there isn't
> much left to comment. (see below)
>
> > >
> > > That leads to a question I had: I don't follow why bounded mode (when
> > > using byval) needs to be excluded. Comments should be added if there's
> > > a good reason (as noted above), but maybe it's a case we can handle
> > > safely?
>
> I had test failures when trying to move the Datum around when performing a
> bounded sort, but did not look into it at first.
>
> Now I've looked into it, and the switch to a heapsort when using bounded mode
> just unconditionnaly tried to free a tuple that was never there to begin with.
> So if the SortTuple does not contain an actual tuple, but only a single datum,
> do not do that.
>
> I've updated the patch to fix this and enable the optimization in the case of
> bounded sort.

Awesome.

> > > A second question: at first glance it's intuitively the case we might
> > > not be able to handle byref values. But nodeAgg doesn't seem to have
> > > that restriction. What's the difference here?
> >
> > I think tuplesort_begin_datum, doesn't have any such limitation, it
> > can handle any type of Datum so I think we don't need to consider the
> > only attbyval, we can consider any type of attribute for this
> > optimization.
>
> I've restricted the optimization to byval types because of the following
> comment in nodeAgg.c:
>
> /*
>  * Note: if input type is pass-by-ref, the datums returned by the
> sort are
>  * freshly palloc'd in the per-query context, so we must be careful
> to
>  * pfree them when they are no longer needed.
>  */
>
> As I was not sure how to handle that, I prefered the safety of not enabling
> it. Since you both told me it should be safe, I've lifted that restriction
> too.

To be clear, I don't know for certain it's safe [without extra work],
but even if it involves some extra manual pfree'ing (a la nodeAgg)
it's probably worth it. Maybe someone else will weigh in on whether or
not anything special is required here to ensure we don't leak memory
(I haven't looked in detail yet).

> > A few small code observations:
> > - In my view the addition of unlikely() in ExecSort is unlikely to be
> > of benefit because it's a single call for the entire node's execution
> > (not in the tuple loop).
>
> Done.
>
> > - It seems clearer to change the "if (!node->is_single_val)" to flip
> > the true/false cases so we don't need the negation.
>
> Agreed, done.

Thanks

> > - I assume there are tests that likely already cover this case, but
> > it'd be worth verifying that.
>
> Yes many test cases cover that, but maybe it would be better to explictly
> check for it on some cases: do you think we could add a debug message that can
> be checked for ?

Mostly I think we should verify code coverage and _maybe_ add a
specific test or two that we know execises this path. I don't know
that the debug message needs to be matched in the test (probably more
pain than it's worth), but the debug message ("enabling datum sort
optimizaton" or similar) might be good anyway.

I wonder if we need to change costing of sorts for this case. I don't
like having to do so, but it's a significant change in speed, so
probably should impact what plan gets chosen. Hopefully others will
weigh on this also.

> > Finally, I believe the same optimization likely ought to be added to
> > nodeIncrementalSort. It's less likely the tests there are sufficient
> > for both this and the original case, but we'll see.
>
> I will look into it, but isn't incrementalsort used to sort tuples on several
> keys, when they are already sorted on the first ? In that case, I doubt we
> would ever have a single-valued tuple here, except if there is a projection to
> strip the tuple from extraneous attributes.

Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.

James




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Ronan Dunklau
Thank you for the review, I will address those shortly, but will answer some 
questions in the meantime.

> > First, the changes are lacking any explanatory comments. Probably we
> > should follow how nodeAgg does this and add both comments to the
> > ExecSort function header as well as specific comments above the "if"
> > around the new tuplesort_begin_datum explaining the specific
> > conditions that are required for the optimization to be useful and
> > safe.

Done, since I lifted the restrictions following your questions, there isn't 
much left to comment. (see below)

> > 
> > That leads to a question I had: I don't follow why bounded mode (when
> > using byval) needs to be excluded. Comments should be added if there's
> > a good reason (as noted above), but maybe it's a case we can handle
> > safely?

I had test failures when trying to move the Datum around when performing a 
bounded sort, but did not look into it at first.

Now I've looked into it, and the switch to a heapsort when using bounded mode 
just unconditionnaly tried to free a tuple that was never there to begin with. 
So if the SortTuple does not contain an actual tuple, but only a single datum, 
do not do that. 

I've updated the patch to fix this and enable the optimization in the case of 
bounded sort.

> > 
> > A second question: at first glance it's intuitively the case we might
> > not be able to handle byref values. But nodeAgg doesn't seem to have
> > that restriction. What's the difference here?
> 
> I think tuplesort_begin_datum, doesn't have any such limitation, it
> can handle any type of Datum so I think we don't need to consider the
> only attbyval, we can consider any type of attribute for this
> optimization.

I've restricted the optimization to byval types because of the following 
comment in nodeAgg.c:

/*
 * Note: if input type is pass-by-ref, the datums returned by the 
sort are
 * freshly palloc'd in the per-query context, so we must be careful 
to
 * pfree them when they are no longer needed.
 */

As I was not sure how to handle that, I prefered the safety of not enabling 
it. Since you both told me it should be safe, I've lifted that restriction 
too.


> A few small code observations:
> - In my view the addition of unlikely() in ExecSort is unlikely to be
> of benefit because it's a single call for the entire node's execution
> (not in the tuple loop).

Done.

> - It seems clearer to change the "if (!node->is_single_val)" to flip
> the true/false cases so we don't need the negation.

Agreed, done.

> - I assume there are tests that likely already cover this case, but
> it'd be worth verifying that.

Yes many test cases cover that, but maybe it would be better to explictly 
check for it on some cases: do you think we could add a debug message that can 
be checked for ? 

> Finally, I believe the same optimization likely ought to be added to
> nodeIncrementalSort. It's less likely the tests there are sufficient
> for both this and the original case, but we'll see.

I will look into it, but isn't incrementalsort used to sort tuples on several 
keys, when they are already sorted on the first ? In that case, I doubt we 
would ever have a single-valued tuple here, except if there is a projection to 
strip the tuple from extraneous attributes.

-- 
Ronan Dunklaudiff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..ff443d15a9 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -85,33 +89,59 @@ ExecSort(PlanState *pstate)
 
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
-
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one key,
+		 * as it is much more efficient especially when the type is pass-by-value.
+		 */
+		if (tupDesc->natts == 1)
+		{
+			node->is_single_val = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		} else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+	

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Dilip Kumar
On Tue, Jul 6, 2021 at 6:49 PM James Coleman  wrote:
>
> Adding David since this patch is likely a precondition for [1].
>
> On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau  wrote:
> >
> > Hello,
> >
> > While testing the patch "Add proper planner support for ORDER BY / DISTINCT
> > aggregates" [0] I discovered the performance penalty from adding a sort node
> > essentially came from not using the single-datum tuplesort optimization in
> > ExecSort (contrary to the sorting done in ExecAgg).
> >
> > I originally proposed this patch as a companion in the same thread [1], but
> > following James suggestion I'm making a separate thread just for this as the
> > optimization is worthwhile independently of David's patch: it looks like we
> > can expect a 2x speedup on a "select a single ordered column" case.
> >
> > The patch aimed to be as simple as possible: we only turn this optimization 
> > on
> > when the tuple being sorted has only one attribute, it is "byval" (so as not
> > to incur copies which would be hard to track in the execution tree) and
> > unbound (again, not having to deal with copying borrowed datum anywhere).
>
> Thanks again for finding this and working up a patch.
>
> I've taken a look, and while I haven't dug into testing it yet, I have
> a few comments.
>
> First, the changes are lacking any explanatory comments. Probably we
> should follow how nodeAgg does this and add both comments to the
> ExecSort function header as well as specific comments above the "if"
> around the new tuplesort_begin_datum explaining the specific
> conditions that are required for the optimization to be useful and
> safe.
>
> That leads to a question I had: I don't follow why bounded mode (when
> using byval) needs to be excluded. Comments should be added if there's
> a good reason (as noted above), but maybe it's a case we can handle
> safely?
>
> A second question: at first glance it's intuitively the case we might
> not be able to handle byref values. But nodeAgg doesn't seem to have
> that restriction. What's the difference here?
>

I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Ranier Vilela
Em ter., 6 de jul. de 2021 às 10:19, James Coleman 
escreveu:

> Adding David since this patch is likely a precondition for [1].
>
> On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau 
> wrote:
> >
> > Hello,
> >
> > While testing the patch "Add proper planner support for ORDER BY /
> DISTINCT
> > aggregates" [0] I discovered the performance penalty from adding a sort
> node
> > essentially came from not using the single-datum tuplesort optimization
> in
> > ExecSort (contrary to the sorting done in ExecAgg).
> >
> > I originally proposed this patch as a companion in the same thread [1],
> but
> > following James suggestion I'm making a separate thread just for this as
> the
> > optimization is worthwhile independently of David's patch: it looks like
> we
> > can expect a 2x speedup on a "select a single ordered column" case.
> >
> > The patch aimed to be as simple as possible: we only turn this
> optimization on
> > when the tuple being sorted has only one attribute, it is "byval" (so as
> not
> > to incur copies which would be hard to track in the execution tree) and
> > unbound (again, not having to deal with copying borrowed datum anywhere).
>
> Thanks again for finding this and working up a patch.
>
> I've taken a look, and while I haven't dug into testing it yet, I have
> a few comments.
>
> First, the changes are lacking any explanatory comments. Probably we
> should follow how nodeAgg does this and add both comments to the
> ExecSort function header as well as specific comments above the "if"
> around the new tuplesort_begin_datum explaining the specific
> conditions that are required for the optimization to be useful and
> safe.
>
> That leads to a question I had: I don't follow why bounded mode (when
> using byval) needs to be excluded. Comments should be added if there's
> a good reason (as noted above), but maybe it's a case we can handle
> safely?
>
> A second question: at first glance it's intuitively the case we might
> not be able to handle byref values. But nodeAgg doesn't seem to have
> that restriction. What's the difference here?
>
> A few small code observations:
> - In my view the addition of unlikely() in ExecSort is unlikely to be
> of benefit because it's a single call for the entire node's execution
> (not in the tuple loop).
>
No objection. And I agree that testing is complex and needs to remain as it
is.

- It seems clearer to change the "if (!node->is_single_val)" to flip
> the true/false cases so we don't need the negation.
>
I think yes, it can be better.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread James Coleman
Adding David since this patch is likely a precondition for [1].

On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau  wrote:
>
> Hello,
>
> While testing the patch "Add proper planner support for ORDER BY / DISTINCT
> aggregates" [0] I discovered the performance penalty from adding a sort node
> essentially came from not using the single-datum tuplesort optimization in
> ExecSort (contrary to the sorting done in ExecAgg).
>
> I originally proposed this patch as a companion in the same thread [1], but
> following James suggestion I'm making a separate thread just for this as the
> optimization is worthwhile independently of David's patch: it looks like we
> can expect a 2x speedup on a "select a single ordered column" case.
>
> The patch aimed to be as simple as possible: we only turn this optimization on
> when the tuple being sorted has only one attribute, it is "byval" (so as not
> to incur copies which would be hard to track in the execution tree) and
> unbound (again, not having to deal with copying borrowed datum anywhere).

Thanks again for finding this and working up a patch.

I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.

First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.

That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?

A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?

A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.

Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.

Thanks,
James

1: 
https://www.postgresql.org/message-id/CAApHDvpHzfo92%3DR4W0%2BxVua3BUYCKMckWAmo-2t_KiXN-wYH%3Dw%40mail.gmail.com




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Ranier Vilela
Em ter., 6 de jul. de 2021 às 08:25, Ranier Vilela 
escreveu:

> Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau 
> escreveu:
>
>> Hello,
>>
>> While testing the patch "Add proper planner support for ORDER BY /
>> DISTINCT
>> aggregates" [0] I discovered the performance penalty from adding a sort
>> node
>> essentially came from not using the single-datum tuplesort optimization
>> in
>> ExecSort (contrary to the sorting done in ExecAgg).
>>
>> I originally proposed this patch as a companion in the same thread [1],
>> but
>> following James suggestion I'm making a separate thread just for this as
>> the
>> optimization is worthwhile independently of David's patch: it looks like
>> we
>> can expect a 2x speedup on a "select a single ordered column" case.
>>
>> The patch aimed to be as simple as possible: we only turn this
>> optimization on
>> when the tuple being sorted has only one attribute, it is "byval" (so as
>> not
>> to incur copies which would be hard to track in the execution tree) and
>> unbound (again, not having to deal with copying borrowed datum anywhere).
>>
>> The attached patch is originally by me, with some cleanup by Ranier
>> Vilela.
>> I'm sending Ranier's version here.
>>
> Nice Ronan.
> But I think there is some confusion here.
> The author is not you?
>
> Just to clarify, at Commitfest, it was supposed to be the other way around.
> You as an author and David as a reviewer.
> I'll put myself as a reviewer too.
>
Sorry David, my mistake.
I confused the numbers (id) of Commitfest.

regards,
Ranier Vilela


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Ranier Vilela
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau 
escreveu:

> Hello,
>
> While testing the patch "Add proper planner support for ORDER BY /
> DISTINCT
> aggregates" [0] I discovered the performance penalty from adding a sort
> node
> essentially came from not using the single-datum tuplesort optimization in
> ExecSort (contrary to the sorting done in ExecAgg).
>
> I originally proposed this patch as a companion in the same thread [1],
> but
> following James suggestion I'm making a separate thread just for this as
> the
> optimization is worthwhile independently of David's patch: it looks like
> we
> can expect a 2x speedup on a "select a single ordered column" case.
>
> The patch aimed to be as simple as possible: we only turn this
> optimization on
> when the tuple being sorted has only one attribute, it is "byval" (so as
> not
> to incur copies which would be hard to track in the execution tree) and
> unbound (again, not having to deal with copying borrowed datum anywhere).
>
> The attached patch is originally by me, with some cleanup by Ranier
> Vilela.
> I'm sending Ranier's version here.
>
Nice Ronan.
But I think there is some confusion here.
The author is not you?

Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.

regards,
Ranier Vilela


[PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-06 Thread Ronan Dunklau
Hello,

While testing the patch "Add proper planner support for ORDER BY / DISTINCT 
aggregates" [0] I discovered the performance penalty from adding a sort node 
essentially came from not using the single-datum tuplesort optimization in 
ExecSort (contrary to the sorting done in ExecAgg).

I originally proposed this patch as a companion in the same thread [1], but 
following James suggestion I'm making a separate thread just for this as the 
optimization is worthwhile independently of David's patch: it looks like we 
can expect a 2x speedup on a "select a single ordered column" case.

The patch aimed to be as simple as possible: we only turn this optimization on 
when the tuple being sorted has only one attribute, it is "byval" (so as not 
to incur copies which would be hard to track in the execution tree) and 
unbound (again, not having to deal with copying borrowed datum anywhere).

The attached patch is originally by me, with some cleanup by Ranier Vilela. 
I'm sending Ranier's version here.


[0]: https://commitfest.postgresql.org/33/3164/
[1]: https://www.postgresql.org/message-id/4480689.ObhdGn8bVM%40aivenronan

-- 
Ronan Dunklaudiff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..363f1d90f0 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -85,16 +85,28 @@ ExecSort(PlanState *pstate)
 
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
-
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-			  plannode->numCols,
-			  plannode->sortColIdx,
-			  plannode->sortOperators,
-			  plannode->collations,
-			  plannode->nullsFirst,
-			  work_mem,
-			  NULL,
-			  node->randomAccess);
+		if (unlikely(tupDesc->natts == 1 &&
+			!node->bounded &&
+			TupleDescAttr(tupDesc, 0)->attbyval))
+		{
+			node->is_single_val = true;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+   plannode->sortOperators[0],
+   plannode->collations[0],
+   plannode->nullsFirst[0],
+   work_mem,
+   NULL,
+   node->randomAccess);
+		} else
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
+  plannode->numCols,
+  plannode->sortColIdx,
+  plannode->sortOperators,
+  plannode->collations,
+  plannode->nullsFirst,
+  work_mem,
+  NULL,
+  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
@@ -103,15 +115,27 @@ ExecSort(PlanState *pstate)
 		 * Scan the subplan and feed all the tuples to tuplesort.
 		 */
 
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
+		if (!node->is_single_val)
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+tuplesort_puttupleslot(tuplesortstate, slot);
+			}
+		else
+			for (;;)
+			{
+slot = ExecProcNode(outerNode);
+
+if (TupIsNull(slot))
+	break;
+slot_getsomeattrs(slot, 1);
+tuplesort_putdatum(tuplesortstate,
+   slot->tts_values[0],
+   slot->tts_isnull[0]);
+			}
 
 		/*
 		 * Complete the sort.
@@ -150,9 +174,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-  ScanDirectionIsForward(dir),
-  false, slot, NULL);
+	if (node->is_single_val)
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+		   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+	else
+		(void) tuplesort_gettupleslot(tuplesortstate,
+	  ScanDirectionIsForward(dir),
+	  false, slot, NULL);
+
 	return slot;
 }
 
@@ -191,6 +224,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->is_single_val = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..643f416c54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
 	int64		bound_Done;		/* value of bound we did the sort with */
 	void	   *tuplesortstate; /* private state of tuplesort.c */
 	bool		am_worker;		/* are we a worker? */
+	bool		is_single_val;  /* are we using the single value optimization ? */
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;