Re: Use virtual tuple slot for Unique node

2024-03-28 Thread Andrey M. Borodin



> On 29 Oct 2023, at 21:30, Denis Smirnov  wrote:
> 
> I have taken a look at this discussion, at the code and I am confused how we 
> choose tuple table slot (TTS) type in PG. 

After offline discussion with Denis, we decided to withdraw this patch from CF 
for now. If anyone is willing to revive working on this, please register a new 
entry in next commitfest.
Thanks!


Best regards, Andrey Borodin.



Re: Use virtual tuple slot for Unique node

2023-10-29 Thread Denis Smirnov
I have taken a look at this discussion, at the code and I am confused how we 
choose tuple table slot (TTS) type in PG. May be you can clarify this topic or 
me. 

1. Brief intro. There are four types of TTS. Plan tree «leaves»:
- buffer heap (produced by index and table scans, has system columns and keeps 
shared buffer pins)
- heap (produced by FDW: has system columns, but doesn’t keep any pins)
- minimal (produced by values and materializations nodes like sort, agg, etc.)
Plan «branches»:
- virtual (keeps datum references to the columns of the tuples in the child 
nodes)

Virtual TTS is cheeper to copy among the plan (as we copy datum references), 
but more expensive to materialize (we have to construct a tuple from pieces).

Leaves are cheeper to materialize (as we make a memcmp under hood), but very 
expensive to copy (we copy the value, not the datum reference).

2. If we take a look at the materialize nodes in the plan, they produce 
different result TTS.
- Outer child TTS type: gater, gather merge, lock rows, limit;
- Minimal: material, sort, incremental sort, memoize, unique, hash, setup (can 
be heap as well);
- Virtual: group, agg, window agg.

From my point of view, the materialization node should preserve the incoming 
TTS type. For the sort node (that materializes incoming tuples as minimal) it 
is ok to output minimal result as well. Looks that unique should use the outer 
child’d TTS (instead of hardcoded minimal). But can anyone explain me why do 
group, agg and window agg return the virtual instead of the same TTS type as 
outer child has? Do we expect that the parent node exists and requires exactly 
virtual tuples (but what if the parent node is sort and benefits from minimal 
TTS)? So, it looks like we need to take a look not only at the unique, but also 
inspect all the materialization nodes.






Re: Use virtual tuple slot for Unique node

2023-10-29 Thread David Rowley
On Fri, 27 Oct 2023 at 22:05, Ashutosh Bapat
 wrote:
>
> On Fri, Oct 27, 2023 at 8:48 AM David Rowley  wrote:
> > I was uncertain if the old behaviour of when srcslot contains fewer
> > attributes than dstslot was intended or not.  What happens there is
> > that we'd leave the additional old dstslot tts_values in place and
> > only overwrite up to srcslot->natts but then we'd go on and
> > materialize all the dstslot attributes. I think this might not be
> > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
> > be ok just to materialize the srcslot attributes and ignore the
> > previous additional dstslot attributes.  Changing it to that would
> > make the attached patch more simple.
>
> We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot
> what's the difference. If we do what you say, we might end up trying
> to access unmaterialized values beyond tts_nvalid. Better to
> investigate whether such a hazard exists.

The TupleDesc's natts is the number of attributes in the tuple
descriptor.  tts_nvalid is the greatest attribute number that's been
deformed in the tuple slot.  For slot types other than virtual slots,
we'll call slot_getsomeattrs() to deform more attributes from the
tuple.

The reason the code in question looks suspicious to me is that we do
"dstslot->tts_nvalid = srcdesc->natts;" and there's no way to deform
more attributes in a virtual slot. Note that
tts_virtual_getsomeattrs() unconditionally does elog(ERROR,
"getsomeattrs is not required to be called on a virtual tuple table
slot");.  We shouldn't ever be accessing tts_values elements above
what tts_nvalid is set to, so either we should be setting
dstslot->tts_nvalid = to the dstdesc->natts so that we can access
tts_values elements above srcdesc->natts or we're needlessly
materialising too many attributes in tts_virtual_copyslot().

David



David




Re: Use virtual tuple slot for Unique node

2023-10-27 Thread Ashutosh Bapat
On Fri, Oct 27, 2023 at 8:48 AM David Rowley  wrote:
>
> On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
>  wrote:
> > We may save the size of data in VirtualTupleTableSlot, thus avoiding
> > the first loop. I assume that when allocating
> > VirtualTupleTableSlot->data, we always know what size we are
> > allocating so it should be just a matter of saving it in
> > VirtualTupleTableSlot->size. This should avoid the first loop in
> > tts_virtual_materialize() and give some speed up. We will need a loop
> > to repoint non-byval Datums. I imagine that the pointers to non-byval
> > Datums can be computed as dest_slot->tts_values[natts] =
> > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> > This would work as long as all the non-byval datums in the source slot
> > are all stored flattened in source slot's data. I am assuming that
> > that would be true in a materialized virtual slot. The byval datums
> > are copied as is. I think, this will avoid multiple memcpy calls, one
> > per non-byval attribute and hence some speedup. I may be wrong though.
>
> hmm, do you think it's common enough that we copy an already
> materialised virtual slot?
>
> I tried adding the following code totts_virtual_copyslot and didn't
> see the NOTICE message when running each of your test queries. "make
> check" also worked without anything failing after adjusting
> nodeUnique.c to always use a TTSOpsVirtual slot.
>
> +   if (srcslot->tts_ops ==  && TTS_SHOULDFREE(srcslot))
> +   elog(NOTICE, "We copied a materialized virtual slot!");
>
> I did get a failure in postgres_fdw's tests:
>
>server loopback options (table_name 'tab_batch_sharded_p1_remote');
>  insert into tab_batch_sharded select * from tab_batch_local;
> +NOTICE:  We copied a materialized virtual slot!
> +NOTICE:  We copied a materialized virtual slot!
>
> so I think it's probably not that common that we'd gain from that 
> optimisation.

Thanks for this analysis. If we aren't copying a materialized virtual
slot often, no point in adding that optimization.

>
> What might buy us a bit more would be to get rid of the for loop
> inside tts_virtual_copyslot() and copy the guts of
> tts_virtual_materialize() into tts_virtual_copyslot() and set the
> dstslot tts_isnull and tts_values arrays in the same loop that we use
> to calculate the size.
>
> I tried that in the attached patch and then tested it alongside the
> patch that changes the slot type.
>
> master = 74604a37f
> 1 = [1]
> 2 = optimize_tts_virtual_copyslot.patch
>
> Using the script from [2] and the setup from [3] but reduced to 10k
> tuples instead of 1 million.
>
> Times the average query time in milliseconds for a 60 second pgbench run.
>
> querymaster  master+1master+1+2m/m+1 m/m+1+2
> Q12.616 2.082   1.903   125.65%
> 137.47%
> Q29.47910.59310.361 89.48%
>  91.49%
> Q310.329  10.35710.627 99.73%
> 97.20%
> Q47.272  7.306 6.941  99.53%
>  104.77%
> Q57.597  7.043 6.645107.87%
>  114.33%
> Q6162.177  161.037  162.807100.71% 99.61%
> Q759.288  59.43  61.294  99.76%
>  96.73%
>
>  103.25%105.94%
>
> I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
> did, so the results might not be stable enough to warrant making any
> decisions from.

I am actually surprised to see that eliminating loop is showing
improvements. There aren't hundreds of attributes involved in those
queries. So I share your stability concerns. And even with these
patches, Q2 and Q3 are still slower.

Q1 is consistently giving performance > 25% for both of us. But other
queries aren't showing a whole lot improvement. So I am wondering
whether it's worth pursuing this change; similar to the opinion you
expressed a few emails earlier.

>
> I was uncertain if the old behaviour of when srcslot contains fewer
> attributes than dstslot was intended or not.  What happens there is
> that we'd leave the additional old dstslot tts_values in place and
> only overwrite up to srcslot->natts but then we'd go on and
> materialize all the dstslot attributes. I think this might not be
> needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
> be ok just to materialize the srcslot attributes and ignore the
> previous additional dstslot attributes.  Changing it to that would
> make the attached patch more simple.

We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot
what's the difference. If we do what you say, we might end up trying
to access unmaterialized values beyond tts_nvalid. Better to
investigate whether such a hazard exists.

-- 
Best Wishes,
Ashutosh Bapat




Re: Use virtual tuple slot for Unique node

2023-10-26 Thread David Rowley
On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
 wrote:
> We may save the size of data in VirtualTupleTableSlot, thus avoiding
> the first loop. I assume that when allocating
> VirtualTupleTableSlot->data, we always know what size we are
> allocating so it should be just a matter of saving it in
> VirtualTupleTableSlot->size. This should avoid the first loop in
> tts_virtual_materialize() and give some speed up. We will need a loop
> to repoint non-byval Datums. I imagine that the pointers to non-byval
> Datums can be computed as dest_slot->tts_values[natts] =
> dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> This would work as long as all the non-byval datums in the source slot
> are all stored flattened in source slot's data. I am assuming that
> that would be true in a materialized virtual slot. The byval datums
> are copied as is. I think, this will avoid multiple memcpy calls, one
> per non-byval attribute and hence some speedup. I may be wrong though.

hmm, do you think it's common enough that we copy an already
materialised virtual slot?

I tried adding the following code totts_virtual_copyslot and didn't
see the NOTICE message when running each of your test queries. "make
check" also worked without anything failing after adjusting
nodeUnique.c to always use a TTSOpsVirtual slot.

+   if (srcslot->tts_ops ==  && TTS_SHOULDFREE(srcslot))
+   elog(NOTICE, "We copied a materialized virtual slot!");

I did get a failure in postgres_fdw's tests:

   server loopback options (table_name 'tab_batch_sharded_p1_remote');
 insert into tab_batch_sharded select * from tab_batch_local;
+NOTICE:  We copied a materialized virtual slot!
+NOTICE:  We copied a materialized virtual slot!

so I think it's probably not that common that we'd gain from that optimisation.

What might buy us a bit more would be to get rid of the for loop
inside tts_virtual_copyslot() and copy the guts of
tts_virtual_materialize() into tts_virtual_copyslot() and set the
dstslot tts_isnull and tts_values arrays in the same loop that we use
to calculate the size.

I tried that in the attached patch and then tested it alongside the
patch that changes the slot type.

master = 74604a37f
1 = [1]
2 = optimize_tts_virtual_copyslot.patch

Using the script from [2] and the setup from [3] but reduced to 10k
tuples instead of 1 million.

Times the average query time in milliseconds for a 60 second pgbench run.

querymaster  master+1master+1+2m/m+1 m/m+1+2
Q12.616 2.082   1.903   125.65%
137.47%
Q29.47910.59310.361 89.48%
 91.49%
Q310.329  10.35710.627 99.73%
97.20%
Q47.272  7.306 6.941  99.53%
 104.77%
Q57.597  7.043 6.645107.87%
 114.33%
Q6162.177  161.037  162.807100.71% 99.61%
Q759.288  59.43  61.294  99.76%
 96.73%

 103.25%105.94%

I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
did, so the results might not be stable enough to warrant making any
decisions from.

I was uncertain if the old behaviour of when srcslot contains fewer
attributes than dstslot was intended or not.  What happens there is
that we'd leave the additional old dstslot tts_values in place and
only overwrite up to srcslot->natts but then we'd go on and
materialize all the dstslot attributes. I think this might not be
needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
be ok just to materialize the srcslot attributes and ignore the
previous additional dstslot attributes.  Changing it to that would
make the attached patch more simple.

David

[1] 
https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch
[2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt
[3] 
https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com
diff --git a/src/backend/executor/execTuples.c 
b/src/backend/executor/execTuples.c
index 2c2712ceac..4f8c6c24c0 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -251,7 +251,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 static void
 tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 {
-   TupleDesc   srcdesc = srcslot->tts_tupleDescriptor;
+   TupleDesc srcdesc = srcslot->tts_tupleDescriptor;
+   TupleDesc dstdesc = dstslot->tts_tupleDescriptor;
+   Sizesz = 0;
+   char   *data;
 
Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
 
@@ -259,17 +262,105 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, 
TupleTableSlot *srcslot)
 
slot_getallattrs(srcslot);
 
-   for (int natt = 0; natt < srcdesc->natts; natt++)
+   /* make sure 

Re: Use virtual tuple slot for Unique node

2023-10-25 Thread Ashutosh Bapat
On Tue, Oct 24, 2023 at 4:30 AM David Rowley  wrote:
>
> On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
>  wrote:
> > I ran my experiments again. It seems on my machine the execution times
> > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> > average of execution times. I did this three times. For each run the
> > standard deviation was within 2%. Here are the numbers
> > master: 13548.33, 13878.88, 14572.52
> > master + 0001: 13734.58, 14193.83, 14574.73
> >
> > So for me, I would say, this particular query performs the same with
> > or without patch.
>
> I'm not really sure which of the 7 queries you're referring to here.
> The times you're quoting seem to align best to Q7 from your previous
> results, so I'll assume you mean Q7.
>
> I'm not really concerned with Q7 as both patched and unpatched use
> TTSOpsMinimalTuple.

It's Q7. Yes. I was responding to your statement: " However, Q7 does
seem to be above noise level faster and I'm not sure why.". Anyway, we
can set that aside.

>
> I also think you need to shrink the size of your benchmark down.  With
> 1 million tuples, you're more likely to be also measuring the time it
> takes to get cache lines from memory into the CPU. A smaller scale
> test will make this less likely. Also, you'd be better timing SELECT
> rather than the time it takes to EXPLAIN ANALYZE. They're not the same
> thing. EXPLAIN ANALYZE has additional timing going on and we may end
> up not de-toasting toasted Datums.

I ran experiments with 10K rows and measured timing using \timing in
psql. The measurements are much more flaky than a larger set of rows
and EXPLAIN ANALYZE. But I think your observations are good enough.

>
> > On Thu, Oct 19, 2023 at 4:26 PM David Rowley  wrote:
> > > We can see that Q2 and Q3 become a bit slower.  This makes sense as
> > > tts_virtual_materialize() is quite a bit more complex than
> > > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> > >
> >
> > If the source slot is a materialized virtual slot,
> > tts_virtual_copyslot() could perform a memcpy of the materialized data
> > itself rather than materialising from datums. That might be more
> > efficient.
>
> I think you're talking about just performing a memcpy() of the
> VirtualTupleTableSlot->data field.  Unfortunately, you'd not be able
> to do just that as you'd also need to repoint the non-byval Datums in
> tts_values at the newly memcpy'd memory. If you skipped that part,
> those would remain pointing to the original memory. If that memory
> goes away, then bad things will happen. I think you'd still need to do
> the 2nd loop in tts_virtual_materialize()

Yes, we will need repoint non-byval Datums ofc.

>
> > May be we should fix the above said inefficiency in
> > tt_virtual_copyslot()?
>
> I don't have any bright ideas on how to make tts_virtual_materialize()
> itself faster.  If there were some way to remember !attbyval
> attributes for the 2nd loop, that might be good, but creating
> somewhere to store that might result in further overheads.

We may save the size of data in VirtualTupleTableSlot, thus avoiding
the first loop. I assume that when allocating
VirtualTupleTableSlot->data, we always know what size we are
allocating so it should be just a matter of saving it in
VirtualTupleTableSlot->size. This should avoid the first loop in
tts_virtual_materialize() and give some speed up. We will need a loop
to repoint non-byval Datums. I imagine that the pointers to non-byval
Datums can be computed as dest_slot->tts_values[natts] =
dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
This would work as long as all the non-byval datums in the source slot
are all stored flattened in source slot's data. I am assuming that
that would be true in a materialized virtual slot. The byval datums
are copied as is. I think, this will avoid multiple memcpy calls, one
per non-byval attribute and hence some speedup. I may be wrong though.

-- 
Best Wishes,
Ashutosh Bapat




Re: Use virtual tuple slot for Unique node

2023-10-23 Thread David Rowley
On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
 wrote:
> I ran my experiments again. It seems on my machine the execution times
> do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> average of execution times. I did this three times. For each run the
> standard deviation was within 2%. Here are the numbers
> master: 13548.33, 13878.88, 14572.52
> master + 0001: 13734.58, 14193.83, 14574.73
>
> So for me, I would say, this particular query performs the same with
> or without patch.

I'm not really sure which of the 7 queries you're referring to here.
The times you're quoting seem to align best to Q7 from your previous
results, so I'll assume you mean Q7.

I'm not really concerned with Q7 as both patched and unpatched use
TTSOpsMinimalTuple.

I also think you need to shrink the size of your benchmark down.  With
1 million tuples, you're more likely to be also measuring the time it
takes to get cache lines from memory into the CPU. A smaller scale
test will make this less likely. Also, you'd be better timing SELECT
rather than the time it takes to EXPLAIN ANALYZE. They're not the same
thing. EXPLAIN ANALYZE has additional timing going on and we may end
up not de-toasting toasted Datums.

> On Thu, Oct 19, 2023 at 4:26 PM David Rowley  wrote:
> > We can see that Q2 and Q3 become a bit slower.  This makes sense as
> > tts_virtual_materialize() is quite a bit more complex than
> > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> >
>
> If the source slot is a materialized virtual slot,
> tts_virtual_copyslot() could perform a memcpy of the materialized data
> itself rather than materialising from datums. That might be more
> efficient.

I think you're talking about just performing a memcpy() of the
VirtualTupleTableSlot->data field.  Unfortunately, you'd not be able
to do just that as you'd also need to repoint the non-byval Datums in
tts_values at the newly memcpy'd memory. If you skipped that part,
those would remain pointing to the original memory. If that memory
goes away, then bad things will happen. I think you'd still need to do
the 2nd loop in tts_virtual_materialize()

> > We'd likely see Q2 and Q3 do better with the patched version if there
> > were more duplicates as there'd be less tuple deforming going on
> > because of the virtual slots.
> >
> > Overall, the patched version is 5.55% faster than master.  However,
> > it's pretty hard to say if we should do this or not. Q3 has a mix of
> > varlena and byval types and that came out slower with the patched
> > version.
>
> Theoretically using the same slot type is supposed to be faster. We
> use same slot types for input and output in other places where as
> well.

Which theory?

> May be we should fix the above said inefficiency in
> tt_virtual_copyslot()?

I don't have any bright ideas on how to make tts_virtual_materialize()
itself faster.  If there were some way to remember !attbyval
attributes for the 2nd loop, that might be good, but creating
somewhere to store that might result in further overheads.

tts_virtual_copyslot() perhaps could be sped up a little by doing a
memcpy of the values/isnull arrays when the src and dst descriptors
have the same number of attributes. aka, something like:

if (srcdesc->natts == dstslot->tts_tupleDescriptor->natts)
memcpy(dstslot->tts_values, srcslot->tts_values,
MAXALIGN(srcdesc->natts * sizeof(Datum)) +
MAXALIGN(srcdesc->natts * sizeof(bool)));
else
{
for (int natt = 0; natt < srcdesc->natts; natt++)
{
dstslot->tts_values[natt] = srcslot->tts_values[natt];
dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
}
}

I imagine we'd only start to notice gains by doing that for larger
natts values.  Each of the 7 queries here, I imagine, wouldn't have
enough columns for it to make much of a difference.

That seems valid enough to do based on how MakeTupleTableSlot()
allocates those arrays. ExecSetSlotDescriptor() does not seem on board
with the single allocation method, however. (Those pfree's in
ExecSetSlotDescriptor() do look a bit fragile.
https://coverage.postgresql.org/src/backend/executor/execTuples.c.gcov.html
says they're never called)

David




Re: Use virtual tuple slot for Unique node

2023-10-20 Thread Ashutosh Bapat
On Thu, Oct 19, 2023 at 4:26 PM David Rowley  wrote:
>
> On Thu, 19 Oct 2023 at 22:29, David Rowley  wrote:
> > It's hard to imagine why there would be a slowdown as this query uses
> > a TTSOpsMinimalTuple slot type in the patch and the unpatched version.
>
> I shrunk down your table sizes to 10k rows instead of 1 million rows
> to reduce the CPU cache pressure on the queries.
>
> I ran pgbench for 1 minute on each query and did pg_prewarm on each
> table. Here are the times I got in milliseconds:
>
> Query   master   Master + 0001   compare
> Q12.576 1.979 130.17%
> Q29.546 9.941   96.03%
> Q39.069 9.536   95.10%
> Q47.285 7.208 101.07%
> Q57.585 6.904 109.86%
> Q6 162.253 161.434100.51%
> Q7   62.507   58.922106.08%
>
> I also noted down the slot type that nodeUnique.c is using in each of
> the queries:
>
> Q1 TTSOpsVirtual
> Q2 TTSOpsVirtual
> Q3 TTSOpsVirtual
> Q4 TTSOpsMinimalTuple
> Q5 TTSOpsVirtual
> Q6 TTSOpsMinimalTuple
> Q7 TTSOpsMinimalTuple
>
> So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7
> does seem to be above noise level faster and I'm not sure why.

I ran my experiments again. It seems on my machine the execution times
do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
average of execution times. I did this three times. For each run the
standard deviation was within 2%. Here are the numbers
master: 13548.33, 13878.88, 14572.52
master + 0001: 13734.58, 14193.83, 14574.73

So for me, I would say, this particular query performs the same with
or without patch.

>
> We can see that Q2 and Q3 become a bit slower.  This makes sense as
> tts_virtual_materialize() is quite a bit more complex than
> heap_copy_minimal_tuple() which is a simple palloc/memcpy.
>

If the source slot is a materialized virtual slot,
tts_virtual_copyslot() could perform a memcpy of the materialized data
itself rather than materialising from datums. That might be more
efficient.

> We'd likely see Q2 and Q3 do better with the patched version if there
> were more duplicates as there'd be less tuple deforming going on
> because of the virtual slots.
>
> Overall, the patched version is 5.55% faster than master.  However,
> it's pretty hard to say if we should do this or not. Q3 has a mix of
> varlena and byval types and that came out slower with the patched
> version.

Theoretically using the same slot type is supposed to be faster. We
use same slot types for input and output in other places where as
well. May be we should fix the above said inefficiency in
tt_virtual_copyslot()?

-- 
Best Wishes,
Ashutosh Bapat




Re: Use virtual tuple slot for Unique node

2023-10-19 Thread David Rowley
On Thu, 19 Oct 2023 at 22:29, David Rowley  wrote:
> It's hard to imagine why there would be a slowdown as this query uses
> a TTSOpsMinimalTuple slot type in the patch and the unpatched version.

I shrunk down your table sizes to 10k rows instead of 1 million rows
to reduce the CPU cache pressure on the queries.

I ran pgbench for 1 minute on each query and did pg_prewarm on each
table. Here are the times I got in milliseconds:

Query   master   Master + 0001   compare
Q12.576 1.979 130.17%
Q29.546 9.941   96.03%
Q39.069 9.536   95.10%
Q47.285 7.208 101.07%
Q57.585 6.904 109.86%
Q6 162.253 161.434100.51%
Q7   62.507   58.922106.08%

I also noted down the slot type that nodeUnique.c is using in each of
the queries:

Q1 TTSOpsVirtual
Q2 TTSOpsVirtual
Q3 TTSOpsVirtual
Q4 TTSOpsMinimalTuple
Q5 TTSOpsVirtual
Q6 TTSOpsMinimalTuple
Q7 TTSOpsMinimalTuple

So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7
does seem to be above noise level faster and I'm not sure why.

We can see that Q2 and Q3 become a bit slower.  This makes sense as
tts_virtual_materialize() is quite a bit more complex than
heap_copy_minimal_tuple() which is a simple palloc/memcpy.

We'd likely see Q2 and Q3 do better with the patched version if there
were more duplicates as there'd be less tuple deforming going on
because of the virtual slots.

Overall, the patched version is 5.55% faster than master.  However,
it's pretty hard to say if we should do this or not. Q3 has a mix of
varlena and byval types and that came out slower with the patched
version.

I've attached the script I used to get the results and the setup,
which is just your tables shrunk down to 10k rows.

David
#!/bin/bash

psql -c "select 
pg_prewarm('t_int'),pg_prewarm('t_text'),pg_prewarm('t_mixed');" postgres

for sql in "select distinct a,b from t_int;" "select distinct a,b from t_text;" 
"select distinct a,b from t_mixed;" "select distinct a,b from (select sum(a) 
over (order by a rows 2 preceding) a, b from t_int) q;" "select distinct a,b 
from (select sum(a) over (order by a rows 2 preceding) a, b from t_int order by 
a, b) q;" "select distinct a,b from (select string_agg(a, ', ') over (order by 
a rows 2 preceding) a, b from t_text) q;" "select distinct a,b from (select 
string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from 
t_text) q;"
do
echo "set enable_hashagg=0;" > bench.sql
echo "set work_mem = '10GB';" >> bench.sql
echo "$sql" >> bench.sql
pgbench -n -f bench.sql -M prepared -T 60 postgres | grep latency
done




setup.sql
Description: Binary data


Re: Use virtual tuple slot for Unique node

2023-10-19 Thread David Rowley
On Thu, 12 Oct 2023 at 23:06, Ashutosh Bapat
 wrote:
> Q7 select distinct a,b from (select string_agg(left(a, 100), ', ')
> over (order by a rows 2 preceding) a, b from t_text) q
> HEAD: 16070.62 ms
> patched: 16182.16 ms

Did you time the SELECT or EXPLAIN ANALYZE?

With SELECT, I'm unable to recreate this slowdown. Using your setup:

$ cat bench.sql
set enable_hashagg=0;
set work_mem='10GB';
select distinct a,b from (select string_agg(left(a, 100), ', ') over
(order by a rows 2 preceding) a, b from t_text) q;

Master @ 13d00729d
$ pgbench -n -f bench.sql -T 300 postgres | grep latency
latency average = 7739.250 ms

Master + use_subnode_slot_type_for_nodeunique.patch
$ pgbench -n -f bench.sql -T 300 postgres | grep latency
latency average = 7718.007 ms

It's hard to imagine why there would be a slowdown as this query uses
a TTSOpsMinimalTuple slot type in the patch and the unpatched version.

David




Re: Use virtual tuple slot for Unique node

2023-10-12 Thread Ashutosh Bapat
On Tue, Oct 10, 2023 at 2:23 PM David Rowley  wrote:
>
> On Wed, 27 Sept 2023 at 20:01, David Rowley  wrote:
> >
> > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas  wrote:
> > > So not a win in this case. Could you peek at the outer slot type, and
> > > use the same kind of slot for the Unique's result? Or some more
> > > complicated logic, like use a virtual slot if all the values are
> > > pass-by-val? I'd also like to keep this simple, though...
> > >
> > > Would this kind of optimization make sense elsewhere?
> >
> > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().
> >
> > If I adjust the patch to:
> >
> > -   ExecInitResultTupleSlotTL(>ps, );
> > +   ExecInitResultTupleSlotTL(>ps,
> > +
> > ExecGetResultSlotOps(outerPlanState(uniquestate),
> > +
> > NULL));
>
> Just to keep this from going cold, here's that in patch form for
> anyone who wants to test.

Thanks.

I don't recollect why we chose MinimalTupleSlot here - may be because
we expected the underlying node to always produce a minimal tupe. But
Unique node copies the tuple returned by the underlying node. This
copy is carried out by the TupleTableSlot specific copy function
copyslot. Every implementation of this function first converts the
source slot tuple into the required form and then copies it. Having
both the TupleTableSlots, ouput slot from the underlying node and the
output slot of Unique node, of the same type avoids the first step and
just copies the slot. It makes sense that it performs better. The code
looks fine to me.

>
> I spent a bit more time running some more benchmarks and I don't see
> any workload where it slows things down.  I'd be happy if someone else
> had a go at finding a regression.

I built on your experiments and I might have found a minor regression.

Setup
=
drop table if exists t_int;
create table t_int(a int, b int);
insert into t_int select x, x from generate_series(1,100)x;
create index on t_int (a,b);
vacuum analyze t_int;

drop table if exists t_text;
create table t_text(a text, b text);
insert into t_text select lpad(x::text, 1000, '0'), x::text from
generate_series(1,100)x;
create index on t_text (a,b);
vacuum analyze t_text;

drop table if exists t_mixed; -- this one is new but it doesn't matter much
create table t_mixed(a text, b int);
insert into t_mixed select lpad(x::text, 1000, '0'), x from
generate_series(1,100)x;
create index on t_mixed (a,b);
vacuum analyze t_mixed;

Queries and measurements (average execution time from 3 runs - on my
Thinkpad T490)
==
Q1 select distinct a,b from t_int';
HEAD: 544.45 ms
patched: 381.55 ms

Q2 select distinct a,b from t_text
HEAD: 609.90 ms
patched: 513.42 ms

Q3 select distinct a,b from t_mixed
HEAD: 626.80 ms
patched: 468.22 ms

The more the pass by ref data, more memory is allocated which seems to
reduce the gain by this patch.
Above nodes use Buffer or HeapTupleTableSlot.
Try some different nodes which output minimal or virtual TTS.

set enable_hashagg to off;
Q4 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int) q
HEAD: 2529.58 ms
patched: 2332.23

Q5 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int order by a, b) q
HEAD: 2633.69 ms
patched: 2255.99 ms


Q6 select distinct a,b from (select string_agg(a, ', ') over (order by
a rows 2 preceding) a, b from t_text) q
HEAD: 108589.85 ms
patched: 107226.82 ms

Q7 select distinct a,b from (select string_agg(left(a, 100), ', ')
over (order by a rows 2 preceding) a, b from t_text) q
HEAD: 16070.62 ms
patched: 16182.16 ms

This one is surprising though. May be the advantage of using the same
tuple table slot is so narrow when large data needs to be copied that
the execution times almost match. The patched and unpatched execution
times differ by the margin of error either way.

-- 
Best Wishes,
Ashutosh Bapat




Re: Use virtual tuple slot for Unique node

2023-10-10 Thread David Rowley
On Wed, 27 Sept 2023 at 20:01, David Rowley  wrote:
>
> On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas  wrote:
> > So not a win in this case. Could you peek at the outer slot type, and
> > use the same kind of slot for the Unique's result? Or some more
> > complicated logic, like use a virtual slot if all the values are
> > pass-by-val? I'd also like to keep this simple, though...
> >
> > Would this kind of optimization make sense elsewhere?
>
> There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().
>
> If I adjust the patch to:
>
> -   ExecInitResultTupleSlotTL(>ps, );
> +   ExecInitResultTupleSlotTL(>ps,
> +
> ExecGetResultSlotOps(outerPlanState(uniquestate),
> +
> NULL));

Just to keep this from going cold, here's that in patch form for
anyone who wants to test.

I spent a bit more time running some more benchmarks and I don't see
any workload where it slows things down.  I'd be happy if someone else
had a go at finding a regression.

David
diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 01f951197c..be585e284b 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -115,6 +115,7 @@ UniqueState *
 ExecInitUnique(Unique *node, EState *estate, int eflags)
 {
UniqueState *uniquestate;
+   const TupleTableSlotOps *ops;
 
/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -137,11 +138,14 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 */
outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, 
eflags);
 
+   /* initialize result slot and type */
+   ops = ExecGetResultSlotOps(outerPlanState(uniquestate), NULL);
+   ExecInitResultTupleSlotTL(>ps, ops);
+
/*
-* Initialize result slot and type. Unique nodes do no projections, so
-* initialize projection info for this node appropriately.
+* Unique nodes do no projections, so initialize projection info for 
this
+* node appropriately.
 */
-   ExecInitResultTupleSlotTL(>ps, );
uniquestate->ps.ps_ProjInfo = NULL;
 
/*


Re: Use virtual tuple slot for Unique node

2023-09-27 Thread David Rowley
On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas  wrote:
> So not a win in this case. Could you peek at the outer slot type, and
> use the same kind of slot for the Unique's result? Or some more
> complicated logic, like use a virtual slot if all the values are
> pass-by-val? I'd also like to keep this simple, though...
>
> Would this kind of optimization make sense elsewhere?

There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().

If I adjust the patch to:

-   ExecInitResultTupleSlotTL(>ps, );
+   ExecInitResultTupleSlotTL(>ps,
+
ExecGetResultSlotOps(outerPlanState(uniquestate),
+
NULL));

Then I get the following performance on my Zen2 machine.

Test 1

drop table if exists t;
create table t(a int, b int);
insert into t select x,x from generate_series(1,100)x;
create index on t (a,b);
vacuum analyze t;

explain (analyze, timing off) select distinct a,b from t;

Master:
Execution Time: 149.669 ms
Execution Time: 149.019 ms
Execution Time: 151.240 ms

Patched:
Execution Time: 96.950 ms
Execution Time: 94.509 ms
Execution Time: 93.498 ms

Test 2

drop table if exists t;
create table t(a text, b text);
insert into t select x::text,x::text from generate_series(1,100)x;
create index on t (a,b);
vacuum analyze t;

explain (analyze, timing off) select distinct a,b from t;

Master:
Execution Time: 185.282 ms
Execution Time: 178.948 ms
Execution Time: 179.217 ms

Patched:
Execution Time: 141.031 ms
Execution Time: 141.136 ms
Execution Time: 142.163 ms

Test 3

set enable_hashagg=off;
explain (analyze, timing off) select distinct g::text, 'a', 'b',
'c','d', 'e','f','g','h' from generate_series(1, 5) g;

Master:
Execution Time: 87.599 ms
Execution Time: 87.721 ms
Execution Time: 87.635 ms

Patched:
Execution Time: 83.449 ms
Execution Time: 84.314 ms
Execution Time: 86.239 ms

David




Re: Use virtual tuple slot for Unique node

2023-09-22 Thread Heikki Linnakangas
I did a little more perf testing with this. I'm seeing the same benefit 
with the query you posted. But can we find a case where it's not 
beneficial? If I understand correctly, when the input slot is a virtual 
slot, it's cheaper to copy it to another virtual slot than to form a 
minimal tuple. Like in your test case. What if the input is a minimial 
tuple?


On master:

postgres=# set enable_hashagg=off;
SET
postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 
'e','f','g','h' from generate_series(1, 500) g;


QUERY PLAN 


---
 Unique  (cost=2630852.42..2655852.42 rows=200 width=288) (actual 
time=4525.212..6576.992 rows=500 loops=1)
   ->  Sort  (cost=2630852.42..2643352.42 rows=500 width=288) 
(actual time=4525.211..5960.967 rows=500 loops=1)

 Sort Key: ((g)::text)
 Sort Method: external merge  Disk: 165296kB
 ->  Function Scan on generate_series g  (cost=0.00..75000.00 
rows=500 width=288) (actual time=518.914..1194.702 rows=500 loops=1)

 Planning Time: 0.036 ms
 JIT:
   Functions: 5
   Options: Inlining true, Optimization true, Expressions true, 
Deforming true
   Timing: Generation 0.242 ms (Deform 0.035 ms), Inlining 63.457 ms, 
Optimization 29.764 ms, Emission 20.592 ms, Total 114.056 ms

 Execution Time: 6766.399 ms
(11 rows)


With the patch:

postgres=# set enable_hashagg=off;
SET
postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 
'e','f','g','h' from generate_series(1, 500) g;


QUERY PLAN 


---
 Unique  (cost=2630852.42..2655852.42 rows=200 width=288) (actual 
time=4563.639..7362.467 rows=500 loops=1)
   ->  Sort  (cost=2630852.42..2643352.42 rows=500 width=288) 
(actual time=4563.637..6069.000 rows=500 loops=1)

 Sort Key: ((g)::text)
 Sort Method: external merge  Disk: 165296kB
 ->  Function Scan on generate_series g  (cost=0.00..75000.00 
rows=500 width=288) (actual time=528.060..1191.483 rows=500 loops=1)

 Planning Time: 0.720 ms
 JIT:
   Functions: 5
   Options: Inlining true, Optimization true, Expressions true, 
Deforming true
   Timing: Generation 0.406 ms (Deform 0.065 ms), Inlining 68.385 ms, 
Optimization 21.656 ms, Emission 21.033 ms, Total 111.480 ms

 Execution Time: 7585.306 ms
(11 rows)


So not a win in this case. Could you peek at the outer slot type, and 
use the same kind of slot for the Unique's result? Or some more 
complicated logic, like use a virtual slot if all the values are 
pass-by-val? I'd also like to keep this simple, though...


Would this kind of optimization make sense elsewhere?

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Use virtual tuple slot for Unique node

2023-08-31 Thread Денис Смирнов
Again the new patch hasn't been attached to the thread, so resend it.


v3-use-virtual-slots-for-unique-node.patch
Description: Binary data


Re: Use virtual tuple slot for Unique node

2023-08-31 Thread Denis Smirnov
I have made a small research and found out that though the patch itself is correct (i.e. we can benefit from replacing TTSOpsMinimalTuple with TTSOpsVirtual for the Unique node), my explanation WHY was wrong.1. We always materialize the new unique tuple in the slot, never mind what type of tuple table slots do we use.2. But the virtual tuple materialization (tts_virtual_copyslot) have performance benefits over the minimal tuple one (tts_minimal_copyslot):    - tts_minimal_copyslot always allocates zeroed memory with palloc0 (expensive according to the flame graph);    - tts_minimal_copyslot() doesn’t allocate additional memory if the tuples are constructed from the passed by value column (but for the variable-size columns we still need memory allocation);    - if tts_minimal_copyslot() need allocation it doesn’t need to zero the memory;So as a result we seriously benefit from virtual TTS for the tuples constructed from the fixed-sized columns when get a Unique node in the plan.commit 148642d81f046b7d72b3a40182c165e42a8ab6d7
Author: Denis Smirnov 
Date:   Thu Aug 31 08:51:14 2023 +0700

Change tuple table slot for Unique node to "virtual"

The Unique node uses minimal TTS implementation to copy the unique
tuples from the sorted stream into the resulting tuple slot. But if
we replace the minimal TTS with the virtual TTS copy method, the
performance improves.

1. Minimal TTS always allocates zeroed memory for the materialized
   tuple.
2. Virtual TTS doesn't allocate additional memory for the tuples
   with the columns passed by value. For the columns with external
   memory we don't need to zero the bytes but can simply take the
   memory chunk from the free list "as is".

diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 45035d74fa..c859add6e0 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 * Initialize result slot and type. Unique nodes do no projections, so
 * initialize projection info for this node appropriately.
 */
-   ExecInitResultTupleSlotTL(>ps, );
+   ExecInitResultTupleSlotTL(>ps, );
uniquestate->ps.ps_ProjInfo = NULL;
 
/*
31 авг. 2023 г., в 10:28, Denis Smirnov  написал(а):It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.31 авг. 2023 г., в 10:06, Denis Smirnov  написал(а):

Re: Use virtual tuple slot for Unique node

2023-08-30 Thread Denis Smirnov
It looks like my patch was not analyzed by the hackers mailing list due to 
incorrect mime type, so I duplicate it here.
commit 2852a3f2fab8e723f208d81c1ad1eb6a6a377b09
Author: Denis Smirnov 
Date:   Thu Aug 31 08:51:14 2023 +0700

Change tuple table slot for Unique node to "virtual"

The Unique node does a very simple thing in the plan - it processes
the incoming sorted tuple stream and adds the unique tuples to the
resulting tuple table slot. Previously the Unique node palloc'ed
the results with the "miminal" tuples. It is redundant and for now
we simply collect references to the child node tuples with the
"virtual" tuples.

diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 45035d74fa..c859add6e0 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 * Initialize result slot and type. Unique nodes do no projections, so
 * initialize projection info for this node appropriately.
 */
-   ExecInitResultTupleSlotTL(>ps, );
+   ExecInitResultTupleSlotTL(>ps, );
uniquestate->ps.ps_ProjInfo = NULL;
 
/*


> 31 авг. 2023 г., в 10:06, Denis Smirnov  написал(а):
> 
> 



Re: Use virtual tuple slot for Unique node

2023-08-30 Thread David Rowley
On Thu, 31 Aug 2023 at 05:37, Денис Смирнов  wrote:
> I have inspected the performance of the GROUP BY and DISTINCT queries for the 
> sorted data streams and found out, that Group node (produced by GROUP BY) 
> works faster then the Unique node (produced by DISTINCT).  The flame graph 
> should out the reason - Unique palloc`s tuples for the result slot while the 
> Group node doesn’t.
>
> I wonder, why do we use minimal tuples for the Unique node instead of the 
> virtual ones? It looks like there is no actual reason for that as Unique 
> doesn’t make any materialization.

It would be good to see example queries and a demonstration of the
performance increase. I'm not disputing your claims, but showing some
performance numbers might catch the eye of a reviewer more quickly.

You should also add this to the September commitfest at
https://commitfest.postgresql.org/44/

David




Use virtual tuple slot for Unique node

2023-08-30 Thread Денис Смирнов
Hi,

I have inspected the performance of the GROUP BY and DISTINCT queries for the 
sorted data streams and found out, that Group node (produced by GROUP BY) works 
faster then the Unique node (produced by DISTINCT).  The flame graph should out 
the reason - Unique palloc`s tuples for the result slot while the Group node 
doesn’t.

I wonder, why do we use minimal tuples for the Unique node instead of the 
virtual ones? It looks like there is no actual reason for that as Unique 
doesn’t make any materialization.
diff --git a/src/backend/executor/nodeUnique.c 
b/src/backend/executor/nodeUnique.c
index 45035d74fa..c859add6e0 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 * Initialize result slot and type. Unique nodes do no projections, so
 * initialize projection info for this node appropriately.
 */
-   ExecInitResultTupleSlotTL(>ps, );
+   ExecInitResultTupleSlotTL(>ps, );
uniquestate->ps.ps_ProjInfo = NULL;
 
/*