Re: Use virtual tuple slot for Unique node
> On 29 Oct 2023, at 21:30, Denis Smirnov wrote: > > I have taken a look at this discussion, at the code and I am confused how we > choose tuple table slot (TTS) type in PG. After offline discussion with Denis, we decided to withdraw this patch from CF for now. If anyone is willing to revive working on this, please register a new entry in next commitfest. Thanks! Best regards, Andrey Borodin.
Re: Use virtual tuple slot for Unique node
I have taken a look at this discussion, at the code and I am confused how we choose tuple table slot (TTS) type in PG. May be you can clarify this topic or me. 1. Brief intro. There are four types of TTS. Plan tree «leaves»: - buffer heap (produced by index and table scans, has system columns and keeps shared buffer pins) - heap (produced by FDW: has system columns, but doesn’t keep any pins) - minimal (produced by values and materializations nodes like sort, agg, etc.) Plan «branches»: - virtual (keeps datum references to the columns of the tuples in the child nodes) Virtual TTS is cheeper to copy among the plan (as we copy datum references), but more expensive to materialize (we have to construct a tuple from pieces). Leaves are cheeper to materialize (as we make a memcmp under hood), but very expensive to copy (we copy the value, not the datum reference). 2. If we take a look at the materialize nodes in the plan, they produce different result TTS. - Outer child TTS type: gater, gather merge, lock rows, limit; - Minimal: material, sort, incremental sort, memoize, unique, hash, setup (can be heap as well); - Virtual: group, agg, window agg. From my point of view, the materialization node should preserve the incoming TTS type. For the sort node (that materializes incoming tuples as minimal) it is ok to output minimal result as well. Looks that unique should use the outer child’d TTS (instead of hardcoded minimal). But can anyone explain me why do group, agg and window agg return the virtual instead of the same TTS type as outer child has? Do we expect that the parent node exists and requires exactly virtual tuples (but what if the parent node is sort and benefits from minimal TTS)? So, it looks like we need to take a look not only at the unique, but also inspect all the materialization nodes.
Re: Use virtual tuple slot for Unique node
On Fri, 27 Oct 2023 at 22:05, Ashutosh Bapat wrote: > > On Fri, Oct 27, 2023 at 8:48 AM David Rowley wrote: > > I was uncertain if the old behaviour of when srcslot contains fewer > > attributes than dstslot was intended or not. What happens there is > > that we'd leave the additional old dstslot tts_values in place and > > only overwrite up to srcslot->natts but then we'd go on and > > materialize all the dstslot attributes. I think this might not be > > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may > > be ok just to materialize the srcslot attributes and ignore the > > previous additional dstslot attributes. Changing it to that would > > make the attached patch more simple. > > We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot > what's the difference. If we do what you say, we might end up trying > to access unmaterialized values beyond tts_nvalid. Better to > investigate whether such a hazard exists. The TupleDesc's natts is the number of attributes in the tuple descriptor. tts_nvalid is the greatest attribute number that's been deformed in the tuple slot. For slot types other than virtual slots, we'll call slot_getsomeattrs() to deform more attributes from the tuple. The reason the code in question looks suspicious to me is that we do "dstslot->tts_nvalid = srcdesc->natts;" and there's no way to deform more attributes in a virtual slot. Note that tts_virtual_getsomeattrs() unconditionally does elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");. We shouldn't ever be accessing tts_values elements above what tts_nvalid is set to, so either we should be setting dstslot->tts_nvalid = to the dstdesc->natts so that we can access tts_values elements above srcdesc->natts or we're needlessly materialising too many attributes in tts_virtual_copyslot(). David David
Re: Use virtual tuple slot for Unique node
On Fri, Oct 27, 2023 at 8:48 AM David Rowley wrote: > > On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat > wrote: > > We may save the size of data in VirtualTupleTableSlot, thus avoiding > > the first loop. I assume that when allocating > > VirtualTupleTableSlot->data, we always know what size we are > > allocating so it should be just a matter of saving it in > > VirtualTupleTableSlot->size. This should avoid the first loop in > > tts_virtual_materialize() and give some speed up. We will need a loop > > to repoint non-byval Datums. I imagine that the pointers to non-byval > > Datums can be computed as dest_slot->tts_values[natts] = > > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). > > This would work as long as all the non-byval datums in the source slot > > are all stored flattened in source slot's data. I am assuming that > > that would be true in a materialized virtual slot. The byval datums > > are copied as is. I think, this will avoid multiple memcpy calls, one > > per non-byval attribute and hence some speedup. I may be wrong though. > > hmm, do you think it's common enough that we copy an already > materialised virtual slot? > > I tried adding the following code totts_virtual_copyslot and didn't > see the NOTICE message when running each of your test queries. "make > check" also worked without anything failing after adjusting > nodeUnique.c to always use a TTSOpsVirtual slot. > > + if (srcslot->tts_ops == && TTS_SHOULDFREE(srcslot)) > + elog(NOTICE, "We copied a materialized virtual slot!"); > > I did get a failure in postgres_fdw's tests: > >server loopback options (table_name 'tab_batch_sharded_p1_remote'); > insert into tab_batch_sharded select * from tab_batch_local; > +NOTICE: We copied a materialized virtual slot! > +NOTICE: We copied a materialized virtual slot! > > so I think it's probably not that common that we'd gain from that > optimisation. Thanks for this analysis. If we aren't copying a materialized virtual slot often, no point in adding that optimization. > > What might buy us a bit more would be to get rid of the for loop > inside tts_virtual_copyslot() and copy the guts of > tts_virtual_materialize() into tts_virtual_copyslot() and set the > dstslot tts_isnull and tts_values arrays in the same loop that we use > to calculate the size. > > I tried that in the attached patch and then tested it alongside the > patch that changes the slot type. > > master = 74604a37f > 1 = [1] > 2 = optimize_tts_virtual_copyslot.patch > > Using the script from [2] and the setup from [3] but reduced to 10k > tuples instead of 1 million. > > Times the average query time in milliseconds for a 60 second pgbench run. > > querymaster master+1master+1+2m/m+1 m/m+1+2 > Q12.616 2.082 1.903 125.65% > 137.47% > Q29.47910.59310.361 89.48% > 91.49% > Q310.329 10.35710.627 99.73% > 97.20% > Q47.272 7.306 6.941 99.53% > 104.77% > Q57.597 7.043 6.645107.87% > 114.33% > Q6162.177 161.037 162.807100.71% 99.61% > Q759.288 59.43 61.294 99.76% > 96.73% > > 103.25%105.94% > > I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but > did, so the results might not be stable enough to warrant making any > decisions from. I am actually surprised to see that eliminating loop is showing improvements. There aren't hundreds of attributes involved in those queries. So I share your stability concerns. And even with these patches, Q2 and Q3 are still slower. Q1 is consistently giving performance > 25% for both of us. But other queries aren't showing a whole lot improvement. So I am wondering whether it's worth pursuing this change; similar to the opinion you expressed a few emails earlier. > > I was uncertain if the old behaviour of when srcslot contains fewer > attributes than dstslot was intended or not. What happens there is > that we'd leave the additional old dstslot tts_values in place and > only overwrite up to srcslot->natts but then we'd go on and > materialize all the dstslot attributes. I think this might not be > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may > be ok just to materialize the srcslot attributes and ignore the > previous additional dstslot attributes. Changing it to that would > make the attached patch more simple. We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot what's the difference. If we do what you say, we might end up trying to access unmaterialized values beyond tts_nvalid. Better to investigate whether such a hazard exists. -- Best Wishes, Ashutosh Bapat
Re: Use virtual tuple slot for Unique node
On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat wrote: > We may save the size of data in VirtualTupleTableSlot, thus avoiding > the first loop. I assume that when allocating > VirtualTupleTableSlot->data, we always know what size we are > allocating so it should be just a matter of saving it in > VirtualTupleTableSlot->size. This should avoid the first loop in > tts_virtual_materialize() and give some speed up. We will need a loop > to repoint non-byval Datums. I imagine that the pointers to non-byval > Datums can be computed as dest_slot->tts_values[natts] = > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). > This would work as long as all the non-byval datums in the source slot > are all stored flattened in source slot's data. I am assuming that > that would be true in a materialized virtual slot. The byval datums > are copied as is. I think, this will avoid multiple memcpy calls, one > per non-byval attribute and hence some speedup. I may be wrong though. hmm, do you think it's common enough that we copy an already materialised virtual slot? I tried adding the following code totts_virtual_copyslot and didn't see the NOTICE message when running each of your test queries. "make check" also worked without anything failing after adjusting nodeUnique.c to always use a TTSOpsVirtual slot. + if (srcslot->tts_ops == && TTS_SHOULDFREE(srcslot)) + elog(NOTICE, "We copied a materialized virtual slot!"); I did get a failure in postgres_fdw's tests: server loopback options (table_name 'tab_batch_sharded_p1_remote'); insert into tab_batch_sharded select * from tab_batch_local; +NOTICE: We copied a materialized virtual slot! +NOTICE: We copied a materialized virtual slot! so I think it's probably not that common that we'd gain from that optimisation. What might buy us a bit more would be to get rid of the for loop inside tts_virtual_copyslot() and copy the guts of tts_virtual_materialize() into tts_virtual_copyslot() and set the dstslot tts_isnull and tts_values arrays in the same loop that we use to calculate the size. I tried that in the attached patch and then tested it alongside the patch that changes the slot type. master = 74604a37f 1 = [1] 2 = optimize_tts_virtual_copyslot.patch Using the script from [2] and the setup from [3] but reduced to 10k tuples instead of 1 million. Times the average query time in milliseconds for a 60 second pgbench run. querymaster master+1master+1+2m/m+1 m/m+1+2 Q12.616 2.082 1.903 125.65% 137.47% Q29.47910.59310.361 89.48% 91.49% Q310.329 10.35710.627 99.73% 97.20% Q47.272 7.306 6.941 99.53% 104.77% Q57.597 7.043 6.645107.87% 114.33% Q6162.177 161.037 162.807100.71% 99.61% Q759.288 59.43 61.294 99.76% 96.73% 103.25%105.94% I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but did, so the results might not be stable enough to warrant making any decisions from. I was uncertain if the old behaviour of when srcslot contains fewer attributes than dstslot was intended or not. What happens there is that we'd leave the additional old dstslot tts_values in place and only overwrite up to srcslot->natts but then we'd go on and materialize all the dstslot attributes. I think this might not be needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may be ok just to materialize the srcslot attributes and ignore the previous additional dstslot attributes. Changing it to that would make the attached patch more simple. David [1] https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch [2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt [3] https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c index 2c2712ceac..4f8c6c24c0 100644 --- a/src/backend/executor/execTuples.c +++ b/src/backend/executor/execTuples.c @@ -251,7 +251,10 @@ tts_virtual_materialize(TupleTableSlot *slot) static void tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) { - TupleDesc srcdesc = srcslot->tts_tupleDescriptor; + TupleDesc srcdesc = srcslot->tts_tupleDescriptor; + TupleDesc dstdesc = dstslot->tts_tupleDescriptor; + Sizesz = 0; + char *data; Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); @@ -259,17 +262,105 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) slot_getallattrs(srcslot); - for (int natt = 0; natt < srcdesc->natts; natt++) + /* make sure
Re: Use virtual tuple slot for Unique node
On Tue, Oct 24, 2023 at 4:30 AM David Rowley wrote: > > On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat > wrote: > > I ran my experiments again. It seems on my machine the execution times > > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took > > average of execution times. I did this three times. For each run the > > standard deviation was within 2%. Here are the numbers > > master: 13548.33, 13878.88, 14572.52 > > master + 0001: 13734.58, 14193.83, 14574.73 > > > > So for me, I would say, this particular query performs the same with > > or without patch. > > I'm not really sure which of the 7 queries you're referring to here. > The times you're quoting seem to align best to Q7 from your previous > results, so I'll assume you mean Q7. > > I'm not really concerned with Q7 as both patched and unpatched use > TTSOpsMinimalTuple. It's Q7. Yes. I was responding to your statement: " However, Q7 does seem to be above noise level faster and I'm not sure why.". Anyway, we can set that aside. > > I also think you need to shrink the size of your benchmark down. With > 1 million tuples, you're more likely to be also measuring the time it > takes to get cache lines from memory into the CPU. A smaller scale > test will make this less likely. Also, you'd be better timing SELECT > rather than the time it takes to EXPLAIN ANALYZE. They're not the same > thing. EXPLAIN ANALYZE has additional timing going on and we may end > up not de-toasting toasted Datums. I ran experiments with 10K rows and measured timing using \timing in psql. The measurements are much more flaky than a larger set of rows and EXPLAIN ANALYZE. But I think your observations are good enough. > > > On Thu, Oct 19, 2023 at 4:26 PM David Rowley wrote: > > > We can see that Q2 and Q3 become a bit slower. This makes sense as > > > tts_virtual_materialize() is quite a bit more complex than > > > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > > > > > > > If the source slot is a materialized virtual slot, > > tts_virtual_copyslot() could perform a memcpy of the materialized data > > itself rather than materialising from datums. That might be more > > efficient. > > I think you're talking about just performing a memcpy() of the > VirtualTupleTableSlot->data field. Unfortunately, you'd not be able > to do just that as you'd also need to repoint the non-byval Datums in > tts_values at the newly memcpy'd memory. If you skipped that part, > those would remain pointing to the original memory. If that memory > goes away, then bad things will happen. I think you'd still need to do > the 2nd loop in tts_virtual_materialize() Yes, we will need repoint non-byval Datums ofc. > > > May be we should fix the above said inefficiency in > > tt_virtual_copyslot()? > > I don't have any bright ideas on how to make tts_virtual_materialize() > itself faster. If there were some way to remember !attbyval > attributes for the 2nd loop, that might be good, but creating > somewhere to store that might result in further overheads. We may save the size of data in VirtualTupleTableSlot, thus avoiding the first loop. I assume that when allocating VirtualTupleTableSlot->data, we always know what size we are allocating so it should be just a matter of saving it in VirtualTupleTableSlot->size. This should avoid the first loop in tts_virtual_materialize() and give some speed up. We will need a loop to repoint non-byval Datums. I imagine that the pointers to non-byval Datums can be computed as dest_slot->tts_values[natts] = dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). This would work as long as all the non-byval datums in the source slot are all stored flattened in source slot's data. I am assuming that that would be true in a materialized virtual slot. The byval datums are copied as is. I think, this will avoid multiple memcpy calls, one per non-byval attribute and hence some speedup. I may be wrong though. -- Best Wishes, Ashutosh Bapat
Re: Use virtual tuple slot for Unique node
On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat wrote: > I ran my experiments again. It seems on my machine the execution times > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took > average of execution times. I did this three times. For each run the > standard deviation was within 2%. Here are the numbers > master: 13548.33, 13878.88, 14572.52 > master + 0001: 13734.58, 14193.83, 14574.73 > > So for me, I would say, this particular query performs the same with > or without patch. I'm not really sure which of the 7 queries you're referring to here. The times you're quoting seem to align best to Q7 from your previous results, so I'll assume you mean Q7. I'm not really concerned with Q7 as both patched and unpatched use TTSOpsMinimalTuple. I also think you need to shrink the size of your benchmark down. With 1 million tuples, you're more likely to be also measuring the time it takes to get cache lines from memory into the CPU. A smaller scale test will make this less likely. Also, you'd be better timing SELECT rather than the time it takes to EXPLAIN ANALYZE. They're not the same thing. EXPLAIN ANALYZE has additional timing going on and we may end up not de-toasting toasted Datums. > On Thu, Oct 19, 2023 at 4:26 PM David Rowley wrote: > > We can see that Q2 and Q3 become a bit slower. This makes sense as > > tts_virtual_materialize() is quite a bit more complex than > > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > > > > If the source slot is a materialized virtual slot, > tts_virtual_copyslot() could perform a memcpy of the materialized data > itself rather than materialising from datums. That might be more > efficient. I think you're talking about just performing a memcpy() of the VirtualTupleTableSlot->data field. Unfortunately, you'd not be able to do just that as you'd also need to repoint the non-byval Datums in tts_values at the newly memcpy'd memory. If you skipped that part, those would remain pointing to the original memory. If that memory goes away, then bad things will happen. I think you'd still need to do the 2nd loop in tts_virtual_materialize() > > We'd likely see Q2 and Q3 do better with the patched version if there > > were more duplicates as there'd be less tuple deforming going on > > because of the virtual slots. > > > > Overall, the patched version is 5.55% faster than master. However, > > it's pretty hard to say if we should do this or not. Q3 has a mix of > > varlena and byval types and that came out slower with the patched > > version. > > Theoretically using the same slot type is supposed to be faster. We > use same slot types for input and output in other places where as > well. Which theory? > May be we should fix the above said inefficiency in > tt_virtual_copyslot()? I don't have any bright ideas on how to make tts_virtual_materialize() itself faster. If there were some way to remember !attbyval attributes for the 2nd loop, that might be good, but creating somewhere to store that might result in further overheads. tts_virtual_copyslot() perhaps could be sped up a little by doing a memcpy of the values/isnull arrays when the src and dst descriptors have the same number of attributes. aka, something like: if (srcdesc->natts == dstslot->tts_tupleDescriptor->natts) memcpy(dstslot->tts_values, srcslot->tts_values, MAXALIGN(srcdesc->natts * sizeof(Datum)) + MAXALIGN(srcdesc->natts * sizeof(bool))); else { for (int natt = 0; natt < srcdesc->natts; natt++) { dstslot->tts_values[natt] = srcslot->tts_values[natt]; dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; } } I imagine we'd only start to notice gains by doing that for larger natts values. Each of the 7 queries here, I imagine, wouldn't have enough columns for it to make much of a difference. That seems valid enough to do based on how MakeTupleTableSlot() allocates those arrays. ExecSetSlotDescriptor() does not seem on board with the single allocation method, however. (Those pfree's in ExecSetSlotDescriptor() do look a bit fragile. https://coverage.postgresql.org/src/backend/executor/execTuples.c.gcov.html says they're never called) David
Re: Use virtual tuple slot for Unique node
On Thu, Oct 19, 2023 at 4:26 PM David Rowley wrote: > > On Thu, 19 Oct 2023 at 22:29, David Rowley wrote: > > It's hard to imagine why there would be a slowdown as this query uses > > a TTSOpsMinimalTuple slot type in the patch and the unpatched version. > > I shrunk down your table sizes to 10k rows instead of 1 million rows > to reduce the CPU cache pressure on the queries. > > I ran pgbench for 1 minute on each query and did pg_prewarm on each > table. Here are the times I got in milliseconds: > > Query master Master + 0001 compare > Q12.576 1.979 130.17% > Q29.546 9.941 96.03% > Q39.069 9.536 95.10% > Q47.285 7.208 101.07% > Q57.585 6.904 109.86% > Q6 162.253 161.434100.51% > Q7 62.507 58.922106.08% > > I also noted down the slot type that nodeUnique.c is using in each of > the queries: > > Q1 TTSOpsVirtual > Q2 TTSOpsVirtual > Q3 TTSOpsVirtual > Q4 TTSOpsMinimalTuple > Q5 TTSOpsVirtual > Q6 TTSOpsMinimalTuple > Q7 TTSOpsMinimalTuple > > So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7 > does seem to be above noise level faster and I'm not sure why. I ran my experiments again. It seems on my machine the execution times do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took average of execution times. I did this three times. For each run the standard deviation was within 2%. Here are the numbers master: 13548.33, 13878.88, 14572.52 master + 0001: 13734.58, 14193.83, 14574.73 So for me, I would say, this particular query performs the same with or without patch. > > We can see that Q2 and Q3 become a bit slower. This makes sense as > tts_virtual_materialize() is quite a bit more complex than > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > If the source slot is a materialized virtual slot, tts_virtual_copyslot() could perform a memcpy of the materialized data itself rather than materialising from datums. That might be more efficient. > We'd likely see Q2 and Q3 do better with the patched version if there > were more duplicates as there'd be less tuple deforming going on > because of the virtual slots. > > Overall, the patched version is 5.55% faster than master. However, > it's pretty hard to say if we should do this or not. Q3 has a mix of > varlena and byval types and that came out slower with the patched > version. Theoretically using the same slot type is supposed to be faster. We use same slot types for input and output in other places where as well. May be we should fix the above said inefficiency in tt_virtual_copyslot()? -- Best Wishes, Ashutosh Bapat
Re: Use virtual tuple slot for Unique node
On Thu, 19 Oct 2023 at 22:29, David Rowley wrote: > It's hard to imagine why there would be a slowdown as this query uses > a TTSOpsMinimalTuple slot type in the patch and the unpatched version. I shrunk down your table sizes to 10k rows instead of 1 million rows to reduce the CPU cache pressure on the queries. I ran pgbench for 1 minute on each query and did pg_prewarm on each table. Here are the times I got in milliseconds: Query master Master + 0001 compare Q12.576 1.979 130.17% Q29.546 9.941 96.03% Q39.069 9.536 95.10% Q47.285 7.208 101.07% Q57.585 6.904 109.86% Q6 162.253 161.434100.51% Q7 62.507 58.922106.08% I also noted down the slot type that nodeUnique.c is using in each of the queries: Q1 TTSOpsVirtual Q2 TTSOpsVirtual Q3 TTSOpsVirtual Q4 TTSOpsMinimalTuple Q5 TTSOpsVirtual Q6 TTSOpsMinimalTuple Q7 TTSOpsMinimalTuple So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7 does seem to be above noise level faster and I'm not sure why. We can see that Q2 and Q3 become a bit slower. This makes sense as tts_virtual_materialize() is quite a bit more complex than heap_copy_minimal_tuple() which is a simple palloc/memcpy. We'd likely see Q2 and Q3 do better with the patched version if there were more duplicates as there'd be less tuple deforming going on because of the virtual slots. Overall, the patched version is 5.55% faster than master. However, it's pretty hard to say if we should do this or not. Q3 has a mix of varlena and byval types and that came out slower with the patched version. I've attached the script I used to get the results and the setup, which is just your tables shrunk down to 10k rows. David #!/bin/bash psql -c "select pg_prewarm('t_int'),pg_prewarm('t_text'),pg_prewarm('t_mixed');" postgres for sql in "select distinct a,b from t_int;" "select distinct a,b from t_text;" "select distinct a,b from t_mixed;" "select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int) q;" "select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int order by a, b) q;" "select distinct a,b from (select string_agg(a, ', ') over (order by a rows 2 preceding) a, b from t_text) q;" "select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q;" do echo "set enable_hashagg=0;" > bench.sql echo "set work_mem = '10GB';" >> bench.sql echo "$sql" >> bench.sql pgbench -n -f bench.sql -M prepared -T 60 postgres | grep latency done setup.sql Description: Binary data
Re: Use virtual tuple slot for Unique node
On Thu, 12 Oct 2023 at 23:06, Ashutosh Bapat wrote: > Q7 select distinct a,b from (select string_agg(left(a, 100), ', ') > over (order by a rows 2 preceding) a, b from t_text) q > HEAD: 16070.62 ms > patched: 16182.16 ms Did you time the SELECT or EXPLAIN ANALYZE? With SELECT, I'm unable to recreate this slowdown. Using your setup: $ cat bench.sql set enable_hashagg=0; set work_mem='10GB'; select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q; Master @ 13d00729d $ pgbench -n -f bench.sql -T 300 postgres | grep latency latency average = 7739.250 ms Master + use_subnode_slot_type_for_nodeunique.patch $ pgbench -n -f bench.sql -T 300 postgres | grep latency latency average = 7718.007 ms It's hard to imagine why there would be a slowdown as this query uses a TTSOpsMinimalTuple slot type in the patch and the unpatched version. David
Re: Use virtual tuple slot for Unique node
On Tue, Oct 10, 2023 at 2:23 PM David Rowley wrote: > > On Wed, 27 Sept 2023 at 20:01, David Rowley wrote: > > > > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas wrote: > > > So not a win in this case. Could you peek at the outer slot type, and > > > use the same kind of slot for the Unique's result? Or some more > > > complicated logic, like use a virtual slot if all the values are > > > pass-by-val? I'd also like to keep this simple, though... > > > > > > Would this kind of optimization make sense elsewhere? > > > > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). > > > > If I adjust the patch to: > > > > - ExecInitResultTupleSlotTL(>ps, ); > > + ExecInitResultTupleSlotTL(>ps, > > + > > ExecGetResultSlotOps(outerPlanState(uniquestate), > > + > > NULL)); > > Just to keep this from going cold, here's that in patch form for > anyone who wants to test. Thanks. I don't recollect why we chose MinimalTupleSlot here - may be because we expected the underlying node to always produce a minimal tupe. But Unique node copies the tuple returned by the underlying node. This copy is carried out by the TupleTableSlot specific copy function copyslot. Every implementation of this function first converts the source slot tuple into the required form and then copies it. Having both the TupleTableSlots, ouput slot from the underlying node and the output slot of Unique node, of the same type avoids the first step and just copies the slot. It makes sense that it performs better. The code looks fine to me. > > I spent a bit more time running some more benchmarks and I don't see > any workload where it slows things down. I'd be happy if someone else > had a go at finding a regression. I built on your experiments and I might have found a minor regression. Setup = drop table if exists t_int; create table t_int(a int, b int); insert into t_int select x, x from generate_series(1,100)x; create index on t_int (a,b); vacuum analyze t_int; drop table if exists t_text; create table t_text(a text, b text); insert into t_text select lpad(x::text, 1000, '0'), x::text from generate_series(1,100)x; create index on t_text (a,b); vacuum analyze t_text; drop table if exists t_mixed; -- this one is new but it doesn't matter much create table t_mixed(a text, b int); insert into t_mixed select lpad(x::text, 1000, '0'), x from generate_series(1,100)x; create index on t_mixed (a,b); vacuum analyze t_mixed; Queries and measurements (average execution time from 3 runs - on my Thinkpad T490) == Q1 select distinct a,b from t_int'; HEAD: 544.45 ms patched: 381.55 ms Q2 select distinct a,b from t_text HEAD: 609.90 ms patched: 513.42 ms Q3 select distinct a,b from t_mixed HEAD: 626.80 ms patched: 468.22 ms The more the pass by ref data, more memory is allocated which seems to reduce the gain by this patch. Above nodes use Buffer or HeapTupleTableSlot. Try some different nodes which output minimal or virtual TTS. set enable_hashagg to off; Q4 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int) q HEAD: 2529.58 ms patched: 2332.23 Q5 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int order by a, b) q HEAD: 2633.69 ms patched: 2255.99 ms Q6 select distinct a,b from (select string_agg(a, ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 108589.85 ms patched: 107226.82 ms Q7 select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 16070.62 ms patched: 16182.16 ms This one is surprising though. May be the advantage of using the same tuple table slot is so narrow when large data needs to be copied that the execution times almost match. The patched and unpatched execution times differ by the margin of error either way. -- Best Wishes, Ashutosh Bapat
Re: Use virtual tuple slot for Unique node
On Wed, 27 Sept 2023 at 20:01, David Rowley wrote: > > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas wrote: > > So not a win in this case. Could you peek at the outer slot type, and > > use the same kind of slot for the Unique's result? Or some more > > complicated logic, like use a virtual slot if all the values are > > pass-by-val? I'd also like to keep this simple, though... > > > > Would this kind of optimization make sense elsewhere? > > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). > > If I adjust the patch to: > > - ExecInitResultTupleSlotTL(>ps, ); > + ExecInitResultTupleSlotTL(>ps, > + > ExecGetResultSlotOps(outerPlanState(uniquestate), > + > NULL)); Just to keep this from going cold, here's that in patch form for anyone who wants to test. I spent a bit more time running some more benchmarks and I don't see any workload where it slows things down. I'd be happy if someone else had a go at finding a regression. David diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c index 01f951197c..be585e284b 100644 --- a/src/backend/executor/nodeUnique.c +++ b/src/backend/executor/nodeUnique.c @@ -115,6 +115,7 @@ UniqueState * ExecInitUnique(Unique *node, EState *estate, int eflags) { UniqueState *uniquestate; + const TupleTableSlotOps *ops; /* check for unsupported flags */ Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -137,11 +138,14 @@ ExecInitUnique(Unique *node, EState *estate, int eflags) */ outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags); + /* initialize result slot and type */ + ops = ExecGetResultSlotOps(outerPlanState(uniquestate), NULL); + ExecInitResultTupleSlotTL(>ps, ops); + /* -* Initialize result slot and type. Unique nodes do no projections, so -* initialize projection info for this node appropriately. +* Unique nodes do no projections, so initialize projection info for this +* node appropriately. */ - ExecInitResultTupleSlotTL(>ps, ); uniquestate->ps.ps_ProjInfo = NULL; /*
Re: Use virtual tuple slot for Unique node
On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas wrote: > So not a win in this case. Could you peek at the outer slot type, and > use the same kind of slot for the Unique's result? Or some more > complicated logic, like use a virtual slot if all the values are > pass-by-val? I'd also like to keep this simple, though... > > Would this kind of optimization make sense elsewhere? There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). If I adjust the patch to: - ExecInitResultTupleSlotTL(>ps, ); + ExecInitResultTupleSlotTL(>ps, + ExecGetResultSlotOps(outerPlanState(uniquestate), + NULL)); Then I get the following performance on my Zen2 machine. Test 1 drop table if exists t; create table t(a int, b int); insert into t select x,x from generate_series(1,100)x; create index on t (a,b); vacuum analyze t; explain (analyze, timing off) select distinct a,b from t; Master: Execution Time: 149.669 ms Execution Time: 149.019 ms Execution Time: 151.240 ms Patched: Execution Time: 96.950 ms Execution Time: 94.509 ms Execution Time: 93.498 ms Test 2 drop table if exists t; create table t(a text, b text); insert into t select x::text,x::text from generate_series(1,100)x; create index on t (a,b); vacuum analyze t; explain (analyze, timing off) select distinct a,b from t; Master: Execution Time: 185.282 ms Execution Time: 178.948 ms Execution Time: 179.217 ms Patched: Execution Time: 141.031 ms Execution Time: 141.136 ms Execution Time: 142.163 ms Test 3 set enable_hashagg=off; explain (analyze, timing off) select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 5) g; Master: Execution Time: 87.599 ms Execution Time: 87.721 ms Execution Time: 87.635 ms Patched: Execution Time: 83.449 ms Execution Time: 84.314 ms Execution Time: 86.239 ms David
Re: Use virtual tuple slot for Unique node
I did a little more perf testing with this. I'm seeing the same benefit with the query you posted. But can we find a case where it's not beneficial? If I understand correctly, when the input slot is a virtual slot, it's cheaper to copy it to another virtual slot than to form a minimal tuple. Like in your test case. What if the input is a minimial tuple? On master: postgres=# set enable_hashagg=off; SET postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 500) g; QUERY PLAN --- Unique (cost=2630852.42..2655852.42 rows=200 width=288) (actual time=4525.212..6576.992 rows=500 loops=1) -> Sort (cost=2630852.42..2643352.42 rows=500 width=288) (actual time=4525.211..5960.967 rows=500 loops=1) Sort Key: ((g)::text) Sort Method: external merge Disk: 165296kB -> Function Scan on generate_series g (cost=0.00..75000.00 rows=500 width=288) (actual time=518.914..1194.702 rows=500 loops=1) Planning Time: 0.036 ms JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.242 ms (Deform 0.035 ms), Inlining 63.457 ms, Optimization 29.764 ms, Emission 20.592 ms, Total 114.056 ms Execution Time: 6766.399 ms (11 rows) With the patch: postgres=# set enable_hashagg=off; SET postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 500) g; QUERY PLAN --- Unique (cost=2630852.42..2655852.42 rows=200 width=288) (actual time=4563.639..7362.467 rows=500 loops=1) -> Sort (cost=2630852.42..2643352.42 rows=500 width=288) (actual time=4563.637..6069.000 rows=500 loops=1) Sort Key: ((g)::text) Sort Method: external merge Disk: 165296kB -> Function Scan on generate_series g (cost=0.00..75000.00 rows=500 width=288) (actual time=528.060..1191.483 rows=500 loops=1) Planning Time: 0.720 ms JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.406 ms (Deform 0.065 ms), Inlining 68.385 ms, Optimization 21.656 ms, Emission 21.033 ms, Total 111.480 ms Execution Time: 7585.306 ms (11 rows) So not a win in this case. Could you peek at the outer slot type, and use the same kind of slot for the Unique's result? Or some more complicated logic, like use a virtual slot if all the values are pass-by-val? I'd also like to keep this simple, though... Would this kind of optimization make sense elsewhere? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Use virtual tuple slot for Unique node
Again the new patch hasn't been attached to the thread, so resend it. v3-use-virtual-slots-for-unique-node.patch Description: Binary data
Re: Use virtual tuple slot for Unique node
I have made a small research and found out that though the patch itself is correct (i.e. we can benefit from replacing TTSOpsMinimalTuple with TTSOpsVirtual for the Unique node), my explanation WHY was wrong.1. We always materialize the new unique tuple in the slot, never mind what type of tuple table slots do we use.2. But the virtual tuple materialization (tts_virtual_copyslot) have performance benefits over the minimal tuple one (tts_minimal_copyslot): - tts_minimal_copyslot always allocates zeroed memory with palloc0 (expensive according to the flame graph); - tts_minimal_copyslot() doesn’t allocate additional memory if the tuples are constructed from the passed by value column (but for the variable-size columns we still need memory allocation); - if tts_minimal_copyslot() need allocation it doesn’t need to zero the memory;So as a result we seriously benefit from virtual TTS for the tuples constructed from the fixed-sized columns when get a Unique node in the plan.commit 148642d81f046b7d72b3a40182c165e42a8ab6d7 Author: Denis Smirnov Date: Thu Aug 31 08:51:14 2023 +0700 Change tuple table slot for Unique node to "virtual" The Unique node uses minimal TTS implementation to copy the unique tuples from the sorted stream into the resulting tuple slot. But if we replace the minimal TTS with the virtual TTS copy method, the performance improves. 1. Minimal TTS always allocates zeroed memory for the materialized tuple. 2. Virtual TTS doesn't allocate additional memory for the tuples with the columns passed by value. For the columns with external memory we don't need to zero the bytes but can simply take the memory chunk from the free list "as is". diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c index 45035d74fa..c859add6e0 100644 --- a/src/backend/executor/nodeUnique.c +++ b/src/backend/executor/nodeUnique.c @@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags) * Initialize result slot and type. Unique nodes do no projections, so * initialize projection info for this node appropriately. */ - ExecInitResultTupleSlotTL(>ps, ); + ExecInitResultTupleSlotTL(>ps, ); uniquestate->ps.ps_ProjInfo = NULL; /* 31 авг. 2023 г., в 10:28, Denis Smirnov написал(а):It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.31 авг. 2023 г., в 10:06, Denis Smirnov написал(а):
Re: Use virtual tuple slot for Unique node
It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here. commit 2852a3f2fab8e723f208d81c1ad1eb6a6a377b09 Author: Denis Smirnov Date: Thu Aug 31 08:51:14 2023 +0700 Change tuple table slot for Unique node to "virtual" The Unique node does a very simple thing in the plan - it processes the incoming sorted tuple stream and adds the unique tuples to the resulting tuple table slot. Previously the Unique node palloc'ed the results with the "miminal" tuples. It is redundant and for now we simply collect references to the child node tuples with the "virtual" tuples. diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c index 45035d74fa..c859add6e0 100644 --- a/src/backend/executor/nodeUnique.c +++ b/src/backend/executor/nodeUnique.c @@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags) * Initialize result slot and type. Unique nodes do no projections, so * initialize projection info for this node appropriately. */ - ExecInitResultTupleSlotTL(>ps, ); + ExecInitResultTupleSlotTL(>ps, ); uniquestate->ps.ps_ProjInfo = NULL; /* > 31 авг. 2023 г., в 10:06, Denis Smirnov написал(а): > >
Re: Use virtual tuple slot for Unique node
On Thu, 31 Aug 2023 at 05:37, Денис Смирнов wrote: > I have inspected the performance of the GROUP BY and DISTINCT queries for the > sorted data streams and found out, that Group node (produced by GROUP BY) > works faster then the Unique node (produced by DISTINCT). The flame graph > should out the reason - Unique palloc`s tuples for the result slot while the > Group node doesn’t. > > I wonder, why do we use minimal tuples for the Unique node instead of the > virtual ones? It looks like there is no actual reason for that as Unique > doesn’t make any materialization. It would be good to see example queries and a demonstration of the performance increase. I'm not disputing your claims, but showing some performance numbers might catch the eye of a reviewer more quickly. You should also add this to the September commitfest at https://commitfest.postgresql.org/44/ David
Use virtual tuple slot for Unique node
Hi, I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that Group node (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT). The flame graph should out the reason - Unique palloc`s tuples for the result slot while the Group node doesn’t. I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no actual reason for that as Unique doesn’t make any materialization. diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c index 45035d74fa..c859add6e0 100644 --- a/src/backend/executor/nodeUnique.c +++ b/src/backend/executor/nodeUnique.c @@ -141,7 +141,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags) * Initialize result slot and type. Unique nodes do no projections, so * initialize projection info for this node appropriately. */ - ExecInitResultTupleSlotTL(>ps, ); + ExecInitResultTupleSlotTL(>ps, ); uniquestate->ps.ps_ProjInfo = NULL; /*