Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-18 Thread Amit Khandekar
On Thu, 17 Sep 2020 at 08:55, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> > I wonder if there is a way we could make "Minimal Tuples but with the
> > length travelling separately (and perhaps chopped off?)" into a
> > first-class concept...  It's also a shame to be schlepping a bunch of
> > padding bytes around.

Yeah, I think we can pass a  "length" data separately, but since the
receiver end already is assuming that it knows the received data is a
minimal tuple, I thought why not skip passing this redundant
component. But anyways, if you and Andres are suggesting that being
able to skip the copy is important for virtual tuples as well, then I
think the approach you suggested (supplying an allocated memory to the
tuple API for conversion) would be one of the better options with us,
if not the only good option. Maybe I will try looking into the shm_mq
working to see if we can come up with a good solution.

>
> There really is no justification for having MinimalTuples, as we have
> them today at least, anymore. We used to rely on being able to construct
> pointers to MinimalTuples that are mostly compatible with HeapTuple. But
> I think there's none of those left since PG 12.

Ah ok.

>
> I think it'd make a bit more sense to do some steps towards having a
> more suitable "minimal" tuple representation, rather than doing this
> local, pretty ugly, hacks. A good way would be to just starting to
> remove the padding, unnecessary fields etc from MinimalTuple.

So there are two things we wish to do :
1. Prevent an extra tuple forming step before sending minimal tuple
data. Possibly device an shm_mq API to get memory to write tuple of a
given length, and device something like
FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will
write minimal tuple data.
2. Shrink the MinimalTupleData structure because it no longer needs
the current padding etc and we can substitute this new MinimalTuple
structure with the current one all over the code wherever it is
currently being used.

If we remove the unnecessary fields from the tuple data being sent to
Gather node, then we need to again form a MinimalTuple at the
receiving end, which again adds an extra tuple forming. So I
understand, that's the reason why you are saying we should shrink the
MinimalTupleData structure itself, in which case we will continue to
use the received new MinimalTupledata as an already-formed tuple, like
how we are doing now.

Now, the above two things (1. and 2.) look independent to me. Suppose
we first do 1. i.e. we come up with a good way to form an in-place
MinimalTuple at the sender's end, without any change to the
MinimalTupleData. And then when we do 2. i.e. shrink the
MinimalTupleData; but for that, we won't require any change in the
in-place-tuple-forming API we wrote in 1. . Just the existing
underlying function heap_form_minimal_tuple() or something similar
might need to be changed. At least that's what I feel right now.

>
> I also think that it'd be good to look at a few of the other places that
> are heavily bottlenecked by MinimalTuple overhead before designing new
> API around this. IIRC it's e.g. very easy to see hash joins spending a
> lot of time doing MinimalTuple copies & conversions.

Yeah, makes sense. The above FormMinimalTupleDataInHere() should be
able to be used for these other places as well. Will keep that in
mind.

>
> > > Thomas, I guess you had a different approach in mind when you said
> > > about "returning either success or
> > > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> >
> > Yeah I tried some things like that, but I wasn't satisfied with any of
> > them; basically the extra work involved in negotiating the size was a
> > bit too high.

Hmm, ok. Let me see if there is any way around this.

>> On the other hand, because my interface was "please
> > write a MinimalTuple here!", it had the option to *form* a
> > MinimalTuple directly in place, whereas your approach can only avoid
> > creating and destroying a temporary tuple when the source is a heap
> > tuple.
True.
>
> There's a lot of cases where the source is a virtual slot (since we'll
> often project stuff below Gather). So it'd be quite advantageous to
> avoid building an unnecessary HeapTuple in those cases.

Yeah right.




Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-16 Thread Andres Freund
Hi,

On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> I wonder if there is a way we could make "Minimal Tuples but with the
> length travelling separately (and perhaps chopped off?)" into a
> first-class concept...  It's also a shame to be schlepping a bunch of
> padding bytes around.

There really is no justification for having MinimalTuples, as we have
them today at least, anymore. We used to rely on being able to construct
pointers to MinimalTuples that are mostly compatible with HeapTuple. But
I think there's none of those left since PG 12.

I think it'd make a bit more sense to do some steps towards having a
more suitable "minimal" tuple representation, rather than doing this
local, pretty ugly, hacks. A good way would be to just starting to
remove the padding, unnecessary fields etc from MinimalTuple.

I also think that it'd be good to look at a few of the other places that
are heavily bottlenecked by MinimalTuple overhead before designing new
API around this. IIRC it's e.g. very easy to see hash joins spending a
lot of time doing MinimalTuple copies & conversions.

> 
>  tuple = (MinimalTuple) data;
> -Assert(tuple->t_len == nbytes);
> +tuple->t_len = nbytes;
> 
> Hmm, so you have to scribble on shared memory on the receiving side.

Ick, I would really like to avoid this.


> > Thomas, I guess you had a different approach in mind when you said
> > about "returning either success or
> > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> 
> Yeah I tried some things like that, but I wasn't satisfied with any of
> them; basically the extra work involved in negotiating the size was a
> bit too high.  On the other hand, because my interface was "please
> write a MinimalTuple here!", it had the option to *form* a
> MinimalTuple directly in place, whereas your approach can only avoid
> creating and destroying a temporary tuple when the source is a heap
> tuple.

There's a lot of cases where the source is a virtual slot (since we'll
often project stuff below Gather). So it'd be quite advantageous to
avoid building an unnecessary HeapTuple in those cases.

I wonder if it would be sensible to build minimal tuples using
tts_values/isnull in some cases.  This might even be advantageous in
case of heap / minimal tuples, because IIRC right now the code will
materialize the slot unnecessarily. Not sure how common that is.

Greetings,

Andres Freund




Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-16 Thread Thomas Munro
On Wed, Sep 9, 2020 at 5:23 PM Amit Khandekar  wrote:
> I went ahead and tried doing this. I chose an approach where we can
> return the pointer to the in-place minimal tuple data if it's a
> heap/buffer/minimal tuple slot. A new function
> ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
> If it's neither heap, buffer or minimal tuple, it returns a copy as
> usual. The receiver should not assume the data is directly taken from
> MinimalTupleData, so it should set it's t_len to the number of bytes
> returned. Patch attached
> (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

+char *
+ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len,
+  bool *shouldFree)

Interesting approach.  It's a bit of a weird interface, returning a
pointer to a non-valid MinimalTuple that requires extra tweaking after
you copy it to make it a valid one and that you're not allowed to
tweak in-place.  I'd probably make that return "const void *" just for
a bit of extra documentation.  I wonder if there is a way we could
make "Minimal Tuples but with the length travelling separately (and
perhaps chopped off?)" into a first-class concept...  It's also a
shame to be schlepping a bunch of padding bytes around.

 tuple = (MinimalTuple) data;
-Assert(tuple->t_len == nbytes);
+tuple->t_len = nbytes;

Hmm, so you have to scribble on shared memory on the receiving side.
I wondered about a couple of different ways to share the length field
with the shm_mq envelope, but that all seems a bit too weird...

> Thomas, I guess you had a different approach in mind when you said
> about "returning either success or
> hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to

Yeah I tried some things like that, but I wasn't satisfied with any of
them; basically the extra work involved in negotiating the size was a
bit too high.  On the other hand, because my interface was "please
write a MinimalTuple here!", it had the option to *form* a
MinimalTuple directly in place, whereas your approach can only avoid
creating and destroying a temporary tuple when the source is a heap
tuple.

> me is that avoiding the copy shows consistent improvement of 4 to 10%
> for simple parallel table scans. I tried my patch on both x86_64 and
> arm64, and observed this speedup on both.

I think that's a great validation of the goal but I hope we can figure
out a way that avoids the temporary tuple for more cases.  FWIW I saw
hash self joins running a couple of percent faster with one of my
abandoned patches.




Redundant tuple copy in tqueueReceiveSlot()

2020-09-08 Thread Amit Khandekar
Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] 

On Fri, 17 Jul 2020 at 09:05, Thomas Munro  wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
>  wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 2000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Wed, 9 Sep 2020 12:03:01 +0800
Subject: [PATCH 2/2] Add guc for ease of testing speedup.

This is only for testing the performance gain. Otherwise, to compare
the performance figures, it requires server restart, and the figures
anyway shift back and forth by 5-15 percent after each restart, which
creates lot of noise when comparing figures with and without fix.

With this, we can easily see at least 4-10% difference in execution
times by setting/unsetting the GUC enable_fix.
---
 src/backend/executor/tqueue.c | 12 
 src/backend/utils/misc/guc.c  | 12 
 2 files changed, 24 insertions(+)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index c70ee0f39a..7dd60b5c7e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -45,6 +45,7 @@ struct TupleQueueReader
 	shm_mq_handle *queue;		/* shm_mq to receive from */
 };
 
+extern bool		enable_fix;
 /*
  *