Re: Redundant tuple copy in tqueueReceiveSlot()
On Thu, 17 Sep 2020 at 08:55, Andres Freund wrote: > > Hi, > > On 2020-09-17 14:20:50 +1200, Thomas Munro wrote: > > I wonder if there is a way we could make "Minimal Tuples but with the > > length travelling separately (and perhaps chopped off?)" into a > > first-class concept... It's also a shame to be schlepping a bunch of > > padding bytes around. Yeah, I think we can pass a "length" data separately, but since the receiver end already is assuming that it knows the received data is a minimal tuple, I thought why not skip passing this redundant component. But anyways, if you and Andres are suggesting that being able to skip the copy is important for virtual tuples as well, then I think the approach you suggested (supplying an allocated memory to the tuple API for conversion) would be one of the better options with us, if not the only good option. Maybe I will try looking into the shm_mq working to see if we can come up with a good solution. > > There really is no justification for having MinimalTuples, as we have > them today at least, anymore. We used to rely on being able to construct > pointers to MinimalTuples that are mostly compatible with HeapTuple. But > I think there's none of those left since PG 12. Ah ok. > > I think it'd make a bit more sense to do some steps towards having a > more suitable "minimal" tuple representation, rather than doing this > local, pretty ugly, hacks. A good way would be to just starting to > remove the padding, unnecessary fields etc from MinimalTuple. So there are two things we wish to do : 1. Prevent an extra tuple forming step before sending minimal tuple data. Possibly device an shm_mq API to get memory to write tuple of a given length, and device something like FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will write minimal tuple data. 2. Shrink the MinimalTupleData structure because it no longer needs the current padding etc and we can substitute this new MinimalTuple structure with the current one all over the code wherever it is currently being used. If we remove the unnecessary fields from the tuple data being sent to Gather node, then we need to again form a MinimalTuple at the receiving end, which again adds an extra tuple forming. So I understand, that's the reason why you are saying we should shrink the MinimalTupleData structure itself, in which case we will continue to use the received new MinimalTupledata as an already-formed tuple, like how we are doing now. Now, the above two things (1. and 2.) look independent to me. Suppose we first do 1. i.e. we come up with a good way to form an in-place MinimalTuple at the sender's end, without any change to the MinimalTupleData. And then when we do 2. i.e. shrink the MinimalTupleData; but for that, we won't require any change in the in-place-tuple-forming API we wrote in 1. . Just the existing underlying function heap_form_minimal_tuple() or something similar might need to be changed. At least that's what I feel right now. > > I also think that it'd be good to look at a few of the other places that > are heavily bottlenecked by MinimalTuple overhead before designing new > API around this. IIRC it's e.g. very easy to see hash joins spending a > lot of time doing MinimalTuple copies & conversions. Yeah, makes sense. The above FormMinimalTupleDataInHere() should be able to be used for these other places as well. Will keep that in mind. > > > > Thomas, I guess you had a different approach in mind when you said > > > about "returning either success or > > > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to > > > > Yeah I tried some things like that, but I wasn't satisfied with any of > > them; basically the extra work involved in negotiating the size was a > > bit too high. Hmm, ok. Let me see if there is any way around this. >> On the other hand, because my interface was "please > > write a MinimalTuple here!", it had the option to *form* a > > MinimalTuple directly in place, whereas your approach can only avoid > > creating and destroying a temporary tuple when the source is a heap > > tuple. True. > > There's a lot of cases where the source is a virtual slot (since we'll > often project stuff below Gather). So it'd be quite advantageous to > avoid building an unnecessary HeapTuple in those cases. Yeah right.
Re: Redundant tuple copy in tqueueReceiveSlot()
Hi, On 2020-09-17 14:20:50 +1200, Thomas Munro wrote: > I wonder if there is a way we could make "Minimal Tuples but with the > length travelling separately (and perhaps chopped off?)" into a > first-class concept... It's also a shame to be schlepping a bunch of > padding bytes around. There really is no justification for having MinimalTuples, as we have them today at least, anymore. We used to rely on being able to construct pointers to MinimalTuples that are mostly compatible with HeapTuple. But I think there's none of those left since PG 12. I think it'd make a bit more sense to do some steps towards having a more suitable "minimal" tuple representation, rather than doing this local, pretty ugly, hacks. A good way would be to just starting to remove the padding, unnecessary fields etc from MinimalTuple. I also think that it'd be good to look at a few of the other places that are heavily bottlenecked by MinimalTuple overhead before designing new API around this. IIRC it's e.g. very easy to see hash joins spending a lot of time doing MinimalTuple copies & conversions. > > tuple = (MinimalTuple) data; > -Assert(tuple->t_len == nbytes); > +tuple->t_len = nbytes; > > Hmm, so you have to scribble on shared memory on the receiving side. Ick, I would really like to avoid this. > > Thomas, I guess you had a different approach in mind when you said > > about "returning either success or > > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to > > Yeah I tried some things like that, but I wasn't satisfied with any of > them; basically the extra work involved in negotiating the size was a > bit too high. On the other hand, because my interface was "please > write a MinimalTuple here!", it had the option to *form* a > MinimalTuple directly in place, whereas your approach can only avoid > creating and destroying a temporary tuple when the source is a heap > tuple. There's a lot of cases where the source is a virtual slot (since we'll often project stuff below Gather). So it'd be quite advantageous to avoid building an unnecessary HeapTuple in those cases. I wonder if it would be sensible to build minimal tuples using tts_values/isnull in some cases. This might even be advantageous in case of heap / minimal tuples, because IIRC right now the code will materialize the slot unnecessarily. Not sure how common that is. Greetings, Andres Freund
Re: Redundant tuple copy in tqueueReceiveSlot()
On Wed, Sep 9, 2020 at 5:23 PM Amit Khandekar wrote: > I went ahead and tried doing this. I chose an approach where we can > return the pointer to the in-place minimal tuple data if it's a > heap/buffer/minimal tuple slot. A new function > ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. > If it's neither heap, buffer or minimal tuple, it returns a copy as > usual. The receiver should not assume the data is directly taken from > MinimalTupleData, so it should set it's t_len to the number of bytes > returned. Patch attached > (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) +char * +ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len, + bool *shouldFree) Interesting approach. It's a bit of a weird interface, returning a pointer to a non-valid MinimalTuple that requires extra tweaking after you copy it to make it a valid one and that you're not allowed to tweak in-place. I'd probably make that return "const void *" just for a bit of extra documentation. I wonder if there is a way we could make "Minimal Tuples but with the length travelling separately (and perhaps chopped off?)" into a first-class concept... It's also a shame to be schlepping a bunch of padding bytes around. tuple = (MinimalTuple) data; -Assert(tuple->t_len == nbytes); +tuple->t_len = nbytes; Hmm, so you have to scribble on shared memory on the receiving side. I wondered about a couple of different ways to share the length field with the shm_mq envelope, but that all seems a bit too weird... > Thomas, I guess you had a different approach in mind when you said > about "returning either success or > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to Yeah I tried some things like that, but I wasn't satisfied with any of them; basically the extra work involved in negotiating the size was a bit too high. On the other hand, because my interface was "please write a MinimalTuple here!", it had the option to *form* a MinimalTuple directly in place, whereas your approach can only avoid creating and destroying a temporary tuple when the source is a heap tuple. > me is that avoiding the copy shows consistent improvement of 4 to 10% > for simple parallel table scans. I tried my patch on both x86_64 and > arm64, and observed this speedup on both. I think that's a great validation of the goal but I hope we can figure out a way that avoids the temporary tuple for more cases. FWIW I saw hash self joins running a couple of percent faster with one of my abandoned patches.
Redundant tuple copy in tqueueReceiveSlot()
Hi, I am starting a new thread that continues with the following point that was discussed in [1] On Fri, 17 Jul 2020 at 09:05, Thomas Munro wrote: > > On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty > wrote: > > Do you mean that we should have an implementation for > > get_minimal_tuple() for the heap AM and have it return a pointer to the > > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as > > tqueueReceiveSlot() will ensure that the heap tuple from which it wants > > to extract the minimal tuple was allocated in the tuple queue in the > > first place? If we consider that the node directly below a gather is a > > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the > > ss_ScanTupleSlot to point to memory in the shared tuple queue? > > Similarly, other ExecInit*() methods can do the same for other executor > > nodes that involve parallelism? Of course, things would be slightly > > different for > > the other use cases you mentioned (such as hash table population) > > What I mean is that where ExecHashTableInsert() and > tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a > freshly allocated copy, and then you copy that again, and free it. > There may be something similar going on in tuplestore and sort code. > Perhaps we could have something like > ExecFetchSlotMinimalTupleInPlace(slot, output_buffer, > output_buffer_size) that returns a value that indicates either success > or hey-that-buffer's-too-small-I-need-N-bytes, or something like that. > That silly extra copy is something Andres pointed out to me in some > perf results involving TPCH hash joins, a couple of years ago. I went ahead and tried doing this. I chose an approach where we can return the pointer to the in-place minimal tuple data if it's a heap/buffer/minimal tuple slot. A new function ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. If it's neither heap, buffer or minimal tuple, it returns a copy as usual. The receiver should not assume the data is directly taken from MinimalTupleData, so it should set it's t_len to the number of bytes returned. Patch attached (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) Thomas, I guess you had a different approach in mind when you said about "returning either success or hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to me is that avoiding the copy shows consistent improvement of 4 to 10% for simple parallel table scans. I tried my patch on both x86_64 and arm64, and observed this speedup on both. create table tab as select generate_series(1, 2000) id, 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v; select pg_prewarm('tab'::regclass); explain analyze select * from tab where id %2 = 0; Times in milli-secs : HEAD : 1833.119 1816.522 1875.648 1872.153 1834.383 Patch'ed : 1763.786 1721.160 1714.665 1719.738 1704.478 This was with the default 2 parallel workers. With 3 or 4 workers, for the above testcase I didn't see a noticeable difference. I think, if I double the number of rows, the difference will be noticeable. In any case, the gain would go on reducing with the number of workers, because the tuple copy also gets parallelized. In some scenarios, parallel_leader_participation=off causes the difference to amplify. Haven't had a chance to see if this helps any of the TPC-H queries. Also attached is a patch guc_for_testing.patch that I used for testing the gain. This patch is only for testing. Without this, in order to compare the performance figures it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. Use this GUC enable_fix to enable/disable the fix. [1] https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com -- Thanks, -Amit Khandekar Huawei Technologies From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Wed, 9 Sep 2020 12:03:01 +0800 Subject: [PATCH 2/2] Add guc for ease of testing speedup. This is only for testing the performance gain. Otherwise, to compare the performance figures, it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. With this, we can easily see at least 4-10% difference in execution times by setting/unsetting the GUC enable_fix. --- src/backend/executor/tqueue.c | 12 src/backend/utils/misc/guc.c | 12 2 files changed, 24 insertions(+) diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c index c70ee0f39a..7dd60b5c7e 100644 --- a/src/backend/executor/tqueue.c +++ b/src/backend/executor/tqueue.c @@ -45,6 +45,7 @@ struct TupleQueueReader shm_mq_handle *queue; /* shm_mq to receive from */ }; +extern bool enable_fix; /* *