Re: [HACKERS] [POC] Faster processing at Gather node

2018-03-29 Thread Bruce Momjian
On Fri, Mar  2, 2018 at 05:21:28PM -0500, Tels wrote:
> Hello Robert,
> 
> On Fri, March 2, 2018 12:22 pm, Robert Haas wrote:
> > On Wed, Feb 28, 2018 at 10:06 AM, Robert Haas 
> > wrote:
> >> [ latest patches ]
> >
> > Committed.  Thanks for the review.
> 
> Cool :)
> 
> There is a typo, tho:
> 
> + /*
> +  * If the counterpary is known to have attached, we can read mq_receiver
> +  * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
> +  * more caution is needed.
> +  */
> 
> s/counterpary/counterparty/;
> 
> Sorry, only noticed while re-reading the thread.
> 
> Also, either a double space is missing, or one is too many:
> 
> + /*
> +  * Separate prior reads of mq_ring from the increment of mq_bytes_read
> +  * which follows.  Pairs with the full barrier in shm_mq_send_bytes(). 
> We
> +  * only need a read barrier here because the increment of mq_bytes_read 
> is
> +  * actually a read followed by a dependent write.
> +  */
> 
> ("  Pairs ..." vs. ". We only ...")
> 
> Best regards,

Change applied with the attached patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
new file mode 100644
index 3faace2..c80cb6e
*** a/src/backend/storage/ipc/shm_mq.c
--- b/src/backend/storage/ipc/shm_mq.c
*** shm_mq_sendv(shm_mq_handle *mqh, shm_mq_
*** 493,499 
  		return SHM_MQ_DETACHED;
  
  	/*
! 	 * If the counterpary is known to have attached, we can read mq_receiver
  	 * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
  	 * more caution is needed.
  	 */
--- 493,499 
  		return SHM_MQ_DETACHED;
  
  	/*
! 	 * If the counterparty is known to have attached, we can read mq_receiver
  	 * without acquiring the spinlock and assume it isn't NULL.  Otherwise,
  	 * more caution is needed.
  	 */
*** shm_mq_inc_bytes_read(shm_mq *mq, Size n
*** 1203,1211 
  
  	/*
  	 * Separate prior reads of mq_ring from the increment of mq_bytes_read
! 	 * which follows.  Pairs with the full barrier in shm_mq_send_bytes(). We
! 	 * only need a read barrier here because the increment of mq_bytes_read is
! 	 * actually a read followed by a dependent write.
  	 */
  	pg_read_barrier();
  
--- 1203,1211 
  
  	/*
  	 * Separate prior reads of mq_ring from the increment of mq_bytes_read
! 	 * which follows.  This pairs with the full barrier in shm_mq_send_bytes().
! 	 * We only need a read barrier here because the increment of mq_bytes_read
! 	 * is actually a read followed by a dependent write.
  	 */
  	pg_read_barrier();
  


Re: [HACKERS] [POC] Faster processing at Gather node

2018-03-02 Thread Tels
Hello Robert,

On Fri, March 2, 2018 12:22 pm, Robert Haas wrote:
> On Wed, Feb 28, 2018 at 10:06 AM, Robert Haas 
> wrote:
>> [ latest patches ]
>
> Committed.  Thanks for the review.

Cool :)

There is a typo, tho:

+   /*
+* If the counterpary is known to have attached, we can read mq_receiver
+* without acquiring the spinlock and assume it isn't NULL.  Otherwise,
+* more caution is needed.
+*/

s/counterpary/counterparty/;

Sorry, only noticed while re-reading the thread.

Also, either a double space is missing, or one is too many:

+   /*
+* Separate prior reads of mq_ring from the increment of mq_bytes_read
+* which follows.  Pairs with the full barrier in shm_mq_send_bytes(). 
We
+* only need a read barrier here because the increment of mq_bytes_read 
is
+* actually a read followed by a dependent write.
+*/

("  Pairs ..." vs. ". We only ...")

Best regards,

Tels



Re: [HACKERS] [POC] Faster processing at Gather node

2018-03-02 Thread Robert Haas
On Wed, Feb 28, 2018 at 10:06 AM, Robert Haas  wrote:
> [ latest patches ]

Committed.  Thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2018-02-28 Thread Robert Haas
On Tue, Feb 27, 2018 at 4:06 PM, Andres Freund  wrote:
>> OK, I'll try to check how feasible that would be.
>
> cool.

It's not too hard, but it doesn't really seem to help, so I'm inclined
to leave it alone.  To make it work, you need to keep two separate
counters in the shm_mq_handle, one for the number of bytes since we
did an increment and the other for the number of bytes since we sent a
signal.  I don't really want to introduce that complexity unless there
is a benefit.

With just 0001 and 0002: 3968.899 ms, 4043.428 ms, 4042.472 ms, 4142.226 ms
With two-separate-counters.patch added: 4123.841 ms, 4101.917 ms,
4063.368 ms, 3985.148 ms

If you take the total of the 4 times, that's an 0.4% slowdown with the
patch applied, but I think that's just noise.  It seems possible that
with a larger queue -- and maybe a different query shape it would
help, but I really just want to get the optimizations that I've got
committed, provided that you find them acceptable, rather than spend a
lot of time looking for new optimizations, because:

1. I've got other things to get done.

2. I think that the patches I've got here capture most of the available benefit.

3. This case isn't super-common in the first place -- we generally
want to avoid feeding tons of tuples through the Gather.

4. We might abandon the shm_mq approach entirely and switch to
something like sticking tuples in DSA using the flexible tuple slot
stuff you've proposed elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0002-shm-mq-reduce-receiver-latch-set-v3.patch
Description: Binary data


0001-shm-mq-less-spinlocks-v4.patch
Description: Binary data


two-separate-counters.patch
Description: Binary data


Re: [HACKERS] [POC] Faster processing at Gather node

2018-02-27 Thread Andres Freund
Hi,

On 2018-02-27 16:03:17 -0500, Robert Haas wrote:
> On Wed, Feb 7, 2018 at 1:41 PM, Andres Freund  wrote:
> > Well, it's more than just systems like that - for 64bit atomics we
> > sometimes do fall back to spinlock based atomics on 32bit systems, even
> > if they support 32 bit atomics.
> 
> I built with -m32 on my laptop and tried "select aid, count(*) from
> pgbench_accounts group by 1 having count(*) > 1" on pgbench at scale
> factor 100 with pgbench_accounts_pkey dropped and
> max_parallel_workers_per_gather set to 10 on (a) commit
> 0b5e33f667a2042d7022da8bef31a8be5937aad1 (I know this is a little old,
> but I think it doesn't matter), (b) same plus
> shm-mq-less-spinlocks-v3, and (c) same plus shm-mq-less-spinlocks-v3
> and shm-mq-reduce-receiver-latch-set-v2.
> 
> (a) 16563.790 ms, 16625.257 ms, 16496.062 ms
> (b) 17217.051 ms, 17157.745 ms, 17225.755 ms [median to median +3.9% vs. (a)]
> (c) 15491.947 ms, 15455.840 ms, 15452.649 ms [median to median -7.0%
> vs. (a), -10.2% vs (b)]
> 
> Do you think that's a problem?  If it is, what do you think we should
> do about it?  It seems to me that it's probably OK because (1) with
> both patches we still come out ahead and (2) 32-bit systems will
> presumably continue to become rarer as time goes on, but you might
> disagree.

No, I think this is fairly reasonable. A fairly extreme usecase on a 32
bit machine regressing a bit, while gaining peformance in other case?
That works for me.


> OK, I'll try to check how feasible that would be.

cool.

Greetings,

Andres Freund



Re: [HACKERS] [POC] Faster processing at Gather node

2018-02-27 Thread Robert Haas
On Wed, Feb 7, 2018 at 1:41 PM, Andres Freund  wrote:
> Well, it's more than just systems like that - for 64bit atomics we
> sometimes do fall back to spinlock based atomics on 32bit systems, even
> if they support 32 bit atomics.

I built with -m32 on my laptop and tried "select aid, count(*) from
pgbench_accounts group by 1 having count(*) > 1" on pgbench at scale
factor 100 with pgbench_accounts_pkey dropped and
max_parallel_workers_per_gather set to 10 on (a) commit
0b5e33f667a2042d7022da8bef31a8be5937aad1 (I know this is a little old,
but I think it doesn't matter), (b) same plus
shm-mq-less-spinlocks-v3, and (c) same plus shm-mq-less-spinlocks-v3
and shm-mq-reduce-receiver-latch-set-v2.

(a) 16563.790 ms, 16625.257 ms, 16496.062 ms
(b) 17217.051 ms, 17157.745 ms, 17225.755 ms [median to median +3.9% vs. (a)]
(c) 15491.947 ms, 15455.840 ms, 15452.649 ms [median to median -7.0%
vs. (a), -10.2% vs (b)]

Do you think that's a problem?  If it is, what do you think we should
do about it?  It seems to me that it's probably OK because (1) with
both patches we still come out ahead and (2) 32-bit systems will
presumably continue to become rarer as time goes on, but you might
disagree.

>> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
>> > the stack / register in the first iteration, and then not reread any
>> > further? I think it's fine because every branch of the if below ends up
>> > in a syscall / barrier, but it might be worth noting on that here.
>>
>> Aargh.  I hate compilers.  I added a comment.  Should I think about
>> changing mq_detached to pg_atomic_uint32 instead?
>
> I think a pg_compiler_barrier() would suffice to alleviate my concern,
> right? If you wanted to go for an atomic, using pg_atomic_flag would
> probably be more appropriate than pg_atomic_uint32.

Hmm, all right, I'll add pg_compiler_barrier().

>> >> - /* Write as much data as we can via a single 
>> >> memcpy(). */
>> >> + /*
>> >> +  * Write as much data as we can via a single 
>> >> memcpy(). Make sure
>> >> +  * these writes happen after the read of 
>> >> mq_bytes_read, above.
>> >> +  * This barrier pairs with the one in 
>> >> shm_mq_inc_bytes_read.
>> >> +  */
>> >
>> > s/above/above. Otherwise a newer mq_bytes_read could become visible
>> > before the corresponding reads have fully finished./?
>>
>> I don't find that very clear.  A newer mq_bytes_read could become
>> visible whenever, and a barrier doesn't prevent that from happening.
>
> Well, my point was that the barrier prevents the the write to
> mq_bytes_read becoming visible before the corresponding reads have
> finished. Which then would mean the memcpy() could overwrite them. And a
> barrier *does* prevent that from happening.

I think we're talking about the same thing, but not finding each
others' explanations very clear.

>> Hmm, I'm not sure I understand what you're suggesting, here.  In
>> general, we return with the data for the current message unconsumed,
>> and then consume it the next time the function is called, so that
>> (except when the message wraps the end of the buffer) we can return a
>> pointer directly into the buffer rather than having to memcpy().  What
>> this patch does is postpone consuming the data further, either until
>> we can free up at least a quarter of the ring buffer or until we need
>> to wait for more data. It seemed worthwhile to free up space in the
>> ring buffer occasionally even if we weren't to the point of waiting
>> yet, so that the sender has an opportunity to write new data into that
>> space if it happens to still be running.
>
> What I'm trying to suggest is that instead of postponing an update of
> mq_bytes_read (by storing amount of already performed reads in
> mqh_consume_pending), we continue to update mq_bytes_read but only set
> the latch if your above thresholds are crossed. That way a burst of
> writes can fully fill the ringbuffer, but the cost of doing a SetLatch()
> is amortized. In my testing SetLatch() was the expensive part, not the
> necessary barriers in shm_mq_inc_bytes_read().

OK, I'll try to check how feasible that would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2018-02-07 Thread Andres Freund
Hi,

On 2018-01-25 12:09:23 -0500, Robert Haas wrote:
> > Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> > I suspect there won't be much of a performance impact, but it's probably
> > worth checking.
>
> I don't think I understand your concern here.  If this is used on a
> system where we're emulating atomics and barriers in painful ways, it
> might hurt performance, but I think we have a policy of not caring.

Well, it's more than just systems like that - for 64bit atomics we
sometimes do fall back to spinlock based atomics on 32bit systems, even
if they support 32 bit atomics.


> Also, I don't know where I'd find a 32-bit system to test.

You can compile with -m32 on reasonable systems ;)


> >>   Assert(used <= ringsize);
> >>   available = Min(ringsize - used, nbytes - sent);
> >>
> >>   /* Bail out if the queue has been detached. */
> >> - if (detached)
> >> + if (mq->mq_detached)
> >
> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> > the stack / register in the first iteration, and then not reread any
> > further? I think it's fine because every branch of the if below ends up
> > in a syscall / barrier, but it might be worth noting on that here.
>
> Aargh.  I hate compilers.  I added a comment.  Should I think about
> changing mq_detached to pg_atomic_uint32 instead?

I think a pg_compiler_barrier() would suffice to alleviate my concern,
right? If you wanted to go for an atomic, using pg_atomic_flag would
probably be more appropriate than pg_atomic_uint32.


> >> - /* Write as much data as we can via a single 
> >> memcpy(). */
> >> + /*
> >> +  * Write as much data as we can via a single 
> >> memcpy(). Make sure
> >> +  * these writes happen after the read of 
> >> mq_bytes_read, above.
> >> +  * This barrier pairs with the one in 
> >> shm_mq_inc_bytes_read.
> >> +  */
> >
> > s/above/above. Otherwise a newer mq_bytes_read could become visible
> > before the corresponding reads have fully finished./?
>
> I don't find that very clear.  A newer mq_bytes_read could become
> visible whenever, and a barrier doesn't prevent that from happening.

Well, my point was that the barrier prevents the the write to
mq_bytes_read becoming visible before the corresponding reads have
finished. Which then would mean the memcpy() could overwrite them. And a
barrier *does* prevent that from happening.

I don't think this is the same as:

> What it does is ensure (together with the one in
> shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
> fully *written* yet.

which seems much more about the barrier in shm_mq_inc_bytes_written()?


> Generally, my mental model is that barriers make things happen in
> program order rather than some other order that the CPU thinks would
> be fun.  Imagine a single-core server doing all of this stuff the "old
> school" way.  If the reader puts data into the queue before
> advertising its presence and the writer finishes using the data from
> the queue before advertising its consumption, then everything works.
> If you do anything else, it's flat busted, even on that single-core
> system, because a context switch could happen at any time, and then
> you might read data that isn't written yet.  The barrier just ensures
> that we get that order of execution even on fancy modern hardware, but
> I'm not sure how much of that we really need to explain here.

IDK, I find it nontrivial to understand individual uses of
barriers. There's often multiple non isometric ways to use barriers, and
the logic why a specific one is correct isn't always obvious.


> >> + pg_memory_barrier();
> >>   memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
> >>  (char *) data + sent, sendnow);
> >>   sent += sendnow;
> >
> > Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
> > proper padding/union in the struct to make it unnecessary to do that bit
> > of offset calculation every time? I think it currently prevents
> > efficient address calculation instructions from being used.
>
> Well, the root cause -- aside from me being a fallible human being
> with only limited programing skills -- is that I wanted the parallel
> query code to be able to request whatever queue size it preferred
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

What I meant is that instead of doing
struct shm_mq
{
...
boolmq_detached;
uint8   mq_ring_offset;
charmq_ring[FLEXIBLE_ARRAY_MEMBER];
};

it'd be possible to do something like

{
...
boolmq_detached;
union {
charmq_ring[FLEXIBLE_ARRAY_MEMBER];
double  forcea

Re: [HACKERS] [POC] Faster processing at Gather node

2018-01-25 Thread Robert Haas
On Tue, Jan 9, 2018 at 7:09 PM, Andres Freund  wrote:
>> + * mq_sender and mq_bytes_written can only be changed by the sender.
>> + * mq_receiver and mq_sender are protected by mq_mutex, although, 
>> importantly,
>> + * they cannot change once set, and thus may be read without a lock once 
>> this
>> + * is known to be the case.
>
> I don't recall our conversation around this anymore, and haven't read
> down far enough to see the relevant code. Lest I forget: Such construct
> often need careful use of barriers.

I think the only thing the code assumes here is that if we previously
read the value with the spinlock and didn't get NULL, we can later
read the value without the spinlock and count on seeing the same value
we saw previously.  I think that's safe enough.

> s/should/is/ or similar?

I prefer it the way that I have it.

> Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> I suspect there won't be much of a performance impact, but it's probably
> worth checking.

I don't think I understand your concern here.  If this is used on a
system where we're emulating atomics and barriers in painful ways, it
might hurt performance, but I think we have a policy of not caring.

Also, I don't know where I'd find a 32-bit system to test.

>> - * Importantly, mq_ring can be safely read and written without a lock.  Were
>> - * this not the case, we'd have to hold the spinlock for much longer
>> - * intervals, and performance might suffer.  Fortunately, that's not
>> - * necessary.  At any given time, the difference between mq_bytes_read and
>> + * At any given time, the difference between mq_bytes_read and
>
> Hm, why did you remove the first part about mq_ring itself?

Bad editing.  Restored.

>> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
>> const void *data,
>>
>>   while (sent < nbytes)
>>   {
>> - booldetached;
>>   uint64  rb;
>> + uint64  wb;
>>
>>   /* Compute number of ring buffer bytes used and available. */
>> - rb = shm_mq_get_bytes_read(mq, &detached);
>> - Assert(mq->mq_bytes_written >= rb);
>> - used = mq->mq_bytes_written - rb;
>> + rb = pg_atomic_read_u64(&mq->mq_bytes_read);
>> + wb = pg_atomic_read_u64(&mq->mq_bytes_written);
>> + Assert(wb >= rb);
>> + used = wb - rb;
>
> Just to make sure my understanding is correct: No barriers needed here
> because "bytes_written" is only written by the sending backend, and
> "bytes_read" cannot lap it. Correct?

We can't possibly read a stale value of mq_bytes_written because we
are the only process that can write it.  It's possible that the
receiver has just increased mq_bytes_read and that the change isn't
visible to us yet, but if so, the sender's also going to set our
latch, or has done so already.  So the worst thing that happens is
that we decide to sleep because it looks like no space is available
and almost immediately get woken up because there really is space.

>>   Assert(used <= ringsize);
>>   available = Min(ringsize - used, nbytes - sent);
>>
>>   /* Bail out if the queue has been detached. */
>> - if (detached)
>> + if (mq->mq_detached)
>
> Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> the stack / register in the first iteration, and then not reread any
> further? I think it's fine because every branch of the if below ends up
> in a syscall / barrier, but it might be worth noting on that here.

Aargh.  I hate compilers.  I added a comment.  Should I think about
changing mq_detached to pg_atomic_uint32 instead?

> Perhaps mention that this could lead to spuriously signalling the wrong
> backend in case of detach, but that that is fine?

I think that's a general risk of latches that doesn't need to be
specifically recapitulated here.

> I know the scheme isn't new, but I do find it not immediately obvious
> that 'wb' is short for 'bytes_written'.

Sorry.

>> - /* Write as much data as we can via a single memcpy(). 
>> */
>> + /*
>> +  * Write as much data as we can via a single memcpy(). 
>> Make sure
>> +  * these writes happen after the read of 
>> mq_bytes_read, above.
>> +  * This barrier pairs with the one in 
>> shm_mq_inc_bytes_read.
>> +  */
>
> s/above/above. Otherwise a newer mq_bytes_read could become visible
> before the corresponding reads have fully finished./?

I don't find that very clear.  A newer mq_bytes_read could become
visible whenever, and a barrier doesn't prevent that from happening.
What it does is ensure (together with the one in
shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
fully *written* yet.

Generally, my mental model is that barriers make things happen in
program order 

Re: [HACKERS] [POC] Faster processing at Gather node

2018-01-09 Thread Andres Freund
Hi,

On 2017-12-04 10:50:53 -0500, Robert Haas wrote:
> Subject: [PATCH 1/2] shm-mq-less-spinlocks-v2


> + * mq_sender and mq_bytes_written can only be changed by the sender.
> + * mq_receiver and mq_sender are protected by mq_mutex, although, 
> importantly,
> + * they cannot change once set, and thus may be read without a lock once this
> + * is known to be the case.

I don't recall our conversation around this anymore, and haven't read
down far enough to see the relevant code. Lest I forget: Such construct
often need careful use of barriers.


> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it.  Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex.  
> Instead,
> + * they are written atomically using 8 byte loads and stores.  Memory 
> barriers
> + * must be carefully used to synchronize reads and writes of these values 
> with
> + * reads and writes of the actual data in mq_ring.
> + *
> + * mq_detached needs no locking.  It can be set by either the sender or the
> + * receiver, but only ever from false to true, so redundant writes don't
> + * matter.  It is important that if we set mq_detached and then set the
> + * counterparty's latch, the counterparty must be certain to see the change
> + * after waking up.  Since SetLatch begins with a memory barrier and 
> ResetLatch
> + * ends with one, this should be OK.

s/should/is/ or similar?


Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
I suspect there won't be much of a performance impact, but it's probably
worth checking.


>   * mq_ring_size and mq_ring_offset never change after initialization, and
>   * can therefore be read without the lock.
>   *
> - * Importantly, mq_ring can be safely read and written without a lock.  Were
> - * this not the case, we'd have to hold the spinlock for much longer
> - * intervals, and performance might suffer.  Fortunately, that's not
> - * necessary.  At any given time, the difference between mq_bytes_read and
> + * At any given time, the difference between mq_bytes_read and

Hm, why did you remove the first part about mq_ring itself?


> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
> const void *data,
>  
>   while (sent < nbytes)
>   {
> - booldetached;
>   uint64  rb;
> + uint64  wb;
>  
>   /* Compute number of ring buffer bytes used and available. */
> - rb = shm_mq_get_bytes_read(mq, &detached);
> - Assert(mq->mq_bytes_written >= rb);
> - used = mq->mq_bytes_written - rb;
> + rb = pg_atomic_read_u64(&mq->mq_bytes_read);
> + wb = pg_atomic_read_u64(&mq->mq_bytes_written);
> + Assert(wb >= rb);
> + used = wb - rb;

Just to make sure my understanding is correct: No barriers needed here
because "bytes_written" is only written by the sending backend, and
"bytes_read" cannot lap it. Correct?


>   Assert(used <= ringsize);
>   available = Min(ringsize - used, nbytes - sent);
>  
>   /* Bail out if the queue has been detached. */
> - if (detached)
> + if (mq->mq_detached)

Hm, do all paths here guarantee that mq->mq_detached won't be stored on
the stack / register in the first iteration, and then not reread any
further? I think it's fine because every branch of the if below ends up
in a syscall / barrier, but it might be worth noting on that here.


> + /*
> +  * Since mq->mqh_counterparty_attached is known to be 
> true at this
> +  * point, mq_receiver has been set, and it can't change 
> once set.
> +  * Therefore, we can read it without acquiring the 
> spinlock.
> +  */
> + Assert(mqh->mqh_counterparty_attached);
> + SetLatch(&mq->mq_receiver->procLatch);

Perhaps mention that this could lead to spuriously signalling the wrong
backend in case of detach, but that that is fine?

>   /* Skip manipulation of our latch if nowait = true. */
>   if (nowait)
> @@ -934,10 +953,18 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
> const void *data,
>   }
>   else
>   {
> - Sizeoffset = mq->mq_bytes_written % 
> (uint64) ringsize;
> - Sizesendnow = Min(available, ringsize - 
> offset);
> + Sizeoffset;
> + Sizesendnow;
> +
> + offset = wb % (uint64) ringsize;
> + sendnow = Min(available, ringsize - offset);

I know the scheme isn't new, but I do find it not immediately obvious
that 'wb' is short for 'bytes_written'.


Re: [HACKERS] [POC] Faster processing at Gather node

2017-12-05 Thread Rafia Sabih
On Mon, Dec 4, 2017 at 9:20 PM, Robert Haas  wrote:

> On Sun, Dec 3, 2017 at 10:30 PM, Amit Kapila 
> wrote:
> > I thought there are some cases (though less) where we want to Shutdown
> > the nodes (ExecShutdownNode) earlier and release the resources sooner.
> > However, if you are not completely sure about this change, then we can
> > leave it as it.  Thanks for sharing your thoughts.
>
> OK, thanks.  I committed that patch, after first running 100 million
> tuples through a Gather over and over again to test for leaks.
> Hopefully I haven't missed anything here, but it looks like it's
> solid.  Here once again are the remaining patches.  While the
> already-committed patches are nice, these two are the ones that
>

Hi,
I was spending sometime in verifying this memory-leak patch for
gather-merge case and I too found it good. In the query I tried, around 10
million tuples were passed through gather-merge. On analysing the output of
top it looks acceptable memory usage and it gets freed once the query is
completed. Since, I was trying on my local system only, I tested for upto 8
workers and didn't find any memory leaks for the queries I tried.
One may find the attached file for the test-case.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


ml_test_query.sql
Description: Binary data


Re: [HACKERS] [POC] Faster processing at Gather node

2017-12-04 Thread Robert Haas
On Sun, Dec 3, 2017 at 10:30 PM, Amit Kapila  wrote:
> I thought there are some cases (though less) where we want to Shutdown
> the nodes (ExecShutdownNode) earlier and release the resources sooner.
> However, if you are not completely sure about this change, then we can
> leave it as it.  Thanks for sharing your thoughts.

OK, thanks.  I committed that patch, after first running 100 million
tuples through a Gather over and over again to test for leaks.
Hopefully I haven't missed anything here, but it looks like it's
solid.  Here once again are the remaining patches.  While the
already-committed patches are nice, these two are the ones that
actually produced big improvements in my testing, so it would be good
to move them along.  Any reviews appreciated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-shm-mq-less-spinlocks-v2.patch
Description: Binary data


0002-shm-mq-reduce-receiver-latch-set-v1.patch
Description: Binary data


Re: [HACKERS] [POC] Faster processing at Gather node

2017-12-03 Thread Amit Kapila
On Fri, Dec 1, 2017 at 8:04 PM, Robert Haas  wrote:
> On Sun, Nov 26, 2017 at 3:15 AM, Amit Kapila  wrote:
>> Yeah and I think something like that can happen after your patch
>> because now the memory for tuples returned via TupleQueueReaderNext
>> will be allocated in ExecutorState and that can last for long.   I
>> think it is better to free memory, but we can leave it as well if you
>> don't feel it important.  In any case, I have written a patch, see if
>> you think it makes sense.
>
> Well, I don't really know.  My intuition is that in most cases after
> ExecShutdownGatherMergeWorkers() we will very shortly thereafter call
> ExecutorEnd() and everything will go away.
>

I thought there are some cases (though less) where we want to Shutdown
the nodes (ExecShutdownNode) earlier and release the resources sooner.
However, if you are not completely sure about this change, then we can
leave it as it.  Thanks for sharing your thoughts.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [POC] Faster processing at Gather node

2017-12-01 Thread Robert Haas
On Sun, Nov 26, 2017 at 3:15 AM, Amit Kapila  wrote:
> Yeah and I think something like that can happen after your patch
> because now the memory for tuples returned via TupleQueueReaderNext
> will be allocated in ExecutorState and that can last for long.   I
> think it is better to free memory, but we can leave it as well if you
> don't feel it important.  In any case, I have written a patch, see if
> you think it makes sense.

Well, I don't really know.  My intuition is that in most cases after
ExecShutdownGatherMergeWorkers() we will very shortly thereafter call
ExecutorEnd() and everything will go away.  Maybe that's wrong, but
Tom put that call where it is in
2d44c58c79aeef2d376be0141057afbb9ec6b5bc, and he could have put it
inside ExecShutdownGatherMergeWorkers() instead.  Now maybe he didn't
consider that approach, but Tom is usually smart about stuff like
that...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-29 Thread Michael Paquier
On Sun, Nov 26, 2017 at 5:15 PM, Amit Kapila  wrote:
> Yeah and I think something like that can happen after your patch
> because now the memory for tuples returned via TupleQueueReaderNext
> will be allocated in ExecutorState and that can last for long.   I
> think it is better to free memory, but we can leave it as well if you
> don't feel it important.  In any case, I have written a patch, see if
> you think it makes sense.

OK. I can see some fresh and unreviewed patches so moved to next CF.
-- 
Michael



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-26 Thread Amit Kapila
On Sat, Nov 25, 2017 at 9:13 PM, Robert Haas  wrote:
> On Wed, Nov 22, 2017 at 8:36 AM, Amit Kapila  wrote:
>>> remove-memory-leak-protection-v1.patch removes the memory leak
>>> protection that Tom installed upon discovering that the original
>>> version of tqueue.c leaked memory like crazy.  I think that it
>>> shouldn't do that any more, courtesy of
>>> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
>>> can avoid a whole lot of tuple copying in Gather Merge and a much more
>>> modest amount of overhead in Gather.  Since my test case exercised
>>> Gather Merge, this bought ~400 ms or so.
>>
>> I think Tom didn't installed memory protection in nodeGatherMerge.c
>> related to an additional copy of tuple.  I could see it is present in
>> the original commit of Gather Merge
>> (355d3993c53ed62c5b53d020648e4fbcfbf5f155).  Tom just avoided applying
>> heap_copytuple to a null tuple in his commit
>> 04e9678614ec64ad9043174ac99a25b1dc45233a.  I am not sure whether you
>> are just referring to the protection Tom added in nodeGather.c,  If
>> so, it is not clear from your mail.
>
> Yes, that's what I mean.  What got done for Gather Merge was motivated
> by what Tom did for Gather.  Sorry for not expressing the idea more
> precisely.
>
>> I think we don't need the additional copy of tuple in
>> nodeGatherMerge.c and your patch seem to be doing the right thing.
>> However, after your changes, it looks slightly odd that
>> gather_merge_clear_tuples() is explicitly calling heap_freetuple for
>> the tuples allocated by tqueue.c, maybe we can add some comment.  It
>> was much clear before this patch as nodeGatherMerge.c was explicitly
>> copying the tuples that it is freeing.
>
> OK, I can add a comment about that.
>

Sure, I think apart from that the patch looks good to me.  I think a
good test of this patch could be to try to pass many tuples via gather
merge and see if there is any leak in memory.  Do you have any other
ideas?

>> Isn't it better to explicitly call gather_merge_clear_tuples in
>> ExecEndGatherMerge so that we can free the memory for tuples allocated
>> in this node rather than relying on reset/free of MemoryContext in
>> which those tuples are allocated?
>
> Generally relying on reset/free of a MemoryContext is cheaper.
> Typically you only want to free manually if the freeing would
> otherwise not happen soon enough.
>

Yeah and I think something like that can happen after your patch
because now the memory for tuples returned via TupleQueueReaderNext
will be allocated in ExecutorState and that can last for long.   I
think it is better to free memory, but we can leave it as well if you
don't feel it important.  In any case, I have written a patch, see if
you think it makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


release_memory_at_gather_merge_shutdown_v1.patch
Description: Binary data


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-25 Thread Robert Haas
On Wed, Nov 22, 2017 at 8:36 AM, Amit Kapila  wrote:
>> remove-memory-leak-protection-v1.patch removes the memory leak
>> protection that Tom installed upon discovering that the original
>> version of tqueue.c leaked memory like crazy.  I think that it
>> shouldn't do that any more, courtesy of
>> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
>> can avoid a whole lot of tuple copying in Gather Merge and a much more
>> modest amount of overhead in Gather.  Since my test case exercised
>> Gather Merge, this bought ~400 ms or so.
>
> I think Tom didn't installed memory protection in nodeGatherMerge.c
> related to an additional copy of tuple.  I could see it is present in
> the original commit of Gather Merge
> (355d3993c53ed62c5b53d020648e4fbcfbf5f155).  Tom just avoided applying
> heap_copytuple to a null tuple in his commit
> 04e9678614ec64ad9043174ac99a25b1dc45233a.  I am not sure whether you
> are just referring to the protection Tom added in nodeGather.c,  If
> so, it is not clear from your mail.

Yes, that's what I mean.  What got done for Gather Merge was motivated
by what Tom did for Gather.  Sorry for not expressing the idea more
precisely.

> I think we don't need the additional copy of tuple in
> nodeGatherMerge.c and your patch seem to be doing the right thing.
> However, after your changes, it looks slightly odd that
> gather_merge_clear_tuples() is explicitly calling heap_freetuple for
> the tuples allocated by tqueue.c, maybe we can add some comment.  It
> was much clear before this patch as nodeGatherMerge.c was explicitly
> copying the tuples that it is freeing.

OK, I can add a comment about that.

> Isn't it better to explicitly call gather_merge_clear_tuples in
> ExecEndGatherMerge so that we can free the memory for tuples allocated
> in this node rather than relying on reset/free of MemoryContext in
> which those tuples are allocated?

Generally relying on reset/free of a MemoryContext is cheaper.
Typically you only want to free manually if the freeing would
otherwise not happen soon enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-23 Thread Rafia Sabih
On Thu, Nov 16, 2017 at 12:24 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-11-15 13:48:18 -0500, Robert Haas wrote:
>> I think that we need a little bit deeper analysis here to draw any
>> firm conclusions.
>
> Indeed.
>
>
>> I suspect that one factor is that many of the queries actually send
>> very few rows through the Gather.
>
> Yep. I kinda wonder if the same result would present if the benchmarks
> were run with parallel_leader_participation. The theory being what were
> seing is just that the leader doesn't accept any tuples, and the large
> queue size just helps because workers can run for longer.
>
I ran Q12 with parallel_leader_participation = off and could not get
any performance improvement with the patches given by Robert.The
result was same for head as well. The query plan also remain
unaffected with the value of this parameter.

Here are the details of the experiment,
TPC-H scale factor = 20,
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1
max_parallel_workers_per_gather = 4

PG commit: 745948422c799c1b9f976ee30f21a7aac050e0f3

Please find the attached file for the explain analyse output for
either values of parallel_leader_participation and patches.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
with parallel_leader_participation = 1;

 QUERY PLAN 
 
 
--
-
 Limit  (cost=1001.19..504469.79 rows=1 width=27) (actual 
time=21833.206..21833.207 rows=1 loops=1)
   ->  GroupAggregate  (cost=1001.19..3525281.42 rows=7 width=27) (actual 
time=21833.203..21833.203 rows=1 loops=1)
 Group Key: lineitem.l_shipmode
 ->  Gather Merge  (cost=1001.19..3515185.81 rows=576888 width=27) 
(actual time=4.388..21590.757 rows=311095 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Nested Loop  (cost=1.13..3445472.82 rows=144222 width=27) 
(actual time=0.150..4399.384 rows=62247 loops=5)
 ->  Parallel Index Scan using l_shipmode on lineitem  
(cost=0.57..3337659.69 rows=144222 width=19) (actual time=0.111..3772.865 
rows=62247 loops=5)
   Index Cond: (l_shipmode = ANY ('{"REG 
AIR",RAIL}'::bpchar[]))
   Filter: ((l_commitdate < l_receiptdate) AND 
(l_shipdate < l_commitdate) AND (l_receiptdate >= '1995-01-01'::date) AND 
(l_receiptdate < '1996-01-01 00:00:00'::timestamp without
 time zone))
   Rows Removed by Filter: 3367603
 ->  Index Scan using orders_pkey on orders  
(cost=0.56..0.75 rows=1 width=20) (actual time=0.009..0.009 rows=1 loops=311236)
   Index Cond: (o_orderkey = lineitem.l_orderkey)
 Planning time: 0.526 ms
 Execution time: 21835.922 ms
(15 rows)

postgres=# set parallel_leader_participation =0;
SET
postgres=# \i /data/rafia.sabih/dss/queries/12.sql 

  QUERY PLAN
  
 
--
-
 Limit  (cost=1001.19..504469.79 rows=1 width=27) (actual 
time=21179.065..21179.066 rows=1 loops=1)
   ->  GroupAggregate  (cost=1001.19..3525281.42 rows=7 width=27) (actual 
time=21179.064..21179.064 rows=1 loops=1)
 Group Key: lineitem.l_shipmode
 ->  Gather Merge  (cost=1001.19..3515185.81 rows=576888 width=27) 
(actual time=4.201..20941.385 rows=311095 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Nested Loop  (cost=1.13..3445472.82 rows=144222 width=27) 
(actual time=0.187..5105.780 rows=77797 loops=4)
 ->  Parallel Index Scan using l_shipmode on lineitem  
(cost=0.57..3337659.69 rows=144222 width=19) (actual time=0.149..4362.235 
rows=77797 loops=4)
   Index Cond: (l_shipmode = ANY ('{"REG 
AIR",RAIL}'::bpchar[]))
   Filter: ((l_commitdate < l_receiptdate) AND 
(l_shipdate < l_commitdate) AND (l_receiptdate >= '1995-01-01'::date) AND 
(l_receiptdate < '1996-01-01 00:00:00'::timestamp without
 time zone))
   Rows Removed by Filter: 4208802
 ->  Index Scan using orders_pkey on orders  
(cost=0.56..0.75 rows=1 width=20) (actual time=0.008..0.008 rows=1 loops=311187)
   Index Cond: (o_orderkey = l

Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-22 Thread Amit Kapila
On Sat, Nov 18, 2017 at 7:23 PM, Amit Kapila  wrote:
> On Fri, Nov 10, 2017 at 8:39 PM, Robert Haas  wrote:
>> On Fri, Nov 10, 2017 at 5:44 AM, Amit Kapila  wrote:
>>> I am seeing the assertion failure as below on executing the above
>>> mentioned Create statement:
>>
>
> - if (!ExecContextForcesOids(&gatherstate->ps, &hasoid))
> + if (!ExecContextForcesOids(outerPlanState(gatherstate), &hasoid))
>   hasoid = false;
>
> Don't we need a similar change in nodeGatherMerge.c (in function
> ExecInitGatherMerge)?
>
>> And here are the other patches again, too.
>>
>
> The 0001* patch doesn't apply, please find the attached rebased
> version which I have used to verify the patch.
>
> Now, along with 0001* and 0002*, 0003-skip-gather-project-v2 looks
> good to me.  I think we can proceed with the commit of 0001*~0003*
> patches unless somebody else has any comments.
>

I see that you have committed 0001* and 0002* patches, so continuing my review.

Review of 0006-remove-memory-leak-protection-v1

> remove-memory-leak-protection-v1.patch removes the memory leak
> protection that Tom installed upon discovering that the original
> version of tqueue.c leaked memory like crazy.  I think that it
> shouldn't do that any more, courtesy of
> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
> can avoid a whole lot of tuple copying in Gather Merge and a much more
> modest amount of overhead in Gather.  Since my test case exercised
> Gather Merge, this bought ~400 ms or so.

I think Tom didn't installed memory protection in nodeGatherMerge.c
related to an additional copy of tuple.  I could see it is present in
the original commit of Gather Merge
(355d3993c53ed62c5b53d020648e4fbcfbf5f155).  Tom just avoided applying
heap_copytuple to a null tuple in his commit
04e9678614ec64ad9043174ac99a25b1dc45233a.  I am not sure whether you
are just referring to the protection Tom added in nodeGather.c,  If
so, it is not clear from your mail.

I think we don't need the additional copy of tuple in
nodeGatherMerge.c and your patch seem to be doing the right thing.
However, after your changes, it looks slightly odd that
gather_merge_clear_tuples() is explicitly calling heap_freetuple for
the tuples allocated by tqueue.c, maybe we can add some comment.  It
was much clear before this patch as nodeGatherMerge.c was explicitly
copying the tuples that it is freeing.

Isn't it better to explicitly call gather_merge_clear_tuples in
ExecEndGatherMerge so that we can free the memory for tuples allocated
in this node rather than relying on reset/free of MemoryContext in
which those tuples are allocated?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-18 Thread Amit Kapila
On Fri, Nov 10, 2017 at 8:39 PM, Robert Haas  wrote:
> On Fri, Nov 10, 2017 at 5:44 AM, Amit Kapila  wrote:
>> I am seeing the assertion failure as below on executing the above
>> mentioned Create statement:
>>
>> TRAP: FailedAssertion("!(!(tup->t_data->t_infomask & 0x0008))", File:
>> "heapam.c", Line: 2634)
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>
> OK, I see it now.  Not sure why I couldn't reproduce this before.
>
> I think the problem is not actually with the code that I just wrote.
> What I'm seeing is that the slot descriptor's tdhasoid value is false
> for both the funnel slot and the result slot; therefore, we conclude
> that no projection is needed to remove the OIDs.  That seems to make
> sense: if the funnel slot doesn't have OIDs and the result slot
> doesn't have OIDs either, then we don't need to remove them.
> Unfortunately, even though the funnel slot descriptor is marked
> tdhashoid = false, the tuples being stored there actually do have
> OIDs.  And that is because they are coming from the underlying
> sequential scan, which *also* has OIDs despite the fact that tdhasoid
> for it's slot is false.
>
> This had me really confused until I realized that there are two
> processes involved.  The problem is that we don't pass eflags down to
> the child process -- so in the user backend, everybody agrees that
> there shouldn't be OIDs anywhere, because EXEC_FLAG_WITHOUT_OIDS is
> set.  In the parallel worker, however, it's not set, so the worker
> feels free to do whatever comes naturally, and in this test case that
> happens to be returning tuples with OIDs.  Patch for this attached.
>
> I also noticed that the code that initializes the funnel slot is using
> its own PlanState rather than the outer plan's PlanState to call
> ExecContextForcesOids.  I think that's formally incorrect, because the
> goal is to end up with a slot that is the same as the outer plan's
> slot.  It doesn't matter because ExecContextForcesOids doesn't care
> which PlanState it gets passed, but the comments in
> ExecContextForcesOids imply that somebody it might, so perhaps it's
> best to clean that up.  Patch for this attached, too.
>

- if (!ExecContextForcesOids(&gatherstate->ps, &hasoid))
+ if (!ExecContextForcesOids(outerPlanState(gatherstate), &hasoid))
  hasoid = false;

Don't we need a similar change in nodeGatherMerge.c (in function
ExecInitGatherMerge)?

> And here are the other patches again, too.
>

The 0001* patch doesn't apply, please find the attached rebased
version which I have used to verify the patch.

Now, along with 0001* and 0002*, 0003-skip-gather-project-v2 looks
good to me.  I think we can proceed with the commit of 0001*~0003*
patches unless somebody else has any comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-pass-eflags-to-worker-v2.patch
Description: Binary data


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-16 Thread Robert Haas
On Thu, Nov 16, 2017 at 10:23 AM, Ants Aasma  wrote:
> For the Gather Merge driven by Parallel Index Scan case it seems to me
> that the correct queue size is one that can store two index pages
> worth of tuples. Additional space will always help buffer any
> performance variations, but there should be a step function somewhere
> around 1+1/n_workers pages. I wonder if the queue could be dynamically
> sized based on the driving scan. With some limits of course as parent
> nodes to the parallel index scan can increase the row count by
> arbitrary amounts.

Currently, Gather Merge can store 100 tuples + as much more stuff as
fits in a 64kB queue.  That should already be more than 2 index pages,
I would think, although admittedly I haven't tested.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-16 Thread Ants Aasma
On Thu, Nov 16, 2017 at 6:42 AM, Robert Haas  wrote:
> The problem here is that we have no idea how big the queue needs to
> be.  The workers will always be happy to generate tuples faster than
> the leader can read them, if that's possible, but it will only
> sometimes help performance to let them do so.   I think in most cases
> we'll end up allocating the local queue - because the workers can
> generate faster than the leader can read - but only occasionally will
> it make anything faster.

For the Gather Merge driven by Parallel Index Scan case it seems to me
that the correct queue size is one that can store two index pages
worth of tuples. Additional space will always help buffer any
performance variations, but there should be a step function somewhere
around 1+1/n_workers pages. I wonder if the queue could be dynamically
sized based on the driving scan. With some limits of course as parent
nodes to the parallel index scan can increase the row count by
arbitrary amounts.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-15 Thread Robert Haas
On Wed, Nov 15, 2017 at 9:34 PM, Amit Kapila  wrote:
> The main advantage of local queue idea is that it won't consume any
> memory by default for running parallel queries.  It would consume
> memory when required and accordingly help in speeding up those cases.
> However, increasing the size of shared queues by default will increase
> memory usage for cases where it is even not required.   Even, if we
> provide a GUC to tune the amount of shared memory, I am not sure how
> convenient it will be for the user to use it as it needs different
> values for different workloads and it is not easy to make a general
> recommendation.  I am not telling we can't work-around this with the
> help of GUC, but it seems like it will be better if we have some
> autotune mechanism and I think Rafia's patch is one way to achieve it.

It's true this might save memory in some cases.  If we never generate
very many tuples, then we won't allocate the local queue and we'll
save memory.  That's mildly nice.

On the other hand, the local queue may also use a bunch of memory
without improving performance, as in the case of Rafia's test where
she raised the queue size 10x and it didn't help.
Alternatively, it may improve performance by a lot, but use more
memory than necessary to do so.  In Rafia's test results, a 100x
improvement got it down to 7s; if she'd done 200x instead, I don't
think it would have helped further, but it would have been necessary
to go 200x to get the full benefit if the data had been twice as big.

The problem here is that we have no idea how big the queue needs to
be.  The workers will always be happy to generate tuples faster than
the leader can read them, if that's possible, but it will only
sometimes help performance to let them do so.   I think in most cases
we'll end up allocating the local queue - because the workers can
generate faster than the leader can read - but only occasionally will
it make anything faster.

If what we really want to do is allow the workers to get arbitrarily
far ahead of the leader, we could ditch shm_mq altogether here and use
Thomas's shared tuplestore stuff.  Then you never run out of memory
because you spill to disk.  I'm not sure that's the way to go, though.
It still has the problem that you may let the workers get very far
ahead not just when it helps, but also when it's possible but not
helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-15 Thread Amit Kapila
On Thu, Nov 16, 2017 at 12:18 AM, Robert Haas  wrote:
> On Tue, Nov 14, 2017 at 7:31 AM, Rafia Sabih
>  wrote:
> Similarly, I think that faster_gather_v3.patch is effectively here
> because it lets all the workers run at the same time, not because
> Gather gets any faster.  The local queue is 100x bigger than the
> shared queue, and that's big enough that the workers never have to
> block, so they all run at the same time and things are great.  I don't
> see much advantage in pursuing this route.  For the local queue to
> make sense it needs to have some advantage that we can't get by just
> making the shared queue bigger, which is easier and less code.
>

The main advantage of local queue idea is that it won't consume any
memory by default for running parallel queries.  It would consume
memory when required and accordingly help in speeding up those cases.
However, increasing the size of shared queues by default will increase
memory usage for cases where it is even not required.   Even, if we
provide a GUC to tune the amount of shared memory, I am not sure how
convenient it will be for the user to use it as it needs different
values for different workloads and it is not easy to make a general
recommendation.  I am not telling we can't work-around this with the
help of GUC, but it seems like it will be better if we have some
autotune mechanism and I think Rafia's patch is one way to achieve it.

>  The
> original idea was that we'd reduce latch traffic and spinlock
> contention by moving data from the local queue to the shared queue in
> bulk, but the patches I posted attack those problems more directly.
>

I think the idea was to solve both the problems (shm_mq communication
overhead and Gather Merge related pipeline stalls) with local queue
stuff [1].


[1] - 
https://www.postgresql.org/message-id/CAA4eK1Jk465W2TTWT4J-RP3RXK2bJWEtYY0xhYpnSc1mcEXfkA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-15 Thread Andres Freund
Hi,

On 2017-11-15 13:48:18 -0500, Robert Haas wrote:
> I think that we need a little bit deeper analysis here to draw any
> firm conclusions.

Indeed.


> I suspect that one factor is that many of the queries actually send
> very few rows through the Gather.

Yep. I kinda wonder if the same result would present if the benchmarks
were run with parallel_leader_participation. The theory being what were
seing is just that the leader doesn't accept any tuples, and the large
queue size just helps because workers can run for longer.


Greetings,

Andres Freund



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-15 Thread Robert Haas
On Tue, Nov 14, 2017 at 7:31 AM, Rafia Sabih
 wrote:
> Case 2: patches applied as in case 1 +
>a) increased PARALLEL_TUPLE_QUEUE_SIZE to 655360
>   No significant change in performance in any query
>b) increased PARALLEL_TUPLE_QUEUE_SIZE to 65536 * 50
>   Performance improved from 20s to 11s for Q12
>c) increased PARALLEL_TUPLE_QUEUE_SIZE to 6553600
>  Q12 shows improvement in performance from 20s to 7s
>
> Case 3: patch applied = faster_gather_v3 as posted at [1]
> Q12 shows improvement in performance from 20s to 8s

I think that we need a little bit deeper analysis here to draw any
firm conclusions.  My own testing showed about a 2x performance
improvement with all 4 patches applied on a query that did a Gather
Merge with many rows.  Now, your testing shows the patches aren't
helping at all.  But what accounts for the difference between your
results?  Without some analysis of that question, this is just a data
point that probably doesn't get us very far.

I suspect that one factor is that many of the queries actually send
very few rows through the Gather.  You didn't send EXPLAIN ANALYZE
outputs for these runs, but I went back and looked at some old tests I
did on a small scale factor and found that, on those tests, Q2, Q6,
Q13, Q14, and Q15 didn't use parallelism at all, while Q1, Q4, Q5, Q7,
Q8, Q9, Q11, Q12, Q19, and Q22 used parallelism, but sent less than
100 rows through Gather.  Obviously, speeding up Gather isn't going to
help at all when only a tiny number of rows are being sent through it.
The remaining seven queries sent the following numbers of rows through
Gather:

3:   ->  Gather Merge  (cost=708490.45..1110533.81
rows=3175456 width=0) (actual time=21932.675..22150.587 rows=118733
loops=1)

10:   ->  Gather Merge  (cost=441168.55..513519.51
rows=574284 width=0) (actual time=15281.370..16319.895 rows=485230
loops=1)

16: ->  Gather
(cost=1000.00..47363.41 rows=297653 width=40) (actual
time=0.414..272.808 rows=297261 loops=1)

17:   ->  Gather  (cost=1000.00..12815.71
rows=2176 width=4) (actual time=2.105..152.966 rows=1943 loops=1)
17: ->  Gather Merge
(cost=2089193.30..3111740.98 rows=7445304 width=0) (actual
time=14071.064..33135.996 rows=9973440 loops=1)

18:   ->  Gather Merge  (cost=3271973.63..7013135.71
rows=29992968 width=0) (actual time=81581.450..81581.594 rows=112
loops=1)

20:   ->  Gather
(cost=1000.00..13368.31 rows=20202 width=4) (actual time=0.361..19.035
rows=21761 loops=1)

21: ->  Gather  (cost=1024178.86..1024179.27
rows=4 width=34) (actual time=12367.266..12377.991 rows=17176 loops=1)

Of those, Q18 is probably uninteresting because it only sends 112
rows, and Q20 and Q16 are probably uninteresting because the Gather
only executed for 19 ms and 272 ms respectively.  Q21 doesn't look
interesting because we ran for 12337.991 seconds and only sent 17176
rows - so the bottleneck is probably generating the tuples, not
sending them.  The places where you'd expect the patch set to help are
where a lot of rows are being sent through the Gather or Gather Merge
node very quickly - so with these plans, probably Q17 is the only that
would have the best chance of going faster with these patches and
maybe Q3 might benefit a bit.

Now obviously your plans are different -- otherwise you couldn't be
seeing a speedup on Q12.  So you have to look at the plans and try to
understand what the big picture is here.  Spending a lot of time
running queries where the time taken by Gather is not the bottleneck
is not a good way to figure out whether we've successfully sped up
Gather.  What would be more useful?  How about:

- Once you've identified the queries where Gather seems like it might
be a bottleneck, run perf without the patch set and see whether Gather
or shm_mq related functions show up high in the profile.  If they do,
run perf which the patch set and see if they become less prominent.

- Try running the test cases that Andres and I tried with and without
the patch set.  See if it helps on those queries.  That will help
verify that your testing procedure is correct, and might also reveal
differences in the effectiveness of that patch set on different
hardware.  You could try this experiment on both PPC and x64, or on
both Linux and MacOS, to see whether CPU architecture and/or operating
system plays a role in the effectiveness of the patch.

I think it's a valid finding that increasing the size of the tuple
queue makes Q12 run faster, but I think that's not because it makes
Gather itself any faster.  Rather, it's because there are fewer
pipeline stalls.  With Gather Merge, whenever a tuple queue becomes
empty, the leader becomes unable to return any more tuples until the
process whose queue is empty generates at least one new tuple.  If
there are multiple workers with non-full queues at