Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-10 Thread Robert Haas
On Fri, Nov 10, 2017 at 5:44 AM, Amit Kapila  wrote:
> I am seeing the assertion failure as below on executing the above
> mentioned Create statement:
>
> TRAP: FailedAssertion("!(!(tup->t_data->t_infomask & 0x0008))", File:
> "heapam.c", Line: 2634)
> server closed the connection unexpectedly
> This probably means the server terminated abnormally

OK, I see it now.  Not sure why I couldn't reproduce this before.

I think the problem is not actually with the code that I just wrote.
What I'm seeing is that the slot descriptor's tdhasoid value is false
for both the funnel slot and the result slot; therefore, we conclude
that no projection is needed to remove the OIDs.  That seems to make
sense: if the funnel slot doesn't have OIDs and the result slot
doesn't have OIDs either, then we don't need to remove them.
Unfortunately, even though the funnel slot descriptor is marked
tdhashoid = false, the tuples being stored there actually do have
OIDs.  And that is because they are coming from the underlying
sequential scan, which *also* has OIDs despite the fact that tdhasoid
for it's slot is false.

This had me really confused until I realized that there are two
processes involved.  The problem is that we don't pass eflags down to
the child process -- so in the user backend, everybody agrees that
there shouldn't be OIDs anywhere, because EXEC_FLAG_WITHOUT_OIDS is
set.  In the parallel worker, however, it's not set, so the worker
feels free to do whatever comes naturally, and in this test case that
happens to be returning tuples with OIDs.  Patch for this attached.

I also noticed that the code that initializes the funnel slot is using
its own PlanState rather than the outer plan's PlanState to call
ExecContextForcesOids.  I think that's formally incorrect, because the
goal is to end up with a slot that is the same as the outer plan's
slot.  It doesn't matter because ExecContextForcesOids doesn't care
which PlanState it gets passed, but the comments in
ExecContextForcesOids imply that somebody it might, so perhaps it's
best to clean that up.  Patch for this attached, too.

And here are the other patches again, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-pass-eflags-to-worker-v1.patch
Description: Binary data


0002-forces-oids-neatnikism-v1.patch
Description: Binary data


0003-skip-gather-project-v2.patch
Description: Binary data


0004-shm-mq-less-spinlocks-v2.patch
Description: Binary data


0005-shm-mq-reduce-receiver-latch-set-v1.patch
Description: Binary data


0006-remove-memory-leak-protection-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-10 Thread Amit Kapila
On Fri, Nov 10, 2017 at 9:48 AM, Amit Kapila  wrote:
> On Fri, Nov 10, 2017 at 8:36 AM, Robert Haas  wrote:
>> On Thu, Nov 9, 2017 at 9:31 PM, Amit Kapila  wrote:
>>> Have you set force_parallel_mode=regress; before running the
>>> statement?
>>
>> Yes, I tried that first.
>>
>>> If so, then why you need to tune other parallel query
>>> related parameters?
>>
>> Because I couldn't get it to break the other way, I then tried this.
>>
>> Instead of asking me what I did, can you tell me what I need to do?
>> Maybe a self-contained reproducible test case including exactly what
>> goes wrong on your end?
>>
>
> I think we are missing something very basic because you should see the
> failure by executing that statement in force_parallel_mode=regress
> even in a freshly created database.
>

I am seeing the assertion failure as below on executing the above
mentioned Create statement:

TRAP: FailedAssertion("!(!(tup->t_data->t_infomask & 0x0008))", File:
"heapam.c", Line: 2634)
server closed the connection unexpectedly
This probably means the server terminated abnormally


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-09 Thread Amit Kapila
On Fri, Nov 10, 2017 at 8:36 AM, Robert Haas  wrote:
> On Thu, Nov 9, 2017 at 9:31 PM, Amit Kapila  wrote:
>> Have you set force_parallel_mode=regress; before running the
>> statement?
>
> Yes, I tried that first.
>
>> If so, then why you need to tune other parallel query
>> related parameters?
>
> Because I couldn't get it to break the other way, I then tried this.
>
> Instead of asking me what I did, can you tell me what I need to do?
> Maybe a self-contained reproducible test case including exactly what
> goes wrong on your end?
>

I think we are missing something very basic because you should see the
failure by executing that statement in force_parallel_mode=regress
even in a freshly created database.  I guess the missing point is that
I am using assertions enabled build and probably you are not (If this
is the reason, then it should have striked me first time).  Anyway, I
am writing steps to reproduce the issue.

1. initdb
2. start server
3. connect using psql
4. set force_parallel_mode=regress;
5. Create Table as_select1 AS SELECT * FROM pg_class WHERE relkind = 'r';


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-09 Thread Robert Haas
On Thu, Nov 9, 2017 at 9:31 PM, Amit Kapila  wrote:
> Have you set force_parallel_mode=regress; before running the
> statement?

Yes, I tried that first.

> If so, then why you need to tune other parallel query
> related parameters?

Because I couldn't get it to break the other way, I then tried this.

Instead of asking me what I did, can you tell me what I need to do?
Maybe a self-contained reproducible test case including exactly what
goes wrong on your end?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-09 Thread Amit Kapila
On Fri, Nov 10, 2017 at 12:05 AM, Robert Haas  wrote:
> On Thu, Nov 9, 2017 at 12:08 AM, Amit Kapila  wrote:
>> This change looks suspicious to me.  I think here we can't use the
>> tupDesc constructed from targetlist.  One problem, I could see is that
>> the check for hasOid setting in tlist_matches_tupdesc won't give the
>> correct answer.   In case of the scan, we use the tuple descriptor
>> stored in relation descriptor which will allow us to take the right
>> decision in tlist_matches_tupdesc.  If you try the statement CREATE
>> TABLE as_select1 AS SELECT * FROM pg_class WHERE relkind = 'r'; in
>> force_parallel_mode=regress, then you can reproduce the problem I am
>> trying to highlight.
>
> I tried this, but nothing seemed to be obviously broken.  Then I
> realized that the CREATE TABLE command wasn't using parallelism, so I
> retried with parallel_setup_cost = 0, parallel_tuple_cost = 0, and
> min_parallel_table_scan_size = 0.  That got it to use parallel query,
> but I still don't see anything broken.  Can you clarify further?
>

Have you set force_parallel_mode=regress; before running the
statement?  If so, then why you need to tune other parallel query
related parameters?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-09 Thread Robert Haas
On Thu, Nov 9, 2017 at 12:08 AM, Amit Kapila  wrote:
> This change looks suspicious to me.  I think here we can't use the
> tupDesc constructed from targetlist.  One problem, I could see is that
> the check for hasOid setting in tlist_matches_tupdesc won't give the
> correct answer.   In case of the scan, we use the tuple descriptor
> stored in relation descriptor which will allow us to take the right
> decision in tlist_matches_tupdesc.  If you try the statement CREATE
> TABLE as_select1 AS SELECT * FROM pg_class WHERE relkind = 'r'; in
> force_parallel_mode=regress, then you can reproduce the problem I am
> trying to highlight.

I tried this, but nothing seemed to be obviously broken.  Then I
realized that the CREATE TABLE command wasn't using parallelism, so I
retried with parallel_setup_cost = 0, parallel_tuple_cost = 0, and
min_parallel_table_scan_size = 0.  That got it to use parallel query,
but I still don't see anything broken.  Can you clarify further?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-08 Thread Amit Kapila
On Wed, Nov 8, 2017 at 1:02 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-11-06 10:56:43 +0530, Amit Kapila wrote:
>> On Sun, Nov 5, 2017 at 6:54 AM, Andres Freund  wrote
>> > On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
>> >> skip-gather-project-v1.patch does what it says on the tin.  I still
>> >> don't have a test case for this, and I didn't find that it helped very
>> >> much,
>>
>> I am also wondering in which case it can help and I can't think of the
>> case.
>
> I'm confused?  Isn't it fairly obvious that unnecessarily projecting
> at the gather node is wasteful? Obviously depending on the query you'll
> see smaller / bigger gains, but that there's beenfdits should be fairly 
> obvious?
>
>

I agree that there could be benefits depending on the statement.  I
initially thought that we are kind of re-evaluating the expressions in
target list as part of projection even if worker backend has already
done that, but that was not the case and instead, we are deforming the
tuples sent by workers.  Now, I think as a general principle it is a
good idea to delay the deforming as much as possible.

About the patch,


  /*
- * Initialize result tuple type and projection info.
- */
- ExecAssignResultTypeFromTL(>ps);
- ExecAssignProjectionInfo(>ps, NULL);
-

- /*
  * Initialize funnel slot to same tuple descriptor as outer plan.
  */
  if (!ExecContextForcesOids(>ps, ))
@@ -115,6 +109,12 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
  tupDesc = ExecTypeFromTL(outerNode->targetlist, hasoid);
  ExecSetSlotDescriptor(gatherstate->funnel_slot, tupDesc);

+ /*
+ * Initialize result tuple type and projection info.
+ */
+ ExecAssignResultTypeFromTL(>ps);
+ ExecConditionalAssignProjectionInfo(>ps, tupDesc, OUTER_VAR);
+

This change looks suspicious to me.  I think here we can't use the
tupDesc constructed from targetlist.  One problem, I could see is that
the check for hasOid setting in tlist_matches_tupdesc won't give the
correct answer.   In case of the scan, we use the tuple descriptor
stored in relation descriptor which will allow us to take the right
decision in tlist_matches_tupdesc.  If you try the statement CREATE
TABLE as_select1 AS SELECT * FROM pg_class WHERE relkind = 'r'; in
force_parallel_mode=regress, then you can reproduce the problem I am
trying to highlight.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-07 Thread Andres Freund
Hi,

On 2017-11-06 10:56:43 +0530, Amit Kapila wrote:
> On Sun, Nov 5, 2017 at 6:54 AM, Andres Freund  wrote
> > On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
> >> skip-gather-project-v1.patch does what it says on the tin.  I still
> >> don't have a test case for this, and I didn't find that it helped very
> >> much,
> 
> I am also wondering in which case it can help and I can't think of the
> case.

I'm confused?  Isn't it fairly obvious that unnecessarily projecting
at the gather node is wasteful? Obviously depending on the query you'll
see smaller / bigger gains, but that there's beenfdits should be fairly obvious?


> Basically, as part of projection in the gather, I think we are just
> deforming the tuple which we anyway need to perform before sending the
> tuple to the client (printtup) or probably at the upper level of the
> node.

But in most cases you're not going to print millions of tuples, instead
you're going to apply some further operators ontop (e.g. the
OFFSET/LIMIT in my example).

> >> and you said this looked like a big bottleneck in your
> >> testing, so here you go.

> Is it possible that it shows the bottleneck only for 'explain analyze'
> statement as we don't deform the tuple for that at a later stage?

Doesn't matter, there's a OFFSET/LIMIT ontop of the query. Could just as
well be a sort node or something.


> > The query where that showed a big benefit was
> >
> > SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;
> >
> > (i.e a not very selective filter, and then just throwing the results away)
> >
> > still shows quite massive benefits:
> 
> Do you see the benefit if the query is executed without using Explain Analyze?

Yes.

Before:
tpch_5[11878][1]=# SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 
10 LIMIT 1;^[[A
...
Time: 7590.196 ms (00:07.590)

After:
Time: 3862.955 ms (00:03.863)


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
Hi --

pgsql-hackers-ow...@postgresql.org wrote on 11/06/2017 09:47:22 AM:

> From: Andres Freund 

> 
> Hi,
> 
> Please don't top-quote on postgresql lists.
Sorry 
> 
> On 2017-11-06 09:44:24 -0600, Jim Van Fleet wrote:
> > > >hammerdb, in this configuration, runs a variant of tpcc
> > > 
> > > Hard to believe that any of the changes here are relevant in that 
> > > case - this is parallelism specific stuff. Whereas tpcc is oltp, 
right?
> 
> > correct
> 
> In that case, could you provide before/after profiles of the performance
> changing runs?
sure -- happy to share -- gzipped files (which include trace, perf, 
netstat, system data) are are large (9G and 13G)
Should I post them on the list or somewhere else (or trim them -- if so, 
what would you like to have?) 
> 
Jim




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Andres Freund
Hi,

Please don't top-quote on postgresql lists.

On 2017-11-06 09:44:24 -0600, Jim Van Fleet wrote:
> > >hammerdb, in this configuration, runs a variant of tpcc
> > 
> > Hard to believe that any of the changes here are relevant in that 
> > case - this is parallelism specific stuff. Whereas tpcc is oltp, right?

> correct

In that case, could you provide before/after profiles of the performance
changing runs?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
correct

> >hammerdb, in this configuration, runs a variant of tpcc
> 
> Hard to believe that any of the changes here are relevant in that 
> case - this is parallelism specific stuff. Whereas tpcc is oltp, right?
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> 




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Andres Freund


On November 6, 2017 7:30:49 AM PST, Jim Van Fleet  wrote:
>Andres Freund  wrote on 11/05/2017 03:40:15 PM:
>
>hammerdb, in this configuration, runs a variant of tpcc

Hard to believe that any of the changes here are relevant in that case - this 
is parallelism specific stuff. Whereas tpcc is oltp, right?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
Andres Freund  wrote on 11/05/2017 03:40:15 PM:

hammerdb, in this configuration, runs a variant of tpcc
> 
> What query(s) did you measure?
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> 




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Amit Kapila
On Sun, Nov 5, 2017 at 6:54 AM, Andres Freund  wrote
> On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
>> skip-gather-project-v1.patch does what it says on the tin.  I still
>> don't have a test case for this, and I didn't find that it helped very
>> much,

I am also wondering in which case it can help and I can't think of the
case.  Basically, as part of projection in the gather, I think we are
just deforming the tuple which we anyway need to perform before
sending the tuple to the client (printtup) or probably at the upper
level of the node.

>> and you said this looked like a big bottleneck in your
>> testing, so here you go.
>

Is it possible that it shows the bottleneck only for 'explain analyze'
statement as we don't deform the tuple for that at a later stage?

> The query where that showed a big benefit was
>
> SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;
>
> (i.e a not very selective filter, and then just throwing the results away)
>
> still shows quite massive benefits:
>
> before:
> set parallel_setup_cost=0;set parallel_tuple_cost=0;set 
> min_parallel_table_scan_size=0;set max_parallel_workers_per_gather=8;
> tpch_5[17938][1]=# explain analyze SELECT * FROM lineitem WHERE l_suppkey > 
> '5012' OFFSET 10 LIMIT 1;
> ┌
> │ QUERY PLAN
> ├
> │ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
> time=8675.097..8675.097 rows=0 loops=1)
> │   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
> time=0.289..7904.849 rows=26989780 loops=1)
> │ Workers Planned: 8
> │ Workers Launched: 7
> │ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 
> rows=3375405 width=127) (actual time=0.124..528.667 rows=3373722 loops=8)
> │   Filter: (l_suppkey > 5012)
> │   Rows Removed by Filter: 376252
> │ Planning time: 0.098 ms
> │ Execution time: 8676.125 ms
> └
> (9 rows)
> after:
> tpch_5[19754][1]=# EXPLAIN ANALYZE SELECT * FROM lineitem WHERE l_suppkey > 
> '5012' OFFSET 10 LIMIT 1;
> ┌
> │ QUERY PLAN
> ├
> │ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
> time=5984.916..5984.916 rows=0 loops=1)
> │   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
> time=0.214..5123.238 rows=26989780 loops=1)
> │ Workers Planned: 8
> │ Workers Launched: 7
> │ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 
> rows=3375405 width=127) (actual time=0.025..649.887 rows=3373722 loops=8)
> │   Filter: (l_suppkey > 5012)
> │   Rows Removed by Filter: 376252
> │ Planning time: 0.076 ms
> │ Execution time: 5986.171 ms
> └
> (9 rows)
>
> so there clearly is still benefit (this is scale 100, but that shouldn't
> make much of a difference).
>

Do you see the benefit if the query is executed without using Explain Analyze?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Andres Freund
Hi,

On November 5, 2017 1:33:24 PM PST, Jim Van Fleet  wrote:
>Ran this change with hammerdb  on a power 8 firestone
>
>with 2 socket, 20 core
>9.6 base--  451991 NOPM
>0926_master -- 464385 NOPM
>11_04master -- 449177 NOPM
>11_04_patch -- 431423 NOPM
>-- two socket patch is a little down from previous base runs
>
>With one socket
>9.6 base  -- 393727 NOPM 
>v10rc1_base -- 350958 NOPM
>11_04master -- 306506 NOPM
>11_04_patch -- 313179 NOPM
>--  one socket 11_04 master is quite a bit down from 9.6 and
>v10rc1_base 
>-- the patch is up a bit over the base
>
>Net -- the patch is about the same as current base on two socket, and
>on 
>one socket  -- consistent with your pgbench (?) findings
>
>As an aside, it is perhaps a worry that one socket is down over 20%
>from 
>9.6 and over 10% from v10rc1

What query(s) did you measure?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Jim Van Fleet
Ran this change with hammerdb  on a power 8 firestone

with 2 socket, 20 core
9.6 base--  451991 NOPM
0926_master -- 464385 NOPM
11_04master -- 449177 NOPM
11_04_patch -- 431423 NOPM
-- two socket patch is a little down from previous base runs

With one socket
9.6 base  -- 393727 NOPM 
v10rc1_base -- 350958 NOPM
11_04master -- 306506 NOPM
11_04_patch -- 313179 NOPM
--  one socket 11_04 master is quite a bit down from 9.6 and v10rc1_base 
-- the patch is up a bit over the base

Net -- the patch is about the same as current base on two socket, and on 
one socket  -- consistent with your pgbench (?) findings

As an aside, it is perhaps a worry that one socket is down over 20% from 
9.6 and over 10% from v10rc1

Jim

pgsql-hackers-ow...@postgresql.org wrote on 11/04/2017 06:08:31 AM:

> On hydra (PPC), these changes didn't help much.  Timings:
> 
> master: 29605.582, 29753.417, 30160.485
> patch: 28218.396, 27986.951, 26465.584
> 
> That's about a 5-6% improvement.  On my MacBook, though, the
> improvement was quite a bit more:
> 
> master: 21436.745, 20978.355, 19918.617
> patch: 15896.573, 15880.652, 15967.176
> 
> Median-to-median, that's about a 24% improvement.
> 
> Any reviews appreciated.
> 
> Thanks,
> 
> -- 
> Robert Haas
> EnterpriseDB: https://urldefense.proofpoint.com/v2/url?
> u=http-3A__www.enterprisedb.com=DwIBaQ=jf_iaSHvJObTbx-
> siA1ZOg=Glx_6-ZyGFPdLCdb8Jr7QJHrJIbUJO1z6oi-JHO8Htk=-
> 
I8r3tfguIVgEpNumrjWTKOGkJWIbHQNT2M2-02-8cU=39p2vefOiiZS9ZooPYkZ97U66hw5osqmkCGcikgZCik=
> The Enterprise PostgreSQL Company
> [attachment "shm-mq-less-spinlocks-v1.2.patch" deleted by Jim Van 
> Fleet/Austin/Contr/IBM] 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> https://urldefense.proofpoint.com/v2/url?
> 
u=http-3A__www.postgresql.org_mailpref_pgsql-2Dhackers=DwIDAg=jf_iaSHvJObTbx-
> siA1ZOg=Glx_6-ZyGFPdLCdb8Jr7QJHrJIbUJO1z6oi-JHO8Htk=-
> 
I8r3tfguIVgEpNumrjWTKOGkJWIbHQNT2M2-02-8cU=aL2TI3avKN4drlXk915UM2RFixyvUsZ2axDjB2FG9G0=




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Robert Haas
On Sun, Nov 5, 2017 at 2:24 AM, Andres Freund  wrote:
>> shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
>> consume input from the shared queue when the amount of unconsumed
>> input exceeds 1/4 of the queue size.  This caused a large performance
>> improvement in my testing because it causes the number of times the
>> latch gets set to drop dramatically. I experimented a bit with
>> thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
>> enough to capture most of the benefit.
>
> Hm. Is consuming the relevant part, or notifying the sender about it?  I
> suspect most of the benefit can be captured by updating bytes read (and
> similarly on the other side w/ bytes written), but not setting the latch
> unless thresholds are reached.  The advantage of updating the value,
> even without notifying the other side, is that in the common case that
> the other side gets around to checking the queue without having blocked,
> it'll see the updated value.  If that works, that'd address the issue
> that we might wait unnecessarily in a number of common cases.

I think it's mostly notifying the sender.  Sending SIGUSR1 over and
over again isn't free, and it shows up in profiling.  I thought about
what you're proposing here, but it seemed more complicated to
implement, and I'm not sure that there would be any benefit.  The
reason is because, with these patches applied, even a radical
expansion of the queue size doesn't produce much incremental
performance benefit at least in the test case I was using.  I can
increase the size of the tuple queues 10x or 100x and it really
doesn't help very much.  And consuming sooner (but sometimes without
notifying) seems very similar to making the queue slightly bigger.

Also, what I see in general is that the CPU usage on the leader goes
to 100% but the workers are only maybe 20% saturated.  Making the
leader work any harder than absolutely necessarily therefore seems
like it's probably counterproductive.  I may be wrong, but it looks to
me like most of the remaining overhead seems to come from (1) the
synchronization overhead associated with memory barriers and (2)
backend-private work that isn't as cheap as would be ideal - e.g.
palloc overhead.

> Interesting.  Here it's
> +8.79%  postgres  postgres[.] ExecAgg
> +6.52%  postgres  postgres[.] slot_deform_tuple
> +5.65%  postgres  postgres[.] slot_getattr
> +4.59%  postgres  postgres[.] shm_mq_send_bytes
> +3.66%  postgres  postgres[.] ExecInterpExpr
> +3.44%  postgres  postgres[.] AllocSetAlloc
> +3.08%  postgres  postgres[.] heap_fill_tuple
> +2.34%  postgres  postgres[.] heap_getnext
> +2.25%  postgres  postgres[.] finalize_aggregates
> +2.08%  postgres  libc-2.24.so[.] __memmove_avx_unaligned_erms
> +2.05%  postgres  postgres[.] heap_compare_slots
> +1.99%  postgres  postgres[.] execTuplesMatch
> +1.83%  postgres  postgres[.] ExecStoreTuple
> +1.83%  postgres  postgres[.] shm_mq_receive
> +1.81%  postgres  postgres[.] ExecScan

More or less the same functions, somewhat different order.

>> I'm probably not super-excited about spending too much more time
>> trying to make the _platform_memmove time (only 20% or so of which
>> seems to be due to the shm_mq stuff) or the shm_mq_receive_bytes time
>> go down until, say, somebody JIT's slot_getattr and slot_deform_tuple.
>> :-)
>
> Hm, let's say somebody were working on something like that. In that case
> the benefits for this precise plan wouldn't yet be that big because a
> good chunk of slot_getattr calls come from execTuplesMatch() which
> doesn't really provide enough context to do JITing (when used for
> hashaggs, there is more so it's JITed). Similarly gather merge's
> heap_compare_slots() doesn't provide such context.
>
> It's about ~9% currently, largely due to the faster aggregate
> invocation. But the big benefit here would be all the deforming and the
> comparisons...

I'm not sure I follow you here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Andres Freund
On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
> skip-gather-project-v1.patch does what it says on the tin.  I still
> don't have a test case for this, and I didn't find that it helped very
> much, but it would probably help more in a test case with more
> columns, and you said this looked like a big bottleneck in your
> testing, so here you go.

The query where that showed a big benefit was

SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;

(i.e a not very selective filter, and then just throwing the results away)

still shows quite massive benefits:

before:
set parallel_setup_cost=0;set parallel_tuple_cost=0;set 
min_parallel_table_scan_size=0;set max_parallel_workers_per_gather=8;
tpch_5[17938][1]=# explain analyze SELECT * FROM lineitem WHERE l_suppkey > 
'5012' OFFSET 10 LIMIT 1;
┌
│ QUERY PLAN
├
│ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
time=8675.097..8675.097 rows=0 loops=1)
│   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
time=0.289..7904.849 rows=26989780 loops=1)
│ Workers Planned: 8
│ Workers Launched: 7
│ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 rows=3375405 
width=127) (actual time=0.124..528.667 rows=3373722 loops=8)
│   Filter: (l_suppkey > 5012)
│   Rows Removed by Filter: 376252
│ Planning time: 0.098 ms
│ Execution time: 8676.125 ms
└
(9 rows)
after:
tpch_5[19754][1]=# EXPLAIN ANALYZE SELECT * FROM lineitem WHERE l_suppkey > 
'5012' OFFSET 10 LIMIT 1;
┌
│ QUERY PLAN
├
│ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
time=5984.916..5984.916 rows=0 loops=1)
│   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
time=0.214..5123.238 rows=26989780 loops=1)
│ Workers Planned: 8
│ Workers Launched: 7
│ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 rows=3375405 
width=127) (actual time=0.025..649.887 rows=3373722 loops=8)
│   Filter: (l_suppkey > 5012)
│   Rows Removed by Filter: 376252
│ Planning time: 0.076 ms
│ Execution time: 5986.171 ms
└
(9 rows)

so there clearly is still benefit (this is scale 100, but that shouldn't
make much of a difference).

Did not review the code.

> shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
> consume input from the shared queue when the amount of unconsumed
> input exceeds 1/4 of the queue size.  This caused a large performance
> improvement in my testing because it causes the number of times the
> latch gets set to drop dramatically. I experimented a bit with
> thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
> enough to capture most of the benefit.

Hm. Is consuming the relevant part, or notifying the sender about it?  I
suspect most of the benefit can be captured by updating bytes read (and
similarly on the other side w/ bytes written), but not setting the latch
unless thresholds are reached.  The advantage of updating the value,
even without notifying the other side, is that in the common case that
the other side gets around to checking the queue without having blocked,
it'll see the updated value.  If that works, that'd address the issue
that we might wait unnecessarily in a number of common cases.

Did not review the code.

> remove-memory-leak-protection-v1.patch removes the memory leak
> protection that Tom installed upon discovering that the original
> version of tqueue.c leaked memory like crazy.  I think that it
> shouldn't do that any more, courtesy of
> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
> can avoid a whole lot of tuple copying in Gather Merge and a much more
> modest amount of overhead in Gather.

Yup, that conceptually makes sense.

Did not review the code.


> Even with all of these patches applied, there's clearly still room for
> more optimization, but MacOS's "sample" profiler seems to show that
> the bottlenecks are largely shifting elsewhere:
>
> Sort by top of stack, same 

Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Robert Haas
On Sat, Nov 4, 2017 at 5:55 PM, Andres Freund  wrote:
>> master: 21436.745, 20978.355, 19918.617
>> patch: 15896.573, 15880.652, 15967.176
>>
>> Median-to-median, that's about a 24% improvement.
>
> Neat!

With the attached stack of 4 patches, I get: 10811.768 ms, 10743.424
ms, 10632.006 ms, about a 49% improvement median-to-median.  Haven't
tried it on hydra or any other test cases yet.

skip-gather-project-v1.patch does what it says on the tin.  I still
don't have a test case for this, and I didn't find that it helped very
much, but it would probably help more in a test case with more
columns, and you said this looked like a big bottleneck in your
testing, so here you go.

shm-mq-less-spinlocks-v2.patch is updated from the version I posted
before based on your review comments.  I don't think it's really
necessary to mention that the 8-byte atomics have fallbacks here;
whatever needs to be said about that should be said in some central
place that talks about atomics, not in each user individually.  I
agree that there might be some further speedups possible by caching
some things in local memory, but I haven't experimented with that.

shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
consume input from the shared queue when the amount of unconsumed
input exceeds 1/4 of the queue size.  This caused a large performance
improvement in my testing because it causes the number of times the
latch gets set to drop dramatically. I experimented a bit with
thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
enough to capture most of the benefit.

remove-memory-leak-protection-v1.patch removes the memory leak
protection that Tom installed upon discovering that the original
version of tqueue.c leaked memory like crazy.  I think that it
shouldn't do that any more, courtesy of
6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
can avoid a whole lot of tuple copying in Gather Merge and a much more
modest amount of overhead in Gather.  Since my test case exercised
Gather Merge, this bought ~400 ms or so.

Even with all of these patches applied, there's clearly still room for
more optimization, but MacOS's "sample" profiler seems to show that
the bottlenecks are largely shifting elsewhere:

Sort by top of stack, same collapsed (when >= 5):
slot_getattr  (in postgres)706
slot_deform_tuple  (in postgres)560
ExecAgg  (in postgres)378
ExecInterpExpr  (in postgres)372
AllocSetAlloc  (in postgres)319
_platform_memmove$VARIANT$Haswell  (in
libsystem_platform.dylib)314
read  (in libsystem_kernel.dylib)303
heap_compare_slots  (in postgres)296
combine_aggregates  (in postgres)273
shm_mq_receive_bytes  (in postgres)272

I'm probably not super-excited about spending too much more time
trying to make the _platform_memmove time (only 20% or so of which
seems to be due to the shm_mq stuff) or the shm_mq_receive_bytes time
go down until, say, somebody JIT's slot_getattr and slot_deform_tuple.
:-)

One thing that might be worth doing is hammering on the AllocSetAlloc
time.  I think that's largely caused by allocating space for heap
tuples and then freeing them and allocating space for new heap tuples.
Gather/Gather Merge are guilty of that, but I think there may be other
places in the executor with the same issue. Maybe we could have
fixed-size buffers for small tuples that just get reused and only
palloc for large tuples (cf. SLAB_SLOT_SIZE).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


skip-gather-project-v1.patch
Description: Binary data


shm-mq-less-spinlocks-v2.patch
Description: Binary data


shm-mq-reduce-receiver-latch-set-v1.patch
Description: Binary data


remove-memory-leak-protection-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Andres Freund
Hi,

On 2017-11-04 16:38:31 +0530, Robert Haas wrote:
> On hydra (PPC), these changes didn't help much.  Timings:
>
> master: 29605.582, 29753.417, 30160.485
> patch: 28218.396, 27986.951, 26465.584
>
> That's about a 5-6% improvement.  On my MacBook, though, the
> improvement was quite a bit more:

Hm. I wonder why that is. Random unverified theories (this plane doesn't
have power supplies for us mere mortals in coach, therefore I'm not
going to run benchmarks):

- Due to the lower per-core performance the leader backend is so
  bottlenecked that there's just not a whole lot of
  contention. Therefore removing the lock doesn't help much. That might
  be a bit different if the redundant projection is removed.
- IO performance on hydra is notoriously bad so there might just not be
  enough data available for workers to process rows fast enough to cause
  contention.

> master: 21436.745, 20978.355, 19918.617
> patch: 15896.573, 15880.652, 15967.176
>
> Median-to-median, that's about a 24% improvement.

Neat!


> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it.  Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex.  
> Instead,
> + * they are written atomically using 8 byte loads and stores.  Memory 
> barriers
> + * must be carefully used to synchronize reads and writes of these values 
> with
> + * reads and writes of the actual data in mq_ring.

Maybe mention that there's a fallback for ancient platforms?


> @@ -900,15 +921,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, 
> const void *data,
>   }
>   else if (available == 0)
>   {
> - shm_mq_result res;
> -
> - /* Let the receiver know that we need them to read some 
> data. */
> - res = shm_mq_notify_receiver(mq);
> - if (res != SHM_MQ_SUCCESS)
> - {
> - *bytes_written = sent;
> - return res;
> - }
> + /*
> +  * Since mq->mqh_counterparty_attached is known to be 
> true at this
> +  * point, mq_receiver has been set, and it can't change 
> once set.
> +  * Therefore, we can read it without acquiring the 
> spinlock.
> +  */
> + SetLatch(>mq_receiver->procLatch);

Might not hurt to assert mqh_counterparty_attached, just for slightly
easier debugging.

> @@ -983,19 +1009,27 @@ shm_mq_receive_bytes(shm_mq *mq, Size bytes_needed, 
> bool nowait,
>   for (;;)
>   {
>   Sizeoffset;
> - booldetached;
> + uint64  read;
>
>   /* Get bytes written, so we can compute what's available to 
> read. */
> - written = shm_mq_get_bytes_written(mq, );
> - used = written - mq->mq_bytes_read;
> + written = pg_atomic_read_u64(>mq_bytes_written);
> + read = pg_atomic_read_u64(>mq_bytes_read);

Theoretically we don't actually need to re-read this from shared memory,
we could just have the information in the local memory too. Right?
Doubtful however that it's important enough to bother given that we've
to move the cacheline for `mq_bytes_written` anyway, and will later also
dirty it to *update* `mq_bytes_read`.  Similarly on the write side.


> -/*
>   * Increment the number of bytes read.
>   */
>  static void
> @@ -1157,63 +1164,51 @@ shm_mq_inc_bytes_read(volatile shm_mq *mq, Size n)
>  {
>   PGPROC *sender;
>
> - SpinLockAcquire(>mq_mutex);
> - mq->mq_bytes_read += n;
> + /*
> +  * Separate prior reads of mq_ring from the increment of mq_bytes_read
> +  * which follows.  Pairs with the full barrier in shm_mq_send_bytes().
> +  * We only need a read barrier here because the increment of 
> mq_bytes_read
> +  * is actually a read followed by a dependent write.
> +  */
> + pg_read_barrier();
> +
> + /*
> +  * There's no need to use pg_atomic_fetch_add_u64 here, because nobody
> +  * else can be changing this value.  This method avoids taking the bus
> +  * lock unnecessarily.
> +  */

s/the bus lock/a bus lock/?  Might also be worth rephrasing away from
bus locks - there's a lot of different ways atomics are implemented.

>  /*
> - * Get the number of bytes written.  The sender need not use this to access
> - * the count of bytes written, but the receiver must.
> - */
> -static uint64
> -shm_mq_get_bytes_written(volatile shm_mq *mq, bool *detached)
> -{
> - uint64  v;
> -
> - SpinLockAcquire(>mq_mutex);
> - v = mq->mq_bytes_written;
> - *detached = mq->mq_detached;
> - SpinLockRelease(>mq_mutex);
> -
> - return v;
> -}

You reference this function in a comment 

Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Robert Haas
On Wed, Oct 18, 2017 at 3:09 AM, Andres Freund  wrote:
> 2) The spinlocks both on the the sending and receiving side a quite hot:
>
>rafia query leader:
> +   36.16%  postgres  postgres[.] shm_mq_receive
> +   19.49%  postgres  postgres[.] s_lock
> +   13.24%  postgres  postgres[.] SetLatch

Here's a patch which, as per an off-list discussion between Andres,
Amit, and myself, removes the use of the spinlock for most
send/receive operations in favor of memory barriers and the atomics
support for 8-byte reads and writes.  I tested with a pgbench -i -s
300 database with pgbench_accounts_pkey dropped and
max_parallel_workers_per_gather boosted to 10.  I used this query:

select aid, count(*) from pgbench_accounts group by 1 having count(*) > 1;

which produces this plan:

 Finalize GroupAggregate  (cost=1235865.51..5569468.75 rows=1000 width=12)
   Group Key: aid
   Filter: (count(*) > 1)
   ->  Gather Merge  (cost=1235865.51..4969468.75 rows=3000 width=12)
 Workers Planned: 6
 ->  Partial GroupAggregate  (cost=1234865.42..1322365.42
rows=500 width=12)
   Group Key: aid
   ->  Sort  (cost=1234865.42..1247365.42 rows=500 width=4)
 Sort Key: aid
 ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..541804.00 rows=500 width=4)
(10 rows)

On hydra (PPC), these changes didn't help much.  Timings:

master: 29605.582, 29753.417, 30160.485
patch: 28218.396, 27986.951, 26465.584

That's about a 5-6% improvement.  On my MacBook, though, the
improvement was quite a bit more:

master: 21436.745, 20978.355, 19918.617
patch: 15896.573, 15880.652, 15967.176

Median-to-median, that's about a 24% improvement.

Any reviews appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


shm-mq-less-spinlocks-v1.2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-23 Thread Amit Kapila
On Thu, Oct 19, 2017 at 1:16 AM, Robert Haas  wrote:
> On Tue, Oct 17, 2017 at 5:39 PM, Andres Freund  wrote:
>
>>b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
>>   queue whenever it's not empty when a tuple is ready.
>
> Batching them with what?  I hate to postpone sending tuples we've got;
> that sounds nice in the case where we're sending tons of tuples at
> high speed, but there might be other cases where it makes the leader
> wait.
>

I think Rafia's latest patch on the thread [1] has done something
similar where the tuples are sent till there is a space in shared
memory queue and then turn to batching the tuples using local queues.


[1] - 
https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-23 Thread Amit Kapila
On Wed, Oct 18, 2017 at 3:09 AM, Andres Freund  wrote:
> Hi Everyone,
>
> On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
>> While analysing the performance of TPC-H queries for the newly developed
>> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
>> that the time taken by gather node is significant. On investigation, as per
>> the current method it copies each tuple to the shared queue and notifies
>> the receiver. Since, this copying is done in shared queue, a lot of locking
>> and latching overhead is there.
>>
>> So, in this POC patch I tried to copy all the tuples in a local queue thus
>> avoiding all the locks and latches. Once, the local queue is filled as per
>> it's capacity, tuples are transferred to the shared queue. Once, all the
>> tuples are transferred the receiver is sent the notification about the same.
>>
>> With this patch I could see significant improvement in performance for
>> simple queries,
>
> I've spent some time looking into this, and I'm not quite convinced this
> is the right approach.
>

As per my understanding, the patch in this thread is dead (not
required) after the patch posted by Rafia in thread "Effect of
changing the value for PARALLEL_TUPLE_QUEUE_SIZE" [1].  There seem to
be two related problems in this area, first is shm queue communication
overhead and second is workers started to stall when shm queue gets
full.  We can observe the first problem in simple queries that use
gather and second in gather merge kind of scenarios.  We are trying to
resolve both the problems with the patch posted in another thread.  I
think there is some similarity with the patch posted on this thread,
but it is different.  I have mentioned something similar up thread as
well.


>  Let me start by describing where I see the
> current performance problems around gather stemming from.
>
> The observations here are made using
> select * from t where i < 3000 offset 2999 - 1;
> with Rafia's data. That avoids slowdowns on the leader due to too many
> rows printed out. Sometimes I'll also use
> SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;
> on tpch data to show the effects on wider tables.
>
> The precise query doesn't really matter, the observations here are more
> general, I hope.
>
> 1) nodeGather.c re-projects every row from workers. As far as I can tell
>that's now always exactly the same targetlist as it's coming from the
>worker. Projection capability was added in 8538a6307049590 (without
>checking whether it's needed afaict), but I think it in turn often
>obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My
>measurement shows that removing the projection yields quite massive
>speedups in queries like yours, which is not too surprising.
>
>I suspect this just needs a tlist_matches_tupdesc check + an if
>around ExecProject(). And a test, right now tests pass unless
>force_parallel_mode is used even if just commenting out the
>projection unconditionally.
>

+1.  I think we should something to avoid this.

>
>Commenting out the memory barrier - which is NOT CORRECT - improves
>timing:
>before: 4675.626
>after: 4125.587
>
>The correct fix obviously is not to break latch correctness. I think
>the big problem here is that we perform a SetLatch() for every read
>from the latch.
>
>I think we should
>a) introduce a batch variant for receiving like:
>
> extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,
>  shm_mq_iovec *iov, int *iovcnt,
>  bool nowait)
>
>   which then only does the SetLatch() at the end. And maybe if it
>   wrapped around.
>
>b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
>   queue whenever it's not empty when a tuple is ready.
>

I think the patch posted in another thread has tried to achieve such a
batching by using local queues, maybe we should try some other way.

>
> Unfortunately the patch's "local worker queue" concept seems, to me,
> like it's not quite addressing the structural issues, instead opting to
> address them by adding another layer of queuing.
>

That is done to use batching the tuples while sending them.  Sure, we
can do some of the other things as well, but I think the main
advantage is from batching the tuples in a smart way while sending
them and once that is done, we might not need many of the other
optimizations.


[1] - 
https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-20 Thread Robert Haas
On Wed, Oct 18, 2017 at 4:30 PM, Andres Freund  wrote:
> Hm. I'm a bit confused/surprised by that. What'd be the worst that can
> happen if we don't immediately detect that the other side is detached?
> At least if we only do so in the non-blocking paths, the worst that
> seems that could happen is that we read/write at most one superflous
> queue entry, rather than reporting an error? Or is the concern that
> detaching might be detected *too early*, before reading the last entry
> from a queue?

Detaching too early is definitely a problem.  A worker is allowed to
start up, write all of its results into the queue, and then detach
without waiting for the leader to read those results.  (Reading
messages that weren't really written would be a problem too, of
course.)

> I'm not convinced by this. Imo the multiplying largely comes from
> superflous actions, like the per-entry SetLatch calls here, rather than
> per batch.
>
> After all I'd benchmarked this *after* an experimental conversion of
> shm_mq to not use spinlocks - so there's actually no external barrier
> providing these guarantees that could be combined with SetLatch()'s
> barrier.

OK.

>> All that having been said, a batch variant for reading tuples in bulk
>> might make sense.  I think when I originally wrote this code I was
>> thinking that one process might be filling the queue while another
>> process was draining it, and that it might therefore be important to
>> free up space as early as possible.  But maybe that's not a very good
>> intuition.
>
> I think that's a sensible goal, but I don't think that has to mean that
> the draining has to happen after every entry. If you'd e.g. have a
> shm_mq_receivev() with 16 iovecs, that'd commonly be only part of a
> single tqueue mq (unless your tuples are > 4k).  I'll note that afaict
> shm_mq_sendv() already batches its SetLatch() calls.

But that's a little different -- shm_mq_sendv() sends one message
collected from multiple places.  There's no more reason for it to wake
up the receiver before the whole message is written that there would
be for shm_mq_send(); it's the same problem.

> I think one important thing a batched drain can avoid is that a worker
> awakes to just put one new tuple into the queue and then sleep
> again. That's kinda expensive.

Yes.  Or - part of a tuple, which is worse.

>> >b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
>> >   queue whenever it's not empty when a tuple is ready.
>>
>> Batching them with what?  I hate to postpone sending tuples we've got;
>> that sounds nice in the case where we're sending tons of tuples at
>> high speed, but there might be other cases where it makes the leader
>> wait.
>
> Yea, that'd need some smarts. How about doing something like batching up
> locally as long as the queue contains less than one average sized batch?

I don't follow.

> No, I don't think that's necesarily true. If that worker's queue is full
> every-time, then yes. But I think a common scenario is that the
> "current" worker only has a small portion of the queue filled. Draining
> that immediately just leads to increased cacheline bouncing.

Hmm, OK.

> I'd not previously thought about this, but won't staying sticky to the
> current worker potentially cause pins on individual tuples be held for a
> potentially long time by workers not making any progress?

Yes.

>> >It might also be a good idea to use a more directed form of wakeups,
>> >e.g. using a per-worker latch + a wait event set, to avoid iterating
>> >over all workers.
>>
>> I don't follow.
>
> Well, right now we're busily checking each worker's queue. That's fine
> with a handful of workers, but starts to become not that cheap pretty
> soon afterwards. In the surely common case where the workers are the
> bottleneck (because that's when parallelism is worthwhile), we'll check
> each worker's queue once one of them returned a single tuple. It'd not
> be a stupid idea to have a per-worker latch and wait for the latches of
> all workers at once. Then targetedly drain the queues of the workers
> that WaitEventSetWait(nevents = nworkers) signalled as ready.

Hmm, interesting.  But we can't completely ignore the process latch
either, since among other things a worker erroring out and propagating
the error to the leader relies on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-18 Thread Andres Freund
Hi,

On 2017-10-18 15:46:39 -0400, Robert Haas wrote:
> > 2) The spinlocks both on the the sending and receiving side a quite hot:
> >
> >rafia query leader:
> > +   36.16%  postgres  postgres[.] shm_mq_receive
> > +   19.49%  postgres  postgres[.] s_lock
> > +   13.24%  postgres  postgres[.] SetLatch
> >
> >To test that theory, here are the timings for
> >1) spinlocks present
> >   time: 6593.045
> >2) spinlocks acuisition replaced by *full* memory barriers, which on x86 
> > is a lock; addl $0,0(%%rsp)
> >   time: 5159.306
> >3) spinlocks replaced by read/write barriers as appropriate.
> >   time: 4687.610
> >
> >By my understanding of shm_mq.c's logic, something like 3) aught to
> >be doable in a correct manner. There should be, in normal
> >circumstances, only be one end modifying each of the protected
> >variables. Doing so instead of using full block atomics with locked
> >instructions is very likely to yield considerably better performance.
> 
> I think the sticking point here will be the handling of the
> mq_detached flag.  I feel like I fixed a bug at some point where this
> had to be checked or set under the lock at the same time we were
> checking or setting mq_bytes_read and/or mq_bytes_written, but I don't
> remember the details.  I like the idea, though.

Hm. I'm a bit confused/surprised by that. What'd be the worst that can
happen if we don't immediately detect that the other side is detached?
At least if we only do so in the non-blocking paths, the worst that
seems that could happen is that we read/write at most one superflous
queue entry, rather than reporting an error? Or is the concern that
detaching might be detected *too early*, before reading the last entry
from a queue?


> Not sure what happened to #3 on your list... you went from #2 to #4.

Threes are boring.


> > 4) Modulo computations in shm_mq are expensive:
> >
> >that we end up with a full blown div makes sense - the compiler can't
> >know anything about ringsize, therefore it can't optimize to a mask.
> >I think we should force the size of the ringbuffer to be a power of
> >two, and use a maks instead of a size for the buffer.
> 
> This seems like it would require some redesign.  Right now we let the
> caller choose any size they want and subtract off our header size to
> find the actual queue size.  We can waste up to MAXALIGN-1 bytes, but
> that's not much.  This would waste up to half the bytes provided,
> which is probably not cool.

Yea, I think it'd require a shm_mq_estimate_size(Size queuesize), that
returns the next power-of-two queuesize + overhead.


> > 5) There's a *lot* of latch interactions. The biggest issue actually is
> >the memory barrier implied by a SetLatch, waiting for the latch
> >barely shows up.
> >
> >Commenting out the memory barrier - which is NOT CORRECT - improves
> >timing:
> >before: 4675.626
> >after: 4125.587
> >
> >The correct fix obviously is not to break latch correctness. I think
> >the big problem here is that we perform a SetLatch() for every read
> >from the latch.
> 
> I think it's a little bit of an overstatement to say that commenting
> out the memory barrier is not correct.  When we added that code, we
> removed this comment:
> 
> - * Presently, when using a shared latch for interprocess signalling, the
> - * flag variable(s) set by senders and inspected by the wait loop must
> - * be protected by spinlocks or LWLocks, else it is possible to miss events
> - * on machines with weak memory ordering (such as PPC).  This restriction
> - * will be lifted in future by inserting suitable memory barriers into
> - * SetLatch and ResetLatch.
> 
> It seems to me that in any case where the data is protected by an
> LWLock, the barriers we've added to SetLatch and ResetLatch are
> redundant. I'm not sure if that's entirely true in the spinlock case,
> because S_UNLOCK() is only documented to have release semantics, so
> maybe the load of latch->is_set could be speculated before the lock is
> dropped.  But I do wonder if we're just multiplying barriers endlessly
> instead of trying to think of ways to minimize them (e.g. have a
> variant of SpinLockRelease that acts as a full barrier instead of a
> release barrier, and then avoid a second barrier when checking the
> latch state).

I'm not convinced by this. Imo the multiplying largely comes from
superflous actions, like the per-entry SetLatch calls here, rather than
per batch.

After all I'd benchmarked this *after* an experimental conversion of
shm_mq to not use spinlocks - so there's actually no external barrier
providing these guarantees that could be combined with SetLatch()'s
barrier.

Presumably part of the cost here comes from the fact that the barriers
actually do have an influence over the ordering.


> All that having been said, a batch variant for reading tuples in bulk
> might make sense.  I 

Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-18 Thread Robert Haas
On Tue, Oct 17, 2017 at 5:39 PM, Andres Freund  wrote:
> The precise query doesn't really matter, the observations here are more
> general, I hope.
>
> 1) nodeGather.c re-projects every row from workers. As far as I can tell
>that's now always exactly the same targetlist as it's coming from the
>worker. Projection capability was added in 8538a6307049590 (without
>checking whether it's needed afaict), but I think it in turn often
>obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My
>measurement shows that removing the projection yields quite massive
>speedups in queries like yours, which is not too surprising.

That seems like an easy and worthwhile optimization.

>I suspect this just needs a tlist_matches_tupdesc check + an if
>around ExecProject(). And a test, right now tests pass unless
>force_parallel_mode is used even if just commenting out the
>projection unconditionally.

So, for this to fail, we'd need a query that uses parallelism but
where the target list contains a parallel-restricted function.  Also,
the function should really be such that we'll reliably get a failure
rather than only with some small probability.  I'm not quite sure how
to put together such a test case, but there's probably some way.

> 2) The spinlocks both on the the sending and receiving side a quite hot:
>
>rafia query leader:
> +   36.16%  postgres  postgres[.] shm_mq_receive
> +   19.49%  postgres  postgres[.] s_lock
> +   13.24%  postgres  postgres[.] SetLatch
>
>To test that theory, here are the timings for
>1) spinlocks present
>   time: 6593.045
>2) spinlocks acuisition replaced by *full* memory barriers, which on x86 
> is a lock; addl $0,0(%%rsp)
>   time: 5159.306
>3) spinlocks replaced by read/write barriers as appropriate.
>   time: 4687.610
>
>By my understanding of shm_mq.c's logic, something like 3) aught to
>be doable in a correct manner. There should be, in normal
>circumstances, only be one end modifying each of the protected
>variables. Doing so instead of using full block atomics with locked
>instructions is very likely to yield considerably better performance.

I think the sticking point here will be the handling of the
mq_detached flag.  I feel like I fixed a bug at some point where this
had to be checked or set under the lock at the same time we were
checking or setting mq_bytes_read and/or mq_bytes_written, but I don't
remember the details.  I like the idea, though.

Not sure what happened to #3 on your list... you went from #2 to #4.

> 4) Modulo computations in shm_mq are expensive:
>
>that we end up with a full blown div makes sense - the compiler can't
>know anything about ringsize, therefore it can't optimize to a mask.
>I think we should force the size of the ringbuffer to be a power of
>two, and use a maks instead of a size for the buffer.

This seems like it would require some redesign.  Right now we let the
caller choose any size they want and subtract off our header size to
find the actual queue size.  We can waste up to MAXALIGN-1 bytes, but
that's not much.  This would waste up to half the bytes provided,
which is probably not cool.

> 5) There's a *lot* of latch interactions. The biggest issue actually is
>the memory barrier implied by a SetLatch, waiting for the latch
>barely shows up.
>
>Commenting out the memory barrier - which is NOT CORRECT - improves
>timing:
>before: 4675.626
>after: 4125.587
>
>The correct fix obviously is not to break latch correctness. I think
>the big problem here is that we perform a SetLatch() for every read
>from the latch.

I think it's a little bit of an overstatement to say that commenting
out the memory barrier is not correct.  When we added that code, we
removed this comment:

- * Presently, when using a shared latch for interprocess signalling, the
- * flag variable(s) set by senders and inspected by the wait loop must
- * be protected by spinlocks or LWLocks, else it is possible to miss events
- * on machines with weak memory ordering (such as PPC).  This restriction
- * will be lifted in future by inserting suitable memory barriers into
- * SetLatch and ResetLatch.

It seems to me that in any case where the data is protected by an
LWLock, the barriers we've added to SetLatch and ResetLatch are
redundant.  I'm not sure if that's entirely true in the spinlock case,
because S_UNLOCK() is only documented to have release semantics, so
maybe the load of latch->is_set could be speculated before the lock is
dropped.  But I do wonder if we're just multiplying barriers endlessly
instead of trying to think of ways to minimize them (e.g. have a
variant of SpinLockRelease that acts as a full barrier instead of a
release barrier, and then avoid a second barrier when checking the
latch state).

All that having been said, a batch variant for reading tuples in 

Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-17 Thread Andres Freund
Hi,

On 2017-10-17 14:39:57 -0700, Andres Freund wrote:
> I've spent some time looking into this, and I'm not quite convinced this
> is the right approach.  Let me start by describing where I see the
> current performance problems around gather stemming from.

One further approach to several of these issues could also be to change
things a bit more radically:

Instead of the current shm_mq + tqueue.c, have a drastically simpler
queue, that just stores fixed width dsa_pointers. Dealing with that
queue will be quite a bit faster. In that queue one would store dsa.c
managed pointers to tuples.

One thing that makes that attractive is that that'd move a bunch of
copying in the leader process solely to the worker processes, because
the leader could just convert the dsa_pointer into a local pointer and
hand that upwards the execution tree.

We'd possibly need some halfway clever way to reuse dsa allocations, but
that doesn't seem impossible.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-17 Thread Andres Freund
Hi Everyone,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies
> the receiver. Since, this copying is done in shared queue, a lot of locking
> and latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.
>
> With this patch I could see significant improvement in performance for
> simple queries,

I've spent some time looking into this, and I'm not quite convinced this
is the right approach.  Let me start by describing where I see the
current performance problems around gather stemming from.

The observations here are made using
select * from t where i < 3000 offset 2999 - 1;
with Rafia's data. That avoids slowdowns on the leader due to too many
rows printed out. Sometimes I'll also use
SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;
on tpch data to show the effects on wider tables.

The precise query doesn't really matter, the observations here are more
general, I hope.

1) nodeGather.c re-projects every row from workers. As far as I can tell
   that's now always exactly the same targetlist as it's coming from the
   worker. Projection capability was added in 8538a6307049590 (without
   checking whether it's needed afaict), but I think it in turn often
   obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My
   measurement shows that removing the projection yields quite massive
   speedups in queries like yours, which is not too surprising.

   I suspect this just needs a tlist_matches_tupdesc check + an if
   around ExecProject(). And a test, right now tests pass unless
   force_parallel_mode is used even if just commenting out the
   projection unconditionally.

   before commenting out nodeGather projection:

   rafia time: 8283.583
   rafia profile:
+   30.62%  postgres  postgres [.] shm_mq_receive
+   18.49%  postgres  postgres [.] s_lock
+   10.08%  postgres  postgres [.] SetLatch
-7.02%  postgres  postgres [.] slot_deform_tuple
   - slot_deform_tuple
  - 88.01% slot_getsomeattrs
   ExecInterpExpr
   ExecGather
   ExecLimit

   lineitem time: 8448.468
   lineitem profile:
+   24.63%  postgres  postgres [.] slot_deform_tuple
+   24.43%  postgres  postgres [.] shm_mq_receive
+   17.36%  postgres  postgres [.] ExecInterpExpr
+7.41%  postgres  postgres [.] s_lock
+5.73%  postgres  postgres [.] SetLatch

after:
   rafia time: 6660.224
   rafia profile:
+   36.77%  postgres  postgres  [.] shm_mq_receive
+   19.33%  postgres  postgres  [.] s_lock
+   13.14%  postgres  postgres  [.] SetLatch
+9.22%  postgres  postgres  [.] AllocSetReset
+4.27%  postgres  postgres  [.] ExecGather
+2.79%  postgres  postgres  [.] AllocSetAlloc

   lineitem time: 4507.416
   lineitem profile:
+   34.81%  postgres  postgres[.] shm_mq_receive
+   15.45%  postgres  postgres[.] s_lock
+   13.38%  postgres  postgres[.] SetLatch
+9.87%  postgres  postgres[.] AllocSetReset
+5.82%  postgres  postgres[.] ExecGather

   as quite clearly visible, avoiding the projection yields some major
   speedups.

   The following analysis here has the projection removed.

2) The spinlocks both on the the sending and receiving side a quite hot:

   rafia query leader:
+   36.16%  postgres  postgres[.] shm_mq_receive
+   19.49%  postgres  postgres[.] s_lock
+   13.24%  postgres  postgres[.] SetLatch

   The presence of s_lock shows us that we're clearly often contending
   on spinlocks, given that's the slow-path for SpinLockAcquire(). In
   shm_mq_receive the instruction profile shows:

   │   SpinLockAcquire(>mq_mutex);
   │1 5ab:   mov$0xa9b580,%ecx
   │ mov$0x4a4,%edx
   │ mov$0xa9b538,%esi
   │ mov%r15,%rdi
   │   → callq  s_lock
   │   ↑ jmpq   2a1
   │   tas():
   │1 5c7:   mov$0x1,%eax
 32.83 │ lock   xchg %al,(%r15)
   │   shm_mq_inc_bytes_read():
   │   SpinLockAcquire(>mq_mutex);
and
  0.01 │ pop%r15
  0.04 │   ← retq
   │ nop
   │   tas():
   │1 338:   mov$0x1,%eax
 17.59 │ lock   xchg %al,(%r15)
   │   

Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-16 Thread Rafia Sabih
On Tue, Oct 17, 2017 at 3:22 AM, Andres Freund  wrote:
> Hi Rafia,
>
> On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
>> head:
>> explain  analyse select * from t where i < 3000;
>>  QUERY PLAN
>
> Could you share how exactly you generated the data? Just so others can
> compare a bit better with your results?
>

Sure. I used generate_series(1, 1000);
Please find the attached script for the detailed steps.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


large_tbl_gen.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-16 Thread Andres Freund
Hi Rafia,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> head:
> explain  analyse select * from t where i < 3000;
>  QUERY PLAN

Could you share how exactly you generated the data? Just so others can
compare a bit better with your results?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-11 Thread Alexander Kuzmenkov
Thanks Rafia, Amit, now I understand the ideas behind the patch better. 
I'll see if I can look at the new one.


--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-11 Thread Rafia Sabih
On Sat, Sep 9, 2017 at 8:14 AM, Amit Kapila  wrote:
> On Fri, Sep 8, 2017 at 11:07 PM, Alexander Kuzmenkov
>  wrote:
>> Hi Rafia,
>>
>> I like the idea of reducing locking overhead by sending tuples in bulk. The
>> implementation could probably be simpler: you could extend the API of shm_mq
>> to decouple notifying the sender from actually putting data into the queue
>> (i.e., make shm_mq_notify_receiver public and make a variant of shm_mq_sendv
>> that doesn't send the notification).
>>
>
> Rafia can comment on details, but I would like to bring it to your
> notice that we need kind of local buffer (queue) for gathermerge
> processing as well where the data needs to be fetched in order from
> queues.  So, there is always a chance that some of the workers have
> filled their queues while waiting for the master to extract the data.
> I think the patch posted by Rafia on the nearby thread [1] addresses
> both the problems by one patch.
>
>
> [1] - 
> https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com
>

Thanks Alexander for your interest in this work. As rightly pointed
out by Amit, when experimenting with this patch we found that there
are cases when master is busy and unable to read tuples in
shared_queue and the worker get stuck as it can not process tuples any
more. When experimenting aong these lines, I found that Q12 of TPC-H
is showing great performance improvement when increasing
shared_tuple_queue_size [1].
It was then we realised that merging this with the idea of giving an
illusion of larger tuple queue size with a local queue[1] could be
more beneficial. To precisely explain the meaning of merging the two
ideas, now we write tuples in local_queue once shared_queue is full
and as soon as we have filled some enough tuples in local queue we
copy the tuples from local to shared queue in one memcpy call. It is
giving good performance improvements for quite some cases.

I'll be glad if you may have a look at this patch and enlighten me
with your suggestions. :-)

[1] - 
https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-08 Thread Amit Kapila
On Fri, Sep 8, 2017 at 11:07 PM, Alexander Kuzmenkov
 wrote:
> Hi Rafia,
>
> I like the idea of reducing locking overhead by sending tuples in bulk. The
> implementation could probably be simpler: you could extend the API of shm_mq
> to decouple notifying the sender from actually putting data into the queue
> (i.e., make shm_mq_notify_receiver public and make a variant of shm_mq_sendv
> that doesn't send the notification).
>

Rafia can comment on details, but I would like to bring it to your
notice that we need kind of local buffer (queue) for gathermerge
processing as well where the data needs to be fetched in order from
queues.  So, there is always a chance that some of the workers have
filled their queues while waiting for the master to extract the data.
I think the patch posted by Rafia on the nearby thread [1] addresses
both the problems by one patch.


[1] - 
https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-08 Thread Alexander Kuzmenkov

Hi Rafia,

I like the idea of reducing locking overhead by sending tuples in bulk. 
The implementation could probably be simpler: you could extend the API 
of shm_mq to decouple notifying the sender from actually putting data 
into the queue (i.e., make shm_mq_notify_receiver public and make a 
variant of shm_mq_sendv that doesn't send the notification). From Amit's 
letter I understand that you have already tried something along these 
lines and the performance wasn't good. What was the bottleneck then? If 
it's the locking around mq_bytes_read/written, it can be rewritten with 
atomics. I think it would be great to try this approach because it 
doesn't add much code, doesn't add any additional copying and improves 
shm_mq performance in general.


--
Alexander Kuzmenkov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-05-19 Thread Amit Kapila
On Fri, May 19, 2017 at 5:58 PM, Robert Haas  wrote:
> On Fri, May 19, 2017 at 7:55 AM, Rafia Sabih
>  wrote:
>> While analysing the performance of TPC-H queries for the newly developed
>> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
>> that the time taken by gather node is significant. On investigation, as per
>> the current method it copies each tuple to the shared queue and notifies the
>> receiver. Since, this copying is done in shared queue, a lot of locking and
>> latching overhead is there.
>>
>> So, in this POC patch I tried to copy all the tuples in a local queue thus
>> avoiding all the locks and latches. Once, the local queue is filled as per
>> it's capacity, tuples are transferred to the shared queue. Once, all the
>> tuples are transferred the receiver is sent the notification about the same.
>
> What if, instead of doing this, we switched the shm_mq stuff to use atomics?
>

That is one of the very first things we have tried, but it didn't show
us any improvement, probably because sending tuple-by-tuple over
shm_mq is not cheap.  Also, independently, we have tried to reduce the
frequency of SetLatch (used to notify receiver), but that also didn't
result in improving the results. Now, I think one thing that can be
tried is to use atomics in shm_mq and reduce the frequency to notify
receiver, but not sure if that can give us better results than with
this idea. There are a couple of other ideas which has been tried to
improve the speed of Gather like avoiding an extra copy of tuple which
we need to do before sending tuple
(tqueueReceiveSlot->ExecMaterializeSlot) and increasing the size of
tuple queue length, but none of those has shown any noticeable
improvement.  I am aware of all this because I and Dilip were offlist
involved in brainstorming ideas with Rafia to improve the speed of
Gather.  I think it might have been better to show the results of
ideas that didn't work out, but I guess Rafia hasn't shared those with
the intuition that nobody would be interested in hearing what didn't
work out.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-05-19 Thread Robert Haas
On Fri, May 19, 2017 at 7:55 AM, Rafia Sabih
 wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies the
> receiver. Since, this copying is done in shared queue, a lot of locking and
> latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.

What if, instead of doing this, we switched the shm_mq stuff to use atomics?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers