Re: [HACKERS] Should we cacheline align PGXACT?

2017-04-03 Thread Jim Van Fleet
pgsql-hackers-ow...@postgresql.org wrote on 04/03/2017 01:58:03 PM:

> From: Andres Freund <and...@anarazel.de>
> To: Alexander Korotkov <a.korot...@postgrespro.ru>
> Cc: David Steele <da...@pgmasters.net>, Ashutosh Sharma 
> <ashu.coe...@gmail.com>, Simon Riggs <si...@2ndquadrant.com>, Alvaro
> Herrera <alvhe...@2ndquadrant.com>, Robert Haas 
> <robertmh...@gmail.com>, Bernd Helmle <maili...@oopsware.de>, Tomas 
> Vondra <tomas.von...@2ndquadrant.com>, pgsql-hackers  hack...@postgresql.org>
> Date: 04/03/2017 01:59 PM
> Subject: Re: [HACKERS] Should we cacheline align PGXACT?
> Sent by: pgsql-hackers-ow...@postgresql.org
> 
> On 2017-03-25 19:35:35 +0300, Alexander Korotkov wrote:
> > On Wed, Mar 22, 2017 at 12:23 AM, David Steele <da...@pgmasters.net> 
wrote:
> > 
> > > Hi Alexander
> > >
> > > On 3/10/17 8:08 AM, Alexander Korotkov wrote:
> > >
> > > Results look good for me.  Idea of committing both of patches looks
> > >> attractive.
> > >> We have pretty much acceleration for read-only case and small
> > >> acceleration for read-write case.
> > >> I'll run benchmark on 72-cores machine as well.
> > >>
> > >
> > > Have you had a chance to run those tests yet?
> > >
> > 
> > I discovered an interesting issue.
> > I found that ccce90b3 (which was reverted) gives almost same effect as
> > PGXACT alignment on read-only test on 72-cores machine.
> 
> That's possibly because it changes alignment?
> 
> 
> > That shouldn't be related to the functionality of ccce90b3 itself, 
because
> > read-only test don't do anything with clog.  And that appears to be 
true.
> > Padding of PGPROC gives same positive effect as ccce90b3.  Padding 
patch
> > (pgproc-pad.patch) is attached.  It's curious that padding changes 
size of
> > PGPROC from 816 bytes to 848 bytes.  So, size of PGPROC remains 
16-byte
> > aligned.  So, probably effect is related to distance between PGPROC
> > members...
> > 
> > See comparison of 16-bytes alignment of PGXACT + reduce PGXACT access 
vs.
> > padding of PGPROC.
> 
> My earlier testing had showed that padding everything is the best
> approach :/
>
My approach has been to, generally, pad "everything" as well.  In my 
testing on power, I padded  PGXACT to 16 bytes.  To my surprise, with the 
padding in isolation, the performance (on hammerdb) was slightly degraded.

Jim Van Fleet 
> 
> I'm inclined to push this to the next CF, it seems we need a lot more
> benchmarking here.
> 
> Greetings,
> 
> Andres Freund
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
> 




Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-07 Thread Jim Van Fleet
Amit Kapila  wrote on 06/07/2017 07:34:06 AM:

...

> > The down side is that on smaller configurations (single socket) where 
there
> > is less "lock thrashing" in the storage subsystem and there are 
multiple
> > Lwlocks to take for an exclusive acquire, there is a decided downturn 
in
> > performance. On  hammerdb, the prototype was 6% worse than the base on 
a
> > single socket power configuration.
> >
> 
> I think any patch having 6% regression on one machine configuration
> and 16% improvement on another machine configuration is not a net win.
> However, if there is a way to address the regression, then it will
> look much attractive.

I have to agree.
> 
> > If there is interest in this approach, I will submit a patch.
> >
> 
> The basic idea is clear from your description, but it will be better
> if you share the patch as well.  It will not only help people to
> review and provide you feedback but also allow them to test and see if
> they can reproduce the numbers you have mentioned in the mail.

OK -- would love the feedback and any suggestions on how to mitigate the 
low end problems.
> 
> There is some related work which was previously proposed in this area
> ("Cache the snapshot") [1] and it claims to reduce contention around
> ProcArrayLock.  I am not sure if that patch still applies, however, if
> you find it relevant and you are interested in evaluating the same,
> then we can request the author to post a rebased version if it doesn't
> apply.

Sokolov Yura has a patch which, to me, looks good for pgbench rw 
performance.  Does not do so well with hammerdb (about the same as base) 
on single socket and two socket.


> 
> [1] - https://www.postgresql.org/message-id/
> CAD__OuiwEi5sHe2wwQCK36Ac9QMhvJuqG3CfPN%2BOFCMb7rdruQ%40mail.gmail.com
> 
> -- 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
> 




Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-07 Thread Jim Van Fleet
Robert Haas  wrote on 06/07/2017 12:12:02 PM:


> > OK -- would love the feedback and any suggestions on how to mitigate 
the low
> > end problems.
> 
> Did you intend to attach a patch?
Yes I do -- tomorrow or Thursday -- needs a little cleaning up ...

> > Sokolov Yura has a patch which, to me, looks good for pgbench rw
> > performance.  Does not do so well with hammerdb (about the same as 
base) on
> > single socket and two socket.
> 
> Any idea why?  I think we will have to understand *why* certain things
> help in some situations and not others, not just *that* they do, in
> order to come up with a good solution to this problem.
Looking at the data now -- LWLockAquire philosophy is different -- at 
first glance I would have guessed "about the same" as the base, but I can 
not yet explain why we have super pgbench rw performance and "the same" 
hammerdb performance. 

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 




Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-08 Thread Jim Van Fleet
pgsql-hackers-ow...@postgresql.org wrote on 06/07/2017 04:06:57 PM:

...
> > 
> > Did you intend to attach a patch?
> Yes I do -- tomorrow or Thursday -- needs a little cleaning up ...
meant Friday

> 
> > > Sokolov Yura has a patch which, to me, looks good for pgbench rw
> > > performance.  Does not do so well with hammerdb (about the same 
> as base) on
> > > single socket and two socket.
> > 
> > Any idea why?  I think we will have to understand *why* certain things
> > help in some situations and not others, not just *that* they do, in
> > order to come up with a good solution to this problem.
> Looking at the data now -- LWLockAquire philosophy is different -- 
> at first glance I would have guessed "about the same" as the base, 
> but I can not yet explain why we have super pgbench rw performance 
> and "the same" hammerdb performance. 
(data taken from perf cycles when I invoked the performance data gathering 
script, generally in the middle of the run)
In hammerdb two socket, the ProcArrayLock is the bottle neck in 
LWLockAcquire (called from GetSnapshotData about 75% of the calls to 
LWLockAquire). With Sokolov's patch, LWLockAcquire (with LWLockAttemptLock 
included) is a little over 9%; pgbench, on the other hand, has 
LWLockAquire at 1.3% with GetSnapshotData calling only 11% of the calls to 
LWLockAcquire. 

What I think that means is that there is no ProcArrayLock bottleneck in 
pgbench. GetSnapshotData runs the entire proc chain of PGXACT's so is held 
a rather long time. Guessing that the other locks are held a much shorter 
time;  Sukolov's patch handles the other locks better because of spinning. 
We see much more time in LWLockAcquire with hammerdb because of the 
spinning -- with the ProcArrayLock, spinning does not help much because of 
the longer hold time.

The spin count is relatively high (100/2), so I made it much smaller 
(20/2) in the hopes that the spin would still handle the shorter hold time 
locks but not be a bother with long hold times.

Running pgbench with 96 users, the thruput was slightly less at 70K tsp vs 
75K tps (vs base of 40K tps at 96 threads and peak of 58K at 64 threads); 
hammerdb two socket was slightly better (about 3%) than the peak base.

What all this tells me is that LWLockAcquire would (probably) benefit from 
some spinning.
> 
> > 
> > -- 
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> > 




Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-06 Thread Jim Van Fleet
Hi Sokolov --

I tried your patch. I only had time for doing a few points on power8. 
pgbench rw  on two sockets is awesome! Keeps getting more throughput as 
threads are added -- in contrast to base and my prototype. I did not run 
single socket pgbench.

Hammerdb, 1 socket was in the same ballpark as the base, but slightly 
lower. 2 socket was also in the same ballpark as the base, again slightly 
lower.  I did not do a series of points (just one at the previous "sweet 
spot"), so the "final" results may be better, The ProcArrayLock multiple 
parts was lower except in two socket case. The performance data I 
collected for your patch on hammerdb showed the same sort of issues  as 
the base.

I don't see much point in combining the two because of the ProcArrayLock 
down side -- that is, single socket. poor performance. Unless we could 
come up with some heuristic to use one part on light loads and two on 
heavy (and still stay correct), then I don't see it ... With the 
combination, what I think we would see is awesome pgbench rw, awesome 
hammerdb 2 socket performance, and  degraded single socket hammerdb.

Jim



From:   Sokolov Yura <y.soko...@postgrespro.ru>
To: Jim Van Fleet <vanfl...@us.ibm.com>
Cc: pgsql-hackers@postgresql.org
Date:   06/05/2017 03:28 PM
Subject:Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into 
multiple parts
Sent by:pgsql-hackers-ow...@postgresql.org



Excuse me, Jim.

I was tired and misunderstand proposal: I thought of ProcArray sharding, 
but proposal is about ProcArrayLock sharding.

BTW, I just posted improvement to LWLock:

https://www.postgresql.org/message-id/2968c0be065baab8865c4c95de3f435c%40postgrespro.ru

Would you mind to test against that and together with that?

5 июня 2017 г. 11:11 PM пользователь Sokolov Yura 
<y.soko...@postgrespro.ru> написал:
Hi, Jim.

How do you ensure of transaction order?

Example:
- you lock shard A and gather info. You find transaction T1 in-progress.
- Then you unlock shard A.
- T1 completes. T2, that depends on T1, also completes. But T2 was on 
shard B.
- you lock shard B, and gather info from.
- You didn't saw T2 as in progress, so you will lookup into clog then and 
will find it as commited.

Now you see T2 as commited, but T1 as in-progress - clear violation of 
transaction order.

Probably you've already solved this issue. If so it would be great to 
learn the solution.


5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet <vanfl...@us.ibm.com> 
написал:
Hi,

I have been experimenting with splitting  the ProcArrayLock into parts. 
 That is, to Acquire the ProcArrayLock in shared mode, it is only 
necessary to acquire one of the parts in shared mode; to acquire the lock 
in exclusive mode, all of the parts must be acquired in exclusive mode. 
For those interested, I have attached a design description of the change.

This approach has been quite successful on large systems with the hammerdb 
benchmark.With a prototype based on 10 master source and running on power8 
(model 8335-GCA with 2sockets, 20 core)
 hammerdb  improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 
2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved 
by 4%. (attached is a set of spreadsheets for power8.

The down side is that on smaller configurations (single socket) where 
there is less "lock thrashing" in the storage subsystem and there are 
multiple Lwlocks to take for an exclusive acquire, there is a decided 
downturn in performance. On  hammerdb, the prototype was 6% worse than the 
base on a single socket power configuration.

If there is interest in this approach, I will submit a patch.

Jim Van Fleet









[HACKERS] HACKERS[PATCH] split ProcArrayLock into multiple parts

2017-06-09 Thread Jim Van Fleet
I left out the retry in LWLockAcquire.





ProcArrayLock_part.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Jim Van Fleet
Hi,

I have been experimenting with splitting  the ProcArrayLock into parts. 
That is, to Acquire the ProcArrayLock in shared mode, it is only necessary 
to acquire one of the parts in shared mode; to acquire the lock in 
exclusive mode, all of the parts must be acquired in exclusive mode. For 
those interested, I have attached a design description of the change.

This approach has been quite successful on large systems with the hammerdb 
benchmark. With a prototype based on 10 master source and running on 
power8 (model 8335-GCA with 2sockets, 20 core)
 hammerdb  improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 
2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved 
by 4%. (attached is a set of spreadsheets for power8.

The down side is that on smaller configurations (single socket) where 
there is less "lock thrashing" in the storage subsystem and there are 
multiple Lwlocks to take for an exclusive acquire, there is a decided 
downturn in performance. On  hammerdb, the prototype was 6% worse than the 
base on a single socket power configuration.

If there is interest in this approach, I will submit a patch.

Jim Van Fleet





compare_10base_toPrototype.ods
Description: Binary data


patchDescription.odt
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Jim Van Fleet
NP, Sokolov --

pgsql-hackers-ow...@postgresql.org wrote on 06/05/2017 03:26:46 PM:

> From: Sokolov Yura <y.soko...@postgrespro.ru>
> To: Jim Van Fleet <vanfl...@us.ibm.com>
> Cc: pgsql-hackers@postgresql.org
> Date: 06/05/2017 03:28 PM
> Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into 
> multiple parts
> Sent by: pgsql-hackers-ow...@postgresql.org
> 
> Excuse me, Jim.
> 
> I was tired and misunderstand proposal: I thought of ProcArray 
> sharding, but proposal is about ProcArrayLock sharding.
> 
> BTW, I just posted improvement to LWLock:
> 
> https://www.postgresql.org/message-id/
> 2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
> 
> Would you mind to test against that and together with that?

I will give them a try ..

Jim



Fw: [HACKERS] HACKERS[PATCH] split ProcArrayLock into multiple parts -- follow-up

2017-09-21 Thread Jim Van Fleet
Howdy --

Not to beat on a dead horse, or anything, but this fix was frowned upon 
because in one environment (one socket) it was 6% down and over 15% up in 
the right environment (two sockets).

So, why not add a configuration parameter which specifies the number of 
parts? Default is 1 which would be "exactly" the same as no parts and 
hence no degradation in the single socket environment -- and with 2, you 
get some positive performance.

Jim
- Forwarded by Jim Van Fleet/Austin/Contr/IBM on 09/21/2017 03:37 PM 
-

pgsql-hackers-ow...@postgresql.org wrote on 06/09/2017 01:39:35 PM:

> From: "Jim Van Fleet" <vanfl...@us.ibm.com>
> To: "Pgsql Hackers" <pgsql-hackers@postgresql.org>
> Date: 06/09/2017 01:41 PM
> Subject: [HACKERS] HACKERS[PATCH] split ProcArrayLock into multiple 
parts
> Sent by: pgsql-hackers-ow...@postgresql.org
> 
> I left out the retry in LWLockAcquire.
> 
> [attachment "ProcArrayLock_part.patch" deleted by Jim Van Fleet/
> Austin/Contr/IBM] 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers




Re: Fw: [HACKERS] HACKERS[PATCH] split ProcArrayLock into multiple parts -- follow-up

2017-09-21 Thread Jim Van Fleet
> On 2017-09-21 15:51:54 -0500, Jim Van Fleet wrote:
> > Not to beat on a dead horse, or anything, but this fix was frowned 
upon 
> > because in one environment (one socket) it was 6% down and over 15% up 
in 
> > the right environment (two sockets).
> 
> > So, why not add a configuration parameter which specifies the number 
of 
> > parts? Default is 1 which would be "exactly" the same as no parts and 
> > hence no degradation in the single socket environment -- and with 2, 
you 
> > get some positive performance.
> 
> Several reasons:
> 
> - You'd either add a bunch of branches into a performance critical
>   parts, or you'd add a compile time flag, which most people would be
>   unable to toggle.
I agree, no compile time flags -- but no extra testing in the main path -- 
gets set at init and not changed from there.
> - It'd be something hard to tune, because even on multi-socket machines
>   it'll be highly load dependant. E.g. workloads that largely are
>   bottlenecked in a single backend / few backends will probably regress
>   as well.
Workloads are hard to tune -- with the default, you have what you have 
today. If you "know" the issue is ProcArrayLock, then you have an 
alternative to try.
> 
> FWIW, you started a new thread with this message, that doesn't seem
> helpful?

Sorry about that -- my mistake.

Jim



Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Jim Van Fleet
Ran this change with hammerdb  on a power 8 firestone

with 2 socket, 20 core
9.6 base--  451991 NOPM
0926_master -- 464385 NOPM
11_04master -- 449177 NOPM
11_04_patch -- 431423 NOPM
-- two socket patch is a little down from previous base runs

With one socket
9.6 base  -- 393727 NOPM 
v10rc1_base -- 350958 NOPM
11_04master -- 306506 NOPM
11_04_patch -- 313179 NOPM
--  one socket 11_04 master is quite a bit down from 9.6 and v10rc1_base 
-- the patch is up a bit over the base

Net -- the patch is about the same as current base on two socket, and on 
one socket  -- consistent with your pgbench (?) findings

As an aside, it is perhaps a worry that one socket is down over 20% from 
9.6 and over 10% from v10rc1

Jim

pgsql-hackers-ow...@postgresql.org wrote on 11/04/2017 06:08:31 AM:

> On hydra (PPC), these changes didn't help much.  Timings:
> 
> master: 29605.582, 29753.417, 30160.485
> patch: 28218.396, 27986.951, 26465.584
> 
> That's about a 5-6% improvement.  On my MacBook, though, the
> improvement was quite a bit more:
> 
> master: 21436.745, 20978.355, 19918.617
> patch: 15896.573, 15880.652, 15967.176
> 
> Median-to-median, that's about a 24% improvement.
> 
> Any reviews appreciated.
> 
> Thanks,
> 
> -- 
> Robert Haas
> EnterpriseDB: https://urldefense.proofpoint.com/v2/url?
> u=http-3A__www.enterprisedb.com=DwIBaQ=jf_iaSHvJObTbx-
> siA1ZOg=Glx_6-ZyGFPdLCdb8Jr7QJHrJIbUJO1z6oi-JHO8Htk=-
> 
I8r3tfguIVgEpNumrjWTKOGkJWIbHQNT2M2-02-8cU=39p2vefOiiZS9ZooPYkZ97U66hw5osqmkCGcikgZCik=
> The Enterprise PostgreSQL Company
> [attachment "shm-mq-less-spinlocks-v1.2.patch" deleted by Jim Van 
> Fleet/Austin/Contr/IBM] 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> https://urldefense.proofpoint.com/v2/url?
> 
u=http-3A__www.postgresql.org_mailpref_pgsql-2Dhackers=DwIDAg=jf_iaSHvJObTbx-
> siA1ZOg=Glx_6-ZyGFPdLCdb8Jr7QJHrJIbUJO1z6oi-JHO8Htk=-
> 
I8r3tfguIVgEpNumrjWTKOGkJWIbHQNT2M2-02-8cU=aL2TI3avKN4drlXk915UM2RFixyvUsZ2axDjB2FG9G0=




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
Hi --

pgsql-hackers-ow...@postgresql.org wrote on 11/06/2017 09:47:22 AM:

> From: Andres Freund <and...@anarazel.de>

> 
> Hi,
> 
> Please don't top-quote on postgresql lists.
Sorry 
> 
> On 2017-11-06 09:44:24 -0600, Jim Van Fleet wrote:
> > > >hammerdb, in this configuration, runs a variant of tpcc
> > > 
> > > Hard to believe that any of the changes here are relevant in that 
> > > case - this is parallelism specific stuff. Whereas tpcc is oltp, 
right?
> 
> > correct
> 
> In that case, could you provide before/after profiles of the performance
> changing runs?
sure -- happy to share -- gzipped files (which include trace, perf, 
netstat, system data) are are large (9G and 13G)
Should I post them on the list or somewhere else (or trim them -- if so, 
what would you like to have?) 
> 
Jim




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
Andres Freund  wrote on 11/05/2017 03:40:15 PM:

hammerdb, in this configuration, runs a variant of tpcc
> 
> What query(s) did you measure?
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> 




Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Jim Van Fleet
correct

> >hammerdb, in this configuration, runs a variant of tpcc
> 
> Hard to believe that any of the changes here are relevant in that 
> case - this is parallelism specific stuff. Whereas tpcc is oltp, right?
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>