Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Robert Haas
On Tue, Nov 15, 2011 at 6:24 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Patch adds cacheline padding around parts of XLogCtl.

 Idea from way back, but never got performance tested in a situation
 where WALInsertLock was the bottleneck.

 So this is speculative, in need of testing to investigate value.

I kicked off a round of benchmarking around this.  I'll post results
in a few hours when the run finishes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Robert Haas
On Wed, Nov 16, 2011 at 9:33 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 15, 2011 at 6:24 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Patch adds cacheline padding around parts of XLogCtl.

 Idea from way back, but never got performance tested in a situation
 where WALInsertLock was the bottleneck.

 So this is speculative, in need of testing to investigate value.

 I kicked off a round of benchmarking around this.  I'll post results
 in a few hours when the run finishes.

I thought that the best way to see any possible benefit of this patch
would be to use permanent tables (since unlogged tables won't
generated enough XLogInsert traffic to throttle the system) and lots
of clients.  So I ran tests at 32 clients and at 80 clients.  There
appeared to be a very small speedup.

resultswp.master.32:tps = 10275.238580 (including connections establishing)
resultswp.master.32:tps = 10231.634195 (including connections establishing)
resultswp.master.32:tps = 10220.907852 (including connections establishing)
resultswp.master.32:tps = 10115.534399 (including connections establishing)
resultswp.master.32:tps = 10048.268430 (including connections establishing)
resultswp.xloginsert-padding.32:tps = 10310.046371 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10269.839056 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10259.268030 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10242.861923 (including
connections establishing)
resultswp.xloginsert-padding.32:tps = 10167.947496 (including
connections establishing)

Taking the median of those five results, the patch seems to have sped
things up by 0.3%.  At 80 clients:

resultswp.master.80:tps = 7540.388586 (including connections establishing)
resultswp.master.80:tps = 7502.803369 (including connections establishing)
resultswp.master.80:tps = 7469.338925 (including connections establishing)
resultswp.master.80:tps = 7505.377256 (including connections establishing)
resultswp.master.80:tps = 7509.402704 (including connections establishing)
resultswp.xloginsert-padding.80:tps = 7541.305973 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7568.942936 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7533.128618 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7558.258839 (including
connections establishing)
resultswp.xloginsert-padding.80:tps = 7562.585635 (including
connections establishing)

Again taking the median of the five results, that's about an 0.7% speedup.

I also tried this over top of my flexlock patch.  I figured that the
benefit ought to be greater there, because with ProcArrayLock
contention reduced, WALInsertLock contention should be the whole
banana.  But:

resultswp.flexlock.32:tps = 11628.279679 (including connections establishing)
resultswp.flexlock.32:tps = 11556.254524 (including connections establishing)
resultswp.flexlock.32:tps = 11606.840931 (including connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11357.904459 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11226.538104 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.32:tps = 11187.932415 (including
connections establishing)

That's about a 3.3% slowdown.

resultswp.flexlock.80:tps = 11560.508772 (including connections establishing)
resultswp.flexlock.80:tps = 11673.255674 (including connections establishing)
resultswp.flexlock.80:tps = 11597.229252 (including connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 11092.549726 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 11179.927207 (including
connections establishing)
resultswp.flexlock-xloginsert-padding.80:tps = 2.771332 (including
connections establishing)

That's even a little worse.

One possible explanation is that I don't believe that what you've done
here is actually sufficient to guarantee alignment on cache-line
boundary - ShmemAlloc() only guarantees MAXALIGN alignment unless the
allocation is at least BLCKSZ bytes.  So depending on exactly how much
shared memory gets allocated before XLogCtlData gets allocated, it's
possible that the change could actually end up splitting all of the
things you're trying to put on their own cache line across two cache
lines.  Might be worth fixing that and running this through its paces
again.

(I wonder if we shouldn't just align every shared memory allocation to
64 or 128 bytes.  It wouldn't waste much memory and it would make us
much more resistant to performance changes caused by minor
modifications to the shared memory layout.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Simon Riggs
On Wed, Nov 16, 2011 at 8:42 PM, Robert Haas robertmh...@gmail.com wrote:

 Taking the median of those five results, the patch seems to have sped
 things up by 0.3%.  At 80 clients:

Thanks for doing that. Even if we fix it as you suggest it seems to
indicate that the WALInsertLock contention is real rather than false
contention. Which lends some clues as to how to proceed.

 One possible explanation is that I don't believe that what you've done
 here is actually sufficient to guarantee alignment on cache-line
 boundary - ShmemAlloc() only guarantees MAXALIGN alignment unless the
 allocation is at least BLCKSZ bytes.  So depending on exactly how much
 shared memory gets allocated before XLogCtlData gets allocated, it's
 possible that the change could actually end up splitting all of the
 things you're trying to put on their own cache line across two cache
 lines.  Might be worth fixing that and running this through its paces
 again.

 (I wonder if we shouldn't just align every shared memory allocation to
 64 or 128 bytes.  It wouldn't waste much memory and it would make us
 much more resistant to performance changes caused by minor
 modifications to the shared memory layout.)

I'll look at that, thanks for the suggestion.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Robert Haas
On Wed, Nov 16, 2011 at 5:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Nov 16, 2011 at 8:42 PM, Robert Haas robertmh...@gmail.com wrote:
 Taking the median of those five results, the patch seems to have sped
 things up by 0.3%.  At 80 clients:

 Thanks for doing that. Even if we fix it as you suggest it seems to
 indicate that the WALInsertLock contention is real rather than false
 contention. Which lends some clues as to how to proceed.

Sure thing.

My general impression from playing around with this over the last 6-7
months is that cache line sharing is just not that big a problem for
us.  In a case like WALInsertLock, I'm fairly certain that we're
actually putting processes to sleep on a regular basis - and the
overhead of a system call to go to sleep and another one to wake up
and the associated context switches dwarfs the cost of passing the
cache line around.  It's far more important to get rid of the sleeping
(which, sadly, is far harder than padding out the data structures).

There are some cases where the cache line really does seem to matter -
e.g. Pavan's PGPROC_MINIMAL patch delivers excellent results on
Itanium - but those seem to be fairly rate FWICT.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 (I wonder if we shouldn't just align every shared memory allocation to
 64 or 128 bytes.  It wouldn't waste much memory and it would make us
 much more resistant to performance changes caused by minor
 modifications to the shared memory layout.)

I could get behind this idea if we had a reasonably clear idea of the
hardware's cache line width, but AFAIK there is no portable way to
identify that.  (This is a pretty fatal objection to Simon's original
patch as well...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor optimisation of XLogInsert()

2011-11-16 Thread Robert Haas
On Wed, Nov 16, 2011 at 11:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 (I wonder if we shouldn't just align every shared memory allocation to
 64 or 128 bytes.  It wouldn't waste much memory and it would make us
 much more resistant to performance changes caused by minor
 modifications to the shared memory layout.)

 I could get behind this idea if we had a reasonably clear idea of the
 hardware's cache line width, but AFAIK there is no portable way to
 identify that.  (This is a pretty fatal objection to Simon's original
 patch as well...)

I don't think it's very important to know the exact size.  As far as I
can tell, x64 is 64 bytes and Itanium and Power are 128 bytes.  If you
optimize for those, you'll also handle any smaller size (that's a
power of two, without which a lot of things we do are wrongheaded)
without wasting much memory.  If you run into hardware with a giant
256-byte or large cache line, you might have some sharing, but you
can't win 'em all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers