Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
 On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund and...@2ndquadrant.com wrote:
   A quick hack (attached) making BufferDescriptor 64byte aligned indeed
   restored performance across all max_connections settings. It's not
   surprising that a misaligned buffer descriptor causes problems -
   there'll be plenty of false sharing of the spinlocks otherwise. Curious
   that the the intel machine isn't hurt much by this.
 
  What fiddling are you thinking of?
 
  Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
  returning from ShmemAlloc() (and thereby ShmemInitStruct).
 
 There is something you have not drawn explicit attention to that is
 very interesting. If we take REL9_3_STABLE tip to be representative
 (built with full -O2 optimization, no assertions just debugging
 symbols), setting max_connections to 91 from 90 does not have the
 effect of making the BufferDescriptors array aligned; it has the
 effect of making it *misaligned*. You reported that 91 was much better
 than 90. I think that the problem actually occurs when the array *is*
 aligned!

I don't think you can learn much from the alignment in 9.3
vs. HEAD. Loads has changed since, most prominently and recently
Robert's LWLock work. That certainly has changed allocation patterns.
It will also depend on some other parameters, e.g. changing
max_wal_senders, max_background_workers will also change the
offset. It's not that 91 is intrinsically better, it just happened to
give a aligned BufferDescriptors array when the other parameters weren't
changed at the same time.

 I suspect that the scenario described in this article accounts for the
 quite noticeable effect reported: http://danluu.com/3c-conflict

I don't think that's applicable here. What's described there is relevant
for access patterns that are larger multiple of the cacheline size - but
our's is exactly cacheline sized. What can happen in such scenarios is
that all your accesses map to the same set of cachelines, so instead of
using most of the cache, you end up using only 8 or so (8 is a common
size of set associative caches these days).
Theoretically we could see something like that for shared_buffers
itself, but I *think* our accesses are too far spread around in them for
that to be a significant issue.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
 I suspect that the scenario described in this article accounts for the
 quite noticeable effect reported: http://danluu.com/3c-conflict

 I don't think that's applicable here.

Maybe, or maybe not, but I think it does say that we should be very wary
of proposals to force data structure alignment without any testing of the
consequences.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-05 09:57:11 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
  I suspect that the scenario described in this article accounts for the
  quite noticeable effect reported: http://danluu.com/3c-conflict
 
  I don't think that's applicable here.
 
 Maybe, or maybe not, but I think it does say that we should be very wary
 of proposals to force data structure alignment without any testing of the
 consequences.

I agree it needs testing, but what the page is talking about really,
really doesn't have *anything* to do with this. What the author is
talking about is not storing things *page* aligned (I.e. 4k or a
multiple). The problem is that the beginning of each such page falls
into the same cacheline set and thus doesn't utilize the entire L1/L2/L3
but only the single set they map into.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Greg Stark
On Wed, Feb 5, 2014 at 3:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Maybe, or maybe not, but I think it does say that we should be very wary
 of proposals to force data structure alignment without any testing of the
 consequences.


Sure. see for instance
http://igoro.com/archive/gallery-of-processor-cache-effects/

From what I understand what Andres is suggesting is ensuring that each
BufferDescriptor is sitting entirely in one cache line. That seems
very unlikely to be worse than having a BufferDescriptor spanning two
cache lines and being on the same cache line as the adjacent
BufferDescriptors. But this all depends on knowing the length of the
cache lines. I see a lot of confusion online over whether cache lines
are 64 bytes, 128 bytes, or other length even just on Intel
architectures, let alone others.

I wonder if there are any generic tools to optimize array/structure
layouts based on cachegrind profiling or something like that. Then we
wouldn't need to know the oddities ourselves and optimize manually. We
could maybe even do it on the build farm and select the right profile
at build time by matching build target information.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-05 16:14:01 +0100, Greg Stark wrote:
 I see a lot of confusion online over whether cache lines
 are 64 bytes, 128 bytes, or other length even just on Intel
 architectures, let alone others.

All current x86 processors use 64. But even if it were bigger/smaller,
they will be either 32, or 128. Neither benefits from touching more
cachelines than necessary. E.g. in the 128 case, we could still touch
two with the current code.
The effects referred to upthread only affect code with larger multiples
of the cacheline size. Not what we have here.

 I wonder if there are any generic tools to optimize array/structure
 layouts based on cachegrind profiling or something like that. Then we
 wouldn't need to know the oddities ourselves and optimize manually. We
 could maybe even do it on the build farm and select the right profile
 at build time by matching build target information.

There's profiling tools (e.g. perf's -e
stalled-cycles-(frontent|backend)), but I don't think there's more than
that.
And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
just wasn't updated in the last 10 years.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Peter Geoghegan
On Wed, Feb 5, 2014 at 7:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 All current x86 processors use 64. But even if it were bigger/smaller,
 they will be either 32, or 128. Neither benefits from touching more
 cachelines than necessary. E.g. in the 128 case, we could still touch
 two with the current code.

That's true, but I believe that a hardware optimization called
Adjacent Cache Line Prefetch is widely supported by recent
microarchitectures, and is sometimes enabled by default. I'm not
suggesting that that would necessarily radically alter the outcome,
but it is a consideration.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 4:24 PM, Peter Geoghegan p...@heroku.com wrote:
 There is something you have not drawn explicit attention to that is
 very interesting. If we take REL9_3_STABLE tip to be representative
 (built with full -O2 optimization, no assertions just debugging
 symbols), setting max_connections to 91 from 90 does not have the
 effect of making the BufferDescriptors array aligned; it has the
 effect of making it *misaligned*.

I spoke too soon; the effect is indeed reversed on master (i.e. bad
max_connection settings are misaligned, and vice-versa).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-04 Thread Andres Freund
On 2014-02-04 00:38:19 +0100, Andres Freund wrote:
   A quick hack (attached) making BufferDescriptor 64byte aligned indeed
   restored performance across all max_connections settings. It's not
   surprising that a misaligned buffer descriptor causes problems -
   there'll be plenty of false sharing of the spinlocks otherwise. Curious
   that the the intel machine isn't hurt much by this.
 
  I think that is explained here:
 
  http://www.agner.org/optimize/blog/read.php?i=142v=t
 
  With Sandy Bridge, Misaligned memory operands [are] handled efficiently.
 
 No, I don't think so. Those improvements afair refer to unaligned
 accesses as in accessing a 4 byte variable at address % 4 != 0.

So, Christian did some benchmarking on the intel machine, and his
results were also lower than mine, and I've since confirmed that it's
also possible to reproduce the alignment problems on the intel machine.

Which imo means fixing this got more important...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-04 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 4:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 Which imo means fixing this got more important...

I strongly agree.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-04 Thread Peter Geoghegan
On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund and...@2ndquadrant.com wrote:
  A quick hack (attached) making BufferDescriptor 64byte aligned indeed
  restored performance across all max_connections settings. It's not
  surprising that a misaligned buffer descriptor causes problems -
  there'll be plenty of false sharing of the spinlocks otherwise. Curious
  that the the intel machine isn't hurt much by this.

 What fiddling are you thinking of?

 Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
 returning from ShmemAlloc() (and thereby ShmemInitStruct).

There is something you have not drawn explicit attention to that is
very interesting. If we take REL9_3_STABLE tip to be representative
(built with full -O2 optimization, no assertions just debugging
symbols), setting max_connections to 91 from 90 does not have the
effect of making the BufferDescriptors array aligned; it has the
effect of making it *misaligned*. You reported that 91 was much better
than 90. I think that the problem actually occurs when the array *is*
aligned!

I suspect that the scenario described in this article accounts for the
quite noticeable effect reported: http://danluu.com/3c-conflict

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-03 Thread Peter Geoghegan
On Sun, Feb 2, 2014 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
 Just as reference, we're talking about a performance degradation from
 475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
 max_connections to 90, from 91...

That's pretty terrible.

 So, I looked into this, and I am fairly certain it's because of the
 (mis-)alignment of the buffer descriptors. With certain max_connections
 settings InitBufferPool() happens to get 64byte aligned addresses, with
 others not. I checked the alignment with gdb to confirm that.

I find your diagnosis to be quite plausible.

 A quick hack (attached) making BufferDescriptor 64byte aligned indeed
 restored performance across all max_connections settings. It's not
 surprising that a misaligned buffer descriptor causes problems -
 there'll be plenty of false sharing of the spinlocks otherwise. Curious
 that the the intel machine isn't hurt much by this.

I think that is explained here:

http://www.agner.org/optimize/blog/read.php?i=142v=t

With Sandy Bridge, Misaligned memory operands [are] handled efficiently.

 Now all this hinges on the fact that by a mere accident
 BufferDescriptors are 64byte in size:

Are they 64 bytes in size on REL9_*_STABLE? How about on win64? I
think we're reasonably disciplined here already, but long is 32-bits
in length even on win64. Looks like it would probably be okay, but as
you say, it doesn't seem like something to leave to chance.

 We could polish up the attached patch and apply it to all the branches,
 the costs of memory are minimal. But I wonder if we shouldn't instead
 make ShmemInitStruct() always return cacheline aligned addresses. That
 will require some fiddling, but it might be a good idea nonetheless?

What fiddling are you thinking of?

 I think we should also consider some more reliable measures to have
 BufferDescriptors cacheline sized, rather than relying on the happy
 accident. Debugging alignment issues isn't fun, too much of a guessing
 game...

+1. Maybe make code that isn't appropriately aligned fail to compile?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-03 Thread Andres Freund
On 2014-02-03 15:17:13 -0800, Peter Geoghegan wrote:
 On Sun, Feb 2, 2014 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
  Just as reference, we're talking about a performance degradation from
  475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
  max_connections to 90, from 91...

 That's pretty terrible.

Yea, I was scared myself.

  A quick hack (attached) making BufferDescriptor 64byte aligned indeed
  restored performance across all max_connections settings. It's not
  surprising that a misaligned buffer descriptor causes problems -
  there'll be plenty of false sharing of the spinlocks otherwise. Curious
  that the the intel machine isn't hurt much by this.

 I think that is explained here:

 http://www.agner.org/optimize/blog/read.php?i=142v=t

 With Sandy Bridge, Misaligned memory operands [are] handled efficiently.

No, I don't think so. Those improvements afair refer to unaligned
accesses as in accessing a 4 byte variable at address % 4 != 0.

  Now all this hinges on the fact that by a mere accident
  BufferDescriptors are 64byte in size:

 Are they 64 bytes in size on REL9_*_STABLE? How about on win64? I
 think we're reasonably disciplined here already, but long is 32-bits
 in length even on win64. Looks like it would probably be okay, but as
 you say, it doesn't seem like something to leave to chance.

I haven't checked any branches yet, but I don't remember any recent
changes to sbufdesc. And I think we're lucky enough that all the used
types are the same width across common 64bit OSs.
But I absolutely agree we shouldn't leave this to chance.

  We could polish up the attached patch and apply it to all the branches,
  the costs of memory are minimal. But I wonder if we shouldn't instead
  make ShmemInitStruct() always return cacheline aligned addresses. That
  will require some fiddling, but it might be a good idea nonetheless?

 What fiddling are you thinking of?

Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
returning from ShmemAlloc() (and thereby ShmemInitStruct). After I'd
written the email I came across an interesting bit of code:

void *
ShmemAlloc(Size size)
{
...
/* extra alignment for large requests, since they are probably buffers 
*/
if (size = BLCKSZ)
newStart = BUFFERALIGN(newStart);

/*
 * Preferred alignment for disk I/O buffers.  On some CPUs, copies between
 * user space and kernel space are significantly faster if the user buffer
 * is aligned on a larger-than-MAXALIGN boundary.  Ideally this should be
 * a platform-dependent value, but for now we just hard-wire it.
 */
#define ALIGNOF_BUFFER  32

#define BUFFERALIGN(LEN)TYPEALIGN(ALIGNOF_BUFFER, (LEN))

So, we're already trying to make large allocations (which we'll be using
here) try to fit to a 32 byte alignment. Which unfortunately isn't
enough here, but it explains why max_connections has to be changed by
exactly 1 to achieve adequate performance...
So, the absolutely easiest thing would be to just change ALIGNOF_BUFFER
to 64. The current value was accurate in 2003 (common cacheline size
back then), but it really isn't anymore today. Anything relevant is
64byte.

But I really think we should also remove the BLCKSZ limit. There's
relatively few granular/dynamic usages of ShmemAlloc(), basically just
shared memory hashes. And I'd be rather unsurprised if aligning all
allocations for hashes resulted in a noticeable speedup. We have
frightening bottlenecks in those today.

  I think we should also consider some more reliable measures to have
  BufferDescriptors cacheline sized, rather than relying on the happy
  accident. Debugging alignment issues isn't fun, too much of a guessing
  game...

 +1. Maybe make code that isn't appropriately aligned fail to compile?

I was wondering about using some union trickery to make sure the array
only consists out of properly aligned types. That'd require some changes
but luckily code directly accessing the BufferDescriptors array isn't
too far spread.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers