Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 It looks like padding out LWLock struct would ensure that each of those
 were in separate cache lines?

I've looked at this before and I think it's a nonstarter; increasing the
size of a spinlock to 128 bytes is just not reasonable.  (Remember there
are two per buffer.)  Also, there's no evidence it would actually help
anything, because the contention we have been able to measure is on only
one particular lock (BufMgrLock) anyway.  But feel free to try it to see
if you can see a difference.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Neil Conway
Simon Riggs wrote:
...and BTW, what is MMCacheLock?? is that an attempt at padding already?
One would hope not, as it would be a totally braindead attempt :) It 
appears to have been formerly used by smgr/mm.c; when that was removed, 
the MMCacheLock should have been removed but was not. Barring any 
objections, I'll remove it from HEAD tomorrow.

It looks like padding out LWLock struct would ensure that each of those
were in separate cache lines?
Sounds like it might be worth trying; since it would be trivial to 
implement for testing purposes, I'd be curious to see if this improves 
performance, and/or has any effect on the CS storm issue.

-Neil
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Simon Riggs
 From: Tom Lane [mailto:[EMAIL PROTECTED] wrote
 Simon Riggs [EMAIL PROTECTED] writes:
  It looks like padding out LWLock struct would ensure that
 each of those
  were in separate cache lines?

 I've looked at this before and I think it's a nonstarter;
 increasing the
 size of a spinlock to 128 bytes is just not reasonable.
 (Remember there
 are two per buffer.)

Well, the performance is unreasonably poor, so its time to do something,
which might if it is unreasonable for the general case would need to be
port specific.

Also, there's no evidence it would actually help
 anything, because the contention we have been able to measure
 is on only
 one particular lock (BufMgrLock) anyway.  But feel free to
 try it to see
 if you can see a difference.

Well, the Wierd Context Switching issue isn't normal contention, which I
agree is stacked up against BufMgrLock.

Overlapping cache lines doesn't cause measurable user space contention,
just poor performance when the delay for cache-spoil, refetch is
required many times more than the minimum (ideal).

I'm thinking that the 128 byte cache line on Intel is sufficiently
higher than the 64 byte cache line on AMD to tip us into different
behaviour at runtime.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 From: Tom Lane [mailto:[EMAIL PROTECTED] wrote
 I've looked at this before and I think it's a nonstarter;
 increasing the
 size of a spinlock to 128 bytes is just not reasonable.

 Well, the performance is unreasonably poor, so its time to do something,
 which might if it is unreasonable for the general case would need to be
 port specific.

Well, it might be worth allocating a full 128 bytes just for the fixed
LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks,
which should be seeing far less contention than the fixed locks anyway.
But first lets see some evidence that this actually helps?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Simon Riggs
 From: Tom Lane [mailto:[EMAIL PROTECTED] wrote
 Simon Riggs [EMAIL PROTECTED] writes:
  From: Tom Lane [mailto:[EMAIL PROTECTED] wrote
  I've looked at this before and I think it's a nonstarter;
  increasing the
  size of a spinlock to 128 bytes is just not reasonable.

  Well, the performance is unreasonably poor, so its time to
 do something,
  which might if it is unreasonable for the general case
 would need to be
  port specific.


 But first lets see some evidence that this actually helps?

Yes, thats what we're planning. Currently just brainstorming ideas for
prototyping, rather than suggesting definitive things to go into the
codebase.

There are some other suggestions there also coming from our Unisys
friends: some additional Intel tweaks to avoid processor stalls and the
like. Again... I'll show the performance figures first.

 Well, it might be worth allocating a full 128 bytes just for the fixed
 LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks,
 which should be seeing far less contention than the fixed
 locks anyway.

Yes, that seems like a likely future way. Right now we're going to pad
the whole structure, to save prototyping time and because we can on a
test box. Working on this now.

My concern about the per-buffer locks is with the CLog and Subtrans
buffer pools. Because we have 8 buffers for each of those it looks like
they'll all be in a single cache line. That means fine grained locking
is ineffective for those caches. With 1000s of shared_buffers, there's
less problem with cache line contention. Possibly Ken's suggestion of
pseudo-randomising the allocation of locks in LWLockAcquire would reduce
the effect on those smaller buffer pools.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] LWLock cache line alignment

2005-02-03 Thread Kenneth Marshall
On Thu, Feb 03, 2005 at 06:26:16AM -0800, Simon Riggs wrote:
  From: Tom Lane [mailto:[EMAIL PROTECTED] wrote
  Simon Riggs [EMAIL PROTECTED] writes:
   It looks like padding out LWLock struct would ensure that
  each of those
   were in separate cache lines?
 
  I've looked at this before and I think it's a nonstarter;
  increasing the
  size of a spinlock to 128 bytes is just not reasonable.
  (Remember there
  are two per buffer.)
 
 Well, the performance is unreasonably poor, so its time to do something,
 which might if it is unreasonable for the general case would need to be
 port specific.
 
 Also, there's no evidence it would actually help
  anything, because the contention we have been able to measure
  is on only
  one particular lock (BufMgrLock) anyway.  But feel free to
  try it to see
  if you can see a difference.
 
 Well, the Wierd Context Switching issue isn't normal contention, which I
 agree is stacked up against BufMgrLock.
 
 Overlapping cache lines doesn't cause measurable user space contention,
 just poor performance when the delay for cache-spoil, refetch is
 required many times more than the minimum (ideal).
 
 I'm thinking that the 128 byte cache line on Intel is sufficiently
 higher than the 64 byte cache line on AMD to tip us into different
 behaviour at runtime.
 

Would it be possible to randomize the placement of the LWLock
struct to reduce lock correlation in the buffers. i.e. Ensure that
buffer locks and allocated at least 1 cacheline apart. This may
allow the reduction in cache coherency and still maintain a compact
data structure. Maybe a simple mod X offset?

Ken

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] LWLock cache line alignment

2005-02-02 Thread Simon Riggs

Following some advice from Intel,
http://www.intel.com/cd/ids/developer/asmo-na/eng/technologies/threading
/20469.htm?page=2
I'm looking at whether the LWLock data structures may be within the same
cache line.

Intel uses 128 byte cache lines on its high end processors.

slru.c uses BUFFERALIGN which is currently hardcoded in
pg_config_manual.c to be
#define ALIGNOF_BUFFER  32
which seems to be the wrong setting for the Intel CPUs, possibly others.

In slru.c we have this code fragment:
/* Release shared lock, grab per-buffer lock instead */
LWLockRelease(shared-ControlLock);
LWLockAcquire(shared-buffer_locks[slotno], LW_EXCLUSIVE);

The purpose of this is to reduce contention, by holding finer grained
locks. ISTM what this does is drop one lock then take another lock by
accessing an array (buffer_locks) which will be in the same cache line
for all locks, then access the LWLock data structure, which again will
be all within the same cache line. ISTM that we have fine grained
LWLocks, but not fine grained cache lines. That means that all Clog and
Subtrans locks would be effected, since we have 8 of each.

For other global LWlocks, the same thing applies, so BufMgrLock and many
other locks are effectively all the same from the cache's perspective.

...and BTW, what is MMCacheLock?? is that an attempt at padding already?

It looks like padding out LWLock struct would ensure that each of those
were in separate cache lines?

Any views?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings