subject:"Re\: \[HACKERS\] Spinlocks, yet again\: analysis and proposed patches"

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2006-06-14 Thread Bruce Momjian


Added to TODO list.

---

Simon Riggs wrote:
 On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
  I wrote:
   Another thought came to mind: maybe the current data layout for LWLocks
   is bad.  Right now, the spinlock that protects each LWLock data struct
   is itself part of the struct, and since the structs aren't large (circa
   20 bytes), the whole thing is usually all in the same cache line. ...
   Maybe it'd be better to allocate the spinlocks off by themselves.
  
  Well, this is odd.  I made up a patch to do this (attached) and found
  that it pretty much sucks.  Still the 4-way Opteron, previous best
  (slock-no-cmpb and spin-delay-2):
  1 31s   2 42s   4 51s   8 100s
  with lwlock-separate added:
  1 31s   2 52s   4 106s  8 213s
  
  What I had expected to see was a penalty in the single-thread case (due
  to instructions added to the LWLock functions), but there isn't any.
  I did not expect to see a factor-of-2 penalty for four threads.
  
  I guess what this means is that there's no real problem with losing the
  cache line while manipulating the LWLock, which is what the patch was
  intended to prevent.  Instead, we're paying for swapping two cache lines
  (the spinlock and the LWLock) across processors instead of just one line.
  But that should at worst be a 2x inflation of the time previously spent
  in LWLockAcquire/Release, which is surely not yet all of the application
  ;-).  Why the heck is this so bad?  
 
 (Just going back through the whole thread for completeness.)
 
  Should we expect that apparently
  minor changes in shared data structures might be costing equivalently
  huge penalties in SMP performance elsewhere?
 
 Yes. That's the advice from Intel and AMD; but we should add that there
 is potential for improving performance also.
 
 The possible problem we were trying to avoid here was false sharing of
 the cache line by lock requestors of locks whose spin locks were
 adjacent in memory.
 
 Splitting the data structure was just one of the possible ways of
 avoiding that. The above shows that the possible solution described
 above didn't work, but there could be others.
 
 One thing we tried in February was padding out the statically defined
 locks with dummy lock definitions in the enum. This has the effect of
 ensuring that the most contested locks are very definitely in their own
 cache line and not shared with others.
 That showed a noticeable improvement in performance, probably because it
 costs very little to implement, even if the code would require some
 explanatory documentation. 
 
 The lwlock structure in include/storage/lwlock.h looks like
 
 typedef enum LWLockId
 {
   BufMappingLock,
   BufFreelistLock,
   LockMgrLock,
   OidGenLock,
   XidGenLock,
   ProcArrayLock,
   SInvalLock,
   FreeSpaceLock,
   WALInsertLock,
   WALWriteLock,
   ...
 
 Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
 are adjacent to at least one other heavily contested lock. So they are
 certain to be in the same cache line and therefore to cause false
 sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
 optimization handbooks).
 
 This could be replaced by...
 
 typedef enum LWLockId
 {
   BufMappingLock,
   BufMappingLock_Padding1,
   BufMappingLock_Padding2,
   BufMappingLock_Padding3,
   BufMappingLock_Padding4,
   BufMappingLock_Padding5,
   BufMappingLock_Padding6,
   BufMappingLock_Padding7,
   BufFreelistLock,
   BufFreelistLock_Padding1,
   BufFreelistLock_Padding2,
   BufFreelistLock_Padding3,
   BufFreelistLock_Padding4,
   BufFreelistLock_Padding5,
   BufFreelistLock_Padding6,
   BufFreelistLock_Padding7,
   LockMgrLock,
   LockMgrLock_Padding1,
   LockMgrLock_Padding2,
   LockMgrLock_Padding3,
   LockMgrLock_Padding4,
   LockMgrLock_Padding5,
   LockMgrLock_Padding6,
   LockMgrLock_Padding7,
   OidGenLock,
   XidGenLock,
   ProcArrayLock,
   SInvalLock,
   FreeSpaceLock,
   WALInsertLock,
   WALInsertLock_Padding1,
   WALInsertLock_Padding2,
   WALInsertLock_Padding3,
   WALInsertLock_Padding4,
   WALInsertLock_Padding5,
   WALInsertLock_Padding6,
   WALInsertLock_Padding7,
   WALWriteLock,
   WALWriteLock_Padding1,
   WALWriteLock_Padding2,
   WALWriteLock_Padding3,
   WALWriteLock_Padding4,
   WALWriteLock_Padding5,
   WALWriteLock_Padding6,
   WALWriteLock_Padding7,
   ...
 
 where the number of padding locks is determined by how many lock
 structures fit within a 128 byte cache line.
 
 This isn't exactly elegant coding, but it provides a useful improvement
 on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
 pretty darn stupid. But it does follow the CPU optimization handbook
 advice and I did see a

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2006-06-14 Thread Qingqing Zhou


Bruce Momjian pgman@candle.pha.pa.us wrote

 Added to TODO list.

  One thing we tried in February was padding out the statically defined
  locks with dummy lock definitions in the enum. This has the effect of
  ensuring that the most contested locks are very definitely in their own
  cache line and not shared with others.
  That showed a noticeable improvement in performance, probably because it
  costs very little to implement, even if the code would require some
  explanatory documentation.
 

Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Regards,
Qingqing



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2006-06-14 Thread Bruce Momjian

Qingqing Zhou wrote:
 
 Bruce Momjian pgman@candle.pha.pa.us wrote
 
  Added to TODO list.
 
   One thing we tried in February was padding out the statically defined
   locks with dummy lock definitions in the enum. This has the effect of
   ensuring that the most contested locks are very definitely in their own
   cache line and not shared with others.
   That showed a noticeable improvement in performance, probably because it
   costs very little to implement, even if the code would require some
   explanatory documentation.
  
 
 Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Oh, yes, thanks.  I thought it had but I couldn't find anything in the
area of the code he propsed the patch.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2006-06-14 Thread Tom Lane

Qingqing Zhou [EMAIL PROTECTED] writes:
 One thing we tried in February was padding out the statically defined
 locks with dummy lock definitions in the enum.

 Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Not really --- that patch was intended to ensure that LWLocks don't
unnecessarily cross two cache lines.  It doesn't ensure that two
different LWLocks aren't sharing a cache line.  You could do that
by increasing LWLOCK_PADDED_SIZE to the cache line size for your
hardware, if you know what that is.

I think a more effective answer might be to twiddle the order of
enum LWLockId items so that the most heavily used LWLocks aren't
close together.  Haven't looked into it though.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2006-06-14 Thread Qingqing Zhou


Tom Lane [EMAIL PROTECTED] wrote

 Not really --- that patch was intended to ensure that LWLocks don't
 unnecessarily cross two cache lines.  It doesn't ensure that two
 different LWLocks aren't sharing a cache line.  You could do that
 by increasing LWLOCK_PADDED_SIZE to the cache line size for your
 hardware, if you know what that is.

Exactly, this is one way -- if we make LWLOCK_PADDED_SIZE big enough, we can
assure that one lwlock one cacheline. If so, maybe we should plug in a check
like LMBench in ./configure to guess out current cacheline size. But this
way looks like overkill -- a compromise is to pad only some of the LWLocks
big enough but not all (for example, the buffer content lock array).

Regards,
Qingqing



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-11-04 Thread Mark Wong

On Thu, 03 Nov 2005 18:29:09 +
Simon Riggs [EMAIL PROTECTED] wrote:

 On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote:
  On Tue, 01 Nov 2005 07:32:32 +
  Simon Riggs [EMAIL PROTECTED] wrote:
   Concerned about the awful checkpointing. Can you bump wal_buffers to
   8192 just to make sure? Thats way too high, but just to prove it.
   
   We need to rdeuce the number of blocks to be written at checkpoint.
   
bgwriter_all_maxpages   5  -  15
bgwriter_all_percent0.333
bgwriter_delay  200  
bgwriter_lru_maxpages   5-  7
bgwriter_lru_percent1
   
shared_buffersset lower to 10
(which should cause some amusement on-list)
  
  
  Okay, here goes, all with the same source base w/ the lw.patch:
  
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
  only increased wal_buffers to 8192 from 2048
  3242 notpm
 
 That looks to me like a clear negative effect from increasing
 wal_buffers. Try putting it back down to 1024.
 Looks like we need to plug that gap.
 
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
  only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
  3019 notpm (but more interesting graph)
 
 Man that sucks. What the heck is happening there? Hackers - if you
 watching you should see this graph - it shows some very poor behaviour.
 
 I'm not happy with that performance at all any chance you could re-
 run that exact same test to see if we can get that repeatably?
 
 I see you have 
 vm.dirty_writeback_centisecs = 0
 
 which pretty much means we aren't ever writing to disk by the pdflush
 daemons, even when the bgwriter is active.
 
 Could we set the bgwriter stuff back to default and try 
 vm.dirty_writeback_centisecs = 500

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/47/
3309 notpm
 
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
  Same as the previously listen run with hared_buffers lowered to 1
  2503 notpm
 
 Sorry, that was 100,000 not 10,000. 

Oops!
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/46/
2794 notpm

 Looks like we need dates on the log_line_prefix so we can check the
 logs.

Oops again!  I didn't check to make sure I had set this correctly before
I ran the last two tests, I'll get on it.
 
 ...not sure about the oprofile results. Seems to show CreateLWLocks
 being as high as xlog_insert, which is mad. Either that shows startup
 time is excessive, or it means the oprofile timing range is too short.
 Not sure which.

Yeah, we've seen this before.  I think I'll have to try pulling the
oprofile cvs code to see if there's any improvement.  I've been working
with oprofile-0.9.1.

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-11-03 Thread Mark Wong

On Tue, 01 Nov 2005 07:32:32 +
Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
  On Thu, 20 Oct 2005 23:03:47 +0100
  Simon Riggs [EMAIL PROTECTED] wrote:
  
   On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
 
 This isn't exactly elegant coding, but it provides a useful 
 improvement
 on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this 
 looks
 pretty darn stupid. But it does follow the CPU optimization handbook
 advice and I did see a noticeable improvement in performance and a
 reduction in context switching.
   
 I'm not in a position to try this again now on 8.1beta, but I'd 
 welcome
 a performance test result from anybody that is. I'll supply a patch
 against 8.1beta for anyone wanting to test this.

Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
I've just set up and probably needs a bit of tuning.  I don't see much
difference but I'm wondering if the cacheline sizes are dramatically
different from Intel/AMD processors.  I still need to take a closer look
to make sure I haven't grossly mistuned anything, but I'll let everyone
take a look:
   
   Well, the Power 5 architecture probably has the lowest overall memory
   delay you can get currently so in some ways that would negate the
   effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
   clear the patch isn't significantly better (like it was with 8.0 when we
   tried this on the 8-way Itanium in Feb).
   
cvs 20051013
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
2501 notpm

cvs 20051013 w/ lw.patch
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
2519 notpm
   
   Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
  
  Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
  redid the disk layout.  Here's where I'm at now:
  
  cvs 20051013
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
  3257 notpm
  
  cvs 20051013 w/ lw.patch
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
  3285 notpm
  
  Still not much of a difference with the patch.  A quick glance over the
  iostat data suggests I'm still not i/o bound, but the i/o wait is rather
  high according to vmstat.  Will try to see if there's anything else
  obvious to get the load up higher.
 
 OK, thats fine. I'm glad there's some gain, but not much yet. I think we
 should leave out doing any more tests on lw.patch for now.
 
 Concerned about the awful checkpointing. Can you bump wal_buffers to
 8192 just to make sure? Thats way too high, but just to prove it.
 
 We need to rdeuce the number of blocks to be written at checkpoint.
 
  bgwriter_all_maxpages   5  -  15
  bgwriter_all_percent0.333
  bgwriter_delay  200  
  bgwriter_lru_maxpages   5-  7
  bgwriter_lru_percent1
 
  shared_buffersset lower to 10
  (which should cause some amusement on-list)


Okay, here goes, all with the same source base w/ the lw.patch:

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
only increased wal_buffers to 8192 from 2048
3242 notpm

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
3019 notpm (but more interesting graph)

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
Same as the previously listen run with hared_buffers lowered to 1
2503 notpm

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-31 Thread Mark Wong

On Thu, 20 Oct 2005 23:03:47 +0100
Simon Riggs [EMAIL PROTECTED] wrote:

 On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
   
   This isn't exactly elegant coding, but it provides a useful improvement
   on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
   pretty darn stupid. But it does follow the CPU optimization handbook
   advice and I did see a noticeable improvement in performance and a
   reduction in context switching.
 
   I'm not in a position to try this again now on 8.1beta, but I'd welcome
   a performance test result from anybody that is. I'll supply a patch
   against 8.1beta for anyone wanting to test this.
  
  Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
  I've just set up and probably needs a bit of tuning.  I don't see much
  difference but I'm wondering if the cacheline sizes are dramatically
  different from Intel/AMD processors.  I still need to take a closer look
  to make sure I haven't grossly mistuned anything, but I'll let everyone
  take a look:
 
 Well, the Power 5 architecture probably has the lowest overall memory
 delay you can get currently so in some ways that would negate the
 effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
 clear the patch isn't significantly better (like it was with 8.0 when we
 tried this on the 8-way Itanium in Feb).
 
  cvs 20051013
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
  2501 notpm
  
  cvs 20051013 w/ lw.patch
  http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
  2519 notpm
 
 Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
redid the disk layout.  Here's where I'm at now:

cvs 20051013
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
3257 notpm

cvs 20051013 w/ lw.patch
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
3285 notpm

Still not much of a difference with the patch.  A quick glance over the
iostat data suggests I'm still not i/o bound, but the i/o wait is rather
high according to vmstat.  Will try to see if there's anything else
obvious to get the load up higher.

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-31 Thread Simon Riggs

On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
 On Thu, 20 Oct 2005 23:03:47 +0100
 Simon Riggs [EMAIL PROTECTED] wrote:
 
  On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:

This isn't exactly elegant coding, but it provides a useful improvement
on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
pretty darn stupid. But it does follow the CPU optimization handbook
advice and I did see a noticeable improvement in performance and a
reduction in context switching.
  
I'm not in a position to try this again now on 8.1beta, but I'd welcome
a performance test result from anybody that is. I'll supply a patch
against 8.1beta for anyone wanting to test this.
   
   Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
   I've just set up and probably needs a bit of tuning.  I don't see much
   difference but I'm wondering if the cacheline sizes are dramatically
   different from Intel/AMD processors.  I still need to take a closer look
   to make sure I haven't grossly mistuned anything, but I'll let everyone
   take a look:
  
  Well, the Power 5 architecture probably has the lowest overall memory
  delay you can get currently so in some ways that would negate the
  effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
  clear the patch isn't significantly better (like it was with 8.0 when we
  tried this on the 8-way Itanium in Feb).
  
   cvs 20051013
   http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
   2501 notpm
   
   cvs 20051013 w/ lw.patch
   http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
   2519 notpm
  
  Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
 
 Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
 redid the disk layout.  Here's where I'm at now:
 
 cvs 20051013
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
 3257 notpm
 
 cvs 20051013 w/ lw.patch
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
 3285 notpm
 
 Still not much of a difference with the patch.  A quick glance over the
 iostat data suggests I'm still not i/o bound, but the i/o wait is rather
 high according to vmstat.  Will try to see if there's anything else
 obvious to get the load up higher.

OK, thats fine. I'm glad there's some gain, but not much yet. I think we
should leave out doing any more tests on lw.patch for now.

Concerned about the awful checkpointing. Can you bump wal_buffers to
8192 just to make sure? Thats way too high, but just to prove it.

We need to rdeuce the number of blocks to be written at checkpoint.

 bgwriter_all_maxpages   5  -  15
 bgwriter_all_percent0.333
 bgwriter_delay  200  
 bgwriter_lru_maxpages   5  -  7
 bgwriter_lru_percent1

 shared_buffers  set lower to 10
 (which should cause some amusement on-list)

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-20 Thread Simon Riggs

On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
  
  This isn't exactly elegant coding, but it provides a useful improvement
  on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
  pretty darn stupid. But it does follow the CPU optimization handbook
  advice and I did see a noticeable improvement in performance and a
  reduction in context switching.

  I'm not in a position to try this again now on 8.1beta, but I'd welcome
  a performance test result from anybody that is. I'll supply a patch
  against 8.1beta for anyone wanting to test this.
 
 Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
 I've just set up and probably needs a bit of tuning.  I don't see much
 difference but I'm wondering if the cacheline sizes are dramatically
 different from Intel/AMD processors.  I still need to take a closer look
 to make sure I haven't grossly mistuned anything, but I'll let everyone
 take a look:

Well, the Power 5 architecture probably has the lowest overall memory
delay you can get currently so in some ways that would negate the
effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
clear the patch isn't significantly better (like it was with 8.0 when we
tried this on the 8-way Itanium in Feb).

 cvs 20051013
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
 2501 notpm
 
 cvs 20051013 w/ lw.patch
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
 2519 notpm

Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-17 Thread Jim C. Nasby

On Wed, Oct 12, 2005 at 03:07:23PM -0400, Emil Briggs wrote:
  where the number of padding locks is determined by how many lock
  structures fit within a 128 byte cache line.
 
  This isn't exactly elegant coding, but it provides a useful improvement
  on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
  pretty darn stupid. But it does follow the CPU optimization handbook
  advice and I did see a noticeable improvement in performance and a
  reduction in context switching.
 
  I'm not in a position to try this again now on 8.1beta, but I'd welcome
  a performance test result from anybody that is. I'll supply a patch
  against 8.1beta for anyone wanting to test this.
 
 
 I don't have an 8 way available right now but I can run tests on a 4 way 
 Opteron if that would be helpful. 

Likewise I can test on a 4 way opteron running open solaris.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-13 Thread Mark Wong

On Wed, 12 Oct 2005 18:44:50 +0100
Simon Riggs [EMAIL PROTECTED] wrote:

 On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
  I wrote:
   Another thought came to mind: maybe the current data layout for LWLocks
   is bad.  Right now, the spinlock that protects each LWLock data struct
   is itself part of the struct, and since the structs aren't large (circa
   20 bytes), the whole thing is usually all in the same cache line. ...
   Maybe it'd be better to allocate the spinlocks off by themselves.
  
  Well, this is odd.  I made up a patch to do this (attached) and found
  that it pretty much sucks.  Still the 4-way Opteron, previous best
  (slock-no-cmpb and spin-delay-2):
  1 31s   2 42s   4 51s   8 100s
  with lwlock-separate added:
  1 31s   2 52s   4 106s  8 213s
  
  What I had expected to see was a penalty in the single-thread case (due
  to instructions added to the LWLock functions), but there isn't any.
  I did not expect to see a factor-of-2 penalty for four threads.
  
  I guess what this means is that there's no real problem with losing the
  cache line while manipulating the LWLock, which is what the patch was
  intended to prevent.  Instead, we're paying for swapping two cache lines
  (the spinlock and the LWLock) across processors instead of just one line.
  But that should at worst be a 2x inflation of the time previously spent
  in LWLockAcquire/Release, which is surely not yet all of the application
  ;-).  Why the heck is this so bad?  
 
 (Just going back through the whole thread for completeness.)
 
  Should we expect that apparently
  minor changes in shared data structures might be costing equivalently
  huge penalties in SMP performance elsewhere?
 
 Yes. That's the advice from Intel and AMD; but we should add that there
 is potential for improving performance also.
 
 The possible problem we were trying to avoid here was false sharing of
 the cache line by lock requestors of locks whose spin locks were
 adjacent in memory.
 
 Splitting the data structure was just one of the possible ways of
 avoiding that. The above shows that the possible solution described
 above didn't work, but there could be others.
 
 One thing we tried in February was padding out the statically defined
 locks with dummy lock definitions in the enum. This has the effect of
 ensuring that the most contested locks are very definitely in their own
 cache line and not shared with others.
 That showed a noticeable improvement in performance, probably because it
 costs very little to implement, even if the code would require some
 explanatory documentation. 
 
 The lwlock structure in include/storage/lwlock.h looks like
 
 typedef enum LWLockId
 {
   BufMappingLock,
   BufFreelistLock,
   LockMgrLock,
   OidGenLock,
   XidGenLock,
   ProcArrayLock,
   SInvalLock,
   FreeSpaceLock,
   WALInsertLock,
   WALWriteLock,
   ...
 
 Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
 are adjacent to at least one other heavily contested lock. So they are
 certain to be in the same cache line and therefore to cause false
 sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
 optimization handbooks).
 
 This could be replaced by...
 
 typedef enum LWLockId
 {
   BufMappingLock,
   BufMappingLock_Padding1,
   BufMappingLock_Padding2,
   BufMappingLock_Padding3,
   BufMappingLock_Padding4,
   BufMappingLock_Padding5,
   BufMappingLock_Padding6,
   BufMappingLock_Padding7,
   BufFreelistLock,
   BufFreelistLock_Padding1,
   BufFreelistLock_Padding2,
   BufFreelistLock_Padding3,
   BufFreelistLock_Padding4,
   BufFreelistLock_Padding5,
   BufFreelistLock_Padding6,
   BufFreelistLock_Padding7,
   LockMgrLock,
   LockMgrLock_Padding1,
   LockMgrLock_Padding2,
   LockMgrLock_Padding3,
   LockMgrLock_Padding4,
   LockMgrLock_Padding5,
   LockMgrLock_Padding6,
   LockMgrLock_Padding7,
   OidGenLock,
   XidGenLock,
   ProcArrayLock,
   SInvalLock,
   FreeSpaceLock,
   WALInsertLock,
   WALInsertLock_Padding1,
   WALInsertLock_Padding2,
   WALInsertLock_Padding3,
   WALInsertLock_Padding4,
   WALInsertLock_Padding5,
   WALInsertLock_Padding6,
   WALInsertLock_Padding7,
   WALWriteLock,
   WALWriteLock_Padding1,
   WALWriteLock_Padding2,
   WALWriteLock_Padding3,
   WALWriteLock_Padding4,
   WALWriteLock_Padding5,
   WALWriteLock_Padding6,
   WALWriteLock_Padding7,
   ...
 
 where the number of padding locks is determined by how many lock
 structures fit within a 128 byte cache line.
 
 This isn't exactly elegant coding, but it provides a useful improvement
 on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
 pretty darn stupid. But it does follow the CPU optimization handbook
 advice and I did see a noticeable improvement in performance and a

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-12 Thread Simon Riggs

On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
 I wrote:
  Another thought came to mind: maybe the current data layout for LWLocks
  is bad.  Right now, the spinlock that protects each LWLock data struct
  is itself part of the struct, and since the structs aren't large (circa
  20 bytes), the whole thing is usually all in the same cache line. ...
  Maybe it'd be better to allocate the spinlocks off by themselves.
 
 Well, this is odd.  I made up a patch to do this (attached) and found
 that it pretty much sucks.  Still the 4-way Opteron, previous best
 (slock-no-cmpb and spin-delay-2):
   1 31s   2 42s   4 51s   8 100s
 with lwlock-separate added:
   1 31s   2 52s   4 106s  8 213s
 
 What I had expected to see was a penalty in the single-thread case (due
 to instructions added to the LWLock functions), but there isn't any.
 I did not expect to see a factor-of-2 penalty for four threads.
 
 I guess what this means is that there's no real problem with losing the
 cache line while manipulating the LWLock, which is what the patch was
 intended to prevent.  Instead, we're paying for swapping two cache lines
 (the spinlock and the LWLock) across processors instead of just one line.
 But that should at worst be a 2x inflation of the time previously spent
 in LWLockAcquire/Release, which is surely not yet all of the application
 ;-).  Why the heck is this so bad?  

(Just going back through the whole thread for completeness.)

 Should we expect that apparently
 minor changes in shared data structures might be costing equivalently
 huge penalties in SMP performance elsewhere?

Yes. That's the advice from Intel and AMD; but we should add that there
is potential for improving performance also.

The possible problem we were trying to avoid here was false sharing of
the cache line by lock requestors of locks whose spin locks were
adjacent in memory.

Splitting the data structure was just one of the possible ways of
avoiding that. The above shows that the possible solution described
above didn't work, but there could be others.

One thing we tried in February was padding out the statically defined
locks with dummy lock definitions in the enum. This has the effect of
ensuring that the most contested locks are very definitely in their own
cache line and not shared with others.
That showed a noticeable improvement in performance, probably because it
costs very little to implement, even if the code would require some
explanatory documentation. 

The lwlock structure in include/storage/lwlock.h looks like

typedef enum LWLockId
{
BufMappingLock,
BufFreelistLock,
LockMgrLock,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALWriteLock,
...

Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
are adjacent to at least one other heavily contested lock. So they are
certain to be in the same cache line and therefore to cause false
sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
optimization handbooks).

This could be replaced by...

typedef enum LWLockId
{
BufMappingLock,
BufMappingLock_Padding1,
BufMappingLock_Padding2,
BufMappingLock_Padding3,
BufMappingLock_Padding4,
BufMappingLock_Padding5,
BufMappingLock_Padding6,
BufMappingLock_Padding7,
BufFreelistLock,
BufFreelistLock_Padding1,
BufFreelistLock_Padding2,
BufFreelistLock_Padding3,
BufFreelistLock_Padding4,
BufFreelistLock_Padding5,
BufFreelistLock_Padding6,
BufFreelistLock_Padding7,
LockMgrLock,
LockMgrLock_Padding1,
LockMgrLock_Padding2,
LockMgrLock_Padding3,
LockMgrLock_Padding4,
LockMgrLock_Padding5,
LockMgrLock_Padding6,
LockMgrLock_Padding7,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALInsertLock_Padding1,
WALInsertLock_Padding2,
WALInsertLock_Padding3,
WALInsertLock_Padding4,
WALInsertLock_Padding5,
WALInsertLock_Padding6,
WALInsertLock_Padding7,
WALWriteLock,
WALWriteLock_Padding1,
WALWriteLock_Padding2,
WALWriteLock_Padding3,
WALWriteLock_Padding4,
WALWriteLock_Padding5,
WALWriteLock_Padding6,
WALWriteLock_Padding7,
...

where the number of padding locks is determined by how many lock
structures fit within a 128 byte cache line.

This isn't exactly elegant coding, but it provides a useful improvement
on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
pretty darn stupid. But it does follow the CPU optimization handbook
advice and I did see a noticeable improvement in performance and a
reduction in context switching.

I'm not in a position to try this again now on

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-10-12 Thread Emil Briggs

 where the number of padding locks is determined by how many lock
 structures fit within a 128 byte cache line.

 This isn't exactly elegant coding, but it provides a useful improvement
 on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
 pretty darn stupid. But it does follow the CPU optimization handbook
 advice and I did see a noticeable improvement in performance and a
 reduction in context switching.

 I'm not in a position to try this again now on 8.1beta, but I'd welcome
 a performance test result from anybody that is. I'll supply a patch
 against 8.1beta for anyone wanting to test this.


I don't have an 8 way available right now but I can run tests on a 4 way 
Opteron if that would be helpful. 

Emil


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-22 Thread Bruce Momjian


Is there a TODO here?

---

Tom Lane wrote:
 The test case I just posted shows that our spinlock code, which we had
 thought largely done, is once again becoming a performance bottleneck.
 It's time to resurrect some of the ideas we kicked around in early
 2002, and didn't pursue because we decided spinlocks weren't our major
 performance problem.
 
 I started investigating this by trying to understand why the futex
 spinlock patch had helped the Red Hat customer I mentioned, when it
 hadn't done anything much in last year's experiments.  It turns out
 that this customer was using an 8-way Opteron, and the futex patch
 helps a lot more on that architecture than others.  Some random
 experimentation eventually turned up part of the reason: the x86 TAS
 assembly code we are using is pessimal for x86_64.  Specifically,
 s_lock.h uses this for both architectures:
 
 /* Use a non-locking test before asserting the bus lock */
 __asm__ __volatile__(
 cmpb$0,%1\n
 jne1f\n
 lock \n
 xchgb%0,%1   \n
 1: \n
 
 and it turns out that deleting the cmpb and jne makes things go
 significantly faster on Opterons.  So the non-locking test is
 not a win at all on x86_64.  As best I can tell from testing,
 removing it wins because it causes the rate of spinlock delays
 to drop significantly.  Why this should be is not completely
 obvious.  I speculate that the Opterons are capable of pulling
 in a copy of the cache line that contains the spinlock and then
 running 100 iterations of the cmpb test without ever noticing
 that the processor holding the lock has now released it.  Without
 the cmpb, they are forced by the locking xchgb test to actually
 look at the real state of the spinlock each time.
 
 I kinda suspect that the cmpb test is a no-op or loss on all
 Intelish processors: it can only be a win if you expect a lot
 of contention for the spin lock, but in percentage terms we still
 have a very low conflict rate, so in most executions of the TAS
 macro, the cmpb is just wasted cycles.  Bottom line: we definitely
 don't want it for x86_64, and maybe not at all, but we need more
 research to decide the latter.
 
 The second reason that the futex patch is helping is that when
 a spinlock delay does occur, it allows the delaying process to be
 awoken almost immediately, rather than delaying 10 msec or more
 as the existing code does.  However, given that we are only expecting
 the spinlock to be held for a couple dozen instructions, using the
 kernel futex mechanism is huge overkill --- the in-kernel overhead
 to manage the futex state is almost certainly several orders of
 magnitude more than the delay we actually want.
 
 I looked into several other methods of doing the spinlock delay
 instead.  I think all of these were suggested at one point or
 another in our earlier discussions of spinlocks:
 
 1. Use sched_yield() if available: it does just what we want,
 ie, yield the processor without forcing any useless time delay
 before we can be rescheduled.  This doesn't exist everywhere
 but it exists in recent Linuxen, so I tried it.  It made for a
 beautiful improvement in my test case numbers: CPU utilization
 went to 100% and the context swap rate to almost nil.  Unfortunately,
 I also saw fairly frequent stuck spinlock panics when running
 more queries than there were processors --- this despite increasing
 NUM_DELAYS to 1 in s_lock.c.  So I don't trust sched_yield
 anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
 (I speculate that it's set up to only yield the processor to other
 processes already affiliated to that processor.  In any case, it
 is definitely capable of getting through 1 yields without
 running the guy who's holding the spinlock.)
 
 2. Reduce the length of the select() delays in s_lock.  The current code
 delays in quanta of 10msec, but on recent Linuxen (and I think other
 platforms too) the scheduling quantum is 1msec, so we can get the
 processor back faster if we ask for a 1msec delay.  I tried this and it
 is definitely a win on Linux; it doesn't seem to hurt anything on older
 systems either, they just round the delay up to 10msec like before.
 So I think we should do this, even though it's only a partial answer.
 
 3. Modify the spin loop to do a little more work between TAS() tests.
 In the existing code there are only about half a dozen instructions
 total in the normal spin loop, most of them very fast, with the result
 that the spinning processor does its best to monopolize the system bus
 with locked probe instructions.  This is obviously not good, as it'll
 interfere with the ability of the spinlock holder to complete its work
 and release the lock.  (The bulk of the spinlock uses are for LWLocks,
 and with the current data structures the LWLock's own state is usually
 going to be in the same

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-20 Thread Alexander Kirpa

The fact that cmpb isn't helping proves that getting the cache line 
in a
read-only fashion does *not* do enough to protect the xchgb in this 
way.
But maybe another locking instruction would.  Comments?

regards, tom lane

Hi, Tom!

Possible you help next link
'Replay: Unknown Features of the NetBurst Core'
http://www.xbitlabs.com/articles/cpu/display/replay.html

Best regards,
 Alexander Kirpa


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-19 Thread Bruce Momjian


I have removed this TODO item:

* Research use of sched_yield() for spinlock acquisition failure


---

Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Marko Kreen marko@l-t.ee writes:
  (I speculate that it's set up to only yield the processor to other
  processes already affiliated to that processor.  In any case, it
  is definitely capable of getting through 1 yields without
  running the guy who's holding the spinlock.)
 
  Maybe it should try sched_yield once and then use select after that?
 
 I tried that, actually, but it didn't seem to offer any particular
 benefit.  (Maybe it would have helped more on older Linux kernels before
 they changed sched_yield?)
 
 I'm feeling even more disenchanted with sched_yield now that Marko
 pointed out that the behavior was changed recently.  Here we have a
 construct that is not portable cross-platform, does not act as
 documented in its man page, and the kernel guys feel free to whack its
 behavior around in major ways without documenting that either.  It seems
 to be a crap-shoot whether it will be useful on any particular machine
 or not.  At least with the select() code we can be reasonably confident
 we know what will happen.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-18 Thread Jim C. Nasby

On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
 Gavin Sherry [EMAIL PROTECTED] writes:
  On Sat, 17 Sep 2005, Tom Lane wrote:
  It'd be real interesting to see comparable numbers from some non-Linux
  kernels, particularly commercial systems like Solaris.
 
  Did you see the Solaris results I posted?
 
 Are you speaking of
 http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
 ?
 
 That doesn't seem directly relevant to the point, because it's for a
 2-CPU machine; so there's no way to run a test case that uses more than
 one but less than all the processors.  In either the one or all
 cases, performance ought to be pretty stable regardless of whether the
 kernel understands about any processor asymmetries that may exist in
 the hardware.  Not to mention that I don't know of any asymmetries in
 a dual SPARC anyway.  We really need to test this on comparable
 hardware, which I guess means we need Solaris/x86 on something with
 hyperthreading or known NUMA asymmetry.

I have access to a 4-way Opteron 852 running Solaris 10. What patches
would you like me to test?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-18 Thread Gavin Sherry

On Sun, 18 Sep 2005, Jim C. Nasby wrote:

 On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
  Gavin Sherry [EMAIL PROTECTED] writes:
   On Sat, 17 Sep 2005, Tom Lane wrote:
   It'd be real interesting to see comparable numbers from some non-Linux
   kernels, particularly commercial systems like Solaris.
 
   Did you see the Solaris results I posted?
 
  Are you speaking of
  http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
  ?
 
  That doesn't seem directly relevant to the point, because it's for a
  2-CPU machine; so there's no way to run a test case that uses more than
  one but less than all the processors.  In either the one or all
  cases, performance ought to be pretty stable regardless of whether the
  kernel understands about any processor asymmetries that may exist in
  the hardware.  Not to mention that I don't know of any asymmetries in
  a dual SPARC anyway.  We really need to test this on comparable
  hardware, which I guess means we need Solaris/x86 on something with
  hyperthreading or known NUMA asymmetry.

 I have access to a 4-way Opteron 852 running Solaris 10. What patches
 would you like me to test?

These ones here:

http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Gavin Sherry

I thought I'd throw SPARC into the equation (SPARC IIIi, in a dual SunFire
v250):

vanilla HEAD from ~1 week ago:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real1m49.037s
user0m0.008s
sys 0m0.016s

real2m3.482s
user0m0.014s
sys 0m0.026s

real3m54.215s
user0m0.028s
sys 0m0.046s

With the spin-delay patch:


real1m50.791s
user0m0.008s
sys 0m0.016s

real2m0.553s
user0m0.015s
sys 0m0.026s

real3m58.185s
user0m0.027s
sys 0m0.048s

Padding the LWLock to 64 bytes:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real1m49.594s
user0m0.008s
sys 0m0.016s

real1m59.141s
user0m0.015s
sys 0m0.026s

real3m56.236s
user0m0.027s
sys 0m0.047s

Padded to 128 bytes (I'm not sure about the cache line size):

real1m50.498s
user0m0.008s
sys 0m0.016s

real2m0.169s
user0m0.015s
sys 0m0.026s

real3m56.891s
user0m0.027s
sys 0m0.048s



These details seem suspicious to me when compared to those we're seeing on
x86 platforms. The reason is that padding the LWLock out to 64 or 128
bytes makes no difference (maybe the cache line is bigger?). The good
thing is, the difference between 1 and 2 instances of the test is small --
precisely what we want.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Josh Berkus

Tom,

 What I think this means is that the kernel is scheduling the 2 processes
 onto 2 processors chosen-at-random, without awareness of whether those
 two processors are on the same chip (in the Xeon case) or have closer
 NUMA affinity (in the Opteron case).

That would be consistent with my experience with HT, and the reason why many 
software vendors recommend disabling it.  Not sure about NUMA.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Greg Stark

Josh Berkus josh@agliodbs.com writes:

 Tom,
 
  What I think this means is that the kernel is scheduling the 2 processes
  onto 2 processors chosen-at-random, without awareness of whether those
  two processors are on the same chip (in the Xeon case) or have closer
  NUMA affinity (in the Opteron case).
 
 That would be consistent with my experience with HT, and the reason why many 
 software vendors recommend disabling it.  Not sure about NUMA.

What version of linux was this test with? What you describe was certainly well
known for Linux 2.4 and earlier. It was well on its way to becoming
established dogma cargo-cult style. 

However I was under the impression that 2.6 had moved beyond that problem.
It would be very interesting to know if 2.6 still suffers from this.

Also, those who recommend disabling HT on this basis should realize it didn't
penalize the 2 process case the same way. You may be getting a benefit only
in the cases where your system isn't heavily loaded. Ie, only when it doesn't
really matter.


-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Stephen Frost

* Greg Stark ([EMAIL PROTECTED]) wrote:
 However I was under the impression that 2.6 had moved beyond that problem.
 It would be very interesting to know if 2.6 still suffers from this.

The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
was better about this too, but I don't have another explanation for it.
CONFIG_SMT was enabled, etc..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Tom Lane

Stephen Frost [EMAIL PROTECTED] writes:
 * Greg Stark ([EMAIL PROTECTED]) wrote:
 However I was under the impression that 2.6 had moved beyond that problem.
 It would be very interesting to know if 2.6 still suffers from this.

 The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
 was better about this too, but I don't have another explanation for it.

The 4-way Opteron I've been using at Red Hat is running
2.6.12-1.1398_FC4smp (Fedora Core 4 obviously).  Red Hat in particular
has been working hard in this area, and I thought that their recent
kernels included NUMA fixes that weren't yet accepted upstream (at least
not in the stable kernel branches).  But it seems there's still a ways
to go yet.

It'd be real interesting to see comparable numbers from some non-Linux
kernels, particularly commercial systems like Solaris.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Gavin Sherry

On Sat, 17 Sep 2005, Tom Lane wrote:

 Stephen Frost [EMAIL PROTECTED] writes:
  * Greg Stark ([EMAIL PROTECTED]) wrote:
  However I was under the impression that 2.6 had moved beyond that problem.
  It would be very interesting to know if 2.6 still suffers from this.

  The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
  was better about this too, but I don't have another explanation for it.

 The 4-way Opteron I've been using at Red Hat is running
 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously).  Red Hat in particular
 has been working hard in this area, and I thought that their recent
 kernels included NUMA fixes that weren't yet accepted upstream (at least
 not in the stable kernel branches).  But it seems there's still a ways
 to go yet.

 It'd be real interesting to see comparable numbers from some non-Linux
 kernels, particularly commercial systems like Solaris.

Did you see the Solaris results I posted?

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-16 Thread Tom Lane

Gavin Sherry [EMAIL PROTECTED] writes:
 On Sat, 17 Sep 2005, Tom Lane wrote:
 It'd be real interesting to see comparable numbers from some non-Linux
 kernels, particularly commercial systems like Solaris.

 Did you see the Solaris results I posted?

Are you speaking of
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
?

That doesn't seem directly relevant to the point, because it's for a
2-CPU machine; so there's no way to run a test case that uses more than
one but less than all the processors.  In either the one or all
cases, performance ought to be pretty stable regardless of whether the
kernel understands about any processor asymmetries that may exist in
the hardware.  Not to mention that I don't know of any asymmetries in
a dual SPARC anyway.  We really need to test this on comparable
hardware, which I guess means we need Solaris/x86 on something with
hyperthreading or known NUMA asymmetry.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Michael Paesold


Tom Lane wrote:

I guess what this means is that there's no real problem with losing the
cache line while manipulating the LWLock, which is what the patch was
intended to prevent.  Instead, we're paying for swapping two cache lines
(the spinlock and the LWLock) across processors instead of just one line.
But that should at worst be a 2x inflation of the time previously spent
in LWLockAcquire/Release, which is surely not yet all of the application
;-).  Why the heck is this so bad?  Should we expect that apparently
minor changes in shared data structures might be costing equivalently
huge penalties in SMP performance elsewhere?

Unfortunately I don't have root on the Opteron and can't run oprofile.
But I'd really like to see some oprofile stats from these two cases
so we can figure out what in the world is going on here.  Can anyone
help?


I will try the patch here and see if it gives the same result. If so I could 
try to run with oprofile if you can give me a quick start.


Best Regards,
Michael Paesold 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Tom Lane

I wrote:
 I guess what this means is that there's no real problem with losing the
 cache line while manipulating the LWLock, which is what the patch was
 intended to prevent.  Instead, we're paying for swapping two cache lines
 (the spinlock and the LWLock) across processors instead of just one line.
 But that should at worst be a 2x inflation of the time previously spent
 in LWLockAcquire/Release, which is surely not yet all of the application
 ;-).  Why the heck is this so bad?  Should we expect that apparently
 minor changes in shared data structures might be costing equivalently
 huge penalties in SMP performance elsewhere?

I did some oprofile work and found that the cost seems to be because
(1) there's an increase in spinlock contention (more time spent in
s_lock), and (2) there's more time spent in LWLockAcquire/Release.
I'm not entirely clear about the reason for (1), but (2) apparently
is because of the extra cache line swapping, as posited above.  In
the oprofile trace it's clear that the extra cost comes exactly in the
statements that touch fields of the shared LWLock, which are the places
where you might have to wait to acquire a cache line.

I thought for a bit that the problem might come from having chosen to
put a pointer to the spinlock into each LWLock; fetching that pointer
implies an additional access to the contended cache line.  However,
changing the data structure to a simple linear array of spinlocks
didn't help.  So that whole idea seems to have died a painful death.

One other interesting result is that the data structure change neither
helped nor hurt on an EM64T machine with 2 physical (4 logical)
processors.  This is also x86_64, but evidently the caching behavior
is totally different from Opterons.

One thing that did seem to help a little bit was padding the LWLocks
to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
the array starts on a 32-byte boundary.  This ensures that we won't have
any LWLocks crossing cache lines --- contended access to such an LWLock
would probably incur the sort of large penalty seen above, because you'd
be trading two cache lines back and forth not one.  It seems that the
important locks are not split that way in CVS tip, because the gain
wasn't much, but I wonder whether some effect like this might explain
some of the unexplainable performance changes we've noticed in the past
(eg, in the dbt2 results).  A seemingly unrelated small change in the
size of other data structures in shared memory might move things around
enough to make a performance-critical lock cross a cache line boundary.

On regular x86, the LWLock struct is by chance exactly 16 bytes, so
there's no alignment issue.  But 64-bit platforms and platforms where
slock_t is bigger than char are exposed to this risk.

I'm going to go ahead and make that change, since it doesn't seem likely
to have any downside.  It might be interesting to look into forcing
proper alignment of the shared buffer headers as well.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Gavin Sherry

On Thu, 15 Sep 2005, Tom Lane wrote:

 One thing that did seem to help a little bit was padding the LWLocks
 to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
 the array starts on a 32-byte boundary.  This ensures that we won't have
 any LWLocks crossing cache lines --- contended access to such an LWLock
 would probably incur the sort of large penalty seen above, because you'd
 be trading two cache lines back and forth not one.  It seems that the
 important locks are not split that way in CVS tip, because the gain
 wasn't much, but I wonder whether some effect like this might explain
 some of the unexplainable performance changes we've noticed in the past
 (eg, in the dbt2 results).  A seemingly unrelated small change in the
 size of other data structures in shared memory might move things around
 enough to make a performance-critical lock cross a cache line boundary.

What about padding the LWLock to 64 bytes on these architectures. Both P4
and Opteron have 64 byte cache lines, IIRC. This would ensure that a
cacheline doesn't hold two LWLocks.

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Simon Riggs

 Tom Lane wrote
 I'm going to go ahead and make that change, since it doesn't
 seem likely
 to have any downside.  It might be interesting to look into forcing
 proper alignment of the shared buffer headers as well.

Just catching up on your mails - all of that sounds good so far.

Everything mentioned so far talks about spinlocks in the general sense,
rather than with respect to particular locks.

The different lock types we have are held for different amounts of time
and are also subject to differing amounts of contention. I think it
would be useful to put in a capability to tune each class of lock
according to the possibility for delay and contention on it.

Long delay, low contention e.g. CheckpointLock = waiter should sleep
immediately
Medium delay, high contention e.g. WALWriteLock = waiter spins if at
head of queue, else sleeps immediately
Short delay, high contention e.g. BufMappingLock = waiter spins and
retries forever, cos the lock is coming soon
Short delay, low contention, sometimes long waits at end of VACUUM
e.g. FreeSpaceLock = waiter spins, then eventually sleeps

I'm not sure whether you'll agree with my characterisation of these
locks, but the main thing I'm trying to get across is that
once-size-fits-all isn't the optimal approach. I'm not saying either
that we end up with individual characterisations for each lock type, but
a few key ones could be catered for differently.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Tom Lane

Gavin Sherry [EMAIL PROTECTED] writes:
 What about padding the LWLock to 64 bytes on these architectures. Both P4
 and Opteron have 64 byte cache lines, IIRC. This would ensure that a
 cacheline doesn't hold two LWLocks.

I tried that first, actually, but it was a net loss.  I guess enlarging
the array that much wastes too much cache space.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Gavin Sherry

On Thu, 15 Sep 2005, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  What about padding the LWLock to 64 bytes on these architectures. Both P4
  and Opteron have 64 byte cache lines, IIRC. This would ensure that a
  cacheline doesn't hold two LWLocks.

 I tried that first, actually, but it was a net loss.  I guess enlarging
 the array that much wastes too much cache space.


Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
the cmpb/jump removed I get:

[EMAIL PROTECTED] pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real0m54.362s
user0m0.003s
sys 0m0.009s

real1m9.788s
user0m0.011s
sys 0m0.013s

real2m8.870s
user0m0.016s
sys 0m0.028s
[EMAIL PROTECTED] pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real0m55.544s
user0m0.006s
sys 0m0.007s

real1m9.313s
user0m0.007s
sys 0m0.018s

real2m1.769s
user0m0.017s
sys 0m0.027s

This compares to the following, which is unpadded but has cmpb/jump
removed but is otherwise vanilla:

1: 55: 2: 111: 4: 207

The decrease is small, but it's there.

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Tom Lane

Gregory Maxwell [EMAIL PROTECTED] writes:
 might be useful to align the structure so it always crosses two lines
 and measure the performance difference.. the delta could be basically
 attributed to the cache line bouncing since even one additional bounce
 would overwhelm the other performance effects from the changed
 alignment.

Good idea.  I goosed the struct declaration and setup code to arrange
that the BufMappingLock's spinlock and the rest of its data were in
different cache lines instead of the same one.  The results (still
on Red Hat's 4-way Opteron):

previous best code (slock-no-cmpb and spin-delay-2):
1 31s   2 42s   4 51s   8 100s
with LWLock padded to 32 bytes and correctly aligned:
1 31s   2 41s   4 51s   8 97s
with LWLocks 32 bytes, but deliberately misaligned:
1 30s   2 50s   4 102s  8 200s

There is no other reason than having to touch multiple cache lines for
the second and third cases to be different: the array indexing code
should be exactly the same.

These last numbers are pretty close to what I got from the
separated-spinlock patch:
1 31s   2 52s   4 106s  8 213s
So it seems there's no doubt that it's the doubled cache traffic that
was causing most of the problem there.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Tom Lane

Gavin Sherry [EMAIL PROTECTED] writes:
 Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
 the cmpb/jump removed I get:
 [ 1 55s   2 69s   4 128s ]
 This compares to the following, which is unpadded but has cmpb/jump
 removed but is otherwise vanilla:
 1: 55: 2: 111: 4: 207

Hmm, that's pretty significant.  I tried it on a Xeon EM64T (thanks to
Stephen Frost for providing access), also 2 phys 4 log, and get:

Yesterday's CVS tip:
1 32s   2 46s   4 88s   8 168s
plus no-cmpb and spindelay2:
1 32s   2 48s   4 100s  8 177s
plus just-committed code to pad LWLock to 32:
1 33s   2 50s   4 98s   8 179s
alter to pad to 64:
1 33s   2 38s   4 108s  8 180s

I don't know what to make of the 2-process time going down while
4-process goes up; that seems just weird.  But both numbers are
repeatable.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Gregory Maxwell

On 9/15/05, Tom Lane [EMAIL PROTECTED] wrote:
 Yesterday's CVS tip:
 1 32s   2 46s   4 88s   8 168s
 plus no-cmpb and spindelay2:
 1 32s   2 48s   4 100s  8 177s
 plus just-committed code to pad LWLock to 32:
 1 33s   2 50s   4 98s   8 179s
 alter to pad to 64:
 1 33s   2 38s   4 108s  8 180s
 
 I don't know what to make of the 2-process time going down while
 4-process goes up; that seems just weird.  But both numbers are
 repeatable.

It is odd.

In the two process case there is, assuming random behavior, a 1/2
chance that you've already got the right line, but in the 4 process
case only a 1/4 chance (since we're on a 4 way box). This would
explain why we don't see as much cost in the intentionally misaligned
case. You'd expect the a similar pattern of improvement with the
64byte alignment (some in the two process case, but more in the 4
case), but here we see more improvement in the two way case.

If I had to guess I might say that the 64byte alignment is removing
much of the unneeded line bouncing in the the two process case but is
at the same time creating more risk of bouncing caused by aliasing.
Since two processes have 1/2 chance the aliasing isn't a problem so
the change is a win, but in the four process case it's no longer a win
because with aliasing there is still a lot of fighting over the cache
lines even if you pack well, and the decrease in packing makes odd
aliasing somewhat more likely. This might also explain why the
misaligned case performed so poorly in the 4process case, since the
misalignment didn't just increase the cost 2x, it also increased the
likelihood of a bogus bounce due to aliasing..

If this is the case, then it may be possible through very careful
memory alignment to make sure that no two high contention locks that
are likely to be contended at once share the same line (through either
aliasing or through being directly within the same line).

Then again I could be completely wrong, my understanding of
multiprocessor cache coherency is very limited, and I have no clue how
cache aliasing fits into it... So the above is just uninformed
conjecture.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-15 Thread Tom Lane

Gregory Maxwell [EMAIL PROTECTED] writes:
 If I had to guess I might say that the 64byte alignment is removing
 much of the unneeded line bouncing in the the two process case but is
 at the same time creating more risk of bouncing caused by aliasing.

It's an idea ... not sure if it's right or not.

One thing I have noticed both on Xeon HT (2-physical-4-logical) and
on Opteron (4 real processors) is that the 2-process times are
significantly more variable than the 1, 4, or 8-process times: repeating
the measurements shows variance around the 10% level for 2 processes
but only around 1% for the others.

What I think this means is that the kernel is scheduling the 2 processes
onto 2 processors chosen-at-random, without awareness of whether those
two processors are on the same chip (in the Xeon case) or have closer
NUMA affinity (in the Opteron case).  The other test cases are all
symmetric and there's no way for the kernel to blow it too badly, but
in the 2-process case it will be very visible whether the kernel
understands the hardware or not.  And the impression I have (which
might be out of date) is that current Linux kernel releases are not
very bright about these things.

How does that tie into the point at hand about different results for
64-byte vs 32-byte LWLocks?  Not sure, but I feel it's related somehow.

Anyway, it'd be interesting to get some test results for these things
on SMP machines running non-Linux kernels.  Also, we ought to be more
rigorous about reporting exactly which kernel we are using for any
particular test.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Michael Paesold


Tom Lane wrote:


But the cmpb instruction in the 8.0 version of TAS would have done that,
and I think we've already established that the cmpb is a loss on most
machines (except maybe single-physical-CPU Xeons).


Note that this was a regular Pentium 4 system, not a Xeon.

Best Regards,
Michael Paesold

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Dave Page

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: 13 September 2005 23:03
 To: Marko Kreen
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Spinlocks, yet again: analysis and 
 proposed patches 

 I'm starting to think that we might have to succumb to having 
 a compile
 option optimize for multiprocessor or optimize for single 
 processor.
 It's pretty hard to see how we'd alter a data structure decision like
 this on the fly.

That would be a major pita for packagers of binary distributions. 

Regards, Dave.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Greg Stark

Stephen Frost [EMAIL PROTECTED] writes:

 * Tom Lane ([EMAIL PROTECTED]) wrote:
  I'm starting to think that we might have to succumb to having a compile
  option optimize for multiprocessor or optimize for single processor.
  It's pretty hard to see how we'd alter a data structure decision like
  this on the fly.
 
 I'd really hate to see this happen.  

I suspect you're thinking too hard about this. It's not like some LWLocks
would need SMP spinlocks and others uniprocessor locks. And it's not like it's
going to change dynamically.

Wouldn't it just be a couple:

if (multiprocessor) {
  complex version
} else {
  simple version
}


Ugly to be sure but it's not going to spread to other areas of the source base
and you know the if statement isn't going to be growing any more arms. 

It's no uglier than doing the same thing with an #ifdef anyways.



If you really really want to have two versions and you think using function
pointers would impose too big a performance penalty you could play linker
games. But that way lies madness.


-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Michael Paesold


Tom Lane wrote:

Michael Paesold [EMAIL PROTECTED] writes:

To have other data, I have retested the patches on a single-cpu Intel P4
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 
dual-Xeon
results it's clear that this is in reality only one cpu. While the 
runtime

for N=1 is better than the other system, for N=4 it's already worse. The
situation with the patches is quite different, though. Unfortunatly.



CVS tip from 2005-09-12:
1: 36s   2: 77s (cpu ~85%)4: 159s (cpu ~98%)



only slock-no-cmpb:
1: 36s   2: 81s (cpu ~79%)4: 177s (cpu ~94%)
(doesn't help this time)


Hm.  This is the first configuration we've seen in which slock-no-cmpb
was a loss.  Could you double-check that result?


The first tests were compiled with 
CFLAGS='-O2 -mcpu=pentium4 -march=pentium4'. I had redone the tests with 
just CFLAGS='-O2' yesterday. The difference was just about a second, but the 
result from the patch was the same. The results for N=4 and N=8 show the 
positive effect more clearly.


configure: CFLAGS='-O2' --enable-casserts
On RHEL 4.1, gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)

CVS tip from 2005-09-12:
1: 37s   2: 78s  4: 159s   8: 324

only slock-no-cmpb:
1: 37s   2: 82s (5%) 4: 178s (12%) 8: 362 (12%)

configure:  --enable-casserts

(Btw. I have always done make clean ; make ; make install between tests)

Best Regards,
Michael Paesold


I can't see any reasonable way to do runtime switching of the cmpb test
--- whatever logic we put in to control it would cost as much or more
than the cmpb anyway :-(.  I think that has to be a compile-time choice.
From my perspective it'd be acceptable to remove the cmpb only for
x86_64, since only there does it seem to be a really significant win.
On the other hand it seems that removing the cmpb is a net win on most
x86 setups too, so maybe we should just do it and accept that there are
some cases where it's not perfect.


How many test cases do we have yet?
Summary of the effects without the cmpb instruction seems to be:

8-way Opteron:   better
Dual/HT Xeon w/o EM64T:  better
Dual/HT EM64T:   better for N=cpus, worse for Ncpus (Stephen's)
HT P4 w/o EM64T: worse (stronger for Ncpus)

Have I missed other reports that did test the slock-no-cmpb.patch alone?
Two of the systems with positive effects are x86_64, one is an older 
high-end Intel x86 chip. The negative effect is on a low-cost Pentium 4 with 
only hyper threading. According to the mentions thread's title, this was an 
optimization for hyperthreading, not regular multi-cpus.


We could have more data, especially on newer and high-end systems. Could 
some of you test the slock-no-cmpb.patch? You'll need an otherwise idle 
system to get repeatable results.


http://archives.postgresql.org/pgsql-hackers/2005-09/msg00565.php
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

I have re-attached the relevant files from Tom's posts because in the 
archive it's not clear anymore what should go into which file. See 
instructions in the first messages above.


The patch applies to CVS tip with
patch -p1  slock-no-cmpb.patch

Best Regards,
Michael Paesold



slock-no-cmpb.patch
Description: Binary data


test_setup.sql
Description: Binary data


test_run_small.sql
Description: Binary data


startn.sh
Description: Binary data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Martijn van Oosterhout

On Tue, Sep 13, 2005 at 08:14:47PM -0400, Stephen Frost wrote:
 I suppose another option would be to provide seperate packages...  Could
 this be done as a shared library so it's more 'plug-and-play' to switch
 between the two?  I dunno, just trying to think about how to deal with
 this without making it suck terribly bad for me (as a Debian user and
 maintainer).

Note that the Linux kernel has played with moving spinlock code out of
line. Due to the effect of having so many, it ended up that the memory
saved by moving the code out of line actually benefitted overall. An
unconditional call to a function can't be that expensive, surely.

However, it would *have* to be in the same compile unit, no shared
libraries. ELF imposes some overhead for calls in different compile
units, using function pointers won't save you (see Ulrich Drepper
paper on it).

However, to make it flexible you would need a pointer to a function
pointer. Fortunatly these variables won't change often so the function
pointer and the function itself should be in the cache if used often
enough.

Finally, the kernel solves the problem by saying, if you compile for
uniprocessor, optimize the code away, since a lot of the issues don't
apply. My main concern is how you even detect the number of processors
in a portable way...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpLWptHgoZDN.pgp
Description: PGP signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Stephen Frost

Tom, et al.,

Updated, with full recompiles between everything and the new
modification:

N, runtime:
Tip:1 31s   2 37s   4 86s   8 159s
no-cmpb:1 32s   2 43s   4 83s   8 168s
spin:   1 32s   2 51s   4 84s   8 160s
spin+mod:   1 32s   2 51s   4 89s   8 158s
spin+no-cmpb:   1 32s   2 51s   4 87s   8 163s
spin+mod+no-cmpb:   1 32s   2 50s   4 86s   8 161s

Unfortunately, the results don't seem to be terribly consistent between
runs anyway:

Run 2:
Tip:1 32s   2 43s   4 87s   8 160s
no-cmpb:1 31s   2 47s   4 83s   8 167s
spin:   1 32s   2 52s   4 88s   8 154s
spin+no-cmpb:   1 32s   2 51s   4 102s  8 166s
spin+mod:   1 32s   2 53s   4 85s   8 154s
spin+mod+no-cmpb:   1 32s   2 51s   4 91s   8 161s

Hope it helps,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Tom Lane

I wrote:
 We could ameliorate this if there were a way to acquire ownership of the
 cache line without necessarily winning the spinlock.  I'm imagining
 that we insert a dummy locked instruction just ahead of the xchgb,
 which touches the spinlock in such a way as to not change its state.

I tried this, using this tas code:

static __inline__ int
tas(volatile slock_t *lock)
{
register slock_t _res = 1;
register slock_t _dummy = 0;

/* Use a locking test before trying to take the spinlock */
/* xchg implies a LOCK prefix, so no need to say LOCK for it */
__asm__ __volatile__(
   lock\n
   xaddb   %2,%1   \n
   xchgb   %0,%1   \n
:   +q(_res), +m(*lock), +q(_dummy)
:
:   memory, cc);
return (int) _res;
}

At least on Opteron, it's a loser.  The previous best results (with
slock-no-cmpb and spin-delay patches) were
1 31s   2 42s   4 51s   8 100s
and with this instead of slock-no-cmpb,
1 33s   2 45s   4 55s   8 104s

The xadd may indeed be helping in terms of protecting the xchg from
end-of-timeslice --- the rate of select() delays is really tiny, one
every few seconds, which is better than I saw before.  But the extra
cost of the extra locked operation isn't getting repaid overall.
Feel free to try it on other hardware, but it doesn't look promising.

BTW, I also determined that on that 4-way Opteron box, the integer
modulo idea doesn't make any difference --- that is, spin-delay and
what Michael called spin-delay-2 are the same speed.  I think I had
tried the modulo before adding the variable spin delay, and it did
help in that configuration; but most likely, it was just helping by
stretching out the amount of time spent looping before entering the
kernel.  So we can drop that idea too.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Tom Lane

I wrote:
 Another thought came to mind: maybe the current data layout for LWLocks
 is bad.  Right now, the spinlock that protects each LWLock data struct
 is itself part of the struct, and since the structs aren't large (circa
 20 bytes), the whole thing is usually all in the same cache line. ...
 Maybe it'd be better to allocate the spinlocks off by themselves.

Well, this is odd.  I made up a patch to do this (attached) and found
that it pretty much sucks.  Still the 4-way Opteron, previous best
(slock-no-cmpb and spin-delay-2):
1 31s   2 42s   4 51s   8 100s
with lwlock-separate added:
1 31s   2 52s   4 106s  8 213s

What I had expected to see was a penalty in the single-thread case (due
to instructions added to the LWLock functions), but there isn't any.
I did not expect to see a factor-of-2 penalty for four threads.

I guess what this means is that there's no real problem with losing the
cache line while manipulating the LWLock, which is what the patch was
intended to prevent.  Instead, we're paying for swapping two cache lines
(the spinlock and the LWLock) across processors instead of just one line.
But that should at worst be a 2x inflation of the time previously spent
in LWLockAcquire/Release, which is surely not yet all of the application
;-).  Why the heck is this so bad?  Should we expect that apparently
minor changes in shared data structures might be costing equivalently
huge penalties in SMP performance elsewhere?

Unfortunately I don't have root on the Opteron and can't run oprofile.
But I'd really like to see some oprofile stats from these two cases
so we can figure out what in the world is going on here.  Can anyone
help?

regards, tom lane

*** src/backend/storage/lmgr/lwlock.c.orig  Sat Aug 20 19:26:24 2005
--- src/backend/storage/lmgr/lwlock.c   Wed Sep 14 11:50:09 2005
***
*** 27,35 
  #include storage/spin.h
  
  
  typedef struct LWLock
  {
!   slock_t mutex;  /* Protects LWLock and queue of 
PGPROCs */
boolreleaseOK;  /* T if ok to release waiters */
charexclusive;  /* # of exclusive holders (0 or 
1) */
int shared; /* # of shared holders 
(0..MaxBackends) */
--- 27,44 
  #include storage/spin.h
  
  
+ /*
+  * Each LWLock has an associated spinlock, which we consider as locking the
+  * content of the LWLock struct as well as the associated queue of waiting
+  * PGPROCs.  To minimize cache-line contention on multiprocessors, the
+  * spinlocks are kept physically separate from their LWLocks --- we don't
+  * want a spinning processor to interfere with access to the LWLock for
+  * the processor holding the spinlock.
+  */
+ 
  typedef struct LWLock
  {
!   slock_t*mutex;  /* Protects LWLock and queue of 
PGPROCs */
boolreleaseOK;  /* T if ok to release waiters */
charexclusive;  /* # of exclusive holders (0 or 
1) */
int shared; /* # of shared holders 
(0..MaxBackends) */
***
*** 39,44 
--- 48,65 
  } LWLock;
  
  /*
+  * We allocate CACHE_LINE_SIZE bytes for the fixed LWLocks' spinlocks.
+  * We assume that the fixed locks are few enough and heavily used enough
+  * that it's worth trying to put each one's spinlock in its own cache line.
+  * The dynamic locks are assumed to be many and less heavily hit, so they
+  * get just MAXALIGN space apiece.  (Packing spinlocks tightly sounds like a
+  * bad idea on machines where slock_t is just a byte ... even for lesser-used
+  * spinlocks, there would be enough locks per cache line to create a
+  * contention issue.)
+  */
+ #define CACHE_LINE_SIZE   64
+ 
+ /*
   * This points to the array of LWLocks in shared memory.  Backends inherit
   * the pointer by fork from the postmaster.  LWLockIds are indexes into
   * the array.
***
*** 135,144 
Sizesize;
int numLocks = NumLWLocks();
  
!   /* Allocate the LWLocks plus space for shared allocation counter. */
size = mul_size(numLocks, sizeof(LWLock));
  
!   size = add_size(size, 2 * sizeof(int));
  
return size;
  }
--- 156,174 
Sizesize;
int numLocks = NumLWLocks();
  
!   /* Space for the array of LWLock structs. */
size = mul_size(numLocks, sizeof(LWLock));
  
!   /* Space for dynamic allocation counter and MAXALIGN padding. */
!   size = add_size(size, 2 * sizeof(int) + MAXIMUM_ALIGNOF);
! 
!   /* Space for the spinlocks --- see notes for CACHE_LINE_SIZE. */
!   size = add_size(size,
!   mul_size((int) NumFixedLWLocks,
!Max(sizeof(slock_t), 
CACHE_LINE_SIZE)));
!

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-14 Thread Michael Paesold


Tom Lane wrote:



I wrote:

We could ameliorate this if there were a way to acquire ownership of the
cache line without necessarily winning the spinlock.  I'm imagining
that we insert a dummy locked instruction just ahead of the xchgb,
which touches the spinlock in such a way as to not change its state.


I tried this, using this tas code:

...

I have tried it on the P4 w/ HT.

CVS tip 1: 37s  2: 78s  4: 159s  8: 324
only slock-no-cmpb  1: 37s  2: 82s  4: 178s  8: 362
only dummy-locking  1: 44s  2: 96s  4: 197s ...

I guess there is no reason to try any more.

Best Regards,
Michael Paesold

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Marko Kreen

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 The second reason that the futex patch is helping is that when
 a spinlock delay does occur, it allows the delaying process to be
 awoken almost immediately, rather than delaying 10 msec or more
 as the existing code does.  However, given that we are only expecting
 the spinlock to be held for a couple dozen instructions, using the
 kernel futex mechanism is huge overkill --- the in-kernel overhead
 to manage the futex state is almost certainly several orders of
 magnitude more than the delay we actually want.

Why do you think so?  AFAIK on uncontented case there will be no
kernel access, only atomic inc/dec.  On contented case you'll
want task switch anyway, so the futex managing should not
matter.  Also this mechanism is specifically optimized for 
inter-process locking, I don't think you can get more efficient
mechanism from side-effects from generic syscalls.

If you don't want Linux-specific locking in core code, then
it's another matter...


 1. Use sched_yield() if available: it does just what we want,
 ie, yield the processor without forcing any useless time delay
 before we can be rescheduled.  This doesn't exist everywhere
 but it exists in recent Linuxen, so I tried it.  It made for a
 beautiful improvement in my test case numbers: CPU utilization
 went to 100% and the context swap rate to almost nil.  Unfortunately,
 I also saw fairly frequent stuck spinlock panics when running
 more queries than there were processors --- this despite increasing
 NUM_DELAYS to 1 in s_lock.c.  So I don't trust sched_yield
 anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
 (I speculate that it's set up to only yield the processor to other
 processes already affiliated to that processor.  In any case, it
 is definitely capable of getting through 1 yields without
 running the guy who's holding the spinlock.)

This is intended behaviour of sched_yield.

http://lwn.net/Articles/31462/
http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2


-- 
marko


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Michael Paesold


I wrote:

I'll do tomorrow morning (CEST, i.e. in about 11 hours).


These are the tests with the change:

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)


I have called the resulting patch (spin-delay + this change) spin-delay-2.

again with only slock-no-cmpb applied
1: 55s   4: 119s (cpu ~97%)

with only spin-delay-2 applied
1: 56s   4: 125s (cpu ~99.5%)

with slock-no-cmpb and spin-delay-2 applied
1: 56s   4: 125s (cpu ~100%)

(Note that the cpu averages are not calulated but estimated by looking at 
vmstat.)


Yesterdays results:

CVS tip from 2005-09-12 ~16:00
1: 57s   2: 82s   4: 124s   8: 237s

with only slock-no-cmpb.patch applied
1: 55s   2: 79s   4: 119s   8: 229s

with only spin-delay.patch applied
1: 56s   2: 79s   4: 124s   8: 235s

with both patches applied
1: 55s   2: 78s   4: 124s   8: 235s



To have other data, I have retested the patches on a single-cpu Intel P4 
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon 
results it's clear that this is in reality only one cpu. While the runtime 
for N=1 is better than the other system, for N=4 it's already worse. The 
situation with the patches is quite different, though. Unfortunatly.


CVS tip from 2005-09-12:
1: 36s   2: 77s (cpu ~85%)4: 159s (cpu ~98%)

only slock-no-cmpb:
1: 36s   2: 81s (cpu ~79%)4: 177s (cpu ~94%)
(doesn't help this time)

only spin-delay:
1: 36s   2: 86s (cpu =100%)   4: 157s (cpu =100%)
(bad runtime for N=2 (repeatable), cpu not doing real work here?)

slock-no-cmpb and spin-delay:
1: 36s   2: 106s (cpu =100%)  4: 192s (cpu =100%)
(it gets worse)

only spin-delay-2:
1: 36s   2: 85s (cpu =100%)   4: 160s (cpu =100%)
(quite the same as spin-delay)

slock-no-cmpb and spin-delay-2:
1: 36s   2: 109s (cpu =100%)  4: 198s (cpu =100%)
(worse again)

CS rate was low (20-50) for all tests, increasing for N2 which has to be 
expected. For this system I see no case for applying these patches.


Is there a portable way to detect the CPU we are running on? Do you have any 
other idea how to implement the delays?


Best Regards,
Michael Paesold 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Greg Stark

Marko Kreen marko@l-t.ee writes:

  (I speculate that it's set up to only yield the processor to other
  processes already affiliated to that processor.  In any case, it
  is definitely capable of getting through 1 yields without
  running the guy who's holding the spinlock.)

Maybe it should try sched_yield once and then use select after that?

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Marko Kreen marko@l-t.ee writes:
 On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 However, given that we are only expecting
 the spinlock to be held for a couple dozen instructions, using the
 kernel futex mechanism is huge overkill --- the in-kernel overhead
 to manage the futex state is almost certainly several orders of
 magnitude more than the delay we actually want.

 Why do you think so?  AFAIK on uncontented case there will be no
 kernel access, only atomic inc/dec.

In the uncontended case, we never even enter s_lock() and so the entire
mechanism of yielding is irrelevant.  The problem that's being exposed
by these test cases is that on multiprocessors, you can see a
significant rate of spinlock contention (order of 100 events/second,
which is still a tiny fraction of the number of TAS calls) and our
existing mechanism for dealing with contention is just not efficient
enough.

 On contented case you'll want task switch anyway, so the futex
 managing should not matter.

No, we DON'T want a task switch.  That's the entire point: in a
multiprocessor, it's a good bet that the spinlock is held by a task
running on another processor, and doing a task switch will take orders
of magnitude longer than just spinning until the lock is released.
You should yield only after spinning long enough to make it a strong
probability that the spinlock is held by a process that's lost the
CPU and needs to be rescheduled.

 If you don't want Linux-specific locking in core code, then
 it's another matter...

Well, it's true, we don't particularly want a one-platform solution,
but if it did what we wanted we might hold our noses and use it anyway.

(I think, BTW, that using futexes at the spinlock level is misguided;
what would be interesting would be to see if we could substitute for
both LWLock and spinlock logic with one futex-based module.)

 I also saw fairly frequent stuck spinlock panics when running
 more queries than there were processors --- this despite increasing
 NUM_DELAYS to 1 in s_lock.c.  So I don't trust sched_yield
 anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
 (I speculate that it's set up to only yield the processor to other
 processes already affiliated to that processor.  In any case, it
 is definitely capable of getting through 1 yields without
 running the guy who's holding the spinlock.)

 This is intended behaviour of sched_yield.

 http://lwn.net/Articles/31462/
 http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2

No; that page still says specifically So a process calling
sched_yield() now must wait until all other runnable processes in the
system have used up their time slices before it will get the processor
again.  I can prove that that is NOT what happens, at least not on
a multi-CPU Opteron with current FC4 kernel.  However, if the newer
kernels penalize a process calling sched_yield as heavily as this page
claims, then it's not what we want anyway ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Greg Stark

Marko Kreen marko@l-t.ee writes:

 On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
  The second reason that the futex patch is helping is that when
  a spinlock delay does occur, it allows the delaying process to be
  awoken almost immediately, rather than delaying 10 msec or more
  as the existing code does.  However, given that we are only expecting
  the spinlock to be held for a couple dozen instructions, using the
  kernel futex mechanism is huge overkill --- the in-kernel overhead
  to manage the futex state is almost certainly several orders of
  magnitude more than the delay we actually want.
 
 Why do you think so?  AFAIK on uncontented case there will be no
 kernel access, only atomic inc/dec.  On contented case you'll
 want task switch anyway, so the futex managing should not
 matter.  Also this mechanism is specifically optimized for 
 inter-process locking, I don't think you can get more efficient
 mechanism from side-effects from generic syscalls.
 
 If you don't want Linux-specific locking in core code, then
 it's another matter...

How does the futex using code compare with the new patches on this new
benchmark test?

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Marko Kreen marko@l-t.ee writes:
 (I speculate that it's set up to only yield the processor to other
 processes already affiliated to that processor.  In any case, it
 is definitely capable of getting through 1 yields without
 running the guy who's holding the spinlock.)

 Maybe it should try sched_yield once and then use select after that?

I tried that, actually, but it didn't seem to offer any particular
benefit.  (Maybe it would have helped more on older Linux kernels before
they changed sched_yield?)

I'm feeling even more disenchanted with sched_yield now that Marko
pointed out that the behavior was changed recently.  Here we have a
construct that is not portable cross-platform, does not act as
documented in its man page, and the kernel guys feel free to whack its
behavior around in major ways without documenting that either.  It seems
to be a crap-shoot whether it will be useful on any particular machine
or not.  At least with the select() code we can be reasonably confident
we know what will happen.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

  On contented case you'll want task switch anyway, so the futex
  managing should not matter.
 
 No, we DON'T want a task switch.  That's the entire point: in a
 multiprocessor, it's a good bet that the spinlock is held by a task
 running on another processor, and doing a task switch will take orders
 of magnitude longer than just spinning until the lock is released.
 You should yield only after spinning long enough to make it a strong
 probability that the spinlock is held by a process that's lost the
 CPU and needs to be rescheduled.

Does the futex code make any attempt to record the CPU of the process grabbing
the lock? Clearly it wouldn't be a guarantee of anything but if it's only used
for short-lived spinlocks while acquiring longer lived locks then maybe?

 No; that page still says specifically So a process calling
 sched_yield() now must wait until all other runnable processes in the
 system have used up their time slices before it will get the processor
 again.  I can prove that that is NOT what happens, at least not on
 a multi-CPU Opteron with current FC4 kernel.  However, if the newer
 kernels penalize a process calling sched_yield as heavily as this page
 claims, then it's not what we want anyway ...

Well it would be no worse than select or any other random i/o syscall.

It seems to me what you've found is an outright bug in the linux scheduler.
Perhaps posting it to linux-kernel would be worthwhile.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Josh Berkus

Tom, All:

 It seems to me what you've found is an outright bug in the linux scheduler.
 Perhaps posting it to linux-kernel would be worthwhile.

For people using this on Linux 2.6, which scheduler are you using?   Deadline 
is the recommended one for databases, and does offer significant (+5-8%) 
benefits in some heavy-contention environments, at least in OSDL tests.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Douglas McNaught

Josh Berkus josh@agliodbs.com writes:

 Tom, All:

 It seems to me what you've found is an outright bug in the linux scheduler.
 Perhaps posting it to linux-kernel would be worthwhile.

 For people using this on Linux 2.6, which scheduler are you using?   Deadline 
 is the recommended one for databases, and does offer significant (+5-8%) 
 benefits in some heavy-contention environments, at least in OSDL tests.

I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

-Doug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Douglas McNaught

Greg Stark [EMAIL PROTECTED] writes:

 Tom Lane [EMAIL PROTECTED] writes:

 No; that page still says specifically So a process calling
 sched_yield() now must wait until all other runnable processes in the
 system have used up their time slices before it will get the processor
 again.  I can prove that that is NOT what happens, at least not on
 a multi-CPU Opteron with current FC4 kernel.  However, if the newer
 kernels penalize a process calling sched_yield as heavily as this page
 claims, then it's not what we want anyway ...

 Well it would be no worse than select or any other random i/o syscall.

 It seems to me what you've found is an outright bug in the linux scheduler.
 Perhaps posting it to linux-kernel would be worthwhile.

People have complained on l-k several times about the 2.6
sched_yield() behavior; the response has basically been if you rely
on any particular sched_yield() behavior for synchronization, your app
is broken--it's not a synchronization primitive.

-Doug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Greg Stark

Douglas McNaught [EMAIL PROTECTED] writes:

  It seems to me what you've found is an outright bug in the linux scheduler.
  Perhaps posting it to linux-kernel would be worthwhile.
 
 People have complained on l-k several times about the 2.6
 sched_yield() behavior; the response has basically been if you rely
 on any particular sched_yield() behavior for synchronization, your app
 is broken--it's not a synchronization primitive.

They're not talking about this. They're talking about applications that spin
on sched_yield() and expect it to reduce cpu load as if the process were
calling sleep().

What Tom found was that some processes are never scheduled when sched_yield is
called. There's no reason that should be happening.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Douglas McNaught

Greg Stark [EMAIL PROTECTED] writes:

 What Tom found was that some processes are never scheduled when sched_yield is
 called. There's no reason that should be happening.

Yeah, that would probably be a bug...

-Doug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Mark Wong

On Tue, 13 Sep 2005 12:21:45 -0400
Douglas McNaught [EMAIL PROTECTED] wrote:

 Josh Berkus josh@agliodbs.com writes:
 
  Tom, All:
 
  It seems to me what you've found is an outright bug in the linux scheduler.
  Perhaps posting it to linux-kernel would be worthwhile.
 
  For people using this on Linux 2.6, which scheduler are you using?   
  Deadline 
  is the recommended one for databases, and does offer significant (+5-8%) 
  benefits in some heavy-contention environments, at least in OSDL tests.
 
 I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

Yes, that's correct.  That's an i/o elevator algorithm.

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Douglas McNaught [EMAIL PROTECTED] writes:
 Greg Stark [EMAIL PROTECTED] writes:
 What Tom found was that some processes are never scheduled when sched_yield 
 is
 called. There's no reason that should be happening.

 Yeah, that would probably be a bug...

I suspect the kernel hackers might consider it deliberate in a
NUMA-aware kernel.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 I'm feeling even more disenchanted with sched_yield now that Marko
 pointed out that the behavior was changed recently.  Here we have a

To be fair, I'm not entirely sure 'recently' is quite the right word.
It sounds like it changed during the 2.5 development cycle; which was
between major kernel releases, and looks like it was around Dec, 2003.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Michael Paesold [EMAIL PROTECTED] writes:
 To have other data, I have retested the patches on a single-cpu Intel P4 
 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon 
 results it's clear that this is in reality only one cpu. While the runtime 
 for N=1 is better than the other system, for N=4 it's already worse. The 
 situation with the patches is quite different, though. Unfortunatly.

 CVS tip from 2005-09-12:
 1: 36s   2: 77s (cpu ~85%)4: 159s (cpu ~98%)

 only slock-no-cmpb:
 1: 36s   2: 81s (cpu ~79%)4: 177s (cpu ~94%)
 (doesn't help this time)

Hm.  This is the first configuration we've seen in which slock-no-cmpb
was a loss.  Could you double-check that result?

 Is there a portable way to detect the CPU we are running on? Do you have any 
 other idea how to implement the delays?

I can't see any reasonable way to do runtime switching of the cmpb test
--- whatever logic we put in to control it would cost as much or more
than the cmpb anyway :-(.  I think that has to be a compile-time choice.
From my perspective it'd be acceptable to remove the cmpb only for
x86_64, since only there does it seem to be a really significant win.
On the other hand it seems that removing the cmpb is a net win on most
x86 setups too, so maybe we should just do it and accept that there are
some cases where it's not perfect.  Looking back at the original
discussion, I see that the cmpb was added with little discussion and no
specific performance testing, so it shouldn't be presumed to have a lot
of weight behind it:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00352.php

The other thing that's going on in the patch is dynamic adjustment of
spins_per_delay, and that could certainly be made much smarter than it
is now.  For instance, it'd cost almost nothing to make the lower and
upper bounds for spins_per_delay be variables instead of constants.
We could set the bounds during postmaster startup based on some criteria
yet to be determined.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Marko Kreen

On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote:
 Marko Kreen marko@l-t.ee writes:
  On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
  However, given that we are only expecting
  the spinlock to be held for a couple dozen instructions, using the
  kernel futex mechanism is huge overkill --- the in-kernel overhead
  to manage the futex state is almost certainly several orders of
  magnitude more than the delay we actually want.
 
  Why do you think so?  AFAIK on uncontented case there will be no
  kernel access, only atomic inc/dec.
 
 In the uncontended case, we never even enter s_lock() and so the entire
 mechanism of yielding is irrelevant.  The problem that's being exposed
 by these test cases is that on multiprocessors, you can see a
 significant rate of spinlock contention (order of 100 events/second,
 which is still a tiny fraction of the number of TAS calls) and our
 existing mechanism for dealing with contention is just not efficient
 enough.
 
  On contented case you'll want task switch anyway, so the futex
  managing should not matter.
 
 No, we DON'T want a task switch.  That's the entire point: in a
 multiprocessor, it's a good bet that the spinlock is held by a task
 running on another processor, and doing a task switch will take orders
 of magnitude longer than just spinning until the lock is released.
 You should yield only after spinning long enough to make it a strong
 probability that the spinlock is held by a process that's lost the
 CPU and needs to be rescheduled.

  
Hmm.  I guess this could be separated into 2 cases:

1. Light load - both lock owner and lock requester wont get
   scheduled while busy (owner in critical section, waiter
   spinning.)
2. Big load - either or both of them gets scheduled while busy.
   (waiter is scheduled by OS or voluntarily by eg. calling select())

So my impression is that currently you optimize for #1 at the
expense of #2, while with futexes you'd optimize for #2 at
the expense of #1.  Additionally I'm pretty convinced that
futexes give you most efficient implementation for #2, as kernel
knows what processes are waiting on particular lock so it can
make best decisions for scheduling.

  If you don't want Linux-specific locking in core code, then
  it's another matter...
 
 Well, it's true, we don't particularly want a one-platform solution,
 but if it did what we wanted we might hold our noses and use it anyway.
 
 (I think, BTW, that using futexes at the spinlock level is misguided;
 what would be interesting would be to see if we could substitute for
 both LWLock and spinlock logic with one futex-based module.)

Use pthreads ;)

  I also saw fairly frequent stuck spinlock panics when running
  more queries than there were processors --- this despite increasing
  NUM_DELAYS to 1 in s_lock.c.  So I don't trust sched_yield
  anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
  (I speculate that it's set up to only yield the processor to other
  processes already affiliated to that processor.  In any case, it
  is definitely capable of getting through 1 yields without
  running the guy who's holding the spinlock.)
 
  This is intended behaviour of sched_yield.
 
  http://lwn.net/Articles/31462/
  http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2
 
 No; that page still says specifically So a process calling
 sched_yield() now must wait until all other runnable processes in the
 system have used up their time slices before it will get the processor
 again.  I can prove that that is NOT what happens, at least not on
 a multi-CPU Opteron with current FC4 kernel.  However, if the newer
 kernels penalize a process calling sched_yield as heavily as this page
 claims, then it's not what we want anyway ...

My fault.  As I saw that there is problem with sched_yield, I
said I bet this is because of behaviour change and only
skimmed the exact details.  But the point that sched_yield is
not meant for such usage still stands.

About fast yielding, comment on sys_sched_yield() says:

 * sys_sched_yield - yield the current processor to other threads.
 *
 * this function yields the current CPU by moving the calling thread
 * to the expired array. If there are no other threads running on this
 * CPU then this function will return.

So there just is nothing else to schedule on that CPU.

-- 
marko


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Marko Kreen marko@l-t.ee writes:
 Hmm.  I guess this could be separated into 2 cases:

 1. Light load - both lock owner and lock requester wont get
scheduled while busy (owner in critical section, waiter
spinning.)
 2. Big load - either or both of them gets scheduled while busy.
(waiter is scheduled by OS or voluntarily by eg. calling select())

Don't forget that the coding rules for our spinlocks say that you
mustn't hold any such lock for more than a couple dozen instructions,
and certainly any kernel call while holding the lock is Right Out.
There is *no* case where the holder of a spinlock is going to
voluntarily give up the CPU.  The design intention was that the
odds of losing the CPU while holding a spinlock would be negligibly
small, simply because we don't hold it very long.

 About fast yielding, comment on sys_sched_yield() says:
  * sys_sched_yield - yield the current processor to other threads.
  *
  * this function yields the current CPU by moving the calling thread
  * to the expired array. If there are no other threads running on this
  * CPU then this function will return.

Mph.  So that's pretty much exactly what I suspected...


I just had a thought: it seems that the reason we are seeing a
significant issue here is that on SMP machines, the cost of trading
exclusively-owned cache lines back and forth between processors is
so high that the TAS instructions (specifically the xchgb, in the x86
cases) represent a significant fraction of backend execution time all
by themselves.  (We know this is the case due to oprofile results,
see discussions from last April.)  What that means is that there's a
fair chance of a process losing its timeslice immediately after the
xchgb.  Which is precisely the scenario we do not want, if the process
successfully acquired the spinlock by means of the xchgb.

We could ameliorate this if there were a way to acquire ownership of the
cache line without necessarily winning the spinlock.  I'm imagining
that we insert a dummy locked instruction just ahead of the xchgb,
which touches the spinlock in such a way as to not change its state.
(xchgb won't do for this, but maybe one of the other lockable
instructions will.)  We do the xchgb just after this one.  The idea is
that if we don't own the cache line, the first instruction causes it to
be faulted into the processor's cache, and if our timeslice expires
while that is happening, we lose the processor without having acquired
the spinlock.  This assumes that once we've got the cache line, the
xchgb that actually does the work can get executed with not much
extra time spent and only low probability of someone else stealing the
cache line back first.

The fact that cmpb isn't helping proves that getting the cache line in a
read-only fashion does *not* do enough to protect the xchgb in this way.
But maybe another locking instruction would.  Comments?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

I wrote:
 We could ameliorate this if there were a way to acquire ownership of the
 cache line without necessarily winning the spinlock.

Another thought came to mind: maybe the current data layout for LWLocks
is bad.  Right now, the spinlock that protects each LWLock data struct
is itself part of the struct, and since the structs aren't large (circa
20 bytes), the whole thing is usually all in the same cache line.  This
had seemed like a good idea at the time, on the theory that once you'd
obtained the spinlock you'd also have pulled in the LWLock contents.
But that's only optimal under an assumption of low contention.  If
there's high contention for the spinlock, then another processor
spinning on the lock will be continuously taking away ownership of the
cache line and thus slowing down the guy who's got the lock and is
trying to examine/update the LWLock contents.

Maybe it'd be better to allocate the spinlocks off by themselves.
Then, spinning processors would not affect the processor that's updating
the LWLock; only when it finishes doing that and needs to clear the
spinlock will it have to contend with the spinners for the cache line
containing the spinlock.

This would add an instruction or so to LWLockAcquire and LWLockRelease,
and would be of no benefit on uniprocessors, but it might be worth doing
for multiprocessors.  Another patch to test ...

I'm starting to think that we might have to succumb to having a compile
option optimize for multiprocessor or optimize for single processor.
It's pretty hard to see how we'd alter a data structure decision like
this on the fly.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Min Xu (Hsu)

On Tue, 13 Sep 2005 Tom Lane wrote :
 I wrote:
  We could ameliorate this if there were a way to acquire ownership of the
  cache line without necessarily winning the spinlock.
 
 Another thought came to mind: maybe the current data layout for LWLocks
 is bad.  Right now, the spinlock that protects each LWLock data struct
 is itself part of the struct, and since the structs aren't large (circa
 20 bytes), the whole thing is usually all in the same cache line.  This
 had seemed like a good idea at the time, on the theory that once you'd
 obtained the spinlock you'd also have pulled in the LWLock contents.
 But that's only optimal under an assumption of low contention.  If
 there's high contention for the spinlock, then another processor
 spinning on the lock will be continuously taking away ownership of the
 cache line and thus slowing down the guy who's got the lock and is
 trying to examine/update the LWLock contents.

If this were the case, perhaps first fetch the spin lock with read-only
permission should have helped. Modern processors have store buffers to
not let a store miss to slow down the processor. Therefore, if processor
A has the spin lock and it is examining and updating the LWLock on the
the same cache line, as long as processor B doesn't try to write the
same cache line, processor A won't be slowed down. What would happen
is that multiple writing are coalescing in the write buffer and they
are going to be flushed to the memory when processor A regain the write
permission.

I still think your first scenario is possible. That's when final
processor B gets the lock, its time slice runs out. This sounds to
me is likely to cause a chain of context switches.

 Maybe it'd be better to allocate the spinlocks off by themselves.
 Then, spinning processors would not affect the processor that's updating
 the LWLock; only when it finishes doing that and needs to clear the
 spinlock will it have to contend with the spinners for the cache line
 containing the spinlock.
 
 This would add an instruction or so to LWLockAcquire and LWLockRelease,
 and would be of no benefit on uniprocessors, but it might be worth doing
 for multiprocessors.  Another patch to test ...
 
 I'm starting to think that we might have to succumb to having a compile
 option optimize for multiprocessor or optimize for single processor.
 It's pretty hard to see how we'd alter a data structure decision like
 this on the fly.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 I'm starting to think that we might have to succumb to having a compile
 option optimize for multiprocessor or optimize for single processor.
 It's pretty hard to see how we'd alter a data structure decision like
 this on the fly.

I'd really hate to see this happen.  In this case I don't think the
change you're proposing would have much of an impact on a uniprocessor
machine.  Having seperate compile-time options for uniprocessor and
multiprocessor would open the gates for potentially other changes which
*would* have a more serious impact on one or the other when compiled for
the opposite.  I think this would be a serious problem for binary 
distributions and correspondingly their users.

Happy to test these changes, btw.

Also, I'm redoing my tests from the other patches with all the various
combinations; I'm a bit concerned that the recompiles I did (which were
just 'make', not 'make clean', 'make') for them previously didn't
actually recompile everything necessary.  Sorry if the results differ.
Hope to have that all done this evening.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Gavin Sherry

On Tue, 13 Sep 2005, Stephen Frost wrote:

 * Tom Lane ([EMAIL PROTECTED]) wrote:
  I'm starting to think that we might have to succumb to having a compile
  option optimize for multiprocessor or optimize for single processor.
  It's pretty hard to see how we'd alter a data structure decision like
  this on the fly.

 I'd really hate to see this happen.  In this case I don't think the
 change you're proposing would have much of an impact on a uniprocessor
 machine.  Having seperate compile-time options for uniprocessor and
 multiprocessor would open the gates for potentially other changes which
 *would* have a more serious impact on one or the other when compiled for
 the opposite.  I think this would be a serious problem for binary
 distributions and correspondingly their users.

It does make it painful for distribution/package maintainers but I think
the potential benefits for single/multi-CPU architectures are high. It
means that our lock intrinsic on uniprocessors can just be a lock/delay
loop without any spinning -- which is a complete waste of time if you only
have one CPU.

Doing this at runtime involvevs some pretty significant uglification of
low level code I think.

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Stephen Frost

* Gavin Sherry ([EMAIL PROTECTED]) wrote:
 It does make it painful for distribution/package maintainers but I think
 the potential benefits for single/multi-CPU architectures are high. It
 means that our lock intrinsic on uniprocessors can just be a lock/delay
 loop without any spinning -- which is a complete waste of time if you only
 have one CPU.
 
 Doing this at runtime involvevs some pretty significant uglification of
 low level code I think.

I suspect distributors would go for the multi-cpu setup (especially if
a uniprocessor build is *broken* for multiprocessor) and then in a
lot of cases you end up not actually getting any benefit.  I'm afraid
you'd also end up having to tell alot of people who complain to
recompile, who will then complain back to their distributors, etc.

I suppose another option would be to provide seperate packages...  Could
this be done as a shared library so it's more 'plug-and-play' to switch
between the two?  I dunno, just trying to think about how to deal with
this without making it suck terribly bad for me (as a Debian user and
maintainer).

Certainly makes me wish there was a way to do it which made the kernel
handle the uniprocessor/multiprocessor question and did locking as
appropriate for those.  Ah, well.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Min Xu (Hsu) [EMAIL PROTECTED] writes:
 ...If this were the case, perhaps first fetch the spin lock with read-only
 permission should have helped.

But the cmpb instruction in the 8.0 version of TAS would have done that,
and I think we've already established that the cmpb is a loss on most
machines (except maybe single-physical-CPU Xeons).  I suggested in my
other message that it might help to grab write permission on the cache
line before actually trying to acquire the spin lock --- but I don't see
how getting read-only permission first will help.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Stephen Frost [EMAIL PROTECTED] writes:
 I suspect distributors would go for the multi-cpu setup (especially if
 a uniprocessor build is *broken* for multiprocessor) and then in a
 lot of cases you end up not actually getting any benefit.  I'm afraid
 you'd also end up having to tell alot of people who complain to
 recompile, who will then complain back to their distributors, etc.

Yeah.  Being in charge of Red Hat's packaging of PG, I feel that pain as
keenly as anybody ... and I *know* RH will not be interested in shipping
two different packages.  If we go this way, the RH distributions will
use the --optimize-multi switch, because that's where the money is.

The bottom line here is that we will have to make some compromises:
if we want one-size-fits-all code, it will not be optimal for every
single architecture.  If we don't do one-size-fits-all, then we will
pay for it in various other ways.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Min Xu (Hsu)

On Tue, 13 Sep 2005 Tom Lane wrote :
 Min Xu (Hsu) [EMAIL PROTECTED] writes:
  ...If this were the case, perhaps first fetch the spin lock with read-only
  permission should have helped.
 
 But the cmpb instruction in the 8.0 version of TAS would have done that,
 and I think we've already established that the cmpb is a loss on most
 machines (except maybe single-physical-CPU Xeons).  I suggested in my
 other message that it might help to grab write permission on the cache
 line before actually trying to acquire the spin lock --- but I don't see
 how getting read-only permission first will help.

Yes, I agree. What I was trying to say was that if the second scenario
you hypothesize were true, i.e. fetching write-able line before actually
trying to acquire the spin lock would cause another processor to slow
down its execution inside the critical section, then fetching read-only
lines would have helped. As you said, however, experimental results
shows fetching read-only lines didn't help, which led me wonder whether the
second scenario your described was really happening.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Tom Lane

Min Xu (Hsu) [EMAIL PROTECTED] writes:
 ... As you said, however, experimental results
 shows fetching read-only lines didn't help, which led me wonder whether the
 second scenario your described was really happening.

I don't know --- we haven't tried it.  I do intend to work up some
patches for these ideas, and then we can test them.

Or, if you happen to be wide awake at the moment, feel free to take a
shot at it ... I'm out of steam and won't get to it before tomorrow...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-13 Thread Min Xu (Hsu)

On Tue, 13 Sep 2005 Tom Lane wrote :
 Min Xu (Hsu) [EMAIL PROTECTED] writes:
  ... As you said, however, experimental results
  shows fetching read-only lines didn't help, which led me wonder whether the
  second scenario your described was really happening.
 
 I don't know --- we haven't tried it.  I do intend to work up some
 patches for these ideas, and then we can test them.

I see. I look forward to it.

 Or, if you happen to be wide awake at the moment, feel free to take a
 shot at it ... I'm out of steam and won't get to it before tomorrow...

I am newbie here. Please don't let my comments discourage you from
doing anything. I enjoyed very much following your discussion and analysis
on this problem.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-12 Thread Michael Paesold


Tom Lane wrote:

Comments and testing invited.


I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T). 
(Configured with 
CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts).
The results were pretty stable (around .2 seconds). I would not trust the 
numbers for N=2, linux, at least 2.4 is not good at not scheduling two 
running processes on two different HTs on the same core. Those values also 
had the most variance ( 1s). All other measures were quite stable over 
several runs.


CVS tip from 2005-09-12 ~16:00
1: 57s   2: 82s   4: 124s   8: 237s

with only slock-no-cmpb.patch applied
1: 55s   2: 79s   4: 119s   8: 229s

with only spin-delay.patch applied
1: 56s   2: 79s   4: 124s   8: 235s

with both patches applied
1: 55s   2: 78s   4: 124s   8: 235s


compare to 7.4.8 on the same machine ;-)
1: 92s   2: 235s  4: 474s   8: did not try ...


It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch 
does not really help much on this machine. That seems to match Stephen 
Frost's results with EM64T, if I read them correctly.


The cs rate is about 150 on CVS tip without patches and below 100 with the 
patches (all three cases).
With 7.4.8 its 23-28 with N1. 8.1 is clearly the winner here. Great 
work, Tom.


I hope some more data helps.

Best Regards,
Michael Paesold


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-12 Thread Tom Lane

Michael Paesold [EMAIL PROTECTED] writes:
 It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch 
 does not really help much on this machine. That seems to match Stephen 
 Frost's results with EM64T, if I read them correctly.

Yeah, it's interesting that you both see slock-no-cmpb as saving some
cycles and the second patch as giving them back.  I wonder whether the
integer-modulo-to-slow-the-loop trick is counterproductive on your
machines.  You both seem to be using hardware that will recognize
rep;nop and maybe that's all that's needed.

I probably should have broken down the spindelay patch into multiple
components.  But it's only a small change --- could you try simplifying
the patched line

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)

and see how the patch does that way?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-12 Thread Michael Paesold


Tom Lane wrote:


I probably should have broken down the spindelay patch into multiple
components.  But it's only a small change --- could you try simplifying
the patched line

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)

and see how the patch does that way?


I'll do tomorrow morning (CEST, i.e. in about 11 hours).
Best Regards,
Michael Paesold

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-12 Thread Tom Lane

I wrote:
 ... and see how the patch does that way?

BTW, please do look at vmstat 1 while running the test case
corresponding to your number of processors.  It's hard to tell from the
runtime alone whether the patch is fully accomplishing its goal of
reducing wasted cycles.  If you see user CPU percent go to 100 and
context swaps drop to a low value, though, you know it's working.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Kurt Roeckx

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 
 I kinda suspect that the cmpb test is a no-op or loss on all
 Intelish processors: it can only be a win if you expect a lot
 of contention for the spin lock, but in percentage terms we still
 have a very low conflict rate, so in most executions of the TAS
 macro, the cmpb is just wasted cycles.  Bottom line: we definitely
 don't want it for x86_64, and maybe not at all, but we need more
 research to decide the latter.

I think an important question is wether this is for x86_64 in
general, of opteron specific.  It could be that it's not the same
on Intel's EM64Ts.

One reason this might behave differently for opterons is that
it's a cc-NUMA instead of the normal SMP.

Something else to consider is the OS you're using.  I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.


Kurt


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane

Kurt Roeckx [EMAIL PROTECTED] writes:
 On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 I kinda suspect that the cmpb test is a no-op or loss on all
 Intelish processors:

 I think an important question is wether this is for x86_64 in
 general, of opteron specific.  It could be that it's not the same
 on Intel's EM64Ts.

Good point --- anyone have one to try?

 Something else to consider is the OS you're using.  I've been
 told that Linux isn't that good in NUMA and FreeBSD might be
 better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 Kurt Roeckx [EMAIL PROTECTED] writes:
  On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
  I kinda suspect that the cmpb test is a no-op or loss on all
  Intelish processors:
 
  I think an important question is wether this is for x86_64 in
  general, of opteron specific.  It could be that it's not the same
  on Intel's EM64Ts.
 
 Good point --- anyone have one to try?

I've got one I can test on.  I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).

I'll work on it starting now and post results once I get some.

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 My proposal therefore is to do #2, #3, and #4, and to modify the TAS
 assembly code at least on x86_64.  Together, these changes bring
 my test case on a 4-way Opteron from
 
 N, runtime:   1 36s   2 61s   4 105s  8 198s

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s   2 47s   4 86s   8 159s

 to
 
 N, runtime:   1 35s   2 48s   4 58s   8 105s

N, runtime: 1 32s   2 53s   4 90s   8 169s

CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane

Stephen Frost [EMAIL PROTECTED] writes:
 em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

 N, runtime:   1 31s   2 47s   4 86s   8 159s

 N, runtime:   1 32s   2 53s   4 90s   8 169s

Er, which (or both) of the two patches did you apply here?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
 
  N, runtime: 1 31s   2 47s   4 86s   8 159s
 
  N, runtime: 1 32s   2 53s   4 90s   8 169s
 
 Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane

Stephen Frost [EMAIL PROTECTED] writes:
 * Tom Lane ([EMAIL PROTECTED]) wrote:
 Er, which (or both) of the two patches did you apply here?

 Applied both, sorry that wasn't clear.

Thanks.  If you've got the time, could you try the two patches
separately and see what you get?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Qingqing Zhou


Tom Lane [EMAIL PROTECTED] wrote

 My proposal therefore is to do #2, #3, and #4, and to modify the TAS
 assembly code at least on x86_64.  Together, these changes bring
 my test case on a 4-way Opteron from


Some changes are based on tests and heuristics, so can we make them into the 
configure script so the choice could be made automatically?

Regards,
Qingqing 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost

* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  * Tom Lane ([EMAIL PROTECTED]) wrote:
  Er, which (or both) of the two patches did you apply here?
 
  Applied both, sorry that wasn't clear.
 
 Thanks.  If you've got the time, could you try the two patches
 separately and see what you get?

Sure.

CVS Head:

N, runtime: 1 31s   2 47s   4 86s   8 159s

With just slock-no-cmpb.patch:

N, runtime: 1 32s   2 39s   4 82s   8 167s

With just spin-delay.patch

N, runtime: 1 32s   2 52s   4 94s   8 164s

With both:

N, runtime: 1 32s   2 53s   4 90s   8 169s

Hope that helps,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

  Something else to consider is the OS you're using.  I've been
  told that Linux isn't that good in NUMA and FreeBSD might be
  better.
 
 It's hard to see how the OS could affect behavior at the level of
 processor cache operations --- unless they did something truly
 spectacularly stupid, like mark main memory non-cacheable.

Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.

But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've been told about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost

* Stephen Frost ([EMAIL PROTECTED]) wrote:
  Thanks.  If you've got the time, could you try the two patches
  separately and see what you get?
 
 Sure.
[...]

Just to be clear- this was from a completely default 'make install' 
using the Debian configure options from 8.0.3 (which aren't that 
particularly interesting really- nls, integer-datetimes, debug, 
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld, 
enable-thread-safety).  If it'd be useful for the test I can adjust 
whichever parameters are appropriate (work_mem, cache_size, etc).

There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.

Enjoy,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane

Qingqing Zhou [EMAIL PROTECTED] writes:
 Some changes are based on tests and heuristics, so can we make them into the 
 configure script so the choice could be made automatically?

It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference.  Right now we are in the
find-out-the-facts mode ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Mark Kirkwood


Tom Lane wrote:


I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4.  The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips.  (If so, we'd just make it conditional to x86_64.)



2x PIII 1G 2G Freebsd 6.0Beta4

8.1beta1 (2005-08-28):

N runtime: 1 85s   2 139s  4 220s


8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)

N runtime: 1 89s   2 137s  4 229s


8.1beta1 (2005-08-28) + patch 2

N runtime: 1 84s   2 108s  4 214s



Observe the interesting little speed improvement for patch 2 with 2 
processes (seems to be repeatable).


Let me know if you want to see vmstat output for any of these.

regards

Mark



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

90 matches

Mail list logo