Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Josh Berkus
Dave,

 Are you testing this with Tom's code, you need to do a baseline
 measurement with 10 and then increase it, you will still get lots of cs,
 but it will be less.

No, that was just a test of 1000 straight up.Tom outlined a method, but I 
didn't see any code that would help me find a better level, other than just 
trying each +100 increase one at a time.   This would take days of testing 
...
-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Dave Cramer
Josh,

I think you can safely increase by orders of magnitude here, instead of
by +100, my wild ass guess is that the sweet spot is the spin time
should be approximately the time it takes to consume the resource. So if
you have a really fast machine then the spin count should be higher. 

Also you have to take into consideration your memory bus speed, with the
pause instruction inserted in the loop the timing is now dependent on
memory speed.

But... you need a baseline first.

Dave
On Tue, 2004-04-27 at 14:05, Josh Berkus wrote:
 Dave,
 
  Are you testing this with Tom's code, you need to do a baseline
  measurement with 10 and then increase it, you will still get lots of cs,
  but it will be less.
 
 No, that was just a test of 1000 straight up.Tom outlined a method, but I 
 didn't see any code that would help me find a better level, other than just 
 trying each +100 increase one at a time.   This would take days of testing 
 ...
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Josh Berkus
Dave,

 But... you need a baseline first.

A baseline on CS?   I have that 

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-26 Thread Josh Berkus
Dave,

 Yeah, I did some more testing myself, and actually get better numbers
 with increasing spins per delay to 1000, but my suspicion is that it is
 highly dependent on finding the right delay for the processor you are
 on.

Well, it certainly didn't help here:

procs  memory  swap  io system cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  0  0 14870744 123872 112991200 0 0 1027 187341 48 27 
26  0
 2  0  0 14869912 123872 112991200 048 1030 126490 65 18 
16  0
 2  0  0 14867032 123872 112991200 0 0 1021 106046 72 16 
12  0
 2  0  0 14869912 123872 112991200 0 0 1025 90256 76 14 10  
0
 2  0  0 14870424 123872 112991200 0 0 1022 135249 63 22 
16  0
 2  0  0 14872664 123872 112991200 0 0 1023 13 63 20 
17  0
 1  0  0 14871128 123872 112991200 048 1024 155728 57 22 
20  0
 2  0  0 14871128 123872 112991200 0 0 1028 189655 49 29 
22  0
 2  0  0 14871064 123872 112991200 0 0 1018 190744 48 29 
23  0
 2  0  0 14871064 123872 112991200 0 0 1027 186812 51 26 
23  0


-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-26 Thread Dave Cramer
Are you testing this with Tom's code, you need to do a baseline
measurement with 10 and then increase it, you will still get lots of cs,
but it will be less.

Dave
On Mon, 2004-04-26 at 20:03, Josh Berkus wrote:
 Dave,
 
  Yeah, I did some more testing myself, and actually get better numbers
  with increasing spins per delay to 1000, but my suspicion is that it is
  highly dependent on finding the right delay for the processor you are
  on.
 
 Well, it certainly didn't help here:
 
 procs  memory  swap  io system cpu
  r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
  2  0  0 14870744 123872 112991200 0 0 1027 187341 48 27 
 26  0
  2  0  0 14869912 123872 112991200 048 1030 126490 65 18 
 16  0
  2  0  0 14867032 123872 112991200 0 0 1021 106046 72 16 
 12  0
  2  0  0 14869912 123872 112991200 0 0 1025 90256 76 14 10  
 0
  2  0  0 14870424 123872 112991200 0 0 1022 135249 63 22 
 16  0
  2  0  0 14872664 123872 112991200 0 0 1023 13 63 20 
 17  0
  1  0  0 14871128 123872 112991200 048 1024 155728 57 22 
 20  0
  2  0  0 14871128 123872 112991200 0 0 1028 189655 49 29 
 22  0
  2  0  0 14871064 123872 112991200 0 0 1018 190744 48 29 
 23  0
  2  0  0 14871064 123872 112991200 0 0 1027 186812 51 26 
 23  0
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-25 Thread Andrew McMillan
On Thu, 2004-04-22 at 10:37 -0700, Josh Berkus wrote:
 Tom,
 
  The tricky
  part is that a slow adaptation rate means we can't have every backend
  figuring this out for itself --- the right value would have to be
  maintained globally, and I'm not sure how to do that without adding a
  lot of overhead.
 
 This may be a moot point, since you've stated that changing the loop timing 
 won't solve the problem, but what about making the test part of make?   I 
 don't think too many systems are going to change processor architectures once 
 in production, and those that do can be told to re-compile.

Sure they do - PostgreSQL is regularly provided as a pre-compiled
distribution.  I haven't compiled PostgreSQL for years, and we have it
running on dozens of machines, some SMP, some not, but most running
Debian Linux.

Even having a compiler _installed_ on one of our client's database
servers would usually be considered against security procedures, and
would get a black mark when the auditors came through.

Regards,
Andrew McMillan
-
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053,  Manners St,  Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201   MOB: +64(21)635-694  OFFICE: +64(4)499-2267
 Planning an election?  Call us!
-


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Dave Cramer
Yeah, I did some more testing myself, and actually get better numbers
with increasing spins per delay to 1000, but my suspicion is that it is
highly dependent on finding the right delay for the processor you are
on.

My hypothesis is that if you spin approximately the same or more time
than the average time it takes to get finished with the shared resource
then this should reduce cs.

Certainly more ideas are required here.

Dave 
On Wed, 2004-04-21 at 22:35, Tom Lane wrote:
 Dave Cramer [EMAIL PROTECTED] writes:
  diff -c -r1.16 s_lock.c
  *** backend/storage/lmgr/s_lock.c   8 Aug 2003 21:42:00 -   1.16
  --- backend/storage/lmgr/s_lock.c   21 Apr 2004 20:27:34 -
  ***
  *** 76,82 
   * The select() delays are measured in centiseconds (0.01 sec) because 10
   * msec is a common resolution limit at the OS level.
   */
  ! #define SPINS_PER_DELAY   100
#define NUM_DELAYS1000
#define MIN_DELAY_CSEC1
#define MAX_DELAY_CSEC100
  --- 76,82 
   * The select() delays are measured in centiseconds (0.01 sec) because 10
   * msec is a common resolution limit at the OS level.
   */
  ! #define SPINS_PER_DELAY   10
#define NUM_DELAYS1000
#define MIN_DELAY_CSEC1
#define MAX_DELAY_CSEC100
 
 
 As far as I can tell, this does reduce the rate of semop's
 significantly, but it does so by bringing the overall processing rate
 to a crawl :-(.  I see 97% CPU idle time when using this patch.
 I believe what is happening is that the select() delay in s_lock.c is
 being hit frequently because the spin loop isn't allowed to run long
 enough to let the other processor get out of the spinlock.
 
   regards, tom lane
 
 
 
 !DSPAM:40872f7e21492906114513!
 
 
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Dave Cramer
More data

On a dual xeon with HTT enabled:

I tried increasing the NUM_SPINS to 1000 and it works better.

NUM_SPINLOCKS   CS  ID  pgbench

100 250K59% 230 TPS
1000125K55% 228 TPS

This is certainly heading in the right direction ? Although it looks
like it is highly dependent on the system you are running on.

--dc--   



On Wed, 2004-04-21 at 22:53, Josh Berkus wrote:
 Tom,
 
  As far as I can tell, this does reduce the rate of semop's
  significantly, but it does so by bringing the overall processing rate
  to a crawl :-(.  I see 97% CPU idle time when using this patch.
  I believe what is happening is that the select() delay in s_lock.c is
  being hit frequently because the spin loop isn't allowed to run long
  enough to let the other processor get out of the spinlock.
 
 Also, I tested it on production data, and it reduces the CSes by about 40%.  
 An improvement, but not a magic bullet.
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes:
 My hypothesis is that if you spin approximately the same or more time
 than the average time it takes to get finished with the shared resource
 then this should reduce cs.

The only thing we use spinlocks for nowadays is to protect LWLocks, so
the average time involved is fairly small and stable --- or at least
that was the design intention.  What we seem to be seeing is that on SMP
machines, cache coherency issues cause the TAS step itself to be
expensive and variable.  However, in the experiments I did, strace'ing
showed that actual spin timeouts (manifested by the execution of a
delaying select()) weren't actually that common; the big source of
context switches is semop(), which indicates contention at the LWLock
level rather than the spinlock level.  So while tuning the spinlock
limit count might be a useful thing to do in general, I think it will
have only negligible impact on the particular problems we're discussing
in this thread.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Bruce Momjian
Josh Berkus wrote:
 Tom,
 
  Having to recompile to run on single- vs dual-processor machines doesn't
  seem like it would fly.
 
 Oh, I don't know.  Many applications require compiling for a target 
 architecture; SQL Server, for example, won't use a 2nd processor without 
 re-installation.   I'm not sure about Oracle.
 
 It certainly wasn't too long ago that Linux gurus were esposing re-compiling 
 the kernel for the machine.
 
 And it's not like they would *have* to re-compile to use PostgreSQL after 
 adding an additional processor.  Just if they wanted to maximize peformance 
 benefit.
 
 Also, this is a fairly rare circumstance, I think; to judge by my clients, 
 once a database server is in production nobody touches the hardware.

A much simpler solution would be for the postmaster to run a test during
startup.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Josh Berkus
Tom,

 Having to recompile to run on single- vs dual-processor machines doesn't
 seem like it would fly.

Oh, I don't know.  Many applications require compiling for a target 
architecture; SQL Server, for example, won't use a 2nd processor without 
re-installation.   I'm not sure about Oracle.

It certainly wasn't too long ago that Linux gurus were esposing re-compiling 
the kernel for the machine.

And it's not like they would *have* to re-compile to use PostgreSQL after 
adding an additional processor.  Just if they wanted to maximize peformance 
benefit.

Also, this is a fairly rare circumstance, I think; to judge by my clients, 
once a database server is in production nobody touches the hardware.

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Josh Berkus
Tom,

 The tricky
 part is that a slow adaptation rate means we can't have every backend
 figuring this out for itself --- the right value would have to be
 maintained globally, and I'm not sure how to do that without adding a
 lot of overhead.

This may be a moot point, since you've stated that changing the loop timing 
won't solve the problem, but what about making the test part of make?   I 
don't think too many systems are going to change processor architectures once 
in production, and those that do can be told to re-compile.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Rod Taylor
On Thu, 2004-04-22 at 13:55, Tom Lane wrote:
 Josh Berkus [EMAIL PROTECTED] writes:
  This may be a moot point, since you've stated that changing the loop timing 
  won't solve the problem, but what about making the test part of make?   I 
  don't think too many systems are going to change processor architectures once
  in production, and those that do can be told to re-compile.
 
 Having to recompile to run on single- vs dual-processor machines doesn't
 seem like it would fly.

Is it something the postmaster could quickly determine and set a global
during the startup cycle?



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Dave Cramer
attached.
-- 
Dave Cramer
519 939 0336
ICQ # 14675561
Index: backend/storage/lmgr/s_lock.c
===
RCS file: /usr/local/cvs/pgsql-server/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.16
diff -c -r1.16 s_lock.c
*** backend/storage/lmgr/s_lock.c	8 Aug 2003 21:42:00 -	1.16
--- backend/storage/lmgr/s_lock.c	21 Apr 2004 20:27:34 -
***
*** 76,82 
  	 * The select() delays are measured in centiseconds (0.01 sec) because 10
  	 * msec is a common resolution limit at the OS level.
  	 */
! #define SPINS_PER_DELAY		100
  #define NUM_DELAYS			1000
  #define MIN_DELAY_CSEC		1
  #define MAX_DELAY_CSEC		100
--- 76,82 
  	 * The select() delays are measured in centiseconds (0.01 sec) because 10
  	 * msec is a common resolution limit at the OS level.
  	 */
! #define SPINS_PER_DELAY		10
  #define NUM_DELAYS			1000
  #define MIN_DELAY_CSEC		1
  #define MAX_DELAY_CSEC		100
***
*** 88,93 
--- 88,94 
  
  	while (TAS(lock))
  	{
+ 		__asm__ __volatile__ ( rep;nop: : :memory);
  		if (++spins  SPINS_PER_DELAY)
  		{
  			if (++delays  NUM_DELAYS)
Index: include/storage/s_lock.h
===
RCS file: /usr/local/cvs/pgsql-server/src/include/storage/s_lock.h,v
retrieving revision 1.115.2.1
diff -c -r1.115.2.1 s_lock.h
*** include/storage/s_lock.h	4 Nov 2003 09:43:56 -	1.115.2.1
--- include/storage/s_lock.h	21 Apr 2004 20:26:25 -
***
*** 103,110 
  	register slock_t _res = 1;
  
  	__asm__ __volatile__(
! 			lock			\n
  			xchgb	%0,%1	\n
  :		=q(_res), =m(*lock)
  :		0(_res));
  	return (int) _res;
--- 103,113 
  	register slock_t _res = 1;
  
  	__asm__ __volatile__(
! 		   cmpb $0,%1  \n
! 		   jne 1f  \n
! 			lock		\n
  			xchgb	%0,%1	\n
+ 		   1:\n
  :		=q(_res), =m(*lock)
  :		0(_res));
  	return (int) _res;

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes:
 diff -c -r1.16 s_lock.c
 *** backend/storage/lmgr/s_lock.c 8 Aug 2003 21:42:00 -   1.16
 --- backend/storage/lmgr/s_lock.c 21 Apr 2004 20:27:34 -
 ***
 *** 76,82 
* The select() delays are measured in centiseconds (0.01 sec) because 10
* msec is a common resolution limit at the OS level.
*/
 ! #define SPINS_PER_DELAY 100
   #define NUM_DELAYS  1000
   #define MIN_DELAY_CSEC  1
   #define MAX_DELAY_CSEC  100
 --- 76,82 
* The select() delays are measured in centiseconds (0.01 sec) because 10
* msec is a common resolution limit at the OS level.
*/
 ! #define SPINS_PER_DELAY 10
   #define NUM_DELAYS  1000
   #define MIN_DELAY_CSEC  1
   #define MAX_DELAY_CSEC  100


As far as I can tell, this does reduce the rate of semop's
significantly, but it does so by bringing the overall processing rate
to a crawl :-(.  I see 97% CPU idle time when using this patch.
I believe what is happening is that the select() delay in s_lock.c is
being hit frequently because the spin loop isn't allowed to run long
enough to let the other processor get out of the spinlock.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 For BSDOS it has:

 #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \
 (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD)
 { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */
   int ncpus; size_t len = sizeof(ncpus);
   int mib[2]; mib[0] = CTL_HW; mib[1] = HW_NCPU;
   if (sysctl( mib[0], 2, ncpus, len, NULL, 0 ) == 0)
   //if (sysctlbyname(hw.ncpu, ncpus, len, NULL, 0 ) == 0)
 cpucount = ncpus;
 }

Multiplied by how many platforms?  Ewww...

I was wondering about some sort of dynamic adaptation, roughly along the
lines of whenever a spin loop successfully gets the lock after
spinning, decrease the allowed loop count by one; whenever we fail to
get the lock after spinning, increase by 100; if the loop count reaches,
say, 1, decide we are on a uniprocessor and irreversibly set it to
1.  As written this would tend to incur a select() delay once per
hundred spinlock acquisitions, which is way too much, but I think we
could make it work with a sufficiently slow adaptation rate.  The tricky
part is that a slow adaptation rate means we can't have every backend
figuring this out for itself --- the right value would have to be
maintained globally, and I'm not sure how to do that without adding a
lot of overhead.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend