Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-29 Thread Manfred Spraul
Tom Lane wrote:

Is there an easy way find out which LWLock is contended?
   

Not from oprofile output, as far as I can think.  I've suspected for
some time that the BufMgrLock is a major bottleneck, but have no proof.
 

Mark ran a DBT-2 testrun with the attached statistics patch applied: It 
collects stats about all lightweight locks and dumps them during 
shutdown. The hottest locks are

Lock Acquire   %contention   sleep calls
8(WALInsertLock)   8679205 0.030410263934
1(LockMgrLock)   640894180.0797835113215
5(SInvalLock)   683964700.00129888812
0(BufMgrLock)   246307425   0.12029329629089
The lock numbers are from 7.4, i.e. without the patch that removes 
ShmemIndexLock. I've check that 8 is really WALInsertLock in the 
assembly output.

The scary part from the system perspective are the 35 million context 
switches that were generated by the BufMgrLock and the LockMgrLock. I 
remember there were patches that tried other algorithms instead of the 
simple LRU for the buffer manager. Has anyone tried to change the 
locking of the buffer manager?

The effect of padding the lightweight locks to a full cacheline appears 
to be negligable: With the padding, there were around 4 million 
performance monitor hits on the 'lock xchg' instructions. Without it 
(test run 300), there were 4.2 million hits.

The complete data is at

http://developer.osdl.org/markw/dbt2-pgsql/303/

The db log with the lock stats is at 
http://developer.osdl.org/markw/dbt2-pgsql/303/db/log

(Warning: 6.9 MB)

--
   Manfred
Index: src/backend/storage/lmgr/lwlock.c
===
RCS file: /projects/cvsroot/pgsql-server/src/backend/storage/lmgr/lwlock.c,v
retrieving revision 1.19
diff -u -r1.19 lwlock.c
--- src/backend/storage/lmgr/lwlock.c   20 Dec 2003 17:31:21 -  1.19
+++ src/backend/storage/lmgr/lwlock.c   27 Dec 2003 22:51:36 -
@@ -36,6 +36,11 @@
PGPROC *head;   /* head of list of waiting PGPROCs */
PGPROC *tail;   /* tail of list of waiting PGPROCs */
/* tail is undefined when head is NULL */
+   unsigned long long stat_acquire_total;
+   unsigned long long stat_acquire_fail;
+   unsigned long long stat_release_total;
+   unsigned long long stat_release_wakeup;
+   int fill[20];
 } LWLock;
 
 /*
@@ -159,6 +164,10 @@
lock-shared = 0;
lock-head = NULL;
lock-tail = NULL;
+   lock-stat_acquire_total = 0;
+   lock-stat_acquire_fail = 0;
+   lock-stat_release_total = 0;
+   lock-stat_release_wakeup = 0;
}
 
/*
@@ -245,6 +254,10 @@
if (retry)
lock-releaseOK = true;
 
+   lock-stat_acquire_total++;
+   if (retry)
+   lock-stat_acquire_fail++;
+
/* If I can get the lock, do so quickly. */
if (mode == LW_EXCLUSIVE)
{
@@ -440,6 +453,7 @@
Assert(lock-shared  0);
lock-shared--;
}
+   lock-stat_release_total++;
 
/*
 * See if I need to awaken any waiters.  If I released a non-last
@@ -477,6 +491,8 @@
}
}
 
+   if (head)
+   lock-stat_release_wakeup++;
/* We are done updating shared state of the lock itself. */
SpinLockRelease_NoHoldoff(lock-mutex);
 
@@ -517,5 +533,19 @@
HOLD_INTERRUPTS();  /* match the upcoming 
RESUME_INTERRUPTS */
 
LWLockRelease(held_lwlocks[num_held_lwlocks - 1]);
+   }
+}
+
+void LWLockPrintStats(void);
+void
+LWLockPrintStats(void)
+{
+   int i;
+   for (i=0;iLWLockCounter[0];i++) {
+   volatile LWLock *lock = LWLockArray + i;
+   elog(LOG, Lock %d): acquire_total %Ld acquire_fail %Ld release_total 
%Ld release_wakeup %Ld\n,
+i,
+lock-stat_acquire_total, lock-stat_acquire_fail,
+lock-stat_release_total, lock-stat_release_wakeup);
}
 }
Index: src/backend/postmaster/postmaster.c
===
RCS file: /projects/cvsroot/pgsql-server/src/backend/postmaster/postmaster.c,v
retrieving revision 1.353
diff -u -r1.353 postmaster.c
--- src/backend/postmaster/postmaster.c 25 Dec 2003 03:52:51 -  1.353
+++ src/backend/postmaster/postmaster.c 27 Dec 2003 22:51:38 -
@@ -1701,7 +1701,7 @@
errno = save_errno;
 }
 
-
+void LWLockPrintStats(void);
 
 /*
  * pmdie -- signal handler for processing various postmaster signals.
@@ -1733,6 +1733,7 @@
Shutdown = SmartShutdown;
ereport(LOG,
(errmsg(received smart 

Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-29 Thread Bruce Momjian
Manfred Spraul wrote:
 The scary part from the system perspective are the 35 million context 
 switches that were generated by the BufMgrLock and the LockMgrLock. I 
 remember there were patches that tried other algorithms instead of the 
 simple LRU for the buffer manager. Has anyone tried to change the 
 locking of the buffer manager?

CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
replacement.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-27 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

Tom Lane wrote:
   

Don't you have to put it in a specific place in the loop to make that
work?  If not, why not?
 

Rep;nop is just a short delay - that's all.
   

That view seems to me to be directly contradicted by this statement:

 

The PAUSE instruction provides a hint to the processor that 
the code sequence is a spin-wait loop. The processor uses this hint to 
avoid the memory order violation in most situations, which greatly 
improves processor performance.
   

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...).  I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised.  It would help if Intel were less coy
about what the instruction really does.
 

My guess: Pentium 4 cpu support something like 250 uops in flight - it 
will have a dozend of the spinlock loops in it's pipeline. When the 
spinlock is released, it must figure out which of the loops should get 
it, and gets lost. My guess is that rep;nop delays the cpu buy at least 
100 cpu ticks, and thus the pipeline will be empty before it proceeds. I 
don't have a Pentium 4, and the HP testdrive is down. Someone around who 
could run my test app?

 

This instruction does not change the architectural state of the
processor (that is, it performs essentially a delaying noop
operation).
   

This can be rephrased as we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture.  Great.  How are we supposed to know
how to use it?
 

There was a w_spinlock.pdf document with reference code. google still 
finds it, but the links are dead :-(

I think a separate function is better than adding it into TAS: if it's 
part of tas, then it would automatically be included by every 
SpinLockAcquire call - unnecessary .text bloat.
   

Why do you think it's unnecessary?  One thing that I find particularly
vague in the quoted documentation is the statement that the PAUSE
instruction is needed to avoid a delay when *exiting* the spin-wait
loop.  Doesn't this mean that a PAUSE is needed in the success path
when the first TAS succeeds (i.e, the normal no-contention path)?
IIRC: No.

If not, why not?  If so, does it go before or after the lock
instruction?
 

Neither: somewhere in the failure path.

Also, if the principal effect is a short delay, do we really need it
at all considering that our inner loop in s_lock is rather more than
an xchgb followed by a conditional branch?  There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).
The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.
I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow 
path: locked instructions generate lots of bus traffic, and that's a 
waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup 
means that the postgres spends far too much time in the spinlock code. 
I've looked at the oprofile dumps, and something like 1.2% of the total 
cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest 
instruction in the whole profile, it eats more cpu cycles than the 
memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
--
   Manfred
/*
 * skel.cpp. skeleton for rdtsc benchmarks
 *
 * Copyright (C) 1999, 2001 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header: /pub/home/manfred/cvs-tree/timetest/rep_nop.cpp,v 1.1 2001/04/07 19:38:33 manfred Exp $
 */

#include stdio.h
#include stdlib.h
#include string.h
#include unistd.h
#include getopt.h

// disable local interrupts during benchmark
#undef USE_CLI

// define a cache flushing function
#undef CACHE_FLUSH

#ifdef USE_CLI
#include sys/io.h
#define CLI	cli\n\t
#define STI	sti\n\t
#else
#define CLI
#define STI
#define iopl(a)	do { } while(0)
#endif

// Intel recommends that a serializing instruction
// should be called before and after rdtsc.
// CPUID is a serializing instruction.
// .align 128: P 4 L2 cache line size
#define read_rdtsc_before(time)		\
	__asm__ __volatile__(		\
		.align 128\n\t	\
		xor %%eax,%%eax\n\t	\
		CLI			\
		cpuid\n\t		\
		rdtsc\n\t		\
		mov %%eax,(%0)\n\t	\
		mov %%edx,4(%0)\n\t	\
		xor %%eax,%%eax\n\t	\
		cpuid\n\t		\
		: /* no output */	\
		: S(time)		\
		: eax, 

Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-27 Thread ohp
I have a bi-Xeon 2.6G hyperthreaded if it helps... feel free

Regards
On Sat, 27 Dec 2003, Manfred Spraul wrote:

 Date: Sat, 27 Dec 2003 11:34:16 +0100
 From: Manfred Spraul [EMAIL PROTECTED]
 To: Tom Lane [EMAIL PROTECTED]
 Cc: PostgreSQL-patches [EMAIL PROTECTED]
 Subject: Re: [PATCHES] update i386 spinlock for hyperthreading

 Tom Lane wrote:

 Manfred Spraul [EMAIL PROTECTED] writes:
 
 
 Tom Lane wrote:
 
 
 Don't you have to put it in a specific place in the loop to make that
 work?  If not, why not?
 
 
 
 Rep;nop is just a short delay - that's all.
 
 
 
 That view seems to me to be directly contradicted by this statement:
 
 
 
 The PAUSE instruction provides a hint to the processor that
 the code sequence is a spin-wait loop. The processor uses this hint to
 avoid the memory order violation in most situations, which greatly
 improves processor performance.
 
 
 
 It's not apparent to me how a short delay translates into avoiding a
 memory order violation (possibly some docs on what that means exactly
 might help...).  I suspect strongly that there needs to be some near
 proximity between the PAUSE instruction and the lock-test instruction
 for this to work as advertised.  It would help if Intel were less coy
 about what the instruction really does.
 
 
 My guess: Pentium 4 cpu support something like 250 uops in flight - it
 will have a dozend of the spinlock loops in it's pipeline. When the
 spinlock is released, it must figure out which of the loops should get
 it, and gets lost. My guess is that rep;nop delays the cpu buy at least
 100 cpu ticks, and thus the pipeline will be empty before it proceeds. I
 don't have a Pentium 4, and the HP testdrive is down. Someone around who
 could run my test app?

 
 
 This instruction does not change the architectural state of the
 processor (that is, it performs essentially a delaying noop
 operation).
 
 
 
 This can be rephrased as we're not telling you what this instruction
 really does, because its interesting effects are below the level of the
 instruction set architecture.  Great.  How are we supposed to know
 how to use it?
 
 
 There was a w_spinlock.pdf document with reference code. google still
 finds it, but the links are dead :-(

 I think a separate function is better than adding it into TAS: if it's
 part of tas, then it would automatically be included by every
 SpinLockAcquire call - unnecessary .text bloat.
 
 
 
 Why do you think it's unnecessary?  One thing that I find particularly
 vague in the quoted documentation is the statement that the PAUSE
 instruction is needed to avoid a delay when *exiting* the spin-wait
 loop.  Doesn't this mean that a PAUSE is needed in the success path
 when the first TAS succeeds (i.e, the normal no-contention path)?
 
 IIRC: No.

 If not, why not?  If so, does it go before or after the lock
 instruction?
 
 
 Neither: somewhere in the failure path.

 Also, if the principal effect is a short delay, do we really need it
 at all considering that our inner loop in s_lock is rather more than
 an xchgb followed by a conditional branch?  There will be time for
 the write queue to drain while we're incrementing and testing our
 spin counter (which I trust is in a register...).
 
 The reason I'm so full of questions is that I spent some time several
 days ago looking at exactly this issue, and came away with only the
 conclusion that I had to find some more-detailed documentation before
 I could figure out what we should do about the spinlocks for Xeons.
 
 I'll try to find some more docs and post links.

 The 2nd thing I would change is to add a nonatomic test in the slow
 path: locked instructions generate lots of bus traffic, and that's a
 waste of resources.

 Another question: regardless of the placement of rep;nop - 10% speedup
 means that the postgres spends far too much time in the spinlock code.
 I've looked at the oprofile dumps, and something like 1.2% of the total
 cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest
 instruction in the whole profile, it eats more cpu cycles than the
 memcpy() calls that transfer data to/from kernel.
 Is there an easy way find out which LWLock is contended?
 --
 Manfred


-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou   +33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: [EMAIL PROTECTED]
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-27 Thread Manfred Spraul
Tom Lane wrote:

Anyway, I've committed your patch with some changes.
 

Thanks.

BTW, I noticed a lot of concern in the Intel app notes about reserving
64 or even 128 bytes for each spinlock to avoid cache line conflicts.
That seems excessive to me (we use a lot of spinlocks for buffers), but
perhaps it is worth looking into.
This recommendation usually ignored in the Linux kernel.  A few very hot 
spinlocks have an exclusive cacheline, but most don't.

Is there an easy way find out which LWLock is contended?
   

Not from oprofile output, as far as I can think.  I've suspected for
some time that the BufMgrLock is a major bottleneck, but have no proof.
 

I'll try to write a patch that dumps the LWLock usage and ask mark to 
run it.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-26 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 Intel recommends to add a special pause instruction into spinlock busy 
 loops. It's necessary for hyperthreading - without it, the cpu can't 
 figure out that a logical thread does no useful work and incorrectly 
 awards lots of execution resources to that thread. Additionally, it's 
 supposed to reduce the time the cpu needs to recover from the 
 (mispredicted) branch after the spinlock was obtained.

Don't you have to put it in a specific place in the loop to make that
work?  If not, why not?  I doubt that rep;nop is magic enough to
recognize the loop that will be generated from s_lock()'s code.

My guess is that it'd be more useful to insert the rep;nop into the
failure branch of the TAS macro and forget about the separate CPU_DELAY
construct.  This would allow you to control where exactly rep;nop
appears relative to the xchgb.

 Additionally I've increased the number of loops from 100 to 1000 

I think this change is almost certainly counterproductive; for any
platform other than the Xeon, remove almost.

 + #ifndef HAS_CPU_DELAY
 + #define CPU_DELAY() cpu_delay()
 + 
 + static __inline__ void
 + cpu_delay(void)
 + {
 + }
 + #endif

This breaks every non-gcc compiler in the world (or at least all those
that don't recognize __inline__).  If you really want to keep CPU_DELAY,
consider

#ifndef CPU_DELAY
#define CPU_DELAY()
#endif

but as stated above, I'm dubious that the bottom of the s_lock loop
is the place to be adding anything anyway.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-26 Thread Manfred Spraul
Tom Lane wrote:

Manfred Spraul [EMAIL PROTECTED] writes:
 

Intel recommends to add a special pause instruction into spinlock busy 
loops. It's necessary for hyperthreading - without it, the cpu can't 
figure out that a logical thread does no useful work and incorrectly 
awards lots of execution resources to that thread. Additionally, it's 
supposed to reduce the time the cpu needs to recover from the 
(mispredicted) branch after the spinlock was obtained.
   

Don't you have to put it in a specific place in the loop to make that
work?  If not, why not?  I doubt that rep;nop is magic enough to
recognize the loop that will be generated from s_lock()'s code.
 

Rep;nop is just a short delay - that's all. It means that the cpu 
pipelines have a chance to drain, and that the other thread gets enough 
cpu resources. Below is the full instruction documentation, from the 
latest ia32 doc set from Intel:

Improves the performance of spin-wait loops. When executing a  spin-wait 
loop,  a Pentium 4 or Intel Xeon processor suffers a severe performance 
penalty when exiting the loop because it detects a possible memory order 
violation. The PAUSE instruction provides a hint to the processor that 
the code sequence is a spin-wait loop. The processor uses this hint to 
avoid the memory order violation in most situations, which greatly 
improves processor performance. For this reason, it is recommended that 
a PAUSE instruction be placed in all spin-wait loops. An additional 
function of the PAUSE instruction is to reduce the power consumed by a 
Pentium 4 processor while executing a spin loop. The Pentium 4 processor 
can execute a spin-wait loop extremely quickly, causing the processor to 
consume a lot of power while it waits for the resource it is spinning on 
to become available. Inserting a pause instruction in a spin-wait loop 
greatly reduces the processor s power consumption. This instruction was 
introduced in the Pentium 4 processors, but is backward compatible with 
all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction 
operates like a NOP instruction. The Pentium 4 and Intel Xeon processors 
implement the PAUSE instruction as a pre-defined delay. The delay is 
finite and can be zero for some processors. This instruction does not 
change the architectural state of the processor (that is, it performs 
essentially a delaying noop operation).


I think a separate function is better than adding it into TAS: if it's 
part of tas, then it would automatically be included by every 
SpinLockAcquire call - unnecessary .text bloat. Additionally, there 
might be other busy loops, in addition to TAS, that could use a delay 
function.

I'll post a new patch that doesn't rely on __inline__ in the i386 
independant part.

--
   Manfred
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] update i386 spinlock for hyperthreading

2003-12-26 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Don't you have to put it in a specific place in the loop to make that
 work?  If not, why not?
 
 Rep;nop is just a short delay - that's all.

That view seems to me to be directly contradicted by this statement:

 The PAUSE instruction provides a hint to the processor that 
 the code sequence is a spin-wait loop. The processor uses this hint to 
 avoid the memory order violation in most situations, which greatly 
 improves processor performance.

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...).  I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised.  It would help if Intel were less coy
about what the instruction really does.

 This instruction does not change the architectural state of the
 processor (that is, it performs essentially a delaying noop
 operation).

This can be rephrased as we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture.  Great.  How are we supposed to know
how to use it?

 I think a separate function is better than adding it into TAS: if it's 
 part of tas, then it would automatically be included by every 
 SpinLockAcquire call - unnecessary .text bloat.

Why do you think it's unnecessary?  One thing that I find particularly
vague in the quoted documentation is the statement that the PAUSE
instruction is needed to avoid a delay when *exiting* the spin-wait
loop.  Doesn't this mean that a PAUSE is needed in the success path
when the first TAS succeeds (i.e, the normal no-contention path)?
If not, why not?  If so, does it go before or after the lock
instruction?

Also, if the principal effect is a short delay, do we really need it
at all considering that our inner loop in s_lock is rather more than
an xchgb followed by a conditional branch?  There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).

The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.
You have not convinced me that you know more about the issue than I do.
A 10% speedup is nice, but how do we know that that's what we should
expect to get?  Maybe there's a lot more to be won by doing it correctly
(for some value of correctly).

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly