Re: [PERFORM] futex results with dbt-3

2004-10-30 Thread Manfred Spraul
Tom Lane wrote:
It could be that I'm all wet and there is no relationship between the
cache line thrashing and the seemingly excessive BufMgrLock contention.
 

Is it important? The fix is identical in both cases: per-bucket locks 
for the hash table and a buffer aging strategy that doesn't need one 
global lock that must be acquired for every lookup.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Tom Lane wrote:
Manfred Spraul [EMAIL PROTECTED] writes:
 

Has anyone tried to replace the whole lwlock implementation with 
pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
implemented with futexes, i.e. we would get a fast lock handling without 
os specific hacks.
   

At least for Linux does not strike me as equivalent to without
OS-specific hacks.
 

For me, at least for Linux means that I have tested the patch with 
Linux. I'd expect that the patch works on most recent unices 
(pthread_rwlock_t is probably mandatory for Unix98 compatibility). You 
and others on this mailing list have access to other systems - my patch 
should be seen as a call for testers, not as a proposal for merging. I 
expect that Linux is not the only OS with fast user space semaphores, 
and if an OS has such objects, then the pthread_ locking functions are 
hopefully implemented by using them. IMHO it's better to support the 
standard function instead of trying to use the native (and OS specific) 
fast semaphore functions.

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
 

lwlocks operations are not a basic atomic-test-and-set sequence. They 
are spinlock, several nonatomic operations, spin_unlock.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
Here are some other details, per Manfred's request:
Linux 2.6.8.1 (on a gentoo distro)
 

How complicated are Tom's test scripts? His immediate reply was that I 
should retest with Fedora, to rule out any gentoo bugs.

I have a dual-cpu system with RH FC, I could use it for testing.
--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Tom Lane wrote:
Manfred Spraul [EMAIL PROTECTED] writes:
 

Tom Lane wrote:
   

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
 

lwlocks operations are not a basic atomic-test-and-set sequence. They 
are spinlock, several nonatomic operations, spin_unlock.
   

Right, and it is the spinlock that is the problem.  See discussions a
few months back: at least on Intel SMP machines, most of the problem
seems to have to do with trading the spinlock's cache line back and
forth between CPUs.
I'd disagree: cache line bouncing is one problem. If this happens then 
there is only one solution: The number of changes to that cacheline must 
be reduced. The tools that are used in the linux kernel are:
- hashing. An emergency approach if there is no other solution. I think 
RedHat used it for the buffer cache RH AS: Instead of one buffer cache, 
there were lots of smaller buffer caches with individual locks. The 
cache was chosen based on the file position (probably mixed with some 
pointers to avoid overloading cache 0).
- For read-heavy loads: sequence locks. A reader reads a counter value 
and then accesses the data structure. At the end it checks if the 
counter was modified. If it's still the same value then it can continue, 
otherwise it must retry. Writers acquire a normal spinlock and then 
modify the counter value. RCU is the second option, but there are 
patents - please be careful before using that tool.
- complete rewrites that avoid the global lock. I think the global 
buffer cache is now gone, everything is handled per-file. I think there 
is a global list for buffer replacement, but the at the top of the 
buffer replacement strategy is a simple clock algorithm. That means that 
simple lookups/accesses just set a (local) referenced bit and don't have 
to acquire a global lock. I know that this is the total opposite of ARC, 
but perhaps it's the only scalable solution. ARC could be used as the 
second level strategy.

But: According to the descriptions the problem is a context switch 
storm. I don't see that cache line bouncing can cause a context switch 
storm. What causes the context switch storm? If it's the pg_usleep in 
s_lock, then my patch should help a lot: with pthread_rwlock locks, this 
line doesn't exist anymore.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
Pretty, simple.  One to load the database, and 1 to query it.  I'll 
attach them.

 

I've tested it on my dual-cpu computer:
- it works, both cpus run within the postmaster. It seems something your 
gentoo setup is broken.
- the number of context switch is down slightly, but not significantly: 
The glibc implementation is more or less identical to the implementation 
right now in lwlock.c: a spinlock that protects a few variables that are 
used to implement the actual mutex, several wait queues: one for 
spinlock busy, one or two for the actual mutex code.

Around 25% of the context switches are from spinlock collisions, the 
rest are from actual mutex collisions. It might be possible to get rid 
of the spinlock collisions by writing a special, futex based semaphore 
function that only supports exclusive access [like sem_wait/sem_post], 
but I don't think that it's worth the effort: 75% of the context 
switches would remain.
What's needed is a buffer manager that can do lookups without a global lock.

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
I've heard that simply linking to the pthreads libraries, regardless of
whether you're using them or not creates a significant overhead.  Has
anyone tried it for kicks?
 

That depends on the OS and the functions that are used. The typical 
worst case is buffered IO of single characters: The single threaded 
implementation is just copy and update buffer status, the multi threaded 
implementation contains full locking.

For most other functions there is no difference at all.
--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Josh Berkus
Manfred,

 How complicated are Tom's test scripts? His immediate reply was that I
 should retest with Fedora, to rule out any gentoo bugs.

We've done some testing on other Linux.Linking in pthreads reduced CSes by 
 15%, which was no appreciable impact on real performance.

Gavin/Neil's full futex patch was of greater benefit; while it did not reduce 
CSes very much (25%) somehow the real performance benefit was greater.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 But: According to the descriptions the problem is a context switch 
 storm. I don't see that cache line bouncing can cause a context switch 
 storm. What causes the context switch storm?

As best I can tell, the CS storm arises because the backends get into
some sort of lockstep timing that makes it far more likely than you'd
expect for backend A to try to enter the bufmgr when backend B is already
holding the BufMgrLock.  In the profiles we were looking at back in
April, it seemed that about 10% of the time was spent inside bufmgr
(which is bad enough in itself) but the odds of LWLock collision were
much higher than 10%, leading to many context swaps.

This is not totally surprising given that they are running identical
queries and so are running through loops of the same length, but still
it seems like there must be some effect driving their timing to converge
instead of diverge away from the point of conflict.

What I think (and here is where it's a leap of logic, cause I can't
prove it) is that the excessive time spent passing the spinlock cache
line back and forth is exactly the factor causing that convergence.
Somehow, the delay caused when a processor has to wait to get the cache
line contributes to keeping the backend loops in lockstep.

It is critical to understand that the CS storm is associated with LWLock
contention not spinlock contention: what we saw was a lot of semop()s
not a lot of select()s.

 If it's the pg_usleep in s_lock, then my patch should help a lot: with
 pthread_rwlock locks, this line doesn't exist anymore.

The profiles showed that s_lock() is hardly entered at all, and the
select() delay is reached even more seldom.  So changes in that area
will make exactly zero difference.  This is the surprising and
counterintuitive thing: oprofile clearly shows that very large fractions
of the CPU time are being spent at the initial TAS instructions in
LWLockAcquire and LWLockRelease, and yet those TASes hardly ever fail,
as proven by the fact that oprofile shows s_lock() is seldom entered.
So as far as the Postgres code can tell, there isn't any contention
worth mentioning for the spinlock.  This is indeed the way it was
designed to be, but when so much time is going to the TAS instructions,
you'd think there'd be more software-visible contention for the
spinlock.

It could be that I'm all wet and there is no relationship between the
cache line thrashing and the seemingly excessive BufMgrLock contention.
They are after all occurring at two very different levels of abstraction.
But I think there is some correlation that we just don't understand yet.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 It could be that I'm all wet and there is no relationship between the
 cache line thrashing and the seemingly excessive BufMgrLock contention.
 
 Is it important? The fix is identical in both cases: per-bucket locks 
 for the hash table and a buffer aging strategy that doesn't need one 
 global lock that must be acquired for every lookup.

Reducing BufMgrLock contention is a good idea, but it's not really my
idea of a fix for this issue.  In the absence of a full understanding,
we may be fixing the wrong thing.  It's worth remembering that when we
first hit this issue, I made some simple changes that approximately
halved the number of BufMgrLock acquisitions by joining ReleaseBuffer
and ReadBuffer calls into ReleaseAndReadBuffer in all the places in the
test case's loop.  This made essentially no change in the CS storm
behavior :-(.  So I do not know how much contention we have to get rid
of to get the problem to go away, or even whether this is the right path
to take.

(I am unconvinced that either of those specific suggestions is The Right
Way to break up the bufmgrlock, either, but that's a different thread.)

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-23 Thread Gaetano Mendola
Josh Berkus wrote:
 Tom,


The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
and if so, we could read their code and adopt that method without having
to buy into any large reorganization of our code.


 Well, initial results from Gavin/Neil's patch seem to indicate that, while
 futexes do not cure the CSStorm bug, they do lessen its effects in terms of
 real performance loss.
I proposed weeks ago to see how the CSStorm is affected by stick each backend
in one processor ( where the process was born ) using the cpu-affinity capability
( kernel 2.6 ), is this proposal completely out of mind ?
Regards
Gaetano Mendola



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] futex results with dbt-3

2004-10-23 Thread Tom Lane
Gaetano Mendola [EMAIL PROTECTED] writes:
 I proposed weeks ago to see how the CSStorm is affected by stick each
 backend in one processor ( where the process was born ) using the
 cpu-affinity capability ( kernel 2.6 ), is this proposal completely
 out of mind ?

That was investigated long ago.  See for instance
http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] futex results with dbt-3

2004-10-23 Thread Gaetano Mendola
Tom Lane wrote:
Gaetano Mendola [EMAIL PROTECTED] writes:
I proposed weeks ago to see how the CSStorm is affected by stick each
backend in one processor ( where the process was born ) using the
cpu-affinity capability ( kernel 2.6 ), is this proposal completely
out of mind ?

That was investigated long ago.  See for instance
http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php
If I read correctly this help on the CSStorm, I guess also that this could
also help the performances. Unfortunatelly I do not have any kernel 2.6 running
on SMP to give it a try.
Regards
Gaetano Mendola
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] futex results with dbt-3

2004-10-23 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Josh Berkus wrote:
| Gaetano,
|
|
|I proposed weeks ago to see how the CSStorm is affected by stick each
|backend in one processor ( where the process was born ) using the
|cpu-affinity capability ( kernel 2.6 ), is this proposal completely out of
|mind ?
|
|
| I don't see how that would help.   The problem is not backends switching
| processors, it's the buffermgrlock needing to be swapped between processors.
This is not clear to me. What happen if during a spinlock a backend is
moved away from one processor to another one ?
Regards
Gaetano Mendola

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBeudN7UpzwH2SGd4RAkL9AKCUY9vsw1CPmBV1kC7BKxUtuneN2wCfXaYr
E8utuJI34MAIP8jUm6By09M=
=oRvU
-END PGP SIGNATURE-
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] futex results with dbt-3

2004-10-21 Thread Mark Wong
On Thu, Oct 21, 2004 at 07:45:53AM +0200, Manfred Spraul wrote:
 Mark Wong wrote:
 
 Here are some other details, per Manfred's request:
 
 Linux 2.6.8.1 (on a gentoo distro)
   
 
 How complicated are Tom's test scripts? His immediate reply was that I 
 should retest with Fedora, to rule out any gentoo bugs.
 
 I have a dual-cpu system with RH FC, I could use it for testing.
 

Pretty, simple.  One to load the database, and 1 to query it.  I'll 
attach them.

Mark
drop table test_data;

create table test_data(f1 int);

insert into test_data values (random() * 100);
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;

create index test_index on test_data(f1);

vacuum verbose analyze test_data;
checkpoint;
-- force nestloop indexscan plan
set enable_seqscan to 0;
set enable_mergejoin to 0;
set enable_hashjoin to 0;

explain
select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;

select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] futex results with dbt-3

2004-10-20 Thread Mark Wong
On Sun, Oct 17, 2004 at 09:39:33AM +0200, Manfred Spraul wrote:
 Neil wrote:
 
 . In any case, the futex patch
 uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. 
 
 Has anyone tried to replace the whole lwlock implementation with 
 pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
 implemented with futexes, i.e. we would get a fast lock handling without 
 os specific hacks. Perhaps other os contain user space pthread locks, too.
 Attached is an old patch. I tested it on an uniprocessor system a year 
 ago and it didn't provide much difference, but perhaps the scalability 
 is better. You'll have to add -lpthread to the library list for linking.

I've heard that simply linking to the pthreads libraries, regardless of
whether you're using them or not creates a significant overhead.  Has
anyone tried it for kicks?

Mark

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] futex results with dbt-3

2004-10-20 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 The bigger problem here is that the SMP locking bottlenecks we are
 currently seeing are *hardware* issues (AFAICT anyway).  The only way
 that futexes can offer a performance win is if they have a smarter way
 of executing the basic atomic-test-and-set sequence than we do;
 
 lwlocks operations are not a basic atomic-test-and-set sequence. They 
 are spinlock, several nonatomic operations, spin_unlock.

Right, and it is the spinlock that is the problem.  See discussions a
few months back: at least on Intel SMP machines, most of the problem
seems to have to do with trading the spinlock's cache line back and
forth between CPUs.  It's difficult to see how a futex is going to avoid
that.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] futex results with dbt-3

2004-10-20 Thread Dave Cramer
Forgive my naivete, but do futex's implement some priority algorithm for 
which process gets control. One of the problems as I understand it is 
that linux does (did ) not implement a priority algorithm, so it is 
possible for the context which just gave up control to be the next 
context woken up, which of course is a complete waste of time.

--dc--
Tom Lane wrote:
Manfred Spraul [EMAIL PROTECTED] writes:
 

Tom Lane wrote:
   

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
 

lwlocks operations are not a basic atomic-test-and-set sequence. They 
are spinlock, several nonatomic operations, spin_unlock.
   

Right, and it is the spinlock that is the problem.  See discussions a
few months back: at least on Intel SMP machines, most of the problem
seems to have to do with trading the spinlock's cache line back and
forth between CPUs.  It's difficult to see how a futex is going to avoid
that.
regards, tom lane
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 

--
Dave Cramer
www.postgresintl.com
519 939 0336
ICQ#14675561
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] futex results with dbt-3

2004-10-19 Thread Manfred Spraul
Neil wrote:
. In any case, the futex patch
uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. 

Has anyone tried to replace the whole lwlock implementation with 
pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
implemented with futexes, i.e. we would get a fast lock handling without 
os specific hacks. Perhaps other os contain user space pthread locks, too.
Attached is an old patch. I tested it on an uniprocessor system a year 
ago and it didn't provide much difference, but perhaps the scalability 
is better. You'll have to add -lpthread to the library list for linking.

Regarding Neil's patch:
! /*
!  * XXX: is there a more efficient way to write this? Perhaps using
!  * decl...?
!  */
! static __inline__ slock_t
! atomic_dec(volatile slock_t *ptr)
! {
! 	slock_t prev = -1;
! 
! 	__asm__ __volatile__(
! 			lock		\n
! 			xadd %0,%1	\n
! 		:=q(prev)
! 		:m(*ptr), 0(prev)
! 		:memory, cc);
! 
! 	return prev;
! }

xadd is not supported by original 80386 cpus, it was added for 80486 
cpus. There is no instruction in the 80386 cpu that allows to atomically 
decrement and retrieve the old value of an integer. The only option are 
atomic_dec_test_zero or atomic_dec_test_negative - that can be 
implemented by looking at the sign/zero flag. Depending on what you want 
this may be enough. Or make the futex code conditional for  80386 cpus.

--
   Manfred
--- p7.3.3.orig/src/backend/storage/lmgr/lwlock.c   2002-09-25 22:31:40.0 
+0200
+++ postgresql-7.3.3/src/backend/storage/lmgr/lwlock.c  2003-09-06 14:15:01.0 
+0200
@@ -26,6 +26,28 @@
 #include storage/proc.h
 #include storage/spin.h
 
+#define USE_PTHREAD_LOCKS
+
+#ifdef USE_PTHREAD_LOCKS
+
+#include pthread.h
+#include errno.h
+typedef pthread_rwlock_t LWLock;
+
+inline static void
+InitLWLock(LWLock *p)
+{
+   pthread_rwlockattr_t rwattr;
+   int i;
+
+   pthread_rwlockattr_init(rwattr);
+   pthread_rwlockattr_setpshared(rwattr, PTHREAD_PROCESS_SHARED);
+   i=pthread_rwlock_init(p, rwattr);
+   pthread_rwlockattr_destroy(rwattr);
+   if (i)
+   elog(FATAL, pthread_rwlock_init failed);
+}
+#else
 
 typedef struct LWLock
 {
@@ -38,6 +60,17 @@
/* tail is undefined when head is NULL */
 } LWLock;
 
+inline static void
+InitLWLock(LWLock *lock)
+{
+   SpinLockInit(lock-mutex);
+   lock-releaseOK = true;
+   lock-exclusive = 0;
+   lock-shared = 0;
+   lock-head = NULL;
+   lock-tail = NULL;
+}
+#endif
 /*
  * This points to the array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster.  LWLockIds are indexes into
@@ -61,7 +94,7 @@
 static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 
-#ifdef LOCK_DEBUG
+#if defined(LOCK_DEBUG)  !defined(USE_PTHREAD_LOCKS)
 bool   Trace_lwlocks = false;
 
 inline static void
@@ -153,12 +186,7 @@
 */
for (id = 0, lock = LWLockArray; id  numLocks; id++, lock++)
{
-   SpinLockInit(lock-mutex);
-   lock-releaseOK = true;
-   lock-exclusive = 0;
-   lock-shared = 0;
-   lock-head = NULL;
-   lock-tail = NULL;
+   InitLWLock(lock);
}
 
/*
@@ -185,7 +213,116 @@
return (LWLockId) (LWLockCounter[0]++);
 }
 
+#ifdef USE_PTHREAD_LOCKS
+/*
+ * LWLockAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ */
+void
+LWLockAcquire(LWLockId lockid, LWLockMode mode)
+{
+   int i;
+   PRINT_LWDEBUG(LWLockAcquire, lockid, LWLockArray[lockid]);
+
+   /*
+* We can't wait if we haven't got a PGPROC.  This should only occur
+* during bootstrap or shared memory initialization.  Put an Assert
+* here to catch unsafe coding practices.
+*/
+   Assert(!(proc == NULL  IsUnderPostmaster));
+
+   /*
+* Lock out cancel/die interrupts until we exit the code section
+* protected by the LWLock.  This ensures that interrupts will not
+* interfere with manipulations of data structures in shared memory.
+*/
+   HOLD_INTERRUPTS();
+
+   if (mode == LW_EXCLUSIVE) {
+   i = pthread_rwlock_wrlock(LWLockArray[lockid]);
+   } else {
+   i = pthread_rwlock_rdlock(LWLockArray[lockid]);
+   }
+   if (i)
+   elog(FATAL, Unexpected error from pthread_rwlock.);
+
+   /* Add lock to list of locks held by this backend */
+   Assert(num_held_lwlocks  MAX_SIMUL_LWLOCKS);
+   held_lwlocks[num_held_lwlocks++] = lockid;
+}
+
+/*
+ * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, return FALSE with no side-effects.
+ *
+ * If successful, cancel/die interrupts are held off until lock release.
+ */
+bool
+LWLockConditionalAcquire(LWLockId 

Re: [PERFORM] futex results with dbt-3

2004-10-19 Thread Tom Lane
Manfred Spraul [EMAIL PROTECTED] writes:
 Has anyone tried to replace the whole lwlock implementation with 
 pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
 implemented with futexes, i.e. we would get a fast lock handling without 
 os specific hacks.

At least for Linux does not strike me as equivalent to without
OS-specific hacks.

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
and if so, we could read their code and adopt that method without having
to buy into any large reorganization of our code.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] futex results with dbt-3

2004-10-19 Thread Josh Berkus
Tom,

 The bigger problem here is that the SMP locking bottlenecks we are
 currently seeing are *hardware* issues (AFAICT anyway). The only way
 that futexes can offer a performance win is if they have a smarter way
 of executing the basic atomic-test-and-set sequence than we do;
 and if so, we could read their code and adopt that method without having
 to buy into any large reorganization of our code.

Well, initial results from Gavin/Neil's patch seem to indicate that, while 
futexes do not cure the CSStorm bug, they do lessen its effects in terms of 
real performance loss.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-19 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 The bigger problem here is that the SMP locking bottlenecks we are
 currently seeing are *hardware* issues (AFAICT anyway).

 Well, initial results from Gavin/Neil's patch seem to indicate that, while 
 futexes do not cure the CSStorm bug, they do lessen its effects in terms of 
 real performance loss.

It would be reasonable to expect that futexes would have a somewhat more
efficient code path in the case where you have to block (mainly because
SysV semaphores have such a heavyweight API, much more complex than we
really need).  However, the code path that is killing us is the one
where you *don't* actually need to block.  If we had a proper fix for
the problem then the context swap storm itself would go away, and
whatever advantage you might be able to measure now for futexes likewise
would go away.

In other words, I'm not real excited about a wholesale replacement of
code in order to speed up a code path that I don't want to be taking
in the first place; especially not if that replacement puts a fence
between me and working on the code path that I do care about.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-13 Thread Neil Conway
On Thu, 2004-10-14 at 04:57, Mark Wong wrote: 
 I have some DBT-3 (decision support) results using Gavin's original
 futex patch fix.

I sent an initial description of the futex patch to the mailing lists
last week, but it never appeared (from talking to Marc I believe it
exceeded the size limit on -performance). In any case, the futex patch
uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. The hope
is that using futexes will lead to better performance when there is
contention for spinlocks (e.g. on a busy SMP system). The original patch
was written by Stephen Hemminger at OSDL; Gavin and myself have done a
bunch of additional bugfixing and optimization, as well as added IA64
support.

I've attached a WIP copy of the patch to this email (it supports x86,
x86-64 (untested) and IA64 -- more architectures can be added at
request). I'll post a longer writeup when I submit the patch to
-patches.

 Definitely see some overall throughput performance on the tests, about
 15% increase, but not change with respect to the number of context
 switches.

I'm glad to see that there is a performance improvement; in my own
testing on an 8-way P3 system provided by OSDL, I saw a similar
improvement in pgbench performance (50 concurrent clients, 1000
transactions each, scale factor 75; without the patch, TPS/sec was
between 180 and 185, with the patch TPS/sec was between 200 and 215).

As for context switching, there was some earlier speculation that the
patch might improve or even resolve the CS storm issue that some
people have experienced with SMP Xeon P4 systems. I don't think we have
enough evidence to answer this one way or the other at this point.

-Neil

Index: src/backend/storage/lmgr/s_lock.c
===
RCS file: /var/lib/cvs/pgsql/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.32
diff -c -r1.32 s_lock.c
*** src/backend/storage/lmgr/s_lock.c	30 Aug 2004 23:47:20 -	1.32
--- src/backend/storage/lmgr/s_lock.c	13 Oct 2004 06:23:26 -
***
*** 15,26 
   */
  #include postgres.h
  
  #include time.h
- #include unistd.h
  
  #include storage/s_lock.h
  #include miscadmin.h
  
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
--- 15,49 
   */
  #include postgres.h
  
+ #ifdef S_LOCK_TEST
+ #undef Assert
+ #define Assert(cond) DoAssert(cond, #cond, __FILE__, __LINE__)
+ 
+ #define DoAssert(cond, text, file, line)		\
+ 	if (!(cond))\
+ 	{			\
+ 		printf(ASSERTION FAILED! [%s], file = %s, line = %d\n, \
+ 			   text, file, line);\
+ 		abort();\
+ 	}
+ #endif
+ 
  #include time.h
  
  #include storage/s_lock.h
  #include miscadmin.h
  
+ #ifdef S_LOCK_TEST
+ #define LOCK_TEST_MSG()			\
+ 	do			\
+ 	{			\
+ 		fprintf(stdout, *);	\
+ 		fflush(stdout);			\
+ 	} while (0);
+ #else
+ #define LOCK_TEST_MSG()
+ #endif
+ 
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
***
*** 38,43 
--- 61,131 
  #endif
  }
  
+ #ifdef HAVE_FUTEX
+ /*
+  * futex_lock_contended() is similar to s_lock() for the normal TAS
+  * implementation of spinlocks. When this function is invoked, we have
+  * failed to immediately acquire the spinlock, so we should spin some
+  * number of times attempting to acquire the lock before invoking
+  * sys_futex() to have the kernel wake us up later. val is the
+  * current value of the mutex we saw when we tried to acquire it; it
+  * may have changed since then, of course.
+  */
+ void
+ futex_lock_contended(volatile slock_t *lock, slock_t val,
+ 	 const char *file, int line)
+ {
+ 	int loop_count = 0;
+ 
+ #define MAX_LOCK_WAIT		30
+ #define SPINS_BEFORE_WAIT	100
+ 
+ 	Assert(val != FUTEX_UNLOCKED);
+ 
+ 	if (val == FUTEX_LOCKED_NOWAITER)
+ 		val = atomic_exchange(lock, FUTEX_LOCKED_WAITER);
+ 
+ 	while (val != FUTEX_UNLOCKED)
+ 	{
+ 		static struct timespec delay = { .tv_sec = MAX_LOCK_WAIT,
+ 		 .tv_nsec = 0 };
+ 
+ 		LOCK_TEST_MSG();
+ 
+ #if defined(__i386__) || defined(__x86_64__)
+ 		/* See spin_delay() */
+ 		__asm__ __volatile__( rep; nop\n);
+ #endif
+ 
+ 		/*
+ 		 * XXX: This code is derived from the Drepper algorithm, which
+ 		 * doesn't spin (why, I'm not sure). We should actually change
+ 		 * the lock status to lock, with waiters just before we wait
+ 		 * on the futex, not before we begin looping (that avoids a
+ 		 * system call when the lock is released).
+ 		 */
+ 
+ 		/* XXX: worth using __builtin_expect() here? */
+ 		if (++loop_count = SPINS_BEFORE_WAIT)
+ 		{
+ 			loop_count = 0;
+ 			if (sys_futex(lock, FUTEX_OP_WAIT,
+ 		  FUTEX_LOCKED_WAITER, delay))
+ 			{
+ if (errno == ETIMEDOUT)
+ 	s_lock_stuck(lock, file, line);
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * Do a non-locking test before asserting the bus lock.
+ 		 */
+ 		if (*lock == FUTEX_UNLOCKED)
+ 			val = atomic_exchange(lock, FUTEX_LOCKED_WAITER);
+ 	}
+ }
+ 
+ #else
  
  /*
   * s_lock(lock) - platform-independent portion of