Re: [PERFORM] futex results with dbt-3

2004-10-30 Thread Manfred Spraul
Tom Lane wrote:
It could be that I'm all wet and there is no relationship between the
cache line thrashing and the seemingly excessive BufMgrLock contention.
 

Is it important? The fix is identical in both cases: per-bucket locks 
for the hash table and a buffer aging strategy that doesn't need one 
global lock that must be acquired for every lookup.

--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
I've heard that simply linking to the pthreads libraries, regardless of
whether you're using them or not creates a significant overhead.  Has
anyone tried it for kicks?
 

That depends on the OS and the functions that are used. The typical 
worst case is buffered IO of single characters: The single threaded 
implementation is just copy and update buffer status, the multi threaded 
implementation contains full locking.

For most other functions there is no difference at all.
--
   Manfred
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
Pretty, simple.  One to load the database, and 1 to query it.  I'll 
attach them.

 

I've tested it on my dual-cpu computer:
- it works, both cpus run within the postmaster. It seems something your 
gentoo setup is broken.
- the number of context switch is down slightly, but not significantly: 
The glibc implementation is more or less identical to the implementation 
right now in lwlock.c: a spinlock that protects a few variables that are 
used to implement the actual mutex, several wait queues: one for 
spinlock busy, one or two for the actual mutex code.

Around 25% of the context switches are from spinlock collisions, the 
rest are from actual mutex collisions. It might be possible to get rid 
of the spinlock collisions by writing a special, futex based semaphore 
function that only supports exclusive access [like sem_wait/sem_post], 
but I don't think that it's worth the effort: 75% of the context 
switches would remain.
What's needed is a buffer manager that can do lookups without a global lock.

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Tom Lane wrote:
Manfred Spraul <[EMAIL PROTECTED]> writes:
 

Tom Lane wrote:
   

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
 

lwlocks operations are not a basic atomic-test-and-set sequence. They 
are spinlock, several nonatomic operations, spin_unlock.
   

Right, and it is the spinlock that is the problem.  See discussions a
few months back: at least on Intel SMP machines, most of the problem
seems to have to do with trading the spinlock's cache line back and
forth between CPUs.
I'd disagree: cache line bouncing is one problem. If this happens then 
there is only one solution: The number of changes to that cacheline must 
be reduced. The tools that are used in the linux kernel are:
- hashing. An emergency approach if there is no other solution. I think 
RedHat used it for the buffer cache RH AS: Instead of one buffer cache, 
there were lots of smaller buffer caches with individual locks. The 
cache was chosen based on the file position (probably mixed with some 
pointers to avoid overloading cache 0).
- For read-heavy loads: sequence locks. A reader reads a counter value 
and then accesses the data structure. At the end it checks if the 
counter was modified. If it's still the same value then it can continue, 
otherwise it must retry. Writers acquire a normal spinlock and then 
modify the counter value. RCU is the second option, but there are 
patents - please be careful before using that tool.
- complete rewrites that avoid the global lock. I think the global 
buffer cache is now gone, everything is handled per-file. I think there 
is a global list for buffer replacement, but the at the top of the 
buffer replacement strategy is a simple clock algorithm. That means that 
simple lookups/accesses just set a (local) referenced bit and don't have 
to acquire a global lock. I know that this is the total opposite of ARC, 
but perhaps it's the only scalable solution. ARC could be used as the 
second level strategy.

But: According to the descriptions the problem is a context switch 
storm. I don't see that cache line bouncing can cause a context switch 
storm. What causes the context switch storm? If it's the pg_usleep in 
s_lock, then my patch should help a lot: with pthread_rwlock locks, this 
line doesn't exist anymore.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Tom Lane wrote:
Manfred Spraul <[EMAIL PROTECTED]> writes:
 

Has anyone tried to replace the whole lwlock implementation with 
pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
implemented with futexes, i.e. we would get a fast lock handling without 
os specific hacks.
   

"At least for Linux" does not strike me as equivalent to "without
OS-specific hacks".
 

For me, "at least for Linux" means that I have tested the patch with 
Linux. I'd expect that the patch works on most recent unices 
(pthread_rwlock_t is probably mandatory for Unix98 compatibility). You 
and others on this mailing list have access to other systems - my patch 
should be seen as a call for testers, not as a proposal for merging. I 
expect that Linux is not the only OS with fast user space semaphores, 
and if an OS has such objects, then the pthread_ locking functions are 
hopefully implemented by using them. IMHO it's better to support the 
standard function instead of trying to use the native (and OS specific) 
fast semaphore functions.

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
 

lwlocks operations are not a basic atomic-test-and-set sequence. They 
are spinlock, several nonatomic operations, spin_unlock.

--
   Manfred
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] futex results with dbt-3

2004-10-25 Thread Manfred Spraul
Mark Wong wrote:
Here are some other details, per Manfred's request:
Linux 2.6.8.1 (on a gentoo distro)
 

How complicated are Tom's test scripts? His immediate reply was that I 
should retest with Fedora, to rule out any gentoo bugs.

I have a dual-cpu system with RH FC, I could use it for testing.
--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] futex results with dbt-3

2004-10-19 Thread Manfred Spraul
Neil wrote:
. In any case, the "futex patch"
uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. 

Has anyone tried to replace the whole lwlock implementation with 
pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is 
implemented with futexes, i.e. we would get a fast lock handling without 
os specific hacks. Perhaps other os contain user space pthread locks, too.
Attached is an old patch. I tested it on an uniprocessor system a year 
ago and it didn't provide much difference, but perhaps the scalability 
is better. You'll have to add -lpthread to the library list for linking.

Regarding Neil's patch:
! /*
!  * XXX: is there a more efficient way to write this? Perhaps using
!  * decl...?
!  */
! static __inline__ slock_t
! atomic_dec(volatile slock_t *ptr)
! {
! 	slock_t prev = -1;
! 
! 	__asm__ __volatile__(
! 		"	lock		\n"
! 		"	xadd %0,%1	\n"
! 		:"=q"(prev)
! 		:"m"(*ptr), "0"(prev)
! 		:"memory", "cc");
! 
! 	return prev;
! }

xadd is not supported by original 80386 cpus, it was added for 80486 
cpus. There is no instruction in the 80386 cpu that allows to atomically 
decrement and retrieve the old value of an integer. The only option are 
atomic_dec_test_zero or atomic_dec_test_negative - that can be 
implemented by looking at the sign/zero flag. Depending on what you want 
this may be enough. Or make the futex code conditional for > 80386 cpus.

--
   Manfred
--- p7.3.3.orig/src/backend/storage/lmgr/lwlock.c   2002-09-25 22:31:40.0 
+0200
+++ postgresql-7.3.3/src/backend/storage/lmgr/lwlock.c  2003-09-06 14:15:01.0 
+0200
@@ -26,6 +26,28 @@
 #include "storage/proc.h"
 #include "storage/spin.h"
 
+#define USE_PTHREAD_LOCKS
+
+#ifdef USE_PTHREAD_LOCKS
+
+#include 
+#include 
+typedef pthread_rwlock_t LWLock;
+
+inline static void
+InitLWLock(LWLock *p)
+{
+   pthread_rwlockattr_t rwattr;
+   int i;
+
+   pthread_rwlockattr_init(&rwattr);
+   pthread_rwlockattr_setpshared(&rwattr, PTHREAD_PROCESS_SHARED);
+   i=pthread_rwlock_init(p, &rwattr);
+   pthread_rwlockattr_destroy(&rwattr);
+   if (i)
+   elog(FATAL, "pthread_rwlock_init failed");
+}
+#else
 
 typedef struct LWLock
 {
@@ -38,6 +60,17 @@
/* tail is undefined when head is NULL */
 } LWLock;
 
+inline static void
+InitLWLock(LWLock *lock)
+{
+   SpinLockInit(&lock->mutex);
+   lock->releaseOK = true;
+   lock->exclusive = 0;
+   lock->shared = 0;
+   lock->head = NULL;
+   lock->tail = NULL;
+}
+#endif
 /*
  * This points to the array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster.  LWLockIds are indexes into
@@ -61,7 +94,7 @@
 static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 
-#ifdef LOCK_DEBUG
+#if defined(LOCK_DEBUG) && !defined(USE_PTHREAD_LOCKS)
 bool   Trace_lwlocks = false;
 
 inline static void
@@ -153,12 +186,7 @@
 */
for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
{
-   SpinLockInit(&lock->mutex);
-   lock->releaseOK = true;
-   lock->exclusive = 0;
-   lock->shared = 0;
-   lock->head = NULL;
-   lock->tail = NULL;
+   InitLWLock(lock);
}
 
/*
@@ -185,7 +213,116 @@
return (LWLockId) (LWLockCounter[0]++);
 }
 
+#ifdef USE_PTHREAD_LOCKS
+/*
+ * LWLockAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ */
+void
+LWLockAcquire(LWLockId lockid, LWLockMode mode)
+{
+   int i;
+   PRINT_LWDEBUG("LWLockAcquire", lockid, &LWLockArray[lockid]);
+
+   /*
+* We can't wait if we haven't got a PGPROC.  This should only occur
+* during bootstrap or shared memory initialization.  Put an Assert
+* here to catch unsafe coding practices.
+*/
+   Assert(!(proc == NULL && IsUnderPostmaster));
+
+   /*
+* Lock out cancel/die interrupts until we exit the code section
+* protected by the LWLock.  This ensures that interrupts will not
+* interfere with manipulations of data structures in shared memory.
+*/
+   HOLD_INTERRUPTS();
+
+   if (mode == LW_EXCLUSIVE) {
+   i = pthread_rwlock_wrlock(&LWLockArray[lockid]);
+   } else {
+   i = pthread_rwlock_rdlock(&LWLockArray[lockid]);
+   }
+   if (i)
+   elog(FATAL, "Unexpected error from pthread_rwlock.");
+
+   /* Add lock to list of locks held by this backend */
+   Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS);
+   held_lwlocks[num_held_lwlocks++] = lockid;
+}
+
+/*
+ * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, return FALSE with no side-effects.
+ *
+ * If successful, cancel/die interrupts are held off until lock release.
+ */
+bool

Re: [PERFORM] [HACKERS] fsync method checking

2004-03-26 Thread Manfred Spraul
[EMAIL PROTECTED] wrote:

Compare file sync methods with one 8k write:
   (o_dsync unavailable)  
   open o_sync, write   6.270724
   write, fdatasync13.275225
   write, fsync,   13.359847
 

Odd. Which filesystem, which kernel? It seems fdatasync is broken and 
syncs the inode, too.

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-26 Thread Manfred Spraul
Tom Lane wrote:

[EMAIL PROTECTED] writes:
 

I could certainly do some testing if you want to see how DBT-2 does.
Just tell me what to do. ;)
   

Just do some runs that are identical except for the wal_sync_method
setting.  Note that this should not have any impact on SELECT
performance, only insert/update/delete performance.
 

I've made a test run that compares fsync and fdatasync: The performance 
was identical:
- with fdatasync:

http://khack.osdl.org/stp/290607/

- with fsync:
http://khack.osdl.org/stp/290483/
I don't understand why. Mark - is there a battery backed write cache in 
the raid controller, or something similar that might skew the results? 
The test generates quite a lot of wal traffic - around 1.5 MB/sec. 
Perhaps the writes are so large that the added overhead of syncing the 
inode is not noticable?
Is the pg_xlog directory on a seperate drive?

Btw, it's possible to request such tests through the web-interface, see
http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html
--
   Manfred
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] [HACKERS] fsync method checking

2003-12-16 Thread Manfred Spraul
Bruce Momjian wrote:

	write  0.000360
	write & fsync  0.001391
	write, close & fsync   0.001308
	open o_fsync, write0.000924
 

That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic - 
I guess the hw cache on and the os doesn't issue cache flush commands. 
Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is 
the syscall latency with BSD/OS 4.3?

One advantage of a seperate write and fsync call is better performance 
for the writes that are triggered within AdvanceXLInsertBuffer: I'm not 
sure how often that's necessary, but it's a write while holding both the 
WALWriteLock and WALInsertLock. If every write contains an implicit 
sync, that call would be much more expensive than necessary.

--
   Manfred
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings