Re: [PERFORM] futex results with dbt-3
Tom Lane wrote: It could be that I'm all wet and there is no relationship between the cache line thrashing and the seemingly excessive BufMgrLock contention. Is it important? The fix is identical in both cases: per-bucket locks for the hash table and a buffer aging strategy that doesn't need one global lock that must be acquired for every lookup. -- Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] futex results with dbt-3
Mark Wong wrote: I've heard that simply linking to the pthreads libraries, regardless of whether you're using them or not creates a significant overhead. Has anyone tried it for kicks? That depends on the OS and the functions that are used. The typical worst case is buffered IO of single characters: The single threaded implementation is just copy and update buffer status, the multi threaded implementation contains full locking. For most other functions there is no difference at all. -- Manfred ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] futex results with dbt-3
Mark Wong wrote: Pretty, simple. One to load the database, and 1 to query it. I'll attach them. I've tested it on my dual-cpu computer: - it works, both cpus run within the postmaster. It seems something your gentoo setup is broken. - the number of context switch is down slightly, but not significantly: The glibc implementation is more or less identical to the implementation right now in lwlock.c: a spinlock that protects a few variables that are used to implement the actual mutex, several wait queues: one for spinlock busy, one or two for the actual mutex code. Around 25% of the context switches are from spinlock collisions, the rest are from actual mutex collisions. It might be possible to get rid of the spinlock collisions by writing a special, futex based semaphore function that only supports exclusive access [like sem_wait/sem_post], but I don't think that it's worth the effort: 75% of the context switches would remain. What's needed is a buffer manager that can do lookups without a global lock. -- Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] futex results with dbt-3
Tom Lane wrote: Manfred Spraul <[EMAIL PROTECTED]> writes: Tom Lane wrote: The bigger problem here is that the SMP locking bottlenecks we are currently seeing are *hardware* issues (AFAICT anyway). The only way that futexes can offer a performance win is if they have a smarter way of executing the basic atomic-test-and-set sequence than we do; lwlocks operations are not a basic atomic-test-and-set sequence. They are spinlock, several nonatomic operations, spin_unlock. Right, and it is the spinlock that is the problem. See discussions a few months back: at least on Intel SMP machines, most of the problem seems to have to do with trading the spinlock's cache line back and forth between CPUs. I'd disagree: cache line bouncing is one problem. If this happens then there is only one solution: The number of changes to that cacheline must be reduced. The tools that are used in the linux kernel are: - hashing. An emergency approach if there is no other solution. I think RedHat used it for the buffer cache RH AS: Instead of one buffer cache, there were lots of smaller buffer caches with individual locks. The cache was chosen based on the file position (probably mixed with some pointers to avoid overloading cache 0). - For read-heavy loads: sequence locks. A reader reads a counter value and then accesses the data structure. At the end it checks if the counter was modified. If it's still the same value then it can continue, otherwise it must retry. Writers acquire a normal spinlock and then modify the counter value. RCU is the second option, but there are patents - please be careful before using that tool. - complete rewrites that avoid the global lock. I think the global buffer cache is now gone, everything is handled per-file. I think there is a global list for buffer replacement, but the at the top of the buffer replacement strategy is a simple clock algorithm. That means that simple lookups/accesses just set a (local) referenced bit and don't have to acquire a global lock. I know that this is the total opposite of ARC, but perhaps it's the only scalable solution. ARC could be used as the second level strategy. But: According to the descriptions the problem is a context switch storm. I don't see that cache line bouncing can cause a context switch storm. What causes the context switch storm? If it's the pg_usleep in s_lock, then my patch should help a lot: with pthread_rwlock locks, this line doesn't exist anymore. -- Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] futex results with dbt-3
Tom Lane wrote: Manfred Spraul <[EMAIL PROTECTED]> writes: Has anyone tried to replace the whole lwlock implementation with pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is implemented with futexes, i.e. we would get a fast lock handling without os specific hacks. "At least for Linux" does not strike me as equivalent to "without OS-specific hacks". For me, "at least for Linux" means that I have tested the patch with Linux. I'd expect that the patch works on most recent unices (pthread_rwlock_t is probably mandatory for Unix98 compatibility). You and others on this mailing list have access to other systems - my patch should be seen as a call for testers, not as a proposal for merging. I expect that Linux is not the only OS with fast user space semaphores, and if an OS has such objects, then the pthread_ locking functions are hopefully implemented by using them. IMHO it's better to support the standard function instead of trying to use the native (and OS specific) fast semaphore functions. The bigger problem here is that the SMP locking bottlenecks we are currently seeing are *hardware* issues (AFAICT anyway). The only way that futexes can offer a performance win is if they have a smarter way of executing the basic atomic-test-and-set sequence than we do; lwlocks operations are not a basic atomic-test-and-set sequence. They are spinlock, several nonatomic operations, spin_unlock. -- Manfred ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] futex results with dbt-3
Mark Wong wrote: Here are some other details, per Manfred's request: Linux 2.6.8.1 (on a gentoo distro) How complicated are Tom's test scripts? His immediate reply was that I should retest with Fedora, to rule out any gentoo bugs. I have a dual-cpu system with RH FC, I could use it for testing. -- Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] futex results with dbt-3
Neil wrote: . In any case, the "futex patch" uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. Has anyone tried to replace the whole lwlock implementation with pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is implemented with futexes, i.e. we would get a fast lock handling without os specific hacks. Perhaps other os contain user space pthread locks, too. Attached is an old patch. I tested it on an uniprocessor system a year ago and it didn't provide much difference, but perhaps the scalability is better. You'll have to add -lpthread to the library list for linking. Regarding Neil's patch: ! /* ! * XXX: is there a more efficient way to write this? Perhaps using ! * decl...? ! */ ! static __inline__ slock_t ! atomic_dec(volatile slock_t *ptr) ! { ! slock_t prev = -1; ! ! __asm__ __volatile__( ! " lock \n" ! " xadd %0,%1 \n" ! :"=q"(prev) ! :"m"(*ptr), "0"(prev) ! :"memory", "cc"); ! ! return prev; ! } xadd is not supported by original 80386 cpus, it was added for 80486 cpus. There is no instruction in the 80386 cpu that allows to atomically decrement and retrieve the old value of an integer. The only option are atomic_dec_test_zero or atomic_dec_test_negative - that can be implemented by looking at the sign/zero flag. Depending on what you want this may be enough. Or make the futex code conditional for > 80386 cpus. -- Manfred --- p7.3.3.orig/src/backend/storage/lmgr/lwlock.c 2002-09-25 22:31:40.0 +0200 +++ postgresql-7.3.3/src/backend/storage/lmgr/lwlock.c 2003-09-06 14:15:01.0 +0200 @@ -26,6 +26,28 @@ #include "storage/proc.h" #include "storage/spin.h" +#define USE_PTHREAD_LOCKS + +#ifdef USE_PTHREAD_LOCKS + +#include +#include +typedef pthread_rwlock_t LWLock; + +inline static void +InitLWLock(LWLock *p) +{ + pthread_rwlockattr_t rwattr; + int i; + + pthread_rwlockattr_init(&rwattr); + pthread_rwlockattr_setpshared(&rwattr, PTHREAD_PROCESS_SHARED); + i=pthread_rwlock_init(p, &rwattr); + pthread_rwlockattr_destroy(&rwattr); + if (i) + elog(FATAL, "pthread_rwlock_init failed"); +} +#else typedef struct LWLock { @@ -38,6 +60,17 @@ /* tail is undefined when head is NULL */ } LWLock; +inline static void +InitLWLock(LWLock *lock) +{ + SpinLockInit(&lock->mutex); + lock->releaseOK = true; + lock->exclusive = 0; + lock->shared = 0; + lock->head = NULL; + lock->tail = NULL; +} +#endif /* * This points to the array of LWLocks in shared memory. Backends inherit * the pointer by fork from the postmaster. LWLockIds are indexes into @@ -61,7 +94,7 @@ static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS]; -#ifdef LOCK_DEBUG +#if defined(LOCK_DEBUG) && !defined(USE_PTHREAD_LOCKS) bool Trace_lwlocks = false; inline static void @@ -153,12 +186,7 @@ */ for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++) { - SpinLockInit(&lock->mutex); - lock->releaseOK = true; - lock->exclusive = 0; - lock->shared = 0; - lock->head = NULL; - lock->tail = NULL; + InitLWLock(lock); } /* @@ -185,7 +213,116 @@ return (LWLockId) (LWLockCounter[0]++); } +#ifdef USE_PTHREAD_LOCKS +/* + * LWLockAcquire - acquire a lightweight lock in the specified mode + * + * If the lock is not available, sleep until it is. + * + * Side effect: cancel/die interrupts are held off until lock release. + */ +void +LWLockAcquire(LWLockId lockid, LWLockMode mode) +{ + int i; + PRINT_LWDEBUG("LWLockAcquire", lockid, &LWLockArray[lockid]); + + /* +* We can't wait if we haven't got a PGPROC. This should only occur +* during bootstrap or shared memory initialization. Put an Assert +* here to catch unsafe coding practices. +*/ + Assert(!(proc == NULL && IsUnderPostmaster)); + + /* +* Lock out cancel/die interrupts until we exit the code section +* protected by the LWLock. This ensures that interrupts will not +* interfere with manipulations of data structures in shared memory. +*/ + HOLD_INTERRUPTS(); + + if (mode == LW_EXCLUSIVE) { + i = pthread_rwlock_wrlock(&LWLockArray[lockid]); + } else { + i = pthread_rwlock_rdlock(&LWLockArray[lockid]); + } + if (i) + elog(FATAL, "Unexpected error from pthread_rwlock."); + + /* Add lock to list of locks held by this backend */ + Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS); + held_lwlocks[num_held_lwlocks++] = lockid; +} + +/* + * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode + * + * If the lock is not available, return FALSE with no side-effects. + * + * If successful, cancel/die interrupts are held off until lock release. + */ +bool
Re: [PERFORM] [HACKERS] fsync method checking
[EMAIL PROTECTED] wrote: Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 6.270724 write, fdatasync13.275225 write, fsync, 13.359847 Odd. Which filesystem, which kernel? It seems fdatasync is broken and syncs the inode, too. -- Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] [HACKERS] fsync method checking
Tom Lane wrote: [EMAIL PROTECTED] writes: I could certainly do some testing if you want to see how DBT-2 does. Just tell me what to do. ;) Just do some runs that are identical except for the wal_sync_method setting. Note that this should not have any impact on SELECT performance, only insert/update/delete performance. I've made a test run that compares fsync and fdatasync: The performance was identical: - with fdatasync: http://khack.osdl.org/stp/290607/ - with fsync: http://khack.osdl.org/stp/290483/ I don't understand why. Mark - is there a battery backed write cache in the raid controller, or something similar that might skew the results? The test generates quite a lot of wal traffic - around 1.5 MB/sec. Perhaps the writes are so large that the added overhead of syncing the inode is not noticable? Is the pg_xlog directory on a seperate drive? Btw, it's possible to request such tests through the web-interface, see http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html -- Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] [HACKERS] fsync method checking
Bruce Momjian wrote: write 0.000360 write & fsync 0.001391 write, close & fsync 0.001308 open o_fsync, write0.000924 That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic - I guess the hw cache on and the os doesn't issue cache flush commands. Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is the syscall latency with BSD/OS 4.3? One advantage of a seperate write and fsync call is better performance for the writes that are triggered within AdvanceXLInsertBuffer: I'm not sure how often that's necessary, but it's a write while holding both the WALWriteLock and WALInsertLock. If every write contains an implicit sync, that call would be much more expensive than necessary. -- Manfred ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings