Re: SMP performance degradation with sysbench
On Tue, 2007-03-20 at 10:29 +0800, Zhang, Yanmin wrote: > On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: > > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > > > I would agree that it points to MySQL scalability issues, however the > > > fact that such large gains come from tcmalloc is still interesting. > > > > What glibc version are you, Anton and others are using? > > > > Does that version has this fix included? > > > > Dynamically size mmap treshold if the program frees mmaped blocks. > > > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158=1.159=glibc > The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. > > To verify my idea, I created a small patch. When freeing a block, always > check mp_.trim_threshold even though it might not be in main arena. The > patch is just to verify my idea instead of the final solution. > > --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 > 00:06:02.0 +0800 > +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.0 > +0800 > @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) >} else { > /* Always try heap_trim(), even if the top chunk is not > large, because the corresponding heap might go away. */ > + if ((unsigned long)(chunksize(av->top)) >= > + (unsigned long)(mp_.trim_threshold)) { > heap_info *heap = heap_for_ptr(top(av)); > > assert(heap->ar_ptr == av); > heap_trim(heap, mp_.top_pad); > + } >} > } > > I sent a new patch to glibc maintainer, but didn't get response. So resend it here. Glibc arena is to decrease the malloc/free contention among threads. But arena chooses to shrink agressively, so also grow agressively. When heaps grow, mprotect is called. When heaps shrink, mmap is called. In kernel, both mmap and mprotect need hold the write lock of mm->mmap_sem which introduce new contention. The new contention actually causes the arena effort to become 0. Here is a new patch to address this issue. Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]> --- --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-30 09:01:18.0 +0800 @@ -4605,12 +4605,13 @@ _int_free(mstate av, Void_t* mem) sYSTRIm(mp_.top_pad, av); #endif } else { - /* Always try heap_trim(), even if the top chunk is not - large, because the corresponding heap might go away. */ - heap_info *heap = heap_for_ptr(top(av)); - - assert(heap->ar_ptr == av); - heap_trim(heap, mp_.top_pad); + if ((unsigned long)(chunksize(av->top)) >= + (unsigned long)(mp_.trim_threshold)) { + heap_info *heap = heap_for_ptr(top(av)); + + assert(heap->ar_ptr == av); + heap_trim(heap, mp_.top_pad); + } } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, 2007-03-20 at 10:29 +0800, Zhang, Yanmin wrote: On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. What glibc version are you, Anton and others are using? Does that version has this fix included? Dynamically size mmap treshold if the program frees mmaped blocks. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158r2=1.159cvsroot=glibc The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. To verify my idea, I created a small patch. When freeing a block, always check mp_.trim_threshold even though it might not be in main arena. The patch is just to verify my idea instead of the final solution. --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.0 +0800 @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) } else { /* Always try heap_trim(), even if the top chunk is not large, because the corresponding heap might go away. */ + if ((unsigned long)(chunksize(av-top)) = + (unsigned long)(mp_.trim_threshold)) { heap_info *heap = heap_for_ptr(top(av)); assert(heap-ar_ptr == av); heap_trim(heap, mp_.top_pad); + } } } I sent a new patch to glibc maintainer, but didn't get response. So resend it here. Glibc arena is to decrease the malloc/free contention among threads. But arena chooses to shrink agressively, so also grow agressively. When heaps grow, mprotect is called. When heaps shrink, mmap is called. In kernel, both mmap and mprotect need hold the write lock of mm-mmap_sem which introduce new contention. The new contention actually causes the arena effort to become 0. Here is a new patch to address this issue. Signed-off-by: Zhang Yanmin [EMAIL PROTECTED] --- --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-30 09:01:18.0 +0800 @@ -4605,12 +4605,13 @@ _int_free(mstate av, Void_t* mem) sYSTRIm(mp_.top_pad, av); #endif } else { - /* Always try heap_trim(), even if the top chunk is not - large, because the corresponding heap might go away. */ - heap_info *heap = heap_for_ptr(top(av)); - - assert(heap-ar_ptr == av); - heap_trim(heap, mp_.top_pad); + if ((unsigned long)(chunksize(av-top)) = + (unsigned long)(mp_.trim_threshold)) { + heap_info *heap = heap_for_ptr(top(av)); + + assert(heap-ar_ptr == av); + heap_trim(heap, mp_.top_pad); + } } } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > > I would agree that it points to MySQL scalability issues, however the > > fact that such large gains come from tcmalloc is still interesting. > > What glibc version are you, Anton and others are using? > > Does that version has this fix included? > > Dynamically size mmap treshold if the program frees mmaped blocks. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158=1.159=glibc > Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which already includes the dynamically size mmap threshold patch, so this patch doesn’t resolve the issue. The problem is really relevant to malloc/free of glibc multi-thread. My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot removing the last 8 logical processors. I captured the schedule status. When sysbench thread=8 (best performance), there are about 3.4% context switches caused by __down_read/__down_write_nested. When sysbench thread=10 (best performance), the percentage becomes 11.83%. I captured the thread status by gdb. When sysbench thread=10, usually 2 threads are calling mprotect/mmap. When sysbench thread=8, there are no threads calling mprotect/mmap. Such capture has random behavior, but I tried for many times. I think the increased percentage of context switch related to __down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap accesses the semaphore of vm, so there are some contentions on the sema which make performance down. The strace shows mysqld often calls mprotect/mmap with the same data length 61440. That’s another evidence. Gdb showed such mprotect is called by init_io_malloc=>my_malloc=>malloc=>init_malloc=>mprotect. Mmap is caused by __init_free=>mmap. I checked the source codes of glibc and found the real call chains are malloc=>init_malloc=>grow_heap=>mprotect and __init_free=>heap_trim=>mmap. I guess the transaction processing of mysql/sysbench is: mysql accepts a connection and initiates a block for the connection. After processing a couple of transactions, sysbench closes the connection. Then, restart the procedure. So why are there so many mprotect/mmap? Glibc uses arena to speedup malloc/free at multi-thread environment. mp.trim_threshold only controls main_arena. In function __init_free, FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value. The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. To verify my idea, I created a small patch. When freeing a block, always check mp_.trim_threshold even though it might not be in main arena. The patch is just to verify my idea instead of the final solution. --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.0 +0800 @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) } else { /* Always try heap_trim(), even if the top chunk is not large, because the corresponding heap might go away. */ + if ((unsigned long)(chunksize(av->top)) >= + (unsigned long)(mp_.trim_threshold)) { heap_info *heap = heap_for_ptr(top(av)); assert(heap->ar_ptr == av); heap_trim(heap, mp_.top_pad); + } } } With the patch, I recompiled glibc and reran sysbench/mysql. The result is good. When thread number is larger than 8, the tps and response time(avg) are smooth, and don't drop severely. Is there anyone being able to test it on AMD machine? Yanmin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. What glibc version are you, Anton and others are using? Does that version has this fix included? Dynamically size mmap treshold if the program frees mmaped blocks. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158r2=1.159cvsroot=glibc Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which already includes the dynamically size mmap threshold patch, so this patch doesn’t resolve the issue. The problem is really relevant to malloc/free of glibc multi-thread. My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot removing the last 8 logical processors. I captured the schedule status. When sysbench thread=8 (best performance), there are about 3.4% context switches caused by __down_read/__down_write_nested. When sysbench thread=10 (best performance), the percentage becomes 11.83%. I captured the thread status by gdb. When sysbench thread=10, usually 2 threads are calling mprotect/mmap. When sysbench thread=8, there are no threads calling mprotect/mmap. Such capture has random behavior, but I tried for many times. I think the increased percentage of context switch related to __down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap accesses the semaphore of vm, so there are some contentions on the sema which make performance down. The strace shows mysqld often calls mprotect/mmap with the same data length 61440. That’s another evidence. Gdb showed such mprotect is called by init_io_malloc=my_malloc=malloc=init_malloc=mprotect. Mmap is caused by __init_free=mmap. I checked the source codes of glibc and found the real call chains are malloc=init_malloc=grow_heap=mprotect and __init_free=heap_trim=mmap. I guess the transaction processing of mysql/sysbench is: mysql accepts a connection and initiates a block for the connection. After processing a couple of transactions, sysbench closes the connection. Then, restart the procedure. So why are there so many mprotect/mmap? Glibc uses arena to speedup malloc/free at multi-thread environment. mp.trim_threshold only controls main_arena. In function __init_free, FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value. The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. To verify my idea, I created a small patch. When freeing a block, always check mp_.trim_threshold even though it might not be in main arena. The patch is just to verify my idea instead of the final solution. --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.0 +0800 @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) } else { /* Always try heap_trim(), even if the top chunk is not large, because the corresponding heap might go away. */ + if ((unsigned long)(chunksize(av-top)) = + (unsigned long)(mp_.trim_threshold)) { heap_info *heap = heap_for_ptr(top(av)); assert(heap-ar_ptr == av); heap_trim(heap, mp_.top_pad); + } } } With the patch, I recompiled glibc and reran sysbench/mysql. The result is good. When thread number is larger than 8, the tps and response time(avg) are smooth, and don't drop severely. Is there anyone being able to test it on AMD machine? Yanmin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > I would agree that it points to MySQL scalability issues, however the > fact that such large gains come from tcmalloc is still interesting. What glibc version are you, Anton and others are using? Does that version has this fix included? Dynamically size mmap treshold if the program frees mmaped blocks. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158=1.159=glibc thanks, suresh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. What glibc version are you, Anton and others are using? Does that version has this fix included? Dynamically size mmap treshold if the program frees mmaped blocks. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158r2=1.159cvsroot=glibc thanks, suresh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 3/13/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: Nish Aravamudan a écrit : > On 3/12/07, Anton Blanchard <[EMAIL PROTECTED]> wrote: >> >> Hi Nick, >> >> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help >> look >> > at this, send me a mail (eg. especially with the sched_setscheduler >> issue, >> > you might be able to do something better). >> >> I took a look at this today and figured Id document it: >> >> http://ozlabs.org/~anton/linux/sysbench/ >> >> Bottom line: it looks like issues in the glibc malloc library, replacing >> it with the google malloc library fixes the negative scaling: >> >> # apt-get install libgoogle-perftools0 >> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld > > Quick datapoint, still collecting data and trying to verify it's > always the case: on my 8-way Xeon, I'm actually seeing *much* worse > performance with libtcmalloc.so compared to mainline. Am generating > graphs and such still, but maybe someone else with x86_64 hardware > could try the google PRELOAD and see if it helps/hurts (to rule out > tester stupidity)? I wish I had a 8-way test platform :) Anyway, could you post some oprofile results ? Hopefully soon -- want to still make sure I'm not doing something dumb. Am also hoping to get some of the gdb backtraces like Anton had. Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Nish Aravamudan a écrit : On 3/12/07, Anton Blanchard <[EMAIL PROTECTED]> wrote: Hi Nick, > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look > at this, send me a mail (eg. especially with the sched_setscheduler issue, > you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? I wish I had a 8-way test platform :) Anyway, could you post some oprofile results ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 3/12/07, Anton Blanchard <[EMAIL PROTECTED]> wrote: Hi Nick, > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look > at this, send me a mail (eg. especially with the sched_setscheduler issue, > you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 01:02:44PM +0100, Eric Dumazet wrote: > On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: > > > My wild guess is that they're allocating memory after taking > > futexes. If they do, something like this will happen: > > > > taskA taskB taskC > > user lock > > mmap_sem lock > > mmap sem -> schedule > > user lock -> schedule > > > > If taskB wouldn't be there triggering more random trashing over the > > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. > > > > I suspect the real fix is not to allocate memory or to run other > > expensive syscalls that can block inside the futex critical sections... > > glibc malloc uses arenas, and trylock() only. It should not block because if > an arena is already locked, thread automatically chose another arena, and > might create a new one if necessary. Well, only when allocating it uses trylock, free uses normal lock. glibc malloc will by default use the same arena for all threads, only when it sees contention during allocation it gives different threads different arenas. So, e.g. if mysql did all allocations while holding some global heap lock (thus glibc wouldn't see any contention on allocation), but freeing would be done outside of application's critical section, you would see contention on main arena's lock in the free path. Calling malloc_stats (); from e.g. atexit handler could give interesting details, especially if you recompile glibc malloc with -DTHREAD_STATS=1. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. The initial assumption was that there was zero idle time with threads = cpus and the idle time showed up only when the number of threads increased to the double the number of cpus. If the idle time wouldn't increase with the number of threads, nothing would be suspect. Well I think more threads ~= more probability that this guy is going to be preempted while holding the mutex? This might be why FreeBSD works much better, because it looks like MySQL actually will set RT scheduling for those processes that take critical resources. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. It'd be interesting to see the sysrq+t after the idle time increased. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem -> schedule user lock -> schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: > My wild guess is that they're allocating memory after taking > futexes. If they do, something like this will happen: > > taskAtaskB taskC > user lock > mmap_sem lock > mmap sem -> schedule > user lock -> schedule > > If taskB wouldn't be there triggering more random trashing over the > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. > > I suspect the real fix is not to allocate memory or to run other > expensive syscalls that can block inside the futex critical sections... glibc malloc uses arenas, and trylock() only. It should not block because if an arena is already locked, thread automatically chose another arena, and might create a new one if necessary. But yes, mmap_sem contention is a big problem, because it's also taken by futex code (unfortunately) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Eric Dumazet wrote: On Tuesday 13 March 2007 12:12, Nick Piggin wrote: I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback. Well that doesn't help in this case. I tested and the mmap_sem contention is not an issue. http://lkml.org/lkml/2006/8/9/26 Maybe we have to wait for 32 core cpu before thinking of cache line bouncings... The idea is a good one, and I was half way through implementing similar myself at one point (some java apps hit this badly). It is just horribly sad that futexes are supposed to implement a _scalable_ thread synchronisation mechanism, whilst fundamentally relying on an mm-wide lock to operate. I don't like your interface, but then again, the futex interface isn't exactly pretty anyway. You should resubmit the patch, and get the glibc guys to use it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: > They'll be sleeping in futex_wait in the kernel, I think. One thread > will hold the critical mutex, some will be off doing their own thing, > but importantly there will be many sleeping for the mutex to become > available. The initial assumption was that there was zero idle time with threads = cpus and the idle time showed up only when the number of threads increased to the double the number of cpus. If the idle time wouldn't increase with the number of threads, nothing would be suspect. > However, I tested with a bigger system and actually the idle time > comes before we saturate all CPUs. Also, increasing the aggressiveness > of the load balancer did not drop idle time at all, so it is not a case > of some runqueues idle while others have many threads on them. It'd be interesting to see the sysrq+t after the idle time increased. > I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose > glibc allocator. But I wonder if there are other improvements that glibc > can do here? My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem -> schedule user lock -> schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tuesday 13 March 2007 12:12, Nick Piggin wrote: > > I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose > glibc allocator. But I wonder if there are other improvements that glibc > can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback. http://lkml.org/lkml/2006/8/9/26 Maybe we have to wait for 32 core cpu before thinking of cache line bouncings... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. So it again makes little sense to me that this is idle time, unless some userland mutex has a usleep in the slow path which would be very wrong, in the worst case they should yield() (yield can still waste lots of cpu if two tasks in the slow paths calls it while the holder is not scheduled, but at least it wouldn't be idle time). They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. Idle time is suspicious for a kernel issue in the scheduler or some userland inefficiency (the latter sounds more likely). That is what I first suspected, because the dropoff appeared to happen exactly after we saturated the CPU count: it seems like a scheduler artifact. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: > Well it wasn't iowait time. From Anton's analysis, I would probably > say it was time waiting for either the glibc malloc mutex or MySQL > heap mutex. So it again makes little sense to me that this is idle time, unless some userland mutex has a usleep in the slow path which would be very wrong, in the worst case they should yield() (yield can still waste lots of cpu if two tasks in the slow paths calls it while the holder is not scheduled, but at least it wouldn't be idle time). Idle time is suspicious for a kernel issue in the scheduler or some userland inefficiency (the latter sounds more likely). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. When you said idle I thought idle and not waiting for I/O. Waiting for I/O would be hardly a kernel issue ;). If they're not waiting for I/O and they're not scheduling in userland with nanosleep/pause, the cpu shouldn't go idle. Even if they're calling sched_yield in a loop the cpu should account for zero idle time as far as I can tell. Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: > Well ignoring the HT issue, I was seeing lots of idle time simply > because userspace could not keep up enough load to the scheduler. > There simply were fewer runnable tasks than CPU cores. When you said idle I thought idle and not waiting for I/O. Waiting for I/O would be hardly a kernel issue ;). If they're not waiting for I/O and they're not scheduling in userland with nanosleep/pause, the cpu shouldn't go idle. Even if they're calling sched_yield in a loop the cpu should account for zero idle time as far as I can tell. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) btw, regardless of what glibc is doing, still the cpu shouldn't go idle IMHO. Even if we're overscheduling and trashing over the mmap_sem with threads (no idea if other OS schedules the task away when they find the other cpu in the mmap critical section), or if we've overscheduling with futex locking, the cpu usage should remain 100% system time in the worst case. The only explanation for going idle legitimately could be on HT cpus where HT may hurt more than help but on real multicore it shouldn't happen. Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. But it wasn't a case of all CPUs going idle, just most of them ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: > Hi Anton, > > Very cool. Yeah I had come to the conclusion that it wasn't a kernel > issue, and basically was afraid to look into userspace ;) btw, regardless of what glibc is doing, still the cpu shouldn't go idle IMHO. Even if we're overscheduling and trashing over the mmap_sem with threads (no idea if other OS schedules the task away when they find the other cpu in the mmap critical section), or if we've overscheduling with futex locking, the cpu usage should remain 100% system time in the worst case. The only explanation for going idle legitimately could be on HT cpus where HT may hurt more than help but on real multicore it shouldn't happen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) btw, regardless of what glibc is doing, still the cpu shouldn't go idle IMHO. Even if we're overscheduling and trashing over the mmap_sem with threads (no idea if other OS schedules the task away when they find the other cpu in the mmap critical section), or if we've overscheduling with futex locking, the cpu usage should remain 100% system time in the worst case. The only explanation for going idle legitimately could be on HT cpus where HT may hurt more than help but on real multicore it shouldn't happen. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) btw, regardless of what glibc is doing, still the cpu shouldn't go idle IMHO. Even if we're overscheduling and trashing over the mmap_sem with threads (no idea if other OS schedules the task away when they find the other cpu in the mmap critical section), or if we've overscheduling with futex locking, the cpu usage should remain 100% system time in the worst case. The only explanation for going idle legitimately could be on HT cpus where HT may hurt more than help but on real multicore it shouldn't happen. Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. But it wasn't a case of all CPUs going idle, just most of them ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. When you said idle I thought idle and not waiting for I/O. Waiting for I/O would be hardly a kernel issue ;). If they're not waiting for I/O and they're not scheduling in userland with nanosleep/pause, the cpu shouldn't go idle. Even if they're calling sched_yield in a loop the cpu should account for zero idle time as far as I can tell. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. When you said idle I thought idle and not waiting for I/O. Waiting for I/O would be hardly a kernel issue ;). If they're not waiting for I/O and they're not scheduling in userland with nanosleep/pause, the cpu shouldn't go idle. Even if they're calling sched_yield in a loop the cpu should account for zero idle time as far as I can tell. Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. So it again makes little sense to me that this is idle time, unless some userland mutex has a usleep in the slow path which would be very wrong, in the worst case they should yield() (yield can still waste lots of cpu if two tasks in the slow paths calls it while the holder is not scheduled, but at least it wouldn't be idle time). Idle time is suspicious for a kernel issue in the scheduler or some userland inefficiency (the latter sounds more likely). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. So it again makes little sense to me that this is idle time, unless some userland mutex has a usleep in the slow path which would be very wrong, in the worst case they should yield() (yield can still waste lots of cpu if two tasks in the slow paths calls it while the holder is not scheduled, but at least it wouldn't be idle time). They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. Idle time is suspicious for a kernel issue in the scheduler or some userland inefficiency (the latter sounds more likely). That is what I first suspected, because the dropoff appeared to happen exactly after we saturated the CPU count: it seems like a scheduler artifact. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tuesday 13 March 2007 12:12, Nick Piggin wrote: I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback. http://lkml.org/lkml/2006/8/9/26 Maybe we have to wait for 32 core cpu before thinking of cache line bouncings... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. The initial assumption was that there was zero idle time with threads = cpus and the idle time showed up only when the number of threads increased to the double the number of cpus. If the idle time wouldn't increase with the number of threads, nothing would be suspect. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. It'd be interesting to see the sysrq+t after the idle time increased. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem - schedule user lock - schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Eric Dumazet wrote: On Tuesday 13 March 2007 12:12, Nick Piggin wrote: I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback. Well that doesn't help in this case. I tested and the mmap_sem contention is not an issue. http://lkml.org/lkml/2006/8/9/26 Maybe we have to wait for 32 core cpu before thinking of cache line bouncings... The idea is a good one, and I was half way through implementing similar myself at one point (some java apps hit this badly). It is just horribly sad that futexes are supposed to implement a _scalable_ thread synchronisation mechanism, whilst fundamentally relying on an mm-wide lock to operate. I don't like your interface, but then again, the futex interface isn't exactly pretty anyway. You should resubmit the patch, and get the glibc guys to use it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskAtaskB taskC user lock mmap_sem lock mmap sem - schedule user lock - schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... glibc malloc uses arenas, and trylock() only. It should not block because if an arena is already locked, thread automatically chose another arena, and might create a new one if necessary. But yes, mmap_sem contention is a big problem, because it's also taken by futex code (unfortunately) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Andrea Arcangeli wrote: On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. The initial assumption was that there was zero idle time with threads = cpus and the idle time showed up only when the number of threads increased to the double the number of cpus. If the idle time wouldn't increase with the number of threads, nothing would be suspect. Well I think more threads ~= more probability that this guy is going to be preempted while holding the mutex? This might be why FreeBSD works much better, because it looks like MySQL actually will set RT scheduling for those processes that take critical resources. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. It'd be interesting to see the sysrq+t after the idle time increased. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem - schedule user lock - schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Mar 13, 2007 at 01:02:44PM +0100, Eric Dumazet wrote: On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem - schedule user lock - schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... glibc malloc uses arenas, and trylock() only. It should not block because if an arena is already locked, thread automatically chose another arena, and might create a new one if necessary. Well, only when allocating it uses trylock, free uses normal lock. glibc malloc will by default use the same arena for all threads, only when it sees contention during allocation it gives different threads different arenas. So, e.g. if mysql did all allocations while holding some global heap lock (thus glibc wouldn't see any contention on allocation), but freeing would be done outside of application's critical section, you would see contention on main arena's lock in the free path. Calling malloc_stats (); from e.g. atexit handler could give interesting details, especially if you recompile glibc malloc with -DTHREAD_STATS=1. Jakub - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 3/12/07, Anton Blanchard [EMAIL PROTECTED] wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? Thanks, Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Nish Aravamudan a écrit : On 3/12/07, Anton Blanchard [EMAIL PROTECTED] wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? I wish I had a 8-way test platform :) Anyway, could you post some oprofile results ? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 3/13/07, Eric Dumazet [EMAIL PROTECTED] wrote: Nish Aravamudan a écrit : On 3/12/07, Anton Blanchard [EMAIL PROTECTED] wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? I wish I had a 8-way test platform :) Anyway, could you post some oprofile results ? Hopefully soon -- want to still make sure I'm not doing something dumb. Am also hoping to get some of the gdb backtraces like Anton had. Thanks, Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Anton Blanchard a écrit : Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, thanks for the report. glibc has certainly many scalability problems. One of the known problem is its (ab)use of mmap() to allocate one (yes : one !) page every time you fopen() a file. And then a munmap() at fclose() time. mmap()/munmap() should be avoided as hell in multithreaded programs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Anton Blanchard wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) That bogus setscheduler thing must surely have never worked, though. I wonder if FreeBSD avoids the scalability issue because it is using SCHED_RR there, or because it has a decent threaded malloc implementation. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hi Nick, > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look > at this, send me a mail (eg. especially with the sched_setscheduler issue, > you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Anton Blanchard wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) That bogus setscheduler thing must surely have never worked, though. I wonder if FreeBSD avoids the scalability issue because it is using SCHED_RR there, or because it has a decent threaded malloc implementation. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Anton Blanchard a écrit : Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, thanks for the report. glibc has certainly many scalability problems. One of the known problem is its (ab)use of mmap() to allocate one (yes : one !) page every time you fopen() a file. And then a munmap() at fclose() time. mmap()/munmap() should be avoided as hell in multithreaded programs. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, 2007-02-27 at 20:05 +0100, Lorenzo Allegrucci wrote: > On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: > > That still doesn't fix the potential Linux problem that this > > benchmark identified. > > > > To clarify: I don't care as much about MySQL performance as > > I care about identifying and fixing this potential bug in > > Linux. > > Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway > talks about a patch for FreeBSD 7 which addresses poor scalability > of file descriptor locking and that it's responsible for almost all > of the performance and scaling improvements. How does Linux scale with many threads contending for file descriptor lock? Has anyone tried to run the test with oprofile? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, 2007-02-27 at 20:05 +0100, Lorenzo Allegrucci wrote: On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway talks about a patch for FreeBSD 7 which addresses poor scalability of file descriptor locking and that it's responsible for almost all of the performance and scaling improvements. How does Linux scale with many threads contending for file descriptor lock? Has anyone tried to run the test with oprofile? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Nish Aravamudan <[EMAIL PROTECTED]> wrote: On 2/27/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: > Paulo Marques wrote: > > Rik van Riel wrote: > >> J.A. Magallón wrote: > >>> [...] > >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > >> > >> That still doesn't fix the potential Linux problem that this > >> benchmark identified. > >> > >> To clarify: I don't care as much about MySQL performance as > >> I care about identifying and fixing this potential bug in > >> Linux. > > > > IIRC a long time ago there was a change in the scheduler to prevent a > > low prio task running on a sibling of a hyperthreaded processor to slow > > down a higher prio task on another sibling of the same processor. > > > > Basically the scheduler would put the low prio task to sleep during an > > adequate task slice to allow the other sibling to run at full speed for > > a while. > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. > Note that Intel does make multicore HT processors, and hopefully when > this code works as intended it will result in more total throughput. My > supposition is that it currently is NOT working as intended, since > disabling SMT scheduling is reported to help. It does help, but we still drop off, clearly. Also, that's my baseline, so I'm not able to reproduce the *sharp* dropoff from the blog post yet. > A test with MC on and SMT off would be informative for where to look next. I'm rebooting my box with 2.6.20.1 and exactly this setup now. Here are the results: idle.png: average % idle over 120s runs from 1 to 32 threads transactions.png: TPS over 120s runs from 1 to 32 threads Hope the data is useful. All I can conclude right now is that SMT appears to help (contradicting what I said earlier), but that MC seems to have no effect (or no substantial effect). Thanks, Nish idle.png Description: PNG image transactions.png Description: PNG image
Re: SMP performance degradation with sysbench
On 2/27/07, Nish Aravamudan [EMAIL PROTECTED] wrote: On 2/27/07, Bill Davidsen [EMAIL PROTECTED] wrote: Paulo Marques wrote: Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. snip If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. snip Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. It does help, but we still drop off, clearly. Also, that's my baseline, so I'm not able to reproduce the *sharp* dropoff from the blog post yet. A test with MC on and SMT off would be informative for where to look next. I'm rebooting my box with 2.6.20.1 and exactly this setup now. Here are the results: idle.png: average % idle over 120s runs from 1 to 32 threads transactions.png: TPS over 120s runs from 1 to 32 threads Hope the data is useful. All I can conclude right now is that SMT appears to help (contradicting what I said earlier), but that MC seems to have no effect (or no substantial effect). Thanks, Nish idle.png Description: PNG image transactions.png Description: PNG image
Re: SMP performance degradation with sysbench
On 2/27/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: Paulo Marques wrote: > Rik van Riel wrote: >> J.A. Magallón wrote: >>> [...] >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? >> >> That still doesn't fix the potential Linux problem that this >> benchmark identified. >> >> To clarify: I don't care as much about MySQL performance as >> I care about identifying and fixing this potential bug in >> Linux. > > IIRC a long time ago there was a change in the scheduler to prevent a > low prio task running on a sibling of a hyperthreaded processor to slow > down a higher prio task on another sibling of the same processor. > > Basically the scheduler would put the low prio task to sleep during an > adequate task slice to allow the other sibling to run at full speed for > a while. > > I don't know the scheduler code well enough, but comments like this one > make me think that the change is still in place: > >> /* >> * If an SMT sibling task has been put to sleep for priority >> * reasons reschedule the idle task to see if it can now run. >> */ >> if (rq->nr_running) { >> resched_task(rq->idle); >> ret = 1; >> } > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. > That may be the case, but in my opinion if this helps it doesn't "solve" the problem, because the real problem is that a process which is not on a HT is being treated as if it were. Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. It does help, but we still drop off, clearly. Also, that's my baseline, so I'm not able to reproduce the *sharp* dropoff from the blog post yet. A test with MC on and SMT off would be informative for where to look next. I'm rebooting my box with 2.6.20.1 and exactly this setup now. Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Nish Aravamudan wrote: > On 2/26/07, Nick Piggin <[EMAIL PROTECTED]> wrote: > >> Rik van Riel wrote: >> > Lorenzo Allegrucci wrote: >> > >> >> Hi lkml, >> >> >> >> according to the test below (sysbench) Linux seems to have scalability >> >> problems beyond 8 client threads: >> >> http://jeffr-tech.livejournal.com/6268.html#cutid1 >> >> http://jeffr-tech.livejournal.com/5705.html >> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more >> >> Linux versions on that machine. >> >> Anyway, is there anyone who can reproduce this? >> > >> > >> > I have reproduced it on a quad core test system. >> > >> > With 4 threads (on 4 cores) I get a high throughput, with >> > approximately 58% user time and 42% system time. >> > >> > With 8 threads (on 4 cores) I get way lower throughput, >> > with 37% user time, 29% system time 35% idle time! >> > >> > The maximum time taken per query also increases from >> > 0.0096s to 0.5273s. Ouch! >> > >> > I don't know if this is MySQL, glibc or Linux kernel, >> > but something strange is going on... >> >> Like you, I'm also seeing idle time start going up as threads increase. >> >> I initially thought this was a problem with the multiprocessor scheduler, >> because the pattern is exactly like some artificat in the load balancing. >> >> However, after looking at the stats, and testing a couple of things, I >> think it may not be after all. >> >> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what >> I am seeing is that MySQL is having trouble putting enough load into the >> scheduler. > > > Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC > in .config) I posted about earlier. > > transactions.png resembles Nick's results pretty closely, in that a > drop-off occurs, at the same # of threads, too. That seems weird to > me, but I haven't thought about it too closely. Shouldn't Nick's be > dropping off closer to 16 threads (that would be 1 per core, then, > right?) I don't think it is exactly a matter of processes >= cores, but rather just a general problem at higher concurrency. Ok, thanks for the clarification. -Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Nish Aravamudan wrote: On 2/26/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Rik van Riel wrote: > Lorenzo Allegrucci wrote: > >> Hi lkml, >> >> according to the test below (sysbench) Linux seems to have scalability >> problems beyond 8 client threads: >> http://jeffr-tech.livejournal.com/6268.html#cutid1 >> http://jeffr-tech.livejournal.com/5705.html >> Hardware is an 8-core amd64 system and jeffr seems willing to try more >> Linux versions on that machine. >> Anyway, is there anyone who can reproduce this? > > > I have reproduced it on a quad core test system. > > With 4 threads (on 4 cores) I get a high throughput, with > approximately 58% user time and 42% system time. > > With 8 threads (on 4 cores) I get way lower throughput, > with 37% user time, 29% system time 35% idle time! > > The maximum time taken per query also increases from > 0.0096s to 0.5273s. Ouch! > > I don't know if this is MySQL, glibc or Linux kernel, > but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) I don't think it is exactly a matter of processes >= cores, but rather just a general problem at higher concurrency. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Paulo Marques wrote: Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: /* * If an SMT sibling task has been put to sleep for priority * reasons reschedule the idle task to see if it can now run. */ if (rq->nr_running) { resched_task(rq->idle); ret = 1; } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. That may be the case, but in my opinion if this helps it doesn't "solve" the problem, because the real problem is that a process which is not on a HT is being treated as if it were. Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. A test with MC on and SMT off would be informative for where to look next. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
From: Robert Hancock <[EMAIL PROTECTED]> Subject: Re: SMP performance degradation with sysbench Date: Tue, 27 Feb 2007 18:20:25 -0600 Message-ID: <[EMAIL PROTECTED]> > Hiro Yoshioka wrote: > > Howdy, > > > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > > (written in Japanese but you may read the graph. We compared > > 5.0.24 vs 5.0.32) > > > > The following is oprofile data > > ==> > > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt > > <== > > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) > > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit > > mask of 0x00 (Unhalted core cycles) count 10 > > samples %app name symbol name > > 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock > > 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock > > 18600010 6.6502 mysqld rec_get_offsets_func > > 18121328 6.4790 mysqld btr_search_guess_on_hash > > 11453095 4.0949 mysqld row_search_for_mysql > > > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > > machine. > > Curious that it calls pthread_mutex_trylock (as opposed to > pthread_mutex_lock) so often. Maybe they're doing some kind of mutex > lock busy-looping? Yes, it is. Regards, Hiro - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/26/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Rik van Riel wrote: > Lorenzo Allegrucci wrote: > >> Hi lkml, >> >> according to the test below (sysbench) Linux seems to have scalability >> problems beyond 8 client threads: >> http://jeffr-tech.livejournal.com/6268.html#cutid1 >> http://jeffr-tech.livejournal.com/5705.html >> Hardware is an 8-core amd64 system and jeffr seems willing to try more >> Linux versions on that machine. >> Anyway, is there anyone who can reproduce this? > > > I have reproduced it on a quad core test system. > > With 4 threads (on 4 cores) I get a high throughput, with > approximately 58% user time and 42% system time. > > With 8 threads (on 4 cores) I get way lower throughput, > with 37% user time, 29% system time 35% idle time! > > The maximum time taken per query also increases from > 0.0096s to 0.5273s. Ouch! > > I don't know if this is MySQL, glibc or Linux kernel, > but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) idle.png is the average % idle according to sar over each run from 1 to 32 threads. This appears to confirm what Rik was seeing. Not sure if my data is hurting or helping, but this box remains available for further tests. Thanks, Nish transactions.png Description: PNG image idle.png Description: PNG image
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data ==> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt <== CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. Curious that it calls pthread_mutex_trylock (as opposed to pthread_mutex_lock) so often. Maybe they're doing some kind of mutex lock busy-looping? -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Paulo Marques <[EMAIL PROTECTED]> wrote: Rik van Riel wrote: > J.A. Magallón wrote: >>[...] >> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > That still doesn't fix the potential Linux problem that this > benchmark identified. > > To clarify: I don't care as much about MySQL performance as > I care about identifying and fixing this potential bug in > Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. To chime in here, I was attempting to reproduce this on an 8-way Xeon box (4 dual-core). SCHED_SMT and SCHED_MC on led to scaling issues when above 4 threads (4 threads was the peak). To the point, where I couldn't break 1000 transactions per second. Turning both off (with 2.6.20.1) gives much better performance through 16 threads. I am now running for the cases from 17 to 32 to see if I can reproduce the problem at hand. I'll regenerate my data and post numbers soon. I don't know if anyone else has those on in their kernel .config, but I'd suggest turning them off, as Paulo said. Thanks, Nish - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: > J.A. Magallón wrote: > > On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <[EMAIL PROTECTED]> wrote: > > > >> Hiro Yoshioka wrote: > > >>> Another question. When the number of threads exceeds the number of > >>> CPU cores, we may get a lot of idle time. Then a workaround of > >>> MySQL is that do not creat threads which exceeds the number > >>> of CPU cores. Is it right? > >> Not really, that would make it impossible for MySQL to > >> handle more simultaneous database queries than the system > >> has CPUs. > >> > > > > I don't know myqsl internals, but you assume one thread per query. > > If its more like Apache, one long living thread for several connections ? > > Yes, they are longer lived client connections. One thread > per connection, just like Apache. > > > Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > That still doesn't fix the potential Linux problem that this > benchmark identified. > > To clarify: I don't care as much about MySQL performance as > I care about identifying and fixing this potential bug in > Linux. Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway talks about a patch for FreeBSD 7 which addresses poor scalability of file descriptor locking and that it's responsible for almost all of the performance and scaling improvements. Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: /* * If an SMT sibling task has been put to sleep for priority * reasons reschedule the idle task to see if it can now run. */ if (rq->nr_running) { resched_task(rq->idle); ret = 1; } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. -- Paulo Marques - www.grupopie.com "The face of a child can say it all, especially the mouth part of the face." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
J.A. Magallón wrote: On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <[EMAIL PROTECTED]> wrote: Hiro Yoshioka wrote: Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Yes, they are longer lived client connections. One thread per connection, just like Apache. Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <[EMAIL PROTECTED]> wrote: > Hiro Yoshioka wrote: > > Hi, > > > > From: Rik van Riel <[EMAIL PROTECTED]> > >> Hiro Yoshioka wrote: > >>> Howdy, > >>> > >>> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > >>> http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > >>> (written in Japanese but you may read the graph. We compared > >>> 5.0.24 vs 5.0.32) > > snip > >>> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > >>> machine. > >>> > >>> I think there are a lot of room to be inproved in MySQL implementation. > >> That's one aspect. > >> > >> The other aspect of the problem is that when the number of > >> threads exceeds the number of CPU cores, Linux no longer > >> manages to keep the CPUs busy and we get a lot of idle time. > >> > >> On the other hand, with the number of threads being equal to > >> the number of CPU cores, we are 100% CPU bound... > > > > I have a question. If so, what is the difference of kernel's > > view between SMP and CPU cores? > > None. Each schedulable entity (whether a fully fledged > CPU core or an SMT/HT thread) is treated the same. > And what do the SMT and Multi-Core scheduling options in the kernel config are for ? Because of this thread I re-read the help text, and it looks like on could de-select the SMT scheduler option, get a working SMP system, and see what difference ? I suppose its related to migration and cache flushing and so on, but where could I get more details ? And more strange, what is the difference between multi-core and normal SMP configs ? > > Another question. When the number of threads exceeds the number of > > CPU cores, we may get a lot of idle time. Then a workaround of > > MySQL is that do not creat threads which exceeds the number > > of CPU cores. Is it right? > > Not really, that would make it impossible for MySQL to > handle more simultaneous database queries than the system > has CPUs. > I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > Besides, it looks like this is not a problem in MySQL > per se (it works on FreeBSD) but some bug in Linux. > -- J.A. Magallon \ Software is like sex: \ It's better when it's free Mandriva Linux release 2007.1 (Cooker) for i586 Linux 2.6.19-jam07 (gcc 4.1.2 20070115 (prerelease) (4.1.2-0.20070115.1mdv2007.1)) #2 SMP PREEMPT - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel [EMAIL PROTECTED] wrote: Hiro Yoshioka wrote: Hi, From: Rik van Riel [EMAIL PROTECTED] Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) snip MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? None. Each schedulable entity (whether a fully fledged CPU core or an SMT/HT thread) is treated the same. And what do the SMT and Multi-Core scheduling options in the kernel config are for ? Because of this thread I re-read the help text, and it looks like on could de-select the SMT scheduler option, get a working SMP system, and see what difference ? I suppose its related to migration and cache flushing and so on, but where could I get more details ? And more strange, what is the difference between multi-core and normal SMP configs ? Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? Besides, it looks like this is not a problem in MySQL per se (it works on FreeBSD) but some bug in Linux. -- J.A. Magallon jamagallon()ono!com \ Software is like sex: \ It's better when it's free Mandriva Linux release 2007.1 (Cooker) for i586 Linux 2.6.19-jam07 (gcc 4.1.2 20070115 (prerelease) (4.1.2-0.20070115.1mdv2007.1)) #2 SMP PREEMPT - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
J.A. Magallón wrote: On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel [EMAIL PROTECTED] wrote: Hiro Yoshioka wrote: Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Yes, they are longer lived client connections. One thread per connection, just like Apache. Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: /* * If an SMT sibling task has been put to sleep for priority * reasons reschedule the idle task to see if it can now run. */ if (rq-nr_running) { resched_task(rq-idle); ret = 1; } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. -- Paulo Marques - www.grupopie.com The face of a child can say it all, especially the mouth part of the face. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: J.A. Magallón wrote: On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel [EMAIL PROTECTED] wrote: Hiro Yoshioka wrote: Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Yes, they are longer lived client connections. One thread per connection, just like Apache. Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway talks about a patch for FreeBSD 7 which addresses poor scalability of file descriptor locking and that it's responsible for almost all of the performance and scaling improvements. Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Paulo Marques [EMAIL PROTECTED] wrote: Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: snip If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. To chime in here, I was attempting to reproduce this on an 8-way Xeon box (4 dual-core). SCHED_SMT and SCHED_MC on led to scaling issues when above 4 threads (4 threads was the peak). To the point, where I couldn't break 1000 transactions per second. Turning both off (with 2.6.20.1) gives much better performance through 16 threads. I am now running for the cases from 17 to 32 to see if I can reproduce the problem at hand. I'll regenerate my data and post numbers soon. I don't know if anyone else has those on in their kernel .config, but I'd suggest turning them off, as Paulo said. Thanks, Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data == cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt == CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. Curious that it calls pthread_mutex_trylock (as opposed to pthread_mutex_lock) so often. Maybe they're doing some kind of mutex lock busy-looping? -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/26/07, Nick Piggin [EMAIL PROTECTED] wrote: Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) idle.png is the average % idle according to sar over each run from 1 to 32 threads. This appears to confirm what Rik was seeing. Not sure if my data is hurting or helping, but this box remains available for further tests. Thanks, Nish transactions.png Description: PNG image idle.png Description: PNG image
Re: SMP performance degradation with sysbench
From: Robert Hancock [EMAIL PROTECTED] Subject: Re: SMP performance degradation with sysbench Date: Tue, 27 Feb 2007 18:20:25 -0600 Message-ID: [EMAIL PROTECTED] Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data == cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt == CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. Curious that it calls pthread_mutex_trylock (as opposed to pthread_mutex_lock) so often. Maybe they're doing some kind of mutex lock busy-looping? Yes, it is. Regards, Hiro - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Paulo Marques wrote: Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: /* * If an SMT sibling task has been put to sleep for priority * reasons reschedule the idle task to see if it can now run. */ if (rq-nr_running) { resched_task(rq-idle); ret = 1; } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. That may be the case, but in my opinion if this helps it doesn't solve the problem, because the real problem is that a process which is not on a HT is being treated as if it were. Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. A test with MC on and SMT off would be informative for where to look next. -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Nish Aravamudan wrote: On 2/26/07, Nick Piggin [EMAIL PROTECTED] wrote: Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) I don't think it is exactly a matter of processes = cores, but rather just a general problem at higher concurrency. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Nick Piggin [EMAIL PROTECTED] wrote: Nish Aravamudan wrote: On 2/26/07, Nick Piggin [EMAIL PROTECTED] wrote: Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) I don't think it is exactly a matter of processes = cores, but rather just a general problem at higher concurrency. Ok, thanks for the clarification. -Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On 2/27/07, Bill Davidsen [EMAIL PROTECTED] wrote: Paulo Marques wrote: Rik van Riel wrote: J.A. Magallón wrote: [...] Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: /* * If an SMT sibling task has been put to sleep for priority * reasons reschedule the idle task to see if it can now run. */ if (rq-nr_running) { resched_task(rq-idle); ret = 1; } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. That may be the case, but in my opinion if this helps it doesn't solve the problem, because the real problem is that a process which is not on a HT is being treated as if it were. Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. It does help, but we still drop off, clearly. Also, that's my baseline, so I'm not able to reproduce the *sharp* dropoff from the blog post yet. A test with MC on and SMT off would be informative for where to look next. I'm rebooting my box with 2.6.20.1 and exactly this setup now. Thanks, Nish - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Hi, From: Rik van Riel <[EMAIL PROTECTED]> Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) snip MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? None. Each schedulable entity (whether a fully fledged CPU core or an SMT/HT thread) is treated the same. Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. Besides, it looks like this is not a problem in MySQL per se (it works on FreeBSD) but some bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hi, From: Rik van Riel <[EMAIL PROTECTED]> > Hiro Yoshioka wrote: > > Howdy, > > > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > > (written in Japanese but you may read the graph. We compared > > 5.0.24 vs 5.0.32) snip > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > > machine. > > > > I think there are a lot of room to be inproved in MySQL implementation. > > That's one aspect. > > The other aspect of the problem is that when the number of > threads exceeds the number of CPU cores, Linux no longer > manages to keep the CPUs busy and we get a lot of idle time. > > On the other hand, with the number of threads being equal to > the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Regards, Hiro -- Hiro Yoshioka CTO/Miracle Linux Corporation http://blog.miraclelinux.com/yume/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data ==> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt <== CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data ==> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt <== CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. On 2/27/07, Dave Jones <[EMAIL PROTECTED]> wrote: On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > > I found a couple of interesting issues so far. Firstly, the MySQL > > version that I'm using (5.0.26-Max) is making lots of calls to > > FYI, MySQL fixed some scalability problems in version 5.0.30, as > mentioned here: > > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ > > It may be worth using more recent sources than 5.0.26 if tracking down > scaling problems in MySQL. The blog post that originated this discussion ran tests on 5.0.33 Not that the mysql version should really matter. The key point here is that FreeBSD and Linux were running the *same* version, and FreeBSD was able to handle the situation better somehow. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > > I found a couple of interesting issues so far. Firstly, the MySQL > > version that I'm using (5.0.26-Max) is making lots of calls to > > FYI, MySQL fixed some scalability problems in version 5.0.30, as > mentioned here: > > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ > > It may be worth using more recent sources than 5.0.26 if tracking down > scaling problems in MySQL. The blog post that originated this discussion ran tests on 5.0.33 Not that the mysql version should really matter. The key point here is that FreeBSD and Linux were running the *same* version, and FreeBSD was able to handle the situation better somehow. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > I found a couple of interesting issues so far. Firstly, the MySQL > version that I'm using (5.0.26-Max) is making lots of calls to FYI, MySQL fixed some scalability problems in version 5.0.30, as mentioned here: http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ It may be worth using more recent sources than 5.0.26 if tracking down scaling problems in MySQL. --Pete -- Pete Harlan ArtSelect, Inc. [EMAIL PROTECTED] http://www.artselect.com ArtSelect is a subsidiary of a21, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Nick Piggin wrote: Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. "artificat" Wow. I must need some sleep :) Please excuse any other typos! -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Virtually all of the sleep time is coming from unix_stream_recvmsg, which seems to be what the clients and server threads use to communicate with. There doesn't seem to be any other tell-tale event that the database is blocking on. It seems like it might at least partially be a problem with MySQL thread/connection management. I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to sched_setscheduler attempting to fiddle with SCHED_OTHER priority in what looks like an attempt to boot CPU time while holding some resource. All these calls actually fail, because you cannot change SCHED_OTHER priority like that. Adding a hack to make it fall through to set_user_nice provides a boost which eliminates the cliff (but a downward degredation is still there). Secondly, I've raised the thread numbers from 16 to 32 for my system, which also provides a bit more (although doesn't help the downward slope). Combined, it looks like around 30-40% improvement past 16 threads. It isn't anything like making up for the dropoff seen in the blog link, but different systems, different mysql version... I wonder how close we are with this hack in place? Attached is a graph of my numbers, from 1 to 32 clients. plain = 2.6.20.1, sched is with the attached sched patch, and thread is with 32 rather than 16 clients. Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). Nick -- SUSE Labs, Novell Inc. --- kernel/sched.c.orig 2007-02-26 11:46:46.849841000 +0100 +++ kernel/sched.c 2007-02-26 12:04:09.283056000 +0100 @@ -4227,8 +4227,6 @@ recheck: (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) || (!p->mm && param->sched_priority > MAX_RT_PRIO-1)) return -EINVAL; - if (is_rt_policy(policy) != (param->sched_priority != 0)) - return -EINVAL; /* * Allow unprivileged RT tasks to decrease priority: @@ -4302,6 +4300,13 @@ recheck: rt_mutex_adjust_pi(p); + if (!is_rt_policy(policy)) { +if (param->sched_priority == 8) +set_user_nice(p, -20); +else +set_user_nice(p, param->sched_priority-6); + } + return 0; } EXPORT_SYMBOL_GPL(sched_setscheduler);
Re: SMP performance degradation with sysbench
Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Virtually all of the sleep time is coming from unix_stream_recvmsg, which seems to be what the clients and server threads use to communicate with. There doesn't seem to be any other tell-tale event that the database is blocking on. It seems like it might at least partially be a problem with MySQL thread/connection management. I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to sched_setscheduler attempting to fiddle with SCHED_OTHER priority in what looks like an attempt to boot CPU time while holding some resource. All these calls actually fail, because you cannot change SCHED_OTHER priority like that. Adding a hack to make it fall through to set_user_nice provides a boost which eliminates the cliff (but a downward degredation is still there). Secondly, I've raised the thread numbers from 16 to 32 for my system, which also provides a bit more (although doesn't help the downward slope). Combined, it looks like around 30-40% improvement past 16 threads. It isn't anything like making up for the dropoff seen in the blog link, but different systems, different mysql version... I wonder how close we are with this hack in place? Attached is a graph of my numbers, from 1 to 32 clients. plain = 2.6.20.1, sched is with the attached sched patch, and thread is with 32 rather than 16 clients. Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). Nick -- SUSE Labs, Novell Inc. --- kernel/sched.c.orig 2007-02-26 11:46:46.849841000 +0100 +++ kernel/sched.c 2007-02-26 12:04:09.283056000 +0100 @@ -4227,8 +4227,6 @@ recheck: (p-mm param-sched_priority MAX_USER_RT_PRIO-1) || (!p-mm param-sched_priority MAX_RT_PRIO-1)) return -EINVAL; - if (is_rt_policy(policy) != (param-sched_priority != 0)) - return -EINVAL; /* * Allow unprivileged RT tasks to decrease priority: @@ -4302,6 +4300,13 @@ recheck: rt_mutex_adjust_pi(p); + if (!is_rt_policy(policy)) { +if (param-sched_priority == 8) +set_user_nice(p, -20); +else +set_user_nice(p, param-sched_priority-6); + } + return 0; } EXPORT_SYMBOL_GPL(sched_setscheduler);
Re: SMP performance degradation with sysbench
Nick Piggin wrote: Rik van Riel wrote: Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. artificat Wow. I must need some sleep :) Please excuse any other typos! -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to FYI, MySQL fixed some scalability problems in version 5.0.30, as mentioned here: http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ It may be worth using more recent sources than 5.0.26 if tracking down scaling problems in MySQL. --Pete -- Pete Harlan ArtSelect, Inc. [EMAIL PROTECTED] http://www.artselect.com ArtSelect is a subsidiary of a21, Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to FYI, MySQL fixed some scalability problems in version 5.0.30, as mentioned here: http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ It may be worth using more recent sources than 5.0.26 if tracking down scaling problems in MySQL. The blog post that originated this discussion ran tests on 5.0.33 Not that the mysql version should really matter. The key point here is that FreeBSD and Linux were running the *same* version, and FreeBSD was able to handle the situation better somehow. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data == cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt == CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. On 2/27/07, Dave Jones [EMAIL PROTECTED] wrote: On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to FYI, MySQL fixed some scalability problems in version 5.0.30, as mentioned here: http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ It may be worth using more recent sources than 5.0.26 if tracking down scaling problems in MySQL. The blog post that originated this discussion ran tests on 5.0.33 Not that the mysql version should really matter. The key point here is that FreeBSD and Linux were running the *same* version, and FreeBSD was able to handle the situation better somehow. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data == cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt == CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hi, From: Rik van Riel [EMAIL PROTECTED] Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) snip MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Regards, Hiro -- Hiro Yoshioka CTO/Miracle Linux Corporation http://blog.miraclelinux.com/yume/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Hiro Yoshioka wrote: Hi, From: Rik van Riel [EMAIL PROTECTED] Hiro Yoshioka wrote: Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) snip MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? None. Each schedulable entity (whether a fully fledged CPU core or an SMT/HT thread) is treated the same. Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. Besides, it looks like this is not a problem in MySQL per se (it works on FreeBSD) but some bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench
Lorenzo Allegrucci wrote: Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/