Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote: > On 03/07/2013 06:55 AM, Chris Mason wrote: > > On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: > >> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: > >> > >>> Indeed. Though how well my patches will work with Oracle will > >>> depend a lot on what kind of semctl syscalls they are doing. > >>> > >>> Does Oracle typically do one semop per semctl syscall, or does > >>> it pass in a whole bunch at once? > >> > >> https://oss.oracle.com/~mason/sembench.c > >> > >> I think Chris wrote that to match a particular pattern of semaphore > >> operations the database engine in question does. I haven't checked to > >> see if it triggers the case in point though. > >> > >> Also, Chris since left Oracle but maybe he knows who to poke. > >> > > > > Dave Kleikamp (cc'd) took over my patches and did the most recent > > benchmarking. Ported against 3.0: > > > > https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c > > > > The current versions are still in the 2.6.32 oracle kernel, but it looks > > like they reverted this 3.0 commit. I think with Manfred's upstream > > work my more complex approach wasn't required anymore, but hopefully > > Dave can fill in details. > > From what I recall, I could never get better performance from your > patches that we saw with Manfred's work alone. I can't remember the > reasons for including and then reverting the patches from the 3.0 > (2.6.39) Oracle kernel, but in the end we weren't able to justify their > inclusion. Ok, so after this commit, oracle was happy: commit fd5db42254518fbf241dc454e918598fbe494fa2 Author: Manfred Spraul Date: Wed May 26 14:43:40 2010 -0700 ipc/sem.c: optimize update_queue() for bulk wakeup calls But that doesn't explain why Davidlohr saw semtimedop at the top of the oracle profiles in his runs. Looking through the patches in this thread, I don't see anything that I'd expect to slow down oracle TPC numbers. I dealt with the ipc_perm lock a little differently: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528 My code switched the ipc_rcu_hdr refcount to an atomic, which changed where I needed the spinlock. It may make things easier in patches 3/4 and 4/4. (some of this code was Jens, but at the time he made me promise to pretend he never touched it) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/07/2013 06:55 AM, Chris Mason wrote: > On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: >> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: >> >>> Indeed. Though how well my patches will work with Oracle will >>> depend a lot on what kind of semctl syscalls they are doing. >>> >>> Does Oracle typically do one semop per semctl syscall, or does >>> it pass in a whole bunch at once? >> >> https://oss.oracle.com/~mason/sembench.c >> >> I think Chris wrote that to match a particular pattern of semaphore >> operations the database engine in question does. I haven't checked to >> see if it triggers the case in point though. >> >> Also, Chris since left Oracle but maybe he knows who to poke. >> > > Dave Kleikamp (cc'd) took over my patches and did the most recent > benchmarking. Ported against 3.0: > > https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c > > The current versions are still in the 2.6.32 oracle kernel, but it looks > like they reverted this 3.0 commit. I think with Manfred's upstream > work my more complex approach wasn't required anymore, but hopefully > Dave can fill in details. >From what I recall, I could never get better performance from your patches that we saw with Manfred's work alone. I can't remember the reasons for including and then reverting the patches from the 3.0 (2.6.39) Oracle kernel, but in the end we weren't able to justify their inclusion. > Here is some of the original discussion around the patch: > > https://lkml.org/lkml/2010/4/12/257 > > In terms of how oracle uses IPC, the part that shows up in profiles is > using semtimedop for bulk wakeups. They can configure things to use > either a bunch of small arrays or a huge single array (and anything in > between). > > There is one IPC semaphore per process and they use this to wait for > some event (like a log commit). When the event comes in, everyone > waiting is woken in bulk via a semtimedop call. > > So, single proc waking many waiters at once. > > -chris > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: > On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: > > > Indeed. Though how well my patches will work with Oracle will > > depend a lot on what kind of semctl syscalls they are doing. > > > > Does Oracle typically do one semop per semctl syscall, or does > > it pass in a whole bunch at once? > > https://oss.oracle.com/~mason/sembench.c > > I think Chris wrote that to match a particular pattern of semaphore > operations the database engine in question does. I haven't checked to > see if it triggers the case in point though. > > Also, Chris since left Oracle but maybe he knows who to poke. > Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. Here is some of the original discussion around the patch: https://lkml.org/lkml/2010/4/12/257 In terms of how oracle uses IPC, the part that shows up in profiles is using semtimedop for bulk wakeups. They can configure things to use either a bunch of small arrays or a huge single array (and anything in between). There is one IPC semaphore per process and they use this to wait for some event (like a log commit). When the event comes in, everyone waiting is woken in bulk via a semtimedop call. So, single proc waking many waiters at once. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: > Indeed. Though how well my patches will work with Oracle will > depend a lot on what kind of semctl syscalls they are doing. > > Does Oracle typically do one semop per semctl syscall, or does > it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. Here is some of the original discussion around the patch: https://lkml.org/lkml/2010/4/12/257 In terms of how oracle uses IPC, the part that shows up in profiles is using semtimedop for bulk wakeups. They can configure things to use either a bunch of small arrays or a huge single array (and anything in between). There is one IPC semaphore per process and they use this to wait for some event (like a log commit). When the event comes in, everyone waiting is woken in bulk via a semtimedop call. So, single proc waking many waiters at once. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/07/2013 06:55 AM, Chris Mason wrote: On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. From what I recall, I could never get better performance from your patches that we saw with Manfred's work alone. I can't remember the reasons for including and then reverting the patches from the 3.0 (2.6.39) Oracle kernel, but in the end we weren't able to justify their inclusion. Here is some of the original discussion around the patch: https://lkml.org/lkml/2010/4/12/257 In terms of how oracle uses IPC, the part that shows up in profiles is using semtimedop for bulk wakeups. They can configure things to use either a bunch of small arrays or a huge single array (and anything in between). There is one IPC semaphore per process and they use this to wait for some event (like a log commit). When the event comes in, everyone waiting is woken in bulk via a semtimedop call. So, single proc waking many waiters at once. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote: On 03/07/2013 06:55 AM, Chris Mason wrote: On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote: On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? https://oss.oracle.com/~mason/sembench.c I think Chris wrote that to match a particular pattern of semaphore operations the database engine in question does. I haven't checked to see if it triggers the case in point though. Also, Chris since left Oracle but maybe he knows who to poke. Dave Kleikamp (cc'd) took over my patches and did the most recent benchmarking. Ported against 3.0: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c The current versions are still in the 2.6.32 oracle kernel, but it looks like they reverted this 3.0 commit. I think with Manfred's upstream work my more complex approach wasn't required anymore, but hopefully Dave can fill in details. From what I recall, I could never get better performance from your patches that we saw with Manfred's work alone. I can't remember the reasons for including and then reverting the patches from the 3.0 (2.6.39) Oracle kernel, but in the end we weren't able to justify their inclusion. Ok, so after this commit, oracle was happy: commit fd5db42254518fbf241dc454e918598fbe494fa2 Author: Manfred Spraul manf...@colorfullife.com Date: Wed May 26 14:43:40 2010 -0700 ipc/sem.c: optimize update_queue() for bulk wakeup calls But that doesn't explain why Davidlohr saw semtimedop at the top of the oracle profiles in his runs. Looking through the patches in this thread, I don't see anything that I'd expect to slow down oracle TPC numbers. I dealt with the ipc_perm lock a little differently: https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528 My code switched the ipc_rcu_hdr refcount to an atomic, which changed where I needed the spinlock. It may make things easier in patches 3/4 and 4/4. (some of this code was Jens, but at the time he made me promise to pretend he never touched it) -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 11:13 PM, Davidlohr Bueso wrote: > > Digging into the _raw_spin_lock call: > > 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--49.55%-- sys_semtimedop > | | > | |--77.41%-- system_call > | | semtimedop > | | skgpwwait > | | ksliwat > | | kslwaitctx Hmm. It looks like you cut that off a bit too early. This shows that half the cases came from sys_semtimedop. Where did the other half come from? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 11:13 PM, Davidlohr Bueso davidlohr.bu...@hp.com wrote: Digging into the _raw_spin_lock call: 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--49.55%-- sys_semtimedop | | | |--77.41%-- system_call | | semtimedop | | skgpwwait | | ksliwat | | kslwaitctx Hmm. It looks like you cut that off a bit too early. This shows that half the cases came from sys_semtimedop. Where did the other half come from? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 22:53 -0500, Rik van Riel wrote: > On 03/05/2013 10:46 PM, Waiman Long wrote: > > On 03/05/2013 03:53 PM, Rik van Riel wrote: > > >> Indeed. Though how well my patches will work with Oracle will > >> depend a lot on what kind of semctl syscalls they are doing. > >> > >> Does Oracle typically do one semop per semctl syscall, or does > >> it pass in a whole bunch at once? > > > > i had collected a strace log of Oracle instance startup a while ago. In > > the log, almost all of the semctl() call is to set a single semaphore > > value in one of the element of the array using SETVAL. Also there are > > far more semtimedop() than semctl(), about 100:1. Again, all the > > semtimedop() operations are on a single element of the semaphore array. > > That is good to hear. Just what I was hoping when I started > working on my patches. You should expect them tomorrow or > Thursday. Great, looking forward. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 07:40 -0800, Linus Torvalds wrote: > On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso > wrote: > > > > The following set of patches are based on the discussion of holding the > > ipc lock unnecessarily, such as for permissions and security checks: > > Ok, looks fine from a quick look (but then, so did your previous patch-set ;) > > You still open-code the spinlock in at least a few places (I saw > sem_getref), but I still don't care deeply. > > >> 2) While on an Oracle swingbench DSS (data mining) workload the > > improvements are not as exciting as with Rik's benchmark, we can see > > some positive numbers. For an 8 socket machine the following are the > > percentages of %sys time incurred in the ipc lock: > > Ok, I hoped for it being more noticeable. Since that benchmark is less > trivial than Rik's, can you do a perf record -fg of it and give a more > complete picture of what the kernel footprint is - and in particular > who now gets that ipc lock function? Is it purely semtimedop, or what? > Look out for inlining - ipc_rcu_getref() looks like it would be > inlined, for example. > > It would be good to get a "top twenty kernel functions" from the > profile, along with some call data on where the lock callers are.. I > know that Rik's benchmark *only* had that one call-site, I'm wondering > if the swingbench one has slightly more complex behavior... For a 400 user workload (the kernel functions remain basically the same for any amount of users): 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock 8.46% swapper [kernel.kallsyms] [k] intel_idle 5.51% oracle [kernel.kallsyms] [k] try_atomic_semop 5.05% oracle [kernel.kallsyms] [k] update_sd_lb_stats 2.81% oracle [kernel.kallsyms] [k] tg_load_down 2.41% swapper [kernel.kallsyms] [k] update_blocked_averages 2.38% oracle [kernel.kallsyms] [k] idle_cpu 2.37% swapper [kernel.kallsyms] [k] native_write_msr_safe 2.28% oracle [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.84% oracle [kernel.kallsyms] [k] update_blocked_averages 1.79% oracle [kernel.kallsyms] [k] update_queue 1.73% swapper [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.29% oracle [kernel.kallsyms] [k] native_write_msr_safe 1.07% java [kernel.kallsyms] [k] update_sd_lb_stats 0.91% swapper [kernel.kallsyms] [k] poll_idle 0.86% oracle [kernel.kallsyms] [k] try_to_wake_up 0.80% java [kernel.kallsyms] [k] tg_load_down 0.72% oracle [kernel.kallsyms] [k] load_balance 0.67% oracle [kernel.kallsyms] [k] __schedule 0.67% oracle [kernel.kallsyms] [k] cpumask_next_and Digging into the _raw_spin_lock call: 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--49.55%-- sys_semtimedop | | | |--77.41%-- system_call | | semtimedop | | skgpwwait | | ksliwat | | kslwaitctx Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 10:46 PM, Waiman Long wrote: On 03/05/2013 03:53 PM, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? i had collected a strace log of Oracle instance startup a while ago. In the log, almost all of the semctl() call is to set a single semaphore value in one of the element of the array using SETVAL. Also there are far more semtimedop() than semctl(), about 100:1. Again, all the semtimedop() operations are on a single element of the semaphore array. That is good to hear. Just what I was hoping when I started working on my patches. You should expect them tomorrow or Thursday. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 03:53 PM, Rik van Riel wrote: On 03/05/2013 03:52 PM, Linus Torvalds wrote: On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long wrote: The recommended kernel.sem value from Oracle is "250 32000 100 128". I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? i had collected a strace log of Oracle instance startup a while ago. In the log, almost all of the semctl() call is to set a single semaphore value in one of the element of the array using SETVAL. Also there are far more semtimedop() than semctl(), about 100:1. Again, all the semtimedop() operations are on a single element of the semaphore array. Please note that the behavior of Oracle at startup time may not be indicative of what it will do when running benchmarks like Swingbench. However, I don't think there will be dramatic change in behavior. -Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 03:52 PM, Linus Torvalds wrote: On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long wrote: The recommended kernel.sem value from Oracle is "250 32000 100 128". I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long wrote: > > The recommended kernel.sem value from Oracle is "250 32000 100 128". I have > tried to reduce the maximum semaphores per array (1st value) while > increasing the max number of arrays. That tends to reduce the ipc_lock > contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. At the same time, I have to say that Oracle's recommendation is the right thing to do, and it's really a kernel limitation that we scale badly with lots of semaphores in the array. I'm surprised this hasn't really come up before. It seems such a basic scalability issue for such a traditional Unix load. And while everybody hates the SysV IPC stuff, it's not like it's all *that* complicated. We've had people who worked on much more fundamental and complex scalability things. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, so I'm hopeful that we can actually do this right in the not too distant future. The fact that oracle recomments using large semaphore arrays actually makes me very hopeful that they use semaphores correctly, so that if we just do our scalability work, you'd get the full advantage of it.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 12:10 PM, Rik van Riel wrote: On 03/05/2013 04:35 AM, Davidlohr Bueso wrote: 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% Does the swingbench DSS workload use multiple semaphores, or just one? Your patches look like a great start to make the semaphores more scalable. If the swingbench DSS workload uses multiple semaphores, I have ideas for follow-up patches to make things scale better. What does ipcs output look like while running swingbench DSS? For Oracle, the semaphores are set up when the instance is started irrespective of the workload. For a 8-socket 80 cores test system, the output of ipcs look like: -- Semaphore Arrays keysemid owner perms nsems 0x 0 root 6001 0x 65537 root 6001 0xcd9652f0 4718594oracle 640226 0xcd9652f1 4751363oracle 640226 0xcd9652f2 4784132oracle 640226 0xcd9652f3 4816901oracle 640226 0xcd9652f4 4849670oracle 640226 The recommended kernel.sem value from Oracle is "250 32000 100 128". I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. -Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 04:35 AM, Davidlohr Bueso wrote: 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% Does the swingbench DSS workload use multiple semaphores, or just one? Your patches look like a great start to make the semaphores more scalable. If the swingbench DSS workload uses multiple semaphores, I have ideas for follow-up patches to make things scale better. What does ipcs output look like while running swingbench DSS? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso wrote: > > The following set of patches are based on the discussion of holding the > ipc lock unnecessarily, such as for permissions and security checks: Ok, looks fine from a quick look (but then, so did your previous patch-set ;) You still open-code the spinlock in at least a few places (I saw sem_getref), but I still don't care deeply. >> 2) While on an Oracle swingbench DSS (data mining) workload the > improvements are not as exciting as with Rik's benchmark, we can see > some positive numbers. For an 8 socket machine the following are the > percentages of %sys time incurred in the ipc lock: Ok, I hoped for it being more noticeable. Since that benchmark is less trivial than Rik's, can you do a perf record -fg of it and give a more complete picture of what the kernel footprint is - and in particular who now gets that ipc lock function? Is it purely semtimedop, or what? Look out for inlining - ipc_rcu_getref() looks like it would be inlined, for example. It would be good to get a "top twenty kernel functions" from the profile, along with some call data on where the lock callers are.. I know that Rik's benchmark *only* had that one call-site, I'm wondering if the swingbench one has slightly more complex behavior... Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso davidlohr.bu...@hp.com wrote: The following set of patches are based on the discussion of holding the ipc lock unnecessarily, such as for permissions and security checks: Ok, looks fine from a quick look (but then, so did your previous patch-set ;) You still open-code the spinlock in at least a few places (I saw sem_getref), but I still don't care deeply. 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Ok, I hoped for it being more noticeable. Since that benchmark is less trivial than Rik's, can you do a perf record -fg of it and give a more complete picture of what the kernel footprint is - and in particular who now gets that ipc lock function? Is it purely semtimedop, or what? Look out for inlining - ipc_rcu_getref() looks like it would be inlined, for example. It would be good to get a top twenty kernel functions from the profile, along with some call data on where the lock callers are.. I know that Rik's benchmark *only* had that one call-site, I'm wondering if the swingbench one has slightly more complex behavior... Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 04:35 AM, Davidlohr Bueso wrote: 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% Does the swingbench DSS workload use multiple semaphores, or just one? Your patches look like a great start to make the semaphores more scalable. If the swingbench DSS workload uses multiple semaphores, I have ideas for follow-up patches to make things scale better. What does ipcs output look like while running swingbench DSS? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 12:10 PM, Rik van Riel wrote: On 03/05/2013 04:35 AM, Davidlohr Bueso wrote: 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% Does the swingbench DSS workload use multiple semaphores, or just one? Your patches look like a great start to make the semaphores more scalable. If the swingbench DSS workload uses multiple semaphores, I have ideas for follow-up patches to make things scale better. What does ipcs output look like while running swingbench DSS? For Oracle, the semaphores are set up when the instance is started irrespective of the workload. For a 8-socket 80 cores test system, the output of ipcs look like: -- Semaphore Arrays keysemid owner perms nsems 0x 0 root 6001 0x 65537 root 6001 0xcd9652f0 4718594oracle 640226 0xcd9652f1 4751363oracle 640226 0xcd9652f2 4784132oracle 640226 0xcd9652f3 4816901oracle 640226 0xcd9652f4 4849670oracle 640226 The recommended kernel.sem value from Oracle is 250 32000 100 128. I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. -Longman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long waiman.l...@hp.com wrote: The recommended kernel.sem value from Oracle is 250 32000 100 128. I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. At the same time, I have to say that Oracle's recommendation is the right thing to do, and it's really a kernel limitation that we scale badly with lots of semaphores in the array. I'm surprised this hasn't really come up before. It seems such a basic scalability issue for such a traditional Unix load. And while everybody hates the SysV IPC stuff, it's not like it's all *that* complicated. We've had people who worked on much more fundamental and complex scalability things. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, so I'm hopeful that we can actually do this right in the not too distant future. The fact that oracle recomments using large semaphore arrays actually makes me very hopeful that they use semaphores correctly, so that if we just do our scalability work, you'd get the full advantage of it.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 03:52 PM, Linus Torvalds wrote: On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long waiman.l...@hp.com wrote: The recommended kernel.sem value from Oracle is 250 32000 100 128. I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 03:53 PM, Rik van Riel wrote: On 03/05/2013 03:52 PM, Linus Torvalds wrote: On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long waiman.l...@hp.com wrote: The recommended kernel.sem value from Oracle is 250 32000 100 128. I have tried to reduce the maximum semaphores per array (1st value) while increasing the max number of arrays. That tends to reduce the ipc_lock contention in kernel, but it is against Oracle's recommendation. Ok, the Oracle recommendations seem to be assuming that we'd be scaling the semaphore locking sanely, which we don't. Since we share one single lock for all semaphores in the whole array, Oracle's recommendation does the wrong thing for our ipc_lock contention. David's patch should make it much easier to do the locking more fine-grained, and it sounds like Rik is actively working on that, Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? i had collected a strace log of Oracle instance startup a while ago. In the log, almost all of the semctl() call is to set a single semaphore value in one of the element of the array using SETVAL. Also there are far more semtimedop() than semctl(), about 100:1. Again, all the semtimedop() operations are on a single element of the semaphore array. Please note that the behavior of Oracle at startup time may not be indicative of what it will do when running benchmarks like Swingbench. However, I don't think there will be dramatic change in behavior. -Longman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On 03/05/2013 10:46 PM, Waiman Long wrote: On 03/05/2013 03:53 PM, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? i had collected a strace log of Oracle instance startup a while ago. In the log, almost all of the semctl() call is to set a single semaphore value in one of the element of the array using SETVAL. Also there are far more semtimedop() than semctl(), about 100:1. Again, all the semtimedop() operations are on a single element of the semaphore array. That is good to hear. Just what I was hoping when I started working on my patches. You should expect them tomorrow or Thursday. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 07:40 -0800, Linus Torvalds wrote: On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso davidlohr.bu...@hp.com wrote: The following set of patches are based on the discussion of holding the ipc lock unnecessarily, such as for permissions and security checks: Ok, looks fine from a quick look (but then, so did your previous patch-set ;) You still open-code the spinlock in at least a few places (I saw sem_getref), but I still don't care deeply. 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Ok, I hoped for it being more noticeable. Since that benchmark is less trivial than Rik's, can you do a perf record -fg of it and give a more complete picture of what the kernel footprint is - and in particular who now gets that ipc lock function? Is it purely semtimedop, or what? Look out for inlining - ipc_rcu_getref() looks like it would be inlined, for example. It would be good to get a top twenty kernel functions from the profile, along with some call data on where the lock callers are.. I know that Rik's benchmark *only* had that one call-site, I'm wondering if the swingbench one has slightly more complex behavior... For a 400 user workload (the kernel functions remain basically the same for any amount of users): 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock 8.46% swapper [kernel.kallsyms] [k] intel_idle 5.51% oracle [kernel.kallsyms] [k] try_atomic_semop 5.05% oracle [kernel.kallsyms] [k] update_sd_lb_stats 2.81% oracle [kernel.kallsyms] [k] tg_load_down 2.41% swapper [kernel.kallsyms] [k] update_blocked_averages 2.38% oracle [kernel.kallsyms] [k] idle_cpu 2.37% swapper [kernel.kallsyms] [k] native_write_msr_safe 2.28% oracle [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.84% oracle [kernel.kallsyms] [k] update_blocked_averages 1.79% oracle [kernel.kallsyms] [k] update_queue 1.73% swapper [kernel.kallsyms] [k] update_cfs_rq_blocked_load 1.29% oracle [kernel.kallsyms] [k] native_write_msr_safe 1.07% java [kernel.kallsyms] [k] update_sd_lb_stats 0.91% swapper [kernel.kallsyms] [k] poll_idle 0.86% oracle [kernel.kallsyms] [k] try_to_wake_up 0.80% java [kernel.kallsyms] [k] tg_load_down 0.72% oracle [kernel.kallsyms] [k] load_balance 0.67% oracle [kernel.kallsyms] [k] __schedule 0.67% oracle [kernel.kallsyms] [k] cpumask_next_and Digging into the _raw_spin_lock call: 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--49.55%-- sys_semtimedop | | | |--77.41%-- system_call | | semtimedop | | skgpwwait | | ksliwat | | kslwaitctx Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] ipc: reduce ipc lock contention
On Tue, 2013-03-05 at 22:53 -0500, Rik van Riel wrote: On 03/05/2013 10:46 PM, Waiman Long wrote: On 03/05/2013 03:53 PM, Rik van Riel wrote: Indeed. Though how well my patches will work with Oracle will depend a lot on what kind of semctl syscalls they are doing. Does Oracle typically do one semop per semctl syscall, or does it pass in a whole bunch at once? i had collected a strace log of Oracle instance startup a while ago. In the log, almost all of the semctl() call is to set a single semaphore value in one of the element of the array using SETVAL. Also there are far more semtimedop() than semctl(), about 100:1. Again, all the semtimedop() operations are on a single element of the semaphore array. That is good to hear. Just what I was hoping when I started working on my patches. You should expect them tomorrow or Thursday. Great, looking forward. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/