Re: [PATCH -v9][RFC] mutex: implement adaptive spinning
2009/1/14 Chris Mason chris.ma...@oracle.com: On Tue, 2009-01-13 at 18:21 +0100, Peter Zijlstra wrote: On Tue, 2009-01-13 at 08:49 -0800, Linus Torvalds wrote: So do a v10, and ask people to test. --- Subject: mutex: implement adaptive spinning From: Peter Zijlstra a.p.zijls...@chello.nl Date: Mon Jan 12 14:01:47 CET 2009 Change mutex contention behaviour such that it will sometimes busy wait on acquisition - moving its behaviour closer to that of spinlocks. I've spent a bunch of time on this one, and noticed earlier today that I still had bits of CONFIG_FTRACE compiling. I wasn't actually tracing anything, but it seems to have had a big performance hit. The bad news is the simple spin got much much faster, dbench 50 coming in at 1282MB/s instead of 580MB/s. (other benchmarks give similar results) v10 is better that not spinning, but its in the 5-10% range. So, I've been trying to find ways to close the gap, just to understand exactly where it is different. If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. I tried a few variations, like only checking the wait list once before looping, which helps some. Are there other suggestions on better tuning options? (some thoughts/speculations) Perhaps for highly-contanded mutexes the spinning implementation may quickly degrade [*] to the non-spinning one (i.e. the current sleep-wait mutex) and then just stay in this state until a moment of time when there are no waiters [**] -- i.e. list_empty(lock-wait_list) == 1 and waiters can start spinning again. what may trigger [*]: (1) obviously, an owner scheduling out. Even if it happens rarely (otherwise, it's not a target scenario for our optimization), due to the [**] it may take quite some time until waiters are able to spin again. let's say, waiters (almost) never block (and possibly, such cases would be better off just using a spinlock after some refactoring, if possible) (2) need_resched() is triggered for one of the waiters. (3) !owner rt_task(p) quite unlikely, but possible (there are 2 race windows). Of course, the question is whether it really takes a noticeable amount of time to get out of the [**] state. I'd imagine it can be a case for highly-contended locks. If this is the case indeed, then which of 1,2,3 gets triggered the most. Have you tried removing need_resched() checks? So we kind of emulate real spinlocks here. -chris -- Best regards, Dmitry Adamushko -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v9][RFC] mutex: implement adaptive spinning
* Chris Mason chris.ma...@oracle.com wrote: v10 is better that not spinning, but its in the 5-10% range. So, I've been trying to find ways to close the gap, just to understand exactly where it is different. If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. i think since we already decided that it's ok to be somewhat unfair (_all_ batching constructs introduce unfairness, so the question is never 'should we?' but 'by how much?'), we should just take this out and enjoy the speed ... Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v9][RFC] mutex: implement adaptive spinning
On Wed, 14 Jan 2009, Ingo Molnar wrote: * Chris Mason chris.ma...@oracle.com wrote: v10 is better that not spinning, but its in the 5-10% range. So, I've been trying to find ways to close the gap, just to understand exactly where it is different. If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. i think since we already decided that it's ok to be somewhat unfair (_all_ batching constructs introduce unfairness, so the question is never 'should we?' but 'by how much?'), we should just take this out and enjoy the speed Yeah, let's just do it. It's going to be unfair, but let's see if the unfairness is going to actually be noticeable in real life. I suspect it isn't, and if it is, we can look at it again and see if there are other approaches. If it makes _that_ much of a difference on dbench, it's worth being unfair even if it's just for some dubious benchmark. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v9][RFC] mutex: implement adaptive spinning
On Wed, 2009-01-14 at 07:43 -0800, Linus Torvalds wrote: On Wed, 14 Jan 2009, Ingo Molnar wrote: * Chris Mason chris.ma...@oracle.com wrote: v10 is better that not spinning, but its in the 5-10% range. So, I've been trying to find ways to close the gap, just to understand exactly where it is different. If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. i think since we already decided that it's ok to be somewhat unfair (_all_ batching constructs introduce unfairness, so the question is never 'should we?' but 'by how much?'), we should just take this out and enjoy the speed Ok, numbers first, incremental below: * dbench 50 (higher is better): spin1282MB/s v10 548MB/s v10 no wait 1868MB/s * 4k creates (numbers in files/second higher is better): spinavg 200.60 median 193.20 std 19.71 high 305.93 low 186.82 v10 avg 180.94 median 175.28 std 13.91 high 229.31 low 168.73 v10 no wait avg 232.18 median 222.38 std 22.91 high 314.66 low 209.12 * File stats (numbers in seconds, lower is better): spin2.27s v10 5.1s v10 no wait 1.6s This patch brings v10 up to v10 no wait. The changes are smaller than they look, I just moved the need_resched checks in __mutex_lock_common after the cmpxchg. -chris diff --git a/kernel/mutex.c b/kernel/mutex.c index 0d5336d..c2d47b7 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -171,12 +171,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, struct thread_info *owner; /* -* If there are pending waiters, join them. -*/ - if (!list_empty(lock-wait_list)) - break; - - /* * If there's an owner, wait for it to either * release the lock or go to sleep. */ @@ -184,6 +178,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (owner !mutex_spin_on_owner(lock, owner)) break; + if (atomic_cmpxchg(lock-count, 1, 0) == 1) { + lock_acquired(lock-dep_map, ip); + mutex_set_owner(lock); + preempt_enable(); + return 0; + } + /* * When there's no owner, we might have preempted between the * owner acquiring the lock and setting the owner field. If @@ -192,14 +193,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, */ if (!owner (need_resched() || rt_task(task))) break; - - if (atomic_cmpxchg(lock-count, 1, 0) == 1) { - lock_acquired(lock-dep_map, ip); - mutex_set_owner(lock); - preempt_enable(); - return 0; - } - /* * The cpu_relax() call is a compiler barrier which forces * everything in this loop to be re-loaded. We don't need -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v8][RFC] mutex: implement adaptive spinning
On Mon, 2009-01-12 at 19:32 +0200, Avi Kivity wrote: Peter Zijlstra wrote: Spinlocks can use 'pure' MCS locks. How about this, then. In mutex_lock(), keep wait_lock locked and only release it when scheduling out. Waiter spinning naturally follows. If spinlocks are cache friendly (are they today?) (no they're not, Nick's ticket locks still spin on a shared cacheline IIRC -- the MCS locks mentioned could fix this) we inherit that. If there is no contention on the mutex, then we don't need to reacquire the wait_lock on mutex_unlock() (not that the atomic op is that expensive these days). That might actually work, although we'd have to move the __mutex_slowpath_needs_to_unlock() branch outside wait_lock otherwise we'll deadlock :-) It might be worth trying this if we get serious fairness issues with the current construct. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v11][RFC] mutex: implement adaptive spinning
Full series, including changelogs available at: http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/ and should shortly appear in a git tree near Ingo :-) mutex: small cleanup mutex: preemption fixes mutex: implement adaptive spinning mutex: set owner in mutex_lock() only mutex: adaptive spin for debug too mutex: adaptive spin performance tweaks --- include/linux/mutex.h |5 +- include/linux/sched.h |2 kernel/mutex-debug.c|9 --- kernel/mutex-debug.h| 18 --- kernel/mutex.c | 118 kernel/mutex.h | 22 kernel/sched.c | 71 +++- kernel/sched_features.h |1 8 files changed, 204 insertions(+), 42 deletions(-) Index: linux-2.6/kernel/mutex.c === --- linux-2.6.orig/kernel/mutex.c +++ linux-2.6/kernel/mutex.c @@ -10,6 +10,11 @@ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and * David Howells for suggestions and improvements. * + * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline + *from the -rt tree, where it was originally implemented for rtmutexes + *by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale + *and Sven Dietrich. + * * Also see Documentation/mutex-design.txt. */ #include linux/mutex.h @@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c atomic_set(lock-count, 1); spin_lock_init(lock-wait_lock); INIT_LIST_HEAD(lock-wait_list); + mutex_clear_owner(lock); debug_mutex_init(lock, name, key); } @@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu * 'unlocked' into 'locked' state. */ __mutex_fastpath_lock(lock-count, __mutex_lock_slowpath); + mutex_set_owner(lock); } EXPORT_SYMBOL(mutex_lock); @@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex * * The unlocking fastpath is the 0-1 transition from 'locked' * into 'unlocked' state: */ +#ifndef CONFIG_DEBUG_MUTEXES + /* +* When debugging is enabled we must not clear the owner before time, +* the slow path will always be taken, and that clears the owner field +* after verifying that it was indeed current. +*/ + mutex_clear_owner(lock); +#endif __mutex_fastpath_unlock(lock-count, __mutex_unlock_slowpath); } @@ -129,21 +144,71 @@ __mutex_lock_common(struct mutex *lock, { struct task_struct *task = current; struct mutex_waiter waiter; - unsigned int old_val; unsigned long flags; + preempt_disable(); + mutex_acquire(lock-dep_map, subclass, 0, ip); +#ifdef CONFIG_SMP + /* +* Optimistic spinning. +* +* We try to spin for acquisition when we find that there are no +* pending waiters and the lock owner is currently running on a +* (different) CPU. +* +* The rationale is that if the lock owner is running, it is likely to +* release the lock soon. +* +* Since this needs the lock owner, and this mutex implementation +* doesn't track the owner atomically in the lock field, we need to +* track it non-atomically. +*/ + + for (;;) { + struct thread_info *owner; + + /* +* If there's an owner, wait for it to either +* release the lock or go to sleep. +*/ + owner = ACCESS_ONCE(lock-owner); + if (owner !mutex_spin_on_owner(lock, owner)) + break; + + if (atomic_cmpxchg(lock-count, 1, 0) == 1) { + lock_acquired(lock-dep_map, ip); + preempt_enable(); + return 0; + } + + /* +* When there's no owner, we might have preempted between the +* owner acquiring the lock and setting the owner field. If +* we're an RT task that will live-lock because we won't let +* the owner complete. +*/ + if (!owner (need_resched() || rt_task(task))) + break; + + /* +* The cpu_relax() call is a compiler barrier which forces +* everything in this loop to be re-loaded. We don't need +* memory barriers as we'll eventually observe the right +* values at the cost of a few extra spins. +*/ + cpu_relax(); + } +#endif spin_lock_mutex(lock-wait_lock, flags); debug_mutex_lock_common(lock, waiter); - mutex_acquire(lock-dep_map, subclass, 0, ip); debug_mutex_add_waiter(lock, waiter, task_thread_info(task)); /* add waiting tasks to the end of the waitqueue (FIFO):
[PATCH -v11 delta] mutex: implement adaptive spinning
* Linus Torvalds torva...@linux-foundation.org wrote: If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. i think since we already decided that it's ok to be somewhat unfair (_all_ batching constructs introduce unfairness, so the question is never 'should we?' but 'by how much?'), we should just take this out and enjoy the speed Yeah, let's just do it. It's going to be unfair, but let's see if the unfairness is going to actually be noticeable in real life. I suspect it isn't, and if it is, we can look at it again and see if there are other approaches. If it makes _that_ much of a difference on dbench, it's worth being unfair even if it's just for some dubious benchmark. Below is the final v10-v11 delta, for review. Beyond the tweak from Chris, there's small cleanups and the debug simplification from Johannes Weiner. We think this is the final version and are doing final tests now. Ingo diff --git a/kernel/mutex.c b/kernel/mutex.c index 0d5336d..703dec2 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -148,7 +148,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, preempt_disable(); mutex_acquire(lock-dep_map, subclass, 0, ip); -#if defined(CONFIG_SMP) !defined(CONFIG_DEBUG_MUTEXES) +#ifdef CONFIG_SMP /* * Optimistic spinning. * @@ -162,21 +162,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * Since this needs the lock owner, and this mutex implementation * doesn't track the owner atomically in the lock field, we need to * track it non-atomically. -* -* We can't do this for DEBUG_MUTEXES because that relies on wait_lock -* to serialize everything. */ for (;;) { struct thread_info *owner; /* -* If there are pending waiters, join them. -*/ - if (!list_empty(lock-wait_list)) - break; - - /* * If there's an owner, wait for it to either * release the lock or go to sleep. */ @@ -184,6 +175,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (owner !mutex_spin_on_owner(lock, owner)) break; + if (atomic_cmpxchg(lock-count, 1, 0) == 1) { + lock_acquired(lock-dep_map, ip); + preempt_enable(); + return 0; + } + /* * When there's no owner, we might have preempted between the * owner acquiring the lock and setting the owner field. If @@ -193,13 +190,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (!owner (need_resched() || rt_task(task))) break; - if (atomic_cmpxchg(lock-count, 1, 0) == 1) { - lock_acquired(lock-dep_map, ip); - mutex_set_owner(lock); - preempt_enable(); - return 0; - } - /* * The cpu_relax() call is a compiler barrier which forces * everything in this loop to be re-loaded. We don't need @@ -209,7 +199,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, cpu_relax(); } #endif - spin_lock_mutex(lock-wait_lock, flags); debug_mutex_lock_common(lock, waiter); @@ -263,7 +252,6 @@ done: lock_acquired(lock-dep_map, ip); /* got the lock - rejoice! */ mutex_remove_waiter(lock, waiter, current_thread_info()); - mutex_set_owner(lock); /* set it to 0 if there are no waiters left: */ if (likely(list_empty(lock-wait_list))) @@ -439,10 +427,8 @@ static inline int __mutex_trylock_slowpath(atomic_t *lock_count) spin_lock_mutex(lock-wait_lock, flags); prev = atomic_xchg(lock-count, -1); - if (likely(prev == 1)) { - mutex_set_owner(lock); + if (likely(prev == 1)) mutex_acquire(lock-dep_map, 0, 1, _RET_IP_); - } /* Set it back to 0 if there are no waiters: */ if (likely(list_empty(lock-wait_list))) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v11][RFC] mutex: implement adaptive spinning
On Wed, 2009-01-14 at 18:18 +0100, Nick Piggin wrote: @@ -173,21 +237,21 @@ __mutex_lock_common(struct mutex *lock, spin_unlock_mutex(lock-wait_lock, flags); debug_mutex_free_waiter(waiter); + preempt_enable(); return -EINTR; } __set_task_state(task, state); /* didnt get the lock, go to sleep: */ spin_unlock_mutex(lock-wait_lock, flags); - schedule(); + __schedule(); Why does this need to do a preempt-disabled schedule? After we schedule away, the next task can do arbitrary things or reschedule itself, so if we have not anticipated such a condition here, then I can't see what __schedule protects. At least a comment is in order? From: http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/mutex-preempt.patch Subject: mutex: preemption fixes From: Peter Zijlstra a.p.zijls...@chello.nl Date: Wed Jan 14 15:36:26 CET 2009 The problem is that dropping the spinlock right before schedule is a voluntary preemption point and can cause a schedule, right after which we schedule again. Fix this inefficiency by keeping preemption disabled until we schedule, do this by explicitly disabling preemption and providing a schedule() variant that assumes preemption is already disabled. Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl Pity to add the call overhead to schedule just for this case. Good point, seeing any way around that? BTW. __schedule shouldn't need to be asmlinkage? TBH I've no clue, probably not, Ingo? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v8][RFC] mutex: implement adaptive spinning
Nick Piggin wrote: (no they're not, Nick's ticket locks still spin on a shared cacheline IIRC -- the MCS locks mentioned could fix this) It reminds me. I wrote a basic variation of MCS spinlocks a while back. And converted dcache lock to use it, which showed large dbench improvements on a big machine (of course for different reasons than the dbench improvements in this threaed). http://lkml.org/lkml/2008/8/28/24 Each lock object is sane in size because given set of spin-local queues may only be used once per lock stack. But any spinlocks within a mutex acquisition will always be at the bottom of such a stack anyway, by definition. If you can use any code or concept for your code, that would be great. Does it make sense to replace 'nest' with a per-cpu counter that's incremented on each lock? I guest you'd have to search for the value of nest on unlock, but it would a very short search (typically length 1, 2 if lock sorting is used to avoid deadlocks). I think you'd need to make the lock store the actual node pointer, not the cpu number, since the values of nest would be different on each cpu. That would allow you to replace spinlocks with mcs_locks wholesale. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v9][RFC] mutex: implement adaptive spinning
2009/1/14 Chris Mason chris.ma...@oracle.com: On Wed, 2009-01-14 at 12:18 +0100, Dmitry Adamushko wrote: 2009/1/14 Chris Mason chris.ma...@oracle.com: On Tue, 2009-01-13 at 18:21 +0100, Peter Zijlstra wrote: On Tue, 2009-01-13 at 08:49 -0800, Linus Torvalds wrote: So do a v10, and ask people to test. --- Subject: mutex: implement adaptive spinning From: Peter Zijlstra a.p.zijls...@chello.nl Date: Mon Jan 12 14:01:47 CET 2009 Change mutex contention behaviour such that it will sometimes busy wait on acquisition - moving its behaviour closer to that of spinlocks. I've spent a bunch of time on this one, and noticed earlier today that I still had bits of CONFIG_FTRACE compiling. I wasn't actually tracing anything, but it seems to have had a big performance hit. The bad news is the simple spin got much much faster, dbench 50 coming in at 1282MB/s instead of 580MB/s. (other benchmarks give similar results) v10 is better that not spinning, but its in the 5-10% range. So, I've been trying to find ways to close the gap, just to understand exactly where it is different. If I take out: /* * If there are pending waiters, join them. */ if (!list_empty(lock-wait_list)) break; v10 pops dbench 50 up to 1800MB/s. The other tests soundly beat my spinning and aren't less fair. But clearly this isn't a good solution. I tried a few variations, like only checking the wait list once before looping, which helps some. Are there other suggestions on better tuning options? (some thoughts/speculations) Perhaps for highly-contanded mutexes the spinning implementation may quickly degrade [*] to the non-spinning one (i.e. the current sleep-wait mutex) and then just stay in this state until a moment of time when there are no waiters [**] -- i.e. list_empty(lock-wait_list) == 1 and waiters can start spinning again. It is actually ok if the highly contention mutexes don't degrade as long as they are highly contended and the holder isn't likely to schedule. Yes, my point was that they likely do fall back (degrade) to the wait-on-the-list (non-spinning) behavior with dbench, and that's why the performance numbers are similar (5-10% IIRC) to those of the 'normal' mutex. And the thing is that for the highly contention mutexes, it's not easy to switch back to the (fast) spinning-wait -- just because most of the time there is someone waiting for the lock (and if this 'someone' is not a spinner, none of the new mutex_lock() requests can do busy-waiting/spinning). But whatever, without the list_empty() check it's not relevant any more. what may trigger [*]: (1) obviously, an owner scheduling out. Even if it happens rarely (otherwise, it's not a target scenario for our optimization), due to the [**] it may take quite some time until waiters are able to spin again. let's say, waiters (almost) never block (and possibly, such cases would be better off just using a spinlock after some refactoring, if possible) (2) need_resched() is triggered for one of the waiters. (3) !owner rt_task(p) quite unlikely, but possible (there are 2 race windows). Of course, the question is whether it really takes a noticeable amount of time to get out of the [**] state. I'd imagine it can be a case for highly-contended locks. If this is the case indeed, then which of 1,2,3 gets triggered the most. Sorry, I don't have stats on that. Have you tried removing need_resched() checks? So we kind of emulate real spinlocks here. Unfortunately, the need_resched() checks deal with a few of the ugly corners. They are more important without the waiter list check. Basically if we spun without the need_resched() checks, the process who wants to unlock might not be able to schedule back in. Yeah, for the owner == NULL case. If the owner was preempted in the fast path right after taking the lock and before calling mutex_set_owner(). btw., I wonder... would an additional preempt_disable/enable() in the fast path harm that much? We could avoid the preemption scenario above. -chris -- Best regards, Dmitry Adamushko -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] adaptive spinning mutexes
* Peter Zijlstra pet...@infradead.org wrote: Full series, including changelogs available at: http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/ and should shortly appear in a git tree near Ingo :-) Linus, Please pull the adaptive-mutexes-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git adaptive-mutexes-for-linus We dropped two fresh patches from v11 for the time being: the two debug patches, they had test failures [they triggered mutex debugging false positives]. So this tree is v10 (which got a lot of testing already) plus Chris's performance patch. It passes all x86 runtime tests here. The cross build matrix looks good too: testing 10 architectures. [ syncing linus ... ] testing alpha: -git: pass ( 21), -tip: pass ( 21) testingarm: -git: pass ( 5), -tip: pass ( 5) testing blackfin: -git: pass ( 20), -tip: pass ( 20) testing cris: -git: pass ( 32), -tip: pass ( 32) testing ia64: -git: pass ( 153), -tip: pass ( 153) testing m32r: -git: pass ( 21), -tip: pass ( 21) testing m68k: -git: pass ( 34), -tip: pass ( 34) testing mips: -git: pass ( 4), -tip: pass ( 4) testingpowerpc: -git: pass ( 11), -tip: pass ( 11) testing sparc: -git: pass ( 0), -tip: pass ( 0) Passes all and no new warnings. So in theory this is good enough as a pre-rc2 pull too, should you feel tempted ;-) Thanks, Ingo -- Chris Mason (1): mutex: adaptive spinnning, performance tweaks Peter Zijlstra (3): mutex: small cleanup mutex: preemption fixes mutex: implement adaptive spinning include/linux/mutex.h |5 +- include/linux/sched.h |2 + kernel/mutex-debug.c|9 +--- kernel/mutex-debug.h| 18 --- kernel/mutex.c | 121 --- kernel/mutex.h | 22 - kernel/sched.c | 71 ++- kernel/sched_features.h |1 + 8 files changed, 209 insertions(+), 40 deletions(-) diff --git a/include/linux/mutex.h b/include/linux/mutex.h index 7a0e5c4..3069ec7 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -50,8 +50,10 @@ struct mutex { atomic_tcount; spinlock_t wait_lock; struct list_headwait_list; -#ifdef CONFIG_DEBUG_MUTEXES +#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP) struct thread_info *owner; +#endif +#ifdef CONFIG_DEBUG_MUTEXES const char *name; void*magic; #endif @@ -68,7 +70,6 @@ struct mutex_waiter { struct list_headlist; struct task_struct *task; #ifdef CONFIG_DEBUG_MUTEXES - struct mutex*lock; void*magic; #endif }; diff --git a/include/linux/sched.h b/include/linux/sched.h index 4cae9b8..c34b137 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -328,7 +328,9 @@ extern signed long schedule_timeout(signed long timeout); extern signed long schedule_timeout_interruptible(signed long timeout); extern signed long schedule_timeout_killable(signed long timeout); extern signed long schedule_timeout_uninterruptible(signed long timeout); +asmlinkage void __schedule(void); asmlinkage void schedule(void); +extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner); struct nsproxy; struct user_namespace; diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c index 1d94160..50d022e 100644 --- a/kernel/mutex-debug.c +++ b/kernel/mutex-debug.c @@ -26,11 +26,6 @@ /* * Must be called with lock-wait_lock held. */ -void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner) -{ - lock-owner = new_owner; -} - void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) { memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); @@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter, /* Mark the current thread as blocked on the lock: */ ti-task-blocked_on = waiter; - waiter-lock = lock; } void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter, @@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lock) DEBUG_LOCKS_WARN_ON(lock-magic != lock); DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info()); DEBUG_LOCKS_WARN_ON(!lock-wait_list.prev !lock-wait_list.next); - DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info()); + mutex_clear_owner(lock); } void debug_mutex_init(struct mutex *lock, const char *name, @@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock, const char *name, debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(lock-dep_map, name, key, 0); #endif - lock-owner
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 2009-01-14 at 19:33 +0100, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: Full series, including changelogs available at: http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/ and should shortly appear in a git tree near Ingo :-) Linus, Please pull the adaptive-mutexes-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git adaptive-mutexes-for-linus I was going to put this into the btrfs tree, but since you have a branch just for adaptive mutexes, is it easier to put there? From: Chris Mason chris.ma...@oracle.com Btrfs: stop spinning on mutex_trylock and let the adaptive code spin for us Mutexes now spin internally and the btrfs spin is no longer required for performance. Signed-off-by: Chris Mason chris.ma...@oracle.com diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 39bae77..40ba8e8 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -37,16 +37,6 @@ int btrfs_tree_lock(struct extent_buffer *eb) { - int i; - - if (mutex_trylock(eb-mutex)) - return 0; - for (i = 0; i 512; i++) { - cpu_relax(); - if (mutex_trylock(eb-mutex)) - return 0; - } - cpu_relax(); mutex_lock_nested(eb-mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb)); return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Ingo Molnar mi...@elte.hu wrote: Linus, Please pull the adaptive-mutexes-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git adaptive-mutexes-for-linus We dropped two fresh patches from v11 for the time being: the two debug patches, they had test failures [they triggered mutex debugging false positives]. So this tree is v10 (which got a lot of testing already) plus Chris's performance patch. It passes all x86 runtime tests here. Latest performance figures, on a 2-socket 16-way Nehalem test-system, running the code above, measured via test-mutex V 128 10 VFS creat+unlink scalability test on tmpfs and ext3: no-spin spin [tmpfs]avg ops/sec: 291038 392865 (+34.9%) [ext3] avg ops/sec: 283291 435674 (+53.7%) Those numbers impress the heck out of me, rarely do we have such kind of speedups these days, for any established functionality. We still have the /sys/debug/sched_features tunable under CONFIG_SCHED_DEBUG=y, so should this cause any performance regressions somewhere, it can be pinned down and blamed back on this change easily, without bisection and without rebooting the box. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2009-01-14 at 10:53 -0800, Andrew Morton wrote: On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar mi...@elte.hu wrote: Please pull the adaptive-mutexes-for-linus git tree fear - It seems a major shortcoming that the feature is disabled if CONFIG_DEBUG_MUTEXES=y. It means that lots of people won't test it. Yes, that's a bit unfortunate, a simple patch to enable that is: I must admit I'm a bit stumped on why that debug check triggers, I couldn't reproduce today, but Ingo ran into it quite quickly :/ Yes, the debug patch caused this false positive warning on one of my test-systems: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 255762 [ cut here ] WARNING: at kernel/mutex-debug.c:77 debug_mutex_unlock+0x94/0xde() Hardware name: Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.29-rc1-tip-00983-ge1af3bd-dirty #1 Call Trace: [8024f2f7] warn_slowpath+0xd8/0xf7 [8024f5ec] ? wake_up_klogd+0x9/0x2f [80270fd1] ? graph_lock+0x27/0x66 [80275329] ? validate_chain+0xd4d/0xd9f [802714c4] ? save_trace+0x3f/0xb2 [80275b4f] ? __lock_acquire+0x7d4/0x832 [80275c5f] ? lock_acquire+0xb2/0xc2 [8025091c] ? cpu_maps_update_begin+0x17/0x19 [80271d21] ? trace_hardirqs_off+0xd/0xf [80270a8e] debug_mutex_unlock+0x94/0xde [80906d71] __mutex_unlock_slowpath+0xdd/0x152 [80906df4] mutex_unlock+0xe/0x10 [80250955] cpu_maps_update_done+0x15/0x17 [808ce8b5] register_cpu_notifier+0x2c/0x32 [80d7683e] page_alloc_init+0x10/0x12 [80d5ac45] start_kernel+0x1ba/0x422 [80d5a140] ? early_idt_handler+0x0/0x73 [80d5a2c3] x86_64_start_reservations+0xae/0xb2 [80d5a421] x86_64_start_kernel+0x137/0x146 ---[ end trace a7919e7f17c0a725 ]--- Kernel command line: root=/dev/sda1 earlyprintk=serial,ttyS0,115200 console=ttyS0,115200 console=tty 5 profile=0 debug initcall_debug apic=debug apic=verbose ignore_loglevel sysrq_always_enabled pci=nomsi kernel profiling enabled (shift: 0) debug: sysrq always enabled. Initializing CPU#0 So we left that change out from this pull request. It's not a big deal i think - mutex debugging always had a different fast-path from no-debug mutexes anyway (or a lack of fast-path to begin with). So the performance characteristics were always subtly different. Now they might be more different - but we'll fix that too. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote: Do people enable CONFIG_SCHED_DEBUG? Well, I have it always enabled, but I've honestly no idea if that makes me weird. CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty small subset? Could be, do you fancy me doing a sysctl? shouldn't be hard. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 Jan 2009 20:00:08 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar mi...@elte.hu wrote: Please pull the adaptive-mutexes-for-linus git tree fear - It seems a major shortcoming that the feature is disabled if CONFIG_DEBUG_MUTEXES=y. It means that lots of people won't test it. ^^^? - When people hit performance/latency oddities, it would be nice if they had a /proc knob with which they could disable this feature at runtime. This would also be useful for comparative performance testing. Yeah. From my other mail: We still have the /sys/debug/sched_features tunable under CONFIG_SCHED_DEBUG=y, so should this cause any performance regressions somewhere, it can be pinned down and blamed back on this change easily, without bisection and without rebooting the box. This kind of easy knob was included early on - this is how all those spin versus no-spin numbers were done. Do people enable CONFIG_SCHED_DEBUG? If they suspect performance problems and want to analyze them? Note that CONFIG_SCHED_DEBUG=y is also the default. CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty small subset? Those two are the default config settings actually, so i'd expect it to be the most commonly occuring combinations. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote: Do people enable CONFIG_SCHED_DEBUG? Well, I have it always enabled, but I've honestly no idea if that makes me weird. CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty small subset? Could be, do you fancy me doing a sysctl? shouldn't be hard. i dunno, why another fancy sysctl for something that fits quite nicely into the existing sched_features scheme that we've been using for such purposes for the past 3-4 kernel releases? we always provided various toggles for new scheduler features via /sys/debug/sched_features, so that people can do performance regression testing, and it works quite well. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 14 Jan 2009 21:14:35 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 Jan 2009 20:00:08 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar mi...@elte.hu wrote: Please pull the adaptive-mutexes-for-linus git tree fear - It seems a major shortcoming that the feature is disabled if CONFIG_DEBUG_MUTEXES=y. It means that lots of people won't test it. ^^^? - When people hit performance/latency oddities, it would be nice if they had a /proc knob with which they could disable this feature at runtime. This would also be useful for comparative performance testing. Yeah. From my other mail: We still have the /sys/debug/sched_features tunable under CONFIG_SCHED_DEBUG=y, so should this cause any performance regressions somewhere, it can be pinned down and blamed back on this change easily, without bisection and without rebooting the box. This kind of easy knob was included early on - this is how all those spin versus no-spin numbers were done. Do people enable CONFIG_SCHED_DEBUG? If they suspect performance problems and want to analyze them? The vast majority of users do not and usually cannot compile their own kernels. Note that CONFIG_SCHED_DEBUG=y is also the default. akpm:/usr/src/25 echo $ARCH x86_64 akpm:/usr/src/25 make defconfig *** Default configuration is based on 'x86_64_defconfig' # # configuration written to .config # akpm:/usr/src/25 grep SCHED_DEBUG .config # CONFIG_SCHED_DEBUG is not set CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty small subset? Those two are the default config settings actually, so i'd expect it to be the most commonly occuring combinations. akpm:/usr/src/25 grep DEBUG_MUTEXES .config # CONFIG_DEBUG_MUTEXES is not set -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 14 Jan 2009 21:27:36 +0100 Ingo Molnar mi...@elte.hu wrote: * Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote: Do people enable CONFIG_SCHED_DEBUG? Well, I have it always enabled, but I've honestly no idea if that makes me weird. CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty small subset? Could be, do you fancy me doing a sysctl? shouldn't be hard. i dunno, why another fancy sysctl for something that fits quite nicely into the existing sched_features scheme that we've been using for such purposes for the past 3-4 kernel releases? we always provided various toggles for new scheduler features via /sys/debug/sched_features, so that people can do performance regression testing, and it works quite well. If we know that this control will be reliably available in packaged kernels then fine. But how to we arrange for that? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Andrew Morton a...@linux-foundation.org wrote: Do people enable CONFIG_SCHED_DEBUG? If they suspect performance problems and want to analyze them? The vast majority of users do not and usually cannot compile their own kernels. ... which they derive from distro kernels or some old .config they always used, via 'make oldconfig'. You are arguing against well-established facts here. If you dont believe my word for it, here's an analysis of all kernel configs posted to lkml in the past 8 months: $ grep ^CONFIG_SCHED_DEBUG linux-kernel | wc -l 424 $ grep 'CONFIG_SCHED_DEBUG is not' linux-kernel | wc -l 109 i.e. CONFIG_SCHED_DEBUG=y is set in 80% of the configs. A large majority of testers has it enabled and /sys/debug/sched_features was always a good mechanism that we used for runtime toggles. Note that CONFIG_SCHED_DEBUG=y is also the default. akpm:/usr/src/25 echo $ARCH x86_64 akpm:/usr/src/25 make defconfig *** Default configuration is based on 'x86_64_defconfig' x86 defconfig is used too, but it's a pretty rare usage. Under default i mean the customary meaning of default config: it's the default if you come via 'make oldconfig' or if you derive your config from a distro config: | config SCHED_DEBUG |bool Collect scheduler debugging info |depends on DEBUG_KERNEL PROC_FS |default y Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 14 Jan 2009 21:51:22 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: Do people enable CONFIG_SCHED_DEBUG? If they suspect performance problems and want to analyze them? The vast majority of users do not and usually cannot compile their own kernels. ... which they derive from distro kernels or some old .config they always used, via 'make oldconfig'. You are arguing against well-established facts here. If you dont believe my word for it, here's an analysis of all kernel configs posted to lkml in the past 8 months: $ grep ^CONFIG_SCHED_DEBUG linux-kernel | wc -l 424 $ grep 'CONFIG_SCHED_DEBUG is not' linux-kernel | wc -l 109 i.e. CONFIG_SCHED_DEBUG=y is set in 80% of the configs. A large majority of testers has it enabled and /sys/debug/sched_features was always a good mechanism that we used for runtime toggles. You just disproved your own case :( Note that CONFIG_SCHED_DEBUG=y is also the default. akpm:/usr/src/25 echo $ARCH x86_64 akpm:/usr/src/25 make defconfig *** Default configuration is based on 'x86_64_defconfig' x86 defconfig is used too, but it's a pretty rare usage. Under default i mean the customary meaning of default config: it's the default if you come via 'make oldconfig' or if you derive your config from a distro config: | config SCHED_DEBUG |bool Collect scheduler debugging info |depends on DEBUG_KERNEL PROC_FS |default y This simply isn't reliable. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, 14 Jan 2009 22:14:58 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 Jan 2009 21:51:22 +0100 Ingo Molnar mi...@elte.hu wrote: * Andrew Morton a...@linux-foundation.org wrote: Do people enable CONFIG_SCHED_DEBUG? If they suspect performance problems and want to analyze them? The vast majority of users do not and usually cannot compile their own kernels. ... which they derive from distro kernels or some old .config they always used, via 'make oldconfig'. You are arguing against well-established facts here. If you dont believe my word for it, here's an analysis of all kernel configs posted to lkml in the past 8 months: $ grep ^CONFIG_SCHED_DEBUG linux-kernel | wc -l 424 $ grep 'CONFIG_SCHED_DEBUG is not' linux-kernel | wc -l 109 i.e. CONFIG_SCHED_DEBUG=y is set in 80% of the configs. A large majority of testers has it enabled and /sys/debug/sched_features was always a good mechanism that we used for runtime toggles. You just disproved your own case :( how so? 80% is not enough? No. It really depends on what distros do. I also checked Fedora and it has SCHED_DEBUG=y in its kernel rpms. If all distros set SCHED_DEBUG=y then fine. But if they do this then we should do this at the kernel.org level, and make it a hard-to-turn-off thing via CONFIG_EMBEDDED=y. note that there's also a performance issue here: we generally _dont want_ a debug sysctl overhead in the mutex code or in any fastpath for that matter. So making it depend on SCHED_DEBUG is useful. sched_feat() features get optimized out at build time when SCHED_DEBUG is disabled. So it gives us the best of two worlds: the utility of sysctls in the SCHED_DEBUG=y, and they get compiled out in the !SCHED_DEBUG case. I'm not detecting here a sufficient appreciation of the number of sched-related regressions we've seen in recent years, nor of the difficulty encountered in diagnosing and fixing them. Let alone the difficulty getting those fixes propagated out a *long* time after the regression was added. You're taking a whizzy new feature which drastically changes a critical core kernel feature and jamming it into mainline with a vestigial amount of testing coverage without giving sufficient care and thought to the practical lessons which we have learned from doing this in the past. This is a highly risky change. It's not that the probability of failure is high - the problem is that the *cost* of the improbable failure is high. We should seek to minimize that cost. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Ingo Molnar mi...@elte.hu wrote: You just disproved your own case :( how so? 80% is not enough? I also checked Fedora and it has SCHED_DEBUG=y in its kernel rpms. Ubuntu has CONFIG_SCHED_DEBUG=y as well in their kernels. note that there's also a performance issue here: we generally _dont want_ a debug sysctl overhead in the mutex code or in any fastpath for that matter. So making it depend on SCHED_DEBUG is useful. sched_feat() features get optimized out at build time when SCHED_DEBUG is disabled. So it gives us the best of two worlds: the utility of sysctls in the SCHED_DEBUG=y, and they get compiled out in the !SCHED_DEBUG case. There's a third issue as well: the toggle _is_ intentionally debug-only, while sysctls are non-debug and we _really_ dont want feature assymetry like that. It will just splinter the application space: if there's significant performance variances then apps will just go the path of least resistence: instead of debugging the performance problems properly, the first group of applications will be tuned for the sysctl_mutex_spin=0 case, the second group of applications will be tuned for the sysctl_mutex_spin=1 case. And if an enterprise distro decided to flip the default around we'd have a real tuning mess. /sys/debug/sched_features strikes the right kind of balance IMO: it's not always available and it's explicitly in debugfs with no stable ABI so apps cannot standardize on being able to tweak it, but it's convenient enough in practice for developers to depend on it, performance analysis is easy, etc., etc. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, Jan 14, 2009 at 22:41, Ingo Molnar mi...@elte.hu wrote: * Ingo Molnar mi...@elte.hu wrote: You just disproved your own case :( how so? 80% is not enough? I also checked Fedora and it has SCHED_DEBUG=y in its kernel rpms. Ubuntu has CONFIG_SCHED_DEBUG=y as well in their kernels. $ cat /etc/SuSE-release openSUSE 11.1 (x86_64) VERSION = 11.1 $ uname -a Linux nga 2.6.27.7-9-default #1 SMP 2008-12-04 18:10:04 +0100 x86_64 x86_64 x86_64 GNU/Linux $ zgrep SCHED_DEBUG /proc/config.gz CONFIG_SCHED_DEBUG=y $ zgrep DEBUG_MUTEX /proc/config.gz # CONFIG_DEBUG_MUTEXES is not set Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Kay Sievers kay.siev...@vrfy.org wrote: On Wed, Jan 14, 2009 at 22:41, Ingo Molnar mi...@elte.hu wrote: * Ingo Molnar mi...@elte.hu wrote: You just disproved your own case :( how so? 80% is not enough? I also checked Fedora and it has SCHED_DEBUG=y in its kernel rpms. Ubuntu has CONFIG_SCHED_DEBUG=y as well in their kernels. $ cat /etc/SuSE-release openSUSE 11.1 (x86_64) VERSION = 11.1 $ uname -a Linux nga 2.6.27.7-9-default #1 SMP 2008-12-04 18:10:04 +0100 x86_64 x86_64 x86_64 GNU/Linux $ zgrep SCHED_DEBUG /proc/config.gz CONFIG_SCHED_DEBUG=y $ zgrep DEBUG_MUTEX /proc/config.gz # CONFIG_DEBUG_MUTEXES is not set Fedora has mutex debugging disabled too: # CONFIG_DEBUG_MUTEXES is not set So the 3 main Linux distros, generating ~95% of the kerneloops.org feedback traffic, all have SCHED_DEBUG=y, and at least two have !DEBUG_MUTEXES. (possibly Ubuntu has that disabled it too) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
* Andrew Morton a...@linux-foundation.org wrote: I also checked Fedora and it has SCHED_DEBUG=y in its kernel rpms. If all distros set SCHED_DEBUG=y then fine. 95% of the distros and significant majority of the lkml traffic. And no, we dont generally dont provide knobs for essential performance features of core Linux kernel primitives - so the existence of SPIN_OWNER in /sys/debug/sched_features is an exception already. We dont have any knob to switch ticket spinlocks to old-style spinlocks. We dont have any knob to switch the page allocator from LIFO to FIFO. We dont have any knob to turn off the coalescing of vmas in the MM. We dont have any knob to turn the mmap_sem from an rwsem to a mutex to a spinlock. Why? Beacause such design and implementation details are what make Linux Linux, and we stand by those decisions for better or worse. And we do try to eliminate as many 'worse' situations as possible, but we dont provide knobs galore. We offer flexibility in our willingness to fix any genuine performance issues in our source code. The thing is that apps tend to gravitate towards solutions with the least short-term cost. If a super important enterprise app can solve their performance problem by either redesigning their broken code, or by turning off a feature we have in the kernel in their install scripts (which we made so easy to tune via a stable sysctl), guess which variant they will chose? Even if they hurt all other apps in the process. note that there's also a performance issue here: we generally _dont want_ a debug sysctl overhead in the mutex code or in any fastpath for that matter. So making it depend on SCHED_DEBUG is useful. sched_feat() features get optimized out at build time when SCHED_DEBUG is disabled. So it gives us the best of two worlds: the utility of sysctls in the SCHED_DEBUG=y, and they get compiled out in the !SCHED_DEBUG case. I'm not detecting here a sufficient appreciation of the number of sched-related regressions we've seen in recent years, nor of the difficulty encountered in diagnosing and fixing them. Let alone the difficulty getting those fixes propagated out a *long* time after the regression was added. The bugzilla you just dug out in another thread does not seem to apply, so i'm not sure what you are referring to. Regarding historic tendencies, we have numbers like: [v2.6.14] [v2.6.29] Semaphores | Mutexes -- | no-spin spin | [tmpfs] ops/sec: 50713 | 291038 392865 (+34.9%) [ext3]ops/sec: 45214 | 283291 435674 (+53.7%) 10x performance improvement on ext3, compared to 2.6.14. I'm sure there will be other numbers that go down - but the thing is, we've _never_ been good at finding the worst-possible workload cases during development. You're taking a whizzy new feature which drastically changes a critical core kernel feature and jamming it into mainline with a vestigial amount of testing coverage without giving sufficient care and thought to the practical lessons which we have learned from doing this in the past. If you look at the whole existence of /sys/debug/sched_feature you'll see how careful we've been about performance regressions. We made it a sched_feat() exactly because if a change goes wrong and becomes a step backwards then it's a oneliner to turn it default-off. We made use of that facility in the past and we have a number of debug knobs there right now: # cat /debug/sched_features NEW_FAIR_SLEEPERS NORMALIZED_SLEEPER WAKEUP_PREEMPT START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN All of those ~16 scheduler knobs were done out of caution, to make sure that if we change some scheduling aspect there's a convenient way to debug performance or interactivity regressions, without forcing people into bisection and/or reboots, etc. This is a highly risky change. It's not that the probability of failure is high - the problem is that the *cost* of the improbable failure is high. We should seek to minimize that cost. It never mattered much to the efficiency of finding performance regressions whether a feature sat tight for 4 kernel releases in -mm or went upstream in a week. It _does_ matter to stability - but not significantly to performance. What matteres most to getting performance right is testing exposure and variance, not length of the integration period. Easy revertability helps too - and that is a given here - it's literally a oneliner to disable it. See that oneliner below. Ingo Index: linux/kernel/sched_features.h === ---
Re: [PATCH -v11][RFC] mutex: implement adaptive spinning
On Wed, Jan 14, 2009 at 06:22:36PM +0100, Peter Zijlstra wrote: On Wed, 2009-01-14 at 18:18 +0100, Nick Piggin wrote: @@ -173,21 +237,21 @@ __mutex_lock_common(struct mutex *lock, spin_unlock_mutex(lock-wait_lock, flags); debug_mutex_free_waiter(waiter); + preempt_enable(); return -EINTR; } __set_task_state(task, state); /* didnt get the lock, go to sleep: */ spin_unlock_mutex(lock-wait_lock, flags); - schedule(); + __schedule(); Why does this need to do a preempt-disabled schedule? After we schedule away, the next task can do arbitrary things or reschedule itself, so if we have not anticipated such a condition here, then I can't see what __schedule protects. At least a comment is in order? From: http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/mutex-preempt.patch Subject: mutex: preemption fixes From: Peter Zijlstra a.p.zijls...@chello.nl Date: Wed Jan 14 15:36:26 CET 2009 The problem is that dropping the spinlock right before schedule is a voluntary preemption point and can cause a schedule, right after which we schedule again. Fix this inefficiency by keeping preemption disabled until we schedule, do this by explicitly disabling preemption and providing a schedule() variant that assumes preemption is already disabled. Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl Pity to add the call overhead to schedule just for this case. Good point, seeing any way around that? Hmm, well this is rather a slow path, I would say. I'd prefer not to modify schedule in this way (if we just get scheduled back on after being switched away, the subsequent call to schedule is going to be cache hot and not do too much work). preempt_enable_noresched maybe if you really care, would close up the window even more. But is it really worthwhile? We'd want to see numbers (when in doubt, keep it simpler). BTW. __schedule shouldn't need to be asmlinkage? TBH I've no clue, probably not, Ingo? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v8][RFC] mutex: implement adaptive spinning
On Wed, Jan 14, 2009 at 07:23:12PM +0200, Avi Kivity wrote: Nick Piggin wrote: (no they're not, Nick's ticket locks still spin on a shared cacheline IIRC -- the MCS locks mentioned could fix this) It reminds me. I wrote a basic variation of MCS spinlocks a while back. And converted dcache lock to use it, which showed large dbench improvements on a big machine (of course for different reasons than the dbench improvements in this threaed). http://lkml.org/lkml/2008/8/28/24 Each lock object is sane in size because given set of spin-local queues may only be used once per lock stack. But any spinlocks within a mutex acquisition will always be at the bottom of such a stack anyway, by definition. If you can use any code or concept for your code, that would be great. Does it make sense to replace 'nest' with a per-cpu counter that's incremented on each lock? I guest you'd have to search for the value of nest on unlock, but it would a very short search (typically length 1, 2 if lock sorting is used to avoid deadlocks). I think you'd need to make the lock store the actual node pointer, not the cpu number, since the values of nest would be different on each cpu. That would allow you to replace spinlocks with mcs_locks wholesale. nest could get quite large (and basically is unbounded), though. I think there would definitely be variations (NMCS locks is interesting). But OTOH I think that for _most_ spinlocks, optimising for spinning case is wrong. For some really tricky global spinlocks, yes I think MCS is a good idea. But moreso as a stopgap until hopefully more scalable algorithms are written. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] adaptive spinning mutexes
On Wed, Jan 14, 2009 at 01:35:29PM -0800, Andrew Morton wrote: You're taking a whizzy new feature which drastically changes a critical core kernel feature and jamming it into mainline with a vestigial amount of testing coverage without giving sufficient care and thought to the practical lessons which we have learned from doing this in the past. This is a highly risky change. It's not that the probability of failure is high - the problem is that the *cost* of the improbable failure is high. We should seek to minimize that cost. There is very little downside to waiting for at least the next release cycle. What's the case for making an exception and merging it right now? It actually still seems to be generating a lot of changes and discussion right up until yesterday... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Warning and BUG with btrfs and corrupted image
On Tue, 2009-01-13 at 15:43 +0100, Eric Sesterhenn wrote: * Chris Mason (chris.ma...@oracle.com) wrote: On Tue, 2009-01-13 at 15:21 +0100, Eric Sesterhenn wrote: Hi, when mounting an intentionally corrupted btrfs filesystem i get the following warning and bug message. The image can be found here www.cccmz.de/~snakebyte/btrfs.2.img.bck.bz2 Thanks for looking at things Aside from catching checksumming errors, we're not quite ready for fuzzer style attacks. The code will be hardened for this but it isn't yet. Does this mean i should stop trying to break it for now or are you interested in further reports? For now, you'll be able to break it at will ;) We'll send out a please send us corrupt notice release announcement when we've hardened things. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs-progs 1/4] Add man/mkfs.btrfs.8.in
On Tue, 2009-01-13 at 18:18 +0530, Goldwyn Rodrigues wrote: Add man/mkfs.btrfs.8.in Kept the name with the name in, so that further processing such as BUILD_DATE BUILD_VERSION etc. could be included later. All man pages included in the man directory to avoid file cluttering. Thanks for starting these! I'll queue them up. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] BTRFS: Mark '__init' for btrfs_init_cachep, btrfs_init_sysfs,btrfs_interface_init
There functions are only called by 'static int __init init_btrfs_fs(void)', so also mark them as '__init'. Signed-off-by: Qinghuang Feng qhfeng.ker...@gmail.com --- diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index eee060f..7e03ec8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2045,7 +2045,7 @@ int btrfs_write_inode(struct inode *inode, int wait); void btrfs_dirty_inode(struct inode *inode); struct inode *btrfs_alloc_inode(struct super_block *sb); void btrfs_destroy_inode(struct inode *inode); -int btrfs_init_cachep(void); +int __init btrfs_init_cachep(void); void btrfs_destroy_cachep(void); long btrfs_ioctl_trans_end(struct file *file); struct inode *btrfs_ilookup(struct super_block *s, u64 objectid, @@ -2089,7 +2089,7 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, struct btrfs_root *root, int cache_only); /* sysfs.c */ -int btrfs_init_sysfs(void); +int __init btrfs_init_sysfs(void); void btrfs_exit_sysfs(void); int btrfs_sysfs_add_super(struct btrfs_fs_info *fs); int btrfs_sysfs_add_root(struct btrfs_root *root); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8adfe05..7170c4a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4521,7 +4521,7 @@ struct kmem_cache *btrfs_cache_create(const char *name, size_t size, SLAB_MEM_SPREAD | extra_flags), ctor); } -int btrfs_init_cachep(void) +int __init btrfs_init_cachep(void) { btrfs_inode_cachep = btrfs_cache_create(btrfs_inode_cache, sizeof(struct btrfs_inode), diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 0a14b49..0a3fc0d 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -649,7 +649,7 @@ static struct miscdevice btrfs_misc = { .fops = btrfs_ctl_fops }; -static int btrfs_interface_init(void) +static int __init btrfs_interface_init(void) { return misc_register(btrfs_misc); } diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index a240b6f..bd73c97 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -254,7 +254,7 @@ void btrfs_sysfs_del_super(struct btrfs_fs_info *fs) wait_for_completion(fs-kobj_unregister); } -int btrfs_init_sysfs(void) +int __init btrfs_init_sysfs(void) { btrfs_kset = kset_create_and_add(btrfs, NULL, fs_kobj); if (!btrfs_kset) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v11][RFC] mutex: implement adaptive spinning
On Thu, 2009-01-15 at 01:46 +0100, Nick Piggin wrote: Hmm, well this is rather a slow path, I would say. I'd prefer not to modify schedule in this way (if we just get scheduled back on after being switched away, the subsequent call to schedule is going to be cache hot and not do too much work). preempt_enable_noresched maybe if you really care, would close up the window even more. But is it really worthwhile? We'd want to see numbers (when in doubt, keep it simpler). I initially did the preempt_enable_no_resched() thing and that showed some improvement for PREEMPT=y kernels (lost the numbers though). When I redid all the patches I tried closing that last hole by doing that __schedule() thing, never realizing that schedule() would then get extra overhead,.. d'0h. I agree that that isn't worth it. I shall revert to preempt_enable_no_resched() and try to get some new numbers. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v11][RFC] mutex: implement adaptive spinning
On Thu, Jan 15, 2009 at 08:44:03AM +0100, Peter Zijlstra wrote: On Thu, 2009-01-15 at 01:46 +0100, Nick Piggin wrote: Hmm, well this is rather a slow path, I would say. I'd prefer not to modify schedule in this way (if we just get scheduled back on after being switched away, the subsequent call to schedule is going to be cache hot and not do too much work). preempt_enable_noresched maybe if you really care, would close up the window even more. But is it really worthwhile? We'd want to see numbers (when in doubt, keep it simpler). I initially did the preempt_enable_no_resched() thing and that showed some improvement for PREEMPT=y kernels (lost the numbers though). When I redid all the patches I tried closing that last hole by doing that __schedule() thing, never realizing that schedule() would then get extra overhead,.. d'0h. I agree that that isn't worth it. I shall revert to preempt_enable_no_resched() and try to get some new numbers. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html