Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hey, Peter. On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote: On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote: Here's the combined patch I was planning on testing but didn't get to (yet). It implements two things - hard limit on spin duration and early break if the owner also is spinning on a mutex. This is going to give massive conflicts with https://lkml.org/lkml/2011/3/2/286 https://lkml.org/lkml/2011/3/2/282 which I was planning to stuff into .40 I see. Adapting shouldn't be hard. The patch is in proof-of-concept stage anyway. + * Forward progress is guaranteed regardless of locking ordering by never + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because + * mutex_trylock(), which doesn't have to follow the usual locking + * ordering, also uses this function. While that puts a limit on things it'll still waste time. I'd much rather pass an trylock argument to mutex_spin_on_owner() and then bail on owner also spinning. Do we guarantee or enforce that the lock ownership can't be transferred to a different task? If we do, the recursive spinning detection is enough to guarantee forward progress. + if (task_thread_info(rq-curr) != owner || + rq-spinning_on_mutex || need_resched() || + local_clock() start + MAX_MUTEX_SPIN_NS) { While we did our best with making local_clock() cheap, I'm still fairly uncomfortable with putting it in such a tight loop. That's one thing I didn't really understand. It seems the spinning code tried to be light on CPU cycle usage, but we're wasting CPU cycles there anyway. If the spinning can become smarter using some CPU cycles, isn't that a gain? Why is the spinning code optimizing for less CPU cycles? Also, it would be great if you can share what you're using for locking performance measurements. My attempts with dbench didn't end too well. :-( Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote: Hey, Peter. On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote: On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote: Here's the combined patch I was planning on testing but didn't get to (yet). It implements two things - hard limit on spin duration and early break if the owner also is spinning on a mutex. This is going to give massive conflicts with https://lkml.org/lkml/2011/3/2/286 https://lkml.org/lkml/2011/3/2/282 which I was planning to stuff into .40 I see. Adapting shouldn't be hard. The patch is in proof-of-concept stage anyway. + * Forward progress is guaranteed regardless of locking ordering by never + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because + * mutex_trylock(), which doesn't have to follow the usual locking + * ordering, also uses this function. While that puts a limit on things it'll still waste time. I'd much rather pass an trylock argument to mutex_spin_on_owner() and then bail on owner also spinning. Do we guarantee or enforce that the lock ownership can't be transferred to a different task? If we do, the recursive spinning detection is enough to guarantee forward progress. The only way to switch owner is for the current owner to release and a new owner to acquire the lock. Also we already bail the spin loop when owner changes. + if (task_thread_info(rq-curr) != owner || + rq-spinning_on_mutex || need_resched() || + local_clock() start + MAX_MUTEX_SPIN_NS) { While we did our best with making local_clock() cheap, I'm still fairly uncomfortable with putting it in such a tight loop. That's one thing I didn't really understand. It seems the spinning code tried to be light on CPU cycle usage, but we're wasting CPU cycles there anyway. If the spinning can become smarter using some CPU cycles, isn't that a gain? Why is the spinning code optimizing for less CPU cycles? Loop exit latency mostly, the lighter the loop, the faster you're through the less time you waste once the condition is true, but yeah, what you say is true too, its a balance game as always. Also, it would be great if you can share what you're using for locking performance measurements. My attempts with dbench didn't end too well. :-( The thing I used for the initial implementation is mentioned in the changelog (0d66bf6d3): Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50) gave a 345% boost for VFS scalability on my testbox: # ./test-mutex-shm V 16 10 | grep ^avg ops avg ops/sec: 296604 # ./test-mutex-shm V 16 10 | grep ^avg ops avg ops/sec: 85870 I've no idea how heavy that is on trylock though, you'd have to look at that. Chris did some improvements using dbench in ac6e60ee4 but clearly that isn't working for you. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Excerpts from Tejun Heo's message of 2011-03-29 12:37:02 -0400: Hello, guys. I've been running dbench 50 for a few days now and the result is, well, I don't know how to call it. The problem was that the original patch didn't do anything because x86 fastpath code didn't call into the generic slowpath at all. static inline int __mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) { if (likely(atomic_cmpxchg(count, 1, 0) == 1)) return 1; else return 0; } So, I thought that I probably was doing unconscious data selection while I was running the test before sending out the patches. Maybe I was seeing what I wanted to see, so I ran tests in larger scale more methodologically. I first started with ten consecutive runs and then doubled it with intervening reboot and then basically ended up doing that twice for four configuration (I didn't do two runs of simple and refactor but just averaged the two). The hardware is mostly the same except that I switched to a hard drive instead of SSD as hard drives tend to be slower but more consistent in performance numbers. On each run, the filesystem is recreated and the system was rebooted after every ten runs. The numbers are the reported throughput in MiB/s at the end of each run. https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXchl=en Here are the descriptions of the eight columns. simpleonly with patch to make btrfs use mutex refactormutex_spin() factored out spinmutex_spin() applied to the unused trylock slowpath spin-1ditto spin-fixedx86 trylock fastpath updated to use generic slowpath spin-fixed-1ditto code-layoutrefactor + dummy function added to mutex.c code-layout-1ditto After running the simple, refactor and spin ones, I was convinced that there definitely was something which was causing the difference. The averages were apart by more than 1.5 sigma, but I couldn't explain what could have caused such difference. I have another workload that is interesting for this, basically N (100 or so) procs all doing stats on a bunch of files in the same directory. This hammers very hard on the btree lookups. During the first set of mutex spinning patches, doing any kind of breakout on the owner spin (when the owner hasn't changed) made it much slower. You might want to try again with the 2.6.39-rc1 btrfs (or pull my for-linus-unmerged tree into .38) because this gets rid of an unneeded spinlock for the root node. All your work convinced me to dig out my seqlock patches for read only tree block operations. I think we can take your current btrfs patches and combine the seqlock stuff for a huge benefit. In this case, the only thing we're really missing is a way to mutex_lock without the cond_resched() The biggest trylock user in btrfs is only there so we can keep the spinning lock instead of the blocking lock. Since you're getting rid of the whole spin vs block setup, we can switch directly to a lock. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote: In this case, the only thing we're really missing is a way to mutex_lock without the cond_resched() So you're trying to explicitly avoid a voluntary preemption point? Seems like a bad idea, normally people add those :-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Excerpts from Peter Zijlstra's message of 2011-03-30 07:52:04 -0400: On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote: In this case, the only thing we're really missing is a way to mutex_lock without the cond_resched() So you're trying to explicitly avoid a voluntary preemption point? Seems like a bad idea, normally people add those :-) Yeah, but the btrfs fast path (when we're able to spin today) looks like this: spin_lock(parent) binary search, select slot pull block for that slot out of cache spin_lock(child) spin_unlock(parent) If we switch it all to mutexes: mutex_lock(parent) binary search, select slot pull block for that slot out of cache mutex_lock(child) mutex_unlock(parent) Most of the spinning vs blocking benefits in btrfs came from doing special things (like dropping the parent lock) when we know we're going to block with the parent lock held. Surprise blocking in mutex_lock isn't great. It would probably be enough to just move the cond_resched() after the spinning portion of the mutex_lock() -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, guys. I've been running dbench 50 for a few days now and the result is, well, I don't know how to call it. The problem was that the original patch didn't do anything because x86 fastpath code didn't call into the generic slowpath at all. static inline int __mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) { if (likely(atomic_cmpxchg(count, 1, 0) == 1)) return 1; else return 0; } So, I thought that I probably was doing unconscious data selection while I was running the test before sending out the patches. Maybe I was seeing what I wanted to see, so I ran tests in larger scale more methodologically. I first started with ten consecutive runs and then doubled it with intervening reboot and then basically ended up doing that twice for four configuration (I didn't do two runs of simple and refactor but just averaged the two). The hardware is mostly the same except that I switched to a hard drive instead of SSD as hard drives tend to be slower but more consistent in performance numbers. On each run, the filesystem is recreated and the system was rebooted after every ten runs. The numbers are the reported throughput in MiB/s at the end of each run. https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXchl=en Here are the descriptions of the eight columns. simpleonly with patch to make btrfs use mutex refactor mutex_spin() factored out spin mutex_spin() applied to the unused trylock slowpath spin-1ditto spin-fixedx86 trylock fastpath updated to use generic slowpath spin-fixed-1 ditto code-layout refactor + dummy function added to mutex.c code-layout-1 ditto After running the simple, refactor and spin ones, I was convinced that there definitely was something which was causing the difference. The averages were apart by more than 1.5 sigma, but I couldn't explain what could have caused such difference. The code-layout runs were my desparate attempts to find explanation on what's going on. Addition of mutex_spin to the unused trylock generic path makes gcc arrange functions differently. Without it, trylock functions end up inbetween lock and unlock funcitons; with it, they are located at the end. I commented out the unused trylock slowpath function and added a dummy function at the end to make gcc generate similar assembly layout. At this point, the only conclusions I can draw are, * Using adaptive spinning on mutex_trylock() doesn't seem to buy anything according to btrfs dbench 50 runs. and much more importantly, * btrfs dbench 50 runs are probably not good for measuring subtle mutex performance differences. Maybe it's too macro and there are larger scale tendencies which skew the result unless the number of runs are vastly increased (but 40 runs are already over eight hours). If anyone can provide an explanation on what's going on, I'll be super happy. Otherwise, for now, I'll just leave it alone. :-( Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Here's the combined patch I was planning on testing but didn't get to (yet). It implements two things - hard limit on spin duration and early break if the owner also is spinning on a mutex. Thanks. Index: work1/include/linux/sched.h === --- work1.orig/include/linux/sched.h +++ work1/include/linux/sched.h @@ -359,7 +359,7 @@ extern signed long schedule_timeout_inte extern signed long schedule_timeout_killable(signed long timeout); extern signed long schedule_timeout_uninterruptible(signed long timeout); asmlinkage void schedule(void); -extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner); +extern bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner); struct nsproxy; struct user_namespace; Index: work1/kernel/sched.c === --- work1.orig/kernel/sched.c +++ work1/kernel/sched.c @@ -536,6 +536,10 @@ struct rq { struct hrtimer hrtick_timer; #endif +#ifdef CONFIG_MUTEX_SPIN_ON_OWNER + bool spinning_on_mutex; +#endif + #ifdef CONFIG_SCHEDSTATS /* latency stats */ struct sched_info rq_sched_info; @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER /* - * Look out! owner is an entirely speculative pointer - * access and not reliable. + * Maximum mutex owner spin duration in nsecs. Don't spin more then + * DEF_TIMESLICE. + */ +#define MAX_MUTEX_SPIN_NS (DEF_TIMESLICE * 10LLU / HZ) + +/** + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex + * @lock: the mutex to spin on + * @owner: the current owner (speculative pointer) + * + * The caller is trying to acquire @lock held by @owner. If @owner is + * currently running, it might get unlocked soon and spinning on it can + * save the overhead of sleeping and waking up. + * + * Note that @owner is completely speculative and may be completely + * invalid. It should be accessed very carefully. + * + * Forward progress is guaranteed regardless of locking ordering by never + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because + * mutex_trylock(), which doesn't have to follow the usual locking + * ordering, also uses this function. + * + * CONTEXT: + * Preemption disabled. + * + * RETURNS: + * %true if the lock was released and the caller should retry locking. + * %false if the caller better go sleeping. */ -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) { + unsigned long start; unsigned int cpu; struct rq *rq; + bool ret = true; if (!sched_feat(OWNER_SPIN)) - return 0; + return false; #ifdef CONFIG_DEBUG_PAGEALLOC /* @@ -4039,7 +4071,7 @@ int mutex_spin_on_owner(struct mutex *lo * the mutex owner just released it and exited. */ if (probe_kernel_address(owner-cpu, cpu)) - return 0; + return false; #else cpu = owner-cpu; #endif @@ -4049,15 +4081,17 @@ int mutex_spin_on_owner(struct mutex *lo * the cpu field may no longer be valid. */ if (cpu = nr_cpumask_bits) - return 0; + return false; /* * We need to validate that we can do a * get_cpu() and that we have the percpu area. */ if (!cpu_online(cpu)) - return 0; + return false; + this_rq()-spinning_on_mutex = true; + start = local_clock(); rq = cpu_rq(cpu); for (;;) { @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo * we likely have heavy contention. Return 0 to quit * optimistic spinning and not contend further: */ - if (lock-owner) - return 0; + ret = !lock-owner; break; } /* -* Is that owner really running on that cpu? +* Quit spinning if any of the followings is true. +* +* - The owner isn't running on that cpu. +* - The owner also is spinning on a mutex. +* - Someone else wants to use this cpu. +* - We've been spinning for too long. */ - if (task_thread_info(rq-curr) != owner || need_resched()) - return 0; + if (task_thread_info(rq-curr) != owner || + rq-spinning_on_mutex || need_resched() || + local_clock() start + MAX_MUTEX_SPIN_NS) { + ret = false; + break; + } arch_mutex_cpu_relax(); } - return 1; +
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote: Here's the combined patch I was planning on testing but didn't get to (yet). It implements two things - hard limit on spin duration and early break if the owner also is spinning on a mutex. This is going to give massive conflicts with https://lkml.org/lkml/2011/3/2/286 https://lkml.org/lkml/2011/3/2/282 which I was planning to stuff into .40 @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER /* + * Maximum mutex owner spin duration in nsecs. Don't spin more then + * DEF_TIMESLICE. + */ +#define MAX_MUTEX_SPIN_NS(DEF_TIMESLICE * 10LLU / HZ) DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also I bet we have something like NSEC_PER_SEC to avoid counting '0's. + +/** + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex + * @lock: the mutex to spin on + * @owner: the current owner (speculative pointer) + * + * The caller is trying to acquire @lock held by @owner. If @owner is + * currently running, it might get unlocked soon and spinning on it can + * save the overhead of sleeping and waking up. + * + * Note that @owner is completely speculative and may be completely + * invalid. It should be accessed very carefully. + * + * Forward progress is guaranteed regardless of locking ordering by never + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because + * mutex_trylock(), which doesn't have to follow the usual locking + * ordering, also uses this function. While that puts a limit on things it'll still waste time. I'd much rather pass an trylock argument to mutex_spin_on_owner() and then bail on owner also spinning. + * CONTEXT: + * Preemption disabled. + * + * RETURNS: + * %true if the lock was released and the caller should retry locking. + * %false if the caller better go sleeping. */ -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) { @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo * we likely have heavy contention. Return 0 to quit * optimistic spinning and not contend further: */ + ret = !lock-owner; break; } /* - * Is that owner really running on that cpu? + * Quit spinning if any of the followings is true. + * + * - The owner isn't running on that cpu. + * - The owner also is spinning on a mutex. + * - Someone else wants to use this cpu. + * - We've been spinning for too long. */ + if (task_thread_info(rq-curr) != owner || + rq-spinning_on_mutex || need_resched() || + local_clock() start + MAX_MUTEX_SPIN_NS) { While we did our best with making local_clock() cheap, I'm still fairly uncomfortable with putting it in such a tight loop. + ret = false; + break; + } arch_mutex_cpu_relax(); } + this_rq()-spinning_on_mutex = false; + return ret; } #endif -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, Steven, Linus. On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote: On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote: But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Hmm. I think you're right. And it looks pretty fundamental - I don't see any reasonable approach to avoid it. Hmmm... I have an idea. Will play with it a bit and post if it works out okay. I think the RT issue is a red herring too - afaik, you can get a deadlock with two perfectly normal processes too. Of course, for non-RT tasks, any other process will eventually disturb the situation and you'd get kicked out due to need_resched(), but even that might be avoided for a long time if there are other CPU's - leading to tons of wasted CPU time. Yeap, need_resched() currently is the only thing which limits the duration of spinning when the owner continues to run. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. I've been running more of these tests and am having doubts about the consistency. It seems that, even on a fresh filesystem, some random initial condition seems to have persistent effect on the whole run. I'll run more tests and report back. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
* Tejun Heo t...@kernel.org wrote: Hello, On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. I've been running more of these tests and am having doubts about the consistency. It seems that, even on a fresh filesystem, some random initial condition seems to have persistent effect on the whole run. I'll run more tests and report back. Ok, and there's the deadlock issue as well which Steve noticed. I'll zap the patches from tip:core/urgent and lets do this via tip:core/locking with a .40 timeframe and plenty of time of testing. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt rost...@goodmis.org wrote: On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQ CXTSW THROUGHPUT SIMPLE 61107 354977 217 8099529 845.100 MB/sec SPIN 63140 364888 214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org I'm curious about the effects that this has on those places that do: again: mutex_lock(A); if (mutex_trylock(B)) { mutex_unlock(A); goto again; Where the normal locking order is: B - A If another location does: mutex_lock(B); [...] mutex_lock(A); But another process has A already, and is running, it may spin waiting for A as A's owner is still running. But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Regards, Andrey -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 07:53 +0100, Tejun Heo wrote: Hello, Steven, Linus. On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote: On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote: But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Hmm. I think you're right. And it looks pretty fundamental - I don't see any reasonable approach to avoid it. Hmmm... I have an idea. Will play with it a bit and post if it works out okay. One solution is to have this be only done on explicit trylocks. Perhaps introduce a mutex_trylock_spin()? Then when the developer knows that this scenario does not exist, they can convert mutex_trylocks() into this spinning version. I think the RT issue is a red herring too - afaik, you can get a deadlock with two perfectly normal processes too. Of course, for non-RT tasks, any other process will eventually disturb the situation and you'd get kicked out due to need_resched(), but even that might be avoided for a long time if there are other CPU's - leading to tons of wasted CPU time. Yeap, need_resched() currently is the only thing which limits the duration of spinning when the owner continues to run. Yeah, I was about to complain about the long latencies that this could cause, then I realized that RT tasks would in fact deadlock the system here, which I thought was a bigger problem, and focused on that issue. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Because that will cause this scenario to spin that little longer always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 09:10 -0400, Steven Rostedt wrote: One solution is to have this be only done on explicit trylocks. Perhaps introduce a mutex_trylock_spin()? Then when the developer knows that this scenario does not exist, they can convert mutex_trylocks() into this spinning version. I'm not sure this is even worth it, as I'm looking at the btfs/extend-tree.c code, this is the main reason to use mutex_trylock(). I guess what you see in your benchmarks is that trylock contention happens mostly in the non-deadlock scenario. But I bet you have latencies when it does happen, but the benefit seems to out weigh it in the results. I wonder what happens if you run dbench as an RT task. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote: On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Because that will cause this scenario to spin that little longer always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. Broken semantics and extra latency are two separate issues. If the former is fixed, the latter is easily handled by introducing new mutex_trylock_spin call that lets one either stick to existing behavior (try/fail) or choose a new one where latency penalty is justified by locking patterns. Regards, Andrey -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote: On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote: On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Because that will cause this scenario to spin that little longer always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. Broken semantics and extra latency are two separate issues. If the former is fixed, the latter is easily handled by introducing new mutex_trylock_spin call that lets one either stick to existing behavior (try/fail) or choose a new one where latency penalty is justified by locking patterns. For those wanting a more RT deterministic OS, I will argue against latency penalties. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
* Steven Rostedt rost...@goodmis.org wrote: On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote: On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote: On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Because that will cause this scenario to spin that little longer always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. Broken semantics and extra latency are two separate issues. If the former is fixed, the latter is easily handled by introducing new mutex_trylock_spin call that lets one either stick to existing behavior (try/fail) or choose a new one where latency penalty is justified by locking patterns. For those wanting a more RT deterministic OS, I will argue against latency penalties. Later mails from Tejun suggest that the benchmark results are varied, and that it's not a clear win after all. It's possible that if useless spinning is introduced then that might explain such workload variations. I.e. it's not really 'latencies' but 'unnecessary overhead spent spinning' - and also 'extra non-deterministic noise' - none of which help consistent performance. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org LKML-Reference: 20110323153727.gb12...@htj.dyndns.org Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@redhat.com Cc: Chris Mason chris.ma...@oracle.com --- kernel/mutex.c | 10 ++ 1 file changed, 10 insertions(+) Index: work/kernel/mutex.c === --- work.orig/kernel/mutex.c +++ work/kernel/mutex.c @@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa unsigned long flags; int prev; + preempt_disable(); + + if (mutex_spin(lock)) { + mutex_set_owner(lock); + mutex_acquire(lock-dep_map, 0, 1, _RET_IP_); + preempt_enable(); + return 1; + } + spin_lock_mutex(lock-wait_lock, flags); prev = atomic_xchg(lock-count, -1); @@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa atomic_set(lock-count, 0); spin_unlock_mutex(lock-wait_lock, flags); + preempt_enable(); return prev == 1; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQCXTSW THROUGHPUT SIMPLE 61107 354977217 8099529 845.100 MB/sec SPIN 63140 364888214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org I'm curious about the effects that this has on those places that do: again: mutex_lock(A); if (mutex_trylock(B)) { mutex_unlock(A); goto again; Where the normal locking order is: B - A If another location does: mutex_lock(B); [...] mutex_lock(A); But another process has A already, and is running, it may spin waiting for A as A's owner is still running. But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote: But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Hmm. I think you're right. And it looks pretty fundamental - I don't see any reasonable approach to avoid it. I think the RT issue is a red herring too - afaik, you can get a deadlock with two perfectly normal processes too. Of course, for non-RT tasks, any other process will eventually disturb the situation and you'd get kicked out due to need_resched(), but even that might be avoided for a long time if there are other CPU's - leading to tons of wasted CPU time. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html