Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Tejun Heo
Hey, Peter.

On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
 On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
  Here's the combined patch I was planning on testing but didn't get to
  (yet).  It implements two things - hard limit on spin duration and
  early break if the owner also is spinning on a mutex.
 
 This is going to give massive conflicts with
 
  https://lkml.org/lkml/2011/3/2/286
  https://lkml.org/lkml/2011/3/2/282
 
 which I was planning to stuff into .40

I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
stage anyway.

  + * Forward progress is guaranteed regardless of locking ordering by never
  + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
  + * mutex_trylock(), which doesn't have to follow the usual locking
  + * ordering, also uses this function.
 
 While that puts a limit on things it'll still waste time. I'd much
 rather pass an trylock argument to mutex_spin_on_owner() and then bail
 on owner also spinning.

Do we guarantee or enforce that the lock ownership can't be
transferred to a different task?  If we do, the recursive spinning
detection is enough to guarantee forward progress.

  +   if (task_thread_info(rq-curr) != owner ||
  +   rq-spinning_on_mutex || need_resched() ||
  +   local_clock()  start + MAX_MUTEX_SPIN_NS) {
 
 While we did our best with making local_clock() cheap, I'm still fairly
 uncomfortable with putting it in such a tight loop.

That's one thing I didn't really understand.  It seems the spinning
code tried to be light on CPU cycle usage, but we're wasting CPU
cycles there anyway.  If the spinning can become smarter using some
CPU cycles, isn't that a gain?  Why is the spinning code optimizing
for less CPU cycles?

Also, it would be great if you can share what you're using for locking
performance measurements.  My attempts with dbench didn't end too
well. :-(

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote:
 Hey, Peter.
 
 On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
  On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
   Here's the combined patch I was planning on testing but didn't get to
   (yet).  It implements two things - hard limit on spin duration and
   early break if the owner also is spinning on a mutex.
  
  This is going to give massive conflicts with
  
   https://lkml.org/lkml/2011/3/2/286
   https://lkml.org/lkml/2011/3/2/282
  
  which I was planning to stuff into .40
 
 I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
 stage anyway.
 
   + * Forward progress is guaranteed regardless of locking ordering by never
   + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
   + * mutex_trylock(), which doesn't have to follow the usual locking
   + * ordering, also uses this function.
  
  While that puts a limit on things it'll still waste time. I'd much
  rather pass an trylock argument to mutex_spin_on_owner() and then bail
  on owner also spinning.
 
 Do we guarantee or enforce that the lock ownership can't be
 transferred to a different task?  If we do, the recursive spinning
 detection is enough to guarantee forward progress.

The only way to switch owner is for the current owner to release and a
new owner to acquire the lock. Also we already bail the spin loop when
owner changes.

   + if (task_thread_info(rq-curr) != owner ||
   + rq-spinning_on_mutex || need_resched() ||
   + local_clock()  start + MAX_MUTEX_SPIN_NS) {
  
  While we did our best with making local_clock() cheap, I'm still fairly
  uncomfortable with putting it in such a tight loop.
 
 That's one thing I didn't really understand.  It seems the spinning
 code tried to be light on CPU cycle usage, but we're wasting CPU
 cycles there anyway.  If the spinning can become smarter using some
 CPU cycles, isn't that a gain?  Why is the spinning code optimizing
 for less CPU cycles?

Loop exit latency mostly, the lighter the loop, the faster you're
through the less time you waste once the condition is true, but yeah,
what you say is true too, its a balance game as always.

 Also, it would be great if you can share what you're using for locking
 performance measurements.  My attempts with dbench didn't end too
 well. :-(

The thing I used for the initial implementation is mentioned in the
changelog (0d66bf6d3):

Testing with Ingo's test-mutex application 
(http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   296604

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   85870

I've no idea how heavy that is on trylock though, you'd have to look at
that.

Chris did some improvements using dbench in ac6e60ee4 but clearly that
isn't working for you.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Chris Mason
Excerpts from Tejun Heo's message of 2011-03-29 12:37:02 -0400:
 Hello, guys.
 
 I've been running dbench 50 for a few days now and the result is,
 well, I don't know how to call it.
 
 The problem was that the original patch didn't do anything because x86
 fastpath code didn't call into the generic slowpath at all.
 
   static inline int __mutex_fastpath_trylock(atomic_t *count,
  int (*fail_fn)(atomic_t *))
   {
   if (likely(atomic_cmpxchg(count, 1, 0) == 1))
   return 1;
   else
   return 0;
   }
 
 So, I thought that I probably was doing unconscious data selection
 while I was running the test before sending out the patches.  Maybe I
 was seeing what I wanted to see, so I ran tests in larger scale more
 methodologically.
 
 I first started with ten consecutive runs and then doubled it with
 intervening reboot and then basically ended up doing that twice for
 four configuration (I didn't do two runs of simple and refactor but
 just averaged the two).
 
 The hardware is mostly the same except that I switched to a hard drive
 instead of SSD as hard drives tend to be slower but more consistent in
 performance numbers.  On each run, the filesystem is recreated and the
 system was rebooted after every ten runs.  The numbers are the
 reported throughput in MiB/s at the end of each run.
 
   
 https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXchl=en
 
 Here are the descriptions of the eight columns.
 
   simpleonly with patch to make btrfs use mutex
   refactormutex_spin() factored out
   spinmutex_spin() applied to the unused trylock slowpath
   spin-1ditto
   spin-fixedx86 trylock fastpath updated to use generic slowpath
   spin-fixed-1ditto
   code-layoutrefactor + dummy function added to mutex.c
   code-layout-1ditto
 
 After running the simple, refactor and spin ones, I was convinced that
 there definitely was something which was causing the difference.  The
 averages were apart by more than 1.5 sigma, but I couldn't explain
 what could have caused such difference.

I have another workload that is interesting for this, basically N (100
or so) procs all doing stats on a bunch of files in the same directory.
This hammers very hard on the btree lookups.

During the first set of mutex spinning patches, doing any kind of
breakout on the owner spin (when the owner hasn't changed) made it much
slower.

You might want to try again with the 2.6.39-rc1 btrfs (or pull my
for-linus-unmerged tree into .38) because this gets rid of an unneeded
spinlock for the root node.

All your work convinced me to dig out my seqlock patches for read only
tree block operations.  I think we can take your current btrfs patches
and combine the seqlock stuff for a huge benefit.

In this case, the only thing we're really missing is a way to mutex_lock
without the cond_resched()

The biggest trylock user in btrfs is only there so we can keep the
spinning lock instead of the blocking lock.  Since you're getting rid of
the whole spin vs block setup, we can switch directly to a lock.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
 
 In this case, the only thing we're really missing is a way to mutex_lock
 without the cond_resched() 

So you're trying to explicitly avoid a voluntary preemption point? Seems
like a bad idea, normally people add those :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Chris Mason
Excerpts from Peter Zijlstra's message of 2011-03-30 07:52:04 -0400:
 On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
  
  In this case, the only thing we're really missing is a way to mutex_lock
  without the cond_resched() 
 
 So you're trying to explicitly avoid a voluntary preemption point? Seems
 like a bad idea, normally people add those :-)

Yeah, but the btrfs fast path (when we're able to spin today) looks like
this:

spin_lock(parent)
binary search, select slot
pull block for that slot out of cache
spin_lock(child)
spin_unlock(parent)

If we switch it all to mutexes:

mutex_lock(parent)
binary search, select slot
pull block for that slot out of cache
mutex_lock(child)
mutex_unlock(parent)

Most of the spinning vs blocking benefits in btrfs came from doing
special things (like dropping the parent lock) when we know we're going
to block with the parent lock held.  Surprise blocking in mutex_lock
isn't great.

It would probably be enough to just move the cond_resched() after the
spinning portion of the mutex_lock()

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-29 Thread Tejun Heo
Hello, guys.

I've been running dbench 50 for a few days now and the result is,
well, I don't know how to call it.

The problem was that the original patch didn't do anything because x86
fastpath code didn't call into the generic slowpath at all.

  static inline int __mutex_fastpath_trylock(atomic_t *count,
 int (*fail_fn)(atomic_t *))
  {
  if (likely(atomic_cmpxchg(count, 1, 0) == 1))
  return 1;
  else
  return 0;
  } 


So, I thought that I probably was doing unconscious data selection
while I was running the test before sending out the patches.  Maybe I
was seeing what I wanted to see, so I ran tests in larger scale more
methodologically.

I first started with ten consecutive runs and then doubled it with
intervening reboot and then basically ended up doing that twice for
four configuration (I didn't do two runs of simple and refactor but
just averaged the two).

The hardware is mostly the same except that I switched to a hard drive
instead of SSD as hard drives tend to be slower but more consistent in
performance numbers.  On each run, the filesystem is recreated and the
system was rebooted after every ten runs.  The numbers are the
reported throughput in MiB/s at the end of each run.

  
https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXchl=en

Here are the descriptions of the eight columns.

  simpleonly with patch to make btrfs use mutex
  refactor  mutex_spin() factored out
  spin  mutex_spin() applied to the unused trylock slowpath
  spin-1ditto
  spin-fixedx86 trylock fastpath updated to use generic slowpath
  spin-fixed-1  ditto
  code-layout   refactor + dummy function added to mutex.c
  code-layout-1 ditto

After running the simple, refactor and spin ones, I was convinced that
there definitely was something which was causing the difference.  The
averages were apart by more than 1.5 sigma, but I couldn't explain
what could have caused such difference.

The code-layout runs were my desparate attempts to find explanation on
what's going on.  Addition of mutex_spin to the unused trylock generic
path makes gcc arrange functions differently.  Without it, trylock
functions end up inbetween lock and unlock funcitons; with it, they
are located at the end.  I commented out the unused trylock slowpath
function and added a dummy function at the end to make gcc generate
similar assembly layout.

At this point, the only conclusions I can draw are,

* Using adaptive spinning on mutex_trylock() doesn't seem to buy
  anything according to btrfs dbench 50 runs.

and much more importantly,

* btrfs dbench 50 runs are probably not good for measuring subtle
  mutex performance differences.  Maybe it's too macro and there are
  larger scale tendencies which skew the result unless the number of
  runs are vastly increased (but 40 runs are already over eight
  hours).

If anyone can provide an explanation on what's going on, I'll be super
happy.  Otherwise, for now, I'll just leave it alone.  :-(

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-29 Thread Tejun Heo
Here's the combined patch I was planning on testing but didn't get to
(yet).  It implements two things - hard limit on spin duration and
early break if the owner also is spinning on a mutex.

Thanks.

Index: work1/include/linux/sched.h
===
--- work1.orig/include/linux/sched.h
+++ work1/include/linux/sched.h
@@ -359,7 +359,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: work1/kernel/sched.c
===
--- work1.orig/kernel/sched.c
+++ work1/kernel/sched.c
@@ -536,6 +536,10 @@ struct rq {
struct hrtimer hrtick_timer;
 #endif
 
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+   bool spinning_on_mutex;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
@@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 /*
- * Look out! owner is an entirely speculative pointer
- * access and not reliable.
+ * Maximum mutex owner spin duration in nsecs.  Don't spin more then
+ * DEF_TIMESLICE.
+ */
+#define MAX_MUTEX_SPIN_NS  (DEF_TIMESLICE * 10LLU / HZ)
+
+/**
+ * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
+ * @lock: the mutex to spin on
+ * @owner: the current owner (speculative pointer)
+ *
+ * The caller is trying to acquire @lock held by @owner.  If @owner is
+ * currently running, it might get unlocked soon and spinning on it can
+ * save the overhead of sleeping and waking up.
+ *
+ * Note that @owner is completely speculative and may be completely
+ * invalid.  It should be accessed very carefully.
+ *
+ * Forward progress is guaranteed regardless of locking ordering by never
+ * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
+ * mutex_trylock(), which doesn't have to follow the usual locking
+ * ordering, also uses this function.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if the lock was released and the caller should retry locking.
+ * %false if the caller better go sleeping.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 {
+   unsigned long start;
unsigned int cpu;
struct rq *rq;
+   bool ret = true;
 
if (!sched_feat(OWNER_SPIN))
-   return 0;
+   return false;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
/*
@@ -4039,7 +4071,7 @@ int mutex_spin_on_owner(struct mutex *lo
 * the mutex owner just released it and exited.
 */
if (probe_kernel_address(owner-cpu, cpu))
-   return 0;
+   return false;
 #else
cpu = owner-cpu;
 #endif
@@ -4049,15 +4081,17 @@ int mutex_spin_on_owner(struct mutex *lo
 * the cpu field may no longer be valid.
 */
if (cpu = nr_cpumask_bits)
-   return 0;
+   return false;
 
/*
 * We need to validate that we can do a
 * get_cpu() and that we have the percpu area.
 */
if (!cpu_online(cpu))
-   return 0;
+   return false;
 
+   this_rq()-spinning_on_mutex = true;
+   start = local_clock();
rq = cpu_rq(cpu);
 
for (;;) {
@@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
 * we likely have heavy contention. Return 0 to quit
 * optimistic spinning and not contend further:
 */
-   if (lock-owner)
-   return 0;
+   ret = !lock-owner;
break;
}
 
/*
-* Is that owner really running on that cpu?
+* Quit spinning if any of the followings is true.
+*
+* - The owner isn't running on that cpu.
+* - The owner also is spinning on a mutex.
+* - Someone else wants to use this cpu.
+* - We've been spinning for too long.
 */
-   if (task_thread_info(rq-curr) != owner || need_resched())
-   return 0;
+   if (task_thread_info(rq-curr) != owner ||
+   rq-spinning_on_mutex || need_resched() ||
+   local_clock()  start + MAX_MUTEX_SPIN_NS) {
+   ret = false;
+   break;
+   }
 
arch_mutex_cpu_relax();
}
 
-   return 1;
+   

Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-29 Thread Peter Zijlstra
On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
 Here's the combined patch I was planning on testing but didn't get to
 (yet).  It implements two things - hard limit on spin duration and
 early break if the owner also is spinning on a mutex.

This is going to give massive conflicts with

 https://lkml.org/lkml/2011/3/2/286
 https://lkml.org/lkml/2011/3/2/282

which I was planning to stuff into .40


 @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
  
  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
  /*
 + * Maximum mutex owner spin duration in nsecs.  Don't spin more then
 + * DEF_TIMESLICE.
 + */
 +#define MAX_MUTEX_SPIN_NS(DEF_TIMESLICE * 10LLU / HZ)

DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also
I bet we have something like NSEC_PER_SEC to avoid counting '0's.

 +
 +/**
 + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
 + * @lock: the mutex to spin on
 + * @owner: the current owner (speculative pointer)
 + *
 + * The caller is trying to acquire @lock held by @owner.  If @owner is
 + * currently running, it might get unlocked soon and spinning on it can
 + * save the overhead of sleeping and waking up.
 + *
 + * Note that @owner is completely speculative and may be completely
 + * invalid.  It should be accessed very carefully.
 + *
 + * Forward progress is guaranteed regardless of locking ordering by never
 + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
 + * mutex_trylock(), which doesn't have to follow the usual locking
 + * ordering, also uses this function.

While that puts a limit on things it'll still waste time. I'd much
rather pass an trylock argument to mutex_spin_on_owner() and then bail
on owner also spinning.

 + * CONTEXT:
 + * Preemption disabled.
 + *
 + * RETURNS:
 + * %true if the lock was released and the caller should retry locking.
 + * %false if the caller better go sleeping.
   */
 -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
  {

 @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
* we likely have heavy contention. Return 0 to quit
* optimistic spinning and not contend further:
*/
 + ret = !lock-owner;
   break;
   }
  
   /*
 -  * Is that owner really running on that cpu?
 +  * Quit spinning if any of the followings is true.
 +  *
 +  * - The owner isn't running on that cpu.
 +  * - The owner also is spinning on a mutex.
 +  * - Someone else wants to use this cpu.
 +  * - We've been spinning for too long.
*/
 + if (task_thread_info(rq-curr) != owner ||
 + rq-spinning_on_mutex || need_resched() ||
 + local_clock()  start + MAX_MUTEX_SPIN_NS) {

While we did our best with making local_clock() cheap, I'm still fairly
uncomfortable with putting it in such a tight loop.

 + ret = false;
 + break;
 + }
  
   arch_mutex_cpu_relax();
   }
  
 + this_rq()-spinning_on_mutex = false;
 + return ret;
  }
  #endif
  



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Tejun Heo
Hello, Steven, Linus.

On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote:
 On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote:
 
  But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
  is running (spinning on A) it will spin as well waiting for A's owner to
  release it. Unfortunately, A's owner is also spinning waiting for B to
  release it.
 
  If both A and B's owners are real time tasks, then boom! deadlock.
 
 Hmm. I think you're right. And it looks pretty fundamental - I don't
 see any reasonable approach to avoid it.

Hmmm... I have an idea.  Will play with it a bit and post if it works
out okay.

 I think the RT issue is a red herring too - afaik, you can get a
 deadlock with two perfectly normal processes too. Of course, for
 non-RT tasks, any other process will eventually disturb the situation
 and you'd get kicked out due to need_resched(), but even that might be
 avoided for a long time if there are other CPU's - leading to tons of
 wasted CPU time.

Yeap, need_resched() currently is the only thing which limits the
duration of spinning when the owner continues to run.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Tejun Heo
Hello,

On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
 USER   SYSTEM   SIRQCXTSW  THROUGHPUT
  SIMPLE 61107  354977217  8099529  845.100 MB/sec
  SPIN   63140  364888214  6840527  879.077 MB/sec
 
 On various runs, the adaptive spinning trylock consistently posts
 higher throughput.  The amount of difference varies but it outperforms
 consistently.

I've been running more of these tests and am having doubts about the
consistency.  It seems that, even on a fresh filesystem, some random
initial condition seems to have persistent effect on the whole run.
I'll run more tests and report back.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Ingo Molnar

* Tejun Heo t...@kernel.org wrote:

 Hello,
 
 On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
  USER   SYSTEM   SIRQCXTSW  THROUGHPUT
   SIMPLE 61107  354977217  8099529  845.100 MB/sec
   SPIN   63140  364888214  6840527  879.077 MB/sec
  
  On various runs, the adaptive spinning trylock consistently posts
  higher throughput.  The amount of difference varies but it outperforms
  consistently.
 
 I've been running more of these tests and am having doubts about the
 consistency.  It seems that, even on a fresh filesystem, some random
 initial condition seems to have persistent effect on the whole run.
 I'll run more tests and report back.

Ok, and there's the deadlock issue as well which Steve noticed.

I'll zap the patches from tip:core/urgent and lets do this via tip:core/locking 
with a .40 timeframe and plenty of time of testing.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Andrey Kuzmin
On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
 Adaptive owner spinning used to be applied only to mutex_lock().  This
 patch applies it also to mutex_trylock().

 btrfs has developed custom locking to avoid excessive context switches
 in its btree implementation.  Generally, doing away with the custom
 implementation and just using the mutex shows better behavior;
 however, there's an interesting distinction in the custom implemention
 of trylock.  It distinguishes between simple trylock and tryspin,
 where the former just tries once and then fail while the latter does
 some spinning before giving up.

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, for btrfs anyway.

 The following results are from dbench 50 run on an opteron two
 socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
 During the run, disk stays mostly idle and all CPUs are fully occupied
 and the difference in locking performance becomes quite visible.

 SIMPLE is with the locking simplification patch[1] applied.  i.e. it
 basically just uses mutex.  SPIN is with this patch applied on top -
 mutex_trylock() uses adaptive spinning.

         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
  SPIN   63140  364888    214  6840527  879.077 MB/sec

 On various runs, the adaptive spinning trylock consistently posts
 higher throughput.  The amount of difference varies but it outperforms
 consistently.

 In general, using adaptive spinning on trylock makes sense as trylock
 failure usually leads to costly unlock-relock sequence.

 [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

 Signed-off-by: Tejun Heo t...@kernel.org

 I'm curious about the effects that this has on those places that do:

 again:
        mutex_lock(A);
        if (mutex_trylock(B)) {
                mutex_unlock(A);
                goto again;


 Where the normal locking order is:
  B - A

 If another location does:

        mutex_lock(B);
        [...]
        mutex_lock(A);

 But another process has A already, and is running, it may spin waiting
 for A as A's owner is still running.

 But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
 is running (spinning on A) it will spin as well waiting for A's owner to
 release it. Unfortunately, A's owner is also spinning waiting for B to
 release it.

 If both A and B's owners are real time tasks, then boom! deadlock.

Turning try_lock into indefinitely spinning one breaks its semantics,
so deadlock is to be expected. But what's wrong in this scenario if
try_lock spins a bit before giving up?

Regards,
Andrey


 -- Steve

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Steven Rostedt
On Fri, 2011-03-25 at 07:53 +0100, Tejun Heo wrote:
 Hello, Steven, Linus.
 
 On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote:
  On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote:
  
   But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
   is running (spinning on A) it will spin as well waiting for A's owner to
   release it. Unfortunately, A's owner is also spinning waiting for B to
   release it.
  
   If both A and B's owners are real time tasks, then boom! deadlock.
  
  Hmm. I think you're right. And it looks pretty fundamental - I don't
  see any reasonable approach to avoid it.
 
 Hmmm... I have an idea.  Will play with it a bit and post if it works
 out okay.

One solution is to have this be only done on explicit trylocks. Perhaps
introduce a mutex_trylock_spin()? Then when the developer knows that
this scenario does not exist, they can convert mutex_trylocks() into
this spinning version.

 
  I think the RT issue is a red herring too - afaik, you can get a
  deadlock with two perfectly normal processes too. Of course, for
  non-RT tasks, any other process will eventually disturb the situation
  and you'd get kicked out due to need_resched(), but even that might be
  avoided for a long time if there are other CPU's - leading to tons of
  wasted CPU time.
 
 Yeap, need_resched() currently is the only thing which limits the
 duration of spinning when the owner continues to run.

Yeah, I was about to complain about the long latencies that this could
cause, then I realized that RT tasks would in fact deadlock the system
here, which I thought was a bigger problem, and focused on that issue.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Steven Rostedt
On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
 Turning try_lock into indefinitely spinning one breaks its semantics,
 so deadlock is to be expected. But what's wrong in this scenario if
 try_lock spins a bit before giving up? 

Because that will cause this scenario to spin that little longer
always, and introduce latencies that did not exist before. Either the
solution does not break this scenario, or it should not go in.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Steven Rostedt
On Fri, 2011-03-25 at 09:10 -0400, Steven Rostedt wrote:

 One solution is to have this be only done on explicit trylocks. Perhaps
 introduce a mutex_trylock_spin()? Then when the developer knows that
 this scenario does not exist, they can convert mutex_trylocks() into
 this spinning version.
 

I'm not sure this is even worth it, as I'm looking at the
btfs/extend-tree.c code, this is the main reason to use mutex_trylock().

I guess what you see in your benchmarks is that trylock contention
happens mostly in the non-deadlock scenario. But I bet you have
latencies when it does happen, but the benefit seems to out weigh it in
the results.

I wonder what happens if you run dbench as an RT task.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Andrey Kuzmin
On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
 Turning try_lock into indefinitely spinning one breaks its semantics,
 so deadlock is to be expected. But what's wrong in this scenario if
 try_lock spins a bit before giving up?

 Because that will cause this scenario to spin that little longer
 always, and introduce latencies that did not exist before. Either the
 solution does not break this scenario, or it should not go in.

Broken semantics and extra latency are two separate issues. If the
former is fixed, the latter is easily handled by introducing new
mutex_trylock_spin call that lets one either stick to existing
behavior (try/fail) or choose a new one where latency penalty is
justified by locking patterns.

Regards,
Andrey


 -- Steve



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Steven Rostedt
On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote:
 On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote:
  On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
  Turning try_lock into indefinitely spinning one breaks its semantics,
  so deadlock is to be expected. But what's wrong in this scenario if
  try_lock spins a bit before giving up?
 
  Because that will cause this scenario to spin that little longer
  always, and introduce latencies that did not exist before. Either the
  solution does not break this scenario, or it should not go in.
 
 Broken semantics and extra latency are two separate issues. If the
 former is fixed, the latter is easily handled by introducing new
 mutex_trylock_spin call that lets one either stick to existing
 behavior (try/fail) or choose a new one where latency penalty is
 justified by locking patterns.
 

For those wanting a more RT deterministic OS, I will argue against
latency penalties.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Ingo Molnar

* Steven Rostedt rost...@goodmis.org wrote:

 On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote:
  On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote:
   On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
   Turning try_lock into indefinitely spinning one breaks its semantics,
   so deadlock is to be expected. But what's wrong in this scenario if
   try_lock spins a bit before giving up?
  
   Because that will cause this scenario to spin that little longer
   always, and introduce latencies that did not exist before. Either the
   solution does not break this scenario, or it should not go in.
  
  Broken semantics and extra latency are two separate issues. If the
  former is fixed, the latter is easily handled by introducing new
  mutex_trylock_spin call that lets one either stick to existing
  behavior (try/fail) or choose a new one where latency penalty is
  justified by locking patterns.
  
 
 For those wanting a more RT deterministic OS, I will argue against
 latency penalties.

Later mails from Tejun suggest that the benchmark results are varied, and that 
it's not a clear win after all.

It's possible that if useless spinning is introduced then that might explain 
such workload variations. I.e. it's not really 'latencies' but 'unnecessary 
overhead spent spinning' - and also 'extra non-deterministic noise' - none of 
which help consistent performance.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Tejun Heo
Adaptive owner spinning used to be applied only to mutex_lock().  This
patch applies it also to mutex_trylock().

btrfs has developed custom locking to avoid excessive context switches
in its btree implementation.  Generally, doing away with the custom
implementation and just using the mutex shows better behavior;
however, there's an interesting distinction in the custom implemention
of trylock.  It distinguishes between simple trylock and tryspin,
where the former just tries once and then fail while the latter does
some spinning before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, for btrfs anyway.

The following results are from dbench 50 run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.

SIMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with this patch applied on top -
mutex_trylock() uses adaptive spinning.

USER   SYSTEM   SIRQCXTSW  THROUGHPUT
 SIMPLE 61107  354977217  8099529  845.100 MB/sec
 SPIN   63140  364888214  6840527  879.077 MB/sec

On various runs, the adaptive spinning trylock consistently posts
higher throughput.  The amount of difference varies but it outperforms
consistently.

In general, using adaptive spinning on trylock makes sense as trylock
failure usually leads to costly unlock-relock sequence.

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

Signed-off-by: Tejun Heo t...@kernel.org
LKML-Reference: 20110323153727.gb12...@htj.dyndns.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@redhat.com
Cc: Chris Mason chris.ma...@oracle.com
---
 kernel/mutex.c |   10 ++
 1 file changed, 10 insertions(+)

Index: work/kernel/mutex.c
===
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa
unsigned long flags;
int prev;
 
+   preempt_disable();
+
+   if (mutex_spin(lock)) {
+   mutex_set_owner(lock);
+   mutex_acquire(lock-dep_map, 0, 1, _RET_IP_);
+   preempt_enable();
+   return 1;
+   }
+
spin_lock_mutex(lock-wait_lock, flags);
 
prev = atomic_xchg(lock-count, -1);
@@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa
atomic_set(lock-count, 0);
 
spin_unlock_mutex(lock-wait_lock, flags);
+   preempt_enable();
 
return prev == 1;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Steven Rostedt
On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
 Adaptive owner spinning used to be applied only to mutex_lock().  This
 patch applies it also to mutex_trylock().
 
 btrfs has developed custom locking to avoid excessive context switches
 in its btree implementation.  Generally, doing away with the custom
 implementation and just using the mutex shows better behavior;
 however, there's an interesting distinction in the custom implemention
 of trylock.  It distinguishes between simple trylock and tryspin,
 where the former just tries once and then fail while the latter does
 some spinning before giving up.
 
 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, for btrfs anyway.
 
 The following results are from dbench 50 run on an opteron two
 socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
 During the run, disk stays mostly idle and all CPUs are fully occupied
 and the difference in locking performance becomes quite visible.
 
 SIMPLE is with the locking simplification patch[1] applied.  i.e. it
 basically just uses mutex.  SPIN is with this patch applied on top -
 mutex_trylock() uses adaptive spinning.
 
 USER   SYSTEM   SIRQCXTSW  THROUGHPUT
  SIMPLE 61107  354977217  8099529  845.100 MB/sec
  SPIN   63140  364888214  6840527  879.077 MB/sec
 
 On various runs, the adaptive spinning trylock consistently posts
 higher throughput.  The amount of difference varies but it outperforms
 consistently.
 
 In general, using adaptive spinning on trylock makes sense as trylock
 failure usually leads to costly unlock-relock sequence.
 
 [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
 
 Signed-off-by: Tejun Heo t...@kernel.org

I'm curious about the effects that this has on those places that do:

again:
mutex_lock(A);
if (mutex_trylock(B)) {
mutex_unlock(A);
goto again;


Where the normal locking order is:
 B - A

If another location does:

mutex_lock(B);
[...]
mutex_lock(A);

But another process has A already, and is running, it may spin waiting
for A as A's owner is still running.

But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
is running (spinning on A) it will spin as well waiting for A's owner to
release it. Unfortunately, A's owner is also spinning waiting for B to
release it.

If both A and B's owners are real time tasks, then boom! deadlock.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Linus Torvalds
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote:

 But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
 is running (spinning on A) it will spin as well waiting for A's owner to
 release it. Unfortunately, A's owner is also spinning waiting for B to
 release it.

 If both A and B's owners are real time tasks, then boom! deadlock.

Hmm. I think you're right. And it looks pretty fundamental - I don't
see any reasonable approach to avoid it.

I think the RT issue is a red herring too - afaik, you can get a
deadlock with two perfectly normal processes too. Of course, for
non-RT tasks, any other process will eventually disturb the situation
and you'd get kicked out due to need_resched(), but even that might be
avoided for a long time if there are other CPU's - leading to tons of
wasted CPU time.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html