Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> It does strike me that the whole optimistic spin algorithm
> (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> out. They've been growing more optimizations I see, and the optimizations 
> mostly
> aren't specific to either locks.

There's a few unfortunate differences between the two; but yes it would
be good if we could reduce some of the duplication found there.

One of the distinct differences is that mutex (now) has the lock state
and owner in a single atomic word, while rwsem still tracks the owner in
a separate word and thus needs to account for 'inconsistent' owner
state.

And then there's warts such as ww_mutex and rwsem_owner_is_reader and
similar.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > 
> > No.. and most certainly not without a _very_ good reason.
> 
> Ok, can I ask why?

Because it is an internal helper for lock implementations that want to
do optimistic spinning, it isn't a lock on its own and lacks several
things you would expect.

Using it is tricky and I don't trust random module authors to get 1+1
right, let alone use this thing correctly (no judgement on your code,
just in general).

> Here's what it's for:

I'll try and have a look soon :-) But does that really _have_ to live in
a module?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > 
> > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > have terrifying worst case preemption off latencies.
> 
> Ah. That was missing from your commit message.

Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/

> > Why can't you use something like per-cpu rwsems?
> 
> Well,
> 
>  a) in my use case, lg_global_lock() pretty much isn't used in normal 
> operation,
> it's only called when starting mark and sweep gc (which is not needed
> anymore and disabled by default, it'll eventually get rolled into online
> fsck) and for device resize
> 
>  b) I'm using it in conjection with percpu counters, and technically yes I
> certainly _could_ use per-cpu sleepable locks (mutexes would make more 
> sense
> for me than rwsems), there's a bit of a clash there and it's going to be a
> bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> makes any sense in that case, so it'd mean auditing and converting all the
> code that touches the relevant data structures).

Well, lg is a reader-writer style lock per definition, as you want
concurrency on the local and full exclusion against the global, so I'm
not sure how mutexes fit into this.

In any case, have a look at percpu_down_read_preempt_disable() and
percpu_up_read_preempt_enable(); they're a bit of a hack but they should
work for you I think.

They will sleep at down_read, but the entire actual critical section
will be with preemption disabled -- therefore it had better be short and
bounded, and the RT guys will thank you for not using spinlock_t under
it (use raw_spinlock_t if you have to).

The (global) writer side will block and be preemptible like normal.

> If you're really that dead set against lglocks I might just come up with a new
> lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if 
> the
> global lock is held...

See above, we already have this ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:

No.. and most certainly not without a _very_ good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> bcachefs makes use of them - also, add a proper lg_lock_init()

Why?! lglocks are horrid things, we got rid of them for a reason. They
have terrifying worst case preemption off latencies.

Why can't you use something like per-cpu rwsems?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread Peter Zijlstra
On Fri, Apr 21, 2017 at 08:03:13AM -0600, Jens Axboe wrote:
> You have it so easy - the code is completely standalone, building a
> small test framework around it and measuring performance in _user space_
> is trivial.

Something like this you mean:

  
https://lkml.kernel.org/r/20170322165144.dtidvvbxey7w5...@hirez.programming.kicks-ass.net


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Peter Zijlstra
On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > Now when new refcount_t type and API are finally merged
> > > (see include/linux/refcount.h), the following
> > > patches convert various refcounters in the block susystem from 
> > > atomic_t to refcount_t. By doing this we prevent intentional or 
> > > accidental underflows or overflows that can led to use-after-free
> > > vulnerabilities.
> 
> This description isn't right ... nothing is prevented; we get warnings
> on saturation and use after free with this.

The thing that is prevented is overflow and then a use-after-free by
making it a leak.

Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
we'll never get there.

So you loose use-after-free, you gain a resource leak. The general idea
being that use-after-free is a nice trampoline for exploits, leaks are
'only' a DoS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
> 
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
> 
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
> 
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

> 
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case.  But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
 struct __wait_queue_head {
spinlock_t  lock;
struct list_headtask_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   int onstack;
+#endif
 };
 typedef struct __wait_queue_head wait_queue_head_t;
 
@@ -56,9 +59,18 @@ struct task_struct;
 #define DECLARE_WAITQUEUE(name, tsk)   \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
 
-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {  \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack)  {   \
.lock   = __SPIN_LOCK_UNLOCKED(name.lock),  \
-   .task_list  = { &(name).task_list, &(name).task_list } }
+   .task_list  = { &(name).task_list, &(name).task_list }, \
+   ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) 
___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, 
const char *name, struct
 
 #ifdef CONFIG_LOCKDEP
 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
-   ({ init_waitqueue_head(); name; })
+   ({ init_waitqueue_head(); (name).onstack = 1; name; })
 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
 #else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+   wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
 #endif
 
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int 
mode, int nr_exclusive)
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */
 
+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
 /*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, 
int state)
if (list_empty(>task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
if (list_empty(>task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
}
set_current_state(state);
}
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, 

Re: [PATCH] rwsem: add rwsem_is_contended

2013-08-31 Thread Peter Zijlstra
On Fri, Aug 30, 2013 at 10:14:01AM -0400, Josef Bacik wrote:
 Btrfs uses an rwsem to control access to its extent tree.  Threads will hold a
 read lock on this rwsem while they scan the extent tree, and if need_resched()
 they will drop the lock and schedule.  The transaction commit needs to take a
 write lock for this rwsem for a very short period to switch out the commit
 roots.  If there are a lot of threads doing this caching operation we can 
 starve
 out the committers which slows everybody out.  To address this we want to add
 this functionality to see if our rwsem has anybody waiting to take a write 
 lock
 so we can drop it and schedule for a bit to allow the commit to continue.


 +/*
 + * check to see if the rwsem we're holding has anybody waiting to acquire it.
 + */
 +int rwsem_is_contended(struct rw_semaphore *sem)
 +{
 + int ret = 0;
 + unsigned long flags;
 +
 + if (!raw_spin_trylock_irqsave(sem-wait_lock, flags))
 + return 1;
 + if (!list_empty(sem-wait_list))
 + ret = 1;
 + raw_spin_unlock_irqrestore(sem-wait_lock, flags);
 + return ret;
 +}
 +
 +EXPORT_SYMBOL(rwsem_is_contended);

Modeled after spin_is_contended(), so no problem with that. One thing I
was wondering about is if it wants to be called
rwsem_is_write_contended() or similar, since it explicitly only tests
for pending writers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote:
 Hey, Peter.
 
 On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
  On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
   Here's the combined patch I was planning on testing but didn't get to
   (yet).  It implements two things - hard limit on spin duration and
   early break if the owner also is spinning on a mutex.
  
  This is going to give massive conflicts with
  
   https://lkml.org/lkml/2011/3/2/286
   https://lkml.org/lkml/2011/3/2/282
  
  which I was planning to stuff into .40
 
 I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
 stage anyway.
 
   + * Forward progress is guaranteed regardless of locking ordering by never
   + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
   + * mutex_trylock(), which doesn't have to follow the usual locking
   + * ordering, also uses this function.
  
  While that puts a limit on things it'll still waste time. I'd much
  rather pass an trylock argument to mutex_spin_on_owner() and then bail
  on owner also spinning.
 
 Do we guarantee or enforce that the lock ownership can't be
 transferred to a different task?  If we do, the recursive spinning
 detection is enough to guarantee forward progress.

The only way to switch owner is for the current owner to release and a
new owner to acquire the lock. Also we already bail the spin loop when
owner changes.

   + if (task_thread_info(rq-curr) != owner ||
   + rq-spinning_on_mutex || need_resched() ||
   + local_clock()  start + MAX_MUTEX_SPIN_NS) {
  
  While we did our best with making local_clock() cheap, I'm still fairly
  uncomfortable with putting it in such a tight loop.
 
 That's one thing I didn't really understand.  It seems the spinning
 code tried to be light on CPU cycle usage, but we're wasting CPU
 cycles there anyway.  If the spinning can become smarter using some
 CPU cycles, isn't that a gain?  Why is the spinning code optimizing
 for less CPU cycles?

Loop exit latency mostly, the lighter the loop, the faster you're
through the less time you waste once the condition is true, but yeah,
what you say is true too, its a balance game as always.

 Also, it would be great if you can share what you're using for locking
 performance measurements.  My attempts with dbench didn't end too
 well. :-(

The thing I used for the initial implementation is mentioned in the
changelog (0d66bf6d3):

Testing with Ingo's test-mutex application 
(http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   296604

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   85870

I've no idea how heavy that is on trylock though, you'd have to look at
that.

Chris did some improvements using dbench in ac6e60ee4 but clearly that
isn't working for you.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
 
 In this case, the only thing we're really missing is a way to mutex_lock
 without the cond_resched() 

So you're trying to explicitly avoid a voluntary preemption point? Seems
like a bad idea, normally people add those :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-29 Thread Peter Zijlstra
On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
 Here's the combined patch I was planning on testing but didn't get to
 (yet).  It implements two things - hard limit on spin duration and
 early break if the owner also is spinning on a mutex.

This is going to give massive conflicts with

 https://lkml.org/lkml/2011/3/2/286
 https://lkml.org/lkml/2011/3/2/282

which I was planning to stuff into .40


 @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
  
  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
  /*
 + * Maximum mutex owner spin duration in nsecs.  Don't spin more then
 + * DEF_TIMESLICE.
 + */
 +#define MAX_MUTEX_SPIN_NS(DEF_TIMESLICE * 10LLU / HZ)

DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also
I bet we have something like NSEC_PER_SEC to avoid counting '0's.

 +
 +/**
 + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
 + * @lock: the mutex to spin on
 + * @owner: the current owner (speculative pointer)
 + *
 + * The caller is trying to acquire @lock held by @owner.  If @owner is
 + * currently running, it might get unlocked soon and spinning on it can
 + * save the overhead of sleeping and waking up.
 + *
 + * Note that @owner is completely speculative and may be completely
 + * invalid.  It should be accessed very carefully.
 + *
 + * Forward progress is guaranteed regardless of locking ordering by never
 + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
 + * mutex_trylock(), which doesn't have to follow the usual locking
 + * ordering, also uses this function.

While that puts a limit on things it'll still waste time. I'd much
rather pass an trylock argument to mutex_spin_on_owner() and then bail
on owner also spinning.

 + * CONTEXT:
 + * Preemption disabled.
 + *
 + * RETURNS:
 + * %true if the lock was released and the caller should retry locking.
 + * %false if the caller better go sleeping.
   */
 -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
  {

 @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
* we likely have heavy contention. Return 0 to quit
* optimistic spinning and not contend further:
*/
 + ret = !lock-owner;
   break;
   }
  
   /*
 -  * Is that owner really running on that cpu?
 +  * Quit spinning if any of the followings is true.
 +  *
 +  * - The owner isn't running on that cpu.
 +  * - The owner also is spinning on a mutex.
 +  * - Someone else wants to use this cpu.
 +  * - We've been spinning for too long.
*/
 + if (task_thread_info(rq-curr) != owner ||
 + rq-spinning_on_mutex || need_resched() ||
 + local_clock()  start + MAX_MUTEX_SPIN_NS) {

While we did our best with making local_clock() cheap, I'm still fairly
uncomfortable with putting it in such a tight loop.

 + ret = false;
 + break;
 + }
  
   arch_mutex_cpu_relax();
   }
  
 + this_rq()-spinning_on_mutex = false;
 + return ret;
  }
  #endif
  



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Peter Zijlstra
On Thu, 2010-09-02 at 13:44 +0800, Miao Xie wrote:
 On Wed, 01 Sep 2010 17:25:36 +0200, Peter Zijlstra wrote:
  On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
  + * This program is free software; you can redistribute it and/or modify it
  + * under the terms of the GNU General Public License as published by the 
  Free
  + * Software Foundation; either version 2.1 of the License, or (at your 
  option)
  + * any later version.
 
  The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
 
 Ok, I will change it.

Are you allowed to change it? from what I understand the FSF owns the
copyright of that code, that least, that's what the preamble I cut away
implied.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Peter Zijlstra
On Thu, 2010-09-02 at 09:55 +0200, Peter Zijlstra wrote:
 On Thu, 2010-09-02 at 13:44 +0800, Miao Xie wrote:
  On Wed, 01 Sep 2010 17:25:36 +0200, Peter Zijlstra wrote:
   On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
   + * This program is free software; you can redistribute it and/or modify 
   it
   + * under the terms of the GNU General Public License as published by 
   the Free
   + * Software Foundation; either version 2.1 of the License, or (at your 
   option)
   + * any later version.
  
   The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
  
  Ok, I will change it.
 
 Are you allowed to change it? from what I understand the FSF owns the
 copyright of that code, that least, that's what the preamble I cut away
 implied.

n/m, I should probably have read more inbox before replying.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-01 Thread Peter Zijlstra
On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License as published by the Free
 + * Software Foundation; either version 2.1 of the License, or (at your 
 option)
 + * any later version. 

The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-26 Thread Peter Zijlstra
On Wed, 2010-08-25 at 20:09 -0700, David Rientjes wrote:
  Oh, we can determine an upper bound.  You might just not like it.
  Actually ext3/ext4 shouldn't be as bad as XFS, which Dave estimated to
  be around 400k for a transaction.  My guess is that the worst case for
  ext3/ext4 is probably around 256k or so; like XFS, most of the time,
  it would be a lot less.  (At least, if data != journalled; if we are
  doing data journalling and every single data block begins with
  0xc03b3998U, we'll need to allocate a 4k page for every single data
  block written.)  We could dynamically calculate an upper bound if we
  had to.  Of course, if ext3/ext4 is attached to a network block
  device, then it could get a lot worse than 256k, of course.
  
 On my 8GB machine, /proc/zoneinfo says the min watermark for ZONE_NORMAL 
 is 5086 pages, or ~20MB.  GFP_ATOMIC would allow access to ~12MB of that, 
 so perhaps we should consider this is an acceptable abuse of GFP_ATOMIC as 
 a fallback behavior when GFP_NOFS or GFP_NOIO fails? 

Agreed with the fact that 400k isn't much to worry about.

Not agreed with the GFP_ATOMIC stmt.

Direct reclaim already has PF_MEMALLOC, but then we also need a
concurrency limit on that, otherwise you can still easily blow though
your reserves by having 100 concurrency users of it, resulting in an
upper bound of 4k instead, which will be too much.

There were patches to limit the direct reclaim contexts, not sure they
ever got anywhere..

It is something to consider in the re-design of the whole
direct-reclaim/writeback paths though..
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
 Part of the problem is that we have a few places in the kernel where
 failure is really not an option --- or rather, if we're going to fail
 while we're in the middle of doing a commit, our choices really are
 (a) retry the loop in the jbd layer (which Andrew really doesn't
 like), (b) keep our own private cache of free memory so we don't fail
 and/or loop, (c) fail the file system and mark it read-only, or (d)
 panic. 

d) do the allocation before you're committed to going fwd and can still
fail and back out.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:57 -0400, Ted Ts'o wrote:
 On Wed, Aug 25, 2010 at 01:35:32PM +0200, Peter Zijlstra wrote:
  On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
   Part of the problem is that we have a few places in the kernel where
   failure is really not an option --- or rather, if we're going to fail
   while we're in the middle of doing a commit, our choices really are
   (a) retry the loop in the jbd layer (which Andrew really doesn't
   like), (b) keep our own private cache of free memory so we don't fail
   and/or loop, (c) fail the file system and mark it read-only, or (d)
   panic. 
  
  d) do the allocation before you're committed to going fwd and can still
  fail and back out.
 
 Sure in some cases that can be done, but the commit has to happen at
 some point, or we run out of journal space, at which point we're back
 to (c) or (d).

Well (b) sounds a lot saner than either of those. Simply revert to a
state that is sub-optimal but has bounded memory use and reserve that
memory up-front. That way you can always get out of a tight memory spot.

Its what the block layer has always done to avoid the memory deadlock
situation, it has a private stash of BIOs that is big enough to always
service some IO, and as long as IO is happening stuff keeps moving fwd
and we don't deadlock.

Filesystems might have a slightly harder time creating such a bounded
state because there might be more involved like journals and the like,
but still it should be possible to create something like that (my swap
over nfs patches created such a state for the network rx side of
things).

Also, we cannot let our fear of crappy userspace get in the way of doing
sensible things. Your example of write(2) returning -ENOMEM is not
correct though, the syscall (and the page_mkwrite callback for mmap()s)
happens before we actually dirty the data and need to write things out,
so we can always simply wait for memory to become available to dirty.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 14:48 +0200, Peter Zijlstra wrote:

   On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:

(a) retry the loop in the jbd layer (which Andrew really doesn't
like), (b) keep our own private cache of free memory so we don't fail
and/or loop, 

 Well (b) sounds a lot saner than either of those. Simply revert to a
 state that is sub-optimal but has bounded memory use and reserve that
 memory up-front. That way you can always get out of a tight memory spot.

Also, there's a good reason for disliking (a), its a deadlock scenario,
suppose we need to write out data to get free pages, but the writing out
is blocked on requiring free pages.

There's really nothing the page allocator can do to help you there, its
a situation you have to avoid getting into.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 09:20 -0400, Theodore Tso wrote:
 On Aug 25, 2010, at 8:52 AM, Peter Zijlstra wrote:
  Also, there's a good reason for disliking (a), its a deadlock scenario,
  suppose we need to write out data to get free pages, but the writing out
  is blocked on requiring free pages.
  
  There's really nothing the page allocator can do to help you there, its
  a situation you have to avoid getting into.
 
 Well, if all of these users start having their own private pools of
 emergency memory, I'm not sure that's such a great idea either.
 
 And in some cases, there *is* extra memory.  For example, if the
 reason why the page allocator failed is because there isn't enough
 memory in the current process's cgroup, maybe it's important enough
 that the kernel code might decide to say, violate the cgroup
 constraints --- it's more important that we not bring down the entire
 system than to honor whatever yahoo decided that a particular cgroup
 has been set down to something ridiculous like 512mb, when there's
 plenty of free physical memory --- but not in that cgroup.

I'm not sure, but I think the cgroup thing doesn't account kernel
allocations, in which case the above problem doesn't exist.

For the cpuset case we punch through the cpuset constraints for kernel
allocations (unless __GFP_HARDWALL).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 09:20 -0400, Theodore Tso wrote:
 Well, if all of these users start having their own private pools of
 emergency memory, I'm not sure that's such a great idea either.

That's a secondary problem, and could be reduced by something like the
memory reservation system I provided in the swap-over-nfs patches. Of
course, when all these users can actually happen concurrently there's
really nothing much you can do about it.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 23:24 +1000, Dave Chinner wrote:
 
 That is, the guarantee that we will always make progress simply does
 not exist in filesystems, so a mempool-like concept seems to me to
 be doomed from the start 

While I appreciate that it might be somewhat (a lot) harder for a
filesystem to provide that guarantee, I'd be deeply worried about your
claim that its impossible.

It would render a system without swap very prone to deadlocks. Even with
the very tight dirty page accounting we currently have you can fill all
your memory with anonymous pages, at which point there's nothing free
and you require writeout of dirty pages to succeed.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
 There are other places where we can fail safely (for example, in jbd's
 start_this_handle, although that just pushes the layer up the stack,
 and ultimately, to userspace where most userspace programs don't
 really expect ENOMEM to get returned by a random disk write 

While talking with Chris about this, if you can indeed push the error
out that far you can basically ensure this particular fs-op does not
complicate the journal commit and thereby limit the number of extra
entries in your journal, and thus the amount of memory required.

So once stuff starts failing, push out ops back out of the filesystem
code, force a journal commit, and then let these ops retry. There is no
need to actually push the -ENOMEM all the way back to userspace.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 16:11 -0500, Christoph Lameter wrote:
 We already have __GFP_NOFAIL behavior for slab allocations since
 a __GFP_NOFAIL flag is passed through to the page allocator if no objects
 are available.
 
 The page allocator will do the same as when called directly with
 __GFP_NOFAIL and so we have consistent behavior regardless of the
 allocator used.

Thank's for quoting the context to which you're replying, that really
helps parsing things..

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 16:53 -0400, Ted Ts'o wrote:
 On Wed, Aug 25, 2010 at 03:35:42PM +0200, Peter Zijlstra wrote:
  
  While I appreciate that it might be somewhat (a lot) harder for a
  filesystem to provide that guarantee, I'd be deeply worried about your
  claim that its impossible.
  
  It would render a system without swap very prone to deadlocks. Even with
  the very tight dirty page accounting we currently have you can fill all
  your memory with anonymous pages, at which point there's nothing free
  and you require writeout of dirty pages to succeed.
 
 For file systems that do delayed allocation, the situation is very
 similar to swapping over NFS.  Sometimes in order to make some free
 memory, you need to spend some free memory... 

Which means you need to be able to compute a bounded amount of that
memory.

  which implies that for
 these file systems, being more aggressive about triggering writeout,
 and being more aggressive about throttling processes which are
 creating too many dirty pages, especially dirty delayed allocaiton
 pages (regardless of whether this is via write(2) or accessing mmapped
 memory), is a really good idea.

That seems unrelated, the VM has a strict dirty limit and controls
writeback when needed. That part works.

 A pool of free pages which is reserved for routines that are doing
 page cleaning would probably also be a good idea.  Maybe that's just
 retrying with GFP_ATOMIC if a normal allocation fails, or maybe we
 need our own special pool, or maybe we need to dynamically resize the
 GFP_ATOMIC pool based on how many subsystems might need to use it

We have a smallish reserve, accessible with PF_MEMALLOC, but its use is
not regulated nor bounded, it just mostly works good enough.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-24 Thread Peter Zijlstra
On Tue, 2010-08-24 at 03:50 -0700, David Rientjes wrote:
 These were added as helper functions for documentation and auditability.
 No future callers should be added.

git grep GFP_NOFAIL isn't auditable enough?

might as well declare these functions depricated if you really want to
do this.

FWIW I think mason said he'd fix btrfs to not suck like this 'soon'.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/5] btrfs: add nofail variant of set_extent_dirty

2010-08-24 Thread Peter Zijlstra
On Tue, 2010-08-24 at 03:50 -0700, David Rientjes wrote:
 Add set_extent_dirty_nofail().  This function is equivalent to
 set_extent_dirty(), except that it will never fail because of allocation
 failure and instead loop forever trying to allocate memory.
 
 If the first allocation attempt fails, a warning will be emitted,
 including a call trace.  Subsequent failures will suppress this warning.
 
 This was added as a helper function for documentation and auditability.
 No future callers should be added.
 
 Signed-off-by: David Rientjes rient...@google.com
 ---
  fs/btrfs/extent-tree.c |8 
  fs/btrfs/extent_io.c   |   19 +++
  fs/btrfs/extent_io.h   |2 ++
  3 files changed, 25 insertions(+), 4 deletions(-)

I'd much rather someone help mason to clean this up.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Mon, 2009-01-12 at 19:32 +0200, Avi Kivity wrote:
 Peter Zijlstra wrote:
  Spinlocks can use 'pure' MCS locks.

 
 How about this, then.  In mutex_lock(), keep wait_lock locked and only 
 release it when scheduling out.  Waiter spinning naturally follows.  If 
 spinlocks are cache friendly (are they today?) 

(no they're not, Nick's ticket locks still spin on a shared cacheline
IIRC -- the MCS locks mentioned could fix this)

 we inherit that.  If 
 there is no contention on the mutex, then we don't need to reacquire the 
 wait_lock on mutex_unlock() (not that the atomic op is that expensive 
 these days).

That might actually work, although we'd have to move the
__mutex_slowpath_needs_to_unlock() branch outside wait_lock otherwise
we'll deadlock :-)

It might be worth trying this if we get serious fairness issues with the
current construct.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
Full series, including changelogs available at:

 http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/

and should shortly appear in a git tree near Ingo :-)

 mutex: small cleanup
 mutex: preemption fixes
 mutex: implement adaptive spinning
 mutex: set owner in mutex_lock() only
 mutex: adaptive spin for debug too
 mutex: adaptive spin performance tweaks

---
 include/linux/mutex.h   |5 +-
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 ---
 kernel/mutex.c  |  118 
 kernel/mutex.h  |   22 
 kernel/sched.c  |   71 +++-
 kernel/sched_features.h |1 
 8 files changed, 204 insertions(+), 42 deletions(-)

Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -10,6 +10,11 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *from the -rt tree, where it was originally implemented for rtmutexes
+ *by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *and Sven Dietrich.
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include linux/mutex.h
@@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c
atomic_set(lock-count, 1);
spin_lock_init(lock-wait_lock);
INIT_LIST_HEAD(lock-wait_list);
+   mutex_clear_owner(lock);
 
debug_mutex_init(lock, name, key);
 }
@@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu
 * 'unlocked' into 'locked' state.
 */
__mutex_fastpath_lock(lock-count, __mutex_lock_slowpath);
+   mutex_set_owner(lock);
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex *
 * The unlocking fastpath is the 0-1 transition from 'locked'
 * into 'unlocked' state:
 */
+#ifndef CONFIG_DEBUG_MUTEXES
+   /*
+* When debugging is enabled we must not clear the owner before time,
+* the slow path will always be taken, and that clears the owner field
+* after verifying that it was indeed current.
+*/
+   mutex_clear_owner(lock);
+#endif
__mutex_fastpath_unlock(lock-count, __mutex_unlock_slowpath);
 }
 
@@ -129,21 +144,71 @@ __mutex_lock_common(struct mutex *lock, 
 {
struct task_struct *task = current;
struct mutex_waiter waiter;
-   unsigned int old_val;
unsigned long flags;
 
+   preempt_disable();
+   mutex_acquire(lock-dep_map, subclass, 0, ip);
+#ifdef CONFIG_SMP
+   /*
+* Optimistic spinning.
+*
+* We try to spin for acquisition when we find that there are no
+* pending waiters and the lock owner is currently running on a
+* (different) CPU.
+*
+* The rationale is that if the lock owner is running, it is likely to
+* release the lock soon.
+*
+* Since this needs the lock owner, and this mutex implementation
+* doesn't track the owner atomically in the lock field, we need to
+* track it non-atomically.
+*/
+
+   for (;;) {
+   struct thread_info *owner;
+
+   /*
+* If there's an owner, wait for it to either
+* release the lock or go to sleep.
+*/
+   owner = ACCESS_ONCE(lock-owner);
+   if (owner  !mutex_spin_on_owner(lock, owner))
+   break;
+
+   if (atomic_cmpxchg(lock-count, 1, 0) == 1) {
+   lock_acquired(lock-dep_map, ip);
+   preempt_enable();
+   return 0;
+   }
+
+   /*
+* When there's no owner, we might have preempted between the
+* owner acquiring the lock and setting the owner field. If
+* we're an RT task that will live-lock because we won't let
+* the owner complete.
+*/
+   if (!owner  (need_resched() || rt_task(task)))
+   break;
+
+   /*
+* The cpu_relax() call is a compiler barrier which forces
+* everything in this loop to be re-loaded. We don't need
+* memory barriers as we'll eventually observe the right
+* values at the cost of a few extra spins.
+*/
+   cpu_relax();
+   }
+#endif
spin_lock_mutex(lock-wait_lock, flags);
 
debug_mutex_lock_common(lock, waiter);
-   mutex_acquire(lock-dep_map, subclass, 0, ip);
debug_mutex_add_waiter(lock, waiter, task_thread_info(task));
 
/* add waiting tasks to the end of the waitqueue (FIFO

Re: [PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Wed, 2009-01-14 at 18:18 +0100, Nick Piggin wrote:

  @@ -173,21 +237,21 @@ __mutex_lock_common(struct mutex *lock, 
  spin_unlock_mutex(lock-wait_lock, flags);
   
  debug_mutex_free_waiter(waiter);
  +   preempt_enable();
  return -EINTR;
  }
  __set_task_state(task, state);
   
  /* didnt get the lock, go to sleep: */
  spin_unlock_mutex(lock-wait_lock, flags);
  -   schedule();
  +   __schedule();
 
 Why does this need to do a preempt-disabled schedule? After we schedule
 away, the next task can do arbitrary things or reschedule itself, so if
 we have not anticipated such a condition here, then I can't see what
 __schedule protects. At least a comment is in order?

From:
http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/mutex-preempt.patch

Subject: mutex: preemption fixes
From: Peter Zijlstra a.p.zijls...@chello.nl
Date: Wed Jan 14 15:36:26 CET 2009

The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.

Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicitly disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl

 Pity to add the call overhead to schedule just for this case.

Good point, seeing any way around that?

  BTW. __schedule shouldn't need to be asmlinkage?

TBH I've no clue, probably not, Ingo?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Peter Zijlstra
On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote:

 Do people enable CONFIG_SCHED_DEBUG?

Well, I have it always enabled, but I've honestly no idea if that makes
me weird.

 CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty
 small subset?

Could be, do you fancy me doing a sysctl? shouldn't be hard.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Thu, 2009-01-15 at 01:46 +0100, Nick Piggin wrote:

 Hmm, well this is rather a slow path, I would say. I'd prefer not to
 modify schedule in this way (if we just get scheduled back on after
 being switched away, the subsequent call to schedule is going to be
 cache hot and not do too much work).
 
 preempt_enable_noresched maybe if you really care, would close up the
 window even more. But is it really worthwhile? We'd want to see numbers
 (when in doubt, keep it  simpler).

I initially did the preempt_enable_no_resched() thing and that showed
some improvement for PREEMPT=y kernels (lost the numbers though).

When I redid all the patches I tried closing that last hole by doing
that __schedule() thing, never realizing that schedule() would then get
extra overhead,.. d'0h.

I agree that that isn't worth it. I shall revert to
preempt_enable_no_resched() and try to get some new numbers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra a.p.zijls...@chello.nl
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 304% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   298932

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   98287

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN  /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN  /debug/sched_features

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
---
Chenges since -v8:
 - dropped the fairness

Changes since -v7:
 - made all the mutex ops fully non-preemptible
 - implemented FIFO fairness using the wait_list
 - fixed owner tracking issues

 include/linux/mutex.h   |5 +
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 +++---
 kernel/mutex.c  |  128 +---
 kernel/mutex.h  |   22 +++-
 kernel/sched.c  |   71 +-
 kernel/sched_features.h |1 
 8 files changed, 216 insertions(+), 40 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
@@ -68,7 +70,6 @@ struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct mutex*lock;
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -328,7 +328,9 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock-wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock-owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti-task-blocked_on = waiter;
-   waiter-lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lo
DEBUG_LOCKS_WARN_ON(lock-magic != lock);
DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
DEBUG_LOCKS_WARN_ON(!lock-wait_list.prev  !lock-wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
+   mutex_clear_owner(lock);
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed

Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 08:16 -0800, Linus Torvalds wrote:
 
 On Tue, 13 Jan 2009, Peter Zijlstra wrote:
  
  Change mutex contention behaviour such that it will sometimes busy wait on
  acquisition - moving its behaviour closer to that of spinlocks.
 
 Okey, dokey. Looks reasonable, but I wonder if this part came from v8 and 
 wasn't intentional:
 
  +   if (atomic_xchg(lock-count, -1) == 1) {
  +   lock_acquired(lock-dep_map, ip);
  +   mutex_set_owner(lock);
  +   preempt_enable();
  +   return 0;
  +   }
 
 Now you're forcing the slow-path on unlock. Maybe it was intentional, 
 maybe it wasn't. Did you perhaps mean
 
   if (atomic_cmpxchg(lock-count, 1, 0) == 1) {
 
 here? I thought we agreed it was safe, if only because it should be 
 equivalent to just having done mutex_trylock() instead of a real lock 
 sequence.

Yes, that was an 'accident' from -v8, yes we did think the cmpxchg was
good, however I did get some spurious lockups on -v7, and I only noticed
the thing after I'd done most of the testing, so I decided to let it be
for now.

Let me put the cmpxchg back in and see if this is all still good (only
3*2*2 configs to test :-).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 17:21 +0100, Peter Zijlstra wrote:
 On Tue, 2009-01-13 at 08:16 -0800, Linus Torvalds wrote:
  
  On Tue, 13 Jan 2009, Peter Zijlstra wrote:
   
   Change mutex contention behaviour such that it will sometimes busy wait on
   acquisition - moving its behaviour closer to that of spinlocks.
  
  Okey, dokey. Looks reasonable, but I wonder if this part came from v8 and 
  wasn't intentional:
  
   + if (atomic_xchg(lock-count, -1) == 1) {
   + lock_acquired(lock-dep_map, ip);
   + mutex_set_owner(lock);
   + preempt_enable();
   + return 0;
   + }
  
  Now you're forcing the slow-path on unlock. Maybe it was intentional, 
  maybe it wasn't. Did you perhaps mean
  
  if (atomic_cmpxchg(lock-count, 1, 0) == 1) {
  
  here? I thought we agreed it was safe, if only because it should be 
  equivalent to just having done mutex_trylock() instead of a real lock 
  sequence.
 
 Yes, that was an 'accident' from -v8, yes we did think the cmpxchg was
 good, however I did get some spurious lockups on -v7, and I only noticed
 the thing after I'd done most of the testing, so I decided to let it be
 for now.
 
 Let me put the cmpxchg back in and see if this is all still good (only
 3*2*2 configs to test :-).

Ok, tested only 1, but that was the one I remember lockups from -- and
that seems to be good with the cmpxchg.

Do you fancy me sending v10 or will you make that change locally?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 08:49 -0800, Linus Torvalds wrote:
 
 So do a v10, and ask people to test.

---
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra a.p.zijls...@chello.nl
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 304% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   298932

 # ./test-mutex-shm V 16 10 | grep ^avg ops
 avg ops/sec:   98287

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN  /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN  /debug/sched_features

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
---
Changes since -v9:
 - cmpxchg acquire in the spin loop

Changes since -v8:
 - dropped the fairness

Changes since -v7:
 - made all the mutex ops fully non-preemptible
 - implemented FIFO fairness using the wait_list
 - fixed owner tracking issues

 include/linux/mutex.h   |5 +
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 +++---
 kernel/mutex.c  |  128 +---
 kernel/mutex.h  |   22 +++-
 kernel/sched.c  |   71 +-
 kernel/sched_features.h |1 
 8 files changed, 216 insertions(+), 40 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
@@ -68,7 +70,6 @@ struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct mutex*lock;
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -328,7 +328,9 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock-wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock-owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti-task-blocked_on = waiter;
-   waiter-lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lo
DEBUG_LOCKS_WARN_ON(lock-magic != lock);
DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
DEBUG_LOCKS_WARN_ON(!lock-wait_list.prev  !lock-wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
+   mutex_clear_owner(lock

Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 18:13 +0200, Avi Kivity wrote:

 One thing that worries me here is that the spinners will spin on a 
 memory location in struct mutex, which means that the cacheline holding 
 the mutex (which is likely to be under write activity from the owner) 
 will be continuously shared by the spinners, slowing the owner down when 
 it needs to unshare it.  One way out of this is to spin on a location in 
 struct mutex_waiter, and have the mutex owner touch it when it schedules 
 out.

Yeah, that is what pure MCS locks do -- however I don't think its a
feasible strategy for this spin/sleep hybrid.

 So:
 - each task_struct has an array of currently owned mutexes, appended to 
 by mutex_lock()

That's not going to fly I think. Lockdep does this but its very
expensive and has some issues. We're currently at 48 max owners, and
still some code paths manage to exceed that.

 - mutex waiters spin on mutex_waiter.wait, which they initialize to zero
 - when switching out of a task, walk the mutex list, and for each mutex, 
 bump each waiter's wait variable, and clear the owner array

Which is O(n).

 - when unlocking a mutex, bump the nearest waiter's wait variable, and 
 remove from the owner array
 
 Something similar might be done to spinlocks to reduce cacheline 
 contention from spinners and the owner.

Spinlocks can use 'pure' MCS locks.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 08:20 -0800, Linus Torvalds wrote:
 
 On Mon, 12 Jan 2009, Linus Torvalds wrote:
  
  You made it back into the locked version.
 
 Btw, even if you probably had some reason for this, one thing to note is 
 that I think Chris' performance testing showed that the version using a 
 lock was inferior to his local btrfs hack, while the unlocked version 
 actually beat his hack.
 
 Maybe I misunderstood his numbers, though. But if I followed that sub-part 
 of the test right, it really means that the locked version is pointless - 
 it will never be able to replace peoples local hacks for this same thing, 
 because it just doesn't give the performance people are looking for.
 
 Since the whole (and _only_) point of this thing is to perform well, 
 that's a big deal.

Like said in reply to Chris' email, I just wanted to see if fairness was
worth the effort, because the pure unlocked spin showed significant
unfairness (and I know some people really care about some level of
fairness).

Initial testing with the simple test-mutex thing didn't show too bad
numbers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 12:14 -0500, Chris Mason wrote:
 On Mon, 2009-01-12 at 17:50 +0100, Peter Zijlstra wrote:
   
   (the file stat run is total run time, so lower is better.  The other
   numbers are files or MB per second, so higher is better)
   
   For the file create run, v8 had much lower system time than v7,
   averaging 1s of sys time per proc instead of 1.6s.
  
  Right, how about the spread in completion time, because that is the only
  reason I tried this fairness stuff, because you reported massive
  differences there.
  
 
 I reran the numbers with a slightly different kernel config and they
 have changed somewhat.  These are just for the 4k file create run, all
 numbers in files created per second (and the numbers are stable across
 runs)
 
 v8 avg 176.90 median 171.85 std 12.49 high 215.97 low 165.54
 v7 avg 169.02 median 163.77 std 16.82 high 267.95 low 157.95

Any opinions on the fairness matter, will -v9 be unlocked and unfair
again?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 19:33 +0200, Boaz Harrosh wrote:

 Which brings me back to my initial reaction to this work. Do we need
 two flavors of Mutex? some program sections need Fairness, some need
 performance. Some need low-latency, some need absolute raw CPU power.

Thing is, its the kernel, we cannot have such knobs per task. So we have
to pick one and stick with it -- the 'best' we can do is what PREEMPT_RT
does and replace the thing whole sale at build time.

 Because at the end of the day spinning in a saturated CPU work-load
 that does not care about latency, eats away cycles that could be spent
 on computation. Think multi-threaded video processing for example. 
 Thing I would like to measure is 
 1 - how many times we spin and at the end get a lock
 2 - how many times we spin and at the end sleep.
 3 - how many times we sleep like before.
 vs. In old case CPU spent on scheduling. Just to see if we are actually 
 loosing
 cycles at the end.

Feel free to do so ;-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:54 -0800, Linus Torvalds wrote:

 I was pretty sure that adding the unlocked loop should provably not change 
 the mutex lock semantics. Why? Because it's just basically equivalent to 
 just doing the mutex_trylock() without really changing anything really 
 fundamental in the mutex logic.
 
 And that argument is sadly totally bogus. 

It fails for the RT case, yes. It should still be true for regular tasks
- if the owner tracking was accurate.

 The thing is, we used to have this guarantee that any contention would 
 always go into the slowpath, and then in the slow-path we serialize using 
 the spinlock. 
 
 So I think the bug is still there, we just hid it better by breaking out 
 of the loop with that if (need_resched()) always eventually triggering. 
 And it would be ok if it really is guaranteed to _eventually_ trigger, and 
 I guess with timeslices it eventually always will, but I suspect we could 
 have some serious latency spikes.

Yes, the owner getting preempted after acquiring the lock, but before
setting the owner can give some nasties :-(

I initially did that preempt_disable/enable around the fast path, but I
agree that slowing down the fast path is unwelcome.

Alternatively we could go back to block on !owner, with the added
complexity of not breaking out of the spin on lock-owner != owner
when !lock-owner, so that the premature owner clearing of the unlock
fast path will not force a schedule right before we get a chance to
acquire the lock.

Let me do that..

 The problem? Setting lock-count to 0. That will mean that the next 
 mutex_unlock() will not necessarily enter the slowpath at all, and won't 
 necessarily wake things up like it should.

That's exactly what __mutex_fastpath_trylock() does (or can do,
depending on the implementation), so if regular mutexes are correct in
the face of a trylock stealing the lock in front of a woken up waiter,
then we're still good.

That said, I'm not seeing how mutexes aren't broken already.

say A locks it: counter 1-0
then B contends: counter 0--1, added to wait list
then C contentds: counter -1, added to wait list
then A releases: counter -1-1, wake someone up, say B
then D trylocks: counter 1-0
so B is back to wait list
then D releases, 0-1, no wakeup

Aaah, B going back to sleep sets it to -1

Therefore, I think we're good.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 10:59 -0500, Steven Rostedt wrote:

  
  Adding that blocking on !owner utterly destroys everything.
 
 I was going to warn you about that ;-)
 
 Without the check for !owner, you are almost guaranteed to go to sleep 
 every time. Here's why:
 
 You are spinning and thus have a hot cache on that CPU.
 
 The owner goes to unlock but will be in a cold cache. It sets lock-owner 
 to NULL, but is still in cold cache so it is a bit slower.
 
 Once the spinner sees the NULL, it shoots out of the spin but sees the 
 lock is still not available then goes to sleep. All before the owner could 
 release it. This could probably happen at every contention. Thus, you lose 
 the benefit of spinning. You probably make things worse because you add a 
 spin before every sleep.

Which is why I changed the inner loop to:

  l_owner = ACCESS_ONCE(lock-owner)
  if (l_owner  l_owner != owner)
break

So that that would continue spinning.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 11:44 -0500, Steven Rostedt wrote:

 When we get to the schedule() it then needs to be a:
 
   preempt_enable_no_resched();
   schedule();

On that note:

Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
__set_task_state(task, state);
 
/* didnt get the lock, go to sleep: */
+   preempt_disable();
spin_unlock_mutex(lock-wait_lock, flags);
+   preempt_enable_no_resched();
schedule();
spin_lock_mutex(lock-wait_lock, flags);
}


actually improves mutex performance on PREEMPT=y
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 15:18 +0100, Ingo Molnar wrote:

 [ A detail, -tip testing found that the patch breaks mutex debugging:
 
   =
   [ BUG: bad unlock balance detected! ]
   -
   swapper/0 is trying to release lock (cpu_add_remove_lock) at:
   [8089f540] mutex_unlock+0xe/0x10
   but there are no more locks to release!
 
  but that's a detail for -v7 ;-) ]

Here it is..

I changed the optimistic spin cmpxchg trickery to not require the -1000
state, please double and tripple check the code.

This was done because the interaction between trylock_slowpath and that
-1000 state hurt my head.

---
Subject: 
From: Peter Zijlstra a.p.zijls...@chello.nl
Date: Thu Jan 08 09:41:22 CET 2009

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
---
 include/linux/mutex.h   |4 +-
 include/linux/sched.h   |1 
 kernel/mutex-debug.c|   24 ++-
 kernel/mutex-debug.h|   18 ++-
 kernel/mutex.c  |   76 ++--
 kernel/mutex.h  |   22 -
 kernel/sched.c  |   66 +
 kernel/sched_features.h |1 
 8 files changed, 185 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock-wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock-owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -80,9 +75,25 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock-magic != lock);
+#if 0
+   /*
+* XXX something is iffy with owner tracking, but lockdep - which has
+* similar checks - finds everything dandy, so disable for now.
+*/
+   if (lock-owner != current_thread_info()) {
+   printk(KERN_ERR %p %p\n, lock-owner, current_thread_info());
+   if (lock-owner) {
+   printk(KERN_ERR %d %s\n,
+   lock-owner-task-pid,
+   lock-owner-task-comm);
+   }
+   printk(KERN_ERR %d %s\n,
+   current_thread_info()-task-pid,
+   current_thread_info()-task-comm);
+   }
DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
+#endif
DEBUG_LOCKS_WARN_ON(!lock-wait_list.prev  !lock-wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +106,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(lock-dep_map, name, key, 0);
 #endif
-   lock-owner = NULL;
lock-magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
===
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock-wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-   lock-owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
@@ -35,6 +27,16 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 struct lock_class_key *key);
 
+static inline void mutex_set_owner(struct mutex *lock)
+{
+   lock-owner = current_thread_info();
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+   lock-owner = NULL;
+}
+
 #define spin_lock_mutex(lock, flags)   \
do {\
struct mutex *l = container_of(lock, struct mutex, wait_lock); \
Index: linux-2.6/kernel

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 10:09 -0500, Steven Rostedt wrote:
 On Thu, 8 Jan 2009, Peter Zijlstra wrote:
  Index: linux-2.6/kernel/sched.c
  ===
  --- linux-2.6.orig/kernel/sched.c
  +++ linux-2.6/kernel/sched.c
  @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
   }
   EXPORT_SYMBOL(schedule);
   
  +#ifdef CONFIG_SMP
  +/*
  + * Look out! owner is an entirely speculative pointer
  + * access and not reliable.
  + */
  +int spin_on_owner(struct mutex *lock, struct thread_info *owner)
  +{
  +   unsigned int cpu;
  +   struct rq *rq;
  +   int ret = 1;
  +
  +   if (unlikely(!sched_feat(OWNER_SPIN)))
 
 I would remove the unlikely, if someone turns OWNER_SPIN off, then you 
 have the wrong decision being made. Choices by users should never be in a 
 likely or unlikely annotation. It's discrimination ;-)

in the unlikely case we schedule(), that seems expensive enough to want
to make the spin case ever so slightly faster.

  +   return 0;
  +
  +   preempt_disable();
  +#ifdef CONFIG_DEBUG_PAGEALLOC
  +   /*
  +* Need to access the cpu field knowing that
  +* DEBUG_PAGEALLOC could have unmapped it if
  +* the mutex owner just released it and exited.
  +*/
  +   if (probe_kernel_address(owner-cpu, cpu))
  +   goto out;
  +#else
  +   cpu = owner-cpu;
  +#endif
  +
  +   /*
  +* Even if the access succeeded (likely case),
  +* the cpu field may no longer be valid.
  +*/
  +   if (cpu = nr_cpumask_bits)
  +   goto out;
  +
  +   /*
  +* We need to validate that we can do a
  +* get_cpu() and that we have the percpu area.
  +*/
  +   if (!cpu_online(cpu))
  +   goto out;
 
 Should we need to do a get_cpu or something? Couldn't the CPU disappear 
 between these two calls. Or does it do a stop-machine and the preempt 
 disable will protect us?

Did you miss the preempt_disable() a bit up?

  +
  +   rq = cpu_rq(cpu);
  +
  +   for (;;) {
  +   if (lock-owner != owner)
  +   break;
  +
  +   /*
  +* Is that owner really running on that cpu?
  +*/
  +   if (task_thread_info(rq-curr) != owner)
  +   break;
  +
  +   if (need_resched()) {
  +   ret = 0;
  +   break;
  +   }
  +
  +   cpu_relax();
  +   }
  +out:
  +   preempt_enable_no_resched();
  +   return ret;
  +}
  +#endif

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 10:28 -0500, Steven Rostedt wrote:
 On Thu, 8 Jan 2009, Peter Zijlstra wrote:

  in the unlikely case we schedule(), that seems expensive enough to want
  to make the spin case ever so slightly faster.
 
 OK, that makes sense, but I would comment that. Otherwise, it just looks 
 like another misuse of the unlikely annotation.

OK, sensible enough.

   Should we need to do a get_cpu or something? Couldn't the CPU disappear 
   between these two calls. Or does it do a stop-machine and the preempt 
   disable will protect us?
  
  Did you miss the preempt_disable() a bit up?
 
 No, let me rephrase it better. Does the preempt_disable protect against
 another CPU from going off line? Does taking a CPU off line do a 
 stop_machine?

Yes and yes.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote:
 
 Ok, I've gone through -v7, and I'm sure you're all shocked to hear it, but 
 I have no complaints. 

*cheer*, except I guess we need to figure out what goes bad for Chris.

 Except that you dropped all the good commit 
 commentary you had earlier ;)

Yeah, I've yet to add that back, will do.

 The patch looks pretty good (except for the big #if 0 block in 
 mutex-debug.c that I hope gets fixed, but I can't even really claim that I 
 can be bothered), the locking looks fine (ie no locking at all), and the 
 numbers seem pretty convinving.
 
 Oh, and I think the open-coded
 
   atomic_cmpxchg(count, 1, 0) == 1
 
 could possibly just be replaced with a simple __mutex_fastpath_trylock(). 
 I dunno.

__mutex_fastpath_trylock() isn't always that neat -- see
include/asm-generic/mutex-xchg.h -- and its a NOP on DEBUG_MUTEXES.

Note how I used old_val for the list_empty() thing as well, we could
possibly drop that extra condition though.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:
 
 Well, at least we do unless you enable that broken paravirt support. I'm 
 not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
 don't much care.

Because the virtual cpu that has the ticket might not get scheduled for
a while, even though another vcpu with a spinner is scheduled.

The whole (para)virt is a nightmare in that respect.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN  /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:74910

  # echo NO_MUTEX_SPIN  /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN  /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN  /debug/sched_features

There's also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin  : 2387
  .mtx_sched : 2283
  .mtx_spin  : 1277
  .mtx_sched : 1700

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Reviewed-and-signed-off-by: Ingo Molnar mi...@elte.hu
---
 include/linux/mutex.h   |4 +-
 include/linux/sched.h   |2 +
 kernel/mutex-debug.c|   10 +--
 kernel/mutex-debug.h|8 -
 kernel/mutex.c  |   46 ++
 kernel/mutex.h  |2 -
 kernel/sched.c  |   73 +++
 kernel/sched_debug.c|2 +
 kernel/sched_features.h |1 +
 9 files changed, 115 insertions(+), 33 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 7a0e5c4..c007b4e 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cae9b8..d8fa96b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -328,6 +328,8 @@ extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
+  unsigned long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c
index 1d94160..0564680 100644
--- a/kernel/mutex-debug.c
+++ b/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock-wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock-owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct 
mutex_waiter *waiter,
 
/* Mark the current thread as blocked on the lock: */
ti-task-blocked_on = waiter;
-   waiter-lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lock)
return;
 
DEBUG_LOCKS_WARN_ON(lock-magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock-owner != current); */
DEBUG_LOCKS_WARN_ON(!lock

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 15:50 +0100, Frédéric Weisbecker wrote:
 2009/1/7 Peter Zijlstra pet...@infradead.org:
  Change mutex contention behaviour such that it will sometimes busy wait on
  acquisition - moving its behaviour closer to that of spinlocks.
 
  This concept got ported to mainline from the -rt tree, where it was 
  originally
  implemented for rtmutexes by Steven Rostedt, based on work by Gregory 
  Haskins.
 
  Testing with Ingo's test-mutex application 
  (http://lkml.org/lkml/2006/1/8/50)
  gave a 8% boost for VFS scalability on my testbox:
 
   # echo MUTEX_SPIN  /debug/sched_features
   # ./test-mutex V 16 10
   2 CPUs, running 16 parallel test-tasks.
   checking VFS performance.
 
   avg ops/sec:74910
 
   # echo NO_MUTEX_SPIN  /debug/sched_features
   # ./test-mutex V 16 10
   2 CPUs, running 16 parallel test-tasks.
   checking VFS performance.
 
   avg ops/sec:68804
 
  The key criteria for the busy wait is that the lock owner has to be running 
  on
  a (different) cpu. The idea is that as long as the owner is running, there 
  is a
  fair chance it'll release the lock soon, and thus we'll be better off 
  spinning
  instead of blocking/scheduling.
 
  Since regular mutexes (as opposed to rtmutexes) do not atomically track the
  owner, we add the owner in a non-atomic fashion and deal with the races in
  the slowpath.
 
  Furthermore, to ease the testing of the performance impact of this new code,
  there is means to disable this behaviour runtime (without having to reboot
  the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
  by issuing the following command:
 
   # echo NO_MUTEX_SPIN  /debug/sched_features
 
  This command re-enables spinning again (this is also the default):
 
   # echo MUTEX_SPIN  /debug/sched_features
 
  There's also a few new statistic fields in /proc/sched_debug
  (available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):
 
   # grep mtx /proc/sched_debug
   .mtx_spin  : 2387
   .mtx_sched : 2283
   .mtx_spin  : 1277
   .mtx_sched : 1700
 
  Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
  Reviewed-and-signed-off-by: Ingo Molnar mi...@elte.hu
  ---

 Sorry I haven't read all the previous talk about the older version.
 But it is possible that, in hopefully rare cases, you enter
 mutex_spin_or_schedule
 multiple times, and try to spin for the same lock each of these times.
 
 For each of the above break,
 
 _if you exit the spin because the mutex is unlocked, and someone else
 grab it before you
 _ or simply the owner changed...
 
 then you will enter again in mutex_spin_or_schedule, you have some chances 
 that
 rq-curr == the new owner, and then you will spin again.
 And this situation can almost really make you behave like a spinlock...

You understand correctly, that is indeed possible.

 Shouldn't it actually try only one time to spin, and if it calls again
 mutex_spin_or_schedule()
 then it would be better to schedule()  ?

I don't know, maybe code it up and find a benchmark where it makes a
difference. :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 10:22 -0500, Steven Rostedt wrote:
 Peter, nice work!

Thanks!

  +   }
  +
  +   if (!spin) {
  +   schedstat_inc(this_rq(), mtx_sched);
  +   __set_task_state(task, state);
 
 I still do not know why you set state here instead of in the mutex code.
 Yes, you prevent changing the state if we do not schedule, but there's 
 nothing wrong with setting it before hand. We may even be able to cache 
 the owner and keep the locking of the wait_lock out of here. But then I 
 see that it may be used to protect the sched_stat counters.

I was about to say because we need task_rq(owner) and can only deref
owner while holding that lock, but I found a way around it by using
task_cpu() which is exported.

Compile tested only so far...

---
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,8 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
-extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-  unsigned long *flags);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+  struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -12,7 +12,8 @@
  *
  *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
  *from the -rt tree, where it was originally implemented for rtmutexes
- *by Steven Rostedt, based on work by Gregory Haskins.)
+ *by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *and Sven Dietrich.
  *
  * Also see Documentation/mutex-design.txt.
  */
@@ -157,6 +158,9 @@ __mutex_lock_common(struct mutex *lock, 
lock_contended(lock-dep_map, ip);
 
for (;;) {
+   int cpu = 0;
+   struct task_struct *l_owner;
+
/*
 * Lets try to take the lock again - this is needed even if
 * we get here for the first time (shortly after failing to
@@ -184,8 +188,15 @@ __mutex_lock_common(struct mutex *lock, 
return -EINTR;
}
 
-   mutex_spin_or_schedule(waiter, state, flags);
+   __set_task_state(task, state);
+   l_owner = ACCESS_ONCE(lock-owner);
+   if (l_owner)
+   cpu = task_cpu(l_owner);
+   spin_unlock_mutex(lock-wait_lock, flags);
+   mutex_spin_or_schedule(waiter, l_owner, cpu);
+   spin_lock_mutex(lock-wait_lock, flags);
}
+   __set_task_state(task, TASK_RUNNING);
 
 done:
lock_acquired(lock-dep_map, ip);
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4600,42 +4600,37 @@ pick_next_task(struct rq *rq, struct tas
}
 }
 
-#ifdef CONFIG_DEBUG_MUTEXES
-# include mutex-debug.h
-#else
-# include mutex.h
-#endif
-
-void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-   unsigned long *flags)
+void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+   struct task_struct *owner, int cpu)
 {
-   struct mutex *lock = waiter-lock;
struct task_struct *task = waiter-task;
-   struct task_struct *owner = lock-owner;
+   struct mutex *lock = waiter-lock;
struct rq *rq;
int spin = 0;
 
if (likely(sched_feat(MUTEX_SPIN)  owner)) {
-   rq = task_rq(owner);
+   rq = cpu_rq(cpu);
spin = (rq-curr == owner);
}
 
if (!spin) {
+   preempt_disable();
schedstat_inc(this_rq(), mtx_sched);
-   __set_task_state(task, state);
-   spin_unlock_mutex(lock-wait_lock, *flags);
+   preempt_enable();
+
schedule();
-   spin_lock_mutex(lock-wait_lock, *flags);
return;
}
 
+   preempt_disable();
schedstat_inc(this_rq(), mtx_spin);
-   spin_unlock_mutex(lock-wait_lock, *flags);
+   preempt_enable();
+
for (;;) {
struct task_struct *l_owner;
 
/* Stop spinning when there's a pending signal. */
-   if (signal_pending_state(state, task))
+   if (signal_pending_state(task-state, task))
break;
 
/* Mutex got unlocked, try to acquire. */
@@ -4666,7 +4661,6 @@ void

[PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 08:25 -0800, Linus Torvalds wrote:
 
 On Wed, 7 Jan 2009, Peter Zijlstra wrote:
 
  Change mutex contention behaviour such that it will sometimes busy wait on
  acquisition - moving its behaviour closer to that of spinlocks.
 
 Ok, this one looks _almost_ ok.
 
 The only problem is that I think you've lost the UP case. 
 
 In UP, you shouldn't have the code to spin, and the spin_or_schedule() 
 should fall back to just the schedule case.
 
 It migth also be worthwhile to try to not set the owner, and re-organize 
 that a bit (by making it a inline function that sets the owner only for 
 CONFIG_SMP or lockdep/debug). 

As you wish ;-)

---
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra a.p.zijls...@chello.nl
Date: Tue, 6 Jan 2009 12:32:12 +0100

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN  /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:74910

  # echo NO_MUTEX_SPIN  /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN  /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN  /debug/sched_features

There's also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin  : 2387
  .mtx_sched : 2283
  .mtx_spin  : 1277
  .mtx_sched : 1700

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Reviewed-and-signed-off-by: Ingo Molnar mi...@elte.hu
---
 include/linux/mutex.h   |6 ++--
 include/linux/sched.h   |2 +
 kernel/mutex-debug.c|   10 ---
 kernel/mutex-debug.h|   13 +++--
 kernel/mutex.c  |   66 +---
 kernel/mutex.h  |   13 -
 kernel/sched.c  |   63 +
 kernel/sched_debug.c|2 +
 kernel/sched_features.h |1 
 9 files changed, 146 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
+   struct task_struct  *owner;
+#endif
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +69,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+  struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 12:55 -0800, Linus Torvalds wrote:

   /*
* Look out! thread is an entirely speculative pointer
* access and not reliable.
*/
   void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
   {
   for (;;) {
   unsigned cpu;
   struct runqueue *rq;
 
   if (lock-owner != thread)
   break;
 
   /*
* Need to access the cpu field knowing that
* DEBUG_PAGEALLOC could have unmapped it if
* the mutex owner just released it and exited.
*/
   if (__get_user(cpu, thread-cpu))
   break;
 
   /*
* Even if the access succeeded (likely case),
* the cpu field may no longer be valid. FIXME:
* this needs to validate that we can do a
* get_cpu() and that we have the percpu area.
*/
   if (cpu = NR_CPUS)
   break;
 
   if (!cpu_online(cpu))
   break;
 
   /*
* Is that thread really running on that cpu?
*/
   rq = cpu_rq(cpu);
   if (task_thread_info(rq-curr) != thread)
   break;
 
   cpu_relax();
   }
   }

Do we really have to re-do all that code every loop?

void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
{
unsigned cpu;
struct runqueue *rq;

/*
 * Need to access the cpu field knowing that
 * DEBUG_PAGEALLOC could have unmapped it if
 * the mutex owner just released it and exited.
 */
if (__get_user(cpu, thread-cpu))
break;

/*
 * Even if the access succeeded (likely case),
 * the cpu field may no longer be valid. FIXME:
 * this needs to validate that we can do a
 * get_cpu() and that we have the percpu area.
 */
if (cpu = NR_CPUS)
break;

if (!cpu_online(cpu))
break;

rq = cpu_rq(cpu);

for (;;) {
if (lock-owner != thread)
break;

/*
 * Is that thread really running on that cpu?
 */
if (task_thread_info(rq-curr) != thread)
break;

cpu_relax();
}
}

Also, it would still need to do the funny:

 l_owner = ACCESS_ONCE(lock-owner)
 if (l_owner  l_owner != thread)
   break;

thing, to handle the premature non-atomic lock-owner tracking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 13:58 -0800, Linus Torvalds wrote:
 
 On Wed, 7 Jan 2009, Peter Zijlstra wrote:
  
  Do we really have to re-do all that code every loop?
 
 No, you're right, we can just look up the cpu once. Which makes Andrew's 
 argument that probe_kernel_address() isn't in any hot path even more 
 true.
 
  Also, it would still need to do the funny:
  
   l_owner = ACCESS_ONCE(lock-owner)
   if (l_owner  l_owner != thread)
 break;
 
 Why? That would fall out of the 
 
   if (lock-owner != thread)
   break;
 
 part. We don't actually care that it only happens once: this all has 
 _known_ races, and the cpu_relax() is a barrier.
 
 And notice how the _caller_ handles the owner == NULL case by not even 
 calling this, and looping over just the state in the lock itself. That was 
 in the earlier emails. So this approach is actually pretty different from 
 the case that depended on the whole spinlock thing.

Ah, so now you do loop on !owner, previuosly you insisted we'd go to
sleep on !owner. Yes, with !owner spinning that is indeed not needed.

 +#ifdef CONFIG_SMP
 + /* Optimistic spinning.. */
 + for (;;) {
 + struct thread_struct *owner;
 + int oldval = atomic_read(lock-count);
 +
 + if (oldval = MUTEX_SLEEPERS)
 + break;
 + if (oldval == 1) {
 + oldval = atomic_cmpxchg(lock-count, oldval, 0);
 + if (oldval == 1) {
 + lock-owner = task_thread_info(task);
 + return 0;
 + }
 + } else {
 + /* See who owns it, and spin on him if anybody */
 + owner = lock-owner;
 + if (owner)
 + spin_on_owner(lock, owner);
 + }
 + cpu_relax();
 + }
 +#endif

Hmm, still wouldn't the spin_on_owner() loopyness and the above need
that need_resched() check you mentioned to that it can fall into the
slow path and go to sleep?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 15:32 -0800, Linus Torvalds wrote:
 
 On Wed, 7 Jan 2009, Steven Rostedt wrote:
  
  What would be interesting is various benchmarks against all three.
  
  1) no mutex spinning.
  2) get_task_struct() implementation.
  3) spin_or_sched implementation.
 
 One of the issues is that I cannot convince myself that (2) is even 
 necessarily correct. At least not without having all cases happen inder 
 the mutex spinlock - which they don't. Even with the original patch, the 
 uncontended cases set and cleared the owner field outside the lock.

Yes, 2 isn't feasible for regular mutexes as we have non-atomic owner
tracking.

I've since realized the whole rtmutex thing is fundamentally difference
on a few levels:

  a) we have atomic owner tracking (that's the lock itself, it holds the
task_pointer as a cookie), and 

  b) we need to do that whole enqueue on the waitlist thing because we
need to do the PI propagation and such to figure out if the current task
is even allowed to acquire -- that is, only the highest waiting and or
lateral steal candidates are allowed to spin acquire.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 14:16 +0100, Ingo Molnar wrote:
 * Gregory Haskins ghask...@novell.com wrote:
 
  Ingo Molnar wrote:
   There's no time or spin-rate based heuristics in this at all (i.e. these 
   mutexes are not 'adaptive' at all!),
  
  FYI: The original adaptive name was chosen in the -rt implementation 
  to reflect that the locks can adaptively spin or sleep, depending on 
  conditions.  I realize this is in contrast to the typical usage of the 
  term when it is in reference to the spin-time being based on some 
  empirical heuristics, etc as you mentioned.  Sorry for the confusion.
 
 the current version of the -rt spinny-mutexes bits were mostly written by 
 Steve, right? Historically it all started out with a more classic 
 adaptive mutexes patchset so the name stuck i guess.

Yeah, Gregory and co. started with the whole thing and showed there was
significant performance to be gained, after that Steve rewrote it from
scratch reducing it to this minimalist heuristic, with help from Greg.

(At least, that is how I remember it, please speak up if I got things
wrong)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 13:50 -0800, Linus Torvalds wrote:
 
 On Tue, 6 Jan 2009, Peter Zijlstra wrote:
  
  With Linus' mutex_spin_or_schedule() function the whole - keeping
  owner's task_struct alive issue goes away,.. now if only the thing would
  boot...
 
 Can you post the patch, so that we can see if we can find some silly error 
 that we can ridicule you over?

Sure, see below..

I think I'm seeing why it deadlocks..

One of the things the previous patch did wrong was that it never tracked
the owner in the mutex fast path -- I initially didn't notice because I
had all debugging infrastructure enabled, and that short circuits all
the fast paths.

So I added a lame fast path owner tracking:

  preempt_disable();
  mutex_fast_path_lock();
  lock-owner = current;
  preempt_enable();

and a similar clear on the unlock side.

Which is exactly what causes the deadlock -- or livelock more
accurately. Since the contention code spins while !owner, and the unlock
code clears owner before release, we spin so hard we never release the
cacheline to the other cpu and therefore get stuck.

Now I just looked what kernel/rtmutex.c does, it keeps track of the
owner in the lock field and uses cmpxchg to change ownership. Regular
mutexes however use atomic_t (not wide enough for void *) and hand
crafted assembly fast paths for all archs.

I think I'll just hack up a atomic_long_t atomic_lock_cmpxchg mutex
implementation -- so we can at least test this on x86 to see if its
worth continuing this way.

Converting all hand crafted asm only to find out it degrades performance
doesn't sound too cool :-)

---
 include/linux/mutex.h |4 +-
 include/linux/sched.h |1 
 kernel/mutex-debug.c  |   10 --
 kernel/mutex-debug.h  |8 -
 kernel/mutex.c|   80 ++
 kernel/mutex.h|2 -
 kernel/sched.c|   44 +++
 7 files changed, 110 insertions(+), 39 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *, long state, unsigned 
long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock-wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock-owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti-task-blocked_on = waiter;
-   waiter-lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock-magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock-owner != current); */
DEBUG_LOCKS_WARN_ON(!lock-wait_list.prev  !lock-wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock-owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(lock-dep_map, name, key, 0);
 #endif
-   lock-owner = NULL;
lock-magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h

Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 23:43 +0100, Peter Zijlstra wrote:

 @@ -115,6 +117,7 @@ void __sched mutex_unlock(struct mutex *
* The unlocking fastpath is the 0-1 transition from 'locked'
* into 'unlocked' state:
*/
 + lock-owner = NULL;
   __mutex_fastpath_unlock(lock-count, __mutex_unlock_slowpath);
  }
  

 +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state, 
 unsigned long *flags)
 +{
 + struct mutex *lock = waiter-lock;
 + struct task_struct *task = waiter-task;
 + struct task_struct *owner = lock-owner;
 + struct rq *rq;
 +
 + if (!owner)
 + goto do_schedule;
 +
 + rq = task_rq(owner);
 +
 + if (rq-curr != owner) {
 +do_schedule:
 + __set_task_state(task, state);
 + spin_unlock_mutex(lock-wait_lock, *flags);
 + schedule();
 + } else {
 + spin_unlock_mutex(lock-wait_lock, *flags);
 + for (;;) {
 + /* Stop spinning when there's a pending signal. */
 + if (signal_pending_state(state, task))
 + break;
 +
 + /* Owner changed, bail to revalidate state */
 + if (lock-owner != owner)
 + break;
 +
 + /* Owner stopped running, bail to revalidate state */
 + if (rq-curr != owner)
 + break;
 +
 + cpu_relax();
 + }
 + }
 + spin_lock_mutex(lock-wait_lock, *flags);
 +}

That's not going to work, we set owner to NULL, which means pending
spinners get schedule()ed out instead of racing to acquire.

I suppose the below would fix that... really sleep time now

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4626,8 +4626,8 @@ do_schedule:
if (signal_pending_state(state, task))
break;
 
-   /* Owner changed, bail to revalidate state */
-   if (lock-owner != owner)
+   /* Mutex got unlocked, race to acquire. */
+   if (!mutex_is_locked(lock))
break;
 
/* Owner stopped running, bail to revalidate state */

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-07 Thread Peter Zijlstra
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

 There are still more disk format changes planned, but we're making every
 effort to get them out of the way as quickly as we can.  You can see the
 major features we have planned on the development timeline:
 
 http://btrfs.wiki.kernel.org/index.php/Development_timeline

Just took a peek, seems to be slightly out of date as it still lists the
single mutex thingy.

Also, how true is the IO-error and disk-full claim?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-07 Thread Peter Zijlstra
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

 * Fine grained btree locking.  The large fs_mutex is finally gone.
 There is still some work to do on the locking during extent allocation,
 but the code is much more scalable than it was.

Cool - will try to find a cycle to stare at the code ;-)

 * Helper threads for checksumming and other background tasks.  Most CPU
 intensive operations have been pushed off to helper threads to take
 advantage of SMP machines.  Streaming read and write throughput now
 scale to disk speed even with checksumming on.

Can this lead to the same Priority Inversion issues as seen with
kjournald?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html