Re: Need help with a lockdep splat, possibly perf related?

2019-07-03 Thread Peter Zijlstra
On Wed, Jul 03, 2019 at 09:54:06AM -0400, Josef Bacik wrote:
> Hello,
> 
> I've been seeing a variation of the following splat recently and I have no
> earthly idea what it's trying to tell me.

That you have a lock cycle; aka. deadlock, obviously :-)

> I either get this one, or I get one
> that tells me the same thing except it's complaining about &cpuctx_mutex 
> instead
> of sb_pagefaults. 

Timing on where the cycle closes, most likely.

> There is no place we take the reloc_mutex and then do the
> pagefaults stuff,

That's not needed, What the below tells us, is that:

  btrfs->bio/mq->cpuhotplug->perf->mmap_sem->btrfs

is a cycle. The basic problem seems to be that btrfs, through blk_mq,
have the cpuhotplug lock inside mmap_sem, while perf (and I imagine
quite a few other places) allow pagefaults while holding cpuhotplug
lock.

This then presents a deadlock potential. Some of the actual chains are
clarified below; hope this helps.

> so I don't know where it's getting these dependencies from.
> The stacktraces make no sense because there's perf stuff in here, but it 
> doesn't
> show me a path where we would be holding _any_ btrfs locks, so I'm not sure 
> how
> we gain those dependencies.  Can you tell me where I'm being stupid?  Thanks,
> 
>  ==
>  WARNING: possible circular locking dependency detected
>  5.2.0-rc7-00100-ga4067edd7814 #671 Not tainted
>  --
>  python3.6/44461 is trying to acquire lock:
>  674af011 (&fs_info->reloc_mutex){+.+.}, at: 
> btrfs_record_root_in_trans+0x3c/0x70
> 
>  but task is already holding lock:
>  91a6f027 (sb_pagefaults){.+.+}, at: btrfs_page_mkwrite+0x6a/0x4f0
> 
>  which lock already depends on the new lock.
> 
> 
>  the existing dependency chain (in reverse order) is:
> 
>  -> #10 (sb_pagefaults){.+.+}:
> __sb_start_write+0x12f/0x1d0
> btrfs_page_mkwrite+0x6a/0x4f0
> do_page_mkwrite+0x2b/0x70
> __handle_mm_fault+0x6f2/0x10e0
> handle_mm_fault+0x179/0x360
> __do_page_fault+0x24d/0x4d0
> page_fault+0x1e/0x30

do_user_address_fault()
#9down_read(->mmap_sem)
  handle_mm_fault()
do_page_mkwrite()
  btrfs_page_mkwrite()
#10 __sb_start_write()

>  -> #9 (&mm->mmap_sem#2){}:
> __might_fault+0x6b/0x90
> _copy_to_user+0x1e/0x70
> perf_read+0x1a9/0x2c0
> vfs_read+0x9b/0x150
> ksys_read+0x5c/0xd0
> do_syscall_64+0x4a/0x1b0
> entry_SYSCALL_64_after_hwframe+0x49/0xbe

perf_read()
#8perf_event_ctx_lock()
  copy_to_user
#PF
#9down_read(->mmap_sem)

> 
>  -> #8 (&cpuctx_mutex){+.+.}:
> __mutex_lock+0x81/0x8f0
> perf_event_init_cpu+0x9c/0x150
> perf_event_init+0x1d0/0x1fe
> start_kernel+0x365/0x513
> secondary_startup_64+0xa4/0xb0

perf_event_init_cpu()
#7mutex_lock(&pmus_lock);
#8mutex_lock(&ctx->lock);

>  -> #7 (pmus_lock){+.+.}:
> __mutex_lock+0x81/0x8f0
> perf_event_init_cpu+0x69/0x150
> cpuhp_invoke_callback+0xb8/0x950
> _cpu_up.constprop.29+0xad/0x140
> do_cpu_up+0x92/0xe0
> smp_init+0xcf/0xd4
> kernel_init_freeable+0x13a/0x290
> kernel_init+0xa/0x110
> ret_from_fork+0x24/0x30

_cpu_up()
#6cpus_write_lock()
...
 perf_event_init_cpu()
#7 mutex_lock(&pmus_lock)

>  -> #6 (cpu_hotplug_lock.rw_sem){}:
> cpus_read_lock+0x43/0x90
> kmem_cache_create_usercopy+0x28/0x250
> kmem_cache_create+0x18/0x20
> bioset_init+0x157/0x2a0
> init_bio+0xa1/0xa7
> do_one_initcall+0x67/0x2f4
> kernel_init_freeable+0x1ec/0x290
> kernel_init+0xa/0x110
> ret_from_fork+0x24/0x30

bio_find_or_create_slab()
#5mutex_lock(&bio_slab_lock)
  kmem_cache_create()
kmem_cache_create_usercopy()
#6get_online_cpus()


>  -> #5 (bio_slab_lock){+.+.}:
> __mutex_lock+0x81/0x8f0
> bioset_init+0xb0/0x2a0
> blk_alloc_queue_node+0x80/0x2d0
> blk_mq_init_queue+0x1b/0x60
> 
> 0xa0022131
> do_one_initcall+0x67/0x2f4
> do_init_module+0x5a/0x22e
> load_module+0x1ebc/0x2570
> __do_sys_finit_module+0xb2/0xc0
> do_syscall_64+0x4a/0x1b0
> entry_SYSCALL_64_after_hwframe+0x49/0xbe

This one is weird, but something like:

#4  mutex_lock(&loop_ctl_mutex)
loop_add()
  blk_mq_init_queue()
blk_alloc_queue_node()
  bioset_init()
bio_find_or_create_slab()
#5mutex_lock(&bio_slab_lock)

is entirely possible when you look at block/loop.c

>  -> #4 (loop_ctl_mutex){+.+.}:
> __mutex_lock+0x81/0x8f0
> 
> __blkdev_ge

Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> It does strike me that the whole optimistic spin algorithm
> (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> out. They've been growing more optimizations I see, and the optimizations 
> mostly
> aren't specific to either locks.

There's a few unfortunate differences between the two; but yes it would
be good if we could reduce some of the duplication found there.

One of the distinct differences is that mutex (now) has the lock state
and owner in a single atomic word, while rwsem still tracks the owner in
a separate word and thus needs to account for 'inconsistent' owner
state.

And then there's warts such as ww_mutex and rwsem_owner_is_reader and
similar.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > 
> > No.. and most certainly not without a _very_ good reason.
> 
> Ok, can I ask why?

Because it is an internal helper for lock implementations that want to
do optimistic spinning, it isn't a lock on its own and lacks several
things you would expect.

Using it is tricky and I don't trust random module authors to get 1+1
right, let alone use this thing correctly (no judgement on your code,
just in general).

> Here's what it's for:

I'll try and have a look soon :-) But does that really _have_ to live in
a module?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > 
> > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > have terrifying worst case preemption off latencies.
> 
> Ah. That was missing from your commit message.

Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/

> > Why can't you use something like per-cpu rwsems?
> 
> Well,
> 
>  a) in my use case, lg_global_lock() pretty much isn't used in normal 
> operation,
> it's only called when starting mark and sweep gc (which is not needed
> anymore and disabled by default, it'll eventually get rolled into online
> fsck) and for device resize
> 
>  b) I'm using it in conjection with percpu counters, and technically yes I
> certainly _could_ use per-cpu sleepable locks (mutexes would make more 
> sense
> for me than rwsems), there's a bit of a clash there and it's going to be a
> bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> makes any sense in that case, so it'd mean auditing and converting all the
> code that touches the relevant data structures).

Well, lg is a reader-writer style lock per definition, as you want
concurrency on the local and full exclusion against the global, so I'm
not sure how mutexes fit into this.

In any case, have a look at percpu_down_read_preempt_disable() and
percpu_up_read_preempt_enable(); they're a bit of a hack but they should
work for you I think.

They will sleep at down_read, but the entire actual critical section
will be with preemption disabled -- therefore it had better be short and
bounded, and the RT guys will thank you for not using spinlock_t under
it (use raw_spinlock_t if you have to).

The (global) writer side will block and be preemptible like normal.

> If you're really that dead set against lglocks I might just come up with a new
> lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if 
> the
> global lock is held...

See above, we already have this ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:

No.. and most certainly not without a _very_ good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> bcachefs makes use of them - also, add a proper lg_lock_init()

Why?! lglocks are horrid things, we got rid of them for a reason. They
have terrifying worst case preemption off latencies.

Why can't you use something like per-cpu rwsems?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] v2: block subsystem refcounter conversions

2017-04-21 Thread Peter Zijlstra
On Fri, Apr 21, 2017 at 08:03:13AM -0600, Jens Axboe wrote:
> You have it so easy - the code is completely standalone, building a
> small test framework around it and measuring performance in _user space_
> is trivial.

Something like this you mean:

  
https://lkml.kernel.org/r/20170322165144.dtidvvbxey7w5...@hirez.programming.kicks-ass.net


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] block subsystem refcounter conversions

2017-02-20 Thread Peter Zijlstra
On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote:
> On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote:
> > On 02/20/2017 04:16 AM, Elena Reshetova wrote:
> > > Now when new refcount_t type and API are finally merged
> > > (see include/linux/refcount.h), the following
> > > patches convert various refcounters in the block susystem from 
> > > atomic_t to refcount_t. By doing this we prevent intentional or 
> > > accidental underflows or overflows that can led to use-after-free
> > > vulnerabilities.
> 
> This description isn't right ... nothing is prevented; we get warnings
> on saturation and use after free with this.

The thing that is prevented is overflow and then a use-after-free by
making it a leak.

Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1
we'll never get there.

So you loose use-after-free, you gain a resource leak. The general idea
being that use-after-free is a nice trampoline for exploits, leaks are
'only' a DoS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
> 
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
> 
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
> 
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

> 
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case.  But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
 struct __wait_queue_head {
spinlock_t  lock;
struct list_headtask_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   int onstack;
+#endif
 };
 typedef struct __wait_queue_head wait_queue_head_t;
 
@@ -56,9 +59,18 @@ struct task_struct;
 #define DECLARE_WAITQUEUE(name, tsk)   \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
 
-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {  \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack)  {   \
.lock   = __SPIN_LOCK_UNLOCKED(name.lock),  \
-   .task_list  = { &(name).task_list, &(name).task_list } }
+   .task_list  = { &(name).task_list, &(name).task_list }, \
+   ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) 
___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, 
const char *name, struct
 
 #ifdef CONFIG_LOCKDEP
 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
-   ({ init_waitqueue_head(&name); name; })
+   ({ init_waitqueue_head(&name); (name).onstack = 1; name; })
 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
 #else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+   wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
 #endif
 
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int 
mode, int nr_exclusive)
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */
 
+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
 /*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, 
int state)
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
if (list_empty(&wait->task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
}
set_current_state(state);
}
+   prepare_debug(q, wait);
spin_un

Re: [PATCH] rwsem: add rwsem_is_contended

2013-08-31 Thread Peter Zijlstra
On Fri, Aug 30, 2013 at 10:14:01AM -0400, Josef Bacik wrote:
> Btrfs uses an rwsem to control access to its extent tree.  Threads will hold a
> read lock on this rwsem while they scan the extent tree, and if need_resched()
> they will drop the lock and schedule.  The transaction commit needs to take a
> write lock for this rwsem for a very short period to switch out the commit
> roots.  If there are a lot of threads doing this caching operation we can 
> starve
> out the committers which slows everybody out.  To address this we want to add
> this functionality to see if our rwsem has anybody waiting to take a write 
> lock
> so we can drop it and schedule for a bit to allow the commit to continue.


> +/*
> + * check to see if the rwsem we're holding has anybody waiting to acquire it.
> + */
> +int rwsem_is_contended(struct rw_semaphore *sem)
> +{
> + int ret = 0;
> + unsigned long flags;
> +
> + if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> + return 1;
> + if (!list_empty(&sem->wait_list))
> + ret = 1;
> + raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
> + return ret;
> +}
> +
> +EXPORT_SYMBOL(rwsem_is_contended);

Modeled after spin_is_contended(), so no problem with that. One thing I
was wondering about is if it wants to be called
rwsem_is_write_contended() or similar, since it explicitly only tests
for pending writers.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:
> 
> In this case, the only thing we're really missing is a way to mutex_lock
> without the cond_resched() 

So you're trying to explicitly avoid a voluntary preemption point? Seems
like a bad idea, normally people add those :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-30 Thread Peter Zijlstra
On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote:
> Hey, Peter.
> 
> On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
> > > Here's the combined patch I was planning on testing but didn't get to
> > > (yet).  It implements two things - hard limit on spin duration and
> > > early break if the owner also is spinning on a mutex.
> > 
> > This is going to give massive conflicts with
> > 
> >  https://lkml.org/lkml/2011/3/2/286
> >  https://lkml.org/lkml/2011/3/2/282
> > 
> > which I was planning to stuff into .40
> 
> I see.  Adapting shouldn't be hard.  The patch is in proof-of-concept
> stage anyway.
> 
> > > + * Forward progress is guaranteed regardless of locking ordering by never
> > > + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
> > > + * mutex_trylock(), which doesn't have to follow the usual locking
> > > + * ordering, also uses this function.
> > 
> > While that puts a limit on things it'll still waste time. I'd much
> > rather pass an trylock argument to mutex_spin_on_owner() and then bail
> > on owner also spinning.
> 
> Do we guarantee or enforce that the lock ownership can't be
> transferred to a different task?  If we do, the recursive spinning
> detection is enough to guarantee forward progress.

The only way to switch owner is for the current owner to release and a
new owner to acquire the lock. Also we already bail the spin loop when
owner changes.

> > > + if (task_thread_info(rq->curr) != owner ||
> > > + rq->spinning_on_mutex || need_resched() ||
> > > + local_clock() > start + MAX_MUTEX_SPIN_NS) {
> > 
> > While we did our best with making local_clock() cheap, I'm still fairly
> > uncomfortable with putting it in such a tight loop.
> 
> That's one thing I didn't really understand.  It seems the spinning
> code tried to be light on CPU cycle usage, but we're wasting CPU
> cycles there anyway.  If the spinning can become smarter using some
> CPU cycles, isn't that a gain?  Why is the spinning code optimizing
> for less CPU cycles?

Loop exit latency mostly, the lighter the loop, the faster you're
through the less time you waste once the condition is true, but yeah,
what you say is true too, its a balance game as always.

> Also, it would be great if you can share what you're using for locking
> performance measurements.  My attempts with dbench didn't end too
> well. :-(

The thing I used for the initial implementation is mentioned in the
changelog (0d66bf6d3):

Testing with Ingo's test-mutex application 
(http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   85870

I've no idea how heavy that is on trylock though, you'd have to look at
that.

Chris did some improvements using dbench in ac6e60ee4 but clearly that
isn't working for you.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-29 Thread Peter Zijlstra
On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:
> Here's the combined patch I was planning on testing but didn't get to
> (yet).  It implements two things - hard limit on spin duration and
> early break if the owner also is spinning on a mutex.

This is going to give massive conflicts with

 https://lkml.org/lkml/2011/3/2/286
 https://lkml.org/lkml/2011/3/2/282

which I was planning to stuff into .40


> @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
>  
>  #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
>  /*
> + * Maximum mutex owner spin duration in nsecs.  Don't spin more then
> + * DEF_TIMESLICE.
> + */
> +#define MAX_MUTEX_SPIN_NS(DEF_TIMESLICE * 10LLU / HZ)

DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also
I bet we have something like NSEC_PER_SEC to avoid counting '0's.

> +
> +/**
> + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
> + * @lock: the mutex to spin on
> + * @owner: the current owner (speculative pointer)
> + *
> + * The caller is trying to acquire @lock held by @owner.  If @owner is
> + * currently running, it might get unlocked soon and spinning on it can
> + * save the overhead of sleeping and waking up.
> + *
> + * Note that @owner is completely speculative and may be completely
> + * invalid.  It should be accessed very carefully.
> + *
> + * Forward progress is guaranteed regardless of locking ordering by never
> + * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
> + * mutex_trylock(), which doesn't have to follow the usual locking
> + * ordering, also uses this function.

While that puts a limit on things it'll still waste time. I'd much
rather pass an trylock argument to mutex_spin_on_owner() and then bail
on owner also spinning.

> + * CONTEXT:
> + * Preemption disabled.
> + *
> + * RETURNS:
> + * %true if the lock was released and the caller should retry locking.
> + * %false if the caller better go sleeping.
>   */
> -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
> +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
>  {

> @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
>* we likely have heavy contention. Return 0 to quit
>* optimistic spinning and not contend further:
>*/
> + ret = !lock->owner;
>   break;
>   }
>  
>   /*
> -  * Is that owner really running on that cpu?
> +  * Quit spinning if any of the followings is true.
> +  *
> +  * - The owner isn't running on that cpu.
> +  * - The owner also is spinning on a mutex.
> +  * - Someone else wants to use this cpu.
> +  * - We've been spinning for too long.
>*/
> + if (task_thread_info(rq->curr) != owner ||
> + rq->spinning_on_mutex || need_resched() ||
> + local_clock() > start + MAX_MUTEX_SPIN_NS) {

While we did our best with making local_clock() cheap, I'm still fairly
uncomfortable with putting it in such a tight loop.

> + ret = false;
> + break;
> + }
>  
>   arch_mutex_cpu_relax();
>   }
>  
> + this_rq()->spinning_on_mutex = false;
> + return ret;
>  }
>  #endif
>  



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Peter Zijlstra
On Thu, 2010-09-02 at 09:55 +0200, Peter Zijlstra wrote:
> On Thu, 2010-09-02 at 13:44 +0800, Miao Xie wrote:
> > On Wed, 01 Sep 2010 17:25:36 +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
> > >> + * This program is free software; you can redistribute it and/or modify 
> > >> it
> > >> + * under the terms of the GNU General Public License as published by 
> > >> the Free
> > >> + * Software Foundation; either version 2.1 of the License, or (at your 
> > >> option)
> > >> + * any later version.
> > >
> > > The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
> > 
> > Ok, I will change it.
> 
> Are you allowed to change it? from what I understand the FSF owns the
> copyright of that code, that least, that's what the preamble I cut away
> implied.

n/m, I should probably have read more inbox before replying.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Peter Zijlstra
On Thu, 2010-09-02 at 13:44 +0800, Miao Xie wrote:
> On Wed, 01 Sep 2010 17:25:36 +0200, Peter Zijlstra wrote:
> > On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms of the GNU General Public License as published by the 
> >> Free
> >> + * Software Foundation; either version 2.1 of the License, or (at your 
> >> option)
> >> + * any later version.
> >
> > The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
> 
> Ok, I will change it.

Are you allowed to change it? from what I understand the FSF owns the
copyright of that code, that least, that's what the preamble I cut away
implied.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] lib: introduce some memory copy macros and functions

2010-09-01 Thread Peter Zijlstra
On Wed, 2010-09-01 at 18:36 +0800, Miao Xie wrote:
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2.1 of the License, or (at your 
> option)
> + * any later version. 

The kernel is GPL v2, see COPYING. GPL v2.1 is not a suitable license.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-26 Thread Peter Zijlstra
On Wed, 2010-08-25 at 20:09 -0700, David Rientjes wrote:
> > Oh, we can determine an upper bound.  You might just not like it.
> > Actually ext3/ext4 shouldn't be as bad as XFS, which Dave estimated to
> > be around 400k for a transaction.  My guess is that the worst case for
> > ext3/ext4 is probably around 256k or so; like XFS, most of the time,
> > it would be a lot less.  (At least, if data != journalled; if we are
> > doing data journalling and every single data block begins with
> > 0xc03b3998U, we'll need to allocate a 4k page for every single data
> > block written.)  We could dynamically calculate an upper bound if we
> > had to.  Of course, if ext3/ext4 is attached to a network block
> > device, then it could get a lot worse than 256k, of course.
> > 
> On my 8GB machine, /proc/zoneinfo says the min watermark for ZONE_NORMAL 
> is 5086 pages, or ~20MB.  GFP_ATOMIC would allow access to ~12MB of that, 
> so perhaps we should consider this is an acceptable abuse of GFP_ATOMIC as 
> a fallback behavior when GFP_NOFS or GFP_NOIO fails? 

Agreed with the fact that 400k isn't much to worry about.

Not agreed with the GFP_ATOMIC stmt.

Direct reclaim already has PF_MEMALLOC, but then we also need a
concurrency limit on that, otherwise you can still easily blow though
your reserves by having 100 concurrency users of it, resulting in an
upper bound of 4k instead, which will be too much.

There were patches to limit the direct reclaim contexts, not sure they
ever got anywhere..

It is something to consider in the re-design of the whole
direct-reclaim/writeback paths though..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 16:53 -0400, Ted Ts'o wrote:
> On Wed, Aug 25, 2010 at 03:35:42PM +0200, Peter Zijlstra wrote:
> > 
> > While I appreciate that it might be somewhat (a lot) harder for a
> > filesystem to provide that guarantee, I'd be deeply worried about your
> > claim that its impossible.
> > 
> > It would render a system without swap very prone to deadlocks. Even with
> > the very tight dirty page accounting we currently have you can fill all
> > your memory with anonymous pages, at which point there's nothing free
> > and you require writeout of dirty pages to succeed.
> 
> For file systems that do delayed allocation, the situation is very
> similar to swapping over NFS.  Sometimes in order to make some free
> memory, you need to spend some free memory... 

Which means you need to be able to compute a bounded amount of that
memory.

>  which implies that for
> these file systems, being more aggressive about triggering writeout,
> and being more aggressive about throttling processes which are
> creating too many dirty pages, especially dirty delayed allocaiton
> pages (regardless of whether this is via write(2) or accessing mmapped
> memory), is a really good idea.

That seems unrelated, the VM has a strict dirty limit and controls
writeback when needed. That part works.

> A pool of free pages which is reserved for routines that are doing
> page cleaning would probably also be a good idea.  Maybe that's just
> retrying with GFP_ATOMIC if a normal allocation fails, or maybe we
> need our own special pool, or maybe we need to dynamically resize the
> GFP_ATOMIC pool based on how many subsystems might need to use it

We have a smallish reserve, accessible with PF_MEMALLOC, but its use is
not regulated nor bounded, it just mostly works good enough.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 14:11 -0700, David Rientjes wrote:
> 
> There's still no hard guarantee that the memory will allocatable 
> (GFP_KERNEL, the compaction, then GFP_ATOMIC could all still fail), but I 
> don't see how continuously looping the page allocator is possibly supposed 
> to help in these situations. 

Why do you think I'm a proponent of that behaviour?

I've been arguing that the existance of GFP_NOFAIL is the bug, and I
started the whole discussion because your patchset didn't outline the
purpose of its existance, it merely changes __GFP_NOFAIL usage into
$foo_nofail() functions, which on its own is a rather daft change.

Optimizing the page allocator by removing those conditional from its
innards into an outer loop not used by most callers seems a fine goal,
but you didn't state that.

Also, I like the postfix proposed by Andi better: _i_suck() :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 16:11 -0500, Christoph Lameter wrote:
> We already have __GFP_NOFAIL behavior for slab allocations since
> a __GFP_NOFAIL flag is passed through to the page allocator if no objects
> are available.
> 
> The page allocator will do the same as when called directly with
> __GFP_NOFAIL and so we have consistent behavior regardless of the
> allocator used.

Thank's for quoting the context to which you're replying, that really
helps parsing things..

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 13:43 -0700, David Rientjes wrote:
> The cpusets case is actually the easiest to fix: use GFP_ATOMIC.  

I don't think that's a valid usage of GFP_ATOMIC, I think we should
fallback to outside the cpuset for kernel allocations by default.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
> There are other places where we can fail safely (for example, in jbd's
> start_this_handle, although that just pushes the layer up the stack,
> and ultimately, to userspace where most userspace programs don't
> really expect ENOMEM to get returned by a random disk write 

While talking with Chris about this, if you can indeed push the error
out that far you can basically ensure this particular fs-op does not
complicate the journal commit and thereby limit the number of extra
entries in your journal, and thus the amount of memory required.

So once stuff starts failing, push out ops back out of the filesystem
code, force a journal commit, and then let these ops retry. There is no
need to actually push the -ENOMEM all the way back to userspace.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 23:24 +1000, Dave Chinner wrote:
> 
> That is, the guarantee that we will always make progress simply does
> not exist in filesystems, so a mempool-like concept seems to me to
> be doomed from the start 

While I appreciate that it might be somewhat (a lot) harder for a
filesystem to provide that guarantee, I'd be deeply worried about your
claim that its impossible.

It would render a system without swap very prone to deadlocks. Even with
the very tight dirty page accounting we currently have you can fill all
your memory with anonymous pages, at which point there's nothing free
and you require writeout of dirty pages to succeed.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 09:20 -0400, Theodore Tso wrote:
> Well, if all of these users start having their own private pools of
> emergency memory, I'm not sure that's such a great idea either.

That's a secondary problem, and could be reduced by something like the
memory reservation system I provided in the swap-over-nfs patches. Of
course, when all these users can actually happen concurrently there's
really nothing much you can do about it.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 09:20 -0400, Theodore Tso wrote:
> On Aug 25, 2010, at 8:52 AM, Peter Zijlstra wrote:
> > Also, there's a good reason for disliking (a), its a deadlock scenario,
> > suppose we need to write out data to get free pages, but the writing out
> > is blocked on requiring free pages.
> > 
> > There's really nothing the page allocator can do to help you there, its
> > a situation you have to avoid getting into.
> 
> Well, if all of these users start having their own private pools of
> emergency memory, I'm not sure that's such a great idea either.
> 
> And in some cases, there *is* extra memory.  For example, if the
> reason why the page allocator failed is because there isn't enough
> memory in the current process's cgroup, maybe it's important enough
> that the kernel code might decide to say, "violate the cgroup
> constraints --- it's more important that we not bring down the entire
> system" than to honor whatever yahoo decided that a particular cgroup
> has been set down to something ridiculous like 512mb, when there's
> plenty of free physical memory --- but not in that cgroup.

I'm not sure, but I think the cgroup thing doesn't account kernel
allocations, in which case the above problem doesn't exist.

For the cpuset case we punch through the cpuset constraints for kernel
allocations (unless __GFP_HARDWALL).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 14:48 +0200, Peter Zijlstra wrote:

> > > On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:

> > > > (a) retry the loop in the jbd layer (which Andrew really doesn't
> > > > like), (b) keep our own private cache of free memory so we don't fail
> > > > and/or loop, 

> Well (b) sounds a lot saner than either of those. Simply revert to a
> state that is sub-optimal but has bounded memory use and reserve that
> memory up-front. That way you can always get out of a tight memory spot.

Also, there's a good reason for disliking (a), its a deadlock scenario,
suppose we need to write out data to get free pages, but the writing out
is blocked on requiring free pages.

There's really nothing the page allocator can do to help you there, its
a situation you have to avoid getting into.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:57 -0400, Ted Ts'o wrote:
> On Wed, Aug 25, 2010 at 01:35:32PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
> > > Part of the problem is that we have a few places in the kernel where
> > > failure is really not an option --- or rather, if we're going to fail
> > > while we're in the middle of doing a commit, our choices really are
> > > (a) retry the loop in the jbd layer (which Andrew really doesn't
> > > like), (b) keep our own private cache of free memory so we don't fail
> > > and/or loop, (c) fail the file system and mark it read-only, or (d)
> > > panic. 
> > 
> > d) do the allocation before you're committed to going fwd and can still
> > fail and back out.
> 
> Sure in some cases that can be done, but the commit has to happen at
> some point, or we run out of journal space, at which point we're back
> to (c) or (d).

Well (b) sounds a lot saner than either of those. Simply revert to a
state that is sub-optimal but has bounded memory use and reserve that
memory up-front. That way you can always get out of a tight memory spot.

Its what the block layer has always done to avoid the memory deadlock
situation, it has a private stash of BIOs that is big enough to always
service some IO, and as long as IO is happening stuff keeps moving fwd
and we don't deadlock.

Filesystems might have a slightly harder time creating such a bounded
state because there might be more involved like journals and the like,
but still it should be possible to create something like that (my swap
over nfs patches created such a state for the network rx side of
things).

Also, we cannot let our fear of crappy userspace get in the way of doing
sensible things. Your example of write(2) returning -ENOMEM is not
correct though, the syscall (and the page_mkwrite callback for mmap()s)
happens before we actually dirty the data and need to write things out,
so we can always simply wait for memory to become available to dirty.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Peter Zijlstra
On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
> Part of the problem is that we have a few places in the kernel where
> failure is really not an option --- or rather, if we're going to fail
> while we're in the middle of doing a commit, our choices really are
> (a) retry the loop in the jbd layer (which Andrew really doesn't
> like), (b) keep our own private cache of free memory so we don't fail
> and/or loop, (c) fail the file system and mark it read-only, or (d)
> panic. 

d) do the allocation before you're committed to going fwd and can still
fail and back out.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-24 Thread Peter Zijlstra
On Tue, 2010-08-24 at 23:55 +1000, Dave Chinner wrote:
> no one has noticed
> that XFS has been doing these "allocations can't fail" loop in
> kmem_alloc() and kmem_zone_alloc(), well, forever 

I occasionally check if its still there and cry a little ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/5] btrfs: add nofail variant of set_extent_dirty

2010-08-24 Thread Peter Zijlstra
On Tue, 2010-08-24 at 03:50 -0700, David Rientjes wrote:
> Add set_extent_dirty_nofail().  This function is equivalent to
> set_extent_dirty(), except that it will never fail because of allocation
> failure and instead loop forever trying to allocate memory.
> 
> If the first allocation attempt fails, a warning will be emitted,
> including a call trace.  Subsequent failures will suppress this warning.
> 
> This was added as a helper function for documentation and auditability.
> No future callers should be added.
> 
> Signed-off-by: David Rientjes 
> ---
>  fs/btrfs/extent-tree.c |8 
>  fs/btrfs/extent_io.c   |   19 +++
>  fs/btrfs/extent_io.h   |2 ++
>  3 files changed, 25 insertions(+), 4 deletions(-)

I'd much rather someone help mason to clean this up.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-24 Thread Peter Zijlstra
On Tue, 2010-08-24 at 03:50 -0700, David Rientjes wrote:
> These were added as helper functions for documentation and auditability.
> No future callers should be added.

git grep GFP_NOFAIL isn't auditable enough?

might as well declare these functions depricated if you really want to
do this.

FWIW I think mason said he'd fix btrfs to not suck like this 'soon'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Thu, 2009-01-15 at 01:46 +0100, Nick Piggin wrote:

> Hmm, well this is rather a slow path, I would say. I'd prefer not to
> modify schedule in this way (if we just get scheduled back on after
> being switched away, the subsequent call to schedule is going to be
> cache hot and not do too much work).
> 
> preempt_enable_noresched maybe if you really care, would close up the
> window even more. But is it really worthwhile? We'd want to see numbers
> (when in doubt, keep it  simpler).

I initially did the preempt_enable_no_resched() thing and that showed
some improvement for PREEMPT=y kernels (lost the numbers though).

When I redid all the patches I tried closing that last hole by doing
that __schedule() thing, never realizing that schedule() would then get
extra overhead,.. d'0h.

I agree that that isn't worth it. I shall revert to
preempt_enable_no_resched() and try to get some new numbers.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Peter Zijlstra
On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote:

> Do people enable CONFIG_SCHED_DEBUG?

Well, I have it always enabled, but I've honestly no idea if that makes
me weird.

> CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty
> small subset?

Could be, do you fancy me doing a sysctl? shouldn't be hard.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Peter Zijlstra
On Wed, 2009-01-14 at 10:53 -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar  wrote:
> 
> > Please pull the adaptive-mutexes-for-linus git tree
> 
> 
> 
> - It seems a major shortcoming that the feature is disabled if
>   CONFIG_DEBUG_MUTEXES=y.  It means that lots of people won't test it.

Yes, that's a bit unfortunate, a simple patch to enable that is:

I must admit I'm a bit stumped on why that debug check triggers, I
couldn't reproduce today, but Ingo ran into it quite quickly :/

Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -74,7 +74,7 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info()); */
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -148,7 +148,7 @@ __mutex_lock_common(struct mutex *lock, 
 
preempt_disable();
mutex_acquire(&lock->dep_map, subclass, 0, ip);
-#if defined(CONFIG_SMP) && !defined(CONFIG_DEBUG_MUTEXES)
+#ifdef CONFIG_SMP
/*
 * Optimistic spinning.
 *
@@ -162,9 +162,6 @@ __mutex_lock_common(struct mutex *lock, 
 * Since this needs the lock owner, and this mutex implementation
 * doesn't track the owner atomically in the lock field, we need to
 * track it non-atomically.
-*
-* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-* to serialize everything.
 */
 
for (;;) {


> - When people hit performance/latency oddities, it would be nice if
>   they had a /proc knob with which they could disable this feature at
>   runtime.
> 
>   This would also be useful for comparative performance testing.

CONFIG_SCHED_DEBUG=y

echo NO_MUTEX_SPIN > /debug/sched_features


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Wed, 2009-01-14 at 18:18 +0100, Nick Piggin wrote:

> > @@ -173,21 +237,21 @@ __mutex_lock_common(struct mutex *lock, 
> > spin_unlock_mutex(&lock->wait_lock, flags);
> >  
> > debug_mutex_free_waiter(&waiter);
> > +   preempt_enable();
> > return -EINTR;
> > }
> > __set_task_state(task, state);
> >  
> > /* didnt get the lock, go to sleep: */
> > spin_unlock_mutex(&lock->wait_lock, flags);
> > -   schedule();
> > +   __schedule();
> 
> Why does this need to do a preempt-disabled schedule? After we schedule
> away, the next task can do arbitrary things or reschedule itself, so if
> we have not anticipated such a condition here, then I can't see what
> __schedule protects. At least a comment is in order?

From:
http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/mutex-preempt.patch

Subject: mutex: preemption fixes
From: Peter Zijlstra 
Date: Wed Jan 14 15:36:26 CET 2009

The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.

Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicitly disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.

Signed-off-by: Peter Zijlstra 

> Pity to add the call overhead to schedule just for this case.

Good point, seeing any way around that?

>  BTW. __schedule shouldn't need to be asmlinkage?

TBH I've no clue, probably not, Ingo?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v11][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
Full series, including changelogs available at:

 http://programming.kicks-ass.net/kernel-patches/mutex-adaptive-spin/

and should shortly appear in a git tree near Ingo :-)

 mutex: small cleanup
 mutex: preemption fixes
 mutex: implement adaptive spinning
 mutex: set owner in mutex_lock() only
 mutex: adaptive spin for debug too
 mutex: adaptive spin performance tweaks

---
 include/linux/mutex.h   |5 +-
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 ---
 kernel/mutex.c  |  118 
 kernel/mutex.h  |   22 
 kernel/sched.c  |   71 +++-
 kernel/sched_features.h |1 
 8 files changed, 204 insertions(+), 42 deletions(-)

Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -10,6 +10,11 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *from the -rt tree, where it was originally implemented for rtmutexes
+ *by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *and Sven Dietrich.
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include 
@@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c
atomic_set(&lock->count, 1);
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
+   mutex_clear_owner(lock);
 
debug_mutex_init(lock, name, key);
 }
@@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu
 * 'unlocked' into 'locked' state.
 */
__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+   mutex_set_owner(lock);
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex *
 * The unlocking fastpath is the 0->1 transition from 'locked'
 * into 'unlocked' state:
 */
+#ifndef CONFIG_DEBUG_MUTEXES
+   /*
+* When debugging is enabled we must not clear the owner before time,
+* the slow path will always be taken, and that clears the owner field
+* after verifying that it was indeed current.
+*/
+   mutex_clear_owner(lock);
+#endif
__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
@@ -129,21 +144,71 @@ __mutex_lock_common(struct mutex *lock, 
 {
struct task_struct *task = current;
struct mutex_waiter waiter;
-   unsigned int old_val;
unsigned long flags;
 
+   preempt_disable();
+   mutex_acquire(&lock->dep_map, subclass, 0, ip);
+#ifdef CONFIG_SMP
+   /*
+* Optimistic spinning.
+*
+* We try to spin for acquisition when we find that there are no
+* pending waiters and the lock owner is currently running on a
+* (different) CPU.
+*
+* The rationale is that if the lock owner is running, it is likely to
+* release the lock soon.
+*
+* Since this needs the lock owner, and this mutex implementation
+* doesn't track the owner atomically in the lock field, we need to
+* track it non-atomically.
+*/
+
+   for (;;) {
+   struct thread_info *owner;
+
+   /*
+* If there's an owner, wait for it to either
+* release the lock or go to sleep.
+*/
+   owner = ACCESS_ONCE(lock->owner);
+   if (owner && !mutex_spin_on_owner(lock, owner))
+   break;
+
+   if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+   lock_acquired(&lock->dep_map, ip);
+   preempt_enable();
+   return 0;
+   }
+
+   /*
+* When there's no owner, we might have preempted between the
+* owner acquiring the lock and setting the owner field. If
+* we're an RT task that will live-lock because we won't let
+* the owner complete.
+*/
+   if (!owner && (need_resched() || rt_task(task)))
+   break;
+
+   /*
+* The cpu_relax() call is a compiler barrier which forces
+* everything in this loop to be re-loaded. We don't need
+* memory barriers as we'll eventually observe the right
+* values at the cost of a few extra spins.
+*/
+   cpu_relax();
+   }
+#endif
spin_lock_mutex(&lock->wait_lock, flags);
 
debug_mutex_lock_common(lock, &waiter);
-   mut

Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Peter Zijlstra
On Mon, 2009-01-12 at 19:32 +0200, Avi Kivity wrote:
> Peter Zijlstra wrote:
> > Spinlocks can use 'pure' MCS locks.
> >   
> 
> How about this, then.  In mutex_lock(), keep wait_lock locked and only 
> release it when scheduling out.  Waiter spinning naturally follows.  If 
> spinlocks are cache friendly (are they today?) 

(no they're not, Nick's ticket locks still spin on a shared cacheline
IIRC -- the MCS locks mentioned could fix this)

> we inherit that.  If 
> there is no contention on the mutex, then we don't need to reacquire the 
> wait_lock on mutex_unlock() (not that the atomic op is that expensive 
> these days).

That might actually work, although we'd have to move the
__mutex_slowpath_needs_to_unlock() branch outside wait_lock otherwise
we'll deadlock :-)

It might be worth trying this if we get serious fairness issues with the
current construct.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 08:49 -0800, Linus Torvalds wrote:
> 
> So do a v10, and ask people to test.

---
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra 
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 304% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   298932

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   98287

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features

Signed-off-by: Peter Zijlstra 
---
Changes since -v9:
 - cmpxchg acquire in the spin loop

Changes since -v8:
 - dropped the fairness

Changes since -v7:
 - made all the mutex ops fully non-preemptible
 - implemented FIFO fairness using the wait_list
 - fixed owner tracking issues

 include/linux/mutex.h   |5 +
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 +++---
 kernel/mutex.c  |  128 +---
 kernel/mutex.h  |   22 +++-
 kernel/sched.c  |   71 +-
 kernel/sched_features.h |1 
 8 files changed, 216 insertions(+), 40 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
@@ -68,7 +70,6 @@ struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct mutex*lock;
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -328,7 +328,9 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lo
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(

Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 17:21 +0100, Peter Zijlstra wrote:
> On Tue, 2009-01-13 at 08:16 -0800, Linus Torvalds wrote:
> > 
> > On Tue, 13 Jan 2009, Peter Zijlstra wrote:
> > > 
> > > Change mutex contention behaviour such that it will sometimes busy wait on
> > > acquisition - moving its behaviour closer to that of spinlocks.
> > 
> > Okey, dokey. Looks reasonable, but I wonder if this part came from v8 and 
> > wasn't intentional:
> > 
> > > + if (atomic_xchg(&lock->count, -1) == 1) {
> > > + lock_acquired(&lock->dep_map, ip);
> > > + mutex_set_owner(lock);
> > > + preempt_enable();
> > > + return 0;
> > > + }
> > 
> > Now you're forcing the slow-path on unlock. Maybe it was intentional, 
> > maybe it wasn't. Did you perhaps mean
> > 
> > if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> > 
> > here? I thought we agreed it was safe, if only because it should be 
> > equivalent to just having done "mutex_trylock()" instead of a "real" lock 
> > sequence.
> 
> Yes, that was an 'accident' from -v8, yes we did think the cmpxchg was
> good, however I did get some spurious lockups on -v7, and I only noticed
> the thing after I'd done most of the testing, so I decided to let it be
> for now.
> 
> Let me put the cmpxchg back in and see if this is all still good (only
> 3*2*2 configs to test :-).

Ok, tested only 1, but that was the one I remember lockups from -- and
that seems to be good with the cmpxchg.

Do you fancy me sending v10 or will you make that change locally?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
On Tue, 2009-01-13 at 08:16 -0800, Linus Torvalds wrote:
> 
> On Tue, 13 Jan 2009, Peter Zijlstra wrote:
> > 
> > Change mutex contention behaviour such that it will sometimes busy wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> 
> Okey, dokey. Looks reasonable, but I wonder if this part came from v8 and 
> wasn't intentional:
> 
> > +   if (atomic_xchg(&lock->count, -1) == 1) {
> > +   lock_acquired(&lock->dep_map, ip);
> > +   mutex_set_owner(lock);
> > +   preempt_enable();
> > +   return 0;
> > +   }
> 
> Now you're forcing the slow-path on unlock. Maybe it was intentional, 
> maybe it wasn't. Did you perhaps mean
> 
>   if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> 
> here? I thought we agreed it was safe, if only because it should be 
> equivalent to just having done "mutex_trylock()" instead of a "real" lock 
> sequence.

Yes, that was an 'accident' from -v8, yes we did think the cmpxchg was
good, however I did get some spurious lockups on -v7, and I only noticed
the thing after I'd done most of the testing, so I decided to let it be
for now.

Let me put the cmpxchg back in and see if this is all still good (only
3*2*2 configs to test :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v9][RFC] mutex: implement adaptive spinning

2009-01-13 Thread Peter Zijlstra
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra 
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 304% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   298932

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   98287

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features

Signed-off-by: Peter Zijlstra 
---
Chenges since -v8:
 - dropped the fairness

Changes since -v7:
 - made all the mutex ops fully non-preemptible
 - implemented FIFO fairness using the wait_list
 - fixed owner tracking issues

 include/linux/mutex.h   |5 +
 include/linux/sched.h   |2 
 kernel/mutex-debug.c|9 ---
 kernel/mutex-debug.h|   18 +++---
 kernel/mutex.c  |  128 +---
 kernel/mutex.h  |   22 +++-
 kernel/sched.c  |   71 +-
 kernel/sched_features.h |1 
 8 files changed, 216 insertions(+), 40 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
@@ -68,7 +70,6 @@ struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct mutex*lock;
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -328,7 +328,9 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lo
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   mutex_clear_owner(lock);
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(s

Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 19:33 +0200, Boaz Harrosh wrote:

> Which brings me back to my initial reaction to this work. Do we need
> two flavors of Mutex? some program sections need Fairness, some need
> performance. Some need low-latency, some need absolute raw CPU power.

Thing is, its the kernel, we cannot have such knobs per task. So we have
to pick one and stick with it -- the 'best' we can do is what PREEMPT_RT
does and replace the thing whole sale at build time.

> Because at the end of the day spinning in a saturated CPU work-load
> that does not care about latency, eats away cycles that could be spent
> on computation. Think multi-threaded video processing for example. 
> Thing I would like to measure is 
> 1 - how many times we spin and at the end get a lock
> 2 - how many times we spin and at the end sleep.
> 3 - how many times we sleep like before.
> vs. In old case CPU spent on scheduling. Just to see if we are actually 
> loosing
> cycles at the end.

Feel free to do so ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 12:14 -0500, Chris Mason wrote:
> On Mon, 2009-01-12 at 17:50 +0100, Peter Zijlstra wrote:
> > > 
> > > (the file stat run is total run time, so lower is better.  The other
> > > numbers are files or MB per second, so higher is better)
> > > 
> > > For the file create run, v8 had much lower system time than v7,
> > > averaging 1s of sys time per proc instead of 1.6s.
> > 
> > Right, how about the spread in completion time, because that is the only
> > reason I tried this fairness stuff, because you reported massive
> > differences there.
> > 
> 
> I reran the numbers with a slightly different kernel config and they
> have changed somewhat.  These are just for the 4k file create run, all
> numbers in files created per second (and the numbers are stable across
> runs)
> 
> v8 avg 176.90 median 171.85 std 12.49 high 215.97 low 165.54
> v7 avg 169.02 median 163.77 std 16.82 high 267.95 low 157.95

Any opinions on the fairness matter, will -v9 be unlocked and unfair
again?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 08:20 -0800, Linus Torvalds wrote:
> 
> On Mon, 12 Jan 2009, Linus Torvalds wrote:
> > 
> > You made it back into the locked version.
> 
> Btw, even if you probably had some reason for this, one thing to note is 
> that I think Chris' performance testing showed that the version using a 
> lock was inferior to his local btrfs hack, while the unlocked version 
> actually beat his hack.
> 
> Maybe I misunderstood his numbers, though. But if I followed that sub-part 
> of the test right, it really means that the locked version is pointless - 
> it will never be able to replace peoples local hacks for this same thing, 
> because it just doesn't give the performance people are looking for.
> 
> Since the whole (and _only_) point of this thing is to perform well, 
> that's a big deal.

Like said in reply to Chris' email, I just wanted to see if fairness was
worth the effort, because the pure unlocked spin showed significant
unfairness (and I know some people really care about some level of
fairness).

Initial testing with the simple test-mutex thing didn't show too bad
numbers.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 18:13 +0200, Avi Kivity wrote:

> One thing that worries me here is that the spinners will spin on a 
> memory location in struct mutex, which means that the cacheline holding 
> the mutex (which is likely to be under write activity from the owner) 
> will be continuously shared by the spinners, slowing the owner down when 
> it needs to unshare it.  One way out of this is to spin on a location in 
> struct mutex_waiter, and have the mutex owner touch it when it schedules 
> out.

Yeah, that is what pure MCS locks do -- however I don't think its a
feasible strategy for this spin/sleep hybrid.

> So:
> - each task_struct has an array of currently owned mutexes, appended to 
> by mutex_lock()

That's not going to fly I think. Lockdep does this but its very
expensive and has some issues. We're currently at 48 max owners, and
still some code paths manage to exceed that.

> - mutex waiters spin on mutex_waiter.wait, which they initialize to zero
> - when switching out of a task, walk the mutex list, and for each mutex, 
> bump each waiter's wait variable, and clear the owner array

Which is O(n).

> - when unlocking a mutex, bump the nearest waiter's wait variable, and 
> remove from the owner array
> 
> Something similar might be done to spinlocks to reduce cacheline 
> contention from spinners and the owner.

Spinlocks can use 'pure' MCS locks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
On Mon, 2009-01-12 at 11:45 -0500, Chris Mason wrote:
> On Mon, 2009-01-12 at 08:20 -0800, Linus Torvalds wrote:
> > 
> > On Mon, 12 Jan 2009, Linus Torvalds wrote:
> > > 
> > > You made it back into the locked version.
> > 
> > Btw, even if you probably had some reason for this, one thing to note is 
> > that I think Chris' performance testing showed that the version using a 
> > lock was inferior to his local btrfs hack, while the unlocked version 
> > actually beat his hack.
> > 
> 
> The spinning hack was faster than everything before v7 (we'll call it
> the Linus-fix), and the v7 code was much faster than my spin.
> 
> This is somewhere in between, with slightly better fairness than v7.
> 
>  spin v7 v8
> dbench 50580MB/s  789MB/s421MB/s
> file creates 152 file/s   162 file/s 195 file/s
> file stat3.8s total   2.3s total 5.3s total
> 
> (the file stat run is total run time, so lower is better.  The other
> numbers are files or MB per second, so higher is better)
> 
> For the file create run, v8 had much lower system time than v7,
> averaging 1s of sys time per proc instead of 1.6s.

Right, how about the spread in completion time, because that is the only
reason I tried this fairness stuff, because you reported massive
differences there.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Peter Zijlstra
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra 
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   85870

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features

Signed-off-by: Peter Zijlstra 
---
Changes since -v7:
 - made all the mutex ops fully non-preemptible
   - this fixes the getting preempted with owner=null issues
 - implemented FIFO fairness using the wait_list
 - fixed owner tracking issues

Dmitry noted that since we now have FIFO order, we need not put all
spinners to rest when one falls into schedule, but only those behind
it in the queue -- I haven't yet implemented this, but it should be
doable without a list-walk.

PREEMPT_NONE, PREEMPT_VOLUNTARY work as advertised
PREEMPT=y still seems to madly over-preempt or something.

 include/linux/mutex.h   |   10 ++-
 include/linux/sched.h   |3 
 kernel/mutex-debug.c|9 --
 kernel/mutex-debug.h|   18 +++--
 kernel/mutex.c  |  154 ++--
 kernel/mutex.h  |   22 ++
 kernel/sched.c  |   82 -
 kernel/sched_features.h |1 
 8 files changed, 257 insertions(+), 42 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,11 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+   int waiters;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
@@ -67,8 +70,11 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_SMP
+   int spin;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -328,7 +328,10 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
+extern int mutex_spin_on_owner(struct mutex_waiter *waiter,
+  struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lo

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 11:44 -0500, Steven Rostedt wrote:

> When we get to the schedule() it then needs to be a:
> 
>   preempt_enable_no_resched();
>   schedule();

On that note:

Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
__set_task_state(task, state);
 
/* didnt get the lock, go to sleep: */
+   preempt_disable();
spin_unlock_mutex(&lock->wait_lock, flags);
+   preempt_enable_no_resched();
schedule();
spin_lock_mutex(&lock->wait_lock, flags);
}


actually improves mutex performance on PREEMPT=y
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 10:59 -0500, Steven Rostedt wrote:

> > 
> > Adding that blocking on !owner utterly destroys everything.
> 
> I was going to warn you about that ;-)
> 
> Without the check for !owner, you are almost guaranteed to go to sleep 
> every time. Here's why:
> 
> You are spinning and thus have a hot cache on that CPU.
> 
> The owner goes to unlock but will be in a cold cache. It sets lock->owner 
> to NULL, but is still in cold cache so it is a bit slower.
> 
> Once the spinner sees the NULL, it shoots out of the spin but sees the 
> lock is still not available then goes to sleep. All before the owner could 
> release it. This could probably happen at every contention. Thus, you lose 
> the benefit of spinning. You probably make things worse because you add a 
> spin before every sleep.

Which is why I changed the inner loop to:

  l_owner = ACCESS_ONCE(lock->owner)
  if (l_owner && l_owner != owner)
break

So that that would continue spinning.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:

> > So I think the bug is still there, we just hid it better by breaking out 
> > of the loop with that "if (need_resched())" always eventually triggering. 
> > And it would be ok if it really is guaranteed to _eventually_ trigger, and 
> > I guess with timeslices it eventually always will, but I suspect we could 
> > have some serious latency spikes.
> 
> Yes, the owner getting preempted after acquiring the lock, but before
> setting the owner can give some nasties :-(
> 
> I initially did that preempt_disable/enable around the fast path, but I
> agree that slowing down the fast path is unwelcome.
> 
> Alternatively we could go back to block on !owner, with the added
> complexity of not breaking out of the spin on lock->owner != owner
> when !lock->owner, so that the premature owner clearing of the unlock
> fast path will not force a schedule right before we get a chance to
> acquire the lock.
> 
> Let me do that..

Ok a few observations..

Adding that need_resched() in the outer loop utterly destroys the
performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
somewhat unstable results.

Adding that blocking on !owner utterly destroys everything.

Going to look into where that extra preemption comes from.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:54 -0800, Linus Torvalds wrote:

> I was pretty sure that adding the unlocked loop should provably not change 
> the mutex lock semantics. Why? Because it's just basically equivalent to 
> just doing the mutex_trylock() without really changing anything really 
> fundamental in the mutex logic.
> 
> And that argument is sadly totally bogus. 

It fails for the RT case, yes. It should still be true for regular tasks
- if the owner tracking was accurate.

> The thing is, we used to have this guarantee that any contention would 
> always go into the slowpath, and then in the slow-path we serialize using 
> the spinlock. 
> 
> So I think the bug is still there, we just hid it better by breaking out 
> of the loop with that "if (need_resched())" always eventually triggering. 
> And it would be ok if it really is guaranteed to _eventually_ trigger, and 
> I guess with timeslices it eventually always will, but I suspect we could 
> have some serious latency spikes.

Yes, the owner getting preempted after acquiring the lock, but before
setting the owner can give some nasties :-(

I initially did that preempt_disable/enable around the fast path, but I
agree that slowing down the fast path is unwelcome.

Alternatively we could go back to block on !owner, with the added
complexity of not breaking out of the spin on lock->owner != owner
when !lock->owner, so that the premature owner clearing of the unlock
fast path will not force a schedule right before we get a chance to
acquire the lock.

Let me do that..

> The problem? Setting "lock->count" to 0. That will mean that the next 
> "mutex_unlock()" will not necessarily enter the slowpath at all, and won't 
> necessarily wake things up like it should.

That's exactly what __mutex_fastpath_trylock() does (or can do,
depending on the implementation), so if regular mutexes are correct in
the face of a trylock stealing the lock in front of a woken up waiter,
then we're still good.

That said, I'm not seeing how mutexes aren't broken already.

say A locks it: counter 1->0
then B contends: counter 0->-1, added to wait list
then C contentds: counter -1, added to wait list
then A releases: counter -1->1, wake someone up, say B
then D trylocks: counter 1->0
so B is back to wait list
then D releases, 0->1, no wakeup

Aaah, B going back to sleep sets it to -1

Therefore, I think we're good.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:
> 
> On Thu, 8 Jan 2009, Chris Mason wrote:
> > 
> > It is less fair though, the 50 proc parallel creates had a much bigger
> > span between the first and last proc's exit time.  This isn't a huge
> > shock, I think it shows the hot path is closer to a real spin lock.
> 
> Actually, the real spin locks are now fair. We use ticket locks on x86.

> We _could_ certainly aim for using ticket locks for mutexes too, that 
> might be quite nice.

Not really, ticket locks cannot handle a spinner going away - and we
need that here.

I've googled around a bit and MCS locks
(http://www.cs.rice.edu/~johnmc/papers/asplos91.pdf) look like a viable
way to gain fairness in our situation.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:
> 
> Well, at least we do unless you enable that broken paravirt support. I'm 
> not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
> don't much care.

Because the virtual cpu that has the ticket might not get scheduled for
a while, even though another vcpu with a spinner is scheduled.

The whole (para)virt is a nightmare in that respect.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote:
> 
> Ok, I've gone through -v7, and I'm sure you're all shocked to hear it, but 
> I have no complaints. 

*cheer*, except I guess we need to figure out what goes bad for Chris.

> Except that you dropped all the good commit 
> commentary you had earlier ;)

Yeah, I've yet to add that back, will do.

> The patch looks pretty good (except for the big "#if 0" block in 
> mutex-debug.c that I hope gets fixed, but I can't even really claim that I 
> can be bothered), the locking looks fine (ie no locking at all), and the 
> numbers seem pretty convinving.
> 
> Oh, and I think the open-coded
> 
>   atomic_cmpxchg(count, 1, 0) == 1
> 
> could possibly just be replaced with a simple __mutex_fastpath_trylock(). 
> I dunno.

__mutex_fastpath_trylock() isn't always that neat -- see
include/asm-generic/mutex-xchg.h -- and its a NOP on DEBUG_MUTEXES.

Note how I used old_val for the list_empty() thing as well, we could
possibly drop that extra condition though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 10:28 -0500, Steven Rostedt wrote:
> On Thu, 8 Jan 2009, Peter Zijlstra wrote:

> > in the unlikely case we schedule(), that seems expensive enough to want
> > to make the spin case ever so slightly faster.
> 
> OK, that makes sense, but I would comment that. Otherwise, it just looks 
> like another misuse of the unlikely annotation.

OK, sensible enough.

> > > Should we need to do a "get_cpu" or something? Couldn't the CPU disappear 
> > > between these two calls. Or does it do a stop-machine and the preempt 
> > > disable will protect us?
> > 
> > Did you miss the preempt_disable() a bit up?
> 
> No, let me rephrase it better. Does the preempt_disable protect against
> another CPU from going off line? Does taking a CPU off line do a 
> stop_machine?

Yes and yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 10:09 -0500, Steven Rostedt wrote:
> On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> > Index: linux-2.6/kernel/sched.c
> > ===
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
> >  }
> >  EXPORT_SYMBOL(schedule);
> >  
> > +#ifdef CONFIG_SMP
> > +/*
> > + * Look out! "owner" is an entirely speculative pointer
> > + * access and not reliable.
> > + */
> > +int spin_on_owner(struct mutex *lock, struct thread_info *owner)
> > +{
> > +   unsigned int cpu;
> > +   struct rq *rq;
> > +   int ret = 1;
> > +
> > +   if (unlikely(!sched_feat(OWNER_SPIN)))
> 
> I would remove the "unlikely", if someone turns OWNER_SPIN off, then you 
> have the wrong decision being made. Choices by users should never be in a 
> "likely" or "unlikely" annotation. It's discrimination ;-)

in the unlikely case we schedule(), that seems expensive enough to want
to make the spin case ever so slightly faster.

> > +   return 0;
> > +
> > +   preempt_disable();
> > +#ifdef CONFIG_DEBUG_PAGEALLOC
> > +   /*
> > +* Need to access the cpu field knowing that
> > +* DEBUG_PAGEALLOC could have unmapped it if
> > +* the mutex owner just released it and exited.
> > +*/
> > +   if (probe_kernel_address(&owner->cpu, cpu))
> > +   goto out;
> > +#else
> > +   cpu = owner->cpu;
> > +#endif
> > +
> > +   /*
> > +* Even if the access succeeded (likely case),
> > +* the cpu field may no longer be valid.
> > +*/
> > +   if (cpu >= nr_cpumask_bits)
> > +   goto out;
> > +
> > +   /*
> > +* We need to validate that we can do a
> > +* get_cpu() and that we have the percpu area.
> > +*/
> > +   if (!cpu_online(cpu))
> > +   goto out;
> 
> Should we need to do a "get_cpu" or something? Couldn't the CPU disappear 
> between these two calls. Or does it do a stop-machine and the preempt 
> disable will protect us?

Did you miss the preempt_disable() a bit up?

> > +
> > +   rq = cpu_rq(cpu);
> > +
> > +   for (;;) {
> > +   if (lock->owner != owner)
> > +   break;
> > +
> > +   /*
> > +* Is that owner really running on that cpu?
> > +*/
> > +   if (task_thread_info(rq->curr) != owner)
> > +   break;
> > +
> > +   if (need_resched()) {
> > +   ret = 0;
> > +   break;
> > +   }
> > +
> > +   cpu_relax();
> > +   }
> > +out:
> > +   preempt_enable_no_resched();
> > +   return ret;
> > +}
> > +#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Thu, 2009-01-08 at 15:18 +0100, Ingo Molnar wrote:

> [ A detail, -tip testing found that the patch breaks mutex debugging:
> 
>   =
>   [ BUG: bad unlock balance detected! ]
>   -
>   swapper/0 is trying to release lock (cpu_add_remove_lock) at:
>   [] mutex_unlock+0xe/0x10
>   but there are no more locks to release!
> 
>  but that's a detail for -v7 ;-) ]

Here it is..

I changed the optimistic spin cmpxchg trickery to not require the -1000
state, please double and tripple check the code.

This was done because the interaction between trylock_slowpath and that
-1000 state hurt my head.

---
Subject: 
From: Peter Zijlstra 
Date: Thu Jan 08 09:41:22 CET 2009

Signed-off-by: Peter Zijlstra 
---
 include/linux/mutex.h   |4 +-
 include/linux/sched.h   |1 
 kernel/mutex-debug.c|   24 ++-
 kernel/mutex-debug.h|   18 ++-
 kernel/mutex.c  |   76 ++--
 kernel/mutex.h  |   22 -
 kernel/sched.c  |   66 +
 kernel/sched_features.h |1 
 8 files changed, 185 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -80,9 +75,25 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
+#if 0
+   /*
+* XXX something is iffy with owner tracking, but lockdep - which has
+* similar checks - finds everything dandy, so disable for now.
+*/
+   if (lock->owner != current_thread_info()) {
+   printk(KERN_ERR "%p %p\n", lock->owner, current_thread_info());
+   if (lock->owner) {
+   printk(KERN_ERR "%d %s\n",
+   lock->owner->task->pid,
+   lock->owner->task->comm);
+   }
+   printk(KERN_ERR "%d %s\n",
+   current_thread_info()->task->pid,
+   current_thread_info()->task->comm);
+   }
DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+#endif
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +106,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   lock->owner = NULL;
lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
===
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-   lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
@@ -35,6 +27,16 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 struct lock_class_key *key);
 
+static inline void mutex_set_owner(struct mutex *lock)
+{
+   lock->owner = current_thread_info();
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+   lock->owner = NULL;
+}
+
 #define spin_lock_mutex(lock, flags)   \
do {

[PATCH -v6][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Peter Zijlstra
On Wed, 2009-01-07 at 15:10 -0800, Linus Torvalds wrote:

> Please take all my patches to be pseudo-code. They've neither been 
> compiled nor tested, and I'm just posting them in the hope that somebody 
> else will then do things in the direction I think is the proper one ;)

Linux opteron 2.6.28-tip #585 SMP PREEMPT Thu Jan 8 10:38:09 CET 2009 x86_64 
x86_64 x86_64 GNU/Linux

[r...@opteron bench]# echo NO_OWNER_SPIN > /debug/sched_features; ./timec -e 
-5,-4,-3,-2 ./test-mutex V 16 10
2 CPUs, running 16 parallel test-tasks.
checking VFS performance.
| loops/sec: 69415
avg ops/sec:   74996
average cost per op:   0.00 usecs
average cost per lock: 0.00 usecs
average cost per unlock:   0.00 usecs
max cost per op:   0.00 usecs
max cost per lock: 0.00 usecs
max cost per unlock:   0.00 usecs
average deviance per op:   0.00 usecs

 Performance counter stats for './test-mutex':

   12098.324578  task clock ticks (msecs)

   1081  CPU migrations   (events)
   7102  context switches (events)
   2763  pagefaults   (events)
   12098.324578  task clock ticks (msecs)

 Wall-clock time elapsed: 12026.804839 msecs

[r...@opteron bench]# echo OWNER_SPIN > /debug/sched_features; ./timec -e 
-5,-4,-3,-2 ./test-mutex V 16 10
2 CPUs, running 16 parallel test-tasks.
checking VFS performance.
| loops/sec: 208147
avg ops/sec:   228126
average cost per op:   0.00 usecs
average cost per lock: 0.00 usecs
average cost per unlock:   0.00 usecs
max cost per op:   0.00 usecs
max cost per lock: 0.00 usecs
max cost per unlock:   0.00 usecs
average deviance per op:   0.00 usecs

 Performance counter stats for './test-mutex':

   22280.283224  task clock ticks (msecs)

117  CPU migrations   (events)
   5711  context switches (events)
   2781  pagefaults   (events)
   22280.283224  task clock ticks (msecs)

 Wall-clock time elapsed: 12307.053737 msecs

* WOW *

---
Subject: mutex: implement adaptive spin
From: Peter Zijlstra 
Date: Thu Jan 08 09:41:22 CET 2009

Signed-off-by: Peter Zijlstra 
---
 include/linux/mutex.h   |4 +-
 include/linux/sched.h   |1 
 kernel/mutex-debug.c|7 
 kernel/mutex-debug.h|   18 +-
 kernel/mutex.c  |   81 ++--
 kernel/mutex.h  |   22 +++--
 kernel/sched.c  |   63 +
 kernel/sched_features.h |1 
 8 files changed, 170 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct thread_info  *owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
void*magic;
 #endif
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -82,7 +77,6 @@ void debug_mutex_unlock(struct mutex *lo
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   lock->owner = NULL;
lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
===
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-   lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
struct mu

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 15:32 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, Steven Rostedt wrote:
> > 
> > What would be interesting is various benchmarks against all three.
> > 
> > 1) no mutex spinning.
> > 2) get_task_struct() implementation.
> > 3) spin_or_sched implementation.
> 
> One of the issues is that I cannot convince myself that (2) is even 
> necessarily correct. At least not without having all cases happen inder 
> the mutex spinlock - which they don't. Even with the original patch, the 
> uncontended cases set and cleared the owner field outside the lock.

Yes, 2 isn't feasible for regular mutexes as we have non-atomic owner
tracking.

I've since realized the whole rtmutex thing is fundamentally difference
on a few levels:

  a) we have atomic owner tracking (that's the lock itself, it holds the
task_pointer as a cookie), and 

  b) we need to do that whole enqueue on the waitlist thing because we
need to do the PI propagation and such to figure out if the current task
is even allowed to acquire -- that is, only the highest waiting and or
lateral steal candidates are allowed to spin acquire.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 13:58 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > 
> > Do we really have to re-do all that code every loop?
> 
> No, you're right, we can just look up the cpu once. Which makes Andrew's 
> argument that "probe_kernel_address()" isn't in any hot path even more 
> true.
> 
> > Also, it would still need to do the funny:
> > 
> >  l_owner = ACCESS_ONCE(lock->owner)
> >  if (l_owner && l_owner != thread)
> >break;
> 
> Why? That would fall out of the 
> 
>   if (lock->owner != thread)
>   break;
> 
> part. We don't actually care that it only happens once: this all has 
> _known_ races, and the "cpu_relax()" is a barrier.
> 
> And notice how the _caller_ handles the "owner == NULL" case by not even 
> calling this, and looping over just the state in the lock itself. That was 
> in the earlier emails. So this approach is actually pretty different from 
> the case that depended on the whole spinlock thing.

Ah, so now you do loop on !owner, previuosly you insisted we'd go to
sleep on !owner. Yes, with !owner spinning that is indeed not needed.

> +#ifdef CONFIG_SMP
> + /* Optimistic spinning.. */
> + for (;;) {
> + struct thread_struct *owner;
> + int oldval = atomic_read(&lock->count);
> +
> + if (oldval <= MUTEX_SLEEPERS)
> + break;
> + if (oldval == 1) {
> + oldval = atomic_cmpxchg(&lock->count, oldval, 0);
> + if (oldval == 1) {
> + lock->owner = task_thread_info(task);
> + return 0;
> + }
> + } else {
> + /* See who owns it, and spin on him if anybody */
> + owner = lock->owner;
> + if (owner)
> + spin_on_owner(lock, owner);
> + }
> + cpu_relax();
> + }
> +#endif

Hmm, still wouldn't the spin_on_owner() loopyness and the above need
that need_resched() check you mentioned to that it can fall into the
slow path and go to sleep?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 12:55 -0800, Linus Torvalds wrote:

>   /*
>* Look out! "thread" is an entirely speculative pointer
>* access and not reliable.
>*/
>   void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
>   {
>   for (;;) {
>   unsigned cpu;
>   struct runqueue *rq;
> 
>   if (lock->owner != thread)
>   break;
> 
>   /*
>* Need to access the cpu field knowing that
>* DEBUG_PAGEALLOC could have unmapped it if
>* the mutex owner just released it and exited.
>*/
>   if (__get_user(cpu, &thread->cpu))
>   break;
> 
>   /*
>* Even if the access succeeded (likely case),
>* the cpu field may no longer be valid. FIXME:
>* this needs to validate that we can do a
>* get_cpu() and that we have the percpu area.
>*/
>   if (cpu >= NR_CPUS)
>   break;
> 
>   if (!cpu_online(cpu))
>   break;
> 
>   /*
>* Is that thread really running on that cpu?
>*/
>   rq = cpu_rq(cpu);
>   if (task_thread_info(rq->curr) != thread)
>   break;
> 
>   cpu_relax();
>   }
>   }

Do we really have to re-do all that code every loop?

void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
{
unsigned cpu;
struct runqueue *rq;

/*
 * Need to access the cpu field knowing that
 * DEBUG_PAGEALLOC could have unmapped it if
 * the mutex owner just released it and exited.
 */
if (__get_user(cpu, &thread->cpu))
break;

/*
 * Even if the access succeeded (likely case),
 * the cpu field may no longer be valid. FIXME:
 * this needs to validate that we can do a
 * get_cpu() and that we have the percpu area.
 */
if (cpu >= NR_CPUS)
break;

if (!cpu_online(cpu))
break;

rq = cpu_rq(cpu);

for (;;) {
if (lock->owner != thread)
break;

/*
 * Is that thread really running on that cpu?
 */
if (task_thread_info(rq->curr) != thread)
break;

cpu_relax();
}
}

Also, it would still need to do the funny:

 l_owner = ACCESS_ONCE(lock->owner)
 if (l_owner && l_owner != thread)
   break;

thing, to handle the premature non-atomic lock->owner tracking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 08:25 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> >
> > Change mutex contention behaviour such that it will sometimes busy wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> 
> Ok, this one looks _almost_ ok.
> 
> The only problem is that I think you've lost the UP case. 
> 
> In UP, you shouldn't have the code to spin, and the "spin_or_schedule()" 
> should fall back to just the schedule case.
> 
> It migth also be worthwhile to try to not set the owner, and re-organize 
> that a bit (by making it a inline function that sets the owner only for 
> CONFIG_SMP or lockdep/debug). 

As you wish ;-)

---
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra 
Date: Tue, 6 Jan 2009 12:32:12 +0100

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:74910

  # echo NO_MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN > /debug/sched_features

There's also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin  : 2387
  .mtx_sched     : 2283
  .mtx_spin  : 1277
  .mtx_sched : 1700

Signed-off-by: Peter Zijlstra 
Reviewed-and-signed-off-by: Ingo Molnar 
---
 include/linux/mutex.h   |6 ++--
 include/linux/sched.h   |2 +
 kernel/mutex-debug.c|   10 ---
 kernel/mutex-debug.h|   13 +++--
 kernel/mutex.c  |   66 +---
 kernel/mutex.h  |   13 -
 kernel/sched.c  |   63 +
 kernel/sched_debug.c|2 +
 kernel/sched_features.h |1 
 9 files changed, 146 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
+   struct task_struct  *owner;
+#endif
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +69,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+  struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 10:22 -0500, Steven Rostedt wrote:
> Peter, nice work!

Thanks!

> > +   }
> > +
> > +   if (!spin) {
> > +   schedstat_inc(this_rq(), mtx_sched);
> > +   __set_task_state(task, state);
> 
> I still do not know why you set state here instead of in the mutex code.
> Yes, you prevent changing the state if we do not schedule, but there's 
> nothing wrong with setting it before hand. We may even be able to cache 
> the owner and keep the locking of the wait_lock out of here. But then I 
> see that it may be used to protect the sched_stat counters.

I was about to say because we need task_rq(owner) and can only deref
owner while holding that lock, but I found a way around it by using
task_cpu() which is exported.

Compile tested only so far...

---
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,8 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
-extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-  unsigned long *flags);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+  struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -12,7 +12,8 @@
  *
  *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
  *from the -rt tree, where it was originally implemented for rtmutexes
- *by Steven Rostedt, based on work by Gregory Haskins.)
+ *by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *and Sven Dietrich.
  *
  * Also see Documentation/mutex-design.txt.
  */
@@ -157,6 +158,9 @@ __mutex_lock_common(struct mutex *lock, 
lock_contended(&lock->dep_map, ip);
 
for (;;) {
+   int cpu = 0;
+   struct task_struct *l_owner;
+
/*
 * Lets try to take the lock again - this is needed even if
 * we get here for the first time (shortly after failing to
@@ -184,8 +188,15 @@ __mutex_lock_common(struct mutex *lock, 
return -EINTR;
}
 
-   mutex_spin_or_schedule(&waiter, state, &flags);
+   __set_task_state(task, state);
+   l_owner = ACCESS_ONCE(lock->owner);
+   if (l_owner)
+   cpu = task_cpu(l_owner);
+   spin_unlock_mutex(&lock->wait_lock, flags);
+   mutex_spin_or_schedule(&waiter, l_owner, cpu);
+   spin_lock_mutex(&lock->wait_lock, flags);
}
+   __set_task_state(task, TASK_RUNNING);
 
 done:
lock_acquired(&lock->dep_map, ip);
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4600,42 +4600,37 @@ pick_next_task(struct rq *rq, struct tas
}
 }
 
-#ifdef CONFIG_DEBUG_MUTEXES
-# include "mutex-debug.h"
-#else
-# include "mutex.h"
-#endif
-
-void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-   unsigned long *flags)
+void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+   struct task_struct *owner, int cpu)
 {
-   struct mutex *lock = waiter->lock;
struct task_struct *task = waiter->task;
-   struct task_struct *owner = lock->owner;
+   struct mutex *lock = waiter->lock;
struct rq *rq;
int spin = 0;
 
if (likely(sched_feat(MUTEX_SPIN) && owner)) {
-   rq = task_rq(owner);
+   rq = cpu_rq(cpu);
spin = (rq->curr == owner);
}
 
if (!spin) {
+   preempt_disable();
schedstat_inc(this_rq(), mtx_sched);
-   __set_task_state(task, state);
-   spin_unlock_mutex(&lock->wait_lock, *flags);
+   preempt_enable();
+
schedule();
-   spin_lock_mutex(&lock->wait_lock, *flags);
return;
}
 
+   preempt_disable();
schedstat_inc(this_rq(), mtx_spin);
-   spin_unlock_mutex(&lock->wait_lock, *flags);
+   preempt_enable();
+
for (;;) {
struct task_struct *l_owner;
 
/* Stop spinning when there's a pending signal. */
-   if

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
On Wed, 2009-01-07 at 15:50 +0100, Frédéric Weisbecker wrote:
> 2009/1/7 Peter Zijlstra :
> > Change mutex contention behaviour such that it will sometimes busy wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> >
> > This concept got ported to mainline from the -rt tree, where it was 
> > originally
> > implemented for rtmutexes by Steven Rostedt, based on work by Gregory 
> > Haskins.
> >
> > Testing with Ingo's test-mutex application 
> > (http://lkml.org/lkml/2006/1/8/50)
> > gave a 8% boost for VFS scalability on my testbox:
> >
> >  # echo MUTEX_SPIN > /debug/sched_features
> >  # ./test-mutex V 16 10
> >  2 CPUs, running 16 parallel test-tasks.
> >  checking VFS performance.
> >
> >  avg ops/sec:74910
> >
> >  # echo NO_MUTEX_SPIN > /debug/sched_features
> >  # ./test-mutex V 16 10
> >  2 CPUs, running 16 parallel test-tasks.
> >  checking VFS performance.
> >
> >  avg ops/sec:68804
> >
> > The key criteria for the busy wait is that the lock owner has to be running 
> > on
> > a (different) cpu. The idea is that as long as the owner is running, there 
> > is a
> > fair chance it'll release the lock soon, and thus we'll be better off 
> > spinning
> > instead of blocking/scheduling.
> >
> > Since regular mutexes (as opposed to rtmutexes) do not atomically track the
> > owner, we add the owner in a non-atomic fashion and deal with the races in
> > the slowpath.
> >
> > Furthermore, to ease the testing of the performance impact of this new code,
> > there is means to disable this behaviour runtime (without having to reboot
> > the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
> > by issuing the following command:
> >
> >  # echo NO_MUTEX_SPIN > /debug/sched_features
> >
> > This command re-enables spinning again (this is also the default):
> >
> >  # echo MUTEX_SPIN > /debug/sched_features
> >
> > There's also a few new statistic fields in /proc/sched_debug
> > (available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):
> >
> >  # grep mtx /proc/sched_debug
> >  .mtx_spin  : 2387
> >  .mtx_sched : 2283
> >  .mtx_spin  : 1277
> >  .mtx_sched : 1700
> >
> > Signed-off-by: Peter Zijlstra 
> > Reviewed-and-signed-off-by: Ingo Molnar 
> > ---

> Sorry I haven't read all the previous talk about the older version.
> But it is possible that, in hopefully rare cases, you enter
> mutex_spin_or_schedule
> multiple times, and try to spin for the same lock each of these times.
> 
> For each of the above break,
> 
> _if you exit the spin because the mutex is unlocked, and someone else
> grab it before you
> _ or simply the owner changed...
> 
> then you will enter again in mutex_spin_or_schedule, you have some chances 
> that
> rq->curr == the new owner, and then you will spin again.
> And this situation can almost really make you behave like a spinlock...

You understand correctly, that is indeed possible.

> Shouldn't it actually try only one time to spin, and if it calls again
> mutex_spin_or_schedule()
> then it would be better to schedule()  ?

I don't know, maybe code it up and find a benchmark where it makes a
difference. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v4][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Peter Zijlstra
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:74910

  # echo NO_MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN > /debug/sched_features

There's also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin  : 2387
  .mtx_sched : 2283
  .mtx_spin  : 1277
  .mtx_sched     : 1700

Signed-off-by: Peter Zijlstra 
Reviewed-and-signed-off-by: Ingo Molnar 
---
 include/linux/mutex.h   |4 +-
 include/linux/sched.h   |2 +
 kernel/mutex-debug.c|   10 +--
 kernel/mutex-debug.h|8 -
 kernel/mutex.c  |   46 ++
 kernel/mutex.h  |2 -
 kernel/sched.c  |   73 +++
 kernel/sched_debug.c|2 +
 kernel/sched_features.h |1 +
 9 files changed, 115 insertions(+), 33 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 7a0e5c4..c007b4e 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cae9b8..d8fa96b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -328,6 +328,8 @@ extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
+  unsigned long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c
index 1d94160..0564680 100644
--- a/kernel/mutex-debug.c
+++ b/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct 
mutex_waiter *waiter,
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lock)
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
  

Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Wed, 2009-01-07 at 11:57 +0800, Lai Jiangshan wrote:
> Peter Zijlstra wrote:
> > +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state, 
> > unsigned long *flags)
> > +{
> > +   struct mutex *lock = waiter->lock;
> > +   struct task_struct *task = waiter->task;
> > +   struct task_struct *owner = lock->owner;
> > +   struct rq *rq;
> > +
> > +   if (!owner)
> > +   goto do_schedule;
> > +
> > +   rq = task_rq(owner);
> > +
> > +   if (rq->curr != owner) {
> > +do_schedule:
> > +   __set_task_state(task, state);
> > +   spin_unlock_mutex(&lock->wait_lock, *flags);
> > +   schedule();
> > +   } else {
> > +   spin_unlock_mutex(&lock->wait_lock, *flags);
> > +   for (;;) {
> > +   /* Stop spinning when there's a pending signal. */
> > +   if (signal_pending_state(state, task))
> > +   break;
> > +
> > +   /* Owner changed, bail to revalidate state */
> > +   if (lock->owner != owner)
> > +   break;
> > +
> > +   /* Owner stopped running, bail to revalidate state */
> > +   if (rq->curr != owner)
> > +   break;
> > +
> 
> 2 questions from my immature thought:
> 
> 1) Do we need keep gcc from optimizing when we access lock->owner
>and rq->curr in the loop?

cpu_relax() is a compiler barrier iirc.

> 2) "if (rq->curr != owner)" need become smarter.
>schedule()
>{
>   select_next
>   rq->curr = next;
>   contex_swith
>}
> we also spin when owner is select_next-ing in schedule().
> but select_next is not fast enough.

I'm not sure what you're saying here..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 23:43 +0100, Peter Zijlstra wrote:

> @@ -115,6 +117,7 @@ void __sched mutex_unlock(struct mutex *
>* The unlocking fastpath is the 0->1 transition from 'locked'
>* into 'unlocked' state:
>*/
> + lock->owner = NULL;
>   __mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
>  }
>  

> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state, 
> unsigned long *flags)
> +{
> + struct mutex *lock = waiter->lock;
> + struct task_struct *task = waiter->task;
> + struct task_struct *owner = lock->owner;
> + struct rq *rq;
> +
> + if (!owner)
> + goto do_schedule;
> +
> + rq = task_rq(owner);
> +
> + if (rq->curr != owner) {
> +do_schedule:
> + __set_task_state(task, state);
> + spin_unlock_mutex(&lock->wait_lock, *flags);
> + schedule();
> + } else {
> + spin_unlock_mutex(&lock->wait_lock, *flags);
> + for (;;) {
> + /* Stop spinning when there's a pending signal. */
> + if (signal_pending_state(state, task))
> + break;
> +
> + /* Owner changed, bail to revalidate state */
> + if (lock->owner != owner)
> + break;
> +
> + /* Owner stopped running, bail to revalidate state */
> + if (rq->curr != owner)
> + break;
> +
> + cpu_relax();
> + }
> + }
> + spin_lock_mutex(&lock->wait_lock, *flags);
> +}

That's not going to work, we set owner to NULL, which means pending
spinners get schedule()ed out instead of racing to acquire.

I suppose the below would fix that... really sleep time now

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4626,8 +4626,8 @@ do_schedule:
if (signal_pending_state(state, task))
break;
 
-   /* Owner changed, bail to revalidate state */
-   if (lock->owner != owner)
+   /* Mutex got unlocked, race to acquire. */
+   if (!mutex_is_locked(lock))
break;
 
/* Owner stopped running, bail to revalidate state */

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 14:22 -0800, Linus Torvalds wrote:
> 
> On Tue, 6 Jan 2009, Peter Zijlstra wrote:
> > 
> > One of the things the previous patch did wrong was that it never tracked
> > the owner in the mutex fast path -- I initially didn't notice because I
> > had all debugging infrastructure enabled, and that short circuits all
> > the fast paths.
> 
> Why even worry?

Wrong mind-set, as you rightly point out below.

> Just set the owner non-atomically. We can consider a NULL owner in the 
> slow-path to be a "let's schedule" case. After all, it's just a 
> performance tuning thing after all, and the window is going to be very 
> small, so it won't even be relevant from a performance tuning angle.
> 
> So there's _no_ reason why the fast-path can't just set owner, and no 
> reason to disable preemption.

Agreed, when viewed from that angle all the little races don't matter.

> I also think the whole preempt_disable/enable around the 
> mutex_spin_or_schedule() is pure garbage.
> 
> In fact, I suspect that's the real bug you're hitting: you're enabling 
> preemption while holding a spinlock. That is NOT a good idea.

spinlocks also fiddle with preempt_count, that should all work out -
although granted, it does look funny.



Indeed, the below does boot -- which means I get to sleep now ;-)

---
 include/linux/mutex.h |4 +--
 include/linux/sched.h |1 
 kernel/mutex-debug.c  |   10 
 kernel/mutex-debug.h  |8 --
 kernel/mutex.c|   62 +++---
 kernel/mutex.h|2 -
 kernel/sched.c|   44 +++
 7 files changed, 92 insertions(+), 39 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *, long state, unsigned 
long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   lock->owner = NULL;
lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
===
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debu

Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 13:50 -0800, Linus Torvalds wrote:
> 
> On Tue, 6 Jan 2009, Peter Zijlstra wrote:
> > 
> > With Linus' mutex_spin_or_schedule() function the whole - keeping
> > owner's task_struct alive issue goes away,.. now if only the thing would
> > boot...
> 
> Can you post the patch, so that we can see if we can find some silly error 
> that we can ridicule you over?

Sure, see below..

I think I'm seeing why it deadlocks..

One of the things the previous patch did wrong was that it never tracked
the owner in the mutex fast path -- I initially didn't notice because I
had all debugging infrastructure enabled, and that short circuits all
the fast paths.

So I added a lame fast path owner tracking:

  preempt_disable();
  mutex_fast_path_lock();
  lock->owner = current;
  preempt_enable();

and a similar clear on the unlock side.

Which is exactly what causes the deadlock -- or livelock more
accurately. Since the contention code spins while !owner, and the unlock
code clears owner before release, we spin so hard we never release the
cacheline to the other cpu and therefore get stuck.

Now I just looked what kernel/rtmutex.c does, it keeps track of the
owner in the lock field and uses cmpxchg to change ownership. Regular
mutexes however use atomic_t (not wide enough for void *) and hand
crafted assembly fast paths for all archs.

I think I'll just hack up a atomic_long_t atomic_lock_cmpxchg mutex
implementation -- so we can at least test this on x86 to see if its
worth continuing this way.

Converting all hand crafted asm only to find out it degrades performance
doesn't sound too cool :-)

---
 include/linux/mutex.h |4 +-
 include/linux/sched.h |1 
 kernel/mutex-debug.c  |   10 --
 kernel/mutex-debug.h  |8 -
 kernel/mutex.c|   80 ++
 kernel/mutex.h|2 -
 kernel/sched.c|   44 +++
 7 files changed, 110 insertions(+), 39 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *, long state, unsigned 
long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   /* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, n

Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 13:42 -0800, Paul E. McKenney wrote:
> Preemptable RCU needs to be faster.  Got it -- and might have a way
> to do it by eliminating the irq disabling and cutting way back on the
> number of operations that must be performed.  It would probably still
> be necessary to access the task structure.
> 
> Or is something other than the raw performance of rcu_read_lock() and
> rcu_read_unlock() at issue here?

With Linus' mutex_spin_or_schedule() function the whole - keeping
owner's task_struct alive issue goes away,.. now if only the thing would
boot...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 14:16 +0100, Ingo Molnar wrote:
> * Gregory Haskins  wrote:
> 
> > Ingo Molnar wrote:
> > > There's no time or spin-rate based heuristics in this at all (i.e. these 
> > > mutexes are not 'adaptive' at all!),
> > 
> > FYI: The original "adaptive" name was chosen in the -rt implementation 
> > to reflect that the locks can adaptively spin or sleep, depending on 
> > conditions.  I realize this is in contrast to the typical usage of the 
> > term when it is in reference to the spin-time being based on some 
> > empirical heuristics, etc as you mentioned.  Sorry for the confusion.
> 
> the current version of the -rt spinny-mutexes bits were mostly written by 
> Steve, right? Historically it all started out with a more classic 
> "adaptive mutexes" patchset so the name stuck i guess.

Yeah, Gregory and co. started with the whole thing and showed there was
significant performance to be gained, after that Steve rewrote it from
scratch reducing it to this minimalist heuristic, with help from Greg.

(At least, that is how I remember it, please speak up if I got things
wrong)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
On Tue, 2009-01-06 at 13:10 +0100, Ingo Molnar wrote:
> * Peter Zijlstra  wrote:
> 
> > +++ linux-2.6/kernel/mutex.c
> > @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
> > atomic_set(&lock->count, 1);
> > spin_lock_init(&lock->wait_lock);
> > INIT_LIST_HEAD(&lock->wait_list);
> > +   lock->owner = NULL;
> >  
> > debug_mutex_init(lock, name, key);
> >  }
> > @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
> >  
> >  EXPORT_SYMBOL(mutex_unlock);
> >  
> > +#ifdef CONFIG_SMP
> > +static int adaptive_wait(struct mutex_waiter *waiter,
> > +struct task_struct *owner, long state)
> > +{
> > +   for (;;) {
> > +   if (signal_pending_state(state, waiter->task))
> > +   return 0;
> > +   if (waiter->lock->owner != owner)
> > +   return 0;
> > +   if (!task_is_current(owner))
> > +   return 1;
> > +   cpu_relax();
> > +   }
> > +}
> > +#else
> 
> Linus, what do you think about this particular approach of spin-mutexes? 
> It's not the typical spin-mutex i think.
> 
> The thing i like most about Peter's patch (compared to most other adaptive 
> spinning approaches i've seen, which all sucked as they included various 
> ugly heuristics complicating the whole thing) is that it solves the "how 
> long should we spin" question elegantly: we spin until the owner runs on a 
> CPU.

s/until/as long as/

> So on shortly held locks we degenerate to spinlock behavior, and only 
> long-held blocking locks [with little CPU time spent while holding the 
> lock - say we wait for IO] we degenerate to classic mutex behavior.
> 
> There's no time or spin-rate based heuristics in this at all (i.e. these 
> mutexes are not 'adaptive' at all!), and it degenerates to our primary and 
> well-known locking behavior in the important boundary situations.

Well, it adapts to the situation, choosing between spinning vs blocking.
But what's in a name ;-)

> A couple of other properties i like about it:
> 
>  - A spinlock user can be changed to a mutex with no runtime impact. (no 
>increase in scheduling) This might enable us to convert/standardize 
>some of the uglier locking constructs within ext2/3/4?

I think a lot of stuff there is bit (spin) locks.

> It's hard to tell how it would impact inbetween workloads - i guess it 
> needs to be measured on a couple of workloads.

Matthew volunteered to run something IIRC.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Peter Zijlstra
Subject: mutex: adaptive spin
From: Peter Zijlstra 
Date: Tue Jan 06 12:32:12 CET 2009

Based on the code in -rt, provide adaptive spins on generic mutexes.

Signed-off-by: Peter Zijlstra 
---
 include/linux/mutex.h |4 ++--
 include/linux/sched.h |1 +
 kernel/mutex-debug.c  |   11 ++-
 kernel/mutex-debug.h  |8 
 kernel/mutex.c|   46 +++---
 kernel/mutex.h|2 --
 kernel/sched.c|5 +
 7 files changed, 49 insertions(+), 28 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-   struct thread_info  *owner;
const char  *name;
void*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
struct list_headlist;
struct task_struct  *task;
-#ifdef CONFIG_DEBUG_MUTEXES
struct mutex*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
void*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -249,6 +249,7 @@ extern void init_idle(struct task_struct
 extern void init_idle_bootup_task(struct task_struct *idle);
 
 extern int runqueue_is_locked(void);
+extern int task_is_current(struct task_struct *p);
 extern void task_rq_unlock_wait(struct task_struct *p);
 
 extern cpumask_var_t nohz_cpu_mask;
Index: linux-2.6/kernel/mutex-debug.c
===
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-   lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
/* Mark the current thread as blocked on the lock: */
ti->task->blocked_on = waiter;
-   waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,9 @@ void debug_mutex_unlock(struct mutex *lo
return;
 
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   DEBUG_LOCKS_WARN_ON(lock->owner != current);
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
-   DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+   DEBUG_LOCKS_WARN_ON(lock->owner != current);
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   lock->owner = NULL;
lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
===
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-   lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
atomic_set(&lock->count, 1);
spin_lock_init(&lock->wait_lock);
INIT_LIST_HEAD(&lock->wait_list);
+   lock->owner = NULL;
 
debug_mutex_init(lock, name, key);
 }
@@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
+#ifdef CONFIG_SMP
+static int adaptive_wait(struct mutex_waiter *waiter,
+struct task_struct *owner, long state)
+{
+   for (;;) {
+   if (signal_pending_state(state, waiter->task))
+   return 0;
+   if (waiter->lock->owner != owner)
+  

Re: Btrfs for mainline

2009-01-04 Thread Peter Zijlstra
On Sat, 2009-01-03 at 12:17 -0700, Matthew Wilcox wrote:
> > - locking.c needs a lot of cleanup.
> > If combination spinlocks/mutexes are really a win they should be 
> > in the generic mutex framework. And I'm still dubious on the
> hardcoded 
> > numbers.
> 
> I don't think this needs to be cleaned up before merge.  I've spent
> an hour or two looking at it, and while we can do a somewhat better
> job as part of the generic mutex framework, it's quite tricky (due to
> the different  implementations).  It has the potential to
> introduce some hard-to-hit bugs in the generic mutexes, and there's some
> API discussions to have.

I'm really opposed to having this in some filesystem. Please remove it
before merging it.

The -rt tree has adaptive spin patches for the rtmutex code, its really
not all that hard to do -- the rtmutex code is way more tricky than the
regular mutexes due to all the PI fluff.

For kernel only locking the simple rule: spin iff the lock holder is
running proved to be simple enough. Any added heuristics like max spin
count etc. only made things worse. The whole idea though did make sense
and certainly improved performance.

We've also been looking at doing adaptive spins for futexes, although
that does get a little more complex, furthermore, we've never gotten
around to actually doing any code on that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-07 Thread Peter Zijlstra
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

> * Fine grained btree locking.  The large fs_mutex is finally gone.
> There is still some work to do on the locking during extent allocation,
> but the code is much more scalable than it was.

Cool - will try to find a cycle to stare at the code ;-)

> * Helper threads for checksumming and other background tasks.  Most CPU
> intensive operations have been pushed off to helper threads to take
> advantage of SMP machines.  Streaming read and write throughput now
> scale to disk speed even with checksumming on.

Can this lead to the same Priority Inversion issues as seen with
kjournald?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs v0.16 released

2008-08-07 Thread Peter Zijlstra
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

> There are still more disk format changes planned, but we're making every
> effort to get them out of the way as quickly as we can.  You can see the
> major features we have planned on the development timeline:
> 
> http://btrfs.wiki.kernel.org/index.php/Development_timeline

Just took a peek, seems to be slightly out of date as it still lists the
single mutex thingy.

Also, how true is the IO-error and disk-full claim?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html