Re: sem_lock() vs qspinlocks

2016-05-25 Thread Boqun Feng
On Mon, May 23, 2016 at 10:52:09AM -0700, Linus Torvalds wrote:
> On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra  wrote:
> >
> > Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
> > something like:
> >
> > smp_mb__after_lock()
> 
> I'd much rather make the naming be higher level. It's not necessarily

Speak of higher level, I realize that problem here is similar to the
problem we discussed last year:

http://lkml.kernel.org/r/20151112070915.gc6...@fixme-laptop.cn.ibm.com

the problem here is about synchronization between two spinlocks and that
problem is about synchronization between a spinlock and ordinary
variables.

(One result of this similarity is that qspinlock on x86 may be also
broken in the do_exit() code as spinlocks on AARCH64 and PPC. Because
a variable LOAD inside a qspinlock critical section could be reordered
before the STORE part of a qspinlock acquisition.)


For the problem we found last year, the current solution for AARCH64 and
PPC is to have a little heavy weight spin_unlock_wait() to pair with
spin_lock():

AARCH64: 
http://lkml.kernel.org/r/1448624646-15863-1-git-send-email-will.dea...@arm.com
PPC: 
http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.f...@gmail.com 
(not merged yet)

Another solution works on PPC is what Paul Mckenney suggested, using
smp_mb__after_unlock_lock():

http://lkml.kernel.org/r/20151112144004.gu3...@linux.vnet.ibm.com

, which is petty much the same as the spinlock synchronization
primitive we are discussing about here.


So I'm thinking, if we are going to introduce some primitives for
synchronizing two spinlocks (or even a spinlock and a mutex) anyway,
could we be a little more higher level, to reuse/invent primitives to
solve the synchronzing problem we have between
spinlocks(spin_unlock_wait()) and normal variables?

One benefit of this is that we could drop the complex implementations of
spin_unlock_wait() on AARCH64 and PPC.

Thoughts?

Regards,
Boqun

> going to be a "mb", and while the problem is about smp, the primitives
> it is synchronizing aren't actually smp-specific (ie you're
> synchronizing a lock that is relevant on UP too).
> 
> So I'd just call it something like
> 
> spin_lock_sync_after_lock();
> 
> because different locks might have different levels of serialization
> (ie maybe a spinlock needs one thing, and a mutex needs another - if
> we start worrying about ordering between spin_lock and
> mutex_is_locked(), for example, or between mutex_lock() and
> spin_is_locked()).
> 
> Hmm?
> 
>  Linus


signature.asc
Description: PGP signature


Re: sem_lock() vs qspinlocks

2016-05-25 Thread Boqun Feng
On Mon, May 23, 2016 at 10:52:09AM -0700, Linus Torvalds wrote:
> On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra  wrote:
> >
> > Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
> > something like:
> >
> > smp_mb__after_lock()
> 
> I'd much rather make the naming be higher level. It's not necessarily

Speak of higher level, I realize that problem here is similar to the
problem we discussed last year:

http://lkml.kernel.org/r/20151112070915.gc6...@fixme-laptop.cn.ibm.com

the problem here is about synchronization between two spinlocks and that
problem is about synchronization between a spinlock and ordinary
variables.

(One result of this similarity is that qspinlock on x86 may be also
broken in the do_exit() code as spinlocks on AARCH64 and PPC. Because
a variable LOAD inside a qspinlock critical section could be reordered
before the STORE part of a qspinlock acquisition.)


For the problem we found last year, the current solution for AARCH64 and
PPC is to have a little heavy weight spin_unlock_wait() to pair with
spin_lock():

AARCH64: 
http://lkml.kernel.org/r/1448624646-15863-1-git-send-email-will.dea...@arm.com
PPC: 
http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.f...@gmail.com 
(not merged yet)

Another solution works on PPC is what Paul Mckenney suggested, using
smp_mb__after_unlock_lock():

http://lkml.kernel.org/r/20151112144004.gu3...@linux.vnet.ibm.com

, which is petty much the same as the spinlock synchronization
primitive we are discussing about here.


So I'm thinking, if we are going to introduce some primitives for
synchronizing two spinlocks (or even a spinlock and a mutex) anyway,
could we be a little more higher level, to reuse/invent primitives to
solve the synchronzing problem we have between
spinlocks(spin_unlock_wait()) and normal variables?

One benefit of this is that we could drop the complex implementations of
spin_unlock_wait() on AARCH64 and PPC.

Thoughts?

Regards,
Boqun

> going to be a "mb", and while the problem is about smp, the primitives
> it is synchronizing aren't actually smp-specific (ie you're
> synchronizing a lock that is relevant on UP too).
> 
> So I'd just call it something like
> 
> spin_lock_sync_after_lock();
> 
> because different locks might have different levels of serialization
> (ie maybe a spinlock needs one thing, and a mutex needs another - if
> we start worrying about ordering between spin_lock and
> mutex_is_locked(), for example, or between mutex_lock() and
> spin_is_locked()).
> 
> Hmm?
> 
>  Linus


signature.asc
Description: PGP signature


Re: sem_lock() vs qspinlocks

2016-05-24 Thread Peter Zijlstra
On Sat, May 21, 2016 at 03:49:20PM +0200, Manfred Spraul wrote:

> >I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
> >because I suspect the netfilter code is broken without it.
> >
> >And it seems intuitive to assume that if we return from unlock_wait() we
> >can indeed observe the critical section we waited on.

> Then !spin_is_locked() and spin_unlock_wait() would be different with
> regards to memory barriers.
> Would that really help?

We could fix that I think; something horrible like:

static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
{
int locked;
smp_mb();
locked = atomic_read(>val) & _Q_LOCKED_MASK;
smp_acquire__after_ctrl_dep();
return locked;
}

Which if used in a conditional like:

spin_lock(A);
if (spin_is_locked(B)) {
spin_unlock(A);
spin_lock(B);
...
}

would still provide the ACQUIRE semantics required. The only difference
is that it would provide it to _both_ branches, which might be a little
more expensive.

> My old plan was to document the rules, and define a generic
> smp_acquire__after_spin_is_unlocked.
> https://lkml.org/lkml/2015/3/1/153

Yeah; I more or less forgot all that.

Now, I too think having the thing documented is good; _however_ I also
think having primitives that actually do what you assume them to is a
good thing.

spin_unlock_wait() not actually serializing against the spin_unlock() is
really surprising and subtle.




Re: sem_lock() vs qspinlocks

2016-05-24 Thread Peter Zijlstra
On Sat, May 21, 2016 at 03:49:20PM +0200, Manfred Spraul wrote:

> >I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
> >because I suspect the netfilter code is broken without it.
> >
> >And it seems intuitive to assume that if we return from unlock_wait() we
> >can indeed observe the critical section we waited on.

> Then !spin_is_locked() and spin_unlock_wait() would be different with
> regards to memory barriers.
> Would that really help?

We could fix that I think; something horrible like:

static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
{
int locked;
smp_mb();
locked = atomic_read(>val) & _Q_LOCKED_MASK;
smp_acquire__after_ctrl_dep();
return locked;
}

Which if used in a conditional like:

spin_lock(A);
if (spin_is_locked(B)) {
spin_unlock(A);
spin_lock(B);
...
}

would still provide the ACQUIRE semantics required. The only difference
is that it would provide it to _both_ branches, which might be a little
more expensive.

> My old plan was to document the rules, and define a generic
> smp_acquire__after_spin_is_unlocked.
> https://lkml.org/lkml/2015/3/1/153

Yeah; I more or less forgot all that.

Now, I too think having the thing documented is good; _however_ I also
think having primitives that actually do what you assume them to is a
good thing.

spin_unlock_wait() not actually serializing against the spin_unlock() is
really surprising and subtle.




Re: sem_lock() vs qspinlocks

2016-05-23 Thread Linus Torvalds
On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra  wrote:
>
> Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
> something like:
>
> smp_mb__after_lock()

I'd much rather make the naming be higher level. It's not necessarily
going to be a "mb", and while the problem is about smp, the primitives
it is synchronizing aren't actually smp-specific (ie you're
synchronizing a lock that is relevant on UP too).

So I'd just call it something like

spin_lock_sync_after_lock();

because different locks might have different levels of serialization
(ie maybe a spinlock needs one thing, and a mutex needs another - if
we start worrying about ordering between spin_lock and
mutex_is_locked(), for example, or between mutex_lock() and
spin_is_locked()).

Hmm?

 Linus


Re: sem_lock() vs qspinlocks

2016-05-23 Thread Linus Torvalds
On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra  wrote:
>
> Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
> something like:
>
> smp_mb__after_lock()

I'd much rather make the naming be higher level. It's not necessarily
going to be a "mb", and while the problem is about smp, the primitives
it is synchronizing aren't actually smp-specific (ie you're
synchronizing a lock that is relevant on UP too).

So I'd just call it something like

spin_lock_sync_after_lock();

because different locks might have different levels of serialization
(ie maybe a spinlock needs one thing, and a mutex needs another - if
we start worrying about ordering between spin_lock and
mutex_is_locked(), for example, or between mutex_lock() and
spin_is_locked()).

Hmm?

 Linus


Re: sem_lock() vs qspinlocks

2016-05-23 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:

> So I do wonder if we should make that smp_mb() be something the
> *caller* has to do, and document rules for it. IOW, introduce a new
> spinlock primitive called "spin_lock_synchronize()", and then spinlock
> implementations that have this non-atomic behavior with an unordered
> store would do something like
> 
> static inline void queued_spin_lock_synchronize(struct qspinlock
> *a, struct qspinlock *b)
> {
> smp_mb();
> }
> 
> and then we'd document that *if* yuou need ordering guarantees between
> 
>spin_lock(a);
>.. spin_is_locked/spin_wait_lock(b) ..
> 
> you have to have a
> 
> spin_lock_synchronize(a, b);
> 
> in between.

So I think I favour the explicit barrier. But my 'problem' is that we
now have _two_ different scenarios in which we need to order two
different spinlocks.

The first is the RCpc vs RCsc spinlock situation (currently only on
PowerPC). Where the spin_unlock() spin_lock() 'barier' is not
transitive.

And the second is this 'new' situation, where the store is unordered and
is not observable until a release, which is fundamentally so on PPC and
ARM64 but also possible due to lock implementation choices like with our
qspinlock, which makes it manifest even on x86.

Now, ideally we'd be able to use one barrier construct for both; but
given that, while there is overlap, they're not the same. And I'd be
somewhat reluctant to issue superfluous smp_mb()s just because; it is an
expensive instruction.

Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
something like:

smp_mb__after_lock()

?


OTOH; even if we document this, it is something that is easy to forget
or miss. It is not like Documentation/memory-barriers.txt is in want of
more complexity.


Re: sem_lock() vs qspinlocks

2016-05-23 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:

> So I do wonder if we should make that smp_mb() be something the
> *caller* has to do, and document rules for it. IOW, introduce a new
> spinlock primitive called "spin_lock_synchronize()", and then spinlock
> implementations that have this non-atomic behavior with an unordered
> store would do something like
> 
> static inline void queued_spin_lock_synchronize(struct qspinlock
> *a, struct qspinlock *b)
> {
> smp_mb();
> }
> 
> and then we'd document that *if* yuou need ordering guarantees between
> 
>spin_lock(a);
>.. spin_is_locked/spin_wait_lock(b) ..
> 
> you have to have a
> 
> spin_lock_synchronize(a, b);
> 
> in between.

So I think I favour the explicit barrier. But my 'problem' is that we
now have _two_ different scenarios in which we need to order two
different spinlocks.

The first is the RCpc vs RCsc spinlock situation (currently only on
PowerPC). Where the spin_unlock() spin_lock() 'barier' is not
transitive.

And the second is this 'new' situation, where the store is unordered and
is not observable until a release, which is fundamentally so on PPC and
ARM64 but also possible due to lock implementation choices like with our
qspinlock, which makes it manifest even on x86.

Now, ideally we'd be able to use one barrier construct for both; but
given that, while there is overlap, they're not the same. And I'd be
somewhat reluctant to issue superfluous smp_mb()s just because; it is an
expensive instruction.

Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about
something like:

smp_mb__after_lock()

?


OTOH; even if we document this, it is something that is easy to forget
or miss. It is not like Documentation/memory-barriers.txt is in want of
more complexity.


Re: sem_lock() vs qspinlocks

2016-05-22 Thread Peter Zijlstra
On Sun, May 22, 2016 at 10:43:08AM +0200, Manfred Spraul wrote:
> How would we handle mixed spin_lock()/mutex_lock() code?
> For the IPC code, I would like to replace the outer lock with a mutex.
> The code only uses spinlocks, because at the time it was written, the mutex
> code didn't contain a busy wait.
> With a mutex, the code would become simpler (all the
> lock/unlock/kmalloc/relock parts could be removed).
> 
> The result would be something like:
> 
>   mutex_lock(A)   spin_lock(B)
>   spin_unlock_wait(B) if (!mutex_is_locked(A))
>   do_something()do_something()
> 

Should work similarly, but we'll have to audit mutex for these same
issues. I'll put it on todo.


Re: sem_lock() vs qspinlocks

2016-05-22 Thread Peter Zijlstra
On Sun, May 22, 2016 at 10:43:08AM +0200, Manfred Spraul wrote:
> How would we handle mixed spin_lock()/mutex_lock() code?
> For the IPC code, I would like to replace the outer lock with a mutex.
> The code only uses spinlocks, because at the time it was written, the mutex
> code didn't contain a busy wait.
> With a mutex, the code would become simpler (all the
> lock/unlock/kmalloc/relock parts could be removed).
> 
> The result would be something like:
> 
>   mutex_lock(A)   spin_lock(B)
>   spin_unlock_wait(B) if (!mutex_is_locked(A))
>   do_something()do_something()
> 

Should work similarly, but we'll have to audit mutex for these same
issues. I'll put it on todo.


Re: sem_lock() vs qspinlocks

2016-05-22 Thread Manfred Spraul

Hi Peter,


On 05/20/2016 06:04 PM, Peter Zijlstra wrote:

On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:


Let me write a patch..

OK, something like the below then.. lemme go build that and verify that
too fixes things.

---
Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()

Similar to commits:

   51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
   d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) if (!spin_is_locked(A))
do_something()do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

How would we handle mixed spin_lock()/mutex_lock() code?
For the IPC code, I would like to replace the outer lock with a mutex.
The code only uses spinlocks, because at the time it was written, the 
mutex code didn't contain a busy wait.
With a mutex, the code would become simpler (all the 
lock/unlock/kmalloc/relock parts could be removed).


The result would be something like:

mutex_lock(A)   spin_lock(B)
spin_unlock_wait(B) if (!mutex_is_locked(A))
do_something()do_something()

--
Manfred


Re: sem_lock() vs qspinlocks

2016-05-22 Thread Manfred Spraul

Hi Peter,


On 05/20/2016 06:04 PM, Peter Zijlstra wrote:

On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:


Let me write a patch..

OK, something like the below then.. lemme go build that and verify that
too fixes things.

---
Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()

Similar to commits:

   51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
   d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) if (!spin_is_locked(A))
do_something()do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

How would we handle mixed spin_lock()/mutex_lock() code?
For the IPC code, I would like to replace the outer lock with a mutex.
The code only uses spinlocks, because at the time it was written, the 
mutex code didn't contain a busy wait.
With a mutex, the code would become simpler (all the 
lock/unlock/kmalloc/relock parts could be removed).


The result would be something like:

mutex_lock(A)   spin_lock(B)
spin_unlock_wait(B) if (!mutex_is_locked(A))
do_something()do_something()

--
Manfred


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Davidlohr Bueso

On Sat, 21 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Linus Torvalds wrote:


>Oh, I definitely agree on the stable part, and yes, the "splt things
>up" model should come later if people agree that it's a good thing.

The backporting part is quite nice, yes, but ultimately I think I prefer
Linus' suggestion making things explicit, as opposed to consulting the spinlock
implying barriers. I also hate to have an smp_mb() (particularly for 
spin_is_locked)
given that we are not optimizing for the common case (regular mutual excl).


I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().


Indeed we are.

But by 'common case' I was really thinking about spin_is_locked() vs 
spin_wait_unlock().
The former being the more common of the two, and the one which mostly will 
_not_ be used
for lock correctness purposes, hence it doesn't need that new smp_mb. Hence 
allowing users
to explicitly set the ordering needs (ie spin_lock_synchronize()) seems like 
the better
long term alternative. otoh, with your approach all such bugs are automatically 
fixed :)

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Davidlohr Bueso

On Sat, 21 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Linus Torvalds wrote:


>Oh, I definitely agree on the stable part, and yes, the "splt things
>up" model should come later if people agree that it's a good thing.

The backporting part is quite nice, yes, but ultimately I think I prefer
Linus' suggestion making things explicit, as opposed to consulting the spinlock
implying barriers. I also hate to have an smp_mb() (particularly for 
spin_is_locked)
given that we are not optimizing for the common case (regular mutual excl).


I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().


Indeed we are.

But by 'common case' I was really thinking about spin_is_locked() vs 
spin_wait_unlock().
The former being the more common of the two, and the one which mostly will 
_not_ be used
for lock correctness purposes, hence it doesn't need that new smp_mb. Hence 
allowing users
to explicitly set the ordering needs (ie spin_lock_synchronize()) seems like 
the better
long term alternative. otoh, with your approach all such bugs are automatically 
fixed :)

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Manfred Spraul

On 05/21/2016 09:37 AM, Peter Zijlstra wrote:

On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:

As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
to use for locking correctness. For example, taking a look at 
nf_conntrack_all_lock(),
it too likes to get smart with spin_unlock_wait() -- also for finer graining 
purposes.
While not identical to sems, it goes like:

nf_conntrack_all_lock():nf_conntrack_lock():
spin_lock(B);   spin_lock(A);

if (bar) { // false
bar = 1;   ...
}
[loop ctrl-barrier] 
  spin_unlock_wait(A);
foo();  foo();

If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
we could end up with both threads in foo(), no?. (Although I'm unsure about that
ctrl barrier and archs could fall into it. The point was to see in-tree examples
of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.
Then !spin_is_locked() and spin_unlock_wait() would be different with 
regards to memory barriers.

Would that really help?

My old plan was to document the rules, and define a generic 
smp_acquire__after_spin_is_unlocked.

https://lkml.org/lkml/2015/3/1/153

Noone supported it, so it ended up as 
ipc_smp_acquire__after_spin_is_unlocked().

Should we move it to linux/spinlock.h?

Who needs it?
- ipc/sem.c (but please start from the version from linux-next as 
reference, it is far less convoluted compared to the current code)

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/ipc/sem.c

- nf_conntrack

- task_rq_lock() perhaps needs smp_acquire__after_ctrl_dep
(I didn't figure out yet what happened to the proposed patch)
https://lkml.org/lkml/2015/2/17/129

--
Manfred


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Manfred Spraul

On 05/21/2016 09:37 AM, Peter Zijlstra wrote:

On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:

As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
to use for locking correctness. For example, taking a look at 
nf_conntrack_all_lock(),
it too likes to get smart with spin_unlock_wait() -- also for finer graining 
purposes.
While not identical to sems, it goes like:

nf_conntrack_all_lock():nf_conntrack_lock():
spin_lock(B);   spin_lock(A);

if (bar) { // false
bar = 1;   ...
}
[loop ctrl-barrier] 
  spin_unlock_wait(A);
foo();  foo();

If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
we could end up with both threads in foo(), no?. (Although I'm unsure about that
ctrl barrier and archs could fall into it. The point was to see in-tree examples
of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.
Then !spin_is_locked() and spin_unlock_wait() would be different with 
regards to memory barriers.

Would that really help?

My old plan was to document the rules, and define a generic 
smp_acquire__after_spin_is_unlocked.

https://lkml.org/lkml/2015/3/1/153

Noone supported it, so it ended up as 
ipc_smp_acquire__after_spin_is_unlocked().

Should we move it to linux/spinlock.h?

Who needs it?
- ipc/sem.c (but please start from the version from linux-next as 
reference, it is far less convoluted compared to the current code)

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/ipc/sem.c

- nf_conntrack

- task_rq_lock() perhaps needs smp_acquire__after_ctrl_dep
(I didn't figure out yet what happened to the proposed patch)
https://lkml.org/lkml/2015/2/17/129

--
Manfred


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Peter Zijlstra
On Sat, May 21, 2016 at 12:01:00AM -0400, Waiman Long wrote:
> On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:
> >On Fri, 20 May 2016, Peter Zijlstra wrote:
> >
> >>On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:
> >>
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
> 
> -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> +   return atomic_read(>val);
> 
> >>
> >>>Looking for contended lock, you need to consider the lock waiters
> >>>also. So
> >>>looking at the whole word is right.
> >>
> >>No, you _only_ need to look at the lock waiters.
> >
> >Is there anyway to do this in a single atomic_read? My thought is that
> >otherwise
> >we could further expand the race window

Its inherently racy, arrival of a contender is subject to timing. No
point in trying to fix what can't be fixed.

> The existing code is doing that, but I would argue that including the
> locked, but uncontended case isn't a bad idea.

It _IS_ a bad idea, you get unconditional lock-breaks.

Its the same as:

#define spin_is_contended(l)(true)

Because the only reason you're using spin_is_conteded() is if you're
holding it.


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Peter Zijlstra
On Sat, May 21, 2016 at 12:01:00AM -0400, Waiman Long wrote:
> On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:
> >On Fri, 20 May 2016, Peter Zijlstra wrote:
> >
> >>On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:
> >>
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
> 
> -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> +   return atomic_read(>val);
> 
> >>
> >>>Looking for contended lock, you need to consider the lock waiters
> >>>also. So
> >>>looking at the whole word is right.
> >>
> >>No, you _only_ need to look at the lock waiters.
> >
> >Is there anyway to do this in a single atomic_read? My thought is that
> >otherwise
> >we could further expand the race window

Its inherently racy, arrival of a contender is subject to timing. No
point in trying to fix what can't be fixed.

> The existing code is doing that, but I would argue that including the
> locked, but uncontended case isn't a bad idea.

It _IS_ a bad idea, you get unconditional lock-breaks.

Its the same as:

#define spin_is_contended(l)(true)

Because the only reason you're using spin_is_conteded() is if you're
holding it.


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:
> On Fri, 20 May 2016, Linus Torvalds wrote:
> 
> 
> >Oh, I definitely agree on the stable part, and yes, the "splt things
> >up" model should come later if people agree that it's a good thing.
> 
> The backporting part is quite nice, yes, but ultimately I think I prefer
> Linus' suggestion making things explicit, as opposed to consulting the 
> spinlock
> implying barriers. I also hate to have an smp_mb() (particularly for 
> spin_is_locked)
> given that we are not optimizing for the common case (regular mutual excl).

I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().

> As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
> to use for locking correctness. For example, taking a look at 
> nf_conntrack_all_lock(),
> it too likes to get smart with spin_unlock_wait() -- also for finer graining 
> purposes.
> While not identical to sems, it goes like:
> 
> nf_conntrack_all_lock():  nf_conntrack_lock():
> spin_lock(B); spin_lock(A);
> 
>   if (bar) { // false
> bar = 1; ...
>   }
> [loop ctrl-barrier]   
>  spin_unlock_wait(A);
> foo();foo();
> 
> If the spin_unlock_wait() doesn't yet see the store that makes A visibly 
> locked,
> we could end up with both threads in foo(), no?. (Although I'm unsure about 
> that
> ctrl barrier and archs could fall into it. The point was to see in-tree 
> examples
> of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.

Something a little like so; but then for _all_ implementations.

---
 include/asm-generic/qspinlock.h |  6 ++
 include/linux/compiler.h| 13 -
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..2f2eddd3e1f9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -135,6 +135,12 @@ static inline void queued_spin_unlock_wait(struct 
qspinlock *lock)
smp_mb();
while (atomic_read(>val) & _Q_LOCKED_MASK)
cpu_relax();
+
+   /*
+* Match the RELEASE of the spin_unlock() we just observed. Thereby
+* ensuring we observe the whole critical section that ended.
+*/
+   smp_acquire__after_ctrl_dep();
 }
 
 #ifndef virt_spin_lock
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 793c0829e3a3..3c4bc8160947 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,21 +304,24 @@ static __always_inline void __write_once_size(volatile 
void *p, void *res, int s
__u.__val;  \
 })
 
+/*
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep()  smp_rmb()
+
 /**
  * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #define smp_cond_acquire(cond) do {\
while (!(cond)) \
cpu_relax();\
-   smp_rmb(); /* ctrl + rmb := acquire */  \
+   smp_acquire__after_ctrl_dep();  \
 } while (0)
 
 #endif /* __KERNEL__ */


Re: sem_lock() vs qspinlocks

2016-05-21 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:
> On Fri, 20 May 2016, Linus Torvalds wrote:
> 
> 
> >Oh, I definitely agree on the stable part, and yes, the "splt things
> >up" model should come later if people agree that it's a good thing.
> 
> The backporting part is quite nice, yes, but ultimately I think I prefer
> Linus' suggestion making things explicit, as opposed to consulting the 
> spinlock
> implying barriers. I also hate to have an smp_mb() (particularly for 
> spin_is_locked)
> given that we are not optimizing for the common case (regular mutual excl).

I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().

> As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
> to use for locking correctness. For example, taking a look at 
> nf_conntrack_all_lock(),
> it too likes to get smart with spin_unlock_wait() -- also for finer graining 
> purposes.
> While not identical to sems, it goes like:
> 
> nf_conntrack_all_lock():  nf_conntrack_lock():
> spin_lock(B); spin_lock(A);
> 
>   if (bar) { // false
> bar = 1; ...
>   }
> [loop ctrl-barrier]   
>  spin_unlock_wait(A);
> foo();foo();
> 
> If the spin_unlock_wait() doesn't yet see the store that makes A visibly 
> locked,
> we could end up with both threads in foo(), no?. (Although I'm unsure about 
> that
> ctrl barrier and archs could fall into it. The point was to see in-tree 
> examples
> of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.

Something a little like so; but then for _all_ implementations.

---
 include/asm-generic/qspinlock.h |  6 ++
 include/linux/compiler.h| 13 -
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..2f2eddd3e1f9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -135,6 +135,12 @@ static inline void queued_spin_unlock_wait(struct 
qspinlock *lock)
smp_mb();
while (atomic_read(>val) & _Q_LOCKED_MASK)
cpu_relax();
+
+   /*
+* Match the RELEASE of the spin_unlock() we just observed. Thereby
+* ensuring we observe the whole critical section that ended.
+*/
+   smp_acquire__after_ctrl_dep();
 }
 
 #ifndef virt_spin_lock
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 793c0829e3a3..3c4bc8160947 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,21 +304,24 @@ static __always_inline void __write_once_size(volatile 
void *p, void *res, int s
__u.__val;  \
 })
 
+/*
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep()  smp_rmb()
+
 /**
  * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #define smp_cond_acquire(cond) do {\
while (!(cond)) \
cpu_relax();\
-   smp_rmb(); /* ctrl + rmb := acquire */  \
+   smp_acquire__after_ctrl_dep();  \
 } while (0)
 
 #endif /* __KERNEL__ */


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:


>Similarly, and I know you hate it, but afaict, then semantically
>queued_spin_is_contended() ought to be:
>
>-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
>+   return atomic_read(>val);
>


Looking for contended lock, you need to consider the lock waiters 
also. So

looking at the whole word is right.


No, you _only_ need to look at the lock waiters.


Is there anyway to do this in a single atomic_read? My thought is that 
otherwise

we could further expand the race window of when the lock is and isn't
contended (as returned to by the user). Ie avoiding crap like:

atomic_read(>val) && atomic_read(>val) != _Q_LOCKED_VAL

In any case, falsely returning for the 'locked, uncontended' case, vs 
completely

ignoring waiters is probably the lesser evil :).

Thanks,
Davidlohr


The existing code is doing that, but I would argue that including the 
locked, but uncontended case isn't a bad idea.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 08:59 PM, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:


>Similarly, and I know you hate it, but afaict, then semantically
>queued_spin_is_contended() ought to be:
>
>-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
>+   return atomic_read(>val);
>


Looking for contended lock, you need to consider the lock waiters 
also. So

looking at the whole word is right.


No, you _only_ need to look at the lock waiters.


Is there anyway to do this in a single atomic_read? My thought is that 
otherwise

we could further expand the race window of when the lock is and isn't
contended (as returned to by the user). Ie avoiding crap like:

atomic_read(>val) && atomic_read(>val) != _Q_LOCKED_VAL

In any case, falsely returning for the 'locked, uncontended' case, vs 
completely

ignoring waiters is probably the lesser evil :).

Thanks,
Davidlohr


The existing code is doing that, but I would argue that including the 
locked, but uncontended case isn't a bad idea.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 5:48 PM, Davidlohr Bueso  wrote:
>
> I can verify that this patch fixes the issue.

Ok, I've applied it to my tree.

 Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 5:48 PM, Davidlohr Bueso  wrote:
>
> I can verify that this patch fixes the issue.

Ok, I've applied it to my tree.

 Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:


>Similarly, and I know you hate it, but afaict, then semantically
>queued_spin_is_contended() ought to be:
>
>-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
>+   return atomic_read(>val);
>



Looking for contended lock, you need to consider the lock waiters also. So
looking at the whole word is right.


No, you _only_ need to look at the lock waiters.


Is there anyway to do this in a single atomic_read? My thought is that otherwise
we could further expand the race window of when the lock is and isn't
contended (as returned to by the user). Ie avoiding crap like:

atomic_read(>val) && atomic_read(>val) != _Q_LOCKED_VAL

In any case, falsely returning for the 'locked, uncontended' case, vs completely
ignoring waiters is probably the lesser evil :).

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:


>Similarly, and I know you hate it, but afaict, then semantically
>queued_spin_is_contended() ought to be:
>
>-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
>+   return atomic_read(>val);
>



Looking for contended lock, you need to consider the lock waiters also. So
looking at the whole word is right.


No, you _only_ need to look at the lock waiters.


Is there anyway to do this in a single atomic_read? My thought is that otherwise
we could further expand the race window of when the lock is and isn't
contended (as returned to by the user). Ie avoiding crap like:

atomic_read(>val) && atomic_read(>val) != _Q_LOCKED_VAL

In any case, falsely returning for the 'locked, uncontended' case, vs completely
ignoring waiters is probably the lesser evil :).

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Linus Torvalds wrote:



Oh, I definitely agree on the stable part, and yes, the "splt things
up" model should come later if people agree that it's a good thing.


The backporting part is quite nice, yes, but ultimately I think I prefer
Linus' suggestion making things explicit, as opposed to consulting the spinlock
implying barriers. I also hate to have an smp_mb() (particularly for 
spin_is_locked)
given that we are not optimizing for the common case (regular mutual excl).

As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
to use for locking correctness. For example, taking a look at 
nf_conntrack_all_lock(),
it too likes to get smart with spin_unlock_wait() -- also for finer graining 
purposes.
While not identical to sems, it goes like:

nf_conntrack_all_lock():nf_conntrack_lock():
spin_lock(B);   spin_lock(A);

if (bar) { // false
bar = 1;   ...
}
[loop ctrl-barrier] 
 spin_unlock_wait(A);
foo();  foo();

If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
we could end up with both threads in foo(), no?. (Although I'm unsure about that
ctrl barrier and archs could fall into it. The point was to see in-tree examples
of creative thinking with locking).


Should I take the patch as-is, or should I just wait for a pull
request from the locking tree? Either is ok by me.


I can verify that this patch fixes the issue.

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Linus Torvalds wrote:



Oh, I definitely agree on the stable part, and yes, the "splt things
up" model should come later if people agree that it's a good thing.


The backporting part is quite nice, yes, but ultimately I think I prefer
Linus' suggestion making things explicit, as opposed to consulting the spinlock
implying barriers. I also hate to have an smp_mb() (particularly for 
spin_is_locked)
given that we are not optimizing for the common case (regular mutual excl).

As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
to use for locking correctness. For example, taking a look at 
nf_conntrack_all_lock(),
it too likes to get smart with spin_unlock_wait() -- also for finer graining 
purposes.
While not identical to sems, it goes like:

nf_conntrack_all_lock():nf_conntrack_lock():
spin_lock(B);   spin_lock(A);

if (bar) { // false
bar = 1;   ...
}
[loop ctrl-barrier] 
 spin_unlock_wait(A);
foo();  foo();

If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
we could end up with both threads in foo(), no?. (Although I'm unsure about that
ctrl barrier and archs could fall into it. The point was to see in-tree examples
of creative thinking with locking).


Should I take the patch as-is, or should I just wait for a pull
request from the locking tree? Either is ok by me.


I can verify that this patch fixes the issue.

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 2:06 PM, Peter Zijlstra  wrote:
>>
>> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
>> a big comment atop of it that now becomes nonsensical with this patch.
>
> Not quite; we still need that I think.

I think so too, but it's the *comment* that is nonsensical.

The comment says that "spin_unlock_wait() and !spin_is_locked() are
not memory barriers", and clearly now those instructions *are* memory
barriers with your patch.

However, the semaphore code wants a memory barrier after the _read_ in
the spin_unlocked_wait(), which it doesn't get.

So that is part of why I don't like the "hide memory barriers inside
the implementation".

Because once the operations aren't atomic (exactly like the spinlock
is now no longer atomic on x86: it's a separate read-with-acquire
followed by an unordered store for the queued case), the barrier
semantics within such an operation get very screwy.  There may be
barriers, but they aren't barriers to *everything*, they are just
barriers to part of the non-atomic operation.

If we were to make the synchronization explicit, we'd still have to
deal with all the subtle semantics, but now the subtle semantics would
at least be *explicit*. And it would make it much easier to explain
the barriers in that ipc semaphore code.

>> Now, I'd take Peter's patch as-is, because I don't think any of this
>> matters from a *performance* standpoint, and Peter's patch is much
>> smaller and simpler.
>
> I would suggest you do this and also mark it for stable v4.2 and later.

Oh, I definitely agree on the stable part, and yes, the "splt things
up" model should come later if people agree that it's a good thing.

Should I take the patch as-is, or should I just wait for a pull
request from the locking tree? Either is ok by me.

  Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 2:06 PM, Peter Zijlstra  wrote:
>>
>> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
>> a big comment atop of it that now becomes nonsensical with this patch.
>
> Not quite; we still need that I think.

I think so too, but it's the *comment* that is nonsensical.

The comment says that "spin_unlock_wait() and !spin_is_locked() are
not memory barriers", and clearly now those instructions *are* memory
barriers with your patch.

However, the semaphore code wants a memory barrier after the _read_ in
the spin_unlocked_wait(), which it doesn't get.

So that is part of why I don't like the "hide memory barriers inside
the implementation".

Because once the operations aren't atomic (exactly like the spinlock
is now no longer atomic on x86: it's a separate read-with-acquire
followed by an unordered store for the queued case), the barrier
semantics within such an operation get very screwy.  There may be
barriers, but they aren't barriers to *everything*, they are just
barriers to part of the non-atomic operation.

If we were to make the synchronization explicit, we'd still have to
deal with all the subtle semantics, but now the subtle semantics would
at least be *explicit*. And it would make it much easier to explain
the barriers in that ipc semaphore code.

>> Now, I'd take Peter's patch as-is, because I don't think any of this
>> matters from a *performance* standpoint, and Peter's patch is much
>> smaller and simpler.
>
> I would suggest you do this and also mark it for stable v4.2 and later.

Oh, I definitely agree on the stable part, and yes, the "splt things
up" model should come later if people agree that it's a good thing.

Should I take the patch as-is, or should I just wait for a pull
request from the locking tree? Either is ok by me.

  Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:
> On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra  wrote:
> > +* queued_spin_lock_slowpath() can ACQUIRE the lock before
> > +* issuing the unordered store that sets _Q_LOCKED_VAL.
> 
> Ugh. This was my least favorite part of the queued locks, and I really
> liked the completely unambiguous semantics of the x86 atomics-based
> versions we used to have.
> 
> But I guess we're stuck with it.

Yeah, I wasn't too happy either when I realized it today. We _could_
make these atomic ops, but then we're just making things slower for this
one weird case :/

> That said, it strikes me that almost all of the users of
> "spin_is_locked()" are using it for verification purposes (not locking
> correctness),

Right; although I feel those people should be using
lockdep_assert_held() for this instead. That not only compiles out when
!LOCKDEP but also asserts the current task/cpu is the lock owner, not
someone else.

> and that the people who are playing games with locking
> correctness are few and already have to play *other* games anyway.
> 
> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
> a big comment atop of it that now becomes nonsensical with this patch.

Not quite; we still need that I think.

We now have:

spin_lock(A);
smp_mb();
while (!spin_is_locked(B))
cpu_relax();
smp_rmb();

And that control dependency together with the rmb form a load-acquire on
the unlocked B, which matches the release of the spin_unlock(B) and
ensures we observe the whole previous critical section we waited for.

The new smp_mb() doesn't help with that.

> Now, I'd take Peter's patch as-is, because I don't think any of this
> matters from a *performance* standpoint, and Peter's patch is much
> smaller and simpler.

I would suggest you do this and also mark it for stable v4.2 and later.

> But the reason I think it might be a good thing to introduce those
> spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
> would be to make it very very clear what those subtle implementations
> in mutexes and the multi-level locks in the ipc layer are doing and
> what they rely on.

We can always do the fancy stuff on top, but that isn't going to need
backporting to all stable trees, this is.

I'll think a little more on the explicit document vs simple thing.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote:
> On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra  wrote:
> > +* queued_spin_lock_slowpath() can ACQUIRE the lock before
> > +* issuing the unordered store that sets _Q_LOCKED_VAL.
> 
> Ugh. This was my least favorite part of the queued locks, and I really
> liked the completely unambiguous semantics of the x86 atomics-based
> versions we used to have.
> 
> But I guess we're stuck with it.

Yeah, I wasn't too happy either when I realized it today. We _could_
make these atomic ops, but then we're just making things slower for this
one weird case :/

> That said, it strikes me that almost all of the users of
> "spin_is_locked()" are using it for verification purposes (not locking
> correctness),

Right; although I feel those people should be using
lockdep_assert_held() for this instead. That not only compiles out when
!LOCKDEP but also asserts the current task/cpu is the lock owner, not
someone else.

> and that the people who are playing games with locking
> correctness are few and already have to play *other* games anyway.
> 
> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
> a big comment atop of it that now becomes nonsensical with this patch.

Not quite; we still need that I think.

We now have:

spin_lock(A);
smp_mb();
while (!spin_is_locked(B))
cpu_relax();
smp_rmb();

And that control dependency together with the rmb form a load-acquire on
the unlocked B, which matches the release of the spin_unlock(B) and
ensures we observe the whole previous critical section we waited for.

The new smp_mb() doesn't help with that.

> Now, I'd take Peter's patch as-is, because I don't think any of this
> matters from a *performance* standpoint, and Peter's patch is much
> smaller and simpler.

I would suggest you do this and also mark it for stable v4.2 and later.

> But the reason I think it might be a good thing to introduce those
> spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
> would be to make it very very clear what those subtle implementations
> in mutexes and the multi-level locks in the ipc layer are doing and
> what they rely on.

We can always do the fancy stuff on top, but that isn't going to need
backporting to all stable trees, this is.

I'll think a little more on the explicit document vs simple thing.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 04:44:19PM -0400, Waiman Long wrote:
> On 05/20/2016 07:58 AM, Peter Zijlstra wrote:
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >>As such, the following restores the behavior of the ticket locks and 'fixes'
> >>(or hides?) the bug in sems. Naturally incorrect approach:
> >>
> >>@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> >>
> >>for (i = 0; i<  sma->sem_nsems; i++) {
> >>sem = sma->sem_base + i;
> >>-   spin_unlock_wait(>lock);
> >>+   while (atomic_read(>lock))
> >>+   cpu_relax();
> >>}
> >>ipc_smp_acquire__after_spin_is_unlocked();
> >>}
> >The actual bug is clear_pending_set_locked() not having acquire
> >semantics. And the above 'fixes' things because it will observe the old
> >pending bit or the locked bit, so it doesn't matter if the store
> >flipping them is delayed.
> 
> The clear_pending_set_locked() is not the only place where the lock is set.
> If there are more than one waiter, the queuing patch will be used instead.
> The set_locked(), which is also an unordered store, will then be used to set
> the lock.

Ah yes. I didn't get that far. One case was enough :-)


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 04:44:19PM -0400, Waiman Long wrote:
> On 05/20/2016 07:58 AM, Peter Zijlstra wrote:
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >>As such, the following restores the behavior of the ticket locks and 'fixes'
> >>(or hides?) the bug in sems. Naturally incorrect approach:
> >>
> >>@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> >>
> >>for (i = 0; i<  sma->sem_nsems; i++) {
> >>sem = sma->sem_base + i;
> >>-   spin_unlock_wait(>lock);
> >>+   while (atomic_read(>lock))
> >>+   cpu_relax();
> >>}
> >>ipc_smp_acquire__after_spin_is_unlocked();
> >>}
> >The actual bug is clear_pending_set_locked() not having acquire
> >semantics. And the above 'fixes' things because it will observe the old
> >pending bit or the locked bit, so it doesn't matter if the store
> >flipping them is delayed.
> 
> The clear_pending_set_locked() is not the only place where the lock is set.
> If there are more than one waiter, the queuing patch will be used instead.
> The set_locked(), which is also an unordered store, will then be used to set
> the lock.

Ah yes. I didn't get that far. One case was enough :-)


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:

> >Similarly, and I know you hate it, but afaict, then semantically
> >queued_spin_is_contended() ought to be:
> >
> >-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> >+   return atomic_read(>val);
> >

> Looking for contended lock, you need to consider the lock waiters also. So
> looking at the whole word is right.

No, you _only_ need to look at the lock waiters.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote:

> >Similarly, and I know you hate it, but afaict, then semantically
> >queued_spin_is_contended() ought to be:
> >
> >-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> >+   return atomic_read(>val);
> >

> Looking for contended lock, you need to consider the lock waiters also. So
> looking at the whole word is right.

No, you _only_ need to look at the lock waiters.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 11:00 AM, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
 In addition, this makes me wonder if queued_spin_is_locked() should 
then be:


-return atomic_read(>val);
+return atomic_read(>val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.


Probably


Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);

Thanks,
Davidlohr


Looking for contended lock, you need to consider the lock waiters also. 
So looking at the whole word is right.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 11:00 AM, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
 In addition, this makes me wonder if queued_spin_is_locked() should 
then be:


-return atomic_read(>val);
+return atomic_read(>val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.


Probably


Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);

Thanks,
Davidlohr


Looking for contended lock, you need to consider the lock waiters also. 
So looking at the whole word is right.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 07:58 AM, Peter Zijlstra wrote:

On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

As such, the following restores the behavior of the ticket locks and 'fixes'
(or hides?) the bug in sems. Naturally incorrect approach:

@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)

for (i = 0; i<  sma->sem_nsems; i++) {
sem = sma->sem_base + i;
-   spin_unlock_wait(>lock);
+   while (atomic_read(>lock))
+   cpu_relax();
}
ipc_smp_acquire__after_spin_is_unlocked();
}

The actual bug is clear_pending_set_locked() not having acquire
semantics. And the above 'fixes' things because it will observe the old
pending bit or the locked bit, so it doesn't matter if the store
flipping them is delayed.


The clear_pending_set_locked() is not the only place where the lock is 
set. If there are more than one waiter, the queuing patch will be used 
instead. The set_locked(), which is also an unordered store, will then 
be used to set the lock.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Waiman Long

On 05/20/2016 07:58 AM, Peter Zijlstra wrote:

On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

As such, the following restores the behavior of the ticket locks and 'fixes'
(or hides?) the bug in sems. Naturally incorrect approach:

@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)

for (i = 0; i<  sma->sem_nsems; i++) {
sem = sma->sem_base + i;
-   spin_unlock_wait(>lock);
+   while (atomic_read(>lock))
+   cpu_relax();
}
ipc_smp_acquire__after_spin_is_unlocked();
}

The actual bug is clear_pending_set_locked() not having acquire
semantics. And the above 'fixes' things because it will observe the old
pending bit or the locked bit, so it doesn't matter if the store
flipping them is delayed.


The clear_pending_set_locked() is not the only place where the lock is 
set. If there are more than one waiter, the queuing patch will be used 
instead. The set_locked(), which is also an unordered store, will then 
be used to set the lock.


Cheers,
Longman


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra  wrote:
> +* queued_spin_lock_slowpath() can ACQUIRE the lock before
> +* issuing the unordered store that sets _Q_LOCKED_VAL.

Ugh. This was my least favorite part of the queued locks, and I really
liked the completely unambiguous semantics of the x86 atomics-based
versions we used to have.

But I guess we're stuck with it.

That said, it strikes me that almost all of the users of
"spin_is_locked()" are using it for verification purposes (not locking
correctness), and that the people who are playing games with locking
correctness are few and already have to play *other* games anyway.

See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
a big comment atop of it that now becomes nonsensical with this patch.

So I do wonder if we should make that smp_mb() be something the
*caller* has to do, and document rules for it. IOW, introduce a new
spinlock primitive called "spin_lock_synchronize()", and then spinlock
implementations that have this non-atomic behavior with an unordered
store would do something like

static inline void queued_spin_lock_synchronize(struct qspinlock
*a, struct qspinlock *b)
{
smp_mb();
}

and then we'd document that *if* yuou need ordering guarantees between

   spin_lock(a);
   .. spin_is_locked/spin_wait_lock(b) ..

you have to have a

spin_lock_synchronize(a, b);

in between.

A spin-lock implementation with the old x86 atomic semantics would
make it a no-op.

We should also introduce something like that
"splin_lock_acquire_after_unlock()" so that the ipc/sem.c behavior
would be documented too. Partly because some spinlock implementations
might have stronger unlock ordering and that could be a no-op for
them), but mostly for documentation of the rules.

Now, I'd take Peter's patch as-is, because I don't think any of this
matters from a *performance* standpoint, and Peter's patch is much
smaller and simpler.

But the reason I think it might be a good thing to introduce those
spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
would be to make it very very clear what those subtle implementations
in mutexes and the multi-level locks in the ipc layer are doing and
what they rely on.

Comments? There are arguments for Peter's simple approach too ("robust
and simpler interface" vs "make our requirements explicit").

   Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Linus Torvalds
On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra  wrote:
> +* queued_spin_lock_slowpath() can ACQUIRE the lock before
> +* issuing the unordered store that sets _Q_LOCKED_VAL.

Ugh. This was my least favorite part of the queued locks, and I really
liked the completely unambiguous semantics of the x86 atomics-based
versions we used to have.

But I guess we're stuck with it.

That said, it strikes me that almost all of the users of
"spin_is_locked()" are using it for verification purposes (not locking
correctness), and that the people who are playing games with locking
correctness are few and already have to play *other* games anyway.

See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has
a big comment atop of it that now becomes nonsensical with this patch.

So I do wonder if we should make that smp_mb() be something the
*caller* has to do, and document rules for it. IOW, introduce a new
spinlock primitive called "spin_lock_synchronize()", and then spinlock
implementations that have this non-atomic behavior with an unordered
store would do something like

static inline void queued_spin_lock_synchronize(struct qspinlock
*a, struct qspinlock *b)
{
smp_mb();
}

and then we'd document that *if* yuou need ordering guarantees between

   spin_lock(a);
   .. spin_is_locked/spin_wait_lock(b) ..

you have to have a

spin_lock_synchronize(a, b);

in between.

A spin-lock implementation with the old x86 atomic semantics would
make it a no-op.

We should also introduce something like that
"splin_lock_acquire_after_unlock()" so that the ipc/sem.c behavior
would be documented too. Partly because some spinlock implementations
might have stronger unlock ordering and that could be a no-op for
them), but mostly for documentation of the rules.

Now, I'd take Peter's patch as-is, because I don't think any of this
matters from a *performance* standpoint, and Peter's patch is much
smaller and simpler.

But the reason I think it might be a good thing to introduce those
spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts
would be to make it very very clear what those subtle implementations
in mutexes and the multi-level locks in the ipc layer are doing and
what they rely on.

Comments? There are arguments for Peter's simple approach too ("robust
and simpler interface" vs "make our requirements explicit").

   Linus


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


The problem is that the clear_pending_set_locked() is an unordered
store, therefore this store can be delayed until no later than
spin_unlock() (which orders against it due to the address dependency).

This opens numerous races; for example:

ipc_lock_object(>sem_perm);
sem_wait_array(sma);

false   ->   spin_is_locked(>sem_perm.lock)

is entirely possible, because sem_wait_array() consists of pure reads,
so the store can pass all that, even on x86.


I had pondered at the unordered stores in clear_pending_set_locked() for arm,
for example, but I _certainly_ missed this for x86 inside the ACQUIRE region.

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


The problem is that the clear_pending_set_locked() is an unordered
store, therefore this store can be delayed until no later than
spin_unlock() (which orders against it due to the address dependency).

This opens numerous races; for example:

ipc_lock_object(>sem_perm);
sem_wait_array(sma);

false   ->   spin_is_locked(>sem_perm.lock)

is entirely possible, because sem_wait_array() consists of pure reads,
so the store can pass all that, even on x86.


I had pondered at the unordered stores in clear_pending_set_locked() for arm,
for example, but I _certainly_ missed this for x86 inside the ACQUIRE region.

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:

> Let me write a patch..

OK, something like the below then.. lemme go build that and verify that
too fixes things.

---
Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()

Similar to commits:

  51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) if (!spin_is_locked(A))
do_something()do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

To avoid making the normal case slower, add smp_mb()s to the less used
spin_unlock_wait() / spin_is_locked() side of things to avoid this
problem.

Reported-by: Davidlohr Bueso 
Reported-by: Giovanni Gherdovich 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/qspinlock.h | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 35a52a880b2f..6bd05700d8c9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -28,7 +28,30 @@
  */
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-   return atomic_read(>val);
+   /*
+* queued_spin_lock_slowpath() can ACQUIRE the lock before
+* issuing the unordered store that sets _Q_LOCKED_VAL.
+*
+* See both smp_cond_acquire() sites for more detail.
+*
+* This however means that in code like:
+*
+*   spin_lock(A)   spin_lock(B)
+*   spin_unlock_wait(B)spin_is_locked(A)
+*   do_something() do_something()
+*
+* Both CPUs can end up running do_something() because the store
+* setting _Q_LOCKED_VAL will pass through the loads in
+* spin_unlock_wait() and/or spin_is_locked().
+*
+* Avoid this by issuing a full memory barrier between the spin_lock()
+* and the loads in spin_unlock_wait() and spin_is_locked().
+*
+* Note that regular mutual exclusion doesn't care about this
+* delayed store.
+*/
+   smp_mb();
+   return atomic_read(>val) & _Q_LOCKED_MASK;
 }
 
 /**
@@ -108,6 +131,8 @@ static __always_inline void queued_spin_unlock(struct 
qspinlock *lock)
  */
 static inline void queued_spin_unlock_wait(struct qspinlock *lock)
 {
+   /* See queued_spin_is_locked() */
+   smp_mb();
while (atomic_read(>val) & _Q_LOCKED_MASK)
cpu_relax();
 }


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote:

> Let me write a patch..

OK, something like the below then.. lemme go build that and verify that
too fixes things.

---
Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()

Similar to commits:

  51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) if (!spin_is_locked(A))
do_something()do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

To avoid making the normal case slower, add smp_mb()s to the less used
spin_unlock_wait() / spin_is_locked() side of things to avoid this
problem.

Reported-by: Davidlohr Bueso 
Reported-by: Giovanni Gherdovich 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/qspinlock.h | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 35a52a880b2f..6bd05700d8c9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -28,7 +28,30 @@
  */
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-   return atomic_read(>val);
+   /*
+* queued_spin_lock_slowpath() can ACQUIRE the lock before
+* issuing the unordered store that sets _Q_LOCKED_VAL.
+*
+* See both smp_cond_acquire() sites for more detail.
+*
+* This however means that in code like:
+*
+*   spin_lock(A)   spin_lock(B)
+*   spin_unlock_wait(B)spin_is_locked(A)
+*   do_something() do_something()
+*
+* Both CPUs can end up running do_something() because the store
+* setting _Q_LOCKED_VAL will pass through the loads in
+* spin_unlock_wait() and/or spin_is_locked().
+*
+* Avoid this by issuing a full memory barrier between the spin_lock()
+* and the loads in spin_unlock_wait() and spin_is_locked().
+*
+* Note that regular mutual exclusion doesn't care about this
+* delayed store.
+*/
+   smp_mb();
+   return atomic_read(>val) & _Q_LOCKED_MASK;
 }
 
 /**
@@ -108,6 +131,8 @@ static __always_inline void queued_spin_unlock(struct 
qspinlock *lock)
  */
 static inline void queued_spin_unlock_wait(struct qspinlock *lock)
 {
+   /* See queued_spin_is_locked() */
+   smp_mb();
while (atomic_read(>val) & _Q_LOCKED_MASK)
cpu_relax();
 }


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:05:05PM +0200, Peter Zijlstra wrote:
> On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
> > On Fri, 20 May 2016, Peter Zijlstra wrote:
> > 
> > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > >> In addition, this makes me wonder if queued_spin_is_locked() should then 
> > >> be:
> > >>
> > >>- return atomic_read(>val);
> > >>+ return atomic_read(>val) & _Q_LOCKED_MASK;
> > >>
> > >>And avoid considering pending waiters as locked.
> > >
> > >Probably
> > 
> > Similarly, and I know you hate it, but afaict, then semantically
> > queued_spin_is_contended() ought to be:
> > 
> > -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> > +   return atomic_read(>val);
> 
> Nah, that would make it return true for (0,0,1), ie. uncontended locked.

FWIW, the only usage of spin_is_contended() should be for lock breaking,
see spin_needbreak().

This also means that

  #define spin_is_contended(l) (false)

is a valid implementation, where the only down-side is worse latency.

This is done (together with GENERIC_LOCKBREAK), to allow trivial
test-and-set spinlock implementations; as these cannot tell if the lock
is contended.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 05:05:05PM +0200, Peter Zijlstra wrote:
> On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
> > On Fri, 20 May 2016, Peter Zijlstra wrote:
> > 
> > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > >> In addition, this makes me wonder if queued_spin_is_locked() should then 
> > >> be:
> > >>
> > >>- return atomic_read(>val);
> > >>+ return atomic_read(>val) & _Q_LOCKED_MASK;
> > >>
> > >>And avoid considering pending waiters as locked.
> > >
> > >Probably
> > 
> > Similarly, and I know you hate it, but afaict, then semantically
> > queued_spin_is_contended() ought to be:
> > 
> > -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> > +   return atomic_read(>val);
> 
> Nah, that would make it return true for (0,0,1), ie. uncontended locked.

FWIW, the only usage of spin_is_contended() should be for lock breaking,
see spin_needbreak().

This also means that

  #define spin_is_contended(l) (false)

is a valid implementation, where the only down-side is worse latency.

This is done (together with GENERIC_LOCKBREAK), to allow trivial
test-and-set spinlock implementations; as these cannot tell if the lock
is contended.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:

>On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>> In addition, this makes me wonder if queued_spin_is_locked() should then be:
>>
>>-   return atomic_read(>val);
>>+   return atomic_read(>val) & _Q_LOCKED_MASK;
>>
>>And avoid considering pending waiters as locked.
>
>Probably

Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);


Nah, that would make it return true for (0,0,1), ie. uncontended locked.


Right, and we want:

(*, 1, 1)
(*, 1, 0)
(n, 0, 0)

I may be missing some combinations, its still early.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:

On Fri, 20 May 2016, Peter Zijlstra wrote:

>On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>> In addition, this makes me wonder if queued_spin_is_locked() should then be:
>>
>>-   return atomic_read(>val);
>>+   return atomic_read(>val) & _Q_LOCKED_MASK;
>>
>>And avoid considering pending waiters as locked.
>
>Probably

Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);


Nah, that would make it return true for (0,0,1), ie. uncontended locked.


Right, and we want:

(*, 1, 1)
(*, 1, 0)
(n, 0, 0)

I may be missing some combinations, its still early.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:05:33PM +0800, Boqun Feng wrote:
> On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > As such, the following restores the behavior of the ticket locks and 
> > > 'fixes'
> > > (or hides?) the bug in sems. Naturally incorrect approach:
> > > 
> > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> > > 
> > >   for (i = 0; i < sma->sem_nsems; i++) {
> > >   sem = sma->sem_base + i;
> > > -   spin_unlock_wait(>lock);
> > > +   while (atomic_read(>lock))
> > > +   cpu_relax();
> > >   }
> > >   ipc_smp_acquire__after_spin_is_unlocked();
> > > }
> > 
> > The actual bug is clear_pending_set_locked() not having acquire
> > semantics. And the above 'fixes' things because it will observe the old
> > pending bit or the locked bit, so it doesn't matter if the store
> > flipping them is delayed.
> > 
> > The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> > states that that acquire is sufficient, but this is incorrect in the
> > face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> > lock byte.
> > 
> > The problem is that the clear_pending_set_locked() is an unordered
> > store, therefore this store can be delayed until no later than
> > spin_unlock() (which orders against it due to the address dependency).
> > 
> > This opens numerous races; for example:
> > 
> > ipc_lock_object(>sem_perm);
> > sem_wait_array(sma);
> > 
> > false   ->  
> > spin_is_locked(>sem_perm.lock)
> > 
> > is entirely possible, because sem_wait_array() consists of pure reads,
> > so the store can pass all that, even on x86.
> > 
> > The below 'hack' seems to solve the problem.
> > 
> > _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> > branch is equally wrong -- although not visible on x86. And note that
> > atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> > the acquire is on the LOAD not the STORE of the LL/SC.
> > 
> > I need a break of sorts, because after twisting my head around the sem
> > code and then the qspinlock code I'm wrecked. I'll try and make a proper
> > patch if people can indeed confirm my thinking here.
> > 
> 
> I think your analysis is right, however, the problem only exists if we
> have the following use pattern, right?
> 
>   CPU 0   CPU 1
>   ==
>   spin_lock(A);   spin_lock(B);
>   spin_unlock_wait(B);spin_unlock_wait(A);
>   do_something(); do_something();

More or less yes. The semaphore code is like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) spin_is_locked(A)

which shows that both spin_is_locked() and spin_unlock_wait() are in the
same class.

> , which ends up CPU 0 and 1 both running do_something(). And actually
> this can be simply fixed by add smp_mb() between spin_lock() and
> spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
> as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

Right and arm64 does in d86b8da04dfa. Curiously you only fixed
spin_is_locked() and Will only fixed spin_unlock_wait, while AFAIU we
need to have _BOTH_ fixed.

Now looking at the PPC code, spin_unlock_wait() as per
arch/powerpc/lib/locks.c actually does included the extra smp_mb().

> So if relaxed/acquire atomics and clear_pending_set_locked() work fine
> in other situations, a proper fix would be fixing the
> spin_is_locked()/spin_unlock_wait() or their users?

Right; the relaxed stores work fine for the 'regular' mutual exclusive
critical section usage of locks. And yes, I think only the case you
outlined can care about it.

Let me write a patch..


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:05:33PM +0800, Boqun Feng wrote:
> On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > As such, the following restores the behavior of the ticket locks and 
> > > 'fixes'
> > > (or hides?) the bug in sems. Naturally incorrect approach:
> > > 
> > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> > > 
> > >   for (i = 0; i < sma->sem_nsems; i++) {
> > >   sem = sma->sem_base + i;
> > > -   spin_unlock_wait(>lock);
> > > +   while (atomic_read(>lock))
> > > +   cpu_relax();
> > >   }
> > >   ipc_smp_acquire__after_spin_is_unlocked();
> > > }
> > 
> > The actual bug is clear_pending_set_locked() not having acquire
> > semantics. And the above 'fixes' things because it will observe the old
> > pending bit or the locked bit, so it doesn't matter if the store
> > flipping them is delayed.
> > 
> > The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> > states that that acquire is sufficient, but this is incorrect in the
> > face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> > lock byte.
> > 
> > The problem is that the clear_pending_set_locked() is an unordered
> > store, therefore this store can be delayed until no later than
> > spin_unlock() (which orders against it due to the address dependency).
> > 
> > This opens numerous races; for example:
> > 
> > ipc_lock_object(>sem_perm);
> > sem_wait_array(sma);
> > 
> > false   ->  
> > spin_is_locked(>sem_perm.lock)
> > 
> > is entirely possible, because sem_wait_array() consists of pure reads,
> > so the store can pass all that, even on x86.
> > 
> > The below 'hack' seems to solve the problem.
> > 
> > _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> > branch is equally wrong -- although not visible on x86. And note that
> > atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> > the acquire is on the LOAD not the STORE of the LL/SC.
> > 
> > I need a break of sorts, because after twisting my head around the sem
> > code and then the qspinlock code I'm wrecked. I'll try and make a proper
> > patch if people can indeed confirm my thinking here.
> > 
> 
> I think your analysis is right, however, the problem only exists if we
> have the following use pattern, right?
> 
>   CPU 0   CPU 1
>   ==
>   spin_lock(A);   spin_lock(B);
>   spin_unlock_wait(B);spin_unlock_wait(A);
>   do_something(); do_something();

More or less yes. The semaphore code is like:

spin_lock(A)spin_lock(B)
spin_unlock_wait(B) spin_is_locked(A)

which shows that both spin_is_locked() and spin_unlock_wait() are in the
same class.

> , which ends up CPU 0 and 1 both running do_something(). And actually
> this can be simply fixed by add smp_mb() between spin_lock() and
> spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
> as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

Right and arm64 does in d86b8da04dfa. Curiously you only fixed
spin_is_locked() and Will only fixed spin_unlock_wait, while AFAIU we
need to have _BOTH_ fixed.

Now looking at the PPC code, spin_unlock_wait() as per
arch/powerpc/lib/locks.c actually does included the extra smp_mb().

> So if relaxed/acquire atomics and clear_pending_set_locked() work fine
> in other situations, a proper fix would be fixing the
> spin_is_locked()/spin_unlock_wait() or their users?

Right; the relaxed stores work fine for the 'regular' mutual exclusive
critical section usage of locks. And yes, I think only the case you
outlined can care about it.

Let me write a patch..


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
> On Fri, 20 May 2016, Peter Zijlstra wrote:
> 
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >> In addition, this makes me wonder if queued_spin_is_locked() should then 
> >> be:
> >>
> >>-   return atomic_read(>val);
> >>+   return atomic_read(>val) & _Q_LOCKED_MASK;
> >>
> >>And avoid considering pending waiters as locked.
> >
> >Probably
> 
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
> 
> -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> +   return atomic_read(>val);

Nah, that would make it return true for (0,0,1), ie. uncontended locked.




Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote:
> On Fri, 20 May 2016, Peter Zijlstra wrote:
> 
> >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> >> In addition, this makes me wonder if queued_spin_is_locked() should then 
> >> be:
> >>
> >>-   return atomic_read(>val);
> >>+   return atomic_read(>val) & _Q_LOCKED_MASK;
> >>
> >>And avoid considering pending waiters as locked.
> >
> >Probably
> 
> Similarly, and I know you hate it, but afaict, then semantically
> queued_spin_is_contended() ought to be:
> 
> -   return atomic_read(>val) & ~_Q_LOCKED_MASK;
> +   return atomic_read(>val);

Nah, that would make it return true for (0,0,1), ie. uncontended locked.




Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

 In addition, this makes me wonder if queued_spin_is_locked() should then be:

-   return atomic_read(>val);
+   return atomic_read(>val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.


Probably


Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Davidlohr Bueso

On Fri, 20 May 2016, Peter Zijlstra wrote:


On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

 In addition, this makes me wonder if queued_spin_is_locked() should then be:

-   return atomic_read(>val);
+   return atomic_read(>val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.


Probably


Similarly, and I know you hate it, but afaict, then semantically
queued_spin_is_contended() ought to be:

-   return atomic_read(>val) & ~_Q_LOCKED_MASK;
+   return atomic_read(>val);

Thanks,
Davidlohr


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Boqun Feng
Hi Peter,

On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > As such, the following restores the behavior of the ticket locks and 'fixes'
> > (or hides?) the bug in sems. Naturally incorrect approach:
> > 
> > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> > 
> > for (i = 0; i < sma->sem_nsems; i++) {
> > sem = sma->sem_base + i;
> > -   spin_unlock_wait(>lock);
> > +   while (atomic_read(>lock))
> > +   cpu_relax();
> > }
> > ipc_smp_acquire__after_spin_is_unlocked();
> > }
> 
> The actual bug is clear_pending_set_locked() not having acquire
> semantics. And the above 'fixes' things because it will observe the old
> pending bit or the locked bit, so it doesn't matter if the store
> flipping them is delayed.
> 
> The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> states that that acquire is sufficient, but this is incorrect in the
> face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> lock byte.
> 
> The problem is that the clear_pending_set_locked() is an unordered
> store, therefore this store can be delayed until no later than
> spin_unlock() (which orders against it due to the address dependency).
> 
> This opens numerous races; for example:
> 
>   ipc_lock_object(>sem_perm);
>   sem_wait_array(sma);
> 
>   false   ->  
> spin_is_locked(>sem_perm.lock)
> 
> is entirely possible, because sem_wait_array() consists of pure reads,
> so the store can pass all that, even on x86.
> 
> The below 'hack' seems to solve the problem.
> 
> _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> branch is equally wrong -- although not visible on x86. And note that
> atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> the acquire is on the LOAD not the STORE of the LL/SC.
> 
> I need a break of sorts, because after twisting my head around the sem
> code and then the qspinlock code I'm wrecked. I'll try and make a proper
> patch if people can indeed confirm my thinking here.
> 

I think your analysis is right, however, the problem only exists if we
have the following use pattern, right?

CPU 0   CPU 1
==
spin_lock(A);   spin_lock(B);
spin_unlock_wait(B);spin_unlock_wait(A);
do_something(); do_something();

, which ends up CPU 0 and 1 both running do_something(). And actually
this can be simply fixed by add smp_mb() between spin_lock() and
spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

So if relaxed/acquire atomics and clear_pending_set_locked() work fine
in other situations, a proper fix would be fixing the
spin_is_locked()/spin_unlock_wait() or their users?

Regards,
Boqun

> ---
>  kernel/locking/qspinlock.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index ce2f75e32ae1..348e172e774f 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
> u32 val)
>* *,1,0 -> *,0,1
>*/
>   clear_pending_set_locked(lock);
> + smp_mb();
>   return;
>  
>   /*


signature.asc
Description: PGP signature


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Boqun Feng
Hi Peter,

On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > As such, the following restores the behavior of the ticket locks and 'fixes'
> > (or hides?) the bug in sems. Naturally incorrect approach:
> > 
> > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> > 
> > for (i = 0; i < sma->sem_nsems; i++) {
> > sem = sma->sem_base + i;
> > -   spin_unlock_wait(>lock);
> > +   while (atomic_read(>lock))
> > +   cpu_relax();
> > }
> > ipc_smp_acquire__after_spin_is_unlocked();
> > }
> 
> The actual bug is clear_pending_set_locked() not having acquire
> semantics. And the above 'fixes' things because it will observe the old
> pending bit or the locked bit, so it doesn't matter if the store
> flipping them is delayed.
> 
> The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
> states that that acquire is sufficient, but this is incorrect in the
> face of spin_is_locked()/spin_unlock_wait() usage only looking at the
> lock byte.
> 
> The problem is that the clear_pending_set_locked() is an unordered
> store, therefore this store can be delayed until no later than
> spin_unlock() (which orders against it due to the address dependency).
> 
> This opens numerous races; for example:
> 
>   ipc_lock_object(>sem_perm);
>   sem_wait_array(sma);
> 
>   false   ->  
> spin_is_locked(>sem_perm.lock)
> 
> is entirely possible, because sem_wait_array() consists of pure reads,
> so the store can pass all that, even on x86.
> 
> The below 'hack' seems to solve the problem.
> 
> _However_ this also means the atomic_cmpxchg_relaxed() in the locked:
> branch is equally wrong -- although not visible on x86. And note that
> atomic_cmpxchg_acquire() would not in fact be sufficient either, since
> the acquire is on the LOAD not the STORE of the LL/SC.
> 
> I need a break of sorts, because after twisting my head around the sem
> code and then the qspinlock code I'm wrecked. I'll try and make a proper
> patch if people can indeed confirm my thinking here.
> 

I think your analysis is right, however, the problem only exists if we
have the following use pattern, right?

CPU 0   CPU 1
==
spin_lock(A);   spin_lock(B);
spin_unlock_wait(B);spin_unlock_wait(A);
do_something(); do_something();

, which ends up CPU 0 and 1 both running do_something(). And actually
this can be simply fixed by add smp_mb() between spin_lock() and
spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait()
as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()".

So if relaxed/acquire atomics and clear_pending_set_locked() work fine
in other situations, a proper fix would be fixing the
spin_is_locked()/spin_unlock_wait() or their users?

Regards,
Boqun

> ---
>  kernel/locking/qspinlock.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index ce2f75e32ae1..348e172e774f 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
> u32 val)
>* *,1,0 -> *,0,1
>*/
>   clear_pending_set_locked(lock);
> + smp_mb();
>   return;
>  
>   /*


signature.asc
Description: PGP signature


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> As such, the following restores the behavior of the ticket locks and 'fixes'
> (or hides?) the bug in sems. Naturally incorrect approach:
> 
> @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> 
>   for (i = 0; i < sma->sem_nsems; i++) {
>   sem = sma->sem_base + i;
> -   spin_unlock_wait(>lock);
> +   while (atomic_read(>lock))
> +   cpu_relax();
>   }
>   ipc_smp_acquire__after_spin_is_unlocked();
> }

The actual bug is clear_pending_set_locked() not having acquire
semantics. And the above 'fixes' things because it will observe the old
pending bit or the locked bit, so it doesn't matter if the store
flipping them is delayed.

The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
states that that acquire is sufficient, but this is incorrect in the
face of spin_is_locked()/spin_unlock_wait() usage only looking at the
lock byte.

The problem is that the clear_pending_set_locked() is an unordered
store, therefore this store can be delayed until no later than
spin_unlock() (which orders against it due to the address dependency).

This opens numerous races; for example:

ipc_lock_object(>sem_perm);
sem_wait_array(sma);

false   ->  
spin_is_locked(>sem_perm.lock)

is entirely possible, because sem_wait_array() consists of pure reads,
so the store can pass all that, even on x86.

The below 'hack' seems to solve the problem.

_However_ this also means the atomic_cmpxchg_relaxed() in the locked:
branch is equally wrong -- although not visible on x86. And note that
atomic_cmpxchg_acquire() would not in fact be sufficient either, since
the acquire is on the LOAD not the STORE of the LL/SC.

I need a break of sorts, because after twisting my head around the sem
code and then the qspinlock code I'm wrecked. I'll try and make a proper
patch if people can indeed confirm my thinking here.

---
 kernel/locking/qspinlock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e32ae1..348e172e774f 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 * *,1,0 -> *,0,1
 */
clear_pending_set_locked(lock);
+   smp_mb();
return;
 
/*


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> As such, the following restores the behavior of the ticket locks and 'fixes'
> (or hides?) the bug in sems. Naturally incorrect approach:
> 
> @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)
> 
>   for (i = 0; i < sma->sem_nsems; i++) {
>   sem = sma->sem_base + i;
> -   spin_unlock_wait(>lock);
> +   while (atomic_read(>lock))
> +   cpu_relax();
>   }
>   ipc_smp_acquire__after_spin_is_unlocked();
> }

The actual bug is clear_pending_set_locked() not having acquire
semantics. And the above 'fixes' things because it will observe the old
pending bit or the locked bit, so it doesn't matter if the store
flipping them is delayed.

The comment in queued_spin_lock_slowpath() above the smp_cond_acquire()
states that that acquire is sufficient, but this is incorrect in the
face of spin_is_locked()/spin_unlock_wait() usage only looking at the
lock byte.

The problem is that the clear_pending_set_locked() is an unordered
store, therefore this store can be delayed until no later than
spin_unlock() (which orders against it due to the address dependency).

This opens numerous races; for example:

ipc_lock_object(>sem_perm);
sem_wait_array(sma);

false   ->  
spin_is_locked(>sem_perm.lock)

is entirely possible, because sem_wait_array() consists of pure reads,
so the store can pass all that, even on x86.

The below 'hack' seems to solve the problem.

_However_ this also means the atomic_cmpxchg_relaxed() in the locked:
branch is equally wrong -- although not visible on x86. And note that
atomic_cmpxchg_acquire() would not in fact be sufficient either, since
the acquire is on the LOAD not the STORE of the LL/SC.

I need a break of sorts, because after twisting my head around the sem
code and then the qspinlock code I'm wrecked. I'll try and make a proper
patch if people can indeed confirm my thinking here.

---
 kernel/locking/qspinlock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e32ae1..348e172e774f 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 * *,1,0 -> *,0,1
 */
clear_pending_set_locked(lock);
+   smp_mb();
return;
 
/*


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Mel Gorman
On Fri, May 20, 2016 at 12:09:01PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > > Specifically
> > > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit 
> > > > hangs in libc's
> > > > semop() blocked waiting for zero.
> > > 
> > > OK; so I've been running:
> > > 
> > > while :; do
> > >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
> > >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> > > done
> > > 
> > > for a few minutes now and its not stuck and my machine didn't splat.
> > > 
> > > Am I not doing it right?
> > 
> > Hooray, it went *bang*..
> 
> I suspect a required step was to post about failure to reproduce!
> 

It is known that the bug is both intermittent and not all machines can
reproduce the problem. If it fails to reproduce, it's not necessarily a
methodology error and can simply be a function of luck.

-- 
Mel Gorman
SUSE Labs


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Mel Gorman
On Fri, May 20, 2016 at 12:09:01PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > > Specifically
> > > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit 
> > > > hangs in libc's
> > > > semop() blocked waiting for zero.
> > > 
> > > OK; so I've been running:
> > > 
> > > while :; do
> > >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
> > >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> > > done
> > > 
> > > for a few minutes now and its not stuck and my machine didn't splat.
> > > 
> > > Am I not doing it right?
> > 
> > Hooray, it went *bang*..
> 
> I suspect a required step was to post about failure to reproduce!
> 

It is known that the bug is both intermittent and not all machines can
reproduce the problem. If it fails to reproduce, it's not necessarily a
methodology error and can simply be a function of luck.

-- 
Mel Gorman
SUSE Labs


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > Specifically
> > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs 
> > > in libc's
> > > semop() blocked waiting for zero.
> > 
> > OK; so I've been running:
> > 
> > while :; do
> >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
> >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> > done
> > 
> > for a few minutes now and its not stuck and my machine didn't splat.
> > 
> > Am I not doing it right?
> 
> Hooray, it went *bang*..

I suspect a required step was to post about failure to reproduce!

Ingo


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > > Specifically
> > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs 
> > > in libc's
> > > semop() blocked waiting for zero.
> > 
> > OK; so I've been running:
> > 
> > while :; do
> >   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
> >   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> > done
> > 
> > for a few minutes now and its not stuck and my machine didn't splat.
> > 
> > Am I not doing it right?
> 
> Hooray, it went *bang*..

I suspect a required step was to post about failure to reproduce!

Ingo


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 11:07:49AM +0200, Giovanni Gherdovich wrote:
> --- run_cascade.sh -
> #!/bin/bash
> 
> TESTCASE=$1
> CASCADE_PATH="libmicro-1-installed/bin-x86_64"
> 
> case $TESTCASE in
> c_flock_200)
> BINNAME="cascade_flock"
> COMMAND="$CASCADE_PATH/cascade_flock -E -D 6 -L -S -W \
>  -N c_flock_200 \
>  -P 200 -I 500"
> # c_flock_200 is supposed to last 60 seconds.
> SLEEPTIME=70
> ;;
> c_cond_10)
> BINNAME="cascade_cond"
> COMMAND="$CASCADE_PATH/cascade_cond -E -C 2000 -L -S -W \
> -N c_cond_10 \
> -T 10 -I 3000"
> # c_cond_10 terminates in less than 1 second.
> SLEEPTIME=5
> ;;
> *)
> echo "Unknown test case" >&2
> exit 1
> ;;
> esac
> 
> ERRORS=0
> uname -a
> for i in {1..10} ; do
> {
> eval $COMMAND &
> } >/dev/null 2>&1
> sleep $SLEEPTIME
> if pidof $BINNAME >/dev/null ; then
> echo Run \#$i: $TESTCASE hangs
> for PID in $(pidof $BINNAME) ; do
> head -1 /proc/$PID/stack
> done | sort | uniq -c
> ERRORS=$((ERRORS+1))
> killall $BINNAME
> else
> echo Run \#$i: $TESTCASE exits successfully
> fi
> done
> echo $TESTCASE hanged $ERRORS times.
> 

Thanks, that's a much nicer script than mine ;-)


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 11:07:49AM +0200, Giovanni Gherdovich wrote:
> --- run_cascade.sh -
> #!/bin/bash
> 
> TESTCASE=$1
> CASCADE_PATH="libmicro-1-installed/bin-x86_64"
> 
> case $TESTCASE in
> c_flock_200)
> BINNAME="cascade_flock"
> COMMAND="$CASCADE_PATH/cascade_flock -E -D 6 -L -S -W \
>  -N c_flock_200 \
>  -P 200 -I 500"
> # c_flock_200 is supposed to last 60 seconds.
> SLEEPTIME=70
> ;;
> c_cond_10)
> BINNAME="cascade_cond"
> COMMAND="$CASCADE_PATH/cascade_cond -E -C 2000 -L -S -W \
> -N c_cond_10 \
> -T 10 -I 3000"
> # c_cond_10 terminates in less than 1 second.
> SLEEPTIME=5
> ;;
> *)
> echo "Unknown test case" >&2
> exit 1
> ;;
> esac
> 
> ERRORS=0
> uname -a
> for i in {1..10} ; do
> {
> eval $COMMAND &
> } >/dev/null 2>&1
> sleep $SLEEPTIME
> if pidof $BINNAME >/dev/null ; then
> echo Run \#$i: $TESTCASE hangs
> for PID in $(pidof $BINNAME) ; do
> head -1 /proc/$PID/stack
> done | sort | uniq -c
> ERRORS=$((ERRORS+1))
> killall $BINNAME
> else
> echo Run \#$i: $TESTCASE exits successfully
> fi
> done
> echo $TESTCASE hanged $ERRORS times.
> 

Thanks, that's a much nicer script than mine ;-)


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > Specifically
> > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
> > libc's
> > semop() blocked waiting for zero.
> 
> OK; so I've been running:
> 
> while :; do
>   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
>   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> done
> 
> for a few minutes now and its not stuck and my machine didn't splat.
> 
> Am I not doing it right?

Hooray, it went *bang*.. OK, lemme have a harder look at this semaphore
code.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> > Specifically
> > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
> > libc's
> > semop() blocked waiting for zero.
> 
> OK; so I've been running:
> 
> while :; do
>   bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
>   bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
> done
> 
> for a few minutes now and its not stuck and my machine didn't splat.
> 
> Am I not doing it right?

Hooray, it went *bang*.. OK, lemme have a harder look at this semaphore
code.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> Specifically
> for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
> libc's
> semop() blocked waiting for zero.

OK; so I've been running:

while :; do
  bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
  bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
done

for a few minutes now and its not stuck and my machine didn't splat.

Am I not doing it right?


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> Specifically
> for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
> libc's
> semop() blocked waiting for zero.

OK; so I've been running:

while :; do
  bin/cascade_cond -E -C 200 -L -S -W -T 200  -I 200 ;
  bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ;
done

for a few minutes now and its not stuck and my machine didn't splat.

Am I not doing it right?


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:13:15AM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> 
> > [1] https://hg.java.net/hg/libmicro~hg-repo
> 
> So far I've managed to install mercurial and clone this thing, but it
> doesn't actually build :/
> 
> I'll try harder..

The stuff needs this..

---
diff -r 7dd95b416c3c Makefile.com
--- a/Makefile.com  Thu Jul 26 12:56:00 2012 -0700
+++ b/Makefile.com  Fri May 20 10:18:08 2016 +0200
@@ -107,7 +107,7 @@
echo "char compiler_version[] = \""`$(COMPILER_VERSION_CMD)`"\";" > 
tattle.h
echo "char CC[] = \""$(CC)"\";" >> tattle.h
echo "char extra_compiler_flags[] = \""$(extra_CFLAGS)"\";" >> tattle.h
-   $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm
+   $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm -lpthread
 
 $(ELIDED_BENCHMARKS):  ../elided.c
$(CC) -o $(@) ../elided.c


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Fri, May 20, 2016 at 10:13:15AM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
> 
> > [1] https://hg.java.net/hg/libmicro~hg-repo
> 
> So far I've managed to install mercurial and clone this thing, but it
> doesn't actually build :/
> 
> I'll try harder..

The stuff needs this..

---
diff -r 7dd95b416c3c Makefile.com
--- a/Makefile.com  Thu Jul 26 12:56:00 2012 -0700
+++ b/Makefile.com  Fri May 20 10:18:08 2016 +0200
@@ -107,7 +107,7 @@
echo "char compiler_version[] = \""`$(COMPILER_VERSION_CMD)`"\";" > 
tattle.h
echo "char CC[] = \""$(CC)"\";" >> tattle.h
echo "char extra_compiler_flags[] = \""$(extra_CFLAGS)"\";" >> tattle.h
-   $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm
+   $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm -lpthread
 
 $(ELIDED_BENCHMARKS):  ../elided.c
$(CC) -o $(@) ../elided.c


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> [1] https://hg.java.net/hg/libmicro~hg-repo

So far I've managed to install mercurial and clone this thing, but it
doesn't actually build :/

I'll try harder..


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> [1] https://hg.java.net/hg/libmicro~hg-repo

So far I've managed to install mercurial and clone this thing, but it
doesn't actually build :/

I'll try harder..


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> However, this is semantically different to
> what was previously done with ticket locks in that spin_unlock_wait() will 
> always observe
> all waiters by adding itself to the tail.


static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
__ticket_t head = READ_ONCE(lock->tickets.head);

for (;;) {
struct __raw_tickets tmp = READ_ONCE(lock->tickets);
/*
 * We need to check "unlocked" in a loop, tmp.head == head
 * can be false positive because of overflow.
 */
if (__tickets_equal(tmp.head, tmp.tail) ||
!__tickets_equal(tmp.head, head))
break;

cpu_relax();
}
}


I'm not seeing that (although I think I agreed yesterday on IRC). Note
how we observe the head and then loop until either the lock is unlocked
(head == tail) or simply head isn't what it used to be.

And head is the lock holder end of the queue; see arch_spin_unlock()
incrementing it.

So the ticket lock too should only wait for the current lock holder to
go away, not any longer.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:

> However, this is semantically different to
> what was previously done with ticket locks in that spin_unlock_wait() will 
> always observe
> all waiters by adding itself to the tail.


static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
__ticket_t head = READ_ONCE(lock->tickets.head);

for (;;) {
struct __raw_tickets tmp = READ_ONCE(lock->tickets);
/*
 * We need to check "unlocked" in a loop, tmp.head == head
 * can be false positive because of overflow.
 */
if (__tickets_equal(tmp.head, tmp.tail) ||
!__tickets_equal(tmp.head, head))
break;

cpu_relax();
}
}


I'm not seeing that (although I think I agreed yesterday on IRC). Note
how we observe the head and then loop until either the lock is unlocked
(head == tail) or simply head isn't what it used to be.

And head is the lock holder end of the queue; see arch_spin_unlock()
incrementing it.

So the ticket lock too should only wait for the current lock holder to
go away, not any longer.


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>  In addition, this makes me wonder if queued_spin_is_locked() should then be:
> 
> - return atomic_read(>val);
> + return atomic_read(>val) & _Q_LOCKED_MASK;
> 
> And avoid considering pending waiters as locked.

Probably


Re: sem_lock() vs qspinlocks

2016-05-20 Thread Peter Zijlstra
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote:
>  In addition, this makes me wonder if queued_spin_is_locked() should then be:
> 
> - return atomic_read(>val);
> + return atomic_read(>val) & _Q_LOCKED_MASK;
> 
> And avoid considering pending waiters as locked.

Probably


sem_lock() vs qspinlocks

2016-05-19 Thread Davidlohr Bueso

Hi,

Giovanni ran into a pretty reproducible situation in which the libmicro 
benchmark[1]
shows a functional regression in sysv semaphores, on upstream kernels. 
Specifically
for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
libc's
semop() blocked waiting for zero. Alternatively, the following splats may 
appear:

[  692.991258] BUG: unable to handle kernel NULL pointer dereference (null)
[  692.992062] IP: [] unmerge_queues+0x2f/0x70
[  692.992062] PGD 862fab067 PUD 858bbc067 PMD 0
[  692.992062] Oops:  [#1] SMP
[  692.992062] Modules linked in: ...
[  692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: GE   
4.6.0-juancho2-default+ #18
[  692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  692.992062] task: 88084a7e9640 ti: 880854748000 task.ti: 
880854748000
[  692.992062] RIP: 0010:[]  [] 
unmerge_queues+0x2f/0x70
[  692.992062] RSP: 0018:88085474bce8  EFLAGS: 00010216
[  692.992062] RAX:  RBX:  RCX: 88086cc3d0d0
[  692.992062] RDX: 88086cc3d0d0 RSI: 88086cc3d0d0 RDI: 88086cc3d040
[  692.992062] RBP: 88085474bce8 R08: 0007 R09: 88086cc3d088
[  692.992062] R10:  R11: 00a1597ea64c R12: 88085474bd90
[  692.992062] R13: 88086cc3d040 R14:  R15: 
[  692.992062] FS:  7faa46a2d700() GS:88086e50() 
knlGS:
[  692.992062] CS:  0010 DS:  ES:  CR0: 80050033
[  692.992062] CR2:  CR3: 000862faa000 CR4: 001406e0
[  692.992062] Stack:
[  692.992062]  88085474bf38 812a2ac3 810c3995 
88084a7e9640
[  692.992062]   81cb1f48 0002 
880400038000
[  692.992062]  88084a7e9640 88085474bd40 fffc 
88085474bd40
[  692.992062] Call Trace:
[  692.992062]  [] SYSC_semtimedop+0x833/0xc00
[  692.992062]  [] ? __wake_up_common+0x55/0x90
[  692.992062]  [] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [] ? locks_insert_lock_ctx+0x93/0xa0
[  692.992062]  [] ? flock_lock_inode+0xf4/0x220
[  692.992062]  [] ? locks_lock_inode_wait+0x47/0x160
[  692.992062]  [] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [] ? locks_free_lock+0x4f/0x60
[  692.992062]  [] SyS_semop+0x10/0x20
[  692.992062]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 
39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 
c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83
[  692.992062] RIP  [] unmerge_queues+0x2f/0x70
[  692.992062]  RSP 
[  692.992062] CR2: 
[  693.882179] ---[ end trace 5605f108ab79cdb2 ]---

Or,

[  463.567641] BUG: unable to handle kernel paging request at fffa
[  463.576246] IP: [] perform_atomic_semop.isra.5+0xcf/0x170
[  463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0
[  463.590071] Oops:  [#1] SMP
[  463.594667] Modules linked in: ...
[  463.664710] Supported: Yes
[  463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default 
#1
[  463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015
[  463.685588] task: 88105dba0b40 ti: 8808fc7e task.ti: 
8808fc7e
[  463.694366] RIP: 0010:[]  [] 
perform_atomic_semop.isra.5+0xcf/0x170
[  463.705084] RSP: 0018:8808fc7e3c60  EFLAGS: 00010217
[  463.711610] RAX:  RBX: 88085d22dae0 RCX: 5732f1e7
[  463.719952] RDX: fffa RSI: 88085d22dad0 RDI: 88085d22da80
[  463.728270] RBP:  R08: fff7 R09: 
[  463.736561] R10:  R11: 0206 R12: 88085d22da88
[  463.745001] R13: 88085d22dad0 R14: ffc0 R15: 88085d22da40
[  463.753450] FS:  7f30fd9e5700() GS:88085fac() 
knlGS:
[  463.762684] CS:  0010 DS:  ES:  CR0: 80050033
[  463.769731] CR2: fffa CR3: 00017bc09000 CR4: 001406e0
[  463.778039] Stack:
[  463.781130]  8808fc7e3d50 88085d22dad0 8126dfe1 
4d226800
[  463.789704]  88085d22da80 0001 88085d22da88 
0001
[  463.798254]  0001 88085d22da40 8808fc7e3d50 
8808fc7e3da0
[  463.806758] Call Trace:
[  463.810305]  [] update_queue+0xa1/0x180
[  463.816706]  [] do_smart_update+0x45/0xf0
[  463.823276]  [] SYSC_semtimedop+0x3d1/0xb00
[  463.830035]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[  463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71
[  463.846331]
[  463.848747] Leftover inexact backtrace:
[  463.848747]
[  463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 
48 83 c2 06 44
89 00 48 39 

sem_lock() vs qspinlocks

2016-05-19 Thread Davidlohr Bueso

Hi,

Giovanni ran into a pretty reproducible situation in which the libmicro 
benchmark[1]
shows a functional regression in sysv semaphores, on upstream kernels. 
Specifically
for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in 
libc's
semop() blocked waiting for zero. Alternatively, the following splats may 
appear:

[  692.991258] BUG: unable to handle kernel NULL pointer dereference (null)
[  692.992062] IP: [] unmerge_queues+0x2f/0x70
[  692.992062] PGD 862fab067 PUD 858bbc067 PMD 0
[  692.992062] Oops:  [#1] SMP
[  692.992062] Modules linked in: ...
[  692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: GE   
4.6.0-juancho2-default+ #18
[  692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[  692.992062] task: 88084a7e9640 ti: 880854748000 task.ti: 
880854748000
[  692.992062] RIP: 0010:[]  [] 
unmerge_queues+0x2f/0x70
[  692.992062] RSP: 0018:88085474bce8  EFLAGS: 00010216
[  692.992062] RAX:  RBX:  RCX: 88086cc3d0d0
[  692.992062] RDX: 88086cc3d0d0 RSI: 88086cc3d0d0 RDI: 88086cc3d040
[  692.992062] RBP: 88085474bce8 R08: 0007 R09: 88086cc3d088
[  692.992062] R10:  R11: 00a1597ea64c R12: 88085474bd90
[  692.992062] R13: 88086cc3d040 R14:  R15: 
[  692.992062] FS:  7faa46a2d700() GS:88086e50() 
knlGS:
[  692.992062] CS:  0010 DS:  ES:  CR0: 80050033
[  692.992062] CR2:  CR3: 000862faa000 CR4: 001406e0
[  692.992062] Stack:
[  692.992062]  88085474bf38 812a2ac3 810c3995 
88084a7e9640
[  692.992062]   81cb1f48 0002 
880400038000
[  692.992062]  88084a7e9640 88085474bd40 fffc 
88085474bd40
[  692.992062] Call Trace:
[  692.992062]  [] SYSC_semtimedop+0x833/0xc00
[  692.992062]  [] ? __wake_up_common+0x55/0x90
[  692.992062]  [] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [] ? locks_insert_lock_ctx+0x93/0xa0
[  692.992062]  [] ? flock_lock_inode+0xf4/0x220
[  692.992062]  [] ? locks_lock_inode_wait+0x47/0x160
[  692.992062]  [] ? kmem_cache_alloc+0x1e0/0x200
[  692.992062]  [] ? locks_alloc_lock+0x1b/0x70
[  692.992062]  [] ? locks_free_lock+0x4f/0x60
[  692.992062]  [] SyS_semop+0x10/0x20
[  692.992062]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 
39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 
c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83
[  692.992062] RIP  [] unmerge_queues+0x2f/0x70
[  692.992062]  RSP 
[  692.992062] CR2: 
[  693.882179] ---[ end trace 5605f108ab79cdb2 ]---

Or,

[  463.567641] BUG: unable to handle kernel paging request at fffa
[  463.576246] IP: [] perform_atomic_semop.isra.5+0xcf/0x170
[  463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0
[  463.590071] Oops:  [#1] SMP
[  463.594667] Modules linked in: ...
[  463.664710] Supported: Yes
[  463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default 
#1
[  463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015
[  463.685588] task: 88105dba0b40 ti: 8808fc7e task.ti: 
8808fc7e
[  463.694366] RIP: 0010:[]  [] 
perform_atomic_semop.isra.5+0xcf/0x170
[  463.705084] RSP: 0018:8808fc7e3c60  EFLAGS: 00010217
[  463.711610] RAX:  RBX: 88085d22dae0 RCX: 5732f1e7
[  463.719952] RDX: fffa RSI: 88085d22dad0 RDI: 88085d22da80
[  463.728270] RBP:  R08: fff7 R09: 
[  463.736561] R10:  R11: 0206 R12: 88085d22da88
[  463.745001] R13: 88085d22dad0 R14: ffc0 R15: 88085d22da40
[  463.753450] FS:  7f30fd9e5700() GS:88085fac() 
knlGS:
[  463.762684] CS:  0010 DS:  ES:  CR0: 80050033
[  463.769731] CR2: fffa CR3: 00017bc09000 CR4: 001406e0
[  463.778039] Stack:
[  463.781130]  8808fc7e3d50 88085d22dad0 8126dfe1 
4d226800
[  463.789704]  88085d22da80 0001 88085d22da88 
0001
[  463.798254]  0001 88085d22da40 8808fc7e3d50 
8808fc7e3da0
[  463.806758] Call Trace:
[  463.810305]  [] update_queue+0xa1/0x180
[  463.816706]  [] do_smart_update+0x45/0xf0
[  463.823276]  [] SYSC_semtimedop+0x3d1/0xb00
[  463.830035]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[  463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71
[  463.846331]
[  463.848747] Leftover inexact backtrace:
[  463.848747]
[  463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 
48 83 c2 06 44
89 00 48 39