Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-26 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > So now your code melts down to:
> > 
> >  write(hb->waiters)|write(uaddr)
> >  mb|read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

Actually not. It's protected by an atomic_inc() between the
write(uaddr) and read(hb->waiters) on the waker side.

It took me a while to realize, that get_futex_key_refs() is providing
the protection inadvertently. But as I explained we cannot rely on
that for all architectures.

Thanks,

tglx








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-26 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
 On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
  So now your code melts down to:
  
   write(hb-waiters)|write(uaddr)
   mb|read(hb-waiters)
   read(uaddr)
  
  I fear you simply managed to make the window small enough that your
  testing was not longer able expose it.
 
 Does seem to be the case.

Actually not. It's protected by an atomic_inc() between the
write(uaddr) and read(hb-waiters) on the waker side.

It took me a while to realize, that get_futex_key_refs() is providing
the protection inadvertently. But as I explained we cannot rely on
that for all architectures.

Thanks,

tglx








--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > > which restores the ordering guarantee, which the hash bucket lock
> > > provided so far.
> > 
> > Actually that's not true by design, it just happens to work.
> > 
> > atomic_inc() on x86 is a "lock incl".
> > 
> >  The LOCK prefix just guarantees that the cache line which is affected
> >  by the INCL is locked. And it guarantees that locked operations
> >  serialize all outstanding load and store operations.
> > 
> >  But Documentation/atomic_ops.txt says about atomic_inc():
> > 
> >  "One very important aspect of these two routines is that they DO NOT
> >   require any explicit memory barriers.  They need only perform the
> >   atomic_t counter update in an SMP safe manner."
> > 
> >  So while this has a barrier on x86, it's not guaranteed.
> 
> 
> But it is guaranteed to be "in an SMP safe manner"... which I guess just
> means that two writes will not intermix bytes, but no guarantee that the
> value will be visible to other CPUs unless some kind of barrier is
> explicitly imposed.
> 
> Correct?

Yep, that's what it sayes.
 
> > So now your code melts down to:
> > 
> >  write(hb->waiters)|write(uaddr)
> >  mb|read(hb->waiters)
> >  read(uaddr)
> > 
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
> 
> Does seem to be the case.

You must be aware, that between the the write(uaddr) and the
read(hb->waiters) is the syscall, i.e. a user/kernel space
transition.

sysenter/syscall have no documented barriers inside, but we don't know
whether the actual hw implementation provides one or if the memory ops
between modifying uaddr and reaching the read(hb->waiters) point
including the real atomic op on the waiter side are good enough to
paper over the issue.

Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Darren Hart
On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> > On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> > So with the atomic ops you are changing that to:
> > 
> > CPU 0 CPU 1
> > 
> > val = *futex;
> > futex_wait(futex, val);
> > 
> > spin_lock(>lock);
> > 
> > atomic_inc(>waiters);
> > uval = *futex;
> >   *futex = newval;
> > 
> > if (uval != val) {futex_wake();
> >spin_unlock(>lock);if 
> > (!atomic_read(>waiters))
> >return;   return;
> > }
> >   spin_lock((>lock);  
> > plist_add(hb, self);
> > spin_unlock(>lock);
> > schedule();   wakeup_waiters(hb);
> > ...
> > 
> > which restores the ordering guarantee, which the hash bucket lock
> > provided so far.
> 
> Actually that's not true by design, it just happens to work.
> 
> atomic_inc() on x86 is a "lock incl".
> 
>  The LOCK prefix just guarantees that the cache line which is affected
>  by the INCL is locked. And it guarantees that locked operations
>  serialize all outstanding load and store operations.
> 
>  But Documentation/atomic_ops.txt says about atomic_inc():
> 
>  "One very important aspect of these two routines is that they DO NOT
>   require any explicit memory barriers.  They need only perform the
>   atomic_t counter update in an SMP safe manner."
> 
>  So while this has a barrier on x86, it's not guaranteed.


But it is guaranteed to be "in an SMP safe manner"... which I guess just
means that two writes will not intermix bytes, but no guarantee that the
value will be visible to other CPUs unless some kind of barrier is
explicitly imposed.

Correct?


> atomic_read() is a simple read.
> 
>  This does not guarantee anything. And if you read
>  Documentation/atomic_ops.txt it's well documented:
> 
>  "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"
> 
> 
> So now your code melts down to:
> 
>  write(hb->waiters)|write(uaddr)
>  mb|read(hb->waiters)
>  read(uaddr)
> 
> I fear you simply managed to make the window small enough that your
> testing was not longer able expose it.

Does seem to be the case.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Davidlohr Bueso wrote:
> On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> > If the smp_mb() is heavy weight, then it will hurt massivly in the
> > case where the hash bucket is not empty, because we add the price for
> > the smp_mb() just for no gain.
> > 
> > In that context it would also be helpful to measure the overhead on
> > x86 for the !empty case.
> 
> Absolutely, I will add these comparisons. If we do notice that we end up
> hurting the !empty case, would the current patch using atomic ops still
> be considered? We have made sure that none of the changes in this set
> affects performance on other workloads/smaller systems.

Please read my last reply to the atomic ops approach.

Aside of that we need numbers for a significant range of !x86.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> So with the atomic ops you are changing that to:
> 
> CPU 0 CPU 1
> 
> val = *futex;
> futex_wait(futex, val);
> 
> spin_lock(>lock);
> 
> atomic_inc(>waiters);
> uval = *futex;
>   *futex = newval;
> 
> if (uval != val) {futex_wake();
>spin_unlock(>lock);if (!atomic_read(>waiters))
>return;   return;
> }
>   spin_lock((>lock);  
> plist_add(hb, self);
> spin_unlock(>lock);
> schedule();   wakeup_waiters(hb);
> ...
> 
> which restores the ordering guarantee, which the hash bucket lock
> provided so far.

Actually that's not true by design, it just happens to work.

atomic_inc() on x86 is a "lock incl".

 The LOCK prefix just guarantees that the cache line which is affected
 by the INCL is locked. And it guarantees that locked operations
 serialize all outstanding load and store operations.

 But Documentation/atomic_ops.txt says about atomic_inc():

 "One very important aspect of these two routines is that they DO NOT
  require any explicit memory barriers.  They need only perform the
  atomic_t counter update in an SMP safe manner."

 So while this has a barrier on x86, it's not guaranteed.

atomic_read() is a simple read.

 This does not guarantee anything. And if you read
 Documentation/atomic_ops.txt it's well documented:

 "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"


So now your code melts down to:

 write(hb->waiters)|write(uaddr)
 mb|read(hb->waiters)
 read(uaddr)

I fear you simply managed to make the window small enough that your
testing was not longer able expose it.

Thanks,

tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Davidlohr Bueso
On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
> On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> > On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > > 
> > > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > > space value. The comment in the code is pretty non sensical:
> > > > >
> > > > >* On the other hand, we insert q and release the hash-bucket only
> > > > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > > >* absorb a wakeup if *uaddr does not match the desired values
> > > > >* while the syscall executes.
> > > > >
> > > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > > value. We just have to dequeue in all the error handling cases, but
> > > > > for the fast path it does not matter at all.
> > > > >
> > > > > CPU 0   CPU 1
> > > > >
> > > > > val = *futex;
> > > > > futex_wait(futex, val);
> > > > >
> > > > > spin_lock(>lock);
> > > > >
> > > > > plist_add(hb, self);
> > > > > smp_wmb();
> > > > >
> > > > > uval = *futex;
> > > > > *futex = newval;
> > > > > futex_wake();
> > > > >
> > > > > smp_rmb();
> > > > > if (plist_empty(hb))
> > > > >return;
> > > > > ...
> > > > 
> > > > This would seem to be a nicer approach indeed, without needing the
> > > > extra atomics.
> > > 
> > > I went through the issue with Peter and he noticed, that we need
> > > smp_mb() in both places. That's what we have right now with the
> > > spin_lock() and it is required as we need to guarantee that
> > > 
> > >  The waiter observes the change to the uaddr value after it added
> > >  itself to the plist
> > > 
> > >  The waker observes plist not empty if the change to uaddr was made
> > >  after the waiter checked the value.
> > > 
> > > 
> > >   write(plist)|   write(futex_uaddr)
> > >   mb()|   mb()
> > >   read(futex_uaddr)   |   read(plist)
> > > 
> > > The spin_lock mb() on the waiter side does not help here because it
> > > happpens before the write(plist) and not after it.
> > 
> > Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> > ACQUIRE barrier which is weaker than a full mb and will not suffice in
> > this case even it if were in the right place.
> 
> So now the question is whether this lockless empty check optimization
> which seems to be quite nice on x86 for a particular workload will
> have any negative side effects on other architectures.
> 
> If the smp_mb() is heavy weight, then it will hurt massivly in the
> case where the hash bucket is not empty, because we add the price for
> the smp_mb() just for no gain.
> 
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Absolutely, I will add these comparisons. If we do notice that we end up
hurting the !empty case, would the current patch using atomic ops still
be considered? We have made sure that none of the changes in this set
affects performance on other workloads/smaller systems.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 06:32:55PM +0100, Thomas Gleixner wrote:
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Yes, because mfence is by no means a cheap instruction on x86.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Peter Zijlstra wrote:
> On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> > On Sat, 23 Nov 2013, Linus Torvalds wrote:
> > 
> > > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  
> > > wrote:
> > > >
> > > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > > space value. The comment in the code is pretty non sensical:
> > > >
> > > >* On the other hand, we insert q and release the hash-bucket only
> > > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > > >* absorb a wakeup if *uaddr does not match the desired values
> > > >* while the syscall executes.
> > > >
> > > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > > value. We just have to dequeue in all the error handling cases, but
> > > > for the fast path it does not matter at all.
> > > >
> > > > CPU 0   CPU 1
> > > >
> > > > val = *futex;
> > > > futex_wait(futex, val);
> > > >
> > > > spin_lock(>lock);
> > > >
> > > > plist_add(hb, self);
> > > > smp_wmb();
> > > >
> > > > uval = *futex;
> > > > *futex = newval;
> > > > futex_wake();
> > > >
> > > > smp_rmb();
> > > > if (plist_empty(hb))
> > > >return;
> > > > ...
> > > 
> > > This would seem to be a nicer approach indeed, without needing the
> > > extra atomics.
> > 
> > I went through the issue with Peter and he noticed, that we need
> > smp_mb() in both places. That's what we have right now with the
> > spin_lock() and it is required as we need to guarantee that
> > 
> >  The waiter observes the change to the uaddr value after it added
> >  itself to the plist
> > 
> >  The waker observes plist not empty if the change to uaddr was made
> >  after the waiter checked the value.
> > 
> > 
> > write(plist)|   write(futex_uaddr)
> > mb()|   mb()
> > read(futex_uaddr)   |   read(plist)
> > 
> > The spin_lock mb() on the waiter side does not help here because it
> > happpens before the write(plist) and not after it.
> 
> Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
> ACQUIRE barrier which is weaker than a full mb and will not suffice in
> this case even it if were in the right place.

So now the question is whether this lockless empty check optimization
which seems to be quite nice on x86 for a particular workload will
have any negative side effects on other architectures.

If the smp_mb() is heavy weight, then it will hurt massivly in the
case where the hash bucket is not empty, because we add the price for
the smp_mb() just for no gain.

In that context it would also be helpful to measure the overhead on
x86 for the !empty case.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Nov 2013, Linus Torvalds wrote:
> 
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >* On the other hand, we insert q and release the hash-bucket only
> > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >* absorb a wakeup if *uaddr does not match the desired values
> > >* while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0   CPU 1
> > >
> > > val = *futex;
> > > futex_wait(futex, val);
> > >
> > > spin_lock(>lock);
> > >
> > > plist_add(hb, self);
> > > smp_wmb();
> > >
> > > uval = *futex;
> > > *futex = newval;
> > > futex_wake();
> > >
> > > smp_rmb();
> > > if (plist_empty(hb))
> > >return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> I went through the issue with Peter and he noticed, that we need
> smp_mb() in both places. That's what we have right now with the
> spin_lock() and it is required as we need to guarantee that
> 
>  The waiter observes the change to the uaddr value after it added
>  itself to the plist
> 
>  The waker observes plist not empty if the change to uaddr was made
>  after the waiter checked the value.
> 
> 
>   write(plist)|   write(futex_uaddr)
>   mb()|   mb()
>   read(futex_uaddr)   |   read(plist)
> 
> The spin_lock mb() on the waiter side does not help here because it
> happpens before the write(plist) and not after it.

Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
ACQUIRE barrier which is weaker than a full mb and will not suffice in
this case even it if were in the right place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Linus Torvalds wrote:

> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >* On the other hand, we insert q and release the hash-bucket only
> >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> >* absorb a wakeup if *uaddr does not match the desired values
> >* while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0   CPU 1
> >
> > val = *futex;
> > futex_wait(futex, val);
> >
> > spin_lock(>lock);
> >
> > plist_add(hb, self);
> > smp_wmb();
> >
> > uval = *futex;
> > *futex = newval;
> > futex_wake();
> >
> > smp_rmb();
> > if (plist_empty(hb))
> >return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

I went through the issue with Peter and he noticed, that we need
smp_mb() in both places. That's what we have right now with the
spin_lock() and it is required as we need to guarantee that

 The waiter observes the change to the uaddr value after it added
 itself to the plist

 The waker observes plist not empty if the change to uaddr was made
 after the waiter checked the value.


write(plist)|   write(futex_uaddr)
mb()|   mb()
read(futex_uaddr)   |   read(plist)

The spin_lock mb() on the waiter side does not help here because it
happpens before the write(plist) and not after it.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Davidlohr Bueso wrote:
> On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> > On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> > >
> > > Now the question is why we queue the waiter _AFTER_ reading the user
> > > space value. The comment in the code is pretty non sensical:
> > >
> > >* On the other hand, we insert q and release the hash-bucket only
> > >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> > >* absorb a wakeup if *uaddr does not match the desired values
> > >* while the syscall executes.
> > >
> > > There is no reason why we cannot queue _BEFORE_ reading the user space
> > > value. We just have to dequeue in all the error handling cases, but
> > > for the fast path it does not matter at all.
> > >
> > > CPU 0   CPU 1
> > >
> > > val = *futex;
> > > futex_wait(futex, val);
> > >
> > > spin_lock(>lock);
> > >
> > > plist_add(hb, self);
> > > smp_wmb();
> > >
> > > uval = *futex;
> > > *futex = newval;
> > > futex_wake();
> > >
> > > smp_rmb();
> > > if (plist_empty(hb))
> > >return;
> > > ...
> > 
> > This would seem to be a nicer approach indeed, without needing the
> > extra atomics.
> 
> Yep, I think we can all agree that doing this optization without atomic
> ops is a big plus.
> 
> > 
> > Davidlohr, mind trying Thomas' approach?
> 
> I just took a quick look and it seems pretty straightforward, but not
> without some details to consider. We basically have to redo/reorder
> futex_wait_setup(), which checks that uval == val, and
> futex_wait_queue_me(), which adds the task to the list and blocks. Now,
> both futex_wait() and futex_wait_requeue_pi() have this logic, but since
> we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
> ok to only change futex_wait(), while the order of the uval checking
> doesn't matter for futex_wait_requeue_pi() so it can stay as is.

There is no mechanism which prevents a futex_wake() call on the inner
futex of the wait_requeue_pi mechanism. So no, we have to change both.

futexes are no place for believe. Either you understand them
completely or you just leave them alone.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Davidlohr Bueso wrote:
 On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
  On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de wrote:
  
   Now the question is why we queue the waiter _AFTER_ reading the user
   space value. The comment in the code is pretty non sensical:
  
  * On the other hand, we insert q and release the hash-bucket only
  * after testing *uaddr.  This guarantees that futex_wait() will NOT
  * absorb a wakeup if *uaddr does not match the desired values
  * while the syscall executes.
  
   There is no reason why we cannot queue _BEFORE_ reading the user space
   value. We just have to dequeue in all the error handling cases, but
   for the fast path it does not matter at all.
  
   CPU 0   CPU 1
  
   val = *futex;
   futex_wait(futex, val);
  
   spin_lock(hb-lock);
  
   plist_add(hb, self);
   smp_wmb();
  
   uval = *futex;
   *futex = newval;
   futex_wake();
  
   smp_rmb();
   if (plist_empty(hb))
  return;
   ...
  
  This would seem to be a nicer approach indeed, without needing the
  extra atomics.
 
 Yep, I think we can all agree that doing this optization without atomic
 ops is a big plus.
 
  
  Davidlohr, mind trying Thomas' approach?
 
 I just took a quick look and it seems pretty straightforward, but not
 without some details to consider. We basically have to redo/reorder
 futex_wait_setup(), which checks that uval == val, and
 futex_wait_queue_me(), which adds the task to the list and blocks. Now,
 both futex_wait() and futex_wait_requeue_pi() have this logic, but since
 we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
 ok to only change futex_wait(), while the order of the uval checking
 doesn't matter for futex_wait_requeue_pi() so it can stay as is.

There is no mechanism which prevents a futex_wake() call on the inner
futex of the wait_requeue_pi mechanism. So no, we have to change both.

futexes are no place for believe. Either you understand them
completely or you just leave them alone.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Linus Torvalds wrote:

 On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de wrote:
 
  Now the question is why we queue the waiter _AFTER_ reading the user
  space value. The comment in the code is pretty non sensical:
 
 * On the other hand, we insert q and release the hash-bucket only
 * after testing *uaddr.  This guarantees that futex_wait() will NOT
 * absorb a wakeup if *uaddr does not match the desired values
 * while the syscall executes.
 
  There is no reason why we cannot queue _BEFORE_ reading the user space
  value. We just have to dequeue in all the error handling cases, but
  for the fast path it does not matter at all.
 
  CPU 0   CPU 1
 
  val = *futex;
  futex_wait(futex, val);
 
  spin_lock(hb-lock);
 
  plist_add(hb, self);
  smp_wmb();
 
  uval = *futex;
  *futex = newval;
  futex_wake();
 
  smp_rmb();
  if (plist_empty(hb))
 return;
  ...
 
 This would seem to be a nicer approach indeed, without needing the
 extra atomics.

I went through the issue with Peter and he noticed, that we need
smp_mb() in both places. That's what we have right now with the
spin_lock() and it is required as we need to guarantee that

 The waiter observes the change to the uaddr value after it added
 itself to the plist

 The waker observes plist not empty if the change to uaddr was made
 after the waiter checked the value.


write(plist)|   write(futex_uaddr)
mb()|   mb()
read(futex_uaddr)   |   read(plist)

The spin_lock mb() on the waiter side does not help here because it
happpens before the write(plist) and not after it.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
 On Sat, 23 Nov 2013, Linus Torvalds wrote:
 
  On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de wrote:
  
   Now the question is why we queue the waiter _AFTER_ reading the user
   space value. The comment in the code is pretty non sensical:
  
  * On the other hand, we insert q and release the hash-bucket only
  * after testing *uaddr.  This guarantees that futex_wait() will NOT
  * absorb a wakeup if *uaddr does not match the desired values
  * while the syscall executes.
  
   There is no reason why we cannot queue _BEFORE_ reading the user space
   value. We just have to dequeue in all the error handling cases, but
   for the fast path it does not matter at all.
  
   CPU 0   CPU 1
  
   val = *futex;
   futex_wait(futex, val);
  
   spin_lock(hb-lock);
  
   plist_add(hb, self);
   smp_wmb();
  
   uval = *futex;
   *futex = newval;
   futex_wake();
  
   smp_rmb();
   if (plist_empty(hb))
  return;
   ...
  
  This would seem to be a nicer approach indeed, without needing the
  extra atomics.
 
 I went through the issue with Peter and he noticed, that we need
 smp_mb() in both places. That's what we have right now with the
 spin_lock() and it is required as we need to guarantee that
 
  The waiter observes the change to the uaddr value after it added
  itself to the plist
 
  The waker observes plist not empty if the change to uaddr was made
  after the waiter checked the value.
 
 
   write(plist)|   write(futex_uaddr)
   mb()|   mb()
   read(futex_uaddr)   |   read(plist)
 
 The spin_lock mb() on the waiter side does not help here because it
 happpens before the write(plist) and not after it.

Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
ACQUIRE barrier which is weaker than a full mb and will not suffice in
this case even it if were in the right place.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Peter Zijlstra wrote:
 On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
  On Sat, 23 Nov 2013, Linus Torvalds wrote:
  
   On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de 
   wrote:
   
Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:
   
   * On the other hand, we insert q and release the hash-bucket only
   * after testing *uaddr.  This guarantees that futex_wait() will NOT
   * absorb a wakeup if *uaddr does not match the desired values
   * while the syscall executes.
   
There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.
   
CPU 0   CPU 1
   
val = *futex;
futex_wait(futex, val);
   
spin_lock(hb-lock);
   
plist_add(hb, self);
smp_wmb();
   
uval = *futex;
*futex = newval;
futex_wake();
   
smp_rmb();
if (plist_empty(hb))
   return;
...
   
   This would seem to be a nicer approach indeed, without needing the
   extra atomics.
  
  I went through the issue with Peter and he noticed, that we need
  smp_mb() in both places. That's what we have right now with the
  spin_lock() and it is required as we need to guarantee that
  
   The waiter observes the change to the uaddr value after it added
   itself to the plist
  
   The waker observes plist not empty if the change to uaddr was made
   after the waiter checked the value.
  
  
  write(plist)|   write(futex_uaddr)
  mb()|   mb()
  read(futex_uaddr)   |   read(plist)
  
  The spin_lock mb() on the waiter side does not help here because it
  happpens before the write(plist) and not after it.
 
 Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
 ACQUIRE barrier which is weaker than a full mb and will not suffice in
 this case even it if were in the right place.

So now the question is whether this lockless empty check optimization
which seems to be quite nice on x86 for a particular workload will
have any negative side effects on other architectures.

If the smp_mb() is heavy weight, then it will hurt massivly in the
case where the hash bucket is not empty, because we add the price for
the smp_mb() just for no gain.

In that context it would also be helpful to measure the overhead on
x86 for the !empty case.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 06:32:55PM +0100, Thomas Gleixner wrote:
 In that context it would also be helpful to measure the overhead on
 x86 for the !empty case.

Yes, because mfence is by no means a cheap instruction on x86.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Davidlohr Bueso
On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
 On Mon, 25 Nov 2013, Peter Zijlstra wrote:
  On Mon, Nov 25, 2013 at 05:23:51PM +0100, Thomas Gleixner wrote:
   On Sat, 23 Nov 2013, Linus Torvalds wrote:
   
On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de 
wrote:

 Now the question is why we queue the waiter _AFTER_ reading the user
 space value. The comment in the code is pretty non sensical:

* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr.  This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.

 There is no reason why we cannot queue _BEFORE_ reading the user space
 value. We just have to dequeue in all the error handling cases, but
 for the fast path it does not matter at all.

 CPU 0   CPU 1

 val = *futex;
 futex_wait(futex, val);

 spin_lock(hb-lock);

 plist_add(hb, self);
 smp_wmb();

 uval = *futex;
 *futex = newval;
 futex_wake();

 smp_rmb();
 if (plist_empty(hb))
return;
 ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.
   
   I went through the issue with Peter and he noticed, that we need
   smp_mb() in both places. That's what we have right now with the
   spin_lock() and it is required as we need to guarantee that
   
The waiter observes the change to the uaddr value after it added
itself to the plist
   
The waker observes plist not empty if the change to uaddr was made
after the waiter checked the value.
   
   
 write(plist)|   write(futex_uaddr)
 mb()|   mb()
 read(futex_uaddr)   |   read(plist)
   
   The spin_lock mb() on the waiter side does not help here because it
   happpens before the write(plist) and not after it.
  
  Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
  ACQUIRE barrier which is weaker than a full mb and will not suffice in
  this case even it if were in the right place.
 
 So now the question is whether this lockless empty check optimization
 which seems to be quite nice on x86 for a particular workload will
 have any negative side effects on other architectures.
 
 If the smp_mb() is heavy weight, then it will hurt massivly in the
 case where the hash bucket is not empty, because we add the price for
 the smp_mb() just for no gain.
 
 In that context it would also be helpful to measure the overhead on
 x86 for the !empty case.

Absolutely, I will add these comparisons. If we do notice that we end up
hurting the !empty case, would the current patch using atomic ops still
be considered? We have made sure that none of the changes in this set
affects performance on other workloads/smaller systems.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Sat, 23 Nov 2013, Thomas Gleixner wrote:
 On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
 So with the atomic ops you are changing that to:
 
 CPU 0 CPU 1
 
 val = *futex;
 futex_wait(futex, val);
 
 spin_lock(hb-lock);
 
 atomic_inc(hb-waiters);
 uval = *futex;
   *futex = newval;
 
 if (uval != val) {futex_wake();
spin_unlock(hb-lock);if (!atomic_read(hb-waiters))
return;   return;
 }
   spin_lock((hb-lock);  
 plist_add(hb, self);
 spin_unlock(hb-lock);
 schedule();   wakeup_waiters(hb);
 ...
 
 which restores the ordering guarantee, which the hash bucket lock
 provided so far.

Actually that's not true by design, it just happens to work.

atomic_inc() on x86 is a lock incl.

 The LOCK prefix just guarantees that the cache line which is affected
 by the INCL is locked. And it guarantees that locked operations
 serialize all outstanding load and store operations.

 But Documentation/atomic_ops.txt says about atomic_inc():

 One very important aspect of these two routines is that they DO NOT
  require any explicit memory barriers.  They need only perform the
  atomic_t counter update in an SMP safe manner.

 So while this has a barrier on x86, it's not guaranteed.

atomic_read() is a simple read.

 This does not guarantee anything. And if you read
 Documentation/atomic_ops.txt it's well documented:

 *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***


So now your code melts down to:

 write(hb-waiters)|write(uaddr)
 mb|read(hb-waiters)
 read(uaddr)

I fear you simply managed to make the window small enough that your
testing was not longer able expose it.

Thanks,

tglx




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Davidlohr Bueso wrote:
 On Mon, 2013-11-25 at 18:32 +0100, Thomas Gleixner wrote:
  If the smp_mb() is heavy weight, then it will hurt massivly in the
  case where the hash bucket is not empty, because we add the price for
  the smp_mb() just for no gain.
  
  In that context it would also be helpful to measure the overhead on
  x86 for the !empty case.
 
 Absolutely, I will add these comparisons. If we do notice that we end up
 hurting the !empty case, would the current patch using atomic ops still
 be considered? We have made sure that none of the changes in this set
 affects performance on other workloads/smaller systems.

Please read my last reply to the atomic ops approach.

Aside of that we need numbers for a significant range of !x86.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Darren Hart
On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
 On Sat, 23 Nov 2013, Thomas Gleixner wrote:
  On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
  So with the atomic ops you are changing that to:
  
  CPU 0 CPU 1
  
  val = *futex;
  futex_wait(futex, val);
  
  spin_lock(hb-lock);
  
  atomic_inc(hb-waiters);
  uval = *futex;
*futex = newval;
  
  if (uval != val) {futex_wake();
 spin_unlock(hb-lock);if 
  (!atomic_read(hb-waiters))
 return;   return;
  }
spin_lock((hb-lock);  
  plist_add(hb, self);
  spin_unlock(hb-lock);
  schedule();   wakeup_waiters(hb);
  ...
  
  which restores the ordering guarantee, which the hash bucket lock
  provided so far.
 
 Actually that's not true by design, it just happens to work.
 
 atomic_inc() on x86 is a lock incl.
 
  The LOCK prefix just guarantees that the cache line which is affected
  by the INCL is locked. And it guarantees that locked operations
  serialize all outstanding load and store operations.
 
  But Documentation/atomic_ops.txt says about atomic_inc():
 
  One very important aspect of these two routines is that they DO NOT
   require any explicit memory barriers.  They need only perform the
   atomic_t counter update in an SMP safe manner.
 
  So while this has a barrier on x86, it's not guaranteed.


But it is guaranteed to be in an SMP safe manner... which I guess just
means that two writes will not intermix bytes, but no guarantee that the
value will be visible to other CPUs unless some kind of barrier is
explicitly imposed.

Correct?


 atomic_read() is a simple read.
 
  This does not guarantee anything. And if you read
  Documentation/atomic_ops.txt it's well documented:
 
  *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
 
 
 So now your code melts down to:
 
  write(hb-waiters)|write(uaddr)
  mb|read(hb-waiters)
  read(uaddr)
 
 I fear you simply managed to make the window small enough that your
 testing was not longer able expose it.

Does seem to be the case.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-25 Thread Thomas Gleixner
On Mon, 25 Nov 2013, Darren Hart wrote:
 On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
   which restores the ordering guarantee, which the hash bucket lock
   provided so far.
  
  Actually that's not true by design, it just happens to work.
  
  atomic_inc() on x86 is a lock incl.
  
   The LOCK prefix just guarantees that the cache line which is affected
   by the INCL is locked. And it guarantees that locked operations
   serialize all outstanding load and store operations.
  
   But Documentation/atomic_ops.txt says about atomic_inc():
  
   One very important aspect of these two routines is that they DO NOT
require any explicit memory barriers.  They need only perform the
atomic_t counter update in an SMP safe manner.
  
   So while this has a barrier on x86, it's not guaranteed.
 
 
 But it is guaranteed to be in an SMP safe manner... which I guess just
 means that two writes will not intermix bytes, but no guarantee that the
 value will be visible to other CPUs unless some kind of barrier is
 explicitly imposed.
 
 Correct?

Yep, that's what it sayes.
 
  So now your code melts down to:
  
   write(hb-waiters)|write(uaddr)
   mb|read(hb-waiters)
   read(uaddr)
  
  I fear you simply managed to make the window small enough that your
  testing was not longer able expose it.
 
 Does seem to be the case.

You must be aware, that between the the write(uaddr) and the
read(hb-waiters) is the syscall, i.e. a user/kernel space
transition.

sysenter/syscall have no documented barriers inside, but we don't know
whether the actual hw implementation provides one or if the memory ops
between modifying uaddr and reaching the read(hb-waiters) point
including the real atomic op on the waiter side are good enough to
paper over the issue.

Thanks,

tglx



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Davidlohr Bueso
On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
> On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
> >
> > Now the question is why we queue the waiter _AFTER_ reading the user
> > space value. The comment in the code is pretty non sensical:
> >
> >* On the other hand, we insert q and release the hash-bucket only
> >* after testing *uaddr.  This guarantees that futex_wait() will NOT
> >* absorb a wakeup if *uaddr does not match the desired values
> >* while the syscall executes.
> >
> > There is no reason why we cannot queue _BEFORE_ reading the user space
> > value. We just have to dequeue in all the error handling cases, but
> > for the fast path it does not matter at all.
> >
> > CPU 0   CPU 1
> >
> > val = *futex;
> > futex_wait(futex, val);
> >
> > spin_lock(>lock);
> >
> > plist_add(hb, self);
> > smp_wmb();
> >
> > uval = *futex;
> > *futex = newval;
> > futex_wake();
> >
> > smp_rmb();
> > if (plist_empty(hb))
> >return;
> > ...
> 
> This would seem to be a nicer approach indeed, without needing the
> extra atomics.

Yep, I think we can all agree that doing this optization without atomic
ops is a big plus.

> 
> Davidlohr, mind trying Thomas' approach?

I just took a quick look and it seems pretty straightforward, but not
without some details to consider. We basically have to redo/reorder
futex_wait_setup(), which checks that uval == val, and
futex_wait_queue_me(), which adds the task to the list and blocks. Now,
both futex_wait() and futex_wait_requeue_pi() have this logic, but since
we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
ok to only change futex_wait(), while the order of the uval checking
doesn't matter for futex_wait_requeue_pi() so it can stay as is.

The following is the general skeleton of what Thomas is probably
referring to (yes, it needs comments, cleanups, refactoring, etc, etc).
Futexes are easy to screw up but at least this passes a kernel build and
all futextests on a large box.

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
  ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int prio, ret = 0;
u32 uval;

if (!bitset)
return -EINVAL;
q.bitset = bitset;

if (abs_time) {
to = 

hrtimer_init_on_stack(>timer, (flags & FLAGS_CLOCKRT) ?
  CLOCK_REALTIME : CLOCK_MONOTONIC,
  HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(>timer, *abs_time,
 current->timer_slack_ns);
}
retry:  
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, , VERIFY_READ);
if (unlikely(ret != 0))
goto out;

retry_private:
hb = queue_lock();
prio = min(current->normal_prio, MAX_RT_PRIO);

plist_node_init(, prio);
plist_add(, >chain);
q.task = current;
/* do NOT drop the lock here */
smp_wmb();

ret = get_futex_value_locked(, uaddr);
if (ret) {
plist_del(, >chain);
spin_unlock(>lock);

ret = get_user(uval, uaddr);
if (ret)
goto out_put_key;

if (!(flags & FLAGS_SHARED))
goto retry_private;

put_futex_key();
goto retry;

}

if (uval != val) {
plist_del(, >chain);
spin_unlock(>lock);
ret = -EWOULDBLOCK;
goto out_put_key;
}

set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(>lock);

/* Arm the timer */
if (to) {
hrtimer_start_expires(>timer, HRTIMER_MODE_ABS);
if (!hrtimer_active(>timer))
to->task = NULL;
}

/*
 * If we have been removed from the hash list, then another task
 * has tried to wake us, and we can skip the call to schedule().
 */
if (likely(!plist_node_empty())) {
/*
 * If the timer has already expired, current will already be
 * flagged for rescheduling. Only call schedule if there
 * is no timeout, or if it has yet to expire.
 */
if (!to || to->task)
freezable_schedule();
}

Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Linus Torvalds
On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner  wrote:
>
> Now the question is why we queue the waiter _AFTER_ reading the user
> space value. The comment in the code is pretty non sensical:
>
>* On the other hand, we insert q and release the hash-bucket only
>* after testing *uaddr.  This guarantees that futex_wait() will NOT
>* absorb a wakeup if *uaddr does not match the desired values
>* while the syscall executes.
>
> There is no reason why we cannot queue _BEFORE_ reading the user space
> value. We just have to dequeue in all the error handling cases, but
> for the fast path it does not matter at all.
>
> CPU 0   CPU 1
>
> val = *futex;
> futex_wait(futex, val);
>
> spin_lock(>lock);
>
> plist_add(hb, self);
> smp_wmb();
>
> uval = *futex;
> *futex = newval;
> futex_wake();
>
> smp_rmb();
> if (plist_empty(hb))
>return;
> ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.

Davidlohr, mind trying Thomas' approach?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Thomas Gleixner
On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

Right. The reason is:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(>lock);

uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(>lock);  if (plist_empty(hb))
   return; return;
}

plist_add(hb, self);
spin_unlock(>lock);
schedule();
...

So the waiter on CPU0 gets queued after the user space value changed
and the wakee on CPU1 returns because plist is empty. Waiter therefor
waits forever.

So with the atomic ops you are changing that to:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(>lock);

atomic_inc(>waiters);
uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(>lock);  if (!atomic_read(>waiters))
   return; return;
}
spin_lock((>lock);  
plist_add(hb, self);
spin_unlock(>lock);
schedule(); wakeup_waiters(hb);
...

which restores the ordering guarantee, which the hash bucket lock
provided so far.

Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:

   * On the other hand, we insert q and release the hash-bucket only
   * after testing *uaddr.  This guarantees that futex_wait() will NOT
   * absorb a wakeup if *uaddr does not match the desired values
   * while the syscall executes.

There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(>lock);

plist_add(hb, self);
smp_wmb();

uval = *futex;
*futex = newval;
futex_wake();

smp_rmb();
if (plist_empty(hb))
   return;

spin_lock((>lock);  
if (uval != val) {  
   plist_del(self); 
   spin_unlock(>lock);
   return;  
}

spin_unlock(>lock);
schedule(); wakeup_waiters(hb);
...

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Thomas Gleixner
On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
 On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
  On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
   In futex_wake() there is clearly no point in taking the hb-lock if
   we know beforehand that there are no tasks to be woken. This comes
   at the smaller cost of doing some atomic operations to keep track of
   the list's size.
  
  Hmm. Why? Afaik, you only care about empty or not. And if you don't
  need the serialization from locking, then afaik you can just do a
  plist_head_empty() without holding the lock.
 
 I remember this being the original approach, but after noticing some
 strange behavior we quickly decided it wasn't the path. And sure enough,
 I just double checked and tried the patch without atomic ops and can see
 things being off: one of the futextest performance cases is stuck
 blocked on a futex and I couldn't reboot the machine either -- nothing
 apparent in dmesg, just not 100% functional. The thing is, we can only
 avoid taking the lock only if nobody else is trying to add itself to the
 list.

Right. The reason is:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(hb-lock);

uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(hb-lock);  if (plist_empty(hb))
   return; return;
}

plist_add(hb, self);
spin_unlock(hb-lock);
schedule();
...

So the waiter on CPU0 gets queued after the user space value changed
and the wakee on CPU1 returns because plist is empty. Waiter therefor
waits forever.

So with the atomic ops you are changing that to:

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(hb-lock);

atomic_inc(hb-waiters);
uval = *futex;
*futex = newval;

if (uval != val) {  futex_wake();
   spin_unlock(hb-lock);  if (!atomic_read(hb-waiters))
   return; return;
}
spin_lock((hb-lock);  
plist_add(hb, self);
spin_unlock(hb-lock);
schedule(); wakeup_waiters(hb);
...

which restores the ordering guarantee, which the hash bucket lock
provided so far.

Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:

   * On the other hand, we insert q and release the hash-bucket only
   * after testing *uaddr.  This guarantees that futex_wait() will NOT
   * absorb a wakeup if *uaddr does not match the desired values
   * while the syscall executes.

There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.

CPU 0   CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(hb-lock);

plist_add(hb, self);
smp_wmb();

uval = *futex;
*futex = newval;
futex_wake();

smp_rmb();
if (plist_empty(hb))
   return;

spin_lock((hb-lock);  
if (uval != val) {  
   plist_del(self); 
   spin_unlock(hb-lock);
   return;  
}

spin_unlock(hb-lock);
schedule(); wakeup_waiters(hb);
...

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Linus Torvalds
On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de wrote:

 Now the question is why we queue the waiter _AFTER_ reading the user
 space value. The comment in the code is pretty non sensical:

* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr.  This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.

 There is no reason why we cannot queue _BEFORE_ reading the user space
 value. We just have to dequeue in all the error handling cases, but
 for the fast path it does not matter at all.

 CPU 0   CPU 1

 val = *futex;
 futex_wait(futex, val);

 spin_lock(hb-lock);

 plist_add(hb, self);
 smp_wmb();

 uval = *futex;
 *futex = newval;
 futex_wake();

 smp_rmb();
 if (plist_empty(hb))
return;
 ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.

Davidlohr, mind trying Thomas' approach?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-23 Thread Davidlohr Bueso
On Sat, 2013-11-23 at 19:46 -0800, Linus Torvalds wrote:
 On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner t...@linutronix.de wrote:
 
  Now the question is why we queue the waiter _AFTER_ reading the user
  space value. The comment in the code is pretty non sensical:
 
 * On the other hand, we insert q and release the hash-bucket only
 * after testing *uaddr.  This guarantees that futex_wait() will NOT
 * absorb a wakeup if *uaddr does not match the desired values
 * while the syscall executes.
 
  There is no reason why we cannot queue _BEFORE_ reading the user space
  value. We just have to dequeue in all the error handling cases, but
  for the fast path it does not matter at all.
 
  CPU 0   CPU 1
 
  val = *futex;
  futex_wait(futex, val);
 
  spin_lock(hb-lock);
 
  plist_add(hb, self);
  smp_wmb();
 
  uval = *futex;
  *futex = newval;
  futex_wake();
 
  smp_rmb();
  if (plist_empty(hb))
 return;
  ...
 
 This would seem to be a nicer approach indeed, without needing the
 extra atomics.

Yep, I think we can all agree that doing this optization without atomic
ops is a big plus.

 
 Davidlohr, mind trying Thomas' approach?

I just took a quick look and it seems pretty straightforward, but not
without some details to consider. We basically have to redo/reorder
futex_wait_setup(), which checks that uval == val, and
futex_wait_queue_me(), which adds the task to the list and blocks. Now,
both futex_wait() and futex_wait_requeue_pi() have this logic, but since
we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
ok to only change futex_wait(), while the order of the uval checking
doesn't matter for futex_wait_requeue_pi() so it can stay as is.

The following is the general skeleton of what Thomas is probably
referring to (yes, it needs comments, cleanups, refactoring, etc, etc).
Futexes are easy to screw up but at least this passes a kernel build and
all futextests on a large box.

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
  ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int prio, ret = 0;
u32 uval;

if (!bitset)
return -EINVAL;
q.bitset = bitset;

if (abs_time) {
to = timeout;

hrtimer_init_on_stack(to-timer, (flags  FLAGS_CLOCKRT) ?
  CLOCK_REALTIME : CLOCK_MONOTONIC,
  HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(to-timer, *abs_time,
 current-timer_slack_ns);
}
retry:  
ret = get_futex_key(uaddr, flags  FLAGS_SHARED, q.key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

retry_private:
hb = queue_lock(q);
prio = min(current-normal_prio, MAX_RT_PRIO);

plist_node_init(q.list, prio);
plist_add(q.list, hb-chain);
q.task = current;
/* do NOT drop the lock here */
smp_wmb();

ret = get_futex_value_locked(uval, uaddr);
if (ret) {
plist_del(q.list, hb-chain);
spin_unlock(hb-lock);

ret = get_user(uval, uaddr);
if (ret)
goto out_put_key;

if (!(flags  FLAGS_SHARED))
goto retry_private;

put_futex_key(q.key);
goto retry;

}

if (uval != val) {
plist_del(q.list, hb-chain);
spin_unlock(hb-lock);
ret = -EWOULDBLOCK;
goto out_put_key;
}

set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(hb-lock);

/* Arm the timer */
if (to) {
hrtimer_start_expires(to-timer, HRTIMER_MODE_ABS);
if (!hrtimer_active(to-timer))
to-task = NULL;
}

/*
 * If we have been removed from the hash list, then another task
 * has tried to wake us, and we can skip the call to schedule().
 */
if (likely(!plist_node_empty(q.list))) {
/*
 * If the timer has already expired, current will already be
 * flagged for rescheduling. Only call schedule if there
 * is no timeout, or if it has yet to expire.
 */
if (!to || to-task)
freezable_schedule();
}

Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 19:19 -0800, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> > 
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
> 
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

In your usage, the worst case scenario is that you detect 0 when locking
may have blocked and found a waiter. Correct?

In this case, you return 0, instead of 1 (or more).

This suggests to me a bug in the futextest testcase. Which test
specifically hung up waiting?

Futex hangs are almost always bad userspace code (my bad userspace code
in this case ;-)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().

  ^match

> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

__unqueue_futex (not __unqueue_me)


> @@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
> nr_wake, u32 bitset)
>   goto out;
>  
>   hb = hash_futex();
> + /* make sure we really have tasks to wakeup */

Nit, but please use proper sentence formatting for consistency with the
rest of the comments in futex.c (most of them anyway).

/* Make sure we really have tasks to wake up. */

Now... I'm not thrilled with adding atomics if we don't need to,
especially for an optimization since the atomics themselves cause enough
problems, especially across a large number of CPUs... I'll respond to
Linus's thread though.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Hart, Darren
On Fri, 2013-11-22 at 21:40 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size. Specifically, increment the counter when an element is
> > added to the list, and decrement when it is removed. Of course, if the
> > counter is 0, then there are no tasks blocked on a futex. Some special
> > considerations:
> > 
> > - increment the counter at queue_lock() as we always end up calling
> >   queue_me() which adds the element to the list. Upon any error,
> >   queue_unlock() is called for housekeeping, for which we decrement
> >   to mach the increment done in queue_lock().
> > 
> > - decrement the counter at __unqueue_me() to reflect when an element is
> >   removed from the queue for wakeup related purposes.
> 
> What is the problem you are trying to solve here?

Apologies, too quick on the trigger. I see plenty of detail in 0/5. Will
spend some time reviewing that.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
> 
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().
> 
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

What is the problem you are trying to solve here?

The fast-path in futexes is not calling into the kernel at all, if
you've made the syscall, you've already lost. What kind of improvement
are you seeing by adding this optimization? Is it worth the introduction
of atomic operations into a file known to the state of California to
increase the risk of liver failure?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Waiman Long

On 11/22/2013 08:25 PM, Linus Torvalds wrote:

On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

 if (!list_empty(waiters))
 wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

 Linus


Before the patch, a waker will wake up one or waiters who call 
spin_lock() before the waker. If we just use plist_head_empty(), we will 
miss waiters who have called spin_lock(), but have not yet got the lock 
or inserted itself into the wait list. By adding atomic counter for 
checking purpose, we again has a single point where any waiters who pass 
that point (inc atomic counter) can be woken up by a waiter who check 
the atomic counter after they inc it. This acts sort of like 
serialization without using a lock.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Davidlohr Bueso
On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size.
> 
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.

I remember this being the original approach, but after noticing some
strange behavior we quickly decided it wasn't the path. And sure enough,
I just double checked and tried the patch without atomic ops and can see
things being off: one of the futextest performance cases is stuck
blocked on a futex and I couldn't reboot the machine either -- nothing
apparent in dmesg, just not 100% functional. The thing is, we can only
avoid taking the lock only if nobody else is trying to add itself to the
list.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Jason Low
On Fri, Nov 22, 2013 at 5:25 PM, Linus Torvalds
 wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
>
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
> if (!list_empty(waiters))
> wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.

Hi Linus,

In this patch, since we do the atomic increments before holding the
hb->lock during situations where we may potentially add a task to the
list, then we would only skip attempting the wake up if no other
thread has already held the hb->lock and is in the process of adding a
task to the list, in addition to the list being empty. Would this be
enough to protect it from any race condition?

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Linus Torvalds
On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso  wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

if (!list_empty(waiters))
wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Linus Torvalds
On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
 In futex_wake() there is clearly no point in taking the hb-lock if
 we know beforehand that there are no tasks to be woken. This comes
 at the smaller cost of doing some atomic operations to keep track of
 the list's size.

Hmm. Why? Afaik, you only care about empty or not. And if you don't
need the serialization from locking, then afaik you can just do a
plist_head_empty() without holding the lock.

NOTE!

The list_empty() function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

if (!list_empty(waiters))
wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Jason Low
On Fri, Nov 22, 2013 at 5:25 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
 In futex_wake() there is clearly no point in taking the hb-lock if
 we know beforehand that there are no tasks to be woken. This comes
 at the smaller cost of doing some atomic operations to keep track of
 the list's size.

 Hmm. Why? Afaik, you only care about empty or not. And if you don't
 need the serialization from locking, then afaik you can just do a
 plist_head_empty() without holding the lock.

 NOTE!

 The list_empty() function is very much designed to work even without
 holding a lock (as long as the head itself exists reliably, of course)
 BUT you have to then guarantee yourself that your algorithm doesn't
 have any races wrt other CPU's adding an entry to the list at the same
 time. Not holding a lock obviously means that you are not serialized
 against that.. We've had problems with people doing

 if (!list_empty(waiters))
 wake_up_list(..)

 because they wouldn't wake people up who just got added.

 But considering that your atomic counter checking has the same lack of
 serialization, at least the plist_head_empty() check shouldn't be any
 worse than that counter thing.. And doesn't need any steenking atomic
 ops or a new counter field.

Hi Linus,

In this patch, since we do the atomic increments before holding the
hb-lock during situations where we may potentially add a task to the
list, then we would only skip attempting the wake up if no other
thread has already held the hb-lock and is in the process of adding a
task to the list, in addition to the list being empty. Would this be
enough to protect it from any race condition?

Thanks,
Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Davidlohr Bueso
On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
 On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
  In futex_wake() there is clearly no point in taking the hb-lock if
  we know beforehand that there are no tasks to be woken. This comes
  at the smaller cost of doing some atomic operations to keep track of
  the list's size.
 
 Hmm. Why? Afaik, you only care about empty or not. And if you don't
 need the serialization from locking, then afaik you can just do a
 plist_head_empty() without holding the lock.

I remember this being the original approach, but after noticing some
strange behavior we quickly decided it wasn't the path. And sure enough,
I just double checked and tried the patch without atomic ops and can see
things being off: one of the futextest performance cases is stuck
blocked on a futex and I couldn't reboot the machine either -- nothing
apparent in dmesg, just not 100% functional. The thing is, we can only
avoid taking the lock only if nobody else is trying to add itself to the
list.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Waiman Long

On 11/22/2013 08:25 PM, Linus Torvalds wrote:

On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Buesodavidl...@hp.com  wrote:

In futex_wake() there is clearly no point in taking the hb-lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size.

Hmm. Why? Afaik, you only care about empty or not. And if you don't
need the serialization from locking, then afaik you can just do a
plist_head_empty() without holding the lock.

NOTE!

The list_empty() function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

 if (!list_empty(waiters))
 wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

 Linus


Before the patch, a waker will wake up one or waiters who call 
spin_lock() before the waker. If we just use plist_head_empty(), we will 
miss waiters who have called spin_lock(), but have not yet got the lock 
or inserted itself into the wait list. By adding atomic counter for 
checking purpose, we again has a single point where any waiters who pass 
that point (inc atomic counter) can be woken up by a waiter who check 
the atomic counter after they inc it. This acts sort of like 
serialization without using a lock.


-Longman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
 In futex_wake() there is clearly no point in taking the hb-lock if
 we know beforehand that there are no tasks to be woken. This comes
 at the smaller cost of doing some atomic operations to keep track of
 the list's size. Specifically, increment the counter when an element is
 added to the list, and decrement when it is removed. Of course, if the
 counter is 0, then there are no tasks blocked on a futex. Some special
 considerations:
 
 - increment the counter at queue_lock() as we always end up calling
   queue_me() which adds the element to the list. Upon any error,
   queue_unlock() is called for housekeeping, for which we decrement
   to mach the increment done in queue_lock().
 
 - decrement the counter at __unqueue_me() to reflect when an element is
   removed from the queue for wakeup related purposes.

What is the problem you are trying to solve here?

The fast-path in futexes is not calling into the kernel at all, if
you've made the syscall, you've already lost. What kind of improvement
are you seeing by adding this optimization? Is it worth the introduction
of atomic operations into a file known to the state of California to
increase the risk of liver failure?

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Hart, Darren
On Fri, 2013-11-22 at 21:40 -0800, Darren Hart wrote:
 On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
  In futex_wake() there is clearly no point in taking the hb-lock if
  we know beforehand that there are no tasks to be woken. This comes
  at the smaller cost of doing some atomic operations to keep track of
  the list's size. Specifically, increment the counter when an element is
  added to the list, and decrement when it is removed. Of course, if the
  counter is 0, then there are no tasks blocked on a futex. Some special
  considerations:
  
  - increment the counter at queue_lock() as we always end up calling
queue_me() which adds the element to the list. Upon any error,
queue_unlock() is called for housekeeping, for which we decrement
to mach the increment done in queue_lock().
  
  - decrement the counter at __unqueue_me() to reflect when an element is
removed from the queue for wakeup related purposes.
 
 What is the problem you are trying to solve here?

Apologies, too quick on the trigger. I see plenty of detail in 0/5. Will
spend some time reviewing that.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
 In futex_wake() there is clearly no point in taking the hb-lock if
 we know beforehand that there are no tasks to be woken. This comes
 at the smaller cost of doing some atomic operations to keep track of
 the list's size. Specifically, increment the counter when an element is
 added to the list, and decrement when it is removed. Of course, if the
 counter is 0, then there are no tasks blocked on a futex. Some special
 considerations:
 
 - increment the counter at queue_lock() as we always end up calling
   queue_me() which adds the element to the list. Upon any error,
   queue_unlock() is called for housekeeping, for which we decrement
   to mach the increment done in queue_lock().

  ^match

 
 - decrement the counter at __unqueue_me() to reflect when an element is
   removed from the queue for wakeup related purposes.

__unqueue_futex (not __unqueue_me)


 @@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
 nr_wake, u32 bitset)
   goto out;
  
   hb = hash_futex(key);
 + /* make sure we really have tasks to wakeup */

Nit, but please use proper sentence formatting for consistency with the
rest of the comments in futex.c (most of them anyway).

/* Make sure we really have tasks to wake up. */

Now... I'm not thrilled with adding atomics if we don't need to,
especially for an optimization since the atomics themselves cause enough
problems, especially across a large number of CPUs... I'll respond to
Linus's thread though.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

2013-11-22 Thread Darren Hart
On Fri, 2013-11-22 at 19:19 -0800, Davidlohr Bueso wrote:
 On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
  On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
   In futex_wake() there is clearly no point in taking the hb-lock if
   we know beforehand that there are no tasks to be woken. This comes
   at the smaller cost of doing some atomic operations to keep track of
   the list's size.
  
  Hmm. Why? Afaik, you only care about empty or not. And if you don't
  need the serialization from locking, then afaik you can just do a
  plist_head_empty() without holding the lock.
 
 I remember this being the original approach, but after noticing some
 strange behavior we quickly decided it wasn't the path. And sure enough,
 I just double checked and tried the patch without atomic ops and can see
 things being off: one of the futextest performance cases is stuck
 blocked on a futex and I couldn't reboot the machine either -- nothing
 apparent in dmesg, just not 100% functional. The thing is, we can only
 avoid taking the lock only if nobody else is trying to add itself to the
 list.

In your usage, the worst case scenario is that you detect 0 when locking
may have blocked and found a waiter. Correct?

In this case, you return 0, instead of 1 (or more).

This suggests to me a bug in the futextest testcase. Which test
specifically hung up waiting?

Futex hangs are almost always bad userspace code (my bad userspace code
in this case ;-)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/