Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Peter Zijlstra
On Tue, Oct 02, 2018 at 02:19:53PM +0100, Will Deacon wrote:
> On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:

> > Let me draw a picture of that..
> > 
> > 
> >   CPU0  CPU1CPU2CPU3
> > 
> > 0)  lock
> >   trylock -> (0,0,1)
> > 1)lock
> > trylock /* fail */
> > 
> > 2)  lock
> >   trylock /* fail */
> >   tas-pending -> (0,1,1)
> >   wait-locked
> > 
> > 3)  lock
> >   trylock /* fail */
> >   tas-pending /* fail */
> > 
> > 4)  unlock -> (0,1,0)
> >   clr_pnd_set_lck -> (0,0,1)
> >   unlock -> (0,0,0)
> > 
> > 5)  tas-pending -> (0,1,0)
> > read-val -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)xchg_tail -> (n,0,1)
> >   load_acquire <- (n,0,0) (from-4)
> > 8)cmpxchg /* fail */
> >   set_locked()
> > 
> > > Is there something I'm missing that means this can't happen? I suppose
> > > cacheline granularity ends up giving serialisation between (4) and (7),
> > > but I'd *much* prefer not to rely on that because it feels horribly
> > > fragile.
> > 
> > Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> > fact have smp_mb() in and that should order it sufficient for that not
> > to happen I think.
> 
> Hmm, does that actually help, though? I still think you're relying on the
> cache-coherence protocol to serialise the xchg() on pending before the
> xchg_tail(), which I think is fragile because they don't actually overlap.

Maybe, I suspect TSO makes it work, but see the below alternative.

---
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -6,9 +6,29 @@
 #include 
 #include 
 #include 
+#include 
 
 #define _Q_PENDING_LOOPS   (1 << 9)
 
+static __always_inline bool __test_and_set_pending(struct qspinlock *lock)
+{
+   GEN_BINARY_RMWcc(LOCK_PREFIX "btsl",
+lock->val.counter, "Ir", _Q_PENDING_OFFSET, "%0", c);
+}
+
+#define queued_set_pending_fetch_acquire queued_set_pending_fetch_acquire
+static inline u32 queued_set_pending_fetch_acquire(struct qspinlock *lock)
+{
+   u32 val = 0;
+
+   if (__test_and_set_pending(lock))
+   val |= _Q_PENDING_VAL;
+
+   val |= atomic_read(>val) & ~_Q_PENDING_MASK;
+
+   return val;
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,20 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * queued_set_pending_fetch_acquire - fetch the whole lock value and set 
pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_set_pending_fetch_acquire
+static __always_inline u32 queued_set_pending_fetch_acquire(struct qspinlock 
*lock)
+{
+   return atomic_fetch_or_acquire(_Q_PENDING_VAL, >val);
+}
+#endif
+
+/**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
  *
@@ -328,7 +342,7 @@ void queued_spin_lock_slowpath(struct qs
 *
 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 */
-   val = atomic_fetch_or_acquire(_Q_PENDING_VAL, >val);
+   val = queued_set_pending_fetch_acquire(lock);
 
/*
 * If we observe contention, there is a concurrent locker.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Peter Zijlstra
On Tue, Oct 02, 2018 at 02:19:53PM +0100, Will Deacon wrote:
> On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:

> > Let me draw a picture of that..
> > 
> > 
> >   CPU0  CPU1CPU2CPU3
> > 
> > 0)  lock
> >   trylock -> (0,0,1)
> > 1)lock
> > trylock /* fail */
> > 
> > 2)  lock
> >   trylock /* fail */
> >   tas-pending -> (0,1,1)
> >   wait-locked
> > 
> > 3)  lock
> >   trylock /* fail */
> >   tas-pending /* fail */
> > 
> > 4)  unlock -> (0,1,0)
> >   clr_pnd_set_lck -> (0,0,1)
> >   unlock -> (0,0,0)
> > 
> > 5)  tas-pending -> (0,1,0)
> > read-val -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)xchg_tail -> (n,0,1)
> >   load_acquire <- (n,0,0) (from-4)
> > 8)cmpxchg /* fail */
> >   set_locked()
> > 
> > > Is there something I'm missing that means this can't happen? I suppose
> > > cacheline granularity ends up giving serialisation between (4) and (7),
> > > but I'd *much* prefer not to rely on that because it feels horribly
> > > fragile.
> > 
> > Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> > fact have smp_mb() in and that should order it sufficient for that not
> > to happen I think.
> 
> Hmm, does that actually help, though? I still think you're relying on the
> cache-coherence protocol to serialise the xchg() on pending before the
> xchg_tail(), which I think is fragile because they don't actually overlap.

Maybe, I suspect TSO makes it work, but see the below alternative.

---
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -6,9 +6,29 @@
 #include 
 #include 
 #include 
+#include 
 
 #define _Q_PENDING_LOOPS   (1 << 9)
 
+static __always_inline bool __test_and_set_pending(struct qspinlock *lock)
+{
+   GEN_BINARY_RMWcc(LOCK_PREFIX "btsl",
+lock->val.counter, "Ir", _Q_PENDING_OFFSET, "%0", c);
+}
+
+#define queued_set_pending_fetch_acquire queued_set_pending_fetch_acquire
+static inline u32 queued_set_pending_fetch_acquire(struct qspinlock *lock)
+{
+   u32 val = 0;
+
+   if (__test_and_set_pending(lock))
+   val |= _Q_PENDING_VAL;
+
+   val |= atomic_read(>val) & ~_Q_PENDING_MASK;
+
+   return val;
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,20 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * queued_set_pending_fetch_acquire - fetch the whole lock value and set 
pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_set_pending_fetch_acquire
+static __always_inline u32 queued_set_pending_fetch_acquire(struct qspinlock 
*lock)
+{
+   return atomic_fetch_or_acquire(_Q_PENDING_VAL, >val);
+}
+#endif
+
+/**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
  *
@@ -328,7 +342,7 @@ void queued_spin_lock_slowpath(struct qs
 *
 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 */
-   val = atomic_fetch_or_acquire(_Q_PENDING_VAL, >val);
+   val = queued_set_pending_fetch_acquire(lock);
 
/*
 * If we observe contention, there is a concurrent locker.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Andrea Parri
On Tue, Oct 02, 2018 at 02:22:09PM +0100, Will Deacon wrote:
> On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > > consider this scenario with your patch:
> > > 
> > > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> > >pending.
> > > 
> > > 2. CPU1 comes in and sets pending, spins on locked
> > > 
> > > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> > >the waitqueue (i.e. it's right before xchg_tail()).
> > > 
> > > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> > >it, so pending and locked are now 0.
> > > 
> > > 5. CPU0 sets pending and reads back zeroes for the other fields
> > > 
> > > 6. CPU0 clears pending and sets locked -- it now has the lock
> > > 
> > > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> > >for locked and pending to go clear. However, it reads a stale value
> > >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > > 
> > > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> > >point we're hosed, because both CPU2 and CPU0 have the lock.
> > 
> > Thanks for pointing this out.   I am wondering: can't we have a similar
> > scenario with the current code (i.e., w/o these patches): what prevents
> > the scenario reported below, following Peter's diagram, from happening?
> 
> The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
> so I don't think we can see a stale value in the subsequent (overlapping)
> acquire load.

I see, thanks for the clarification.

  Andrea


> 
> Will
> 
> >   CPU0  CPU1CPU2CPU3
> > 
> > 0)  lock
> >   trylock -> (0,0,1)
> > 1)lock
> > trylock /* fail */
> > 
> > 2)  lock
> >   trylock /* fail */
> >   fetch_or_acquire -> (0,1,1)
> >   wait-locked
> > 
> > 3)  lock
> >   trylock /* fail */
> >   goto queue
> > 
> > 4)  unlock -> (0,1,0)
> >   clr_pnd_set_lck -> (0,0,1)
> >   unlock -> (0,0,0)
> > 
> > 5)  fetch_or_acquire -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)xchg_tail -> (n,0,1)
> >   load_acquire <- (n,0,0) (from-4)
> > 8)cmpxchg /* fail */
> >   set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Andrea Parri
On Tue, Oct 02, 2018 at 02:22:09PM +0100, Will Deacon wrote:
> On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > > consider this scenario with your patch:
> > > 
> > > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> > >pending.
> > > 
> > > 2. CPU1 comes in and sets pending, spins on locked
> > > 
> > > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> > >the waitqueue (i.e. it's right before xchg_tail()).
> > > 
> > > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> > >it, so pending and locked are now 0.
> > > 
> > > 5. CPU0 sets pending and reads back zeroes for the other fields
> > > 
> > > 6. CPU0 clears pending and sets locked -- it now has the lock
> > > 
> > > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> > >for locked and pending to go clear. However, it reads a stale value
> > >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > > 
> > > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> > >point we're hosed, because both CPU2 and CPU0 have the lock.
> > 
> > Thanks for pointing this out.   I am wondering: can't we have a similar
> > scenario with the current code (i.e., w/o these patches): what prevents
> > the scenario reported below, following Peter's diagram, from happening?
> 
> The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
> so I don't think we can see a stale value in the subsequent (overlapping)
> acquire load.

I see, thanks for the clarification.

  Andrea


> 
> Will
> 
> >   CPU0  CPU1CPU2CPU3
> > 
> > 0)  lock
> >   trylock -> (0,0,1)
> > 1)lock
> > trylock /* fail */
> > 
> > 2)  lock
> >   trylock /* fail */
> >   fetch_or_acquire -> (0,1,1)
> >   wait-locked
> > 
> > 3)  lock
> >   trylock /* fail */
> >   goto queue
> > 
> > 4)  unlock -> (0,1,0)
> >   clr_pnd_set_lck -> (0,0,1)
> >   unlock -> (0,0,0)
> > 
> > 5)  fetch_or_acquire -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)xchg_tail -> (n,0,1)
> >   load_acquire <- (n,0,0) (from-4)
> > 8)cmpxchg /* fail */
> >   set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Will Deacon
On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >for locked and pending to go clear. However, it reads a stale value
> >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Thanks for pointing this out.   I am wondering: can't we have a similar
> scenario with the current code (i.e., w/o these patches): what prevents
> the scenario reported below, following Peter's diagram, from happening?

The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
so I don't think we can see a stale value in the subsequent (overlapping)
acquire load.

Will

>   CPU0CPU1CPU2CPU3
> 
> 0)lock
> trylock -> (0,0,1)
> 1)lock
> trylock /* fail */
> 
> 2)lock
> trylock /* fail */
> fetch_or_acquire -> (0,1,1)
> wait-locked
> 
> 3)lock
> trylock /* fail */
> goto queue
> 
> 4)unlock -> (0,1,0)
> clr_pnd_set_lck -> (0,0,1)
> unlock -> (0,0,0)
> 
> 5)  fetch_or_acquire -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)  xchg_tail -> (n,0,1)
> load_acquire <- (n,0,0) (from-4)
> 8)  cmpxchg /* fail */
> set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Will Deacon
On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >for locked and pending to go clear. However, it reads a stale value
> >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Thanks for pointing this out.   I am wondering: can't we have a similar
> scenario with the current code (i.e., w/o these patches): what prevents
> the scenario reported below, following Peter's diagram, from happening?

The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
so I don't think we can see a stale value in the subsequent (overlapping)
acquire load.

Will

>   CPU0CPU1CPU2CPU3
> 
> 0)lock
> trylock -> (0,0,1)
> 1)lock
> trylock /* fail */
> 
> 2)lock
> trylock /* fail */
> fetch_or_acquire -> (0,1,1)
> wait-locked
> 
> 3)lock
> trylock /* fail */
> goto queue
> 
> 4)unlock -> (0,1,0)
> clr_pnd_set_lck -> (0,0,1)
> unlock -> (0,0,0)
> 
> 5)  fetch_or_acquire -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)  xchg_tail -> (n,0,1)
> load_acquire <- (n,0,0) (from-4)
> 8)  cmpxchg /* fail */
> set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Will Deacon
On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> > Thanks for chewing up my afternoon ;)
> 
> I'll get you a beer in EDI ;-)

Just one?!

> > But actually,
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >for locked and pending to go clear. However, it reads a stale value
> >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Let me draw a picture of that..
> 
> 
>   CPU0CPU1CPU2CPU3
> 
> 0)lock
> trylock -> (0,0,1)
> 1)lock
> trylock /* fail */
> 
> 2)lock
> trylock /* fail */
> tas-pending -> (0,1,1)
> wait-locked
> 
> 3)lock
> trylock /* fail */
> tas-pending /* fail */
> 
> 4)unlock -> (0,1,0)
> clr_pnd_set_lck -> (0,0,1)
> unlock -> (0,0,0)
> 
> 5)  tas-pending -> (0,1,0)
> read-val -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)  xchg_tail -> (n,0,1)
> load_acquire <- (n,0,0) (from-4)
> 8)  cmpxchg /* fail */
> set_locked()
> 
> > Is there something I'm missing that means this can't happen? I suppose
> > cacheline granularity ends up giving serialisation between (4) and (7),
> > but I'd *much* prefer not to rely on that because it feels horribly
> > fragile.
> 
> Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> fact have smp_mb() in and that should order it sufficient for that not
> to happen I think.

Hmm, does that actually help, though? I still think you're relying on the
cache-coherence protocol to serialise the xchg() on pending before the
xchg_tail(), which I think is fragile because they don't actually overlap.

> But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
> doesn't seem too far out..
> 
> > Another idea I was playing with was adding test_and_set_bit_acquire()
> > for this, because x86 has an instruction for that, right?
> 
> LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
> old value of the word, just the old bit (in CF).
> 
> I suppose you get rid of the whole mixed size thing, but you still have
> the whole two instruction thing.

I really think we need that set of pending to operate on the whole lock
word.

> > > + /*
> > > +  * Ensures the tail load happens after the xchg().
> > > +  *
> > > +  * lock  unlock(a)
> > > +  *   xchg ---.
> > > +  *(b)  lock  unlock  +- fetch_or
> > > +  *   load ---'
> > > +  * lock  unlock(c)
> > > +  *
> > 
> > I failed miserably at parsing this comment :(
> > 
> > I think the main issue is that I don't understand how to read the little
> > diagram you've got.
> 
> Where fetch_or() is indivisible and has happens-before (a) and
> happens-after (c), the new thing is in fact divisible and has
> happens-in-between (b).
> 
> Of the happens-in-between (b), we can either get a new concurrent
> locker, or make progress of an extant concurrent locker because an
> unlock happened.
> 
> But the rest of the text might indeed be very confused. I think I wrote
> the bulk of that when I was in fact doing a xchg16 on locked_pending,
> but that's fundamentally broken. I did edit it afterwards, but that
> might have just made it worse.

Ok, maybe just remove it :)

Will


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Will Deacon
On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> > Thanks for chewing up my afternoon ;)
> 
> I'll get you a beer in EDI ;-)

Just one?!

> > But actually,
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >for locked and pending to go clear. However, it reads a stale value
> >from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Let me draw a picture of that..
> 
> 
>   CPU0CPU1CPU2CPU3
> 
> 0)lock
> trylock -> (0,0,1)
> 1)lock
> trylock /* fail */
> 
> 2)lock
> trylock /* fail */
> tas-pending -> (0,1,1)
> wait-locked
> 
> 3)lock
> trylock /* fail */
> tas-pending /* fail */
> 
> 4)unlock -> (0,1,0)
> clr_pnd_set_lck -> (0,0,1)
> unlock -> (0,0,0)
> 
> 5)  tas-pending -> (0,1,0)
> read-val -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)  xchg_tail -> (n,0,1)
> load_acquire <- (n,0,0) (from-4)
> 8)  cmpxchg /* fail */
> set_locked()
> 
> > Is there something I'm missing that means this can't happen? I suppose
> > cacheline granularity ends up giving serialisation between (4) and (7),
> > but I'd *much* prefer not to rely on that because it feels horribly
> > fragile.
> 
> Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> fact have smp_mb() in and that should order it sufficient for that not
> to happen I think.

Hmm, does that actually help, though? I still think you're relying on the
cache-coherence protocol to serialise the xchg() on pending before the
xchg_tail(), which I think is fragile because they don't actually overlap.

> But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
> doesn't seem too far out..
> 
> > Another idea I was playing with was adding test_and_set_bit_acquire()
> > for this, because x86 has an instruction for that, right?
> 
> LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
> old value of the word, just the old bit (in CF).
> 
> I suppose you get rid of the whole mixed size thing, but you still have
> the whole two instruction thing.

I really think we need that set of pending to operate on the whole lock
word.

> > > + /*
> > > +  * Ensures the tail load happens after the xchg().
> > > +  *
> > > +  * lock  unlock(a)
> > > +  *   xchg ---.
> > > +  *(b)  lock  unlock  +- fetch_or
> > > +  *   load ---'
> > > +  * lock  unlock(c)
> > > +  *
> > 
> > I failed miserably at parsing this comment :(
> > 
> > I think the main issue is that I don't understand how to read the little
> > diagram you've got.
> 
> Where fetch_or() is indivisible and has happens-before (a) and
> happens-after (c), the new thing is in fact divisible and has
> happens-in-between (b).
> 
> Of the happens-in-between (b), we can either get a new concurrent
> locker, or make progress of an extant concurrent locker because an
> unlock happened.
> 
> But the rest of the text might indeed be very confused. I think I wrote
> the bulk of that when I was in fact doing a xchg16 on locked_pending,
> but that's fundamentally broken. I did edit it afterwards, but that
> might have just made it worse.

Ok, maybe just remove it :)

Will


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Andrea Parri
> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>for locked and pending to go clear. However, it reads a stale value
>from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>point we're hosed, because both CPU2 and CPU0 have the lock.

Thanks for pointing this out.   I am wondering: can't we have a similar
scenario with the current code (i.e., w/o these patches): what prevents
the scenario reported below, following Peter's diagram, from happening?

  Andrea


  CPU0  CPU1CPU2CPU3

0)  lock
  trylock -> (0,0,1)
1)lock
trylock /* fail */

2)  lock
  trylock /* fail */
  fetch_or_acquire -> (0,1,1)
  wait-locked

3)  lock
  trylock /* fail */
  goto queue

4)  unlock -> (0,1,0)
  clr_pnd_set_lck -> (0,0,1)
  unlock -> (0,0,0)

5)  fetch_or_acquire -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)xchg_tail -> (n,0,1)
  load_acquire <- (n,0,0) (from-4)
8)cmpxchg /* fail */
  set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-02 Thread Andrea Parri
> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>for locked and pending to go clear. However, it reads a stale value
>from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>point we're hosed, because both CPU2 and CPU0 have the lock.

Thanks for pointing this out.   I am wondering: can't we have a similar
scenario with the current code (i.e., w/o these patches): what prevents
the scenario reported below, following Peter's diagram, from happening?

  Andrea


  CPU0  CPU1CPU2CPU3

0)  lock
  trylock -> (0,0,1)
1)lock
trylock /* fail */

2)  lock
  trylock /* fail */
  fetch_or_acquire -> (0,1,1)
  wait-locked

3)  lock
  trylock /* fail */
  goto queue

4)  unlock -> (0,1,0)
  clr_pnd_set_lck -> (0,0,1)
  unlock -> (0,0,0)

5)  fetch_or_acquire -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)xchg_tail -> (n,0,1)
  load_acquire <- (n,0,0) (from-4)
8)cmpxchg /* fail */
  set_locked()


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-01 Thread Peter Zijlstra
On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> Thanks for chewing up my afternoon ;)

I'll get you a beer in EDI ;-)

> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:

> >  /**
> > + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> > and locked
> > + * @lock : Pointer to queued spinlock structure
> > + * Return: The previous lock value
> > + *
> > + * *,*,* -> *,1,*
> > + */
> > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock 
> > *lock)
> > +{
> > +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> > +   u32 val;
> > +
> > +   /*
> > +* For the !LL/SC archs that do not have a native atomic_fetch_or
> > +* but do have a native xchg (x86) we can do trickery and avoid the
> > +* cmpxchg-loop based fetch-or to improve determinism.
> > +*
> > +* We first xchg the pending byte to set PENDING, and then issue a load
> > +* for the rest of the word and mask out the pending byte to compute a
> > +* 'full' value.
> > +*/
> > +   val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;
> 
> If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
> and use a plain atomic_read() to get the rest of the word? 

I did consider that; but I confused myself so many times I stuck with
this one. Primarily because on x86 it doesn't matter one way or the
other and smp_mb() are 'easier' to reason about.

> But actually,
> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>for locked and pending to go clear. However, it reads a stale value
>from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>point we're hosed, because both CPU2 and CPU0 have the lock.

Let me draw a picture of that..


  CPU0  CPU1CPU2CPU3

0)  lock
  trylock -> (0,0,1)
1)lock
trylock /* fail */

2)  lock
  trylock /* fail */
  tas-pending -> (0,1,1)
  wait-locked

3)  lock
  trylock /* fail */
  tas-pending /* fail */

4)  unlock -> (0,1,0)
  clr_pnd_set_lck -> (0,0,1)
  unlock -> (0,0,0)

5)  tas-pending -> (0,1,0)
read-val -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)xchg_tail -> (n,0,1)
  load_acquire <- (n,0,0) (from-4)
8)cmpxchg /* fail */
  set_locked()

> Is there something I'm missing that means this can't happen? I suppose
> cacheline granularity ends up giving serialisation between (4) and (7),
> but I'd *much* prefer not to rely on that because it feels horribly
> fragile.

Well, on x86 atomics are fully ordered, so the xchg_tail() does in
fact have smp_mb() in and that should order it sufficient for that not
to happen I think.

But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
doesn't seem too far out..

> Another idea I was playing with was adding test_and_set_bit_acquire()
> for this, because x86 has an instruction for that, right?

LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
old value of the word, just the old bit (in CF).

I suppose you get rid of the whole mixed size thing, but you still have
the whole two instruction thing.

> > +   /*
> > +* Ensures the tail load happens after the xchg().
> > +*
> > +* lock  unlock(a)
> > +*   xchg ---.
> > +*(b)  lock  unlock  +- fetch_or
> > +*   load ---'
> > +* lock  unlock(c)
> > +*
> 
> I failed miserably at parsing this comment :(
> 
> I think the main issue is that I don't understand how to read the little
> diagram you've got.

Where fetch_or() is indivisible and has happens-before (a) and
happens-after (c), the new thing is in fact divisible and has
happens-in-between (b).

Of the happens-in-between (b), we can either get a new concurrent
locker, or make progress of an extant concurrent locker because an
unlock happened.

But the rest of the 

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-01 Thread Peter Zijlstra
On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> Thanks for chewing up my afternoon ;)

I'll get you a beer in EDI ;-)

> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:

> >  /**
> > + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> > and locked
> > + * @lock : Pointer to queued spinlock structure
> > + * Return: The previous lock value
> > + *
> > + * *,*,* -> *,1,*
> > + */
> > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock 
> > *lock)
> > +{
> > +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> > +   u32 val;
> > +
> > +   /*
> > +* For the !LL/SC archs that do not have a native atomic_fetch_or
> > +* but do have a native xchg (x86) we can do trickery and avoid the
> > +* cmpxchg-loop based fetch-or to improve determinism.
> > +*
> > +* We first xchg the pending byte to set PENDING, and then issue a load
> > +* for the rest of the word and mask out the pending byte to compute a
> > +* 'full' value.
> > +*/
> > +   val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;
> 
> If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
> and use a plain atomic_read() to get the rest of the word? 

I did consider that; but I confused myself so many times I stuck with
this one. Primarily because on x86 it doesn't matter one way or the
other and smp_mb() are 'easier' to reason about.

> But actually,
> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>for locked and pending to go clear. However, it reads a stale value
>from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>point we're hosed, because both CPU2 and CPU0 have the lock.

Let me draw a picture of that..


  CPU0  CPU1CPU2CPU3

0)  lock
  trylock -> (0,0,1)
1)lock
trylock /* fail */

2)  lock
  trylock /* fail */
  tas-pending -> (0,1,1)
  wait-locked

3)  lock
  trylock /* fail */
  tas-pending /* fail */

4)  unlock -> (0,1,0)
  clr_pnd_set_lck -> (0,0,1)
  unlock -> (0,0,0)

5)  tas-pending -> (0,1,0)
read-val -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)xchg_tail -> (n,0,1)
  load_acquire <- (n,0,0) (from-4)
8)cmpxchg /* fail */
  set_locked()

> Is there something I'm missing that means this can't happen? I suppose
> cacheline granularity ends up giving serialisation between (4) and (7),
> but I'd *much* prefer not to rely on that because it feels horribly
> fragile.

Well, on x86 atomics are fully ordered, so the xchg_tail() does in
fact have smp_mb() in and that should order it sufficient for that not
to happen I think.

But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
doesn't seem too far out..

> Another idea I was playing with was adding test_and_set_bit_acquire()
> for this, because x86 has an instruction for that, right?

LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
old value of the word, just the old bit (in CF).

I suppose you get rid of the whole mixed size thing, but you still have
the whole two instruction thing.

> > +   /*
> > +* Ensures the tail load happens after the xchg().
> > +*
> > +* lock  unlock(a)
> > +*   xchg ---.
> > +*(b)  lock  unlock  +- fetch_or
> > +*   load ---'
> > +* lock  unlock(c)
> > +*
> 
> I failed miserably at parsing this comment :(
> 
> I think the main issue is that I don't understand how to read the little
> diagram you've got.

Where fetch_or() is indivisible and has happens-before (a) and
happens-after (c), the new thing is in fact divisible and has
happens-in-between (b).

Of the happens-in-between (b), we can either get a new concurrent
locker, or make progress of an extant concurrent locker because an
unlock happened.

But the rest of the 

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-01 Thread Will Deacon
Hi Peter,

Thanks for chewing up my afternoon ;)

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>we clear the tail and set the lock byte. By first setting the
>pending bit this cmpxchg will fail and the later load must then
>see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
>   lock := (0,0,0)
> 
>   CPU 0   CPU 1
> 
>   lock()  lock()
> trylock(-> 0,0,1)   trylock() /* fail */
>   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> mb()
> val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.
> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/qspinlock.h |2 +
>  kernel/locking/qspinlock.c   |   56 
> ++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS (1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> + u32 val;
> +
> + /*
> +  * For the !LL/SC archs that do not have a native atomic_fetch_or
> +  * but do have a native xchg (x86) we can do trickery and avoid the
> +  * cmpxchg-loop based fetch-or to improve determinism.
> +  *
> +  * We first xchg the pending byte to set PENDING, and then issue a load
> +  * for the rest of the word and mask out the pending byte to compute a
> +  * 'full' value.
> +  */
> + val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;

If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
and use a plain atomic_read() to get the rest of the word? But actually,
consider this scenario with your patch:

1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
   pending.

2. CPU1 comes in and sets pending, spins on locked

3. CPU2 sees a pending and locked val, and is about to enter the head of
   the waitqueue (i.e. it's right before xchg_tail()).

4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
   it, so pending and locked are now 0.

5. CPU0 sets pending and reads back zeroes for the other fields

6. CPU0 clears pending and sets locked -- it now has the lock

7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
   for locked and pending to go clear. However, it reads a stale value
   from step (4) and attempts the atomic_try_cmpxchg() to take the lock.

8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
   point we're hosed, because both CPU2 and CPU0 have the lock.

Is there something I'm missing that means this can't happen? I suppose
cacheline granularity ends up giving serialisation between (4) and (7),
but I'd *much* prefer not to rely on that because it feels horribly
fragile.

Another idea I was playing with was 

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-10-01 Thread Will Deacon
Hi Peter,

Thanks for chewing up my afternoon ;)

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>we clear the tail and set the lock byte. By first setting the
>pending bit this cmpxchg will fail and the later load must then
>see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
>   lock := (0,0,0)
> 
>   CPU 0   CPU 1
> 
>   lock()  lock()
> trylock(-> 0,0,1)   trylock() /* fail */
>   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> mb()
> val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.
> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/qspinlock.h |2 +
>  kernel/locking/qspinlock.c   |   56 
> ++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS (1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> + u32 val;
> +
> + /*
> +  * For the !LL/SC archs that do not have a native atomic_fetch_or
> +  * but do have a native xchg (x86) we can do trickery and avoid the
> +  * cmpxchg-loop based fetch-or to improve determinism.
> +  *
> +  * We first xchg the pending byte to set PENDING, and then issue a load
> +  * for the rest of the word and mask out the pending byte to compute a
> +  * 'full' value.
> +  */
> + val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;

If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
and use a plain atomic_read() to get the rest of the word? But actually,
consider this scenario with your patch:

1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
   pending.

2. CPU1 comes in and sets pending, spins on locked

3. CPU2 sees a pending and locked val, and is about to enter the head of
   the waitqueue (i.e. it's right before xchg_tail()).

4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
   it, so pending and locked are now 0.

5. CPU0 sets pending and reads back zeroes for the other fields

6. CPU0 clears pending and sets locked -- it now has the lock

7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
   for locked and pending to go clear. However, it reads a stale value
   from step (4) and attempts the atomic_try_cmpxchg() to take the lock.

8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
   point we're hosed, because both CPU2 and CPU0 have the lock.

Is there something I'm missing that means this can't happen? I suppose
cacheline granularity ends up giving serialisation between (4) and (7),
but I'd *much* prefer not to rely on that because it feels horribly
fragile.

Another idea I was playing with was 

RE: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread David Laight
From: Peter Zijlstra
> Sent: 26 September 2018 12:01
> 
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
...

IIRC the load will be 'slow' because it will have to wait for the
earlier store to actually complete - rather than being satisfied
by data from the store buffer (because the widths are different).

This may not matter for xchg ?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread David Laight
From: Peter Zijlstra
> Sent: 26 September 2018 12:01
> 
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
...

IIRC the load will be 'slow' because it will have to wait for the
earlier store to actually complete - rather than being satisfied
by data from the store buffer (because the widths are different).

This may not matter for xchg ?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Thu, Sep 27, 2018 at 10:13:15AM +0200, Andrea Parri wrote:
> On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > > 
> > > True, but it is nothing conceptually new to deal with: there're Cat
> > > models that handle mixed-size accesses, just give it time.
> > 
> > Sure, but until that time I must not rely on (and thus not use) LKMM for
> > qspinlock things.
> 
> This is way too generic to be agreed ;D

Only if you know that thing well enough to know why it gives a certain
answer, and thus don't already need it.

If you need it to help you with something; it can't because it doesn't
do the mixed size thing.

> > So while your argument about coherence might be true -- I'll have to
> > think about it; litmus tests are out the window.
> 
> You trimmed the litmus test I gave you.

Because of not wanting to reverse engineer the argument from the litmus
test. But yes, I think I see your point, the earlier trylock will load
the new value and our later load cannot be earlier than that.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Thu, Sep 27, 2018 at 10:13:15AM +0200, Andrea Parri wrote:
> On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > > 
> > > True, but it is nothing conceptually new to deal with: there're Cat
> > > models that handle mixed-size accesses, just give it time.
> > 
> > Sure, but until that time I must not rely on (and thus not use) LKMM for
> > qspinlock things.
> 
> This is way too generic to be agreed ;D

Only if you know that thing well enough to know why it gives a certain
answer, and thus don't already need it.

If you need it to help you with something; it can't because it doesn't
do the mixed size thing.

> > So while your argument about coherence might be true -- I'll have to
> > think about it; litmus tests are out the window.
> 
> You trimmed the litmus test I gave you.

Because of not wanting to reverse engineer the argument from the litmus
test. But yes, I think I see your point, the earlier trylock will load
the new value and our later load cannot be earlier than that.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Andrea Parri
On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > 
> > True, but it is nothing conceptually new to deal with: there're Cat
> > models that handle mixed-size accesses, just give it time.
> 
> Sure, but until that time I must not rely on (and thus not use) LKMM for
> qspinlock things.

This is way too generic to be agreed ;D


> 
> So while your argument about coherence might be true -- I'll have to
> think about it; litmus tests are out the window.

You trimmed the litmus test I gave you.

  Andrea


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Andrea Parri
On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > 
> > True, but it is nothing conceptually new to deal with: there're Cat
> > models that handle mixed-size accesses, just give it time.
> 
> Sure, but until that time I must not rely on (and thus not use) LKMM for
> qspinlock things.

This is way too generic to be agreed ;D


> 
> So while your argument about coherence might be true -- I'll have to
> think about it; litmus tests are out the window.

You trimmed the litmus test I gave you.

  Andrea


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> 
> True, but it is nothing conceptually new to deal with: there're Cat
> models that handle mixed-size accesses, just give it time.

Sure, but until that time I must not rely on (and thus not use) LKMM for
qspinlock things.

So while your argument about coherence might be true -- I'll have to
think about it; litmus tests are out the window.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> 
> True, but it is nothing conceptually new to deal with: there're Cat
> models that handle mixed-size accesses, just give it time.

Sure, but until that time I must not rely on (and thus not use) LKMM for
qspinlock things.

So while your argument about coherence might be true -- I'll have to
think about it; litmus tests are out the window.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Andrea Parri
On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > > 
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > > 
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > > 
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >and set tail; since we set tail while pending is set, the ordering
> > >is split is not important (and not fundamentally different form
> > >fetch_or). [*]
> > > 
> > >  - when the last queued lock holder acquires the lock (uncontended),
> > >we clear the tail and set the lock byte. By first setting the
> > >pending bit this cmpxchg will fail and the later load must then
> > >see the remaining tail.
> > > 
> > > Another interesting scenario is where there are only 2 threads:
> > > 
> > >   lock := (0,0,0)
> > > 
> > >   CPU 0   CPU 1
> > > 
> > >   lock()  lock()
> > > trylock(-> 0,0,1)   trylock() /* fail */
> > >   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> > > mb()
> > > val = smp_load_acquire(*lock);
> > > 
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> > 
> > If this were true, we would have a violation of "coherence":
> 
> The thing is, this is mixed size, see:

The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


> 
>   https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> 
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
> 
>   old = xchg(pending,1);
>   val = smp_load_acquire(*lock);
> 
> to be re-ordered like:
> 
>   val = smp_load_acquire(*lock);
>   old = xchg(pending, 1);
> 
> with the exception that it will fwd the pending byte into the later
> load, so we get:
> 
>   val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
> 
> for 'free'.
> 
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

  Andrea


> 
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Andrea Parri
On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > > 
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > > 
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > > 
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >and set tail; since we set tail while pending is set, the ordering
> > >is split is not important (and not fundamentally different form
> > >fetch_or). [*]
> > > 
> > >  - when the last queued lock holder acquires the lock (uncontended),
> > >we clear the tail and set the lock byte. By first setting the
> > >pending bit this cmpxchg will fail and the later load must then
> > >see the remaining tail.
> > > 
> > > Another interesting scenario is where there are only 2 threads:
> > > 
> > >   lock := (0,0,0)
> > > 
> > >   CPU 0   CPU 1
> > > 
> > >   lock()  lock()
> > > trylock(-> 0,0,1)   trylock() /* fail */
> > >   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> > > mb()
> > > val = smp_load_acquire(*lock);
> > > 
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> > 
> > If this were true, we would have a violation of "coherence":
> 
> The thing is, this is mixed size, see:

The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


> 
>   https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> 
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
> 
>   old = xchg(pending,1);
>   val = smp_load_acquire(*lock);
> 
> to be re-ordered like:
> 
>   val = smp_load_acquire(*lock);
>   old = xchg(pending, 1);
> 
> with the exception that it will fwd the pending byte into the later
> load, so we get:
> 
>   val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
> 
> for 'free'.
> 
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

  Andrea


> 
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 07:54:18PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> > On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > >
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > >
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > >
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >and set tail; since we set tail while pending is set, the ordering
> > >is split is not important (and not fundamentally different form
> > >fetch_or). [*]
> > 
> > The split can cause some changes in behavior. The 3rd locker observes
> > the pending bit and set tail. The split load of the 2nd locker may make
> > it observe the tail and backout of the pending loop. As a result, the
> > 2nd locker will acquire the lock after the third locker in this case.
> > That won't happen with the original code.
> > 
> > I am not saying this is a problem. It is just something we should take
> > note on.
> 
> Right, good one. Yes that can happen.

I think I remember why I 'discounted' that scenario; the same can happen
with the fetch_or() but in a timing scenario.

Consider:

  CPU 0 CPU 1   CPU 2   CPU 3

  lock (->0,0,1)
lock
  trylock (fail)
  tas-pending (->0,1,1)
  wait-!locked
lock
  trylock (fail)
  tas-pending (fail)
  unlock (->0,1,0)
  acquire (->0,0,1)
lock
  trylock (fail)
  (A) xchg-tail
  tas-pending -> (x,1,1)
  (B) xchg-tail

And then, depending on A/B x is either 0 or !0.

So the ordering of the concurrent lock attempts is always (obviously)
subject to timing, the split xchg-load thing just makes it 'worse'.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 07:54:18PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> > On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > >
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > >
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > >
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >and set tail; since we set tail while pending is set, the ordering
> > >is split is not important (and not fundamentally different form
> > >fetch_or). [*]
> > 
> > The split can cause some changes in behavior. The 3rd locker observes
> > the pending bit and set tail. The split load of the 2nd locker may make
> > it observe the tail and backout of the pending loop. As a result, the
> > 2nd locker will acquire the lock after the third locker in this case.
> > That won't happen with the original code.
> > 
> > I am not saying this is a problem. It is just something we should take
> > note on.
> 
> Right, good one. Yes that can happen.

I think I remember why I 'discounted' that scenario; the same can happen
with the fetch_or() but in a timing scenario.

Consider:

  CPU 0 CPU 1   CPU 2   CPU 3

  lock (->0,0,1)
lock
  trylock (fail)
  tas-pending (->0,1,1)
  wait-!locked
lock
  trylock (fail)
  tas-pending (fail)
  unlock (->0,1,0)
  acquire (->0,0,1)
lock
  trylock (fail)
  (A) xchg-tail
  tas-pending -> (x,1,1)
  (B) xchg-tail

And then, depending on A/B x is either 0 or !0.

So the ordering of the concurrent lock attempts is always (obviously)
subject to timing, the split xchg-load thing just makes it 'worse'.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> > 
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> > 
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> > 
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >and set tail; since we set tail while pending is set, the ordering
> >is split is not important (and not fundamentally different form
> >fetch_or). [*]
> > 
> >  - when the last queued lock holder acquires the lock (uncontended),
> >we clear the tail and set the lock byte. By first setting the
> >pending bit this cmpxchg will fail and the later load must then
> >see the remaining tail.
> > 
> > Another interesting scenario is where there are only 2 threads:
> > 
> > lock := (0,0,0)
> > 
> > CPU 0   CPU 1
> > 
> > lock()  lock()
> >   trylock(-> 0,0,1)   trylock() /* fail */
> > return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> >   mb()
> >   val = smp_load_acquire(*lock);
> > 
> > Where, without the mb() the load would've been allowed to return 0 for
> > the locked byte.
> 
> If this were true, we would have a violation of "coherence":

The thing is, this is mixed size, see:

  https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

If I remember things correctly (I've not reread that paper recently) it
is allowed for:

old = xchg(pending,1);
val = smp_load_acquire(*lock);

to be re-ordered like:

val = smp_load_acquire(*lock);
old = xchg(pending, 1);

with the exception that it will fwd the pending byte into the later
load, so we get:

val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);

for 'free'.

LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

With the addition of smp_mb__after_atomic(), we disallow the load to be
done prior to the xchg(). It might still fwd the more recent pending
byte from its store buffer, but at least the other bytes must not be
earlier.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-27 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> > 
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> > 
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> > 
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >and set tail; since we set tail while pending is set, the ordering
> >is split is not important (and not fundamentally different form
> >fetch_or). [*]
> > 
> >  - when the last queued lock holder acquires the lock (uncontended),
> >we clear the tail and set the lock byte. By first setting the
> >pending bit this cmpxchg will fail and the later load must then
> >see the remaining tail.
> > 
> > Another interesting scenario is where there are only 2 threads:
> > 
> > lock := (0,0,0)
> > 
> > CPU 0   CPU 1
> > 
> > lock()  lock()
> >   trylock(-> 0,0,1)   trylock() /* fail */
> > return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> >   mb()
> >   val = smp_load_acquire(*lock);
> > 
> > Where, without the mb() the load would've been allowed to return 0 for
> > the locked byte.
> 
> If this were true, we would have a violation of "coherence":

The thing is, this is mixed size, see:

  https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

If I remember things correctly (I've not reread that paper recently) it
is allowed for:

old = xchg(pending,1);
val = smp_load_acquire(*lock);

to be re-ordered like:

val = smp_load_acquire(*lock);
old = xchg(pending, 1);

with the exception that it will fwd the pending byte into the later
load, so we get:

val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);

for 'free'.

LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

With the addition of smp_mb__after_atomic(), we disallow the load to be
done prior to the xchg(). It might still fwd the more recent pending
byte from its store buffer, but at least the other bytes must not be
earlier.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Andrea Parri
On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>we clear the tail and set the lock byte. By first setting the
>pending bit this cmpxchg will fail and the later load must then
>see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
>   lock := (0,0,0)
> 
>   CPU 0   CPU 1
> 
>   lock()  lock()
> trylock(-> 0,0,1)   trylock() /* fail */
>   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> mb()
> val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.

If this were true, we would have a violation of "coherence":

C peterz

{}

P0(atomic_t *val)
{
int r0;

/* in queued_spin_trylock() */
r0 = atomic_cmpxchg_acquire(val, 0, 1);
}

P1(atomic_t *val)
{
int r0;
int r1;

/* in queued_spin_trylock() */
r0 = atomic_cmpxchg_acquire(val, 0, 1); /* or atomic_read(val) */

/* in queued_spin_lock_slowpath)() -- set_pending_fetch_acquire() */
r1 = atomic_read_acquire(val);
}

exists (0:r0=0 /\ ~1:r0=0 /\ 1:r1=0)

  Andrea


> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/qspinlock.h |2 +
>  kernel/locking/qspinlock.c   |   56 
> ++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS (1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> + u32 val;
> +
> + /*
> +  * For the !LL/SC archs that do not have a native atomic_fetch_or
> +  * but do have a native xchg (x86) we can do trickery and avoid the
> +  * cmpxchg-loop based fetch-or to improve determinism.
> +  *
> +  * We first xchg the pending byte to set PENDING, and then issue a load
> +  * for the rest of the word and mask out the pending byte to compute a
> +  * 'full' value.
> +  */
> + val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;
> + /*
> +  * Ensures the tail load happens after the xchg().
> +  *
> +  * lock  unlock(a)
> +  *   xchg ---.
> +  *(b)  lock  unlock  +- fetch_or
> +  *   load ---'
> +  * lock  unlock(c)
> +  *
> +  * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +  * since either are fully before or after. The only new case is (b),
> +  * where a concurrent lock or unlock can interleave with the composite
> +  * operation.
> +  *
> +  * In either case, it is the queueing case that is of interrest, 
> otherwise
> +  * the whole operation is covered by the xchg() and the tail will be 0.
> +  *
> +  * For lock-(b); we only care if we set the PENDING bit or not. If we 
> lost
> + 

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Andrea Parri
On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>we clear the tail and set the lock byte. By first setting the
>pending bit this cmpxchg will fail and the later load must then
>see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
>   lock := (0,0,0)
> 
>   CPU 0   CPU 1
> 
>   lock()  lock()
> trylock(-> 0,0,1)   trylock() /* fail */
>   return;   xchg_relaxed(pending, 1) (-> 0,1,1)
> mb()
> val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.

If this were true, we would have a violation of "coherence":

C peterz

{}

P0(atomic_t *val)
{
int r0;

/* in queued_spin_trylock() */
r0 = atomic_cmpxchg_acquire(val, 0, 1);
}

P1(atomic_t *val)
{
int r0;
int r1;

/* in queued_spin_trylock() */
r0 = atomic_cmpxchg_acquire(val, 0, 1); /* or atomic_read(val) */

/* in queued_spin_lock_slowpath)() -- set_pending_fetch_acquire() */
r1 = atomic_read_acquire(val);
}

exists (0:r0=0 /\ ~1:r0=0 /\ 1:r1=0)

  Andrea


> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/qspinlock.h |2 +
>  kernel/locking/qspinlock.c   |   56 
> ++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS (1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending 
> and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> + u32 val;
> +
> + /*
> +  * For the !LL/SC archs that do not have a native atomic_fetch_or
> +  * but do have a native xchg (x86) we can do trickery and avoid the
> +  * cmpxchg-loop based fetch-or to improve determinism.
> +  *
> +  * We first xchg the pending byte to set PENDING, and then issue a load
> +  * for the rest of the word and mask out the pending byte to compute a
> +  * 'full' value.
> +  */
> + val = xchg_relaxed(>pending, 1) << _Q_PENDING_OFFSET;
> + /*
> +  * Ensures the tail load happens after the xchg().
> +  *
> +  * lock  unlock(a)
> +  *   xchg ---.
> +  *(b)  lock  unlock  +- fetch_or
> +  *   load ---'
> +  * lock  unlock(c)
> +  *
> +  * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +  * since either are fully before or after. The only new case is (b),
> +  * where a concurrent lock or unlock can interleave with the composite
> +  * operation.
> +  *
> +  * In either case, it is the queueing case that is of interrest, 
> otherwise
> +  * the whole operation is covered by the xchg() and the tail will be 0.
> +  *
> +  * For lock-(b); we only care if we set the PENDING bit or not. If we 
> lost
> + 

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> >
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> >
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> >
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >and set tail; since we set tail while pending is set, the ordering
> >is split is not important (and not fundamentally different form
> >fetch_or). [*]
> 
> The split can cause some changes in behavior. The 3rd locker observes
> the pending bit and set tail. The split load of the 2nd locker may make
> it observe the tail and backout of the pending loop. As a result, the
> 2nd locker will acquire the lock after the third locker in this case.
> That won't happen with the original code.
> 
> I am not saying this is a problem. It is just something we should take
> note on.

Right, good one. Yes that can happen.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Peter Zijlstra
On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> >
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> >
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> >
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >and set tail; since we set tail while pending is set, the ordering
> >is split is not important (and not fundamentally different form
> >fetch_or). [*]
> 
> The split can cause some changes in behavior. The 3rd locker observes
> the pending bit and set tail. The split load of the 2nd locker may make
> it observe the tail and backout of the pending loop. As a result, the
> 2nd locker will acquire the lock after the third locker in this case.
> That won't happen with the original code.
> 
> I am not saying this is a problem. It is just something we should take
> note on.

Right, good one. Yes that can happen.


Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Waiman Long
On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
>
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
>
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
>
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]

The split can cause some changes in behavior. The 3rd locker observes
the pending bit and set tail. The split load of the 2nd locker may make
it observe the tail and backout of the pending loop. As a result, the
2nd locker will acquire the lock after the third locker in this case.
That won't happen with the original code.

I am not saying this is a problem. It is just something we should take
note on.

Cheers,
Longman



Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

2018-09-26 Thread Waiman Long
On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
>
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
>
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
>
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>and set tail; since we set tail while pending is set, the ordering
>is split is not important (and not fundamentally different form
>fetch_or). [*]

The split can cause some changes in behavior. The 3rd locker observes
the pending bit and set tail. The split load of the 2nd locker may make
it observe the tail and backout of the pending loop. As a result, the
2nd locker will acquire the lock after the third locker in this case.
That won't happen with the original code.

I am not saying this is a problem. It is just something we should take
note on.

Cheers,
Longman