Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-06 Thread Jason Low
On Wed, Oct 5, 2016 at 10:47 PM, Davidlohr Bueso  wrote:
> On Wed, 05 Oct 2016, Waiman Long wrote:
>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 05a3785..1e6823a 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -12,6 +12,23 @@
>>  */
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>>
>> +enum mbtype {
>> +acquire,
>> +release,
>> +relaxed,
>> +};
>
>
> No, please.
>
>> +
>> +static __always_inline int
>> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int
>> new)
>> +{
>> +if (barrier == acquire)
>> +return atomic_cmpxchg_acquire(v, old, new);
>> +else if (barrier == release)
>> +return atomic_cmpxchg_release(v, old, new);
>> +else
>> +return atomic_cmpxchg_relaxed(v, old, new);
>> +}
>
>
> Things like the above are icky. How about something like below? I'm not
> crazy about it, but there are other similar macros, ie lockref. We still
> provide the osq_lock/unlock to imply acquire/release and the new _relaxed
> flavor, as I agree that should be the correct naming
>
> While I have not touched osq_wait_next(), the following are impacted:
>
> - node->locked is now completely without ordering for _relaxed() (currently
> its under smp_load_acquire, which does not match and the race is harmless
> to begin with as we just iterate again. For the acquire flavor, it is always
> formed with ctr dep + smp_rmb().
>
> - If osq_lock() fails we never guarantee any ordering.
>
> What do you think?

I think that this method looks cleaner than the version with the
conditionals and enum.

My first preference though would be to document that the current code
doesn't provide the acquire semantics. The thinking is that if OSQ is
just used internally as a queuing mechanism and isn't used as a
traditional lock guarding critical sections, then it may just be
misleading and just adds more complexity.

If we do want a separate acquire and relaxed variants, then I think
David's version using macros is a bit more in line with other locking
code.

Jason


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-06 Thread Jason Low
On Wed, Oct 5, 2016 at 10:47 PM, Davidlohr Bueso  wrote:
> On Wed, 05 Oct 2016, Waiman Long wrote:
>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 05a3785..1e6823a 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -12,6 +12,23 @@
>>  */
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>>
>> +enum mbtype {
>> +acquire,
>> +release,
>> +relaxed,
>> +};
>
>
> No, please.
>
>> +
>> +static __always_inline int
>> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int
>> new)
>> +{
>> +if (barrier == acquire)
>> +return atomic_cmpxchg_acquire(v, old, new);
>> +else if (barrier == release)
>> +return atomic_cmpxchg_release(v, old, new);
>> +else
>> +return atomic_cmpxchg_relaxed(v, old, new);
>> +}
>
>
> Things like the above are icky. How about something like below? I'm not
> crazy about it, but there are other similar macros, ie lockref. We still
> provide the osq_lock/unlock to imply acquire/release and the new _relaxed
> flavor, as I agree that should be the correct naming
>
> While I have not touched osq_wait_next(), the following are impacted:
>
> - node->locked is now completely without ordering for _relaxed() (currently
> its under smp_load_acquire, which does not match and the race is harmless
> to begin with as we just iterate again. For the acquire flavor, it is always
> formed with ctr dep + smp_rmb().
>
> - If osq_lock() fails we never guarantee any ordering.
>
> What do you think?

I think that this method looks cleaner than the version with the
conditionals and enum.

My first preference though would be to document that the current code
doesn't provide the acquire semantics. The thinking is that if OSQ is
just used internally as a queuing mechanism and isn't used as a
traditional lock guarding critical sections, then it may just be
misleading and just adds more complexity.

If we do want a separate acquire and relaxed variants, then I think
David's version using macros is a bit more in line with other locking
code.

Jason


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-06 Thread Waiman Long

On 10/06/2016 01:47 AM, Davidlohr Bueso wrote:

On Wed, 05 Oct 2016, Waiman Long wrote:


diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
 */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};


No, please.


+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, 
int new)

+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}


Things like the above are icky. How about something like below? I'm not
crazy about it, but there are other similar macros, ie lockref. We still
provide the osq_lock/unlock to imply acquire/release and the new _relaxed
flavor, as I agree that should be the correct naming

While I have not touched osq_wait_next(), the following are impacted:

- node->locked is now completely without ordering for _relaxed() 
(currently

its under smp_load_acquire, which does not match and the race is harmless
to begin with as we just iterate again. For the acquire flavor, it is 
always

formed with ctr dep + smp_rmb().

- If osq_lock() fails we never guarantee any ordering.

What do you think?

Thanks,
Davidlohr


Yes, I am OK with your change. However, I need some additional changes 
in osq_wait_next() as well. Either it is changed to use the release 
variants of atomic_cmpxchg and xchg or using macro like what you did 
with osq_lock and osq_unlock. The release variant is needed in the 
osq_lock(). As osq_wait_next() is only invoked in the failure path of 
osq_lock(), the barrier type doesn't really matter.


Cheers,
Longman


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-06 Thread Waiman Long

On 10/06/2016 01:47 AM, Davidlohr Bueso wrote:

On Wed, 05 Oct 2016, Waiman Long wrote:


diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
 */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};


No, please.


+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, 
int new)

+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}


Things like the above are icky. How about something like below? I'm not
crazy about it, but there are other similar macros, ie lockref. We still
provide the osq_lock/unlock to imply acquire/release and the new _relaxed
flavor, as I agree that should be the correct naming

While I have not touched osq_wait_next(), the following are impacted:

- node->locked is now completely without ordering for _relaxed() 
(currently

its under smp_load_acquire, which does not match and the race is harmless
to begin with as we just iterate again. For the acquire flavor, it is 
always

formed with ctr dep + smp_rmb().

- If osq_lock() fails we never guarantee any ordering.

What do you think?

Thanks,
Davidlohr


Yes, I am OK with your change. However, I need some additional changes 
in osq_wait_next() as well. Either it is changed to use the release 
variants of atomic_cmpxchg and xchg or using macro like what you did 
with osq_lock and osq_unlock. The release variant is needed in the 
osq_lock(). As osq_wait_next() is only invoked in the failure path of 
osq_lock(), the barrier type doesn't really matter.


Cheers,
Longman


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Davidlohr Bueso

On Wed, 05 Oct 2016, Waiman Long wrote:


diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
 */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};


No, please.


+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}


Things like the above are icky. How about something like below? I'm not
crazy about it, but there are other similar macros, ie lockref. We still
provide the osq_lock/unlock to imply acquire/release and the new _relaxed
flavor, as I agree that should be the correct naming

While I have not touched osq_wait_next(), the following are impacted:

- node->locked is now completely without ordering for _relaxed() (currently
its under smp_load_acquire, which does not match and the race is harmless
to begin with as we just iterate again. For the acquire flavor, it is always
formed with ctr dep + smp_rmb().

- If osq_lock() fails we never guarantee any ordering.

What do you think?

Thanks,
Davidlohr


diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c30a33..183ee51e6e54 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -29,9 +29,20 @@ static inline void osq_lock_init(struct 
optimistic_spin_queue *lock)
atomic_set(>tail, OSQ_UNLOCKED_VAL);
}

+/*
+ * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE
+ * (store)-RELEASE barrier semantics.
+ *
+ * Note that a failed call to either osq_lock() or osq_lock_relaxed() does
+ * not imply barriers... we are next to block.
+ */
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);
+
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+
static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
{
return atomic_read(>tail) != OSQ_UNLOCKED_VAL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90db3909..b1bf1e057565 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 * acquire the mutex all at once, the spinners need to take a
 * MCS (queued) lock first before spinning on the owner field.
 */
-   if (!osq_lock(>osq))
+   if (!osq_lock_relaxed(>osq))
goto done;

while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}

mutex_set_owner(lock);
-   osq_unlock(>osq);
+   osq_unlock_relaxed(>osq);
return true;
}

@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}

-   osq_unlock(>osq);
+   osq_unlock_relaxed(>osq);
done:
/*
 * If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..8c3d57698702 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
return per_cpu_ptr(_node, cpu_nr);
}

+static inline void set_node_locked_release(struct optimistic_spin_node *node)
+{
+   smp_store_release(>locked, 1);
+}
+
+static inline void set_node_locked_relaxed(struct optimistic_spin_node *node)
+{
+   WRITE_ONCE(node->locked, 1);
+
+}
+
/*
 * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
 * Can return NULL in case we were the last queued and we updated @lock instead.
@@ -81,130 +92,140 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}

-bool osq_lock(struct optimistic_spin_queue *lock)
-{
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
-   struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
-   int old;
-
-   node->locked = 0;
-   node->next = NULL;
-   node->cpu = curr;
-
-   /*
-* We need both ACQUIRE (pairs with corresponding RELEASE in
-* unlock() uncontended, or fastpath) and RELEASE (to publish
-* the node fields we just initialised) semantics when updating
-* the lock tail.
-*/
-   old = atomic_xchg(>tail, curr);
-   if (old == OSQ_UNLOCKED_VAL)
-   return true;
-
-   prev = decode_cpu(old);
-   

Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Davidlohr Bueso

On Wed, 05 Oct 2016, Waiman Long wrote:


diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
 */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};


No, please.


+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}


Things like the above are icky. How about something like below? I'm not
crazy about it, but there are other similar macros, ie lockref. We still
provide the osq_lock/unlock to imply acquire/release and the new _relaxed
flavor, as I agree that should be the correct naming

While I have not touched osq_wait_next(), the following are impacted:

- node->locked is now completely without ordering for _relaxed() (currently
its under smp_load_acquire, which does not match and the race is harmless
to begin with as we just iterate again. For the acquire flavor, it is always
formed with ctr dep + smp_rmb().

- If osq_lock() fails we never guarantee any ordering.

What do you think?

Thanks,
Davidlohr


diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c30a33..183ee51e6e54 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -29,9 +29,20 @@ static inline void osq_lock_init(struct 
optimistic_spin_queue *lock)
atomic_set(>tail, OSQ_UNLOCKED_VAL);
}

+/*
+ * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE
+ * (store)-RELEASE barrier semantics.
+ *
+ * Note that a failed call to either osq_lock() or osq_lock_relaxed() does
+ * not imply barriers... we are next to block.
+ */
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);
+
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+
static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
{
return atomic_read(>tail) != OSQ_UNLOCKED_VAL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90db3909..b1bf1e057565 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 * acquire the mutex all at once, the spinners need to take a
 * MCS (queued) lock first before spinning on the owner field.
 */
-   if (!osq_lock(>osq))
+   if (!osq_lock_relaxed(>osq))
goto done;

while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}

mutex_set_owner(lock);
-   osq_unlock(>osq);
+   osq_unlock_relaxed(>osq);
return true;
}

@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}

-   osq_unlock(>osq);
+   osq_unlock_relaxed(>osq);
done:
/*
 * If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..8c3d57698702 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
return per_cpu_ptr(_node, cpu_nr);
}

+static inline void set_node_locked_release(struct optimistic_spin_node *node)
+{
+   smp_store_release(>locked, 1);
+}
+
+static inline void set_node_locked_relaxed(struct optimistic_spin_node *node)
+{
+   WRITE_ONCE(node->locked, 1);
+
+}
+
/*
 * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
 * Can return NULL in case we were the last queued and we updated @lock instead.
@@ -81,130 +92,140 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}

-bool osq_lock(struct optimistic_spin_queue *lock)
-{
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
-   struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
-   int old;
-
-   node->locked = 0;
-   node->next = NULL;
-   node->cpu = curr;
-
-   /*
-* We need both ACQUIRE (pairs with corresponding RELEASE in
-* unlock() uncontended, or fastpath) and RELEASE (to publish
-* the node fields we just initialised) semantics when updating
-* the lock tail.
-*/
-   old = atomic_xchg(>tail, curr);
-   if (old == OSQ_UNLOCKED_VAL)
-   return true;
-
-   prev = decode_cpu(old);
-   

Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Waiman Long

On 10/05/2016 08:19 AM, Waiman Long wrote:

On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used 
internally
for lock owner spinning situations? Leaking out of the critical 
region will
obviously be bad if using it as a full lock, but, as is, this can 
only hurt
performance of two of the most popular locks in the kernel -- 
although yes,

using smp_acquire__after_ctrl_dep is nicer for polling.


First of all, it is not obvious from the name osq_lock() that it is 
not an acquire barrier in some cases. We either need to clearly 
document it or has a variant name that indicate that, e.g. 
osq_lock_relaxed, for example.


Secondly, if we look at the use cases of osq_lock(), the additional 
latency (for non-x86 archs) only matters if the master lock is 
immediately available for acquisition after osq_lock() return. 
Otherwise, it will be hidden in the spinning loop for that master 
lock. So yes, there may be a slight performance hit in some cases, but 
certainly not always.


If you need tighter osq for rwsems, could it be refactored such that 
mutexes

do not take a hit?



Yes, we can certainly do that like splitting into 2 variants, one with 
acquire barrier guarantee and one without.




How about the following patch? Does that address your concern?

Cheers,
Longman

--[ Cut Here 
]--

[PATCH] locking/osq: Provide 2 variants of lock/unlock functions

Despite the lock/unlock names, the osq_lock() and osq_unlock()
functions were not proper acquire and release barriers. To clarify
the situation, two different variants are now provided:

 1) osq_lock/osq_unlock - they are proper acquire (if successful)
and release barriers respectively.

 2) osq_lock_relaxed/osq_unlock_relaxed - they do not provide the
acquire/release barrier guarantee.

Both the mutex and rwsem optimistic spinning codes are modified to
use the relaxed variants which will be faster in some architectures as
the proper memory barrier semantics are not needed for queuing purpose.

Signed-off-by: Waiman Long 
---
 include/linux/osq_lock.h|2 +
 kernel/locking/mutex.c  |6 +-
 kernel/locking/osq_lock.c   |   89 
+++---

 kernel/locking/rwsem-xadd.c |4 +-
 4 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c..72357d0 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -30,7 +30,9 @@ static inline void osq_lock_init(struct 
optimistic_spin_queue *lock)

 }

 extern bool osq_lock(struct optimistic_spin_queue *lock);
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);

 static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
 {
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..b1bf1e0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
  * acquire the mutex all at once, the spinners need to take a
  * MCS (queued) lock first before spinning on the owner field.
  */
-if (!osq_lock(>osq))
+if (!osq_lock_relaxed(>osq))
 goto done;

 while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 }

 mutex_set_owner(lock);
-osq_unlock(>osq);
+osq_unlock_relaxed(>osq);
 return true;
 }

@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 cpu_relax_lowlatency();
 }

-osq_unlock(>osq);
+osq_unlock_relaxed(>osq);
 done:
 /*
  * If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
  */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};
+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}
+
 /*
  * We use the value 0 to represent "no CPU", thus the encoded value
  * will be the CPU number 

Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Waiman Long

On 10/05/2016 08:19 AM, Waiman Long wrote:

On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used 
internally
for lock owner spinning situations? Leaking out of the critical 
region will
obviously be bad if using it as a full lock, but, as is, this can 
only hurt
performance of two of the most popular locks in the kernel -- 
although yes,

using smp_acquire__after_ctrl_dep is nicer for polling.


First of all, it is not obvious from the name osq_lock() that it is 
not an acquire barrier in some cases. We either need to clearly 
document it or has a variant name that indicate that, e.g. 
osq_lock_relaxed, for example.


Secondly, if we look at the use cases of osq_lock(), the additional 
latency (for non-x86 archs) only matters if the master lock is 
immediately available for acquisition after osq_lock() return. 
Otherwise, it will be hidden in the spinning loop for that master 
lock. So yes, there may be a slight performance hit in some cases, but 
certainly not always.


If you need tighter osq for rwsems, could it be refactored such that 
mutexes

do not take a hit?



Yes, we can certainly do that like splitting into 2 variants, one with 
acquire barrier guarantee and one without.




How about the following patch? Does that address your concern?

Cheers,
Longman

--[ Cut Here 
]--

[PATCH] locking/osq: Provide 2 variants of lock/unlock functions

Despite the lock/unlock names, the osq_lock() and osq_unlock()
functions were not proper acquire and release barriers. To clarify
the situation, two different variants are now provided:

 1) osq_lock/osq_unlock - they are proper acquire (if successful)
and release barriers respectively.

 2) osq_lock_relaxed/osq_unlock_relaxed - they do not provide the
acquire/release barrier guarantee.

Both the mutex and rwsem optimistic spinning codes are modified to
use the relaxed variants which will be faster in some architectures as
the proper memory barrier semantics are not needed for queuing purpose.

Signed-off-by: Waiman Long 
---
 include/linux/osq_lock.h|2 +
 kernel/locking/mutex.c  |6 +-
 kernel/locking/osq_lock.c   |   89 
+++---

 kernel/locking/rwsem-xadd.c |4 +-
 4 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c..72357d0 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -30,7 +30,9 @@ static inline void osq_lock_init(struct 
optimistic_spin_queue *lock)

 }

 extern bool osq_lock(struct optimistic_spin_queue *lock);
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);

 static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
 {
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..b1bf1e0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
  * acquire the mutex all at once, the spinners need to take a
  * MCS (queued) lock first before spinning on the owner field.
  */
-if (!osq_lock(>osq))
+if (!osq_lock_relaxed(>osq))
 goto done;

 while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 }

 mutex_set_owner(lock);
-osq_unlock(>osq);
+osq_unlock_relaxed(>osq);
 return true;
 }

@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
 cpu_relax_lowlatency();
 }

-osq_unlock(>osq);
+osq_unlock_relaxed(>osq);
 done:
 /*
  * If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
  */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);


+enum mbtype {
+acquire,
+release,
+relaxed,
+};
+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
+{
+if (barrier == acquire)
+return atomic_cmpxchg_acquire(v, old, new);
+else if (barrier == release)
+return atomic_cmpxchg_release(v, old, new);
+else
+return atomic_cmpxchg_relaxed(v, old, new);
+}
+
 /*
  * We use the value 0 to represent "no CPU", thus the encoded value
  * will be the CPU number incremented by 1.
@@ 

Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Waiman Long

On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used 
internally
for lock owner spinning situations? Leaking out of the critical region 
will
obviously be bad if using it as a full lock, but, as is, this can only 
hurt
performance of two of the most popular locks in the kernel -- although 
yes,

using smp_acquire__after_ctrl_dep is nicer for polling.


First of all, it is not obvious from the name osq_lock() that it is not 
an acquire barrier in some cases. We either need to clearly document it 
or has a variant name that indicate that, e.g. osq_lock_relaxed, for 
example.


Secondly, if we look at the use cases of osq_lock(), the additional 
latency (for non-x86 archs) only matters if the master lock is 
immediately available for acquisition after osq_lock() return. 
Otherwise, it will be hidden in the spinning loop for that master lock. 
So yes, there may be a slight performance hit in some cases, but 
certainly not always.


If you need tighter osq for rwsems, could it be refactored such that 
mutexes

do not take a hit?



Yes, we can certainly do that like splitting into 2 variants, one with 
acquire barrier guarantee and one without.




Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
kernel/locking/osq_lock.c |   24 ++--
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)

cpu_relax_lowlatency();
}
+/*
+ * Add an acquire memory barrier for pairing with the release 
barrier

+ * in unlock.
+ */
+smp_acquire__after_ctrl_dep();
return true;

unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue 
*lock)

 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-next = xchg(>next, NULL);
-if (next) {
-WRITE_ONCE(next->locked, 1);
+next = xchg_relaxed(>next, NULL);
+if (next)
+goto unlock;
+
+next = osq_wait_next(lock, node, NULL);
+if (unlikely(!next)) {
+/*
+ * In the unlikely event that the OSQ is empty, we need to
+ * provide a proper release barrier.
+ */
+smp_mb();
return;
}

-next = osq_wait_next(lock, node, NULL);
-if (next)
-WRITE_ONCE(next->locked, 1);
+unlock:
+smp_store_release(>locked, 1);
}


As well as for the smp_acquire__after_ctrl_dep comment you have above, 
this also
obviously pairs with the osq_lock's smp_load_acquire while backing out 
(unqueueing,
step A). Given the above, for this case we might also just rely on 
READ_ONCE(node->locked),
if we get the conditional wrong and miss the node becoming locked, all 
we do is another
iteration, and while there is a cmpxchg() there, it is mitigated with 
the ccas thingy.


Similar to osq_lock(), the current osq_unlock() does not have a release 
barrier guarantee. I think splitting into 2 variants - osq_unlock, 
osq_unlock_relaxed will help.


Cheers,
Longman


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-05 Thread Waiman Long

On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used 
internally
for lock owner spinning situations? Leaking out of the critical region 
will
obviously be bad if using it as a full lock, but, as is, this can only 
hurt
performance of two of the most popular locks in the kernel -- although 
yes,

using smp_acquire__after_ctrl_dep is nicer for polling.


First of all, it is not obvious from the name osq_lock() that it is not 
an acquire barrier in some cases. We either need to clearly document it 
or has a variant name that indicate that, e.g. osq_lock_relaxed, for 
example.


Secondly, if we look at the use cases of osq_lock(), the additional 
latency (for non-x86 archs) only matters if the master lock is 
immediately available for acquisition after osq_lock() return. 
Otherwise, it will be hidden in the spinning loop for that master lock. 
So yes, there may be a slight performance hit in some cases, but 
certainly not always.


If you need tighter osq for rwsems, could it be refactored such that 
mutexes

do not take a hit?



Yes, we can certainly do that like splitting into 2 variants, one with 
acquire barrier guarantee and one without.




Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
kernel/locking/osq_lock.c |   24 ++--
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)

cpu_relax_lowlatency();
}
+/*
+ * Add an acquire memory barrier for pairing with the release 
barrier

+ * in unlock.
+ */
+smp_acquire__after_ctrl_dep();
return true;

unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue 
*lock)

 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-next = xchg(>next, NULL);
-if (next) {
-WRITE_ONCE(next->locked, 1);
+next = xchg_relaxed(>next, NULL);
+if (next)
+goto unlock;
+
+next = osq_wait_next(lock, node, NULL);
+if (unlikely(!next)) {
+/*
+ * In the unlikely event that the OSQ is empty, we need to
+ * provide a proper release barrier.
+ */
+smp_mb();
return;
}

-next = osq_wait_next(lock, node, NULL);
-if (next)
-WRITE_ONCE(next->locked, 1);
+unlock:
+smp_store_release(>locked, 1);
}


As well as for the smp_acquire__after_ctrl_dep comment you have above, 
this also
obviously pairs with the osq_lock's smp_load_acquire while backing out 
(unqueueing,
step A). Given the above, for this case we might also just rely on 
READ_ONCE(node->locked),
if we get the conditional wrong and miss the node becoming locked, all 
we do is another
iteration, and while there is a cmpxchg() there, it is mitigated with 
the ccas thingy.


Similar to osq_lock(), the current osq_unlock() does not have a release 
barrier guarantee. I think splitting into 2 variants - osq_unlock, 
osq_unlock_relaxed will help.


Cheers,
Longman


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-04 Thread Jason Low
On Tue, Oct 4, 2016 at 12:06 PM, Davidlohr Bueso  wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
>> The osq_lock() and osq_unlock() function may not provide the necessary
>> acquire and release barrier in some cases. This patch makes sure
>> that the proper barriers are provided when osq_lock() is successful
>> or when osq_unlock() is called.
>
>
> But why do we need these guarantees given that osq is only used internally
> for lock owner spinning situations? Leaking out of the critical region will
> obviously be bad if using it as a full lock, but, as is, this can only hurt
> performance of two of the most popular locks in the kernel -- although yes,
> using smp_acquire__after_ctrl_dep is nicer for polling.
>
> If you need tighter osq for rwsems, could it be refactored such that mutexes
> do not take a hit?

I agree with David, the OSQ is meant to be used internally as a
queuing mechanism than as something for protecting critical regions,
and so these guarantees of preventing critical section leaks don't
seem to be necessary for the OSQ. If that is the case, then it would
be better to avoid the performance hit to mutexes and rwsems.

Jason


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-04 Thread Jason Low
On Tue, Oct 4, 2016 at 12:06 PM, Davidlohr Bueso  wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
>> The osq_lock() and osq_unlock() function may not provide the necessary
>> acquire and release barrier in some cases. This patch makes sure
>> that the proper barriers are provided when osq_lock() is successful
>> or when osq_unlock() is called.
>
>
> But why do we need these guarantees given that osq is only used internally
> for lock owner spinning situations? Leaking out of the critical region will
> obviously be bad if using it as a full lock, but, as is, this can only hurt
> performance of two of the most popular locks in the kernel -- although yes,
> using smp_acquire__after_ctrl_dep is nicer for polling.
>
> If you need tighter osq for rwsems, could it be refactored such that mutexes
> do not take a hit?

I agree with David, the OSQ is meant to be used internally as a
queuing mechanism than as something for protecting critical regions,
and so these guarantees of preventing critical section leaks don't
seem to be necessary for the OSQ. If that is the case, then it would
be better to avoid the performance hit to mutexes and rwsems.

Jason


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-04 Thread Davidlohr Bueso

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used internally
for lock owner spinning situations? Leaking out of the critical region will
obviously be bad if using it as a full lock, but, as is, this can only hurt
performance of two of the most popular locks in the kernel -- although yes,
using smp_acquire__after_ctrl_dep is nicer for polling.

If you need tighter osq for rwsems, could it be refactored such that mutexes
do not take a hit?



Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
kernel/locking/osq_lock.c |   24 ++--
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)

cpu_relax_lowlatency();
}
+   /*
+* Add an acquire memory barrier for pairing with the release barrier
+* in unlock.
+*/
+   smp_acquire__after_ctrl_dep();
return true;

unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-   next = xchg(>next, NULL);
-   if (next) {
-   WRITE_ONCE(next->locked, 1);
+   next = xchg_relaxed(>next, NULL);
+   if (next)
+   goto unlock;
+
+   next = osq_wait_next(lock, node, NULL);
+   if (unlikely(!next)) {
+   /*
+* In the unlikely event that the OSQ is empty, we need to
+* provide a proper release barrier.
+*/
+   smp_mb();
return;
}

-   next = osq_wait_next(lock, node, NULL);
-   if (next)
-   WRITE_ONCE(next->locked, 1);
+unlock:
+   smp_store_release(>locked, 1);
}


As well as for the smp_acquire__after_ctrl_dep comment you have above, this also
obviously pairs with the osq_lock's smp_load_acquire while backing out 
(unqueueing,
step A). Given the above, for this case we might also just rely on 
READ_ONCE(node->locked),
if we get the conditional wrong and miss the node becoming locked, all we do is 
another
iteration, and while there is a cmpxchg() there, it is mitigated with the ccas 
thingy.

Thanks,
Davidlohr


Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-10-04 Thread Davidlohr Bueso

On Thu, 18 Aug 2016, Waiman Long wrote:


The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.


But why do we need these guarantees given that osq is only used internally
for lock owner spinning situations? Leaking out of the critical region will
obviously be bad if using it as a full lock, but, as is, this can only hurt
performance of two of the most popular locks in the kernel -- although yes,
using smp_acquire__after_ctrl_dep is nicer for polling.

If you need tighter osq for rwsems, could it be refactored such that mutexes
do not take a hit?



Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
kernel/locking/osq_lock.c |   24 ++--
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)

cpu_relax_lowlatency();
}
+   /*
+* Add an acquire memory barrier for pairing with the release barrier
+* in unlock.
+*/
+   smp_acquire__after_ctrl_dep();
return true;

unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-   next = xchg(>next, NULL);
-   if (next) {
-   WRITE_ONCE(next->locked, 1);
+   next = xchg_relaxed(>next, NULL);
+   if (next)
+   goto unlock;
+
+   next = osq_wait_next(lock, node, NULL);
+   if (unlikely(!next)) {
+   /*
+* In the unlikely event that the OSQ is empty, we need to
+* provide a proper release barrier.
+*/
+   smp_mb();
return;
}

-   next = osq_wait_next(lock, node, NULL);
-   if (next)
-   WRITE_ONCE(next->locked, 1);
+unlock:
+   smp_store_release(>locked, 1);
}


As well as for the smp_acquire__after_ctrl_dep comment you have above, this also
obviously pairs with the osq_lock's smp_load_acquire while backing out 
(unqueueing,
step A). Given the above, for this case we might also just rely on 
READ_ONCE(node->locked),
if we get the conditional wrong and miss the node becoming locked, all we do is 
another
iteration, and while there is a cmpxchg() there, it is mitigated with the ccas 
thingy.

Thanks,
Davidlohr


[RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-08-18 Thread Waiman Long
The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
 kernel/locking/osq_lock.c |   24 ++--
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
cpu_relax_lowlatency();
}
+   /*
+* Add an acquire memory barrier for pairing with the release barrier
+* in unlock.
+*/
+   smp_acquire__after_ctrl_dep();
return true;
 
 unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-   next = xchg(>next, NULL);
-   if (next) {
-   WRITE_ONCE(next->locked, 1);
+   next = xchg_relaxed(>next, NULL);
+   if (next)
+   goto unlock;
+
+   next = osq_wait_next(lock, node, NULL);
+   if (unlikely(!next)) {
+   /*
+* In the unlikely event that the OSQ is empty, we need to
+* provide a proper release barrier.
+*/
+   smp_mb();
return;
}
 
-   next = osq_wait_next(lock, node, NULL);
-   if (next)
-   WRITE_ONCE(next->locked, 1);
+unlock:
+   smp_store_release(>locked, 1);
 }
-- 
1.7.1



[RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier

2016-08-18 Thread Waiman Long
The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
---
 kernel/locking/osq_lock.c |   24 ++--
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
cpu_relax_lowlatency();
}
+   /*
+* Add an acquire memory barrier for pairing with the release barrier
+* in unlock.
+*/
+   smp_acquire__after_ctrl_dep();
return true;
 
 unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 * Second most likely case.
 */
node = this_cpu_ptr(_node);
-   next = xchg(>next, NULL);
-   if (next) {
-   WRITE_ONCE(next->locked, 1);
+   next = xchg_relaxed(>next, NULL);
+   if (next)
+   goto unlock;
+
+   next = osq_wait_next(lock, node, NULL);
+   if (unlikely(!next)) {
+   /*
+* In the unlikely event that the OSQ is empty, we need to
+* provide a proper release barrier.
+*/
+   smp_mb();
return;
}
 
-   next = osq_wait_next(lock, node, NULL);
-   if (next)
-   WRITE_ONCE(next->locked, 1);
+unlock:
+   smp_store_release(>locked, 1);
 }
-- 
1.7.1