Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Wed, Oct 5, 2016 at 10:47 PM, Davidlohr Buesowrote: > On Wed, 05 Oct 2016, Waiman Long wrote: > >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >> index 05a3785..1e6823a 100644 >> --- a/kernel/locking/osq_lock.c >> +++ b/kernel/locking/osq_lock.c >> @@ -12,6 +12,23 @@ >> */ >> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, >> osq_node); >> >> +enum mbtype { >> +acquire, >> +release, >> +relaxed, >> +}; > > > No, please. > >> + >> +static __always_inline int >> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int >> new) >> +{ >> +if (barrier == acquire) >> +return atomic_cmpxchg_acquire(v, old, new); >> +else if (barrier == release) >> +return atomic_cmpxchg_release(v, old, new); >> +else >> +return atomic_cmpxchg_relaxed(v, old, new); >> +} > > > Things like the above are icky. How about something like below? I'm not > crazy about it, but there are other similar macros, ie lockref. We still > provide the osq_lock/unlock to imply acquire/release and the new _relaxed > flavor, as I agree that should be the correct naming > > While I have not touched osq_wait_next(), the following are impacted: > > - node->locked is now completely without ordering for _relaxed() (currently > its under smp_load_acquire, which does not match and the race is harmless > to begin with as we just iterate again. For the acquire flavor, it is always > formed with ctr dep + smp_rmb(). > > - If osq_lock() fails we never guarantee any ordering. > > What do you think? I think that this method looks cleaner than the version with the conditionals and enum. My first preference though would be to document that the current code doesn't provide the acquire semantics. The thinking is that if OSQ is just used internally as a queuing mechanism and isn't used as a traditional lock guarding critical sections, then it may just be misleading and just adds more complexity. If we do want a separate acquire and relaxed variants, then I think David's version using macros is a bit more in line with other locking code. Jason
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Wed, Oct 5, 2016 at 10:47 PM, Davidlohr Bueso wrote: > On Wed, 05 Oct 2016, Waiman Long wrote: > >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >> index 05a3785..1e6823a 100644 >> --- a/kernel/locking/osq_lock.c >> +++ b/kernel/locking/osq_lock.c >> @@ -12,6 +12,23 @@ >> */ >> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, >> osq_node); >> >> +enum mbtype { >> +acquire, >> +release, >> +relaxed, >> +}; > > > No, please. > >> + >> +static __always_inline int >> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int >> new) >> +{ >> +if (barrier == acquire) >> +return atomic_cmpxchg_acquire(v, old, new); >> +else if (barrier == release) >> +return atomic_cmpxchg_release(v, old, new); >> +else >> +return atomic_cmpxchg_relaxed(v, old, new); >> +} > > > Things like the above are icky. How about something like below? I'm not > crazy about it, but there are other similar macros, ie lockref. We still > provide the osq_lock/unlock to imply acquire/release and the new _relaxed > flavor, as I agree that should be the correct naming > > While I have not touched osq_wait_next(), the following are impacted: > > - node->locked is now completely without ordering for _relaxed() (currently > its under smp_load_acquire, which does not match and the race is harmless > to begin with as we just iterate again. For the acquire flavor, it is always > formed with ctr dep + smp_rmb(). > > - If osq_lock() fails we never guarantee any ordering. > > What do you think? I think that this method looks cleaner than the version with the conditionals and enum. My first preference though would be to document that the current code doesn't provide the acquire semantics. The thinking is that if OSQ is just used internally as a queuing mechanism and isn't used as a traditional lock guarding critical sections, then it may just be misleading and just adds more complexity. If we do want a separate acquire and relaxed variants, then I think David's version using macros is a bit more in line with other locking code. Jason
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/06/2016 01:47 AM, Davidlohr Bueso wrote: On Wed, 05 Oct 2016, Waiman Long wrote: diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; No, please. + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} Things like the above are icky. How about something like below? I'm not crazy about it, but there are other similar macros, ie lockref. We still provide the osq_lock/unlock to imply acquire/release and the new _relaxed flavor, as I agree that should be the correct naming While I have not touched osq_wait_next(), the following are impacted: - node->locked is now completely without ordering for _relaxed() (currently its under smp_load_acquire, which does not match and the race is harmless to begin with as we just iterate again. For the acquire flavor, it is always formed with ctr dep + smp_rmb(). - If osq_lock() fails we never guarantee any ordering. What do you think? Thanks, Davidlohr Yes, I am OK with your change. However, I need some additional changes in osq_wait_next() as well. Either it is changed to use the release variants of atomic_cmpxchg and xchg or using macro like what you did with osq_lock and osq_unlock. The release variant is needed in the osq_lock(). As osq_wait_next() is only invoked in the failure path of osq_lock(), the barrier type doesn't really matter. Cheers, Longman
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/06/2016 01:47 AM, Davidlohr Bueso wrote: On Wed, 05 Oct 2016, Waiman Long wrote: diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; No, please. + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} Things like the above are icky. How about something like below? I'm not crazy about it, but there are other similar macros, ie lockref. We still provide the osq_lock/unlock to imply acquire/release and the new _relaxed flavor, as I agree that should be the correct naming While I have not touched osq_wait_next(), the following are impacted: - node->locked is now completely without ordering for _relaxed() (currently its under smp_load_acquire, which does not match and the race is harmless to begin with as we just iterate again. For the acquire flavor, it is always formed with ctr dep + smp_rmb(). - If osq_lock() fails we never guarantee any ordering. What do you think? Thanks, Davidlohr Yes, I am OK with your change. However, I need some additional changes in osq_wait_next() as well. Either it is changed to use the release variants of atomic_cmpxchg and xchg or using macro like what you did with osq_lock and osq_unlock. The release variant is needed in the osq_lock(). As osq_wait_next() is only invoked in the failure path of osq_lock(), the barrier type doesn't really matter. Cheers, Longman
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Wed, 05 Oct 2016, Waiman Long wrote: diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; No, please. + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} Things like the above are icky. How about something like below? I'm not crazy about it, but there are other similar macros, ie lockref. We still provide the osq_lock/unlock to imply acquire/release and the new _relaxed flavor, as I agree that should be the correct naming While I have not touched osq_wait_next(), the following are impacted: - node->locked is now completely without ordering for _relaxed() (currently its under smp_load_acquire, which does not match and the race is harmless to begin with as we just iterate again. For the acquire flavor, it is always formed with ctr dep + smp_rmb(). - If osq_lock() fails we never guarantee any ordering. What do you think? Thanks, Davidlohr diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 703ea5c30a33..183ee51e6e54 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -29,9 +29,20 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock) atomic_set(>tail, OSQ_UNLOCKED_VAL); } +/* + * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE + * (store)-RELEASE barrier semantics. + * + * Note that a failed call to either osq_lock() or osq_lock_relaxed() does + * not imply barriers... we are next to block. + */ +extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock); +extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock); + extern bool osq_lock(struct optimistic_spin_queue *lock); extern void osq_unlock(struct optimistic_spin_queue *lock); + static inline bool osq_is_locked(struct optimistic_spin_queue *lock) { return atomic_read(>tail) != OSQ_UNLOCKED_VAL; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90db3909..b1bf1e057565 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, * acquire the mutex all at once, the spinners need to take a * MCS (queued) lock first before spinning on the owner field. */ - if (!osq_lock(>osq)) + if (!osq_lock_relaxed(>osq)) goto done; while (true) { @@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, } mutex_set_owner(lock); - osq_unlock(>osq); + osq_unlock_relaxed(>osq); return true; } @@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, cpu_relax_lowlatency(); } - osq_unlock(>osq); + osq_unlock_relaxed(>osq); done: /* * If we fell out of the spin path because of need_resched(), diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a37857ab55..8c3d57698702 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) return per_cpu_ptr(_node, cpu_nr); } +static inline void set_node_locked_release(struct optimistic_spin_node *node) +{ + smp_store_release(>locked, 1); +} + +static inline void set_node_locked_relaxed(struct optimistic_spin_node *node) +{ + WRITE_ONCE(node->locked, 1); + +} + /* * Get a stable @node->next pointer, either for unlock() or unqueue() purposes. * Can return NULL in case we were the last queued and we updated @lock instead. @@ -81,130 +92,140 @@ osq_wait_next(struct optimistic_spin_queue *lock, return next; } -bool osq_lock(struct optimistic_spin_queue *lock) -{ - struct optimistic_spin_node *node = this_cpu_ptr(_node); - struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); - int old; - - node->locked = 0; - node->next = NULL; - node->cpu = curr; - - /* -* We need both ACQUIRE (pairs with corresponding RELEASE in -* unlock() uncontended, or fastpath) and RELEASE (to publish -* the node fields we just initialised) semantics when updating -* the lock tail. -*/ - old = atomic_xchg(>tail, curr); - if (old == OSQ_UNLOCKED_VAL) - return true; - - prev = decode_cpu(old); -
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Wed, 05 Oct 2016, Waiman Long wrote: diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; No, please. + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} Things like the above are icky. How about something like below? I'm not crazy about it, but there are other similar macros, ie lockref. We still provide the osq_lock/unlock to imply acquire/release and the new _relaxed flavor, as I agree that should be the correct naming While I have not touched osq_wait_next(), the following are impacted: - node->locked is now completely without ordering for _relaxed() (currently its under smp_load_acquire, which does not match and the race is harmless to begin with as we just iterate again. For the acquire flavor, it is always formed with ctr dep + smp_rmb(). - If osq_lock() fails we never guarantee any ordering. What do you think? Thanks, Davidlohr diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 703ea5c30a33..183ee51e6e54 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -29,9 +29,20 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock) atomic_set(>tail, OSQ_UNLOCKED_VAL); } +/* + * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE + * (store)-RELEASE barrier semantics. + * + * Note that a failed call to either osq_lock() or osq_lock_relaxed() does + * not imply barriers... we are next to block. + */ +extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock); +extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock); + extern bool osq_lock(struct optimistic_spin_queue *lock); extern void osq_unlock(struct optimistic_spin_queue *lock); + static inline bool osq_is_locked(struct optimistic_spin_queue *lock) { return atomic_read(>tail) != OSQ_UNLOCKED_VAL; diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90db3909..b1bf1e057565 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, * acquire the mutex all at once, the spinners need to take a * MCS (queued) lock first before spinning on the owner field. */ - if (!osq_lock(>osq)) + if (!osq_lock_relaxed(>osq)) goto done; while (true) { @@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, } mutex_set_owner(lock); - osq_unlock(>osq); + osq_unlock_relaxed(>osq); return true; } @@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, cpu_relax_lowlatency(); } - osq_unlock(>osq); + osq_unlock_relaxed(>osq); done: /* * If we fell out of the spin path because of need_resched(), diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a37857ab55..8c3d57698702 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) return per_cpu_ptr(_node, cpu_nr); } +static inline void set_node_locked_release(struct optimistic_spin_node *node) +{ + smp_store_release(>locked, 1); +} + +static inline void set_node_locked_relaxed(struct optimistic_spin_node *node) +{ + WRITE_ONCE(node->locked, 1); + +} + /* * Get a stable @node->next pointer, either for unlock() or unqueue() purposes. * Can return NULL in case we were the last queued and we updated @lock instead. @@ -81,130 +92,140 @@ osq_wait_next(struct optimistic_spin_queue *lock, return next; } -bool osq_lock(struct optimistic_spin_queue *lock) -{ - struct optimistic_spin_node *node = this_cpu_ptr(_node); - struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); - int old; - - node->locked = 0; - node->next = NULL; - node->cpu = curr; - - /* -* We need both ACQUIRE (pairs with corresponding RELEASE in -* unlock() uncontended, or fastpath) and RELEASE (to publish -* the node fields we just initialised) semantics when updating -* the lock tail. -*/ - old = atomic_xchg(>tail, curr); - if (old == OSQ_UNLOCKED_VAL) - return true; - - prev = decode_cpu(old); -
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/05/2016 08:19 AM, Waiman Long wrote: On 10/04/2016 03:06 PM, Davidlohr Bueso wrote: On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. First of all, it is not obvious from the name osq_lock() that it is not an acquire barrier in some cases. We either need to clearly document it or has a variant name that indicate that, e.g. osq_lock_relaxed, for example. Secondly, if we look at the use cases of osq_lock(), the additional latency (for non-x86 archs) only matters if the master lock is immediately available for acquisition after osq_lock() return. Otherwise, it will be hidden in the spinning loop for that master lock. So yes, there may be a slight performance hit in some cases, but certainly not always. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Yes, we can certainly do that like splitting into 2 variants, one with acquire barrier guarantee and one without. How about the following patch? Does that address your concern? Cheers, Longman --[ Cut Here ]-- [PATCH] locking/osq: Provide 2 variants of lock/unlock functions Despite the lock/unlock names, the osq_lock() and osq_unlock() functions were not proper acquire and release barriers. To clarify the situation, two different variants are now provided: 1) osq_lock/osq_unlock - they are proper acquire (if successful) and release barriers respectively. 2) osq_lock_relaxed/osq_unlock_relaxed - they do not provide the acquire/release barrier guarantee. Both the mutex and rwsem optimistic spinning codes are modified to use the relaxed variants which will be faster in some architectures as the proper memory barrier semantics are not needed for queuing purpose. Signed-off-by: Waiman Long--- include/linux/osq_lock.h|2 + kernel/locking/mutex.c |6 +- kernel/locking/osq_lock.c | 89 +++--- kernel/locking/rwsem-xadd.c |4 +- 4 files changed, 81 insertions(+), 20 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 703ea5c..72357d0 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -30,7 +30,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock) } extern bool osq_lock(struct optimistic_spin_queue *lock); +extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock); extern void osq_unlock(struct optimistic_spin_queue *lock); +extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock); static inline bool osq_is_locked(struct optimistic_spin_queue *lock) { diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..b1bf1e0 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, * acquire the mutex all at once, the spinners need to take a * MCS (queued) lock first before spinning on the owner field. */ -if (!osq_lock(>osq)) +if (!osq_lock_relaxed(>osq)) goto done; while (true) { @@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, } mutex_set_owner(lock); -osq_unlock(>osq); +osq_unlock_relaxed(>osq); return true; } @@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, cpu_relax_lowlatency(); } -osq_unlock(>osq); +osq_unlock_relaxed(>osq); done: /* * If we fell out of the spin path because of need_resched(), diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} + /* * We use the value 0 to represent "no CPU", thus the encoded value * will be the CPU number
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/05/2016 08:19 AM, Waiman Long wrote: On 10/04/2016 03:06 PM, Davidlohr Bueso wrote: On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. First of all, it is not obvious from the name osq_lock() that it is not an acquire barrier in some cases. We either need to clearly document it or has a variant name that indicate that, e.g. osq_lock_relaxed, for example. Secondly, if we look at the use cases of osq_lock(), the additional latency (for non-x86 archs) only matters if the master lock is immediately available for acquisition after osq_lock() return. Otherwise, it will be hidden in the spinning loop for that master lock. So yes, there may be a slight performance hit in some cases, but certainly not always. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Yes, we can certainly do that like splitting into 2 variants, one with acquire barrier guarantee and one without. How about the following patch? Does that address your concern? Cheers, Longman --[ Cut Here ]-- [PATCH] locking/osq: Provide 2 variants of lock/unlock functions Despite the lock/unlock names, the osq_lock() and osq_unlock() functions were not proper acquire and release barriers. To clarify the situation, two different variants are now provided: 1) osq_lock/osq_unlock - they are proper acquire (if successful) and release barriers respectively. 2) osq_lock_relaxed/osq_unlock_relaxed - they do not provide the acquire/release barrier guarantee. Both the mutex and rwsem optimistic spinning codes are modified to use the relaxed variants which will be faster in some architectures as the proper memory barrier semantics are not needed for queuing purpose. Signed-off-by: Waiman Long --- include/linux/osq_lock.h|2 + kernel/locking/mutex.c |6 +- kernel/locking/osq_lock.c | 89 +++--- kernel/locking/rwsem-xadd.c |4 +- 4 files changed, 81 insertions(+), 20 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 703ea5c..72357d0 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -30,7 +30,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock) } extern bool osq_lock(struct optimistic_spin_queue *lock); +extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock); extern void osq_unlock(struct optimistic_spin_queue *lock); +extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock); static inline bool osq_is_locked(struct optimistic_spin_queue *lock) { diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..b1bf1e0 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, * acquire the mutex all at once, the spinners need to take a * MCS (queued) lock first before spinning on the owner field. */ -if (!osq_lock(>osq)) +if (!osq_lock_relaxed(>osq)) goto done; while (true) { @@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, } mutex_set_owner(lock); -osq_unlock(>osq); +osq_unlock_relaxed(>osq); return true; } @@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock, cpu_relax_lowlatency(); } -osq_unlock(>osq); +osq_unlock_relaxed(>osq); done: /* * If we fell out of the spin path because of need_resched(), diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..1e6823a 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -12,6 +12,23 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +enum mbtype { +acquire, +release, +relaxed, +}; + +static __always_inline int +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new) +{ +if (barrier == acquire) +return atomic_cmpxchg_acquire(v, old, new); +else if (barrier == release) +return atomic_cmpxchg_release(v, old, new); +else +return atomic_cmpxchg_relaxed(v, old, new); +} + /* * We use the value 0 to represent "no CPU", thus the encoded value * will be the CPU number incremented by 1. @@
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/04/2016 03:06 PM, Davidlohr Bueso wrote: On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. First of all, it is not obvious from the name osq_lock() that it is not an acquire barrier in some cases. We either need to clearly document it or has a variant name that indicate that, e.g. osq_lock_relaxed, for example. Secondly, if we look at the use cases of osq_lock(), the additional latency (for non-x86 archs) only matters if the master lock is immediately available for acquisition after osq_lock() return. Otherwise, it will be hidden in the spinning loop for that master lock. So yes, there may be a slight performance hit in some cases, but certainly not always. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Yes, we can certainly do that like splitting into 2 variants, one with acquire barrier guarantee and one without. Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } +/* + * Add an acquire memory barrier for pairing with the release barrier + * in unlock. + */ +smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); -next = xchg(>next, NULL); -if (next) { -WRITE_ONCE(next->locked, 1); +next = xchg_relaxed(>next, NULL); +if (next) +goto unlock; + +next = osq_wait_next(lock, node, NULL); +if (unlikely(!next)) { +/* + * In the unlikely event that the OSQ is empty, we need to + * provide a proper release barrier. + */ +smp_mb(); return; } -next = osq_wait_next(lock, node, NULL); -if (next) -WRITE_ONCE(next->locked, 1); +unlock: +smp_store_release(>locked, 1); } As well as for the smp_acquire__after_ctrl_dep comment you have above, this also obviously pairs with the osq_lock's smp_load_acquire while backing out (unqueueing, step A). Given the above, for this case we might also just rely on READ_ONCE(node->locked), if we get the conditional wrong and miss the node becoming locked, all we do is another iteration, and while there is a cmpxchg() there, it is mitigated with the ccas thingy. Similar to osq_lock(), the current osq_unlock() does not have a release barrier guarantee. I think splitting into 2 variants - osq_unlock, osq_unlock_relaxed will help. Cheers, Longman
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On 10/04/2016 03:06 PM, Davidlohr Bueso wrote: On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. First of all, it is not obvious from the name osq_lock() that it is not an acquire barrier in some cases. We either need to clearly document it or has a variant name that indicate that, e.g. osq_lock_relaxed, for example. Secondly, if we look at the use cases of osq_lock(), the additional latency (for non-x86 archs) only matters if the master lock is immediately available for acquisition after osq_lock() return. Otherwise, it will be hidden in the spinning loop for that master lock. So yes, there may be a slight performance hit in some cases, but certainly not always. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Yes, we can certainly do that like splitting into 2 variants, one with acquire barrier guarantee and one without. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } +/* + * Add an acquire memory barrier for pairing with the release barrier + * in unlock. + */ +smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); -next = xchg(>next, NULL); -if (next) { -WRITE_ONCE(next->locked, 1); +next = xchg_relaxed(>next, NULL); +if (next) +goto unlock; + +next = osq_wait_next(lock, node, NULL); +if (unlikely(!next)) { +/* + * In the unlikely event that the OSQ is empty, we need to + * provide a proper release barrier. + */ +smp_mb(); return; } -next = osq_wait_next(lock, node, NULL); -if (next) -WRITE_ONCE(next->locked, 1); +unlock: +smp_store_release(>locked, 1); } As well as for the smp_acquire__after_ctrl_dep comment you have above, this also obviously pairs with the osq_lock's smp_load_acquire while backing out (unqueueing, step A). Given the above, for this case we might also just rely on READ_ONCE(node->locked), if we get the conditional wrong and miss the node becoming locked, all we do is another iteration, and while there is a cmpxchg() there, it is mitigated with the ccas thingy. Similar to osq_lock(), the current osq_unlock() does not have a release barrier guarantee. I think splitting into 2 variants - osq_unlock, osq_unlock_relaxed will help. Cheers, Longman
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Tue, Oct 4, 2016 at 12:06 PM, Davidlohr Buesowrote: > On Thu, 18 Aug 2016, Waiman Long wrote: > >> The osq_lock() and osq_unlock() function may not provide the necessary >> acquire and release barrier in some cases. This patch makes sure >> that the proper barriers are provided when osq_lock() is successful >> or when osq_unlock() is called. > > > But why do we need these guarantees given that osq is only used internally > for lock owner spinning situations? Leaking out of the critical region will > obviously be bad if using it as a full lock, but, as is, this can only hurt > performance of two of the most popular locks in the kernel -- although yes, > using smp_acquire__after_ctrl_dep is nicer for polling. > > If you need tighter osq for rwsems, could it be refactored such that mutexes > do not take a hit? I agree with David, the OSQ is meant to be used internally as a queuing mechanism than as something for protecting critical regions, and so these guarantees of preventing critical section leaks don't seem to be necessary for the OSQ. If that is the case, then it would be better to avoid the performance hit to mutexes and rwsems. Jason
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Tue, Oct 4, 2016 at 12:06 PM, Davidlohr Bueso wrote: > On Thu, 18 Aug 2016, Waiman Long wrote: > >> The osq_lock() and osq_unlock() function may not provide the necessary >> acquire and release barrier in some cases. This patch makes sure >> that the proper barriers are provided when osq_lock() is successful >> or when osq_unlock() is called. > > > But why do we need these guarantees given that osq is only used internally > for lock owner spinning situations? Leaking out of the critical region will > obviously be bad if using it as a full lock, but, as is, this can only hurt > performance of two of the most popular locks in the kernel -- although yes, > using smp_acquire__after_ctrl_dep is nicer for polling. > > If you need tighter osq for rwsems, could it be refactored such that mutexes > do not take a hit? I agree with David, the OSQ is meant to be used internally as a queuing mechanism than as something for protecting critical regions, and so these guarantees of preventing critical section leaks don't seem to be necessary for the OSQ. If that is the case, then it would be better to avoid the performance hit to mutexes and rwsems. Jason
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } + /* +* Add an acquire memory barrier for pairing with the release barrier +* in unlock. +*/ + smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); - next = xchg(>next, NULL); - if (next) { - WRITE_ONCE(next->locked, 1); + next = xchg_relaxed(>next, NULL); + if (next) + goto unlock; + + next = osq_wait_next(lock, node, NULL); + if (unlikely(!next)) { + /* +* In the unlikely event that the OSQ is empty, we need to +* provide a proper release barrier. +*/ + smp_mb(); return; } - next = osq_wait_next(lock, node, NULL); - if (next) - WRITE_ONCE(next->locked, 1); +unlock: + smp_store_release(>locked, 1); } As well as for the smp_acquire__after_ctrl_dep comment you have above, this also obviously pairs with the osq_lock's smp_load_acquire while backing out (unqueueing, step A). Given the above, for this case we might also just rely on READ_ONCE(node->locked), if we get the conditional wrong and miss the node becoming locked, all we do is another iteration, and while there is a cmpxchg() there, it is mitigated with the ccas thingy. Thanks, Davidlohr
Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
On Thu, 18 Aug 2016, Waiman Long wrote: The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. But why do we need these guarantees given that osq is only used internally for lock owner spinning situations? Leaking out of the critical region will obviously be bad if using it as a full lock, but, as is, this can only hurt performance of two of the most popular locks in the kernel -- although yes, using smp_acquire__after_ctrl_dep is nicer for polling. If you need tighter osq for rwsems, could it be refactored such that mutexes do not take a hit? Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } + /* +* Add an acquire memory barrier for pairing with the release barrier +* in unlock. +*/ + smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); - next = xchg(>next, NULL); - if (next) { - WRITE_ONCE(next->locked, 1); + next = xchg_relaxed(>next, NULL); + if (next) + goto unlock; + + next = osq_wait_next(lock, node, NULL); + if (unlikely(!next)) { + /* +* In the unlikely event that the OSQ is empty, we need to +* provide a proper release barrier. +*/ + smp_mb(); return; } - next = osq_wait_next(lock, node, NULL); - if (next) - WRITE_ONCE(next->locked, 1); +unlock: + smp_store_release(>locked, 1); } As well as for the smp_acquire__after_ctrl_dep comment you have above, this also obviously pairs with the osq_lock's smp_load_acquire while backing out (unqueueing, step A). Given the above, for this case we might also just rely on READ_ONCE(node->locked), if we get the conditional wrong and miss the node becoming locked, all we do is another iteration, and while there is a cmpxchg() there, it is mitigated with the ccas thingy. Thanks, Davidlohr
[RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } + /* +* Add an acquire memory barrier for pairing with the release barrier +* in unlock. +*/ + smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); - next = xchg(>next, NULL); - if (next) { - WRITE_ONCE(next->locked, 1); + next = xchg_relaxed(>next, NULL); + if (next) + goto unlock; + + next = osq_wait_next(lock, node, NULL); + if (unlikely(!next)) { + /* +* In the unlikely event that the OSQ is empty, we need to +* provide a proper release barrier. +*/ + smp_mb(); return; } - next = osq_wait_next(lock, node, NULL); - if (next) - WRITE_ONCE(next->locked, 1); +unlock: + smp_store_release(>locked, 1); } -- 1.7.1
[RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
The osq_lock() and osq_unlock() function may not provide the necessary acquire and release barrier in some cases. This patch makes sure that the proper barriers are provided when osq_lock() is successful or when osq_unlock() is called. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long --- kernel/locking/osq_lock.c | 24 ++-- 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..3da0b97 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) cpu_relax_lowlatency(); } + /* +* Add an acquire memory barrier for pairing with the release barrier +* in unlock. +*/ + smp_acquire__after_ctrl_dep(); return true; unqueue: @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock) * Second most likely case. */ node = this_cpu_ptr(_node); - next = xchg(>next, NULL); - if (next) { - WRITE_ONCE(next->locked, 1); + next = xchg_relaxed(>next, NULL); + if (next) + goto unlock; + + next = osq_wait_next(lock, node, NULL); + if (unlikely(!next)) { + /* +* In the unlikely event that the OSQ is empty, we need to +* provide a proper release barrier. +*/ + smp_mb(); return; } - next = osq_wait_next(lock, node, NULL); - if (next) - WRITE_ONCE(next->locked, 1); +unlock: + smp_store_release(>locked, 1); } -- 1.7.1