Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-10 Thread Nicholas Piggin
On Thu Nov 10, 2022 at 4:37 PM AEST, Christophe Leroy wrote:
>
>
> Le 10/11/2022 à 01:35, Jordan Niethe a écrit :
> > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> > 
> >> -#define queued_spin_lock queued_spin_lock
> >>   
> >> -static inline void queued_spin_unlock(struct qspinlock *lock)
> >> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >>   {
> >> -  if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> >> -  smp_store_release(>locked, 0);
> >> -  else
> >> -  __pv_queued_spin_unlock(lock);
> >> +  if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> >> +  return 1;
> >> +  return 0;
> > 
> > optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);
>
> No parenthesis.
> No == 0
>
> Should be :
>
>   return !atomic_cmpxchg_acquire(>val, 0, 1);

In this case I prefer the == 0 because we're testing against the 0 old
parameter being passed in. This is the recognisable cmpxchg pattern.

The other style of cmpxchg returns true if it succeeded, so it's less
clear we're not using that version if using !.

Thanks,
Nick


Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-10 Thread Nicholas Piggin
On Thu Nov 10, 2022 at 10:35 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> 
> > -#define queued_spin_lock queued_spin_lock
> >  
> > -static inline void queued_spin_unlock(struct qspinlock *lock)
> > +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >  {
> > -   if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> > -   smp_store_release(>locked, 0);
> > -   else
> > -   __pv_queued_spin_unlock(lock);
> > +   if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> > +   return 1;
> > +   return 0;
>
> optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);
>
> [resend as utf-8, not utf-7]

Thanks for the thorough review, apologies again it took me so long to
get back to.

I'm not completely sold on this. I guess it's already side-effects in a
control flow statement though... Maybe I will change it, not sure.

Thanks,
Nick


Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-09 Thread Christophe Leroy


Le 10/11/2022 à 01:35, Jordan Niethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> 
>> -#define queued_spin_lock queued_spin_lock
>>   
>> -static inline void queued_spin_unlock(struct qspinlock *lock)
>> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
>> -smp_store_release(>locked, 0);
>> -else
>> -__pv_queued_spin_unlock(lock);
>> +if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
>> +return 1;
>> +return 0;
> 
> optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);

No parenthesis.
No == 0

Should be :

return !atomic_cmpxchg_acquire(>val, 0, 1);

> 
> [resend as utf-8, not utf-7]
> 

Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> - smp_store_release(>locked, 0);
> - else
> - __pv_queued_spin_unlock(lock);
> + if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> + return 1;
> + return 0;

optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);

[resend as utf-8, not utf-7]



Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-08-10 Thread Christophe Leroy


Le 10/08/2022 à 03:52, Jordan NIethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> 
>> -#define queued_spin_lock queued_spin_lock
>>   
>> -static inline void queued_spin_unlock(struct qspinlock *lock)
>> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
>> -smp_store_release(>locked, 0);
>> -else
>> -__pv_queued_spin_unlock(lock);
>> +if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
>> +return 1;
>> +return 0;
> 
> optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);
> 

The parenthesis are pointless, and ! is usually prefered to == 0, 
something like that:

return !atomic_cmpxchg_acquire(>val, 0, 1);

Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-08-09 Thread Jordan NIethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> - smp_store_release(>locked, 0);
> - else
> - __pv_queued_spin_unlock(lock);
> + if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> + return 1;
> + return 0;

optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);



[PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-07-28 Thread Nicholas Piggin
Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.

Generic PV qspinlock code is causing latency / starvation regressions on
large systems that are resulting in hard lockups reported (mostly in
pathoogical cases).  The generic qspinlock code has a number of issues
important for powerpc hardware and hypervisors that aren't easily solved
without changing code that would impact other architectures. Follow
s390's lead and implement our own for now.

Issues for powerpc using generic qspinlocks:
- The previous lock value should not be loaded with simple loads, and
  need not be passed around from previous loads or cmpxchg results,
  because powerpc uses ll/sc-style atomics which can perform more
  complex operations that do not require this. powerpc implementations
  tend to prefer loads use larx for improved coherency performance.
- The queueing process should absolutely minimise the number of stores
  to the lock word to reduce exclusive coherency probes, important for
  large system scalability. The pending logic is counter productive
  here.
- Non-atomic unlock for paravirt locks is important (atomic instructions
  tend to still be more expensive than x86 CPUs).
- Yielding to the lock owner is important in the oversubscribed paravirt
  case, which requires storing the owner CPU in the lock word.
- More control of lock stealing for the paravirt case is important to
  keep latency down on large systems.
- The lock acquisition operation should always be made with a special
  variant of atomic instructions with the lock hint bit set, including
  (especially) in the queueing paths. This is more a matter of adding
  more arch lock helpers so not an insurmountable problem for generic
  code.

Since the RFC series, I tested this on a 16-socket 1920 thread POWER10
system with some microbenchmarks, and that showed up significant
problems with the previous series. High amount of spinning on the lock
up-front (lock stealing) for SPLPAR mode (paravirt) really hurts
scalability when the guest is not overcommitted. However on smaller
KVM systems with significant overcommit (e.g., 5-10%), this spinning
is very important to avoid performance tanking due to the queueing
problem. So rather than set STEAL_SPINS and HEAD_SPINS based on
SPLPAR at boot-time, I lowered them and do more to dynamically deal
with vCPU preemption. So behaviour of dedicated and shared LPAR mode
is now the same until there is vCPU preemption detected. This seems
to be leading to better results overall, but some worst-case latencies
are significantly up with the lockstorm test (latency is still better
than generic queued spinlocks, but not as good as it previously was or
as good as simple). Statistical fairness is still significantly better.

Thanks,
Nick

---
 arch/powerpc/Kconfig   |  1 -
 arch/powerpc/include/asm/qspinlock.h   | 78 ++
 arch/powerpc/include/asm/qspinlock_types.h | 13 
 arch/powerpc/include/asm/spinlock_types.h  |  2 +-
 arch/powerpc/lib/Makefile  |  4 +-
 arch/powerpc/lib/qspinlock.c   | 18 +
 6 files changed, 69 insertions(+), 47 deletions(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 7aa12e88c580..4838e6c96b20 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -154,7 +154,6 @@ config PPC
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_MEMTEST
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
-   select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index 39c1c7f80579..cb2b4f91e976 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,66 +2,56 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include 
-#include 
+#include 
+#include 
+#include 
 
-#define _Q_PENDING_LOOPS   (1 << 9) /* not tuned */
-
-void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_unlock(struct qspinlock *lock);
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-   u32 val = 0;
+   return atomic_read(>val);
+}
 
-   if