Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-23 Thread Nicholas Piggin
Excerpts from Waiman Long's message of July 24, 2020 12:29 am:
> On 7/23/20 9:30 AM, Nicholas Piggin wrote:
>>> I would prefer to extract out the pending bit handling code out into a
>>> separate helper function which can be overridden by the arch code
>>> instead of breaking the slowpath into 2 pieces.
>> You mean have the arch provide a queued_spin_lock_slowpath_pending
>> function that the slow path calls?
>>
>> I would actually prefer the pending handling can be made inline in
>> the queued_spin_lock function, especially with out-of-line locks it
>> makes sense to put it there.
>>
>> We could ifdef out queued_spin_lock_slowpath_queue if it's not used,
>> then __queued_spin_lock_slowpath_queue would be inlined into the
>> caller so there would be no split?
> 
> The pending code is an optimization for lightly contended locks. That is 
> why I think it is appropriate to extract it into a helper function and 
> mark it as such.
> 
> You can certainly put the code in the arch's spin_lock code, you just 
> has to override the generic pending code by a null function.

I see what you mean. I guess that would work fine.

Thanks,
Nick


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-23 Thread Waiman Long

On 7/23/20 9:30 AM, Nicholas Piggin wrote:

I would prefer to extract out the pending bit handling code out into a
separate helper function which can be overridden by the arch code
instead of breaking the slowpath into 2 pieces.

You mean have the arch provide a queued_spin_lock_slowpath_pending
function that the slow path calls?

I would actually prefer the pending handling can be made inline in
the queued_spin_lock function, especially with out-of-line locks it
makes sense to put it there.

We could ifdef out queued_spin_lock_slowpath_queue if it's not used,
then __queued_spin_lock_slowpath_queue would be inlined into the
caller so there would be no split?


The pending code is an optimization for lightly contended locks. That is 
why I think it is appropriate to extract it into a helper function and 
mark it as such.


You can certainly put the code in the arch's spin_lock code, you just 
has to override the generic pending code by a null function.


Cheers,
Longman



Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-23 Thread Nicholas Piggin
Excerpts from Waiman Long's message of July 22, 2020 12:36 am:
> On 7/21/20 7:08 AM, Nicholas Piggin wrote:
>> diff --git a/arch/powerpc/include/asm/qspinlock.h 
>> b/arch/powerpc/include/asm/qspinlock.h
>> index b752d34517b3..26d8766a1106 100644
>> --- a/arch/powerpc/include/asm/qspinlock.h
>> +++ b/arch/powerpc/include/asm/qspinlock.h
>> @@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock 
>> *lock)
>>   
>>   #else
>>   extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>> +extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
>>   #endif
>>   
>>   static __always_inline void queued_spin_lock(struct qspinlock *lock)
>>   {
>> -u32 val = 0;
>> -
>> -if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL)))
>> +atomic_t *a = >val;
>> +u32 val;
>> +
>> +again:
>> +asm volatile(
>> +"1:\t"  PPC_LWARX(%0,0,%1,1) "  # queued_spin_lock  
>> \n"
>> +: "=" (val)
>> +: "r" (>counter)
>> +: "memory");
>> +
>> +if (likely(val == 0)) {
>> +asm_volatile_goto(
>> +"   stwcx.  %0,0,%1 
>> \n"
>> +"   bne-%l[again]   
>> \n"
>> +"\t"PPC_ACQUIRE_BARRIER "   
>> \n"
>> +:
>> +: "r"(_Q_LOCKED_VAL), "r" (>counter)
>> +: "cr0", "memory"
>> +: again );
>>  return;
>> -
>> -queued_spin_lock_slowpath(lock, val);
>> +}
>> +
>> +if (likely(val == _Q_LOCKED_VAL)) {
>> +asm_volatile_goto(
>> +"   stwcx.  %0,0,%1 
>> \n"
>> +"   bne-%l[again]   
>> \n"
>> +:
>> +: "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (>counter)
>> +: "cr0", "memory"
>> +: again );
>> +
>> +atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
>> +//  clear_pending_set_locked(lock);
>> +WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>> +//  lockevent_inc(lock_pending);
>> +return;
>> +}
>> +
>> +if (val == _Q_PENDING_VAL) {
>> +int cnt = _Q_PENDING_LOOPS;
>> +val = atomic_cond_read_relaxed(a,
>> +   (VAL != _Q_PENDING_VAL) || 
>> !cnt--);
>> +if (!(val & ~_Q_LOCKED_MASK))
>> +goto again;
>> +}
>> +queued_spin_lock_slowpath_queue(lock);
>>   }
>>   #define queued_spin_lock queued_spin_lock
>>   
> 
> I am fine with the arch code override some part of the generic code.

Cool.

>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index b9515fcc9b29..ebcc6f5d99d5 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -287,10 +287,14 @@ static __always_inline u32  
>> __pv_wait_head_or_lock(struct qspinlock *lock,
>>   
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>   #define queued_spin_lock_slowpath  native_queued_spin_lock_slowpath
>> +#define queued_spin_lock_slowpath_queue 
>> native_queued_spin_lock_slowpath_queue
>>   #endif
>>   
>>   #endif /* _GEN_PV_LOCK_SLOWPATH */
>>   
>> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
>> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
>> +
>>   /**
>>* queued_spin_lock_slowpath - acquire the queued spinlock
>>* @lock: Pointer to queued spinlock structure
>> @@ -314,12 +318,6 @@ static __always_inline u32  
>> __pv_wait_head_or_lock(struct qspinlock *lock,
>>*/
>>   void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>   {
>> -struct mcs_spinlock *prev, *next, *node;
>> -u32 old, tail;
>> -int idx;
>> -
>> -BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
>> -
>>  if (pv_enabled())
>>  goto pv_queue;
>>   
>> @@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
>> u32 val)
>>   queue:
>>  lockevent_inc(lock_slowpath);
>>   pv_queue:
>> +__queued_spin_lock_slowpath_queue(lock);
>> +}
>> +EXPORT_SYMBOL(queued_spin_lock_slowpath);
>> +
>> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
>> +{
>> +lockevent_inc(lock_slowpath);
>> +__queued_spin_lock_slowpath_queue(lock);
>> +}
>> +EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
>> +
>> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
>> +{
>> +struct mcs_spinlock *prev, *next, *node;
>> +u32 old, tail;
>> +u32 val;
>> +int idx;
>> +
>> +BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
>> +
>>  node = this_cpu_ptr([0].mcs);
>>  idx = node->count++;
>>  tail = encode_tail(smp_processor_id(), idx);
>> @@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, 
>> u32 val)
>>   */
>>  

Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-21 Thread Waiman Long

On 7/21/20 7:08 AM, Nicholas Piggin wrote:

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..26d8766a1106 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock 
*lock)
  
  #else

  extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
  #endif
  
  static __always_inline void queued_spin_lock(struct qspinlock *lock)

  {
-   u32 val = 0;
-
-   if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL)))
+   atomic_t *a = >val;
+   u32 val;
+
+again:
+   asm volatile(
+"1:\t"   PPC_LWARX(%0,0,%1,1) " # queued_spin_lock  
\n"
+   : "=" (val)
+   : "r" (>counter)
+   : "memory");
+
+   if (likely(val == 0)) {
+   asm_volatile_goto(
+   "  stwcx.  %0,0,%1 \n"
+   "  bne-%l[again]   \n"
+   "\t"  PPC_ACQUIRE_BARRIER "  
\n"
+   :
+   : "r"(_Q_LOCKED_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
return;
-
-   queued_spin_lock_slowpath(lock, val);
+   }
+
+   if (likely(val == _Q_LOCKED_VAL)) {
+   asm_volatile_goto(
+   "  stwcx.  %0,0,%1 \n"
+   "  bne-%l[again]   \n"
+   :
+   : "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
+
+   atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
+// clear_pending_set_locked(lock);
+   WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+// lockevent_inc(lock_pending);
+   return;
+   }
+
+   if (val == _Q_PENDING_VAL) {
+   int cnt = _Q_PENDING_LOOPS;
+   val = atomic_cond_read_relaxed(a,
+  (VAL != _Q_PENDING_VAL) || 
!cnt--);
+   if (!(val & ~_Q_LOCKED_MASK))
+   goto again;
+}
+   queued_spin_lock_slowpath_queue(lock);
  }
  #define queued_spin_lock queued_spin_lock
  


I am fine with the arch code override some part of the generic code.



diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ebcc6f5d99d5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
  
  #ifdef CONFIG_PARAVIRT_SPINLOCKS

  #define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath_queue
native_queued_spin_lock_slowpath_queue
  #endif
  
  #endif /* _GEN_PV_LOCK_SLOWPATH */
  
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock);

+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+
  /**
   * queued_spin_lock_slowpath - acquire the queued spinlock
   * @lock: Pointer to queued spinlock structure
@@ -314,12 +318,6 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
   */
  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
  {
-   struct mcs_spinlock *prev, *next, *node;
-   u32 old, tail;
-   int idx;
-
-   BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
-
if (pv_enabled())
goto pv_queue;
  
@@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

  queue:
lockevent_inc(lock_slowpath);
  pv_queue:
+   __queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+   lockevent_inc(lock_slowpath);
+   __queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
+
+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+   struct mcs_spinlock *prev, *next, *node;
+   u32 old, tail;
+   u32 val;
+   int idx;
+
+   BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
node = this_cpu_ptr([0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
__this_cpu_dec(qnodes[0].mcs.count);
  }
-EXPORT_SYMBOL(queued_spin_lock_slowpath);
  
  /*

   * Generate the paravirt code for queued_spin_unlock_slowpath().

I would prefer to extract out the pending bit handling code out into a 
separate helper function which can be overridden by the arch code 
instead of breaking the slowpath 

Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-21 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of July 9, 2020 6:31 pm:
> On Wed, Jul 08, 2020 at 07:54:34PM -0400, Waiman Long wrote:
>> On 7/8/20 4:41 AM, Peter Zijlstra wrote:
>> > On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
>> > > Yes, powerpc could certainly get more performance out of the slow
>> > > paths, and then there are a few parameters to tune.
>> > Can you clarify? The slow path is already in use on ARM64 which is weak,
>> > so I doubt there's superfluous serialization present. And Will spend a
>> > fair amount of time on making that thing guarantee forward progressm, so
>> > there just isn't too much room to play.
>> > 
>> > > We don't have a good alternate patching for function calls yet, but
>> > > that would be something to do for native vs pv.
>> > Going by your jump_label implementation, support for static_call should
>> > be fairly straight forward too, no?
>> > 
>> >https://lkml.kernel.org/r/20200624153024.794671...@infradead.org
>> > 
>> Speaking of static_call, I am also looking forward to it. Do you have an
>> idea when that will be merged?
> 
> 0day had one crash on the last round, I think Steve send a fix for that
> last night and I'll go look at it.
> 
> That said, the last posting got 0 feedback, so either everybody is
> really happy with it, or not interested. So let us know in the thread,
> with some review feedback.
> 
> Once I get through enough of the inbox to actually find the fix and test
> it, I'll also update the thread, and maybe threaten to merge it if
> everybody stays silent :-)

I'd like to use it in powerpc. We have code now for example that patches 
a branch immediately at the top of memcpy which branches to a different 
version of the function. pv queued spinlock selection obviously, and
there's a bunch of platform ops struct things that get filled in at boot 
time, etc.

So +1 here if you can get them through. I'm not 100% sure we can do
it with existing toolchain and no ugly hacks, but there's no way to
structure things that can get around that AFAIKS. We'd eventually
use it though, I'd say.

Thanks,
Nick


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-21 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of July 8, 2020 6:41 pm:
> On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
>> Yes, powerpc could certainly get more performance out of the slow
>> paths, and then there are a few parameters to tune.
> 

Sorry for the delay, got bogged down and distracted by other things :(

> Can you clarify? The slow path is already in use on ARM64 which is weak,
> so I doubt there's superfluous serialization present. And Will spend a
> fair amount of time on making that thing guarantee forward progressm, so
> there just isn't too much room to play.

Sure, the way the pending not-queued slowpath (which I guess is the
medium-path) is implemented is just poorly structured for LL/SC. It
has one more atomic than necessary (queued_fetch_set_pending_acquire),
and a lot of branches in suboptimal order.

Attached patch (completely untested just compiled and looked at asm
so far) is a way we can fix this on powerpc I think. It's actually
very little generic code change which is good, duplicated medium-path
logic unfortunately but that's no worse than something like x86
really.

>> We don't have a good alternate patching for function calls yet, but
>> that would be something to do for native vs pv.
> 
> Going by your jump_label implementation, support for static_call should
> be fairly straight forward too, no?
> 
>   https://lkml.kernel.org/r/20200624153024.794671...@infradead.org

Nice, yeah it should be. I've wanted this for ages!

powerpc is kind of annoying to implement that with limited call range,
Hmm, not sure if we'd need a new linker feature to support it. We'd
provide call site patch space for indirect branches for those out of
range of direct call, so that should work fine. The trick would be 
patching in the TOC lookup for the function... should be doable somehow.

Thanks,
Nick

---

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..26d8766a1106 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock 
*lock)
 
 #else
 extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
 #endif
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-   u32 val = 0;
-
-   if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL)))
+   atomic_t *a = >val;
+   u32 val;
+
+again:
+   asm volatile(
+"1:\t" PPC_LWARX(%0,0,%1,1) "  # queued_spin_lock  \n"
+   : "=" (val)
+   : "r" (>counter)
+   : "memory");
+
+   if (likely(val == 0)) {
+   asm_volatile_goto(
+   "   stwcx.  %0,0,%1 
\n"
+   "   bne-%l[again]   
\n"
+   "\t"PPC_ACQUIRE_BARRIER "   
\n"
+   :
+   : "r"(_Q_LOCKED_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
return;
-
-   queued_spin_lock_slowpath(lock, val);
+   }
+
+   if (likely(val == _Q_LOCKED_VAL)) {
+   asm_volatile_goto(
+   "   stwcx.  %0,0,%1 
\n"
+   "   bne-%l[again]   
\n"
+   :
+   : "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
+
+   atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
+// clear_pending_set_locked(lock);
+   WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+// lockevent_inc(lock_pending);
+   return;
+   }
+
+   if (val == _Q_PENDING_VAL) {
+   int cnt = _Q_PENDING_LOOPS;
+   val = atomic_cond_read_relaxed(a,
+  (VAL != _Q_PENDING_VAL) || 
!cnt--);
+   if (!(val & ~_Q_LOCKED_MASK))
+   goto again;
+}
+   queued_spin_lock_slowpath_queue(lock);
 }
 #define queued_spin_lock queued_spin_lock
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ebcc6f5d99d5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define queued_spin_lock_slowpath  native_queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath_queue
native_queued_spin_lock_slowpath_queue
 #endif
 
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+
 /**
  * queued_spin_lock_slowpath 

Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-09 Thread Peter Zijlstra
On Wed, Jul 08, 2020 at 07:54:34PM -0400, Waiman Long wrote:
> On 7/8/20 4:41 AM, Peter Zijlstra wrote:
> > On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
> > > Yes, powerpc could certainly get more performance out of the slow
> > > paths, and then there are a few parameters to tune.
> > Can you clarify? The slow path is already in use on ARM64 which is weak,
> > so I doubt there's superfluous serialization present. And Will spend a
> > fair amount of time on making that thing guarantee forward progressm, so
> > there just isn't too much room to play.
> > 
> > > We don't have a good alternate patching for function calls yet, but
> > > that would be something to do for native vs pv.
> > Going by your jump_label implementation, support for static_call should
> > be fairly straight forward too, no?
> > 
> >https://lkml.kernel.org/r/20200624153024.794671...@infradead.org
> > 
> Speaking of static_call, I am also looking forward to it. Do you have an
> idea when that will be merged?

0day had one crash on the last round, I think Steve send a fix for that
last night and I'll go look at it.

That said, the last posting got 0 feedback, so either everybody is
really happy with it, or not interested. So let us know in the thread,
with some review feedback.

Once I get through enough of the inbox to actually find the fix and test
it, I'll also update the thread, and maybe threaten to merge it if
everybody stays silent :-)


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 7:50 PM, Waiman Long wrote:

On 7/8/20 1:10 AM, Nicholas Piggin wrote:

Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.

Regarding the paravirt lock, I have taken a further look into the
current PPC spinlock code. There is an equivalent of pv_wait() but no
pv_kick(). Maybe PPC doesn't really need that.

So powerpc has two types of wait, either undirected "all processors" or
directed to a specific processor which has been preempted by the
hypervisor.

The simple spinlock code does a directed wait, because it knows the CPU
which is holding the lock. In this case, there is a sequence that is
used to ensure we don't wait if the condition has become true, and the
target CPU does not need to kick the waiter it will happen automatically
(see splpar_spin_yield). This is preferable because we only wait as
needed and don't require the kick operation.

Thanks for the explanation.


The pv spinlock code I did uses the undirected wait, because we don't
know the CPU number which we are waiting on. This is undesirable because
it's higher overhead and the wait is not so accurate.

I think perhaps we could change things so we wait on the correct CPU
when queued, which might be good enough (we could also put the lock
owner CPU in the spinlock word, if we add another format).


The LS byte of the lock word is used to indicate locking status. If we 
have less than 255 cpus, we can put the (cpu_nr + 1) into the lock 
byte. The special 0xff value can be used to indicate a cpu number >= 
255 for indirect yield. The required change to the qspinlock code will 
be minimal, I think. 


BTW, we can also keep track of the previous cpu in the waiting queue. 
Due to lock stealing, that may not be the cpu that is holding the lock. 
Maybe we can use this, if available, in case the cpu number is >= 255.


Regards,
Longman



Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 4:41 AM, Peter Zijlstra wrote:

On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

Can you clarify? The slow path is already in use on ARM64 which is weak,
so I doubt there's superfluous serialization present. And Will spend a
fair amount of time on making that thing guarantee forward progressm, so
there just isn't too much room to play.


We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

Going by your jump_label implementation, support for static_call should
be fairly straight forward too, no?

   https://lkml.kernel.org/r/20200624153024.794671...@infradead.org

Speaking of static_call, I am also looking forward to it. Do you have an 
idea when that will be merged?


Cheers,
Longman



Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 4:32 AM, Peter Zijlstra wrote:

On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote:

 From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001
From: Waiman Long 
Date: Tue, 7 Jul 2020 22:29:16 -0400
Subject: [PATCH 2/9] locking/pvqspinlock: Introduce
  CONFIG_PARAVIRT_QSPINLOCKS_LITE

Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows
architectures to use the PV qspinlock code without the need to use or
implement a pv_kick() function, thus eliminating the atomic unlock
overhead. The non-atomic queued_spin_unlock() can be used instead.
The pv_wait() function will still be needed, but it can be a dummy
function.

With that option set, the hybrid PV queued/unfair locking code should
still be able to make it performant enough in a paravirtualized

How is this supposed to work? If there is no kick, you have no control
over who wakes up and fairness goes out the window entirely.

You don't even begin to explain...

I don't have a full understanding of how the PPC hypervisor work myself. 
Apparently, a cpu kick may not be needed.


This is just a test patch to see if it yields better result. It is 
subjected to further modifcation.


Cheers,
Longman



Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 1:10 AM, Nicholas Piggin wrote:

Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.

Regarding the paravirt lock, I have taken a further look into the
current PPC spinlock code. There is an equivalent of pv_wait() but no
pv_kick(). Maybe PPC doesn't really need that.

So powerpc has two types of wait, either undirected "all processors" or
directed to a specific processor which has been preempted by the
hypervisor.

The simple spinlock code does a directed wait, because it knows the CPU
which is holding the lock. In this case, there is a sequence that is
used to ensure we don't wait if the condition has become true, and the
target CPU does not need to kick the waiter it will happen automatically
(see splpar_spin_yield). This is preferable because we only wait as
needed and don't require the kick operation.

Thanks for the explanation.


The pv spinlock code I did uses the undirected wait, because we don't
know the CPU number which we are waiting on. This is undesirable because
it's higher overhead and the wait is not so accurate.

I think perhaps we could change things so we wait on the correct CPU
when queued, which might be good enough (we could also put the lock
owner CPU in the spinlock word, if we add another format).


The LS byte of the lock word is used to indicate locking status. If we 
have less than 255 cpus, we can put the (cpu_nr + 1) into the lock byte. 
The special 0xff value can be used to indicate a cpu number >= 255 for 
indirect yield. The required change to the qspinlock code will be 
minimal, I think.




Attached are two
additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE
option to not require pv_kick(). There is also a fixup patch to be
applied after your patchset.

I don't have access to a PPC LPAR with shared processor at the moment,
so I can't test the performance of the paravirt code. Would you mind
adding my patches and do some performance test on your end to see if it
gives better result?

Great, I'll do some tests. Any suggestions for what to try?


I will just like to see if it will produce some better performance 
result compared with your current version.


Cheers,
Longman



Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Peter Zijlstra
On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
> Yes, powerpc could certainly get more performance out of the slow
> paths, and then there are a few parameters to tune.

Can you clarify? The slow path is already in use on ARM64 which is weak,
so I doubt there's superfluous serialization present. And Will spend a
fair amount of time on making that thing guarantee forward progressm, so
there just isn't too much room to play.

> We don't have a good alternate patching for function calls yet, but
> that would be something to do for native vs pv.

Going by your jump_label implementation, support for static_call should
be fairly straight forward too, no?

  https://lkml.kernel.org/r/20200624153024.794671...@infradead.org


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Peter Zijlstra
On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote:
> From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001
> From: Waiman Long 
> Date: Tue, 7 Jul 2020 22:29:16 -0400
> Subject: [PATCH 2/9] locking/pvqspinlock: Introduce
>  CONFIG_PARAVIRT_QSPINLOCKS_LITE
> 
> Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows
> architectures to use the PV qspinlock code without the need to use or
> implement a pv_kick() function, thus eliminating the atomic unlock
> overhead. The non-atomic queued_spin_unlock() can be used instead.
> The pv_wait() function will still be needed, but it can be a dummy
> function.
> 
> With that option set, the hybrid PV queued/unfair locking code should
> still be able to make it performant enough in a paravirtualized

How is this supposed to work? If there is no kick, you have no control
over who wakes up and fairness goes out the window entirely.

You don't even begin to explain...


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-07 Thread Nicholas Piggin
Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:
> On 7/7/20 1:57 AM, Nicholas Piggin wrote:
>> Yes, powerpc could certainly get more performance out of the slow
>> paths, and then there are a few parameters to tune.
>>
>> We don't have a good alternate patching for function calls yet, but
>> that would be something to do for native vs pv.
>>
>> And then there seem to be one or two tunable parameters we could
>> experiment with.
>>
>> The paravirt locks may need a bit more tuning. Some simple testing
>> under KVM shows we might be a bit slower in some cases. Whether this
>> is fairness or something else I'm not sure. The current simple pv
>> spinlock code can do a directed yield to the lock holder CPU, whereas
>> the pv qspl here just does a general yield. I think we might actually
>> be able to change that to also support directed yield. Though I'm
>> not sure if this is actually the cause of the slowdown yet.
> 
> Regarding the paravirt lock, I have taken a further look into the 
> current PPC spinlock code. There is an equivalent of pv_wait() but no 
> pv_kick(). Maybe PPC doesn't really need that.

So powerpc has two types of wait, either undirected "all processors" or 
directed to a specific processor which has been preempted by the 
hypervisor.

The simple spinlock code does a directed wait, because it knows the CPU 
which is holding the lock. In this case, there is a sequence that is 
used to ensure we don't wait if the condition has become true, and the
target CPU does not need to kick the waiter it will happen automatically
(see splpar_spin_yield). This is preferable because we only wait as 
needed and don't require the kick operation.

The pv spinlock code I did uses the undirected wait, because we don't
know the CPU number which we are waiting on. This is undesirable because 
it's higher overhead and the wait is not so accurate.

I think perhaps we could change things so we wait on the correct CPU 
when queued, which might be good enough (we could also put the lock
owner CPU in the spinlock word, if we add another format).

> Attached are two 
> additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE 
> option to not require pv_kick(). There is also a fixup patch to be 
> applied after your patchset.
> 
> I don't have access to a PPC LPAR with shared processor at the moment, 
> so I can't test the performance of the paravirt code. Would you mind 
> adding my patches and do some performance test on your end to see if it 
> gives better result?

Great, I'll do some tests. Any suggestions for what to try?

Thanks,
Nick


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-07 Thread Waiman Long

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.


Regarding the paravirt lock, I have taken a further look into the 
current PPC spinlock code. There is an equivalent of pv_wait() but no 
pv_kick(). Maybe PPC doesn't really need that. Attached are two 
additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE 
option to not require pv_kick(). There is also a fixup patch to be 
applied after your patchset.


I don't have access to a PPC LPAR with shared processor at the moment, 
so I can't test the performance of the paravirt code. Would you mind 
adding my patches and do some performance test on your end to see if it 
gives better result?


Thanks,
Longman

>From 161e545523a7eb4c42c145c04e9a5a15903ba3d9 Mon Sep 17 00:00:00 2001
From: Waiman Long 
Date: Tue, 7 Jul 2020 20:46:51 -0400
Subject: [PATCH 1/9] locking/pvqspinlock: Code relocation and extraction

Move pv_kick_node() and the unlock functions up and extract out the hash
and lock code from pv_wait_head_or_lock() into pv_hash_lock(). There
is no functional change.

Signed-off-by: Waiman Long 
---
 kernel/locking/qspinlock_paravirt.h | 302 ++--
 1 file changed, 156 insertions(+), 146 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..8eec58320b85 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -55,6 +55,7 @@ struct pv_node {
 
 /*
  * Hybrid PV queued/unfair lock
+ * 
  *
  * By replacing the regular queued_spin_trylock() with the function below,
  * it will be called once when a lock waiter enter the PV slowpath before
@@ -259,6 +260,156 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 	BUG();
 }
 
+/*
+ * Insert lock into hash and set _Q_SLOW_VAL.
+ * Return true if lock acquired.
+ */
+static inline bool pv_hash_lock(struct qspinlock *lock, struct pv_node *node)
+{
+	struct qspinlock **lp = pv_hash(lock, node);
+
+	/*
+	 * We must hash before setting _Q_SLOW_VAL, such that
+	 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
+	 * we'll be sure to be able to observe our hash entry.
+	 *
+	 *   [S]  [Rmw] l->locked == _Q_SLOW_VAL
+	 *   MB   RMB
+	 * [RmW] l->locked = _Q_SLOW_VAL  [L] 
+	 *
+	 * Matches the smp_rmb() in __pv_queued_spin_unlock().
+	 */
+	if (xchg(>locked, _Q_SLOW_VAL) == 0) {
+		/*
+		 * The lock was free and now we own the lock.
+		 * Change the lock value back to _Q_LOCKED_VAL
+		 * and unhash the table.
+		 */
+		WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+		WRITE_ONCE(*lp, NULL);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_or_lock(), this avoids a
+ * wake/sleep cycle.
+ */
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
+{
+	struct pv_node *pn = (struct pv_node *)node;
+
+	/*
+	 * If the vCPU is indeed halted, advance its state to match that of
+	 * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+	 * observe its next->locked value and advance itself.
+	 *
+	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+	 *
+	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * must be ordered before the read of pn->state in the cmpxchg()
+	 * below for the code to work correctly. To guarantee full ordering
+	 * irrespective of the success or failure of the cmpxchg(),
+	 * a relaxed version with explicit barrier is used. The control
+	 * dependency will order the reading of pn->state before any
+	 * subsequent writes.
+	 */
+	smp_mb__before_atomic();
+	if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
+	!= vcpu_halted)
+		return;
+
+	/*
+	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
+	 *
+	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+	 * the hash table later on at unlock time, no atomic instruction is
+	 * needed.
+	 */
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
+	(void)pv_hash(lock, pn);
+}
+
+/*
+ * PV versions 

Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-06 Thread Nicholas Piggin
Excerpts from Waiman Long's message of July 7, 2020 4:39 am:
> On 7/6/20 12:35 AM, Nicholas Piggin wrote:
>> v3 is updated to use __pv_queued_spin_unlock, noticed by Waiman (thank you).
>>
>> Thanks,
>> Nick
>>
>> Nicholas Piggin (6):
>>powerpc/powernv: must include hvcall.h to get PAPR defines
>>powerpc/pseries: move some PAPR paravirt functions to their own file
>>powerpc: move spinlock implementation to simple_spinlock
>>powerpc/64s: implement queued spinlocks and rwlocks
>>powerpc/pseries: implement paravirt qspinlocks for SPLPAR
>>powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
>>  lock hint
>>
>>   arch/powerpc/Kconfig  |  13 +
>>   arch/powerpc/include/asm/Kbuild   |   2 +
>>   arch/powerpc/include/asm/atomic.h |  28 ++
>>   arch/powerpc/include/asm/paravirt.h   |  89 +
>>   arch/powerpc/include/asm/qspinlock.h  |  91 ++
>>   arch/powerpc/include/asm/qspinlock_paravirt.h |   7 +
>>   arch/powerpc/include/asm/simple_spinlock.h| 292 +
>>   .../include/asm/simple_spinlock_types.h   |  21 ++
>>   arch/powerpc/include/asm/spinlock.h   | 308 +-
>>   arch/powerpc/include/asm/spinlock_types.h |  17 +-
>>   arch/powerpc/lib/Makefile |   3 +
>>   arch/powerpc/lib/locks.c  |  12 +-
>>   arch/powerpc/platforms/powernv/pci-ioda-tce.c |   1 +
>>   arch/powerpc/platforms/pseries/Kconfig|   5 +
>>   arch/powerpc/platforms/pseries/setup.c|   6 +-
>>   include/asm-generic/qspinlock.h   |   4 +
>>   16 files changed, 577 insertions(+), 322 deletions(-)
>>   create mode 100644 arch/powerpc/include/asm/paravirt.h
>>   create mode 100644 arch/powerpc/include/asm/qspinlock.h
>>   create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
>>   create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
>>   create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h
>>
> This patch looks OK to me.

Thanks for reviewing and testing.

> I had run some microbenchmark on powerpc system with or w/o the patch.
> 
> On a 2-socket 160-thread SMT4 POWER9 system (not virtualized):
> 
> 5.8.0-rc4
> =
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 160, Min/Mean/Max = 77,665/90,153/106,895
> Threads = 160, Total Rate = 1,441,759 op/s; Percpu Rate = 9,011 op/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 160, Min/Mean/Max = 47,879/53,807/63,689
> Threads = 160, Total Rate = 860,192 op/s; Percpu Rate = 5,376 op/s
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 80, Min/Mean/Max = 242,907/319,514/463,161
> Threads = 80, Total Rate = 2,555 kop/s; Percpu Rate = 32 kop/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 80, Min/Mean/Max = 146,161/187,474/259,270
> Threads = 80, Total Rate = 1,498 kop/s; Percpu Rate = 19 kop/s
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 40, Min/Mean/Max = 646,639/1,000,817/1,455,205
> Threads = 40, Total Rate = 4,001 kop/s; Percpu Rate = 100 kop/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 40, Min/Mean/Max = 402,165/597,132/814,555
> Threads = 40, Total Rate = 2,388 kop/s; Percpu Rate = 60 kop/s
> 
> 5.8.0-rc4-qlock+
> 
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 160, Min/Mean/Max = 123,835/124,580/124,587
> Threads = 160, Total Rate = 1,992 kop/s; Percpu Rate = 12 kop/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 160, Min/Mean/Max = 254,210/264,714/276,784
> Threads = 160, Total Rate = 4,231 kop/s; Percpu Rate = 26 kop/s
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 80, Min/Mean/Max = 599,715/603,397/603,450
> Threads = 80, Total Rate = 4,825 kop/s; Percpu Rate = 60 kop/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 80, Min/Mean/Max = 492,687/525,224/567,456
> Threads = 80, Total Rate = 4,199 kop/s; Percpu Rate = 52 kop/s
> 
> Running locktest with spinlock [runtime = 10s, load = 1]
> Threads = 40, Min/Mean/Max = 1,325,623/1,325,628/1,325,636
> Threads = 40, Total Rate = 5,299 kop/s; Percpu Rate = 132 kop/s
> 
> Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
> Threads = 40, Min/Mean/Max = 1,249,731/1,292,977/1,342,815
> Threads = 40, Total Rate = 5,168 kop/s; Percpu Rate = 129 kop/s
> 
> On systems on large number of cpus, qspinlock lock is faster and more fair.
> 
> With some tuning, we may be able to squeeze out more performance.

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or 

Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-06 Thread Waiman Long

On 7/6/20 12:35 AM, Nicholas Piggin wrote:

v3 is updated to use __pv_queued_spin_unlock, noticed by Waiman (thank you).

Thanks,
Nick

Nicholas Piggin (6):
   powerpc/powernv: must include hvcall.h to get PAPR defines
   powerpc/pseries: move some PAPR paravirt functions to their own file
   powerpc: move spinlock implementation to simple_spinlock
   powerpc/64s: implement queued spinlocks and rwlocks
   powerpc/pseries: implement paravirt qspinlocks for SPLPAR
   powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
 lock hint

  arch/powerpc/Kconfig  |  13 +
  arch/powerpc/include/asm/Kbuild   |   2 +
  arch/powerpc/include/asm/atomic.h |  28 ++
  arch/powerpc/include/asm/paravirt.h   |  89 +
  arch/powerpc/include/asm/qspinlock.h  |  91 ++
  arch/powerpc/include/asm/qspinlock_paravirt.h |   7 +
  arch/powerpc/include/asm/simple_spinlock.h| 292 +
  .../include/asm/simple_spinlock_types.h   |  21 ++
  arch/powerpc/include/asm/spinlock.h   | 308 +-
  arch/powerpc/include/asm/spinlock_types.h |  17 +-
  arch/powerpc/lib/Makefile |   3 +
  arch/powerpc/lib/locks.c  |  12 +-
  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   1 +
  arch/powerpc/platforms/pseries/Kconfig|   5 +
  arch/powerpc/platforms/pseries/setup.c|   6 +-
  include/asm-generic/qspinlock.h   |   4 +
  16 files changed, 577 insertions(+), 322 deletions(-)
  create mode 100644 arch/powerpc/include/asm/paravirt.h
  create mode 100644 arch/powerpc/include/asm/qspinlock.h
  create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
  create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
  create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h


This patch looks OK to me.

I had run some microbenchmark on powerpc system with or w/o the patch.

On a 2-socket 160-thread SMT4 POWER9 system (not virtualized):

5.8.0-rc4
=

Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 160, Min/Mean/Max = 77,665/90,153/106,895
Threads = 160, Total Rate = 1,441,759 op/s; Percpu Rate = 9,011 op/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 160, Min/Mean/Max = 47,879/53,807/63,689
Threads = 160, Total Rate = 860,192 op/s; Percpu Rate = 5,376 op/s

Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 80, Min/Mean/Max = 242,907/319,514/463,161
Threads = 80, Total Rate = 2,555 kop/s; Percpu Rate = 32 kop/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 80, Min/Mean/Max = 146,161/187,474/259,270
Threads = 80, Total Rate = 1,498 kop/s; Percpu Rate = 19 kop/s

Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 40, Min/Mean/Max = 646,639/1,000,817/1,455,205
Threads = 40, Total Rate = 4,001 kop/s; Percpu Rate = 100 kop/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 40, Min/Mean/Max = 402,165/597,132/814,555
Threads = 40, Total Rate = 2,388 kop/s; Percpu Rate = 60 kop/s

5.8.0-rc4-qlock+


Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 160, Min/Mean/Max = 123,835/124,580/124,587
Threads = 160, Total Rate = 1,992 kop/s; Percpu Rate = 12 kop/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 160, Min/Mean/Max = 254,210/264,714/276,784
Threads = 160, Total Rate = 4,231 kop/s; Percpu Rate = 26 kop/s

Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 80, Min/Mean/Max = 599,715/603,397/603,450
Threads = 80, Total Rate = 4,825 kop/s; Percpu Rate = 60 kop/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 80, Min/Mean/Max = 492,687/525,224/567,456
Threads = 80, Total Rate = 4,199 kop/s; Percpu Rate = 52 kop/s

Running locktest with spinlock [runtime = 10s, load = 1]
Threads = 40, Min/Mean/Max = 1,325,623/1,325,628/1,325,636
Threads = 40, Total Rate = 5,299 kop/s; Percpu Rate = 132 kop/s

Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1]
Threads = 40, Min/Mean/Max = 1,249,731/1,292,977/1,342,815
Threads = 40, Total Rate = 5,168 kop/s; Percpu Rate = 129 kop/s

On systems on large number of cpus, qspinlock lock is faster and more fair.

With some tuning, we may be able to squeeze out more performance.

Cheers,
Longman



[PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-05 Thread Nicholas Piggin
v3 is updated to use __pv_queued_spin_unlock, noticed by Waiman (thank you).

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/powernv: must include hvcall.h to get PAPR defines
  powerpc/pseries: move some PAPR paravirt functions to their own file
  powerpc: move spinlock implementation to simple_spinlock
  powerpc/64s: implement queued spinlocks and rwlocks
  powerpc/pseries: implement paravirt qspinlocks for SPLPAR
  powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
lock hint

 arch/powerpc/Kconfig  |  13 +
 arch/powerpc/include/asm/Kbuild   |   2 +
 arch/powerpc/include/asm/atomic.h |  28 ++
 arch/powerpc/include/asm/paravirt.h   |  89 +
 arch/powerpc/include/asm/qspinlock.h  |  91 ++
 arch/powerpc/include/asm/qspinlock_paravirt.h |   7 +
 arch/powerpc/include/asm/simple_spinlock.h| 292 +
 .../include/asm/simple_spinlock_types.h   |  21 ++
 arch/powerpc/include/asm/spinlock.h   | 308 +-
 arch/powerpc/include/asm/spinlock_types.h |  17 +-
 arch/powerpc/lib/Makefile |   3 +
 arch/powerpc/lib/locks.c  |  12 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   1 +
 arch/powerpc/platforms/pseries/Kconfig|   5 +
 arch/powerpc/platforms/pseries/setup.c|   6 +-
 include/asm-generic/qspinlock.h   |   4 +
 16 files changed, 577 insertions(+), 322 deletions(-)
 create mode 100644 arch/powerpc/include/asm/paravirt.h
 create mode 100644 arch/powerpc/include/asm/qspinlock.h
 create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h

-- 
2.23.0