Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-15 Thread Will Deacon
On Thu, Feb 14, 2019 at 10:09:44AM -0800, Linus Torvalds wrote:
> On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds
>  wrote:
> >
> > The arm64 numbers scaled horribly even before, and that's because
> > there is too much ping-pong, and it's probably because there is no
> > "stickiness" to the cacheline to the core, and thus adding the extra
> > loop can make the ping-pong issue even worse because now there is more
> > of it.
> 
> Actually, if it's using the ll/sc, then I don't see why arm64 should
> even change. It doesn't really even change the pattern: the initial
> load of the value is just replaced with a "ll" that gets a non-zero
> value, and then we re-try without even doing the "sc" part.

So our cmpxchg() has a prefetch-with-intent-to-modify instruction before the
'll' part, which will attempt to grab the line unique the first time round.
The 'll' also has acquire semantics, so there's the chance for the
micro-architecture to handle that badly too.

I think that the problem with the proposed changed change is that whenever a
reader tries to acquire an rwsem that is already held for read, it will
always fail the first cmpxchg(), so in this situation the read path is
considerably slower than before.

> End result: exact same "load once, then do ll/sc to update". Just
> using a slightly different instruction pattern.
> 
> But maybe "ll" does something different to the cacheline than a regular "ld"?
> 
> Alternatively, the machine you used is using LSE, and the "swp" thing
> has some horrid behavior when it fails.

Depending on where the data is, the LSE instructions may execute outside of
the CPU (e.g. in a cache controller) and so could add latency to a failing
CAS.

Will


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Waiman Long
On 02/14/2019 01:02 PM, Will Deacon wrote:
> On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
>>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>>> it generate slightly better code.
>>>
>>> Before this patch, down_read_trylock:
>>>
>>>0x <+0>: callq  0x5 
>>>0x0005 <+5>: jmp0x18 
>>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>>0x000b <+11>:mov%rdx,%rax
>>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>>0x0013 <+19>:cmp%rax,%rdx
>>>0x0016 <+22>:je 0x23 
>>>0x0018 <+24>:mov(%rdi),%rdx
>>>0x001b <+27>:test   %rdx,%rdx
>>>0x001e <+30>:jns0x7 
>>>0x0020 <+32>:xor%eax,%eax
>>>0x0022 <+34>:retq
>>>0x0023 <+35>:mov%gs:0x0,%rax
>>>0x002c <+44>:or $0x3,%rax
>>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>>0x0034 <+52>:mov$0x1,%eax
>>>0x0039 <+57>:retq
>>>
>>> After patch, down_read_trylock:
>>>
>>>0x <+0>: callq  0x5 
>>>0x0005 <+5>: xor%eax,%eax
>>>0x0007 <+7>: lea0x1(%rax),%rdx
>>>0x000b <+11>:lock cmpxchg %rdx,(%rdi)
>>>0x0010 <+16>:jne0x29 
>>>0x0012 <+18>:mov%gs:0x0,%rax
>>>0x001b <+27>:or $0x3,%rax
>>>0x001f <+31>:mov%rax,0x20(%rdi)
>>>0x0023 <+35>:mov$0x1,%eax
>>>0x0028 <+40>:retq
>>>0x0029 <+41>:test   %rax,%rax
>>>0x002c <+44>:jns0x7 
>>>0x002e <+46>:xor%eax,%eax
>>>0x0030 <+48>:retq
>>>
>>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>>> load of 10 to lengthen the lock critical section) on a x86-64 system
>>> before and after the patch were:
>>>
>>>  Before PatchAfter Patch
>>># of Threads rlock   rlock
>>> -   -
>>> 1   14,496  14,716
>>> 28,644   8,453
>>> 46,799   6,983
>>> 85,664   7,190
>>>
>>> On a ARM64 system, the performance results were:
>>>
>>>  Before PatchAfter Patch
>>># of Threads rlock   rlock
>>> -   -
>>> 1   23,676  24,488
>>> 27,697   9,502
>>> 44,945   3,440
>>> 82,641   1,603
>> Urgh, yes LL/SC is the obvious exception that can actually do better
>> here :/
>>
>> Will, what say you?
> What machine were these numbers generated on and is it using LL/SC or LSE
> atomics for arm64? If you stick the microbenchmark somewhere, I can go play
> with a broader variety of h/w.
>
> Will

The machine is a 2-socket Cavium ThunderX2 99xx system with 64 cores and
256 threads. I was just using threads from the first socket for this
test. The microbenchmark that I used is attached. I used the command
"./run-locktest -ltryrwsem -r100 -i-10 -c10 -n" to generate the
locking rates.

The lscpu flags were:

fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

Cheers,
Longman



locktest.tar.gz
Description: application/gzip


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds
 wrote:
>
> The arm64 numbers scaled horribly even before, and that's because
> there is too much ping-pong, and it's probably because there is no
> "stickiness" to the cacheline to the core, and thus adding the extra
> loop can make the ping-pong issue even worse because now there is more
> of it.

Actually, if it's using the ll/sc, then I don't see why arm64 should
even change. It doesn't really even change the pattern: the initial
load of the value is just replaced with a "ll" that gets a non-zero
value, and then we re-try without even doing the "sc" part.

End result: exact same "load once, then do ll/sc to update". Just
using a slightly different instruction pattern.

But maybe "ll" does something different to the cacheline than a regular "ld"?

Alternatively, the machine you used is using LSE, and the "swp" thing
has some horrid behavior when it fails.

So I take it back. I'm actually surprised that arm64 performs worse. I
don't think it should. But numbers walk, bullshit talks, and it
clearly does make for worse numbers on arm64.

   Linus


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Will Deacon
On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
> > Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> > it generate slightly better code.
> > 
> > Before this patch, down_read_trylock:
> > 
> >0x <+0>: callq  0x5 
> >0x0005 <+5>: jmp0x18 
> >0x0007 <+7>: lea0x1(%rdx),%rcx
> >0x000b <+11>:mov%rdx,%rax
> >0x000e <+14>:lock cmpxchg %rcx,(%rdi)
> >0x0013 <+19>:cmp%rax,%rdx
> >0x0016 <+22>:je 0x23 
> >0x0018 <+24>:mov(%rdi),%rdx
> >0x001b <+27>:test   %rdx,%rdx
> >0x001e <+30>:jns0x7 
> >0x0020 <+32>:xor%eax,%eax
> >0x0022 <+34>:retq
> >0x0023 <+35>:mov%gs:0x0,%rax
> >0x002c <+44>:or $0x3,%rax
> >0x0030 <+48>:mov%rax,0x20(%rdi)
> >0x0034 <+52>:mov$0x1,%eax
> >0x0039 <+57>:retq
> > 
> > After patch, down_read_trylock:
> > 
> >0x <+0>: callq  0x5 
> >0x0005 <+5>: xor%eax,%eax
> >0x0007 <+7>: lea0x1(%rax),%rdx
> >0x000b <+11>:lock cmpxchg %rdx,(%rdi)
> >0x0010 <+16>:jne0x29 
> >0x0012 <+18>:mov%gs:0x0,%rax
> >0x001b <+27>:or $0x3,%rax
> >0x001f <+31>:mov%rax,0x20(%rdi)
> >0x0023 <+35>:mov$0x1,%eax
> >0x0028 <+40>:retq
> >0x0029 <+41>:test   %rax,%rax
> >0x002c <+44>:jns0x7 
> >0x002e <+46>:xor%eax,%eax
> >0x0030 <+48>:retq
> > 
> > By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> > load of 10 to lengthen the lock critical section) on a x86-64 system
> > before and after the patch were:
> > 
> >  Before PatchAfter Patch
> ># of Threads rlock   rlock
> > -   -
> > 1   14,496  14,716
> > 28,644   8,453
> > 46,799   6,983
> > 85,664   7,190
> > 
> > On a ARM64 system, the performance results were:
> > 
> >  Before PatchAfter Patch
> ># of Threads rlock   rlock
> > -   -
> > 1   23,676  24,488
> > 27,697   9,502
> > 44,945   3,440
> > 82,641   1,603
> 
> Urgh, yes LL/SC is the obvious exception that can actually do better
> here :/
> 
> Will, what say you?

What machine were these numbers generated on and is it using LL/SC or LSE
atomics for arm64? If you stick the microbenchmark somewhere, I can go play
with a broader variety of h/w.

Will


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2019 at 6:53 AM Waiman Long  wrote:
>
> The ARM64 result is what I would have expected given that the change was
> to optimize for the uncontended case. The x86-64 result is kind of an
> anomaly to me, but I haven't bothered to dig into that.

I would say that the ARM result is what I'd expect from something that
scales badly to begin with.

The x86-64 result is the expected one: yes, the cmpxchg is done one
extra time, but it results in fewer cache transitions (the cacheline
never goes into "shared" state), and cache transitions are what
matter.

The cost of re-doing the instruction should be low. The cacheline
ping-pong and the cache coherency messages is what hurts.

So I actually think both are very easily explained.

The x86-64 number improves, because there is less cache coherency traffic.

The arm64 numbers scaled horribly even before, and that's because
there is too much ping-pong, and it's probably because there is no
"stickiness" to the cacheline to the core, and thus adding the extra
loop can make the ping-pong issue even worse because now there is more
of it.

The cachelines not sticking at all to a core probably is good for
fairness issues (in particular, sticking *too* much can cause horrible
issues), but it's absolutely horrible if it means that you lose the
cacheline even before you get to complete the second cmpxchg.

  Linus


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Waiman Long
On 02/14/2019 05:33 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>> it generate slightly better code.
>>
>> Before this patch, down_read_trylock:
>>
>>0x <+0>: callq  0x5 
>>0x0005 <+5>: jmp0x18 
>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>0x000b <+11>:mov%rdx,%rax
>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>0x0013 <+19>:cmp%rax,%rdx
>>0x0016 <+22>:je 0x23 
>>0x0018 <+24>:mov(%rdi),%rdx
>>0x001b <+27>:test   %rdx,%rdx
>>0x001e <+30>:jns0x7 
>>0x0020 <+32>:xor%eax,%eax
>>0x0022 <+34>:retq
>>0x0023 <+35>:mov%gs:0x0,%rax
>>0x002c <+44>:or $0x3,%rax
>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>0x0034 <+52>:mov$0x1,%eax
>>0x0039 <+57>:retq
>>
>> After patch, down_read_trylock:
>>
>>0x <+0>:  callq  0x5 
>>0x0005 <+5>:  xor%eax,%eax
>>0x0007 <+7>:  lea0x1(%rax),%rdx
>>0x000b <+11>: lock cmpxchg %rdx,(%rdi)
>>0x0010 <+16>: jne0x29 
>>0x0012 <+18>: mov%gs:0x0,%rax
>>0x001b <+27>: or $0x3,%rax
>>0x001f <+31>: mov%rax,0x20(%rdi)
>>0x0023 <+35>: mov$0x1,%eax
>>0x0028 <+40>: retq
>>0x0029 <+41>: test   %rax,%rax
>>0x002c <+44>: jns0x7 
>>0x002e <+46>: xor%eax,%eax
>>0x0030 <+48>: retq
>>
>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>> load of 10 to lengthen the lock critical section) on a x86-64 system
>> before and after the patch were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   14,496  14,716
>> 28,644   8,453
>>  46,799   6,983
>>  85,664   7,190
>>
>> On a ARM64 system, the performance results were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   23,676  24,488
>> 27,697   9,502
>> 44,945   3,440
>> 82,641   1,603
> Urgh, yes LL/SC is the obvious exception that can actually do better
> here :/
>
> Will, what say you?

The ARM64 result is what I would have expected given that the change was
to optimize for the uncontended case. The x86-64 result is kind of an
anomaly to me, but I haven't bothered to dig into that.

Cheers,
Longman



Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Peter Zijlstra
On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.
> 
> Before this patch, down_read_trylock:
> 
>0x <+0>: callq  0x5 
>0x0005 <+5>: jmp0x18 
>0x0007 <+7>: lea0x1(%rdx),%rcx
>0x000b <+11>:mov%rdx,%rax
>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>0x0013 <+19>:cmp%rax,%rdx
>0x0016 <+22>:je 0x23 
>0x0018 <+24>:mov(%rdi),%rdx
>0x001b <+27>:test   %rdx,%rdx
>0x001e <+30>:jns0x7 
>0x0020 <+32>:xor%eax,%eax
>0x0022 <+34>:retq
>0x0023 <+35>:mov%gs:0x0,%rax
>0x002c <+44>:or $0x3,%rax
>0x0030 <+48>:mov%rax,0x20(%rdi)
>0x0034 <+52>:mov$0x1,%eax
>0x0039 <+57>:retq
> 
> After patch, down_read_trylock:
> 
>0x <+0>:   callq  0x5 
>0x0005 <+5>:   xor%eax,%eax
>0x0007 <+7>:   lea0x1(%rax),%rdx
>0x000b <+11>:  lock cmpxchg %rdx,(%rdi)
>0x0010 <+16>:  jne0x29 
>0x0012 <+18>:  mov%gs:0x0,%rax
>0x001b <+27>:  or $0x3,%rax
>0x001f <+31>:  mov%rax,0x20(%rdi)
>0x0023 <+35>:  mov$0x1,%eax
>0x0028 <+40>:  retq
>0x0029 <+41>:  test   %rax,%rax
>0x002c <+44>:  jns0x7 
>0x002e <+46>:  xor%eax,%eax
>0x0030 <+48>:  retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> load of 10 to lengthen the lock critical section) on a x86-64 system
> before and after the patch were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   14,496  14,716
> 28,644   8,453
>   46,799   6,983
>   85,664   7,190
> 
> On a ARM64 system, the performance results were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   23,676  24,488
> 27,697   9,502
> 44,945   3,440
> 82,641   1,603

Urgh, yes LL/SC is the obvious exception that can actually do better
here :/

Will, what say you?


[PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-13 Thread Waiman Long
Modify __down_read_trylock() to optimize for an unlocked rwsem and make
it generate slightly better code.

Before this patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: jmp0x18 
   0x0007 <+7>: lea0x1(%rdx),%rcx
   0x000b <+11>:mov%rdx,%rax
   0x000e <+14>:lock cmpxchg %rcx,(%rdi)
   0x0013 <+19>:cmp%rax,%rdx
   0x0016 <+22>:je 0x23 
   0x0018 <+24>:mov(%rdi),%rdx
   0x001b <+27>:test   %rdx,%rdx
   0x001e <+30>:jns0x7 
   0x0020 <+32>:xor%eax,%eax
   0x0022 <+34>:retq
   0x0023 <+35>:mov%gs:0x0,%rax
   0x002c <+44>:or $0x3,%rax
   0x0030 <+48>:mov%rax,0x20(%rdi)
   0x0034 <+52>:mov$0x1,%eax
   0x0039 <+57>:retq

After patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: xor%eax,%eax
   0x0007 <+7>: lea0x1(%rax),%rdx
   0x000b <+11>:lock cmpxchg %rdx,(%rdi)
   0x0010 <+16>:jne0x29 
   0x0012 <+18>:mov%gs:0x0,%rax
   0x001b <+27>:or $0x3,%rax
   0x001f <+31>:mov%rax,0x20(%rdi)
   0x0023 <+35>:mov$0x1,%eax
   0x0028 <+40>:retq
   0x0029 <+41>:test   %rax,%rax
   0x002c <+44>:jns0x7 
   0x002e <+46>:xor%eax,%eax
   0x0030 <+48>:retq

By using a rwsem microbenchmark, the down_read_trylock() rate (with a
load of 10 to lengthen the lock critical section) on a x86-64 system
before and after the patch were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   14,496  14,716
28,644   8,453
46,799   6,983
85,664   7,190

On a ARM64 system, the performance results were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   23,676  24,488
27,697   9,502
44,945   3,440
82,641   1,603

For the uncontended case (1 thread), the new down_read_trylock() is a
little bit faster. For the contended cases, the new down_read_trylock()
perform pretty well in x86-64, but performance degrades at high
contention level on ARM64.

Suggested-by: Linus Torvalds 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 067e265..e0bcc11 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -175,14 +175,17 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   /*
+* Optimize for the case when the rwsem is not locked at all.
+*/
+   long tmp = RWSEM_UNLOCKED_VALUE;
 
-   while ((tmp = atomic_long_read(>count)) >= 0) {
-   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   do {
+   if (atomic_long_try_cmpxchg_acquire(>count, ,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
-   }
+   } while (tmp >= 0);
return 0;
 }
 
-- 
1.8.3.1