Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 10:09:44AM -0800, Linus Torvalds wrote: > On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds > wrote: > > > > The arm64 numbers scaled horribly even before, and that's because > > there is too much ping-pong, and it's probably because there is no > > "stickiness" to the cacheline to the core, and thus adding the extra > > loop can make the ping-pong issue even worse because now there is more > > of it. > > Actually, if it's using the ll/sc, then I don't see why arm64 should > even change. It doesn't really even change the pattern: the initial > load of the value is just replaced with a "ll" that gets a non-zero > value, and then we re-try without even doing the "sc" part. So our cmpxchg() has a prefetch-with-intent-to-modify instruction before the 'll' part, which will attempt to grab the line unique the first time round. The 'll' also has acquire semantics, so there's the chance for the micro-architecture to handle that badly too. I think that the problem with the proposed changed change is that whenever a reader tries to acquire an rwsem that is already held for read, it will always fail the first cmpxchg(), so in this situation the read path is considerably slower than before. > End result: exact same "load once, then do ll/sc to update". Just > using a slightly different instruction pattern. > > But maybe "ll" does something different to the cacheline than a regular "ld"? > > Alternatively, the machine you used is using LSE, and the "swp" thing > has some horrid behavior when it fails. Depending on where the data is, the LSE instructions may execute outside of the CPU (e.g. in a cache controller) and so could add latency to a failing CAS. Will
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On 02/14/2019 01:02 PM, Will Deacon wrote: > On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: >> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >>> it generate slightly better code. >>> >>> Before this patch, down_read_trylock: >>> >>>0x <+0>: callq 0x5 >>>0x0005 <+5>: jmp0x18 >>>0x0007 <+7>: lea0x1(%rdx),%rcx >>>0x000b <+11>:mov%rdx,%rax >>>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>>0x0013 <+19>:cmp%rax,%rdx >>>0x0016 <+22>:je 0x23 >>>0x0018 <+24>:mov(%rdi),%rdx >>>0x001b <+27>:test %rdx,%rdx >>>0x001e <+30>:jns0x7 >>>0x0020 <+32>:xor%eax,%eax >>>0x0022 <+34>:retq >>>0x0023 <+35>:mov%gs:0x0,%rax >>>0x002c <+44>:or $0x3,%rax >>>0x0030 <+48>:mov%rax,0x20(%rdi) >>>0x0034 <+52>:mov$0x1,%eax >>>0x0039 <+57>:retq >>> >>> After patch, down_read_trylock: >>> >>>0x <+0>: callq 0x5 >>>0x0005 <+5>: xor%eax,%eax >>>0x0007 <+7>: lea0x1(%rax),%rdx >>>0x000b <+11>:lock cmpxchg %rdx,(%rdi) >>>0x0010 <+16>:jne0x29 >>>0x0012 <+18>:mov%gs:0x0,%rax >>>0x001b <+27>:or $0x3,%rax >>>0x001f <+31>:mov%rax,0x20(%rdi) >>>0x0023 <+35>:mov$0x1,%eax >>>0x0028 <+40>:retq >>>0x0029 <+41>:test %rax,%rax >>>0x002c <+44>:jns0x7 >>>0x002e <+46>:xor%eax,%eax >>>0x0030 <+48>:retq >>> >>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >>> load of 10 to lengthen the lock critical section) on a x86-64 system >>> before and after the patch were: >>> >>> Before PatchAfter Patch >>># of Threads rlock rlock >>> - - >>> 1 14,496 14,716 >>> 28,644 8,453 >>> 46,799 6,983 >>> 85,664 7,190 >>> >>> On a ARM64 system, the performance results were: >>> >>> Before PatchAfter Patch >>># of Threads rlock rlock >>> - - >>> 1 23,676 24,488 >>> 27,697 9,502 >>> 44,945 3,440 >>> 82,641 1,603 >> Urgh, yes LL/SC is the obvious exception that can actually do better >> here :/ >> >> Will, what say you? > What machine were these numbers generated on and is it using LL/SC or LSE > atomics for arm64? If you stick the microbenchmark somewhere, I can go play > with a broader variety of h/w. > > Will The machine is a 2-socket Cavium ThunderX2 99xx system with 64 cores and 256 threads. I was just using threads from the first socket for this test. The microbenchmark that I used is attached. I used the command "./run-locktest -ltryrwsem -r100 -i-10 -c10 -n" to generate the locking rates. The lscpu flags were: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm Cheers, Longman locktest.tar.gz Description: application/gzip
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds wrote: > > The arm64 numbers scaled horribly even before, and that's because > there is too much ping-pong, and it's probably because there is no > "stickiness" to the cacheline to the core, and thus adding the extra > loop can make the ping-pong issue even worse because now there is more > of it. Actually, if it's using the ll/sc, then I don't see why arm64 should even change. It doesn't really even change the pattern: the initial load of the value is just replaced with a "ll" that gets a non-zero value, and then we re-try without even doing the "sc" part. End result: exact same "load once, then do ll/sc to update". Just using a slightly different instruction pattern. But maybe "ll" does something different to the cacheline than a regular "ld"? Alternatively, the machine you used is using LSE, and the "swp" thing has some horrid behavior when it fails. So I take it back. I'm actually surprised that arm64 performs worse. I don't think it should. But numbers walk, bullshit talks, and it clearly does make for worse numbers on arm64. Linus
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > > it generate slightly better code. > > > > Before this patch, down_read_trylock: > > > >0x <+0>: callq 0x5 > >0x0005 <+5>: jmp0x18 > >0x0007 <+7>: lea0x1(%rdx),%rcx > >0x000b <+11>:mov%rdx,%rax > >0x000e <+14>:lock cmpxchg %rcx,(%rdi) > >0x0013 <+19>:cmp%rax,%rdx > >0x0016 <+22>:je 0x23 > >0x0018 <+24>:mov(%rdi),%rdx > >0x001b <+27>:test %rdx,%rdx > >0x001e <+30>:jns0x7 > >0x0020 <+32>:xor%eax,%eax > >0x0022 <+34>:retq > >0x0023 <+35>:mov%gs:0x0,%rax > >0x002c <+44>:or $0x3,%rax > >0x0030 <+48>:mov%rax,0x20(%rdi) > >0x0034 <+52>:mov$0x1,%eax > >0x0039 <+57>:retq > > > > After patch, down_read_trylock: > > > >0x <+0>: callq 0x5 > >0x0005 <+5>: xor%eax,%eax > >0x0007 <+7>: lea0x1(%rax),%rdx > >0x000b <+11>:lock cmpxchg %rdx,(%rdi) > >0x0010 <+16>:jne0x29 > >0x0012 <+18>:mov%gs:0x0,%rax > >0x001b <+27>:or $0x3,%rax > >0x001f <+31>:mov%rax,0x20(%rdi) > >0x0023 <+35>:mov$0x1,%eax > >0x0028 <+40>:retq > >0x0029 <+41>:test %rax,%rax > >0x002c <+44>:jns0x7 > >0x002e <+46>:xor%eax,%eax > >0x0030 <+48>:retq > > > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > > load of 10 to lengthen the lock critical section) on a x86-64 system > > before and after the patch were: > > > > Before PatchAfter Patch > ># of Threads rlock rlock > > - - > > 1 14,496 14,716 > > 28,644 8,453 > > 46,799 6,983 > > 85,664 7,190 > > > > On a ARM64 system, the performance results were: > > > > Before PatchAfter Patch > ># of Threads rlock rlock > > - - > > 1 23,676 24,488 > > 27,697 9,502 > > 44,945 3,440 > > 82,641 1,603 > > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? What machine were these numbers generated on and is it using LL/SC or LSE atomics for arm64? If you stick the microbenchmark somewhere, I can go play with a broader variety of h/w. Will
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 6:53 AM Waiman Long wrote: > > The ARM64 result is what I would have expected given that the change was > to optimize for the uncontended case. The x86-64 result is kind of an > anomaly to me, but I haven't bothered to dig into that. I would say that the ARM result is what I'd expect from something that scales badly to begin with. The x86-64 result is the expected one: yes, the cmpxchg is done one extra time, but it results in fewer cache transitions (the cacheline never goes into "shared" state), and cache transitions are what matter. The cost of re-doing the instruction should be low. The cacheline ping-pong and the cache coherency messages is what hurts. So I actually think both are very easily explained. The x86-64 number improves, because there is less cache coherency traffic. The arm64 numbers scaled horribly even before, and that's because there is too much ping-pong, and it's probably because there is no "stickiness" to the cacheline to the core, and thus adding the extra loop can make the ping-pong issue even worse because now there is more of it. The cachelines not sticking at all to a core probably is good for fairness issues (in particular, sticking *too* much can cause horrible issues), but it's absolutely horrible if it means that you lose the cacheline even before you get to complete the second cmpxchg. Linus
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On 02/14/2019 05:33 AM, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >> it generate slightly better code. >> >> Before this patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: jmp0x18 >>0x0007 <+7>: lea0x1(%rdx),%rcx >>0x000b <+11>:mov%rdx,%rax >>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>0x0013 <+19>:cmp%rax,%rdx >>0x0016 <+22>:je 0x23 >>0x0018 <+24>:mov(%rdi),%rdx >>0x001b <+27>:test %rdx,%rdx >>0x001e <+30>:jns0x7 >>0x0020 <+32>:xor%eax,%eax >>0x0022 <+34>:retq >>0x0023 <+35>:mov%gs:0x0,%rax >>0x002c <+44>:or $0x3,%rax >>0x0030 <+48>:mov%rax,0x20(%rdi) >>0x0034 <+52>:mov$0x1,%eax >>0x0039 <+57>:retq >> >> After patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: xor%eax,%eax >>0x0007 <+7>: lea0x1(%rax),%rdx >>0x000b <+11>: lock cmpxchg %rdx,(%rdi) >>0x0010 <+16>: jne0x29 >>0x0012 <+18>: mov%gs:0x0,%rax >>0x001b <+27>: or $0x3,%rax >>0x001f <+31>: mov%rax,0x20(%rdi) >>0x0023 <+35>: mov$0x1,%eax >>0x0028 <+40>: retq >>0x0029 <+41>: test %rax,%rax >>0x002c <+44>: jns0x7 >>0x002e <+46>: xor%eax,%eax >>0x0030 <+48>: retq >> >> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >> load of 10 to lengthen the lock critical section) on a x86-64 system >> before and after the patch were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 14,496 14,716 >> 28,644 8,453 >> 46,799 6,983 >> 85,664 7,190 >> >> On a ARM64 system, the performance results were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 23,676 24,488 >> 27,697 9,502 >> 44,945 3,440 >> 82,641 1,603 > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? The ARM64 result is what I would have expected given that the change was to optimize for the uncontended case. The x86-64 result is kind of an anomaly to me, but I haven't bothered to dig into that. Cheers, Longman
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. > > Before this patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: jmp0x18 >0x0007 <+7>: lea0x1(%rdx),%rcx >0x000b <+11>:mov%rdx,%rax >0x000e <+14>:lock cmpxchg %rcx,(%rdi) >0x0013 <+19>:cmp%rax,%rdx >0x0016 <+22>:je 0x23 >0x0018 <+24>:mov(%rdi),%rdx >0x001b <+27>:test %rdx,%rdx >0x001e <+30>:jns0x7 >0x0020 <+32>:xor%eax,%eax >0x0022 <+34>:retq >0x0023 <+35>:mov%gs:0x0,%rax >0x002c <+44>:or $0x3,%rax >0x0030 <+48>:mov%rax,0x20(%rdi) >0x0034 <+52>:mov$0x1,%eax >0x0039 <+57>:retq > > After patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: xor%eax,%eax >0x0007 <+7>: lea0x1(%rax),%rdx >0x000b <+11>: lock cmpxchg %rdx,(%rdi) >0x0010 <+16>: jne0x29 >0x0012 <+18>: mov%gs:0x0,%rax >0x001b <+27>: or $0x3,%rax >0x001f <+31>: mov%rax,0x20(%rdi) >0x0023 <+35>: mov$0x1,%eax >0x0028 <+40>: retq >0x0029 <+41>: test %rax,%rax >0x002c <+44>: jns0x7 >0x002e <+46>: xor%eax,%eax >0x0030 <+48>: retq > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > load of 10 to lengthen the lock critical section) on a x86-64 system > before and after the patch were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 14,496 14,716 > 28,644 8,453 > 46,799 6,983 > 85,664 7,190 > > On a ARM64 system, the performance results were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 23,676 24,488 > 27,697 9,502 > 44,945 3,440 > 82,641 1,603 Urgh, yes LL/SC is the obvious exception that can actually do better here :/ Will, what say you?
[PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
Modify __down_read_trylock() to optimize for an unlocked rwsem and make it generate slightly better code. Before this patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: jmp0x18 0x0007 <+7>: lea0x1(%rdx),%rcx 0x000b <+11>:mov%rdx,%rax 0x000e <+14>:lock cmpxchg %rcx,(%rdi) 0x0013 <+19>:cmp%rax,%rdx 0x0016 <+22>:je 0x23 0x0018 <+24>:mov(%rdi),%rdx 0x001b <+27>:test %rdx,%rdx 0x001e <+30>:jns0x7 0x0020 <+32>:xor%eax,%eax 0x0022 <+34>:retq 0x0023 <+35>:mov%gs:0x0,%rax 0x002c <+44>:or $0x3,%rax 0x0030 <+48>:mov%rax,0x20(%rdi) 0x0034 <+52>:mov$0x1,%eax 0x0039 <+57>:retq After patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: xor%eax,%eax 0x0007 <+7>: lea0x1(%rax),%rdx 0x000b <+11>:lock cmpxchg %rdx,(%rdi) 0x0010 <+16>:jne0x29 0x0012 <+18>:mov%gs:0x0,%rax 0x001b <+27>:or $0x3,%rax 0x001f <+31>:mov%rax,0x20(%rdi) 0x0023 <+35>:mov$0x1,%eax 0x0028 <+40>:retq 0x0029 <+41>:test %rax,%rax 0x002c <+44>:jns0x7 0x002e <+46>:xor%eax,%eax 0x0030 <+48>:retq By using a rwsem microbenchmark, the down_read_trylock() rate (with a load of 10 to lengthen the lock critical section) on a x86-64 system before and after the patch were: Before PatchAfter Patch # of Threads rlock rlock - - 1 14,496 14,716 28,644 8,453 46,799 6,983 85,664 7,190 On a ARM64 system, the performance results were: Before PatchAfter Patch # of Threads rlock rlock - - 1 23,676 24,488 27,697 9,502 44,945 3,440 82,641 1,603 For the uncontended case (1 thread), the new down_read_trylock() is a little bit faster. For the contended cases, the new down_read_trylock() perform pretty well in x86-64, but performance degrades at high contention level on ARM64. Suggested-by: Linus Torvalds Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 067e265..e0bcc11 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -175,14 +175,17 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + /* +* Optimize for the case when the rwsem is not locked at all. +*/ + long tmp = RWSEM_UNLOCKED_VALUE; - while ((tmp = atomic_long_read(>count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(>count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + do { + if (atomic_long_try_cmpxchg_acquire(>count, , + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } - } + } while (tmp >= 0); return 0; } -- 1.8.3.1