Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0
On 02/14/2019 05:37 AM, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: >> v4: >> - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. >> >> v3: >> - Optimize __down_read_trylock() for the uncontended case as suggested >>by Linus. >> >> v2: >> - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. >> - Update performance test data in patch 1. >> >> The goal of this patchset is to remove the architecture specific files >> for rwsem-xadd to make it easer to add enhancements in the later rwsem >> patches. It also removes the legacy rwsem-spinlock.c file and make all >> the architectures use one single implementation of rwsem - rwsem-xadd.c. >> >> Waiman Long (3): >> locking/rwsem: Remove arch specific rwsem files >> locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all >> archs >> locking/rwsem: Optimize down_read_trylock() > Acked-by: Peter Zijlstra (Intel) > > with the caveat that I'm happy to exchange patch 3 back to my earlier > suggestion in case Will expesses concerns wrt the ARM64 performance of > Linus' suggestion. I inserted a few lock event counters into the rwsem trylock code: static inline int __down_read_trylock(struct rw_semaphore *sem) { /* * Optimize for the case when the rwsem is not locked at all. */ long tmp = RWSEM_UNLOCKED_VALUE; lockevent_inc(rwsem_rtrylock); do { if (atomic_long_try_cmpxchg_acquire(>count, , tmp + RWSEM_ACTIVE_READ_BIAS)) { rwsem_set_reader_owned(sem); return 1; } lockevent_inc(rwsem_rtrylock_retry); } while (tmp >= 0); lockevent_inc(rwsem_rtrylock_fail); return 0; } static inline int __down_write_trylock(struct rw_semaphore *sem) { long tmp; lockevent_inc(rwsem_wtrylock); tmp = atomic_long_cmpxchg_acquire(>count, RWSEM_UNLOCKED_VALUE, RWSEM_ACTIVE_WRITE_BIAS); if (tmp == RWSEM_UNLOCKED_VALUE) { rwsem_set_owner(sem); return true; } lockevent_inc(rwsem_wtrylock_fail); return false; } I booted the new kernel on a 4-socket 56-core 112-thread Broadwell system. The counter values 1) After bootup: rwsem_rtrylock=784029 rwsem_rtrylock_fail=59 rwsem_rtrylock_retry=394 rwsem_wtrylock=18284 rwsem_wtrylock_fail=230 2) After parallel kernel build (-j112): rwsem_rtrylock=338667559 rwsem_rtrylock_fail=18 rwsem_rtrylock_retry=51 rwsem_wtrylock=17016332 rwsem_wtrylock_fail=98058 At least for these two use cases, try-for-ownership as suggested by Linus is the right choice. Cheers, Longman
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On 02/14/2019 01:02 PM, Will Deacon wrote: > On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: >> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >>> it generate slightly better code. >>> >>> Before this patch, down_read_trylock: >>> >>>0x <+0>: callq 0x5 >>>0x0005 <+5>: jmp0x18 >>>0x0007 <+7>: lea0x1(%rdx),%rcx >>>0x000b <+11>:mov%rdx,%rax >>>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>>0x0013 <+19>:cmp%rax,%rdx >>>0x0016 <+22>:je 0x23 >>>0x0018 <+24>:mov(%rdi),%rdx >>>0x001b <+27>:test %rdx,%rdx >>>0x001e <+30>:jns0x7 >>>0x0020 <+32>:xor%eax,%eax >>>0x0022 <+34>:retq >>>0x0023 <+35>:mov%gs:0x0,%rax >>>0x002c <+44>:or $0x3,%rax >>>0x0030 <+48>:mov%rax,0x20(%rdi) >>>0x0034 <+52>:mov$0x1,%eax >>>0x0039 <+57>:retq >>> >>> After patch, down_read_trylock: >>> >>>0x <+0>: callq 0x5 >>>0x0005 <+5>: xor%eax,%eax >>>0x0007 <+7>: lea0x1(%rax),%rdx >>>0x000b <+11>:lock cmpxchg %rdx,(%rdi) >>>0x0010 <+16>:jne0x29 >>>0x0012 <+18>:mov%gs:0x0,%rax >>>0x001b <+27>:or $0x3,%rax >>>0x001f <+31>:mov%rax,0x20(%rdi) >>>0x0023 <+35>:mov$0x1,%eax >>>0x0028 <+40>:retq >>>0x0029 <+41>:test %rax,%rax >>>0x002c <+44>:jns0x7 >>>0x002e <+46>:xor%eax,%eax >>>0x0030 <+48>:retq >>> >>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >>> load of 10 to lengthen the lock critical section) on a x86-64 system >>> before and after the patch were: >>> >>> Before PatchAfter Patch >>># of Threads rlock rlock >>> - - >>> 1 14,496 14,716 >>> 28,644 8,453 >>> 46,799 6,983 >>> 85,664 7,190 >>> >>> On a ARM64 system, the performance results were: >>> >>> Before PatchAfter Patch >>># of Threads rlock rlock >>> - - >>> 1 23,676 24,488 >>> 27,697 9,502 >>> 44,945 3,440 >>> 82,641 1,603 >> Urgh, yes LL/SC is the obvious exception that can actually do better >> here :/ >> >> Will, what say you? > What machine were these numbers generated on and is it using LL/SC or LSE > atomics for arm64? If you stick the microbenchmark somewhere, I can go play > with a broader variety of h/w. > > Will The machine is a 2-socket Cavium ThunderX2 99xx system with 64 cores and 256 threads. I was just using threads from the first socket for this test. The microbenchmark that I used is attached. I used the command "./run-locktest -ltryrwsem -r100 -i-10 -c10 -n" to generate the locking rates. The lscpu flags were: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm Cheers, Longman locktest.tar.gz Description: application/gzip
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds wrote: > > The arm64 numbers scaled horribly even before, and that's because > there is too much ping-pong, and it's probably because there is no > "stickiness" to the cacheline to the core, and thus adding the extra > loop can make the ping-pong issue even worse because now there is more > of it. Actually, if it's using the ll/sc, then I don't see why arm64 should even change. It doesn't really even change the pattern: the initial load of the value is just replaced with a "ll" that gets a non-zero value, and then we re-try without even doing the "sc" part. End result: exact same "load once, then do ll/sc to update". Just using a slightly different instruction pattern. But maybe "ll" does something different to the cacheline than a regular "ld"? Alternatively, the machine you used is using LSE, and the "swp" thing has some horrid behavior when it fails. So I take it back. I'm actually surprised that arm64 performs worse. I don't think it should. But numbers walk, bullshit talks, and it clearly does make for worse numbers on arm64. Linus
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > > it generate slightly better code. > > > > Before this patch, down_read_trylock: > > > >0x <+0>: callq 0x5 > >0x0005 <+5>: jmp0x18 > >0x0007 <+7>: lea0x1(%rdx),%rcx > >0x000b <+11>:mov%rdx,%rax > >0x000e <+14>:lock cmpxchg %rcx,(%rdi) > >0x0013 <+19>:cmp%rax,%rdx > >0x0016 <+22>:je 0x23 > >0x0018 <+24>:mov(%rdi),%rdx > >0x001b <+27>:test %rdx,%rdx > >0x001e <+30>:jns0x7 > >0x0020 <+32>:xor%eax,%eax > >0x0022 <+34>:retq > >0x0023 <+35>:mov%gs:0x0,%rax > >0x002c <+44>:or $0x3,%rax > >0x0030 <+48>:mov%rax,0x20(%rdi) > >0x0034 <+52>:mov$0x1,%eax > >0x0039 <+57>:retq > > > > After patch, down_read_trylock: > > > >0x <+0>: callq 0x5 > >0x0005 <+5>: xor%eax,%eax > >0x0007 <+7>: lea0x1(%rax),%rdx > >0x000b <+11>:lock cmpxchg %rdx,(%rdi) > >0x0010 <+16>:jne0x29 > >0x0012 <+18>:mov%gs:0x0,%rax > >0x001b <+27>:or $0x3,%rax > >0x001f <+31>:mov%rax,0x20(%rdi) > >0x0023 <+35>:mov$0x1,%eax > >0x0028 <+40>:retq > >0x0029 <+41>:test %rax,%rax > >0x002c <+44>:jns0x7 > >0x002e <+46>:xor%eax,%eax > >0x0030 <+48>:retq > > > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > > load of 10 to lengthen the lock critical section) on a x86-64 system > > before and after the patch were: > > > > Before PatchAfter Patch > ># of Threads rlock rlock > > - - > > 1 14,496 14,716 > > 28,644 8,453 > > 46,799 6,983 > > 85,664 7,190 > > > > On a ARM64 system, the performance results were: > > > > Before PatchAfter Patch > ># of Threads rlock rlock > > - - > > 1 23,676 24,488 > > 27,697 9,502 > > 44,945 3,440 > > 82,641 1,603 > > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? What machine were these numbers generated on and is it using LL/SC or LSE atomics for arm64? If you stick the microbenchmark somewhere, I can go play with a broader variety of h/w. Will
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Thu, Feb 14, 2019 at 6:53 AM Waiman Long wrote: > > The ARM64 result is what I would have expected given that the change was > to optimize for the uncontended case. The x86-64 result is kind of an > anomaly to me, but I haven't bothered to dig into that. I would say that the ARM result is what I'd expect from something that scales badly to begin with. The x86-64 result is the expected one: yes, the cmpxchg is done one extra time, but it results in fewer cache transitions (the cacheline never goes into "shared" state), and cache transitions are what matter. The cost of re-doing the instruction should be low. The cacheline ping-pong and the cache coherency messages is what hurts. So I actually think both are very easily explained. The x86-64 number improves, because there is less cache coherency traffic. The arm64 numbers scaled horribly even before, and that's because there is too much ping-pong, and it's probably because there is no "stickiness" to the cacheline to the core, and thus adding the extra loop can make the ping-pong issue even worse because now there is more of it. The cachelines not sticking at all to a core probably is good for fairness issues (in particular, sticking *too* much can cause horrible issues), but it's absolutely horrible if it means that you lose the cacheline even before you get to complete the second cmpxchg. Linus
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote: > On Fri, 08 Feb 2019, Waiman Long wrote: >> I am planning to run more performance test and post the data sometimes >> next week. Davidlohr is also going to run some of his rwsem performance >> test on this patchset. > > So I ran this series on a 40-core IB 2 socket with various worklods in > mmtests. Below are some of the interesting ones; full numbers and curves > at https://linux-scalability.org/rwsem-reader-spinner/ > > All workloads are with increasing number of threads. > > -- pagefault timings: pft is an artificial pf benchmark (thus reader > stress). > metric is faults/cpu and faults/sec > v5.0-rc6 v5.0-rc6 > dirty > Hmean faults/cpu-1 624224.9815 ( 0.00%) 618847.5201 * -0.86%* > Hmean faults/cpu-4 539550.3509 ( 0.00%) 547407.5738 * 1.46%* > Hmean faults/cpu-7 401470.3461 ( 0.00%) 381157.9830 * -5.06%* > Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* > Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* > Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* > Hmean faults/cpu-40 91203.6820 ( 0.00%) 91832.7489 * 0.69%* > Hmean faults/sec-1 623292.3467 ( 0.00%) 617992.0795 * -0.85%* > Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* > Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* > Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* > Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* > Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* > Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* > Stddev faults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) > Stddev faults/cpu-4 24165.9454 ( 0.00%) 15841.1232 ( 34.45%) > Stddev faults/cpu-7 20914.8351 ( 0.00%) 22744.3294 ( -8.75%) > Stddev faults/cpu-12 11274.3490 ( 0.00%) 14733.3152 ( -30.68%) > Stddev faults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) > Stddev faults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) > Stddev faults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) > Stddev faults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) > Stddev faults/sec-4 84042.7234 ( 0.00%) 57996.7158 ( 30.99%) > Stddev faults/sec-7 123656.7901 ( 0.00%) 135591.1087 ( -9.65%) > Stddev faults/sec-12 97135.6091 ( 0.00%) 127054.4926 ( -30.80%) > Stddev faults/sec-21 69564.6264 ( 0.00%) 65922.6381 ( 5.24%) > Stddev faults/sec-30 51524.4027 ( 0.00%) 56109.4159 ( -8.90%) > Stddev faults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) > > With the exception of the hicup at 7 threads, things are pretty much in > the noise region for both metrics. > > -- git checkout > > First metric is total runtime for runs with incremental threads. > > v5.0-rc6 v5.0-rc6 > dirty > User 218.95 219.07 > System 149.29 146.82 > Elapsed 1574.10 1427.08 > > In this case there's a non trivial improvement (11%) in overall > elapsed time. > > -- reaim (which is always succeptible to rwsem changes for both > mmap_sem and > i_mmap) > v5.0-rc6 v5.0-rc6 > dirty > Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* > Hmean compute-21 85294.91 ( 0.00%) 85524.20 * 0.27%* > Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* > Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* > Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* > Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* > Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* > Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* > Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* > > The 'compute' workload overall takes a small hit. > > Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* > Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* > Hmean new_dbase-41 14202.91 ( 0.00%) 14470.59 * 1.88%* > Hmean new_dbase-61 21207.24 ( 0.00%) 21067.40 * -0.66%* > Hmean new_dbase-81 25542.40 ( 0.00%) 25542.40 * 0.00%* > Hmean new_dbase-101 30165.28 ( 0.00%) 30046.21 * -0.39%* > Hmean new_dbase-121 33638.33 ( 0.00%) 33219.90 * -1.24%* > Hmean new_dbase-141 36723.70 ( 0.00%) 37504.52 * 2.13%* > Hmean new_dbase-161 42242.51 ( 0.00%) 42117.34 * -0.30%* > Hmean shared-1
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On 02/14/2019 05:33 AM, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >> it generate slightly better code. >> >> Before this patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: jmp0x18 >>0x0007 <+7>: lea0x1(%rdx),%rcx >>0x000b <+11>:mov%rdx,%rax >>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>0x0013 <+19>:cmp%rax,%rdx >>0x0016 <+22>:je 0x23 >>0x0018 <+24>:mov(%rdi),%rdx >>0x001b <+27>:test %rdx,%rdx >>0x001e <+30>:jns0x7 >>0x0020 <+32>:xor%eax,%eax >>0x0022 <+34>:retq >>0x0023 <+35>:mov%gs:0x0,%rax >>0x002c <+44>:or $0x3,%rax >>0x0030 <+48>:mov%rax,0x20(%rdi) >>0x0034 <+52>:mov$0x1,%eax >>0x0039 <+57>:retq >> >> After patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: xor%eax,%eax >>0x0007 <+7>: lea0x1(%rax),%rdx >>0x000b <+11>: lock cmpxchg %rdx,(%rdi) >>0x0010 <+16>: jne0x29 >>0x0012 <+18>: mov%gs:0x0,%rax >>0x001b <+27>: or $0x3,%rax >>0x001f <+31>: mov%rax,0x20(%rdi) >>0x0023 <+35>: mov$0x1,%eax >>0x0028 <+40>: retq >>0x0029 <+41>: test %rax,%rax >>0x002c <+44>: jns0x7 >>0x002e <+46>: xor%eax,%eax >>0x0030 <+48>: retq >> >> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >> load of 10 to lengthen the lock critical section) on a x86-64 system >> before and after the patch were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 14,496 14,716 >> 28,644 8,453 >> 46,799 6,983 >> 85,664 7,190 >> >> On a ARM64 system, the performance results were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 23,676 24,488 >> 27,697 9,502 >> 44,945 3,440 >> 82,641 1,603 > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? The ARM64 result is what I would have expected given that the change was to optimize for the uncontended case. The x86-64 result is kind of an anomaly to me, but I haven't bothered to dig into that. Cheers, Longman
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Fri, 08 Feb 2019, Waiman Long wrote: I am planning to run more performance test and post the data sometimes next week. Davidlohr is also going to run some of his rwsem performance test on this patchset. So I ran this series on a 40-core IB 2 socket with various worklods in mmtests. Below are some of the interesting ones; full numbers and curves at https://linux-scalability.org/rwsem-reader-spinner/ All workloads are with increasing number of threads. -- pagefault timings: pft is an artificial pf benchmark (thus reader stress). metric is faults/cpu and faults/sec v5.0-rc6 v5.0-rc6 dirty Hmean faults/cpu-1624224.9815 ( 0.00%) 618847.5201 * -0.86%* Hmean faults/cpu-4539550.3509 ( 0.00%) 547407.5738 * 1.46%* Hmean faults/cpu-7401470.3461 ( 0.00%) 381157.9830 * -5.06%* Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* Hmean faults/cpu-4091203.6820 ( 0.00%)91832.7489 * 0.69%* Hmean faults/sec-1623292.3467 ( 0.00%) 617992.0795 * -0.85%* Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* Stddevfaults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) Stddevfaults/cpu-4 24165.9454 ( 0.00%)15841.1232 ( 34.45%) Stddevfaults/cpu-7 20914.8351 ( 0.00%)22744.3294 ( -8.75%) Stddevfaults/cpu-1211274.3490 ( 0.00%)14733.3152 ( -30.68%) Stddevfaults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) Stddevfaults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) Stddevfaults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) Stddevfaults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) Stddevfaults/sec-4 84042.7234 ( 0.00%)57996.7158 ( 30.99%) Stddevfaults/sec-7123656.7901 ( 0.00%) 135591.1087 ( -9.65%) Stddevfaults/sec-1297135.6091 ( 0.00%) 127054.4926 ( -30.80%) Stddevfaults/sec-2169564.6264 ( 0.00%)65922.6381 ( 5.24%) Stddevfaults/sec-3051524.4027 ( 0.00%)56109.4159 ( -8.90%) Stddevfaults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) With the exception of the hicup at 7 threads, things are pretty much in the noise region for both metrics. -- git checkout First metric is total runtime for runs with incremental threads. v5.0-rc6v5.0-rc6 dirty User 218.95 219.07 System 149.29 146.82 Elapsed 1574.10 1427.08 In this case there's a non trivial improvement (11%) in overall elapsed time. -- reaim (which is always succeptible to rwsem changes for both mmap_sem and i_mmap) v5.0-rc6 v5.0-rc6 dirty Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* Hmean compute-21 85294.91 ( 0.00%)85524.20 * 0.27%* Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* The 'compute' workload overall takes a small hit. Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* Hmean new_dbase-41 14202.91 ( 0.00%)14470.59 * 1.88%* Hmean new_dbase-61 21207.24 ( 0.00%)21067.40 * -0.66%* Hmean new_dbase-81 25542.40 ( 0.00%)25542.40 * 0.00%* Hmean new_dbase-10130165.28 ( 0.00%)30046.21 * -0.39%* Hmean new_dbase-12133638.33 ( 0.00%)33219.90 * -1.24%* Hmean new_dbase-14136723.70 ( 0.00%)37504.52 * 2.13%* Hmean new_dbase-16142242.51 ( 0.00%)42117.34 * -0.30%* Hmean shared-176.54 ( 0.00%) 76.09 * -0.59%* Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%* Hmean shared-4117207.81 ( 0.00%)14651.94 * -14.85%* Hmean shared-61
Re: [PATCH v4 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
On Thu, Feb 14, 2019 at 11:54:47AM +0100, Geert Uytterhoeven wrote: > On Wed, Feb 13, 2019 at 11:01 PM Waiman Long wrote: > > Currently, we have two different implementation of rwsem: > > 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c) > > 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c) > > > > As we are going to use a single generic implementation for rwsem-xadd.c > > and no architecture-specific code will be needed, there is no point > > in keeping two different implementations of rwsem. In most cases, the > > performance of rwsem-spinlock.c will be worse. It also doesn't get all > > the performance tuning and optimizations that had been implemented in > > rwsem-xadd.c over the years. > > > > For simplication, we are going to remove rwsem-spinlock.c and make all > > architectures use a single implementation of rwsem - rwsem-xadd.c. > > > > All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM > > in the code are removed. > > > > Suggested-by: Peter Zijlstra > > Signed-off-by: Waiman Long > > Note that this conflicts with "[PATCH 03/11] kernel/locks: consolidate > RWSEM_GENERIC_* options" > https://lore.kernel.org/lkml/20190213174005.28785-4-...@lst.de/ *sigh*.. of that never was Cc'ed to locking people :/
Re: [PATCH v4 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
On Wed, Feb 13, 2019 at 11:01 PM Waiman Long wrote: > Currently, we have two different implementation of rwsem: > 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c) > 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c) > > As we are going to use a single generic implementation for rwsem-xadd.c > and no architecture-specific code will be needed, there is no point > in keeping two different implementations of rwsem. In most cases, the > performance of rwsem-spinlock.c will be worse. It also doesn't get all > the performance tuning and optimizations that had been implemented in > rwsem-xadd.c over the years. > > For simplication, we are going to remove rwsem-spinlock.c and make all > architectures use a single implementation of rwsem - rwsem-xadd.c. > > All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM > in the code are removed. > > Suggested-by: Peter Zijlstra > Signed-off-by: Waiman Long Note that this conflicts with "[PATCH 03/11] kernel/locks: consolidate RWSEM_GENERIC_* options" https://lore.kernel.org/lkml/20190213174005.28785-4-...@lst.de/ Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0
On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: > v4: > - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. > > v3: > - Optimize __down_read_trylock() for the uncontended case as suggested >by Linus. > > v2: > - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. > - Update performance test data in patch 1. > > The goal of this patchset is to remove the architecture specific files > for rwsem-xadd to make it easer to add enhancements in the later rwsem > patches. It also removes the legacy rwsem-spinlock.c file and make all > the architectures use one single implementation of rwsem - rwsem-xadd.c. > > Waiman Long (3): > locking/rwsem: Remove arch specific rwsem files > locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all > archs > locking/rwsem: Optimize down_read_trylock() Acked-by: Peter Zijlstra (Intel) with the caveat that I'm happy to exchange patch 3 back to my earlier suggestion in case Will expesses concerns wrt the ARM64 performance of Linus' suggestion.
Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()
On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. > > Before this patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: jmp0x18 >0x0007 <+7>: lea0x1(%rdx),%rcx >0x000b <+11>:mov%rdx,%rax >0x000e <+14>:lock cmpxchg %rcx,(%rdi) >0x0013 <+19>:cmp%rax,%rdx >0x0016 <+22>:je 0x23 >0x0018 <+24>:mov(%rdi),%rdx >0x001b <+27>:test %rdx,%rdx >0x001e <+30>:jns0x7 >0x0020 <+32>:xor%eax,%eax >0x0022 <+34>:retq >0x0023 <+35>:mov%gs:0x0,%rax >0x002c <+44>:or $0x3,%rax >0x0030 <+48>:mov%rax,0x20(%rdi) >0x0034 <+52>:mov$0x1,%eax >0x0039 <+57>:retq > > After patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: xor%eax,%eax >0x0007 <+7>: lea0x1(%rax),%rdx >0x000b <+11>: lock cmpxchg %rdx,(%rdi) >0x0010 <+16>: jne0x29 >0x0012 <+18>: mov%gs:0x0,%rax >0x001b <+27>: or $0x3,%rax >0x001f <+31>: mov%rax,0x20(%rdi) >0x0023 <+35>: mov$0x1,%eax >0x0028 <+40>: retq >0x0029 <+41>: test %rax,%rax >0x002c <+44>: jns0x7 >0x002e <+46>: xor%eax,%eax >0x0030 <+48>: retq > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > load of 10 to lengthen the lock critical section) on a x86-64 system > before and after the patch were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 14,496 14,716 > 28,644 8,453 > 46,799 6,983 > 85,664 7,190 > > On a ARM64 system, the performance results were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 23,676 24,488 > 27,697 9,502 > 44,945 3,440 > 82,641 1,603 Urgh, yes LL/SC is the obvious exception that can actually do better here :/ Will, what say you?