Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long wrote: > > On 04/10/2019 04:15 AM, huang ying wrote: > > Hi, Waiman, > > > > What's the status of this patchset? And its merging plan? > > > > Best Regards, > > Huang, Ying > > I have broken the patch into 3 parts (0/1/2) and rewritten some of them. > Part 0 has been merged into tip. Parts 1 and 2 are still under testing. Thanks! Please keep me updated! Best Regards, Huang, Ying > Cheers, > Longman >
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On 04/10/2019 04:15 AM, huang ying wrote: > Hi, Waiman, > > What's the status of this patchset? And its merging plan? > > Best Regards, > Huang, Ying I have broken the patch into 3 parts (0/1/2) and rewritten some of them. Part 0 has been merged into tip. Parts 1 and 2 are still under testing. Cheers, Longman
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
Hi, Waiman, What's the status of this patchset? And its merging plan? Best Regards, Huang, Ying
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote: > On Fri, 08 Feb 2019, Waiman Long wrote: >> I am planning to run more performance test and post the data sometimes >> next week. Davidlohr is also going to run some of his rwsem performance >> test on this patchset. > > So I ran this series on a 40-core IB 2 socket with various worklods in > mmtests. Below are some of the interesting ones; full numbers and curves > at https://linux-scalability.org/rwsem-reader-spinner/ > > All workloads are with increasing number of threads. > > -- pagefault timings: pft is an artificial pf benchmark (thus reader > stress). > metric is faults/cpu and faults/sec > v5.0-rc6 v5.0-rc6 > dirty > Hmean faults/cpu-1 624224.9815 ( 0.00%) 618847.5201 * -0.86%* > Hmean faults/cpu-4 539550.3509 ( 0.00%) 547407.5738 * 1.46%* > Hmean faults/cpu-7 401470.3461 ( 0.00%) 381157.9830 * -5.06%* > Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* > Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* > Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* > Hmean faults/cpu-40 91203.6820 ( 0.00%) 91832.7489 * 0.69%* > Hmean faults/sec-1 623292.3467 ( 0.00%) 617992.0795 * -0.85%* > Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* > Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* > Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* > Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* > Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* > Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* > Stddev faults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) > Stddev faults/cpu-4 24165.9454 ( 0.00%) 15841.1232 ( 34.45%) > Stddev faults/cpu-7 20914.8351 ( 0.00%) 22744.3294 ( -8.75%) > Stddev faults/cpu-12 11274.3490 ( 0.00%) 14733.3152 ( -30.68%) > Stddev faults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) > Stddev faults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) > Stddev faults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) > Stddev faults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) > Stddev faults/sec-4 84042.7234 ( 0.00%) 57996.7158 ( 30.99%) > Stddev faults/sec-7 123656.7901 ( 0.00%) 135591.1087 ( -9.65%) > Stddev faults/sec-12 97135.6091 ( 0.00%) 127054.4926 ( -30.80%) > Stddev faults/sec-21 69564.6264 ( 0.00%) 65922.6381 ( 5.24%) > Stddev faults/sec-30 51524.4027 ( 0.00%) 56109.4159 ( -8.90%) > Stddev faults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) > > With the exception of the hicup at 7 threads, things are pretty much in > the noise region for both metrics. > > -- git checkout > > First metric is total runtime for runs with incremental threads. > > v5.0-rc6 v5.0-rc6 > dirty > User 218.95 219.07 > System 149.29 146.82 > Elapsed 1574.10 1427.08 > > In this case there's a non trivial improvement (11%) in overall > elapsed time. > > -- reaim (which is always succeptible to rwsem changes for both > mmap_sem and > i_mmap) > v5.0-rc6 v5.0-rc6 > dirty > Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* > Hmean compute-21 85294.91 ( 0.00%) 85524.20 * 0.27%* > Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* > Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* > Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* > Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* > Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* > Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* > Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* > > The 'compute' workload overall takes a small hit. > > Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* > Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* > Hmean new_dbase-41 14202.91 ( 0.00%) 14470.59 * 1.88%* > Hmean new_dbase-61 21207.24 ( 0.00%) 21067.40 * -0.66%* > Hmean new_dbase-81 25542.40 ( 0.00%) 25542.40 * 0.00%* > Hmean new_dbase-101 30165.28 ( 0.00%) 30046.21 * -0.39%* > Hmean new_dbase-121 33638.33 ( 0.00%) 33219.90 * -1.24%* > Hmean new_dbase-141 36723.70 ( 0.00%) 37504.52 * 2.13%* > Hmean new_dbase-161 42242.51 ( 0.00%) 42117.34 * -0.30%* > Hmean shared-1
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Fri, 08 Feb 2019, Waiman Long wrote: I am planning to run more performance test and post the data sometimes next week. Davidlohr is also going to run some of his rwsem performance test on this patchset. So I ran this series on a 40-core IB 2 socket with various worklods in mmtests. Below are some of the interesting ones; full numbers and curves at https://linux-scalability.org/rwsem-reader-spinner/ All workloads are with increasing number of threads. -- pagefault timings: pft is an artificial pf benchmark (thus reader stress). metric is faults/cpu and faults/sec v5.0-rc6 v5.0-rc6 dirty Hmean faults/cpu-1624224.9815 ( 0.00%) 618847.5201 * -0.86%* Hmean faults/cpu-4539550.3509 ( 0.00%) 547407.5738 * 1.46%* Hmean faults/cpu-7401470.3461 ( 0.00%) 381157.9830 * -5.06%* Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* Hmean faults/cpu-4091203.6820 ( 0.00%)91832.7489 * 0.69%* Hmean faults/sec-1623292.3467 ( 0.00%) 617992.0795 * -0.85%* Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* Stddevfaults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) Stddevfaults/cpu-4 24165.9454 ( 0.00%)15841.1232 ( 34.45%) Stddevfaults/cpu-7 20914.8351 ( 0.00%)22744.3294 ( -8.75%) Stddevfaults/cpu-1211274.3490 ( 0.00%)14733.3152 ( -30.68%) Stddevfaults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) Stddevfaults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) Stddevfaults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) Stddevfaults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) Stddevfaults/sec-4 84042.7234 ( 0.00%)57996.7158 ( 30.99%) Stddevfaults/sec-7123656.7901 ( 0.00%) 135591.1087 ( -9.65%) Stddevfaults/sec-1297135.6091 ( 0.00%) 127054.4926 ( -30.80%) Stddevfaults/sec-2169564.6264 ( 0.00%)65922.6381 ( 5.24%) Stddevfaults/sec-3051524.4027 ( 0.00%)56109.4159 ( -8.90%) Stddevfaults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) With the exception of the hicup at 7 threads, things are pretty much in the noise region for both metrics. -- git checkout First metric is total runtime for runs with incremental threads. v5.0-rc6v5.0-rc6 dirty User 218.95 219.07 System 149.29 146.82 Elapsed 1574.10 1427.08 In this case there's a non trivial improvement (11%) in overall elapsed time. -- reaim (which is always succeptible to rwsem changes for both mmap_sem and i_mmap) v5.0-rc6 v5.0-rc6 dirty Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* Hmean compute-21 85294.91 ( 0.00%)85524.20 * 0.27%* Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* The 'compute' workload overall takes a small hit. Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* Hmean new_dbase-41 14202.91 ( 0.00%)14470.59 * 1.88%* Hmean new_dbase-61 21207.24 ( 0.00%)21067.40 * -0.66%* Hmean new_dbase-81 25542.40 ( 0.00%)25542.40 * 0.00%* Hmean new_dbase-10130165.28 ( 0.00%)30046.21 * -0.39%* Hmean new_dbase-12133638.33 ( 0.00%)33219.90 * -1.24%* Hmean new_dbase-14136723.70 ( 0.00%)37504.52 * 2.13%* Hmean new_dbase-16142242.51 ( 0.00%)42117.34 * -0.30%* Hmean shared-176.54 ( 0.00%) 76.09 * -0.59%* Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%* Hmean shared-4117207.81 ( 0.00%)14651.94 * -14.85%* Hmean shared-61
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
Ok, those test robot reports are hard to read, but trying to distill it down: On Wed, Feb 13, 2019 at 1:19 AM Chen Rong wrote: > > %stddev %change %stddev > \ |\ > 196250 ± 8% -64.1% 70494will-it-scale.per_thread_ops That's the original 64% regression.. And then with the patch set: > %stddev change %stddev > \ |\ > 71190 180% 199232 ± 4% will-it-scale.per_thread_ops looks like it's back up where it used to be. So I guess we have numbers for the regression now. Thanks. And that closes my biggest question for the new model, and with the new organization that gets ird of the arch-specific asm separately first and makes it a bit more legible that way, I guess I'll just Ack the whole series. Linus
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
Hi all, Kernel test robot reported a will-it-scale.per_thread_ops -64.1% regression on IVB-desktop for v4.20-rc1. The first bad commit is: 9bc8039e715da3b53dbac89525323a9f2f69b7b5, Yang Shi : mm: brk: downgrade mmap_sem to read when shrinking (https://lists.01.org/pipermail/lkp/2018-November/009335.html). = compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-ivb-d01/brk1/will-it-scale/0x20 commit: 85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking") 9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking") 85a06835f6f1ba79 9bc8039e715da3b53dbac89525 -- %stddev %change %stddev \ |\ 196250 ± 8% -64.1% 70494will-it-scale.per_thread_ops 127330 ± 19% -98.0% 2525 ± 24% will-it-scale.time.involuntary_context_switches 727.50 ± 2% -77.0% 167.25 will-it-scale.time.percent_of_cpu_this_job_got 2141 ± 2% -77.6% 479.12will-it-scale.time.system_time 50.48 ± 7% -48.5% 25.98will-it-scale.time.user_time 34925294 ± 18%+270.3% 1.293e+08 ± 4% will-it-scale.time.voluntary_context_switches 1570007 ± 8% -64.1% 563958will-it-scale.workload 6435 ± 2% -6.4% 6024proc-vmstat.nr_shmem 1298 ± 16% -44.5% 721.00 ± 18% proc-vmstat.pgactivate 2341 +16.4% 2724slabinfo.kmalloc-96.active_objs 2341 +16.4% 2724slabinfo.kmalloc-96.num_objs 6346 ±150% -87.8% 776.25 ± 9% softirqs.NET_RX 160107 ± 8%+151.9% 403273softirqs.SCHED 1097999 -13.0% 955526softirqs.TIMER 5.50 ± 9% -81.8% 1.00vmstat.procs.r 230700 ± 19%+269.9% 853292 ± 4% vmstat.system.cs 26706 ± 3% +15.7% 30910 ± 5% vmstat.system.in 11.24 ± 23% +72.2 83.39mpstat.cpu.idle% 0.00 ±131% +0.00.04 ± 99% mpstat.cpu.iowait% 86.32 ± 2% -70.8 15.54mpstat.cpu.sys% 2.44 ± 7% -1.41.04 ± 8% mpstat.cpu.usr% 20610709 ± 15% +2376.0% 5.103e+08 ± 34% cpuidle.C1.time 3233399 ± 8%+241.5% 11042785 ± 25% cpuidle.C1.usage 36172040 ± 6%+931.3% 3.73e+08 ± 15% cpuidle.C1E.time 783605 ± 4%+548.7%5083041 ± 18% cpuidle.C1E.usage 28753819 ± 39% +1054.5% 3.319e+08 ± 49% cpuidle.C3.time 283912 ± 25%+688.4%2238225 ± 34% cpuidle.C3.usage 1.507e+08 ± 47%+292.3% 5.913e+08 ± 28% cpuidle.C6.time 339861 ± 37%+549.7%2208222 ± 24% cpuidle.C6.usage 2709719 ± 5%+824.2% 25043444cpuidle.POLL.time 28602864 ± 18%+173.7% 78276116 ± 10% cpuidle.POLL.usage We found that the patchset could fix the regression. tests: 1 testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01 commit: 85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking") fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader") 85a06835f6f1ba79 fb835fe7f0adbd7c2c074b98ec -- %stddev change %stddev \ |\ 120736 ± 22%56% 188019 ± 6% will-it-scale.time.involuntary_context_switches 2126 ± 3% 4% 2215will-it-scale.time.system_time 722 ± 3% 4%752 will-it-scale.time.percent_of_cpu_this_job_got 36256485 ± 27% -35% 23682989 ± 3% will-it-scale.time.voluntary_context_switches 3151 ± 9%11% 3504turbostat.Avg_MHz 229285 ± 32% -30% 160660 ± 3% vmstat.system.cs 120736 ± 22%56% 188019 ± 6% time.involuntary_context_switches 2126 ± 3% 4% 2215time.system_time 722 ± 3% 4%752time.percent_of_cpu_this_job_got 36256485 ± 27% -35% 23682989 ± 3% time.voluntary_context_switches 23 643%171 ± 3% proc-vmstat.nr_zone_inactive_file 23 643%171 ± 3% proc-vmstat.nr_inactive_file 3664 12% 4121proc-vmstat.nr_kernel_stack 6392 6% 6785proc-vmstat.nr_slab_unreclaimable 9991 10176proc-vmstat.nr_slab_reclaimable 63938 62394proc-vmstat.nr_zone_active_anon 63938 62394proc-vmstat.nr_active_anon 386388 ± 9%-6% 362272proc-vmstat.pgfree 368296 ± 9% -10% 333074
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
* Waiman Long wrote: > On 02/07/2019 02:51 PM, Davidlohr Bueso wrote: > > On Thu, 07 Feb 2019, Waiman Long wrote: > >> 30 files changed, 1197 insertions(+), 1594 deletions(-) > > > > Performance numbers on numerous workloads, pretty please. > > > > I'll go and throw this at my mmap_sem intensive workloads > > I've collected. > > > > Thanks, > > Davidlohr > > Thanks for getting some of the performance numbers. This is the initial > draft after more than 1 years of hibernation. I will also get other > performance numbers in subsequent revision of the patch. If you could sort all the invariant preparatory patches to the head of the series I can merge them to reduce overall complexity and simplify performance testing and review of the rest. Thanks, Ingo
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Fri, Feb 8, 2019 at 12:31 PM Waiman Long wrote: > > > (b) what's the new fastpath case > > The only change in the fastpath is the use of cmpxchg for writer lock. .. since a big deal here was about using the generic atomic accessor functions, I really was looking forward to seeing the *actual* fast path code generation. In other words, right now I have very little visibility in how it actually affects the code. Looking at the patches themselves doesn't make it obvious. I was hoping for the overview to really explain the whole "before and after" situation, and it didn't. Not at the high level, and not at a low level. And no performance numbers in the overview either. And yes, I see the numbers in the patches, but what I really hoped for was some real load numbers. In particular, I would have loved to see numbers from th ekernel test robot "will-it-scale.per_thread_ops" case, which is the one that had a 65% regression due to the lack of reader spinning. So I was kind of hoping to hear whether that regression is basically entirely gone with this patch series, or if we still have a regression due to the extra downgrade, or what? Linus
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On 02/08/2019 02:50 PM, Linus Torvalds wrote: > On Thu, Feb 7, 2019 at 11:08 AM Waiman Long wrote: >> This patchset revamps the current rwsem-xadd implementation to make >> it saner and easier to work with. This patchset removes all the >> architecture specific assembly code and uses generic C code for all >> architectures. This eases maintenance and enables us to enhance the >> code more easily. >> >> This patchset also implements the following 3 new features: >> >> 1) Waiter lock handoff >> 2) Reader optimistic spinning >> 3) Store write-lock owner in the atomic count (x86-64 only) > The patches are kind of hard to read, with most of them just doing > prep-work that doesn't necessarily matter to the big picture. > > What I'd really like to see is > > (a) an overview of the new locking logic The new locking logic is similar to qrwlock (see patch 11). Cmpxchg is used to acquire the write lock, while xadd is still used for read lock. Some of the bits in the count are also reserved for special purpose like has waiter or lock handoff. Patch 15 tries to compress the write-lock owner task pointer and put it into the count field for x86-64 at the expense of less bits available for reader count. I have sent out an additional patch this morning to make sure that the reader count won't overflow. In term of performance, there isn't much change with respect to read-lock performance. For write-lock, I saw a slight drop in some cases, but nothing significant. The merging of owner task pointer into the count field does impose a slightly bigger drop than I would have liked which I am going to look into a bit more. > > (b) what's the new fastpath case The only change in the fastpath is the use of cmpxchg for writer lock. > > (c) some performance numbers There are performance data at patches 11, 12, 15, 19, 20, 21. There was performance data for patch 4 as well for eliminating the arch specific file. Apparently, I might have deleted it accidentally. Anyway, no noticeable performance difference was observed when switching to use generic C code for x86, ppc and ARM64. The major gain in performance is due to reader optimistic spinning patches. The microbenchmark that I used shown an order of magnitude of performance improvement for mixed reader-writer workloads. Of course, we will see less performance gain with real world benchmarks. I am planning to run more performance test and post the data sometimes next week. Davidlohr is also going to run some of his rwsem performance test on this patchset. > > to explain the changes from a "this is the point of the whole > exercise" standpoint. > > And yes, I realize that the lock handoff and optimistic spinning is a > big deal, since I've seen the same regression numbers that presumably > caused this effort to be resurrected. So it's not that I don't find > this intriguing and worthwhile, it's literally that I'd like a summary > not so much of the individual patches, but of the new model. > > Please? Maybe I should break this patchset into a few smaller ones to make it easier to review. Any suggestion is welcome. Cheers, Longman
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Thu, Feb 7, 2019 at 11:08 AM Waiman Long wrote: > > This patchset revamps the current rwsem-xadd implementation to make > it saner and easier to work with. This patchset removes all the > architecture specific assembly code and uses generic C code for all > architectures. This eases maintenance and enables us to enhance the > code more easily. > > This patchset also implements the following 3 new features: > > 1) Waiter lock handoff > 2) Reader optimistic spinning > 3) Store write-lock owner in the atomic count (x86-64 only) The patches are kind of hard to read, with most of them just doing prep-work that doesn't necessarily matter to the big picture. What I'd really like to see is (a) an overview of the new locking logic (b) what's the new fastpath case (c) some performance numbers to explain the changes from a "this is the point of the whole exercise" standpoint. And yes, I realize that the lock handoff and optimistic spinning is a big deal, since I've seen the same regression numbers that presumably caused this effort to be resurrected. So it's not that I don't find this intriguing and worthwhile, it's literally that I'd like a summary not so much of the individual patches, but of the new model. Please? Linus
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On 02/07/2019 02:51 PM, Davidlohr Bueso wrote: > On Thu, 07 Feb 2019, Waiman Long wrote: >> 30 files changed, 1197 insertions(+), 1594 deletions(-) > > Performance numbers on numerous workloads, pretty please. > > I'll go and throw this at my mmap_sem intensive workloads > I've collected. > > Thanks, > Davidlohr Thanks for getting some of the performance numbers. This is the initial draft after more than 1 years of hibernation. I will also get other performance numbers in subsequent revision of the patch. Cheers, Longman
[PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
This patchset revamps the current rwsem-xadd implementation to make it saner and easier to work with. This patchset removes all the architecture specific assembly code and uses generic C code for all architectures. This eases maintenance and enables us to enhance the code more easily. This patchset also implements the following 3 new features: 1) Waiter lock handoff 2) Reader optimistic spinning 3) Store write-lock owner in the atomic count (x86-64 only) Waiter lock handoff is similar to the mechanism currently in the mutex code. This ensures that lock starvation won't happen. Reader optimistic spinning enables readers to acquire the lock more quickly. So workloads that use a mix of readers and writers should see an increase in performance. Finally, storing the write-lock owner into the count will allow optimistic spinners to get to the lock holder's task structure more quickly and eliminating the timing gap where the write lock is acquired but the owner isn't known yet. This is important for RT tasks where spinning on a lock with an unknown owner is not allowed. Because of the fact that multiple readers can share the same lock, there is a natural preference for readers when measuring in term of locking throughput as more readers are likely to get into the locking fast path than the writers. With waiter lock handoff, we are not going to starve the writers. Patches 1-2 reworks the qspinlock_stat code to make it generic (lock event counting) so that it can be used by all architectures and all locking code. Patch 3 reloctes the rwsem_down_read_failed() and associated functions to below the optimistic spinning functions. Patch 4 eliminates all architecture specific code and use generic C code for all. Patch 5 moves code that manages the owner field closer to the rwsem lock fast path as it is not needed by the rwsem-spinlock code. Patch 6 renames rwsem.h to rwsem-xadd.h as it is now specific to rwsem-xadd.c only. Patch 7 hides the internal rwsem-xadd functions from the public. Patch 8 moves the DEBUG_RWSEMS_WARN_ON checks from rwsem.c to kernel/locking/rwsem-xadd.h and adds some new ones. Patch 9 enhances the DEBUG_RWSEMS_WARN_ON macro to print out rwsem internal states that can be useful for debugging purpose. Patch 10 enables lock event countings in the rwsem code. Patch 11 implements a new rwsem locking scheme similar to what qrwlock is current doing. Write lock is done by atomic_cmpxchg() while read lock is still being done by atomic_add(). Patch 12 implments lock handoff to prevent lock starvation. Patch 13 removes rwsem_wake() wakeup optimization as it doesn't work with lock handoff. Patch 14 adds some new rwsem owner access helper functions. Patch 15 merges the write-lock owner task pointer into the count. Only 64-bit count has enough space to provide a reasonable number of bits for reader count. ARM64 seems to have problem with the current encoding scheme. So this owner merging is currently limited to x86-64 only. Patch 16 eliminates redundant computation of the merged owner-count. Patch 17 reduces the chance of missed optimistic spinning opportunity because of some race conditions. Patch 18 makes rwsem_spin_on_owner() returns a tri-state value. Patch 19 enables reader to spin on a writer-owned rwsem. Patch 20 enables lock waiters to spin on a reader-owned rwsem with limited number of tries. Patch 21 makes reader wakeup to wake all the readers in the wait queue instead of just those in the front. Patch 22 disallows RT tasks to spin on a rwsem with unknown owner. In term of performance, eliminating architecture specific assembly code and using generic code doesn't seem to have any impact on performance. Supporting lock handoff does have a minor performance impact on highly contended rwsem, but it is a price worth paying for preventing lock starvation. Reader optimistic spinning is generally good for performance. Of course, there will be some corner cases where performance may suffer. Merging owner into count does have a minor performance impact. We can discuss if this is a feature we want to have in the rwsem code. There are also some performance data scattered in some of the patches. Waiman Long (22): locking/qspinlock_stat: Introduce a generic lockevent counting APIs locking/lock_events: Make lock_events available for all archs & other locks locking/rwsem: Relocate rwsem_down_read_failed() locking/rwsem: Remove arch specific rwsem files locking/rwsem: Move owner setting code from rwsem.c to rwsem.h locking/rwsem: Rename kernel/locking/rwsem.h locking/rwsem: Move rwsem internal function declarations to rwsem-xadd.h locking/rwsem: Add debug check for __down_read*() locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro locking/rwsem: Enable lock event counting locking/rwsem: Implement a new locking scheme locking/rwsem: Implement lock handoff to prevent lock starvation locking/rwsem: Remove rwsem_wake() wakeup optimization
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Thu, 07 Feb 2019, Waiman Long wrote: 30 files changed, 1197 insertions(+), 1594 deletions(-) Performance numbers on numerous workloads, pretty please. I'll go and throw this at my mmap_sem intensive workloads I've collected. Thanks, Davidlohr