Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-04-11 Thread huang ying
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long  wrote:
>
> On 04/10/2019 04:15 AM, huang ying wrote:
> > Hi, Waiman,
> >
> > What's the status of this patchset?  And its merging plan?
> >
> > Best Regards,
> > Huang, Ying
>
> I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
> Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Thanks!  Please keep me updated!

Best Regards,
Huang, Ying

> Cheers,
> Longman
>


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-04-10 Thread Waiman Long
On 04/10/2019 04:15 AM, huang ying wrote:
> Hi, Waiman,
>
> What's the status of this patchset?  And its merging plan?
>
> Best Regards,
> Huang, Ying

I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Cheers,
Longman



Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-04-10 Thread huang ying
Hi, Waiman,

What's the status of this patchset?  And its merging plan?

Best Regards,
Huang, Ying


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-14 Thread Waiman Long
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote:
> On Fri, 08 Feb 2019, Waiman Long wrote:
>> I am planning to run more performance test and post the data sometimes
>> next week. Davidlohr is also going to run some of his rwsem performance
>> test on this patchset.
>
> So I ran this series on a 40-core IB 2 socket with various worklods in
> mmtests. Below are some of the interesting ones; full numbers and curves
> at https://linux-scalability.org/rwsem-reader-spinner/
>
> All workloads are with increasing number of threads.
>
> -- pagefault timings: pft is an artificial pf benchmark (thus reader
> stress).
> metric is faults/cpu and faults/sec
>   v5.0-rc6 v5.0-rc6
>    dirty
> Hmean faults/cpu-1    624224.9815 (   0.00%)   618847.5201 *  -0.86%*
> Hmean faults/cpu-4    539550.3509 (   0.00%)   547407.5738 *   1.46%*
> Hmean faults/cpu-7    401470.3461 (   0.00%)   381157.9830 *  -5.06%*
> Hmean faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
> Hmean faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
> Hmean faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
> Hmean faults/cpu-40    91203.6820 (   0.00%)    91832.7489 *   0.69%*
> Hmean faults/sec-1    623292.3467 (   0.00%)   617992.0795 *  -0.85%*
> Hmean faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
> Hmean faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
> Hmean faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
> Hmean faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
> Hmean faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
> Hmean faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
> Stddev    faults/cpu-1  2949.5159 (   0.00%) 2802.2712 (   4.99%)
> Stddev    faults/cpu-4 24165.9454 (   0.00%)    15841.1232 (  34.45%)
> Stddev    faults/cpu-7 20914.8351 (   0.00%)    22744.3294 (  -8.75%)
> Stddev    faults/cpu-12    11274.3490 (   0.00%)    14733.3152 ( -30.68%)
> Stddev    faults/cpu-21 2500.1950 (   0.00%) 2200.9518 (  11.97%)
> Stddev    faults/cpu-30 1599.5346 (   0.00%) 1414.0339 (  11.60%)
> Stddev    faults/cpu-40 1473.0181 (   0.00%) 3004.1209 (-103.94%)
> Stddev    faults/sec-1  2655.2581 (   0.00%) 2405.1625 (   9.42%)
> Stddev    faults/sec-4 84042.7234 (   0.00%)    57996.7158 (  30.99%)
> Stddev    faults/sec-7    123656.7901 (   0.00%)   135591.1087 (  -9.65%)
> Stddev    faults/sec-12    97135.6091 (   0.00%)   127054.4926 ( -30.80%)
> Stddev    faults/sec-21    69564.6264 (   0.00%)    65922.6381 (   5.24%)
> Stddev    faults/sec-30    51524.4027 (   0.00%)    56109.4159 (  -8.90%)
> Stddev    faults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)
>
> With the exception of the hicup at 7 threads, things are pretty much in
> the noise region for both metrics.
>
> -- git checkout
>
> First metric is total runtime for runs with incremental threads.
>
>   v5.0-rc6    v5.0-rc6
>  dirty
> User 218.95  219.07
> System   149.29  146.82
> Elapsed 1574.10 1427.08
>
> In this case there's a non trivial improvement (11%) in overall
> elapsed time.
>
> -- reaim (which is always succeptible to rwsem changes for both
> mmap_sem and
> i_mmap)
>     v5.0-rc6   v5.0-rc6
>    dirty
> Hmean compute-1 6674.01 (   0.00%) 6544.28 *  -1.94%*
> Hmean compute-21   85294.91 (   0.00%)    85524.20 *   0.27%*
> Hmean compute-41  149674.70 (   0.00%)   149494.58 *  -0.12%*
> Hmean compute-61  177721.15 (   0.00%)   170507.38 *  -4.06%*
> Hmean compute-81  181531.07 (   0.00%)   180463.24 *  -0.59%*
> Hmean compute-101 189024.09 (   0.00%)   187288.86 *  -0.92%*
> Hmean compute-121 200673.24 (   0.00%)   195327.65 *  -2.66%*
> Hmean compute-141 213082.29 (   0.00%)   211290.80 *  -0.84%*
> Hmean compute-161 207764.06 (   0.00%)   204626.68 *  -1.51%*
>
> The 'compute' workload overall takes a small hit.
>
> Hmean new_dbase-1 60.48 (   0.00%)   60.63 *   0.25%*
> Hmean new_dbase-21  6590.49 (   0.00%) 6671.81 *   1.23%*
> Hmean new_dbase-41 14202.91 (   0.00%)    14470.59 *   1.88%*
> Hmean new_dbase-61 21207.24 (   0.00%)    21067.40 *  -0.66%*
> Hmean new_dbase-81 25542.40 (   0.00%)    25542.40 *   0.00%*
> Hmean new_dbase-101    30165.28 (   0.00%)    30046.21 *  -0.39%*
> Hmean new_dbase-121    33638.33 (   0.00%)    33219.90 *  -1.24%*
> Hmean new_dbase-141    36723.70 (   0.00%)    37504.52 *   2.13%*
> Hmean new_dbase-161    42242.51 (   0.00%)    42117.34 *  -0.30%*
> Hmean shared-1  

Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-14 Thread Davidlohr Bueso

On Fri, 08 Feb 2019, Waiman Long wrote:

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.


So I ran this series on a 40-core IB 2 socket with various worklods in
mmtests. Below are some of the interesting ones; full numbers and curves
at https://linux-scalability.org/rwsem-reader-spinner/

All workloads are with increasing number of threads.

-- pagefault timings: pft is an artificial pf benchmark (thus reader stress).
metric is faults/cpu and faults/sec
  v5.0-rc6 v5.0-rc6
   dirty
Hmean faults/cpu-1624224.9815 (   0.00%)   618847.5201 *  -0.86%*
Hmean faults/cpu-4539550.3509 (   0.00%)   547407.5738 *   1.46%*
Hmean faults/cpu-7401470.3461 (   0.00%)   381157.9830 *  -5.06%*
Hmean faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
Hmean faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
Hmean faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
Hmean faults/cpu-4091203.6820 (   0.00%)91832.7489 *   0.69%*
Hmean faults/sec-1623292.3467 (   0.00%)   617992.0795 *  -0.85%*
Hmean faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
Hmean faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
Hmean faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
Hmean faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
Hmean faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
Hmean faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
Stddevfaults/cpu-1  2949.5159 (   0.00%) 2802.2712 (   4.99%)
Stddevfaults/cpu-4 24165.9454 (   0.00%)15841.1232 (  34.45%)
Stddevfaults/cpu-7 20914.8351 (   0.00%)22744.3294 (  -8.75%)
Stddevfaults/cpu-1211274.3490 (   0.00%)14733.3152 ( -30.68%)
Stddevfaults/cpu-21 2500.1950 (   0.00%) 2200.9518 (  11.97%)
Stddevfaults/cpu-30 1599.5346 (   0.00%) 1414.0339 (  11.60%)
Stddevfaults/cpu-40 1473.0181 (   0.00%) 3004.1209 (-103.94%)
Stddevfaults/sec-1  2655.2581 (   0.00%) 2405.1625 (   9.42%)
Stddevfaults/sec-4 84042.7234 (   0.00%)57996.7158 (  30.99%)
Stddevfaults/sec-7123656.7901 (   0.00%)   135591.1087 (  -9.65%)
Stddevfaults/sec-1297135.6091 (   0.00%)   127054.4926 ( -30.80%)
Stddevfaults/sec-2169564.6264 (   0.00%)65922.6381 (   5.24%)
Stddevfaults/sec-3051524.4027 (   0.00%)56109.4159 (  -8.90%)
Stddevfaults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)

With the exception of the hicup at 7 threads, things are pretty much in
the noise region for both metrics.

-- git checkout

First metric is total runtime for runs with incremental threads.

  v5.0-rc6v5.0-rc6
 dirty
User 218.95  219.07
System   149.29  146.82
Elapsed 1574.10 1427.08

In this case there's a non trivial improvement (11%) in overall elapsed time.

-- reaim (which is always succeptible to rwsem changes for both mmap_sem and
i_mmap)
v5.0-rc6   v5.0-rc6
   dirty
Hmean compute-1 6674.01 (   0.00%) 6544.28 *  -1.94%*
Hmean compute-21   85294.91 (   0.00%)85524.20 *   0.27%*
Hmean compute-41  149674.70 (   0.00%)   149494.58 *  -0.12%*
Hmean compute-61  177721.15 (   0.00%)   170507.38 *  -4.06%*
Hmean compute-81  181531.07 (   0.00%)   180463.24 *  -0.59%*
Hmean compute-101 189024.09 (   0.00%)   187288.86 *  -0.92%*
Hmean compute-121 200673.24 (   0.00%)   195327.65 *  -2.66%*
Hmean compute-141 213082.29 (   0.00%)   211290.80 *  -0.84%*
Hmean compute-161 207764.06 (   0.00%)   204626.68 *  -1.51%*

The 'compute' workload overall takes a small hit.

Hmean new_dbase-1 60.48 (   0.00%)   60.63 *   0.25%*
Hmean new_dbase-21  6590.49 (   0.00%) 6671.81 *   1.23%*
Hmean new_dbase-41 14202.91 (   0.00%)14470.59 *   1.88%*
Hmean new_dbase-61 21207.24 (   0.00%)21067.40 *  -0.66%*
Hmean new_dbase-81 25542.40 (   0.00%)25542.40 *   0.00%*
Hmean new_dbase-10130165.28 (   0.00%)30046.21 *  -0.39%*
Hmean new_dbase-12133638.33 (   0.00%)33219.90 *  -1.24%*
Hmean new_dbase-14136723.70 (   0.00%)37504.52 *   2.13%*
Hmean new_dbase-16142242.51 (   0.00%)42117.34 *  -0.30%*
Hmean shared-176.54 (   0.00%)   76.09 *  -0.59%*
Hmean shared-21 7535.51 (   0.00%) 5518.75 * -26.76%*
Hmean shared-4117207.81 (   0.00%)14651.94 * -14.85%*
Hmean shared-61

Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-13 Thread Linus Torvalds
Ok, those test robot reports are hard to read, but trying to distill it down:

On Wed, Feb 13, 2019 at 1:19 AM Chen Rong  wrote:
>
>  %stddev %change %stddev
>  \  |\
> 196250 ±  8% -64.1%  70494will-it-scale.per_thread_ops

That's the original 64% regression..

And then with the patch set:

>  %stddev  change %stddev
>  \  |\
>  71190 180% 199232 ±  4%  will-it-scale.per_thread_ops

looks like it's back up where it used to be.

So I guess we have numbers for the regression now. Thanks.

And that closes my biggest question for the new model, and with the
new organization that gets ird of the arch-specific asm separately
first and makes it a bit more legible that way, I guess I'll just Ack
the whole series.

 Linus


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-13 Thread Chen Rong
Hi all,

Kernel test robot reported a will-it-scale.per_thread_ops -64.1% regression on 
IVB-desktop for v4.20-rc1.
The first bad commit is: 9bc8039e715da3b53dbac89525323a9f2f69b7b5, Yang Shi 
: mm: brk: downgrade mmap_sem to read when shrinking
(https://lists.01.org/pipermail/lkp/2018-November/009335.html).

=
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
  
gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-ivb-d01/brk1/will-it-scale/0x20

commit: 
  85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
  9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking")

85a06835f6f1ba79 9bc8039e715da3b53dbac89525 
 -- 
 %stddev %change %stddev
 \  |\  
196250 ±  8% -64.1%  70494will-it-scale.per_thread_ops
127330 ± 19% -98.0%   2525 ± 24%  
will-it-scale.time.involuntary_context_switches
727.50 ±  2% -77.0% 167.25
will-it-scale.time.percent_of_cpu_this_job_got
  2141 ±  2% -77.6% 479.12will-it-scale.time.system_time
 50.48 ±  7% -48.5%  25.98will-it-scale.time.user_time
  34925294 ± 18%+270.3%  1.293e+08 ±  4%  
will-it-scale.time.voluntary_context_switches
   1570007 ±  8% -64.1% 563958will-it-scale.workload
  6435 ±  2%  -6.4%   6024proc-vmstat.nr_shmem
  1298 ± 16% -44.5% 721.00 ± 18%  proc-vmstat.pgactivate
  2341   +16.4%   2724slabinfo.kmalloc-96.active_objs
  2341   +16.4%   2724slabinfo.kmalloc-96.num_objs
  6346 ±150% -87.8% 776.25 ±  9%  softirqs.NET_RX
160107 ±  8%+151.9% 403273softirqs.SCHED
   1097999   -13.0% 955526softirqs.TIMER
  5.50 ±  9% -81.8%   1.00vmstat.procs.r
230700 ± 19%+269.9% 853292 ±  4%  vmstat.system.cs
 26706 ±  3% +15.7%  30910 ±  5%  vmstat.system.in
 11.24 ± 23% +72.2   83.39mpstat.cpu.idle%
  0.00 ±131%  +0.00.04 ± 99%  mpstat.cpu.iowait%
 86.32 ±  2% -70.8   15.54mpstat.cpu.sys%
  2.44 ±  7%  -1.41.04 ±  8%  mpstat.cpu.usr%
  20610709 ± 15%   +2376.0%  5.103e+08 ± 34%  cpuidle.C1.time
   3233399 ±  8%+241.5%   11042785 ± 25%  cpuidle.C1.usage
  36172040 ±  6%+931.3%   3.73e+08 ± 15%  cpuidle.C1E.time
783605 ±  4%+548.7%5083041 ± 18%  cpuidle.C1E.usage
  28753819 ± 39%   +1054.5%  3.319e+08 ± 49%  cpuidle.C3.time
283912 ± 25%+688.4%2238225 ± 34%  cpuidle.C3.usage
 1.507e+08 ± 47%+292.3%  5.913e+08 ± 28%  cpuidle.C6.time
339861 ± 37%+549.7%2208222 ± 24%  cpuidle.C6.usage
   2709719 ±  5%+824.2%   25043444cpuidle.POLL.time
  28602864 ± 18%+173.7%   78276116 ± 10%  cpuidle.POLL.usage


We found that the patchset could fix the regression.

tests: 1
testcase/path_params/tbox_group/run: 
will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01

commit: 
  85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
  fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader")

85a06835f6f1ba79  fb835fe7f0adbd7c2c074b98ec  
  --  
 %stddev  change %stddev
 \  |\  
120736 ± 22%56% 188019 ±  6%  
will-it-scale.time.involuntary_context_switches
  2126 ±  3% 4%   2215will-it-scale.time.system_time
   722 ±  3% 4%752
will-it-scale.time.percent_of_cpu_this_job_got
  36256485 ± 27%   -35%   23682989 ±  3%  
will-it-scale.time.voluntary_context_switches
  3151 ±  9%11%   3504turbostat.Avg_MHz
229285 ± 32%   -30% 160660 ±  3%  vmstat.system.cs
120736 ± 22%56% 188019 ±  6%  time.involuntary_context_switches
  2126 ±  3% 4%   2215time.system_time
   722 ±  3% 4%752time.percent_of_cpu_this_job_got
  36256485 ± 27%   -35%   23682989 ±  3%  time.voluntary_context_switches
23 643%171 ±  3%  proc-vmstat.nr_zone_inactive_file
23 643%171 ±  3%  proc-vmstat.nr_inactive_file
  3664  12%   4121proc-vmstat.nr_kernel_stack
  6392   6%   6785proc-vmstat.nr_slab_unreclaimable
  9991   10176proc-vmstat.nr_slab_reclaimable
 63938   62394proc-vmstat.nr_zone_active_anon
 63938   62394proc-vmstat.nr_active_anon
386388 ±  9%-6% 362272proc-vmstat.pgfree
368296 ±  9%   -10% 333074

Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-10 Thread Ingo Molnar


* Waiman Long  wrote:

> On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> > On Thu, 07 Feb 2019, Waiman Long wrote:
> >> 30 files changed, 1197 insertions(+), 1594 deletions(-)
> >
> > Performance numbers on numerous workloads, pretty please.
> >
> > I'll go and throw this at my mmap_sem intensive workloads
> > I've collected.
> >
> > Thanks,
> > Davidlohr
> 
> Thanks for getting some of the performance numbers. This is the initial
> draft after more than 1 years of hibernation. I will also get other
> performance numbers in subsequent revision of the patch.

If you could sort all the invariant preparatory patches to the head of 
the series I can merge them to reduce overall complexity and simplify 
performance testing and review of the rest.

Thanks,

Ingo


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-08 Thread Linus Torvalds
On Fri, Feb 8, 2019 at 12:31 PM Waiman Long  wrote:
>
> >  (b) what's the new fastpath case
>
> The only change in the fastpath is the use of cmpxchg for writer lock.

.. since a big deal here was about using the generic atomic accessor
functions, I really was looking forward to seeing the *actual* fast
path code generation.

In other words, right now I have very little visibility in how it
actually affects the code. Looking at the patches themselves doesn't
make it obvious. I was hoping for the overview to really explain the
whole "before and after" situation, and it didn't. Not at the high
level, and not at a low level. And no performance numbers in the
overview either.

And yes, I see the numbers in the patches, but what I really hoped for
was some real load numbers. In particular, I would have loved to see
numbers from th ekernel test robot "will-it-scale.per_thread_ops"
case, which is the one that had a 65% regression due to the lack of
reader spinning.

So I was kind of hoping to hear whether that regression is basically
entirely gone with this patch series, or if we still have a regression
due to the extra downgrade, or what?

 Linus


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-08 Thread Waiman Long
On 02/08/2019 02:50 PM, Linus Torvalds wrote:
> On Thu, Feb 7, 2019 at 11:08 AM Waiman Long  wrote:
>> This patchset revamps the current rwsem-xadd implementation to make
>> it saner and easier to work with. This patchset removes all the
>> architecture specific assembly code and uses generic C code for all
>> architectures. This eases maintenance and enables us to enhance the
>> code more easily.
>>
>> This patchset also implements the following 3 new features:
>>
>>  1) Waiter lock handoff
>>  2) Reader optimistic spinning
>>  3) Store write-lock owner in the atomic count (x86-64 only)
> The patches are kind of hard to read, with most of them just doing
> prep-work that doesn't necessarily matter to the big picture.
>
> What I'd really like to see is
>
>  (a) an overview of the new locking logic

The new locking logic is similar to qrwlock (see patch 11). Cmpxchg is
used to acquire the write lock, while xadd is still used for read lock.
Some of the bits in the count are also reserved for special purpose like
has waiter or lock handoff. Patch 15 tries to compress the write-lock
owner task pointer and put it into the count field for x86-64 at the
expense of less bits available for reader count. I have sent out an
additional patch this morning to make sure that the reader count won't
overflow.

In term of performance, there isn't much change with respect to
read-lock performance. For write-lock, I saw a slight drop in some
cases, but nothing significant. The merging of owner task pointer into
the count field does impose a slightly bigger drop than I would have
liked which I am going to look into a bit more.

>
>  (b) what's the new fastpath case

The only change in the fastpath is the use of cmpxchg for writer lock.

>
>  (c) some performance numbers

There are performance data at patches 11, 12, 15, 19, 20, 21. There was
performance data for patch 4 as well for eliminating the arch specific
file. Apparently, I might have deleted it accidentally. Anyway, no
noticeable performance difference was observed when switching to use
generic C code for x86, ppc and ARM64.

The major gain in performance is due to reader optimistic spinning
patches. The microbenchmark that I used shown an order of magnitude of
performance improvement for mixed reader-writer workloads. Of course, we
will see less performance gain with real world benchmarks.

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.

>
> to explain the changes from a "this is the point of the whole
> exercise" standpoint.
>
> And yes, I realize that the lock handoff and optimistic spinning is a
> big deal, since I've seen the same regression numbers that presumably
> caused this effort to be resurrected. So it's not that I don't find
> this intriguing and worthwhile, it's literally that I'd like a summary
> not so much of the individual patches, but of the new model.
>
> Please?

Maybe I should break this patchset into a few smaller ones to make it
easier to review. Any suggestion is welcome.

Cheers,
Longman



Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-08 Thread Linus Torvalds
On Thu, Feb 7, 2019 at 11:08 AM Waiman Long  wrote:
>
> This patchset revamps the current rwsem-xadd implementation to make
> it saner and easier to work with. This patchset removes all the
> architecture specific assembly code and uses generic C code for all
> architectures. This eases maintenance and enables us to enhance the
> code more easily.
>
> This patchset also implements the following 3 new features:
>
>  1) Waiter lock handoff
>  2) Reader optimistic spinning
>  3) Store write-lock owner in the atomic count (x86-64 only)

The patches are kind of hard to read, with most of them just doing
prep-work that doesn't necessarily matter to the big picture.

What I'd really like to see is

 (a) an overview of the new locking logic

 (b) what's the new fastpath case

 (c) some performance numbers

to explain the changes from a "this is the point of the whole
exercise" standpoint.

And yes, I realize that the lock handoff and optimistic spinning is a
big deal, since I've seen the same regression numbers that presumably
caused this effort to be resurrected. So it's not that I don't find
this intriguing and worthwhile, it's literally that I'd like a summary
not so much of the individual patches, but of the new model.

Please?

 Linus


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-07 Thread Waiman Long
On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> On Thu, 07 Feb 2019, Waiman Long wrote:
>> 30 files changed, 1197 insertions(+), 1594 deletions(-)
>
> Performance numbers on numerous workloads, pretty please.
>
> I'll go and throw this at my mmap_sem intensive workloads
> I've collected.
>
> Thanks,
> Davidlohr

Thanks for getting some of the performance numbers. This is the initial
draft after more than 1 years of hibernation. I will also get other
performance numbers in subsequent revision of the patch.

Cheers,
Longman



[PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-07 Thread Waiman Long
This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. This patchset removes all the
architecture specific assembly code and uses generic C code for all
architectures. This eases maintenance and enables us to enhance the
code more easily.

This patchset also implements the following 3 new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning
 3) Store write-lock owner in the atomic count (x86-64 only)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

Patches 1-2 reworks the qspinlock_stat code to make it generic (lock
event counting) so that it can be used by all architectures and all
locking code.

Patch 3 reloctes the rwsem_down_read_failed() and associated functions
to below the optimistic spinning functions.

Patch 4 eliminates all architecture specific code and use generic C
code for all.

Patch 5 moves code that manages the owner field closer to the rwsem
lock fast path as it is not needed by the rwsem-spinlock code.

Patch 6 renames rwsem.h to rwsem-xadd.h as it is now specific to
rwsem-xadd.c only.

Patch 7 hides the internal rwsem-xadd functions from the public.

Patch 8 moves the DEBUG_RWSEMS_WARN_ON checks from rwsem.c to
kernel/locking/rwsem-xadd.h and adds some new ones.

Patch 9 enhances the DEBUG_RWSEMS_WARN_ON macro to print out rwsem
internal states that can be useful for debugging purpose.

Patch 10 enables lock event countings in the rwsem code.

Patch 11 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 12 implments lock handoff to prevent lock starvation.

Patch 13 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 14 adds some new rwsem owner access helper functions.

Patch 15 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of bits
for reader count. ARM64 seems to have problem with the current encoding
scheme. So this owner merging is currently limited to x86-64 only.

Patch 16 eliminates redundant computation of the merged owner-count.

Patch 17 reduces the chance of missed optimistic spinning opportunity
because of some race conditions.

Patch 18 makes rwsem_spin_on_owner() returns a tri-state value.

Patch 19 enables reader to spin on a writer-owned rwsem.

Patch 20 enables lock waiters to spin on a reader-owned rwsem with
limited number of tries.

Patch 21 makes reader wakeup to wake all the readers in the wait queue
instead of just those in the front.

Patch 22 disallows RT tasks to spin on a rwsem with unknown owner.

In term of performance, eliminating architecture specific assembly code
and using generic code doesn't seem to have any impact on performance.

Supporting lock handoff does have a minor performance impact on highly
contended rwsem, but it is a price worth paying for preventing lock
starvation.

Reader optimistic spinning is generally good for performance. Of course,
there will be some corner cases where performance may suffer.

Merging owner into count does have a minor performance impact. We can
discuss if this is a feature we want to have in the rwsem code.

There are also some performance data scattered in some of the patches.


Waiman Long (22):
  locking/qspinlock_stat: Introduce a generic lockevent counting APIs
  locking/lock_events: Make lock_events available for all archs & other
locks
  locking/rwsem: Relocate rwsem_down_read_failed()
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Move owner setting code from rwsem.c to rwsem.h
  locking/rwsem: Rename kernel/locking/rwsem.h
  locking/rwsem: Move rwsem internal function declarations to
rwsem-xadd.h
  locking/rwsem: Add debug check for __down_read*()
  locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro
  locking/rwsem: Enable lock event counting
  locking/rwsem: Implement a new locking scheme
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Remove rwsem_wake() wakeup optimization
  

Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-07 Thread Davidlohr Bueso

On Thu, 07 Feb 2019, Waiman Long wrote:

30 files changed, 1197 insertions(+), 1594 deletions(-)


Performance numbers on numerous workloads, pretty please.

I'll go and throw this at my mmap_sem intensive workloads
I've collected.

Thanks,
Davidlohr