Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-14 Thread Waiman Long
On 02/14/2019 05:37 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
>> v4:
>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
>>
>> v3:
>>  - Optimize __down_read_trylock() for the uncontended case as suggested
>>by Linus.
>>
>> v2:
>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>>  - Update performance test data in patch 1.
>>
>> The goal of this patchset is to remove the architecture specific files
>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
>> patches. It also removes the legacy rwsem-spinlock.c file and make all
>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
>>
>> Waiman Long (3):
>>   locking/rwsem: Remove arch specific rwsem files
>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>> archs
>>   locking/rwsem: Optimize down_read_trylock()
> Acked-by: Peter Zijlstra (Intel) 
>
> with the caveat that I'm happy to exchange patch 3 back to my earlier
> suggestion in case Will expesses concerns wrt the ARM64 performance of
> Linus' suggestion.

I inserted a few lock event counters into the rwsem trylock code:

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
    /*
 * Optimize for the case when the rwsem is not locked at all.
 */
    long tmp = RWSEM_UNLOCKED_VALUE;

    lockevent_inc(rwsem_rtrylock);
    do {
    if (atomic_long_try_cmpxchg_acquire(>count, ,
    tmp + RWSEM_ACTIVE_READ_BIAS)) {
    rwsem_set_reader_owned(sem);
    return 1;
    }
    lockevent_inc(rwsem_rtrylock_retry);
    } while (tmp >= 0);
    lockevent_inc(rwsem_rtrylock_fail);
    return 0;
}

static inline int __down_write_trylock(struct rw_semaphore *sem)
{
    long tmp;

    lockevent_inc(rwsem_wtrylock);
    tmp = atomic_long_cmpxchg_acquire(>count, RWSEM_UNLOCKED_VALUE,
  RWSEM_ACTIVE_WRITE_BIAS);
    if (tmp == RWSEM_UNLOCKED_VALUE) {
    rwsem_set_owner(sem);
    return true;
    }
    lockevent_inc(rwsem_wtrylock_fail);
    return false;
}

I booted the new kernel on a 4-socket 56-core 112-thread Broadwell
system. The counter values

1) After bootup:

rwsem_rtrylock=784029
rwsem_rtrylock_fail=59
rwsem_rtrylock_retry=394
rwsem_wtrylock=18284
rwsem_wtrylock_fail=230

2) After parallel kernel build (-j112):

rwsem_rtrylock=338667559
rwsem_rtrylock_fail=18
rwsem_rtrylock_retry=51
rwsem_wtrylock=17016332
rwsem_wtrylock_fail=98058

At least for these two use cases, try-for-ownership as suggested by
Linus is the right choice.

Cheers,
Longman



Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Waiman Long
On 02/14/2019 01:02 PM, Will Deacon wrote:
> On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
>>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>>> it generate slightly better code.
>>>
>>> Before this patch, down_read_trylock:
>>>
>>>0x <+0>: callq  0x5 
>>>0x0005 <+5>: jmp0x18 
>>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>>0x000b <+11>:mov%rdx,%rax
>>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>>0x0013 <+19>:cmp%rax,%rdx
>>>0x0016 <+22>:je 0x23 
>>>0x0018 <+24>:mov(%rdi),%rdx
>>>0x001b <+27>:test   %rdx,%rdx
>>>0x001e <+30>:jns0x7 
>>>0x0020 <+32>:xor%eax,%eax
>>>0x0022 <+34>:retq
>>>0x0023 <+35>:mov%gs:0x0,%rax
>>>0x002c <+44>:or $0x3,%rax
>>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>>0x0034 <+52>:mov$0x1,%eax
>>>0x0039 <+57>:retq
>>>
>>> After patch, down_read_trylock:
>>>
>>>0x <+0>: callq  0x5 
>>>0x0005 <+5>: xor%eax,%eax
>>>0x0007 <+7>: lea0x1(%rax),%rdx
>>>0x000b <+11>:lock cmpxchg %rdx,(%rdi)
>>>0x0010 <+16>:jne0x29 
>>>0x0012 <+18>:mov%gs:0x0,%rax
>>>0x001b <+27>:or $0x3,%rax
>>>0x001f <+31>:mov%rax,0x20(%rdi)
>>>0x0023 <+35>:mov$0x1,%eax
>>>0x0028 <+40>:retq
>>>0x0029 <+41>:test   %rax,%rax
>>>0x002c <+44>:jns0x7 
>>>0x002e <+46>:xor%eax,%eax
>>>0x0030 <+48>:retq
>>>
>>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>>> load of 10 to lengthen the lock critical section) on a x86-64 system
>>> before and after the patch were:
>>>
>>>  Before PatchAfter Patch
>>># of Threads rlock   rlock
>>> -   -
>>> 1   14,496  14,716
>>> 28,644   8,453
>>> 46,799   6,983
>>> 85,664   7,190
>>>
>>> On a ARM64 system, the performance results were:
>>>
>>>  Before PatchAfter Patch
>>># of Threads rlock   rlock
>>> -   -
>>> 1   23,676  24,488
>>> 27,697   9,502
>>> 44,945   3,440
>>> 82,641   1,603
>> Urgh, yes LL/SC is the obvious exception that can actually do better
>> here :/
>>
>> Will, what say you?
> What machine were these numbers generated on and is it using LL/SC or LSE
> atomics for arm64? If you stick the microbenchmark somewhere, I can go play
> with a broader variety of h/w.
>
> Will

The machine is a 2-socket Cavium ThunderX2 99xx system with 64 cores and
256 threads. I was just using threads from the first socket for this
test. The microbenchmark that I used is attached. I used the command
"./run-locktest -ltryrwsem -r100 -i-10 -c10 -n" to generate the
locking rates.

The lscpu flags were:

fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

Cheers,
Longman



locktest.tar.gz
Description: application/gzip


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds
 wrote:
>
> The arm64 numbers scaled horribly even before, and that's because
> there is too much ping-pong, and it's probably because there is no
> "stickiness" to the cacheline to the core, and thus adding the extra
> loop can make the ping-pong issue even worse because now there is more
> of it.

Actually, if it's using the ll/sc, then I don't see why arm64 should
even change. It doesn't really even change the pattern: the initial
load of the value is just replaced with a "ll" that gets a non-zero
value, and then we re-try without even doing the "sc" part.

End result: exact same "load once, then do ll/sc to update". Just
using a slightly different instruction pattern.

But maybe "ll" does something different to the cacheline than a regular "ld"?

Alternatively, the machine you used is using LSE, and the "swp" thing
has some horrid behavior when it fails.

So I take it back. I'm actually surprised that arm64 performs worse. I
don't think it should. But numbers walk, bullshit talks, and it
clearly does make for worse numbers on arm64.

   Linus


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Will Deacon
On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
> > Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> > it generate slightly better code.
> > 
> > Before this patch, down_read_trylock:
> > 
> >0x <+0>: callq  0x5 
> >0x0005 <+5>: jmp0x18 
> >0x0007 <+7>: lea0x1(%rdx),%rcx
> >0x000b <+11>:mov%rdx,%rax
> >0x000e <+14>:lock cmpxchg %rcx,(%rdi)
> >0x0013 <+19>:cmp%rax,%rdx
> >0x0016 <+22>:je 0x23 
> >0x0018 <+24>:mov(%rdi),%rdx
> >0x001b <+27>:test   %rdx,%rdx
> >0x001e <+30>:jns0x7 
> >0x0020 <+32>:xor%eax,%eax
> >0x0022 <+34>:retq
> >0x0023 <+35>:mov%gs:0x0,%rax
> >0x002c <+44>:or $0x3,%rax
> >0x0030 <+48>:mov%rax,0x20(%rdi)
> >0x0034 <+52>:mov$0x1,%eax
> >0x0039 <+57>:retq
> > 
> > After patch, down_read_trylock:
> > 
> >0x <+0>: callq  0x5 
> >0x0005 <+5>: xor%eax,%eax
> >0x0007 <+7>: lea0x1(%rax),%rdx
> >0x000b <+11>:lock cmpxchg %rdx,(%rdi)
> >0x0010 <+16>:jne0x29 
> >0x0012 <+18>:mov%gs:0x0,%rax
> >0x001b <+27>:or $0x3,%rax
> >0x001f <+31>:mov%rax,0x20(%rdi)
> >0x0023 <+35>:mov$0x1,%eax
> >0x0028 <+40>:retq
> >0x0029 <+41>:test   %rax,%rax
> >0x002c <+44>:jns0x7 
> >0x002e <+46>:xor%eax,%eax
> >0x0030 <+48>:retq
> > 
> > By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> > load of 10 to lengthen the lock critical section) on a x86-64 system
> > before and after the patch were:
> > 
> >  Before PatchAfter Patch
> ># of Threads rlock   rlock
> > -   -
> > 1   14,496  14,716
> > 28,644   8,453
> > 46,799   6,983
> > 85,664   7,190
> > 
> > On a ARM64 system, the performance results were:
> > 
> >  Before PatchAfter Patch
> ># of Threads rlock   rlock
> > -   -
> > 1   23,676  24,488
> > 27,697   9,502
> > 44,945   3,440
> > 82,641   1,603
> 
> Urgh, yes LL/SC is the obvious exception that can actually do better
> here :/
> 
> Will, what say you?

What machine were these numbers generated on and is it using LL/SC or LSE
atomics for arm64? If you stick the microbenchmark somewhere, I can go play
with a broader variety of h/w.

Will


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2019 at 6:53 AM Waiman Long  wrote:
>
> The ARM64 result is what I would have expected given that the change was
> to optimize for the uncontended case. The x86-64 result is kind of an
> anomaly to me, but I haven't bothered to dig into that.

I would say that the ARM result is what I'd expect from something that
scales badly to begin with.

The x86-64 result is the expected one: yes, the cmpxchg is done one
extra time, but it results in fewer cache transitions (the cacheline
never goes into "shared" state), and cache transitions are what
matter.

The cost of re-doing the instruction should be low. The cacheline
ping-pong and the cache coherency messages is what hurts.

So I actually think both are very easily explained.

The x86-64 number improves, because there is less cache coherency traffic.

The arm64 numbers scaled horribly even before, and that's because
there is too much ping-pong, and it's probably because there is no
"stickiness" to the cacheline to the core, and thus adding the extra
loop can make the ping-pong issue even worse because now there is more
of it.

The cachelines not sticking at all to a core probably is good for
fairness issues (in particular, sticking *too* much can cause horrible
issues), but it's absolutely horrible if it means that you lose the
cacheline even before you get to complete the second cmpxchg.

  Linus


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-14 Thread Waiman Long
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote:
> On Fri, 08 Feb 2019, Waiman Long wrote:
>> I am planning to run more performance test and post the data sometimes
>> next week. Davidlohr is also going to run some of his rwsem performance
>> test on this patchset.
>
> So I ran this series on a 40-core IB 2 socket with various worklods in
> mmtests. Below are some of the interesting ones; full numbers and curves
> at https://linux-scalability.org/rwsem-reader-spinner/
>
> All workloads are with increasing number of threads.
>
> -- pagefault timings: pft is an artificial pf benchmark (thus reader
> stress).
> metric is faults/cpu and faults/sec
>   v5.0-rc6 v5.0-rc6
>    dirty
> Hmean faults/cpu-1    624224.9815 (   0.00%)   618847.5201 *  -0.86%*
> Hmean faults/cpu-4    539550.3509 (   0.00%)   547407.5738 *   1.46%*
> Hmean faults/cpu-7    401470.3461 (   0.00%)   381157.9830 *  -5.06%*
> Hmean faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
> Hmean faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
> Hmean faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
> Hmean faults/cpu-40    91203.6820 (   0.00%)    91832.7489 *   0.69%*
> Hmean faults/sec-1    623292.3467 (   0.00%)   617992.0795 *  -0.85%*
> Hmean faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
> Hmean faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
> Hmean faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
> Hmean faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
> Hmean faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
> Hmean faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
> Stddev    faults/cpu-1  2949.5159 (   0.00%) 2802.2712 (   4.99%)
> Stddev    faults/cpu-4 24165.9454 (   0.00%)    15841.1232 (  34.45%)
> Stddev    faults/cpu-7 20914.8351 (   0.00%)    22744.3294 (  -8.75%)
> Stddev    faults/cpu-12    11274.3490 (   0.00%)    14733.3152 ( -30.68%)
> Stddev    faults/cpu-21 2500.1950 (   0.00%) 2200.9518 (  11.97%)
> Stddev    faults/cpu-30 1599.5346 (   0.00%) 1414.0339 (  11.60%)
> Stddev    faults/cpu-40 1473.0181 (   0.00%) 3004.1209 (-103.94%)
> Stddev    faults/sec-1  2655.2581 (   0.00%) 2405.1625 (   9.42%)
> Stddev    faults/sec-4 84042.7234 (   0.00%)    57996.7158 (  30.99%)
> Stddev    faults/sec-7    123656.7901 (   0.00%)   135591.1087 (  -9.65%)
> Stddev    faults/sec-12    97135.6091 (   0.00%)   127054.4926 ( -30.80%)
> Stddev    faults/sec-21    69564.6264 (   0.00%)    65922.6381 (   5.24%)
> Stddev    faults/sec-30    51524.4027 (   0.00%)    56109.4159 (  -8.90%)
> Stddev    faults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)
>
> With the exception of the hicup at 7 threads, things are pretty much in
> the noise region for both metrics.
>
> -- git checkout
>
> First metric is total runtime for runs with incremental threads.
>
>   v5.0-rc6    v5.0-rc6
>  dirty
> User 218.95  219.07
> System   149.29  146.82
> Elapsed 1574.10 1427.08
>
> In this case there's a non trivial improvement (11%) in overall
> elapsed time.
>
> -- reaim (which is always succeptible to rwsem changes for both
> mmap_sem and
> i_mmap)
>     v5.0-rc6   v5.0-rc6
>    dirty
> Hmean compute-1 6674.01 (   0.00%) 6544.28 *  -1.94%*
> Hmean compute-21   85294.91 (   0.00%)    85524.20 *   0.27%*
> Hmean compute-41  149674.70 (   0.00%)   149494.58 *  -0.12%*
> Hmean compute-61  177721.15 (   0.00%)   170507.38 *  -4.06%*
> Hmean compute-81  181531.07 (   0.00%)   180463.24 *  -0.59%*
> Hmean compute-101 189024.09 (   0.00%)   187288.86 *  -0.92%*
> Hmean compute-121 200673.24 (   0.00%)   195327.65 *  -2.66%*
> Hmean compute-141 213082.29 (   0.00%)   211290.80 *  -0.84%*
> Hmean compute-161 207764.06 (   0.00%)   204626.68 *  -1.51%*
>
> The 'compute' workload overall takes a small hit.
>
> Hmean new_dbase-1 60.48 (   0.00%)   60.63 *   0.25%*
> Hmean new_dbase-21  6590.49 (   0.00%) 6671.81 *   1.23%*
> Hmean new_dbase-41 14202.91 (   0.00%)    14470.59 *   1.88%*
> Hmean new_dbase-61 21207.24 (   0.00%)    21067.40 *  -0.66%*
> Hmean new_dbase-81 25542.40 (   0.00%)    25542.40 *   0.00%*
> Hmean new_dbase-101    30165.28 (   0.00%)    30046.21 *  -0.39%*
> Hmean new_dbase-121    33638.33 (   0.00%)    33219.90 *  -1.24%*
> Hmean new_dbase-141    36723.70 (   0.00%)    37504.52 *   2.13%*
> Hmean new_dbase-161    42242.51 (   0.00%)    42117.34 *  -0.30%*
> Hmean shared-1  

Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Waiman Long
On 02/14/2019 05:33 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>> it generate slightly better code.
>>
>> Before this patch, down_read_trylock:
>>
>>0x <+0>: callq  0x5 
>>0x0005 <+5>: jmp0x18 
>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>0x000b <+11>:mov%rdx,%rax
>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>0x0013 <+19>:cmp%rax,%rdx
>>0x0016 <+22>:je 0x23 
>>0x0018 <+24>:mov(%rdi),%rdx
>>0x001b <+27>:test   %rdx,%rdx
>>0x001e <+30>:jns0x7 
>>0x0020 <+32>:xor%eax,%eax
>>0x0022 <+34>:retq
>>0x0023 <+35>:mov%gs:0x0,%rax
>>0x002c <+44>:or $0x3,%rax
>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>0x0034 <+52>:mov$0x1,%eax
>>0x0039 <+57>:retq
>>
>> After patch, down_read_trylock:
>>
>>0x <+0>:  callq  0x5 
>>0x0005 <+5>:  xor%eax,%eax
>>0x0007 <+7>:  lea0x1(%rax),%rdx
>>0x000b <+11>: lock cmpxchg %rdx,(%rdi)
>>0x0010 <+16>: jne0x29 
>>0x0012 <+18>: mov%gs:0x0,%rax
>>0x001b <+27>: or $0x3,%rax
>>0x001f <+31>: mov%rax,0x20(%rdi)
>>0x0023 <+35>: mov$0x1,%eax
>>0x0028 <+40>: retq
>>0x0029 <+41>: test   %rax,%rax
>>0x002c <+44>: jns0x7 
>>0x002e <+46>: xor%eax,%eax
>>0x0030 <+48>: retq
>>
>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>> load of 10 to lengthen the lock critical section) on a x86-64 system
>> before and after the patch were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   14,496  14,716
>> 28,644   8,453
>>  46,799   6,983
>>  85,664   7,190
>>
>> On a ARM64 system, the performance results were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   23,676  24,488
>> 27,697   9,502
>> 44,945   3,440
>> 82,641   1,603
> Urgh, yes LL/SC is the obvious exception that can actually do better
> here :/
>
> Will, what say you?

The ARM64 result is what I would have expected given that the change was
to optimize for the uncontended case. The x86-64 result is kind of an
anomaly to me, but I haven't bothered to dig into that.

Cheers,
Longman



Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-14 Thread Davidlohr Bueso

On Fri, 08 Feb 2019, Waiman Long wrote:

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.


So I ran this series on a 40-core IB 2 socket with various worklods in
mmtests. Below are some of the interesting ones; full numbers and curves
at https://linux-scalability.org/rwsem-reader-spinner/

All workloads are with increasing number of threads.

-- pagefault timings: pft is an artificial pf benchmark (thus reader stress).
metric is faults/cpu and faults/sec
  v5.0-rc6 v5.0-rc6
   dirty
Hmean faults/cpu-1624224.9815 (   0.00%)   618847.5201 *  -0.86%*
Hmean faults/cpu-4539550.3509 (   0.00%)   547407.5738 *   1.46%*
Hmean faults/cpu-7401470.3461 (   0.00%)   381157.9830 *  -5.06%*
Hmean faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
Hmean faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
Hmean faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
Hmean faults/cpu-4091203.6820 (   0.00%)91832.7489 *   0.69%*
Hmean faults/sec-1623292.3467 (   0.00%)   617992.0795 *  -0.85%*
Hmean faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
Hmean faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
Hmean faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
Hmean faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
Hmean faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
Hmean faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
Stddevfaults/cpu-1  2949.5159 (   0.00%) 2802.2712 (   4.99%)
Stddevfaults/cpu-4 24165.9454 (   0.00%)15841.1232 (  34.45%)
Stddevfaults/cpu-7 20914.8351 (   0.00%)22744.3294 (  -8.75%)
Stddevfaults/cpu-1211274.3490 (   0.00%)14733.3152 ( -30.68%)
Stddevfaults/cpu-21 2500.1950 (   0.00%) 2200.9518 (  11.97%)
Stddevfaults/cpu-30 1599.5346 (   0.00%) 1414.0339 (  11.60%)
Stddevfaults/cpu-40 1473.0181 (   0.00%) 3004.1209 (-103.94%)
Stddevfaults/sec-1  2655.2581 (   0.00%) 2405.1625 (   9.42%)
Stddevfaults/sec-4 84042.7234 (   0.00%)57996.7158 (  30.99%)
Stddevfaults/sec-7123656.7901 (   0.00%)   135591.1087 (  -9.65%)
Stddevfaults/sec-1297135.6091 (   0.00%)   127054.4926 ( -30.80%)
Stddevfaults/sec-2169564.6264 (   0.00%)65922.6381 (   5.24%)
Stddevfaults/sec-3051524.4027 (   0.00%)56109.4159 (  -8.90%)
Stddevfaults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)

With the exception of the hicup at 7 threads, things are pretty much in
the noise region for both metrics.

-- git checkout

First metric is total runtime for runs with incremental threads.

  v5.0-rc6v5.0-rc6
 dirty
User 218.95  219.07
System   149.29  146.82
Elapsed 1574.10 1427.08

In this case there's a non trivial improvement (11%) in overall elapsed time.

-- reaim (which is always succeptible to rwsem changes for both mmap_sem and
i_mmap)
v5.0-rc6   v5.0-rc6
   dirty
Hmean compute-1 6674.01 (   0.00%) 6544.28 *  -1.94%*
Hmean compute-21   85294.91 (   0.00%)85524.20 *   0.27%*
Hmean compute-41  149674.70 (   0.00%)   149494.58 *  -0.12%*
Hmean compute-61  177721.15 (   0.00%)   170507.38 *  -4.06%*
Hmean compute-81  181531.07 (   0.00%)   180463.24 *  -0.59%*
Hmean compute-101 189024.09 (   0.00%)   187288.86 *  -0.92%*
Hmean compute-121 200673.24 (   0.00%)   195327.65 *  -2.66%*
Hmean compute-141 213082.29 (   0.00%)   211290.80 *  -0.84%*
Hmean compute-161 207764.06 (   0.00%)   204626.68 *  -1.51%*

The 'compute' workload overall takes a small hit.

Hmean new_dbase-1 60.48 (   0.00%)   60.63 *   0.25%*
Hmean new_dbase-21  6590.49 (   0.00%) 6671.81 *   1.23%*
Hmean new_dbase-41 14202.91 (   0.00%)14470.59 *   1.88%*
Hmean new_dbase-61 21207.24 (   0.00%)21067.40 *  -0.66%*
Hmean new_dbase-81 25542.40 (   0.00%)25542.40 *   0.00%*
Hmean new_dbase-10130165.28 (   0.00%)30046.21 *  -0.39%*
Hmean new_dbase-12133638.33 (   0.00%)33219.90 *  -1.24%*
Hmean new_dbase-14136723.70 (   0.00%)37504.52 *   2.13%*
Hmean new_dbase-16142242.51 (   0.00%)42117.34 *  -0.30%*
Hmean shared-176.54 (   0.00%)   76.09 *  -0.59%*
Hmean shared-21 7535.51 (   0.00%) 5518.75 * -26.76%*
Hmean shared-4117207.81 (   0.00%)14651.94 * -14.85%*
Hmean shared-61

Re: [PATCH v4 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-02-14 Thread Peter Zijlstra
On Thu, Feb 14, 2019 at 11:54:47AM +0100, Geert Uytterhoeven wrote:
> On Wed, Feb 13, 2019 at 11:01 PM Waiman Long  wrote:
> > Currently, we have two different implementation of rwsem:
> >  1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c)
> >  2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c)
> >
> > As we are going to use a single generic implementation for rwsem-xadd.c
> > and no architecture-specific code will be needed, there is no point
> > in keeping two different implementations of rwsem. In most cases, the
> > performance of rwsem-spinlock.c will be worse. It also doesn't get all
> > the performance tuning and optimizations that had been implemented in
> > rwsem-xadd.c over the years.
> >
> > For simplication, we are going to remove rwsem-spinlock.c and make all
> > architectures use a single implementation of rwsem - rwsem-xadd.c.
> >
> > All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM
> > in the code are removed.
> >
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Waiman Long 
> 
> Note that this conflicts with "[PATCH 03/11] kernel/locks: consolidate
> RWSEM_GENERIC_* options"
> https://lore.kernel.org/lkml/20190213174005.28785-4-...@lst.de/

*sigh*.. of that never was Cc'ed to locking people :/


Re: [PATCH v4 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-02-14 Thread Geert Uytterhoeven
On Wed, Feb 13, 2019 at 11:01 PM Waiman Long  wrote:
> Currently, we have two different implementation of rwsem:
>  1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c)
>  2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c)
>
> As we are going to use a single generic implementation for rwsem-xadd.c
> and no architecture-specific code will be needed, there is no point
> in keeping two different implementations of rwsem. In most cases, the
> performance of rwsem-spinlock.c will be worse. It also doesn't get all
> the performance tuning and optimizations that had been implemented in
> rwsem-xadd.c over the years.
>
> For simplication, we are going to remove rwsem-spinlock.c and make all
> architectures use a single implementation of rwsem - rwsem-xadd.c.
>
> All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM
> in the code are removed.
>
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Waiman Long 

Note that this conflicts with "[PATCH 03/11] kernel/locks: consolidate
RWSEM_GENERIC_* options"
https://lore.kernel.org/lkml/20190213174005.28785-4-...@lst.de/

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-14 Thread Peter Zijlstra
On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> v4:
>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> 
> v3:
>  - Optimize __down_read_trylock() for the uncontended case as suggested
>by Linus.
> 
> v2:
>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>  - Update performance test data in patch 1.
> 
> The goal of this patchset is to remove the architecture specific files
> for rwsem-xadd to make it easer to add enhancements in the later rwsem
> patches. It also removes the legacy rwsem-spinlock.c file and make all
> the architectures use one single implementation of rwsem - rwsem-xadd.c.
> 
> Waiman Long (3):
>   locking/rwsem: Remove arch specific rwsem files
>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> archs
>   locking/rwsem: Optimize down_read_trylock()

Acked-by: Peter Zijlstra (Intel) 

with the caveat that I'm happy to exchange patch 3 back to my earlier
suggestion in case Will expesses concerns wrt the ARM64 performance of
Linus' suggestion.


Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-14 Thread Peter Zijlstra
On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote:
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.
> 
> Before this patch, down_read_trylock:
> 
>0x <+0>: callq  0x5 
>0x0005 <+5>: jmp0x18 
>0x0007 <+7>: lea0x1(%rdx),%rcx
>0x000b <+11>:mov%rdx,%rax
>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>0x0013 <+19>:cmp%rax,%rdx
>0x0016 <+22>:je 0x23 
>0x0018 <+24>:mov(%rdi),%rdx
>0x001b <+27>:test   %rdx,%rdx
>0x001e <+30>:jns0x7 
>0x0020 <+32>:xor%eax,%eax
>0x0022 <+34>:retq
>0x0023 <+35>:mov%gs:0x0,%rax
>0x002c <+44>:or $0x3,%rax
>0x0030 <+48>:mov%rax,0x20(%rdi)
>0x0034 <+52>:mov$0x1,%eax
>0x0039 <+57>:retq
> 
> After patch, down_read_trylock:
> 
>0x <+0>:   callq  0x5 
>0x0005 <+5>:   xor%eax,%eax
>0x0007 <+7>:   lea0x1(%rax),%rdx
>0x000b <+11>:  lock cmpxchg %rdx,(%rdi)
>0x0010 <+16>:  jne0x29 
>0x0012 <+18>:  mov%gs:0x0,%rax
>0x001b <+27>:  or $0x3,%rax
>0x001f <+31>:  mov%rax,0x20(%rdi)
>0x0023 <+35>:  mov$0x1,%eax
>0x0028 <+40>:  retq
>0x0029 <+41>:  test   %rax,%rax
>0x002c <+44>:  jns0x7 
>0x002e <+46>:  xor%eax,%eax
>0x0030 <+48>:  retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> load of 10 to lengthen the lock critical section) on a x86-64 system
> before and after the patch were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   14,496  14,716
> 28,644   8,453
>   46,799   6,983
>   85,664   7,190
> 
> On a ARM64 system, the performance results were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   23,676  24,488
> 27,697   9,502
> 44,945   3,440
> 82,641   1,603

Urgh, yes LL/SC is the obvious exception that can actually do better
here :/

Will, what say you?