Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long
On 03/22/2019 03:30 PM, Davidlohr Bueso wrote:
> On Fri, 22 Mar 2019, Linus Torvalds wrote:
>> Some of them _might_ be performance-critical. There's the one on
>> mmap_sem in the fault handling path, for example. And yes, I'd expect
>> the normal case to very much be "no other readers or writers" for that
>> one.
>
> Yeah, the mmap_sem case in the fault path is really expecting an unlocked
> state. To the point that four archs have added branch predictions, ie:
>
> 92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the
> page fault path))
> b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to
> do_page_fault)
>
> And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly:
> (without resetting the counters)
>
> correct incorrect  %    Function  File  Line
> --- -  -        
>  4603685   34   0 do_user_addr_fault fault.c  1416
> (bootup)
> 382327745  449   0 do_user_addr_fault fault.c 
> 1416 (kernel build)
> 399446159  461   0 do_user_addr_fault fault.c 
> 1416 (redis benchmark)
>
> It would probably wouldn't harm doing the unlikely() for all archs, or
> alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in
> patch 3 and do it implicitly but maybe that would be less flexible(?)
>
> Thanks,
> Davidlohr

I had used the my lock event counting code to count the number of
contended and uncontended trylocks. I tested both bootup and kernel
build. I think I saw less than 1% were contended, the rests were all
uncontended. That is similar to what you got. I thought I had sent the
data out previously, but I couldn't find the email. That was the main
reason why I took Linus' suggestion to optimize it for the uncontended case.

Thanks,
Longman



Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Davidlohr Bueso

On Fri, 22 Mar 2019, Linus Torvalds wrote:

Some of them _might_ be performance-critical. There's the one on
mmap_sem in the fault handling path, for example. And yes, I'd expect
the normal case to very much be "no other readers or writers" for that
one.


Yeah, the mmap_sem case in the fault path is really expecting an unlocked
state. To the point that four archs have added branch predictions, ie:

92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the page 
fault path))
b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to do_page_fault)

And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly:
(without resetting the counters)

correct incorrect  %Function  File  Line
--- -  -    
 4603685   34   0 do_user_addr_fault fault.c  1416 (bootup)
382327745  449   0 do_user_addr_fault fault.c  1416 (kernel 
build)
399446159  461   0 do_user_addr_fault fault.c  1416 (redis 
benchmark)

It would probably wouldn't harm doing the unlikely() for all archs, or
alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in
patch 3 and do it implicitly but maybe that would be less flexible(?)

Thanks,
Davidlohr


Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long
On 03/22/2019 01:25 PM, Russell King - ARM Linux admin wrote:
> On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>> it generate slightly better code.
>>
>> Before this patch, down_read_trylock:
>>
>>0x <+0>: callq  0x5 
>>0x0005 <+5>: jmp0x18 
>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>0x000b <+11>:mov%rdx,%rax
>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>0x0013 <+19>:cmp%rax,%rdx
>>0x0016 <+22>:je 0x23 
>>0x0018 <+24>:mov(%rdi),%rdx
>>0x001b <+27>:test   %rdx,%rdx
>>0x001e <+30>:jns0x7 
>>0x0020 <+32>:xor%eax,%eax
>>0x0022 <+34>:retq
>>0x0023 <+35>:mov%gs:0x0,%rax
>>0x002c <+44>:or $0x3,%rax
>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>0x0034 <+52>:mov$0x1,%eax
>>0x0039 <+57>:retq
>>
>> After patch, down_read_trylock:
>>
>>0x <+0>:  callq  0x5 
>>0x0005 <+5>:  xor%eax,%eax
>>0x0007 <+7>:  lea0x1(%rax),%rdx
>>0x000b <+11>: lock cmpxchg %rdx,(%rdi)
>>0x0010 <+16>: jne0x29 
>>0x0012 <+18>: mov%gs:0x0,%rax
>>0x001b <+27>: or $0x3,%rax
>>0x001f <+31>: mov%rax,0x20(%rdi)
>>0x0023 <+35>: mov$0x1,%eax
>>0x0028 <+40>: retq
>>0x0029 <+41>: test   %rax,%rax
>>0x002c <+44>: jns0x7 
>>0x002e <+46>: xor%eax,%eax
>>0x0030 <+48>: retq
>>
>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>> load of 10 to lengthen the lock critical section) on a x86-64 system
>> before and after the patch were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   14,496  14,716
>> 28,644   8,453
>>  46,799   6,983
>>  85,664   7,190
>>
>> On a ARM64 system, the performance results were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   23,676  24,488
>> 27,697   9,502
>> 44,945   3,440
>> 82,641   1,603
>>
>> For the uncontended case (1 thread), the new down_read_trylock() is a
>> little bit faster. For the contended cases, the new down_read_trylock()
>> perform pretty well in x86-64, but performance degrades at high
>> contention level on ARM64.
> So, 70% for 4 threads, 61% for 4 threads - does this trend
> continue tailing off as the number of threads (and cores)
> increase?
>
I didn't try higher number of contending threads. I won't worry too much
about contention as trylock is a one-off event. The chance of having
more than one trylock happening simultaneously is very small.

Cheers,
Longman



Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Russell King - ARM Linux admin
On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.
> 
> Before this patch, down_read_trylock:
> 
>0x <+0>: callq  0x5 
>0x0005 <+5>: jmp0x18 
>0x0007 <+7>: lea0x1(%rdx),%rcx
>0x000b <+11>:mov%rdx,%rax
>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>0x0013 <+19>:cmp%rax,%rdx
>0x0016 <+22>:je 0x23 
>0x0018 <+24>:mov(%rdi),%rdx
>0x001b <+27>:test   %rdx,%rdx
>0x001e <+30>:jns0x7 
>0x0020 <+32>:xor%eax,%eax
>0x0022 <+34>:retq
>0x0023 <+35>:mov%gs:0x0,%rax
>0x002c <+44>:or $0x3,%rax
>0x0030 <+48>:mov%rax,0x20(%rdi)
>0x0034 <+52>:mov$0x1,%eax
>0x0039 <+57>:retq
> 
> After patch, down_read_trylock:
> 
>0x <+0>:   callq  0x5 
>0x0005 <+5>:   xor%eax,%eax
>0x0007 <+7>:   lea0x1(%rax),%rdx
>0x000b <+11>:  lock cmpxchg %rdx,(%rdi)
>0x0010 <+16>:  jne0x29 
>0x0012 <+18>:  mov%gs:0x0,%rax
>0x001b <+27>:  or $0x3,%rax
>0x001f <+31>:  mov%rax,0x20(%rdi)
>0x0023 <+35>:  mov$0x1,%eax
>0x0028 <+40>:  retq
>0x0029 <+41>:  test   %rax,%rax
>0x002c <+44>:  jns0x7 
>0x002e <+46>:  xor%eax,%eax
>0x0030 <+48>:  retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> load of 10 to lengthen the lock critical section) on a x86-64 system
> before and after the patch were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   14,496  14,716
> 28,644   8,453
>   46,799   6,983
>   85,664   7,190
> 
> On a ARM64 system, the performance results were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   23,676  24,488
> 27,697   9,502
> 44,945   3,440
> 82,641   1,603
> 
> For the uncontended case (1 thread), the new down_read_trylock() is a
> little bit faster. For the contended cases, the new down_read_trylock()
> perform pretty well in x86-64, but performance degrades at high
> contention level on ARM64.

So, 70% for 4 threads, 61% for 4 threads - does this trend
continue tailing off as the number of threads (and cores)
increase?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long
On 03/22/2019 01:01 PM, Linus Torvalds wrote:
> On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>>  19 files changed, 133 insertions(+), 930 deletions(-)
> Lovely. And it all looks sane to me.
>
> So ack.
>
> The only comment I have is about __down_read_trylock(), which probably
> isn't critical enough to actually care about, but:
>
>> +static inline int __down_read_trylock(struct rw_semaphore *sem)
>> +{
>> +   long tmp;
>> +
>> +   while ((tmp = atomic_long_read(>count)) >= 0) {
>> +   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
>> +  tmp + RWSEM_ACTIVE_READ_BIAS)) {
>> +   return 1;
>> +   }
>> +   }
>> +   return 0;
>> +}
> So this seems to
>
>  (a) read the line early (the whole cacheline in shared state issue)
>
>  (b) read the line again unnecessarily in the while loop
>
> Now, (a) might be explained by "well, maybe we do trylock even with
> existing readers", although I continue to think that the case we
> should optimize for is simply the uncontended one, where we don't even
> have multiple readers.
>
> But (b) just seems silly.
>
> So I wonder if it shouldn't just be
>
> long tmp = 0;
>
> do {
> long new = atomic_long_cmpxchg_acquire(>count, tmp,
> tmp + RWSEM_ACTIVE_READ_BIAS);
> if (likely(new == tmp))
> return 1;
>tmp = new;
> } while (tmp >= 0);
> return 0;
>
> which would seem simpler and solve both issues. Hmm?
>
> But honestly, I didn't check what our uses of down_read_trylock() look
> like. We have more of them than I expected, and I _think_ the normal
> case is the "nobody else holds the lock", but that's just a gut
> feeling.
>
> Some of them _might_ be performance-critical. There's the one on
> mmap_sem in the fault handling path, for example. And yes, I'd expect
> the normal case to very much be "no other readers or writers" for that
> one.
>
> NOTE! The above code snippet is absolutely untested, and might be
> completely wrong. Take it as a "something like this" rather than
> anything else.
>
>Linus

As you have noticed already, this patch is just for moving code around
without changing it. I optimize __down_read_trylock() in patch 3.

Cheers,
Longman



Re: [PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-03-22 Thread Linus Torvalds
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
> For simplication, we are going to remove rwsem-spinlock.c and make all
> architectures use a single implementation of rwsem - rwsem-xadd.c.

Ack.

   Linus


Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Linus Torvalds
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.

Oh, that should teach me to read all patches in the series before
starting to comment on them.

So ignore my comment on #1.

Linus


[PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long
As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code as long as the
atomic functions are properly implemented. So we can remove those arch
specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
effort.

Currently, only x86, alpha and ia64 have implemented architecture
specific fast paths. I don't have access to alpha and ia64 systems for
testing, but they are legacy systems that are not likely to be updated
to the latest kernel anyway.

By using a rwsem microbenchmark, the total locking rates on a 4-socket
56-core 112-thread x86-64 system before and after the patch were as
follows (mixed means equal # of read and write locks):

  Before Patch  After Patch
   # of Threads  wlock   rlock   mixed wlock   rlock   mixed
     -   -   - -   -   -
129,201  30,143  29,45828,615  30,172  29,201
2 6,807  13,299   1,171 7,725  15,025   1,804
4 6,504  12,755   1,520 7,127  14,286   1,345
8 6,762  13,412 764 6,826  13,652 726
   16 6,693  15,408 662 6,599  15,938 626
   32 6,145  15,286 496 5,549  15,487 511
   64 5,812  15,495  60 5,858  15,572  60

There were some run-to-run variations for the multi-thread tests. For
x86-64, using the generic C code fast path seems to be a little bit
faster than the assembly version with low lock contention.  Looking at
the assembly version of the fast paths, there are assembly to/from C
code wrappers that save and restore all the callee-clobbered registers
(7 registers on x86-64). The assembly generated from the generic C
code doesn't need to do that. That may explain the slight performance
gain here.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
with no code change as no other code other than those under
kernel/locking needs to access the internal rwsem macros and functions.

Signed-off-by: Waiman Long 
---
 MAINTAINERS |   1 -
 arch/alpha/include/asm/rwsem.h  | 211 
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/include/asm/rwsem.h   | 172 ---
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/x86/include/asm/rwsem.h| 237 
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 -
 arch/x86/um/Makefile|   1 -
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 ---
 include/linux/rwsem.h   |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem.h  | 130 ++
 19 files changed, 133 insertions(+), 930 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e17ebf70b548..6bfd5a94c08e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9089,7 +9089,6 @@ F:arch/*/include/asm/spinlock*.h
 F: include/linux/rwlock*.h
 F: include/linux/mutex*.h
 F: include/linux/rwsem*.h
-F: arch/*/include/asm/rwsem.h
 F: include/linux/seqlock.h
 F: lib/locking*.[ch]
 F: kernel/locking/
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
deleted file mode 100644
index cf8fc8f9a2ed..
--- a/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,211 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky , 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include 
-
-#define RWSEM_UNLOCKED_VALUE   0xL
-#define RWSEM_ACTIVE_BIAS  0x0001L
-#define RWSEM_ACTIVE_MASK  0xL
-#define RWSEM_WAITING_BIAS (-0x0001L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
-
-static inline int ___down_read(struct rw_semaphore *sem)
-{
-   long oldcount;
-#ifndefCONFIG_SMP
-   oldcount = sem->count.counter;
-   sem->count.counter += RWSEM_ACTIVE_READ_BIAS;
-#else
-   long temp;
-   __asm__ __volatile__(
-   "1: ldq_l 

[PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long
Modify __down_read_trylock() to optimize for an unlocked rwsem and make
it generate slightly better code.

Before this patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: jmp0x18 
   0x0007 <+7>: lea0x1(%rdx),%rcx
   0x000b <+11>:mov%rdx,%rax
   0x000e <+14>:lock cmpxchg %rcx,(%rdi)
   0x0013 <+19>:cmp%rax,%rdx
   0x0016 <+22>:je 0x23 
   0x0018 <+24>:mov(%rdi),%rdx
   0x001b <+27>:test   %rdx,%rdx
   0x001e <+30>:jns0x7 
   0x0020 <+32>:xor%eax,%eax
   0x0022 <+34>:retq
   0x0023 <+35>:mov%gs:0x0,%rax
   0x002c <+44>:or $0x3,%rax
   0x0030 <+48>:mov%rax,0x20(%rdi)
   0x0034 <+52>:mov$0x1,%eax
   0x0039 <+57>:retq

After patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: xor%eax,%eax
   0x0007 <+7>: lea0x1(%rax),%rdx
   0x000b <+11>:lock cmpxchg %rdx,(%rdi)
   0x0010 <+16>:jne0x29 
   0x0012 <+18>:mov%gs:0x0,%rax
   0x001b <+27>:or $0x3,%rax
   0x001f <+31>:mov%rax,0x20(%rdi)
   0x0023 <+35>:mov$0x1,%eax
   0x0028 <+40>:retq
   0x0029 <+41>:test   %rax,%rax
   0x002c <+44>:jns0x7 
   0x002e <+46>:xor%eax,%eax
   0x0030 <+48>:retq

By using a rwsem microbenchmark, the down_read_trylock() rate (with a
load of 10 to lengthen the lock critical section) on a x86-64 system
before and after the patch were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   14,496  14,716
28,644   8,453
46,799   6,983
85,664   7,190

On a ARM64 system, the performance results were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   23,676  24,488
27,697   9,502
44,945   3,440
82,641   1,603

For the uncontended case (1 thread), the new down_read_trylock() is a
little bit faster. For the contended cases, the new down_read_trylock()
perform pretty well in x86-64, but performance degrades at high
contention level on ARM64.

Suggested-by: Linus Torvalds 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 45ee00236e03..1f5775aa6a1d 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -174,14 +174,17 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   /*
+* Optimize for the case when the rwsem is not locked at all.
+*/
+   long tmp = RWSEM_UNLOCKED_VALUE;
 
-   while ((tmp = atomic_long_read(>count)) >= 0) {
-   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   do {
+   if (atomic_long_try_cmpxchg_acquire(>count, ,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
-   }
+   } while (tmp >= 0);
return 0;
 }
 
-- 
2.18.1



[PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-03-22 Thread Waiman Long
Currently, we have two different implementation of rwsem:
 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c)
 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c)

As we are going to use a single generic implementation for rwsem-xadd.c
and no architecture-specific code will be needed, there is no point
in keeping two different implementations of rwsem. In most cases, the
performance of rwsem-spinlock.c will be worse. It also doesn't get all
the performance tuning and optimizations that had been implemented in
rwsem-xadd.c over the years.

For simplication, we are going to remove rwsem-spinlock.c and make all
architectures use a single implementation of rwsem - rwsem-xadd.c.

All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM
in the code are removed.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 arch/alpha/Kconfig  |   7 -
 arch/arc/Kconfig|   3 -
 arch/arm/Kconfig|   4 -
 arch/arm64/Kconfig  |   3 -
 arch/c6x/Kconfig|   3 -
 arch/csky/Kconfig   |   3 -
 arch/h8300/Kconfig  |   3 -
 arch/hexagon/Kconfig|   6 -
 arch/ia64/Kconfig   |   4 -
 arch/m68k/Kconfig   |   7 -
 arch/microblaze/Kconfig |   6 -
 arch/mips/Kconfig   |   7 -
 arch/nds32/Kconfig  |   3 -
 arch/nios2/Kconfig  |   3 -
 arch/openrisc/Kconfig   |   6 -
 arch/parisc/Kconfig |   6 -
 arch/powerpc/Kconfig|   7 -
 arch/riscv/Kconfig  |   3 -
 arch/s390/Kconfig   |   6 -
 arch/sh/Kconfig |   6 -
 arch/sparc/Kconfig  |   8 -
 arch/unicore32/Kconfig  |   6 -
 arch/x86/Kconfig|   3 -
 arch/x86/um/Kconfig |   6 -
 arch/xtensa/Kconfig |   3 -
 include/linux/rwsem-spinlock.h  |  47 -
 include/linux/rwsem.h   |   5 -
 kernel/Kconfig.locks|   2 +-
 kernel/locking/Makefile |   4 +-
 kernel/locking/rwsem-spinlock.c | 339 
 kernel/locking/rwsem.h  |   3 -
 31 files changed, 2 insertions(+), 520 deletions(-)
 delete mode 100644 include/linux/rwsem-spinlock.h
 delete mode 100644 kernel/locking/rwsem-spinlock.c

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 584a6e114853..27c871227eee 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -49,13 +49,6 @@ config MMU
bool
default y
 
-config RWSEM_GENERIC_SPINLOCK
-   bool
-
-config RWSEM_XCHGADD_ALGORITHM
-   bool
-   default y
-
 config ARCH_HAS_ILOG2_U32
bool
default n
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index c781e45d1d99..23e063df5d2c 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -63,9 +63,6 @@ config SCHED_OMIT_FRAME_POINTER
 config GENERIC_CSUM
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config ARCH_DISCONTIGMEM_ENABLE
def_bool n
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 054ead960f98..c11c61093c6c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -178,10 +178,6 @@ config TRACE_IRQFLAGS_SUPPORT
bool
default !CPU_V7M
 
-config RWSEM_XCHGADD_ALGORITHM
-   bool
-   default y
-
 config ARCH_HAS_ILOG2_U32
bool
 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e34b9eba5de..c62b9db2b5e8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -237,9 +237,6 @@ config LOCKDEP_SUPPORT
 config TRACE_IRQFLAGS_SUPPORT
def_bool y
 
-config RWSEM_XCHGADD_ALGORITHM
-   def_bool y
-
 config GENERIC_BUG
def_bool y
depends on BUG
diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig
index e5cd3c5f8399..ed92b5840c0a 100644
--- a/arch/c6x/Kconfig
+++ b/arch/c6x/Kconfig
@@ -27,9 +27,6 @@ config MMU
 config FPU
def_bool n
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config GENERIC_CALIBRATE_DELAY
def_bool y
 
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 725a115759c9..6555d1781132 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -92,9 +92,6 @@ config GENERIC_HWEIGHT
 config MMU
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config STACKTRACE_SUPPORT
def_bool y
 
diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig
index c071da34e081..61c01db6c292 100644
--- a/arch/h8300/Kconfig
+++ b/arch/h8300/Kconfig
@@ -27,9 +27,6 @@ config H8300
 config CPU_BIG_ENDIAN
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config GENERIC_HWEIGHT
def_bool y
 
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index ac441680dcc0..3e54a53208d5 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -65,12 +65,6 @@ config GENERIC_CSUM
 config GENERIC_IRQ_PROBE
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool n
-
-config RWSEM_XCHGADD_ALGORITHM
-   def_bool y
-
 config GENERIC_HWEIGHT
 

[PATCH v5 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-03-22 Thread Waiman Long
v5:
 - Rebase to the latest v5.1 tree and fix conflicts in 
   arch/{xtensa,s390}/include/asm/Kbuild.

v4:
 - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.

v3:
 - Optimize __down_read_trylock() for the uncontended case as suggested
   by Linus.

v2:
 - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
 - Update performance test data in patch 1.

The goal of this patchset is to remove the architecture specific files
for rwsem-xadd to make it easer to add enhancements in the later rwsem
patches. It also removes the legacy rwsem-spinlock.c file and make all
the architectures use one single implementation of rwsem - rwsem-xadd.c.

Waiman Long (3):
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
archs
  locking/rwsem: Optimize down_read_trylock()

 MAINTAINERS |   1 -
 arch/alpha/Kconfig  |   7 -
 arch/alpha/include/asm/rwsem.h  | 211 
 arch/arc/Kconfig|   3 -
 arch/arm/Kconfig|   4 -
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/Kconfig  |   3 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/c6x/Kconfig|   3 -
 arch/csky/Kconfig   |   3 -
 arch/h8300/Kconfig  |   3 -
 arch/hexagon/Kconfig|   6 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/Kconfig   |   4 -
 arch/ia64/include/asm/rwsem.h   | 172 
 arch/m68k/Kconfig   |   7 -
 arch/microblaze/Kconfig |   6 -
 arch/mips/Kconfig   |   7 -
 arch/nds32/Kconfig  |   3 -
 arch/nios2/Kconfig  |   3 -
 arch/openrisc/Kconfig   |   6 -
 arch/parisc/Kconfig |   6 -
 arch/powerpc/Kconfig|   7 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/riscv/Kconfig  |   3 -
 arch/s390/Kconfig   |   6 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/Kconfig |   6 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/Kconfig  |   8 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/unicore32/Kconfig  |   6 -
 arch/x86/Kconfig|   3 -
 arch/x86/include/asm/rwsem.h| 237 --
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 ---
 arch/x86/um/Kconfig |   6 -
 arch/x86/um/Makefile|   1 -
 arch/xtensa/Kconfig |   3 -
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 -
 include/linux/rwsem-spinlock.h  |  47 -
 include/linux/rwsem.h   |   9 +-
 kernel/Kconfig.locks|   2 +-
 kernel/locking/Makefile |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem-spinlock.c | 339 
 kernel/locking/rwsem.h  | 130 
 48 files changed, 135 insertions(+), 1447 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h
 delete mode 100644 include/linux/rwsem-spinlock.h
 delete mode 100644 kernel/locking/rwsem-spinlock.c

-- 
2.18.1