Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 11:35:24AM -0500, Waiman Long wrote:
> On 02/11/2019 06:58 AM, Peter Zijlstra wrote:
> > Which is clearly worse. Now we can write that as:
> >
> >   int __down_read_trylock2(unsigned long *l)
> >   {
> >   long tmp = READ_ONCE(*l);
> >
> >   while (tmp >= 0) {
> >   if (try_cmpxchg(l, , tmp + 1))
> >   return 1;
> >   }
> >
> >   return 0;
> >   }
> >
> > which generates:
> >
> >   0030 <__down_read_trylock2>:
> >   30:   48 8b 07mov(%rdi),%rax
> >   33:   48 85 c0test   %rax,%rax
> >   36:   78 18   js 50 <__down_read_trylock2+0x20>
> >   38:   48 8d 50 01 lea0x1(%rax),%rdx
> >   3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
> >   41:   75 f0   jne33 <__down_read_trylock2+0x3>
> >   43:   b8 01 00 00 00  mov$0x1,%eax
> >   48:   c3  retq
> >   49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
> >   50:   31 c0   xor%eax,%eax
> >   52:   c3  retq
> >
> > Which is a lot better; but not quite there yet.
> >
> >
> > I've tried quite a bit, but I can't seem to get GCC to generate the:
> >
> > add $1,%rdx
> > jle
> >
> > required; stuff like:
> >
> > new = old + 1;
> > if (new <= 0)
> >
> > generates:
> >
> > lea 0x1(%rax),%rdx
> > test %rdx, %rdx
> > jle
> 
> Thanks for the suggested code snippet. So you want to replace "lea
> 0x1(%rax), %rdx" by "add $1,%rdx"?
> 
> I think the compiler is doing that so as to use the address generation
> unit for addition instead of using the ALU. That will leave the ALU
> available for doing other arithmetic operation in parallel. I don't
> think it is a good idea to override the compiler and force it to use
> ALU. So I am not going to try doing that. It is only 1 or 2 more of
> codes anyway.

Yeah, I was trying to see what I could make it do.. #2 really should be
good enough, but you know how it is once you're poking at it :-)


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
On 02/11/2019 06:58 AM, Peter Zijlstra wrote:
> Which is clearly worse. Now we can write that as:
>
>   int __down_read_trylock2(unsigned long *l)
>   {
> long tmp = READ_ONCE(*l);
>
> while (tmp >= 0) {
> if (try_cmpxchg(l, , tmp + 1))
> return 1;
> }
>
> return 0;
>   }
>
> which generates:
>
>   0030 <__down_read_trylock2>:
>   30:   48 8b 07mov(%rdi),%rax
>   33:   48 85 c0test   %rax,%rax
>   36:   78 18   js 50 <__down_read_trylock2+0x20>
>   38:   48 8d 50 01 lea0x1(%rax),%rdx
>   3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
>   41:   75 f0   jne33 <__down_read_trylock2+0x3>
>   43:   b8 01 00 00 00  mov$0x1,%eax
>   48:   c3  retq
>   49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
>   50:   31 c0   xor%eax,%eax
>   52:   c3  retq
>
> Which is a lot better; but not quite there yet.
>
>
> I've tried quite a bit, but I can't seem to get GCC to generate the:
>
>   add $1,%rdx
>   jle
>
> required; stuff like:
>
>   new = old + 1;
>   if (new <= 0)
>
> generates:
>
>   lea 0x1(%rax),%rdx
>   test %rdx, %rdx
>   jle

Thanks for the suggested code snippet. So you want to replace "lea
0x1(%rax), %rdx" by "add $1,%rdx"?

I think the compiler is doing that so as to use the address generation
unit for addition instead of using the ALU. That will leave the ALU
available for doing other arithmetic operation in parallel. I don't
think it is a good idea to override the compiler and force it to use
ALU. So I am not going to try doing that. It is only 1 or 2 more of
codes anyway.

Cheers,
Longman



Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
On 02/11/2019 05:39 AM, Ingo Molnar wrote:
> * Ingo Molnar  wrote:
>
>> Sounds good to me - I've merged this patch, will push it out after 
>> testing.
> Based on Peter's feedback I'm delaying this - performance testing on at 
> least one key ll/sc arch would be nice indeed.
>
> Thanks,
>
>   Ingo

Yes, I will twist the generic code to generate better code.

As I said in the commit log, only x86, ia64 and alpha provide assembly
code to replace the generic C code. The ll/sc archs that I have access
to (ARM64, ppc) are all using the generic C code anyway. I actually had
done some performance measurement on both those platforms and didn't see
any performance difference. I didn't include them as they were using
generic code before. I will rerun the tests after I twisted the generic
C code.

Thanks,
Longman




Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:

> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + while ((tmp = atomic_long_read(>count)) >= 0) {
> + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> + return 1;
> + }
> + }
> + return 0;
> +}

So the orignal x86 implementation reads:

  static inline bool __down_read_trylock(struct rw_semaphore *sem)
  {
  long result, tmp;
  asm volatile("# beginning __down_read_trylock\n\t"
   "  mov  %[count],%[result]\n\t"
   "1:\n\t"
   "  mov  %[result],%[tmp]\n\t"
   "  add  %[inc],%[tmp]\n\t"
   "  jle2f\n\t"
   LOCK_PREFIX "  cmpxchg  %[tmp],%[count]\n\t"
   "  jnz1b\n\t"
   "2:\n\t"
   "# ending __down_read_trylock\n\t"
   : [count] "+m" (sem->count), [result] "=" (result),
 [tmp] "=" (tmp)
   : [inc] "i" (RWSEM_ACTIVE_READ_BIAS)
   : "memory", "cc");
  return result >= 0;
  }

you replace that with:

  int __down_read_trylock1(unsigned long *l)
  {
  long tmp;

  while ((tmp = READ_ONCE(*l)) >= 0) {
  if (tmp == cmpxchg(l, tmp, tmp + 1))
  return 1;
  }

  return 0;
  }

which generates:

   <__down_read_trylock1>:
   0:   eb 17   jmp19 <__down_read_trylock1+0x19>
   2:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
   8:   48 8d 4a 01 lea0x1(%rdx),%rcx
   c:   48 89 d0mov%rdx,%rax
   f:   f0 48 0f b1 0f  lock cmpxchg %rcx,(%rdi)
  14:   48 39 c2cmp%rax,%rdx
  17:   74 0f   je 28 <__down_read_trylock1+0x28>
  19:   48 8b 17mov(%rdi),%rdx
  1c:   48 85 d2test   %rdx,%rdx
  1f:   79 e7   jns8 <__down_read_trylock1+0x8>
  21:   31 c0   xor%eax,%eax
  23:   c3  retq
  24:   0f 1f 40 00 nopl   0x0(%rax)
  28:   b8 01 00 00 00  mov$0x1,%eax
  2d:   c3  retq


Which is clearly worse. Now we can write that as:

  int __down_read_trylock2(unsigned long *l)
  {
  long tmp = READ_ONCE(*l);

  while (tmp >= 0) {
  if (try_cmpxchg(l, , tmp + 1))
  return 1;
  }

  return 0;
  }

which generates:

  0030 <__down_read_trylock2>:
  30:   48 8b 07mov(%rdi),%rax
  33:   48 85 c0test   %rax,%rax
  36:   78 18   js 50 <__down_read_trylock2+0x20>
  38:   48 8d 50 01 lea0x1(%rax),%rdx
  3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
  41:   75 f0   jne33 <__down_read_trylock2+0x3>
  43:   b8 01 00 00 00  mov$0x1,%eax
  48:   c3  retq
  49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
  50:   31 c0   xor%eax,%eax
  52:   c3  retq

Which is a lot better; but not quite there yet.


I've tried quite a bit, but I can't seem to get GCC to generate the:

add $1,%rdx
jle

required; stuff like:

new = old + 1;
if (new <= 0)

generates:

lea 0x1(%rax),%rdx
test %rdx, %rdx
jle


Ah well, have fun :-)

typedef unsigned char u8;
typedef unsigned short u16;
typedef unsigned int u32;
typedef unsigned long long u64;
typedef signed char s8;
typedef signed short s16;
typedef signed int s32;
typedef signed long long s64;
typedef _Bool bool;

# define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
# define CC_OUT(c) "=@cc" #c

#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x)   __builtin_expect(!!(x), 0)

extern void __cmpxchg_wrong_size(void);

#define __raw_cmpxchg(ptr, old, new, size, lock)			\
({	\
	__typeof__(*(ptr)) __ret;	\
	__typeof__(*(ptr)) __old = (old);\
	__typeof__(*(ptr)) __new = (new);\
	switch (size) {			\
	case 1:		\
	{\
		volatile u8 *__ptr = (volatile u8 *)(ptr);		\
		asm volatile(lock "cmpxchgb %2,%1"			\
			 : "=a" (__ret), "+m" (*__ptr)		\
			 : "q" (__new), "0" (__old)			\
			 : "memory");\
		break;			\
	}\
	case 2:		\
	{\
		volatile u16 *__ptr = (volatile u16 *)(ptr);		\
		asm volatile(lock "cmpxchgw %2,%1"			\
			 : "=a" (__ret), "+m" (*__ptr)		\
			 : "r" (__new), "0" (__old)			\
			 : "memory");\
		break;			\
	}\
	case 4:		\
	{\
		volatile u32 *__ptr = (volatile u32 *)(ptr);		\
		asm volatile(lock 

Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 10:40:44AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote:
> > On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> > > +static inline int __down_read_trylock(struct rw_semaphore *sem)
> > > +{
> > > + long tmp;
> > > +
> > > + while ((tmp = atomic_long_read(>count)) >= 0) {
> > > + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> > > +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> > > + return 1;
> > 
> > That really wants to be:
> > 
> > if (atomic_long_try_cmpxchg_acquire(>count, ,
> > tmp + 
> > RWSEM_ACTIVE_READ_BIAS))
> > 
> > > + }
> > > + }
> > > + return 0;
> > > +}
> 
> Also, the is the one case where LL/SC can actually do 'better'. Do you
> have benchmarks for say PowerPC or ARM64 ?

Ah, I see they already used asm-generic/rwsem.h which has similar code
to the above.


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Ingo Molnar


* Will Deacon  wrote:

> On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote:
> > 
> > * Ingo Molnar  wrote:
> > 
> > > Sounds good to me - I've merged this patch, will push it out after 
> > > testing.
> > 
> > Based on Peter's feedback I'm delaying this - performance testing on at 
> > least one key ll/sc arch would be nice indeed.
> 
> Once Waiman has posted a new version, I can take it for a spin on some
> arm64 boxen if he shares his workload.

Cool, thanks!

Ingo


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Will Deacon
On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > Sounds good to me - I've merged this patch, will push it out after 
> > testing.
> 
> Based on Peter's feedback I'm delaying this - performance testing on at 
> least one key ll/sc arch would be nice indeed.

Once Waiman has posted a new version, I can take it for a spin on some
arm64 boxen if he shares his workload.

Will


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Ingo Molnar


* Ingo Molnar  wrote:

> Sounds good to me - I've merged this patch, will push it out after 
> testing.

Based on Peter's feedback I'm delaying this - performance testing on at 
least one key ll/sc arch would be nice indeed.

Thanks,

Ingo


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote:
> On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> > +static inline int __down_read_trylock(struct rw_semaphore *sem)
> > +{
> > +   long tmp;
> > +
> > +   while ((tmp = atomic_long_read(>count)) >= 0) {
> > +   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> > +  tmp + RWSEM_ACTIVE_READ_BIAS)) {
> > +   return 1;
> 
> That really wants to be:
> 
>   if (atomic_long_try_cmpxchg_acquire(>count, ,
>   tmp + 
> RWSEM_ACTIVE_READ_BIAS))
> 
> > +   }
> > +   }
> > +   return 0;
> > +}

Also, the is the one case where LL/SC can actually do 'better'. Do you
have benchmarks for say PowerPC or ARM64 ?


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
> index bad2bca..067e265 100644
> --- a/kernel/locking/rwsem.h
> +++ b/kernel/locking/rwsem.h
> @@ -32,6 +32,26 @@
>  # define DEBUG_RWSEMS_WARN_ON(c)
>  #endif
>  
> +/*
> + * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
> + * Adapted largely from include/asm-i386/rwsem.h
> + * by Paul Mackerras .
> + */
> +
> +/*
> + * the semaphore definition
> + */
> +#ifdef CONFIG_64BIT
> +# define RWSEM_ACTIVE_MASK   0xL
> +#else
> +# define RWSEM_ACTIVE_MASK   0xL
> +#endif
> +
> +#define RWSEM_ACTIVE_BIAS0x0001L
> +#define RWSEM_WAITING_BIAS   (-RWSEM_ACTIVE_MASK-1)
> +#define RWSEM_ACTIVE_READ_BIAS   RWSEM_ACTIVE_BIAS
> +#define RWSEM_ACTIVE_WRITE_BIAS  (RWSEM_WAITING_BIAS + 
> RWSEM_ACTIVE_BIAS)
> +
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  /*
>   * All writes to owner are protected by WRITE_ONCE() to make sure that
> @@ -132,3 +152,113 @@ static inline void rwsem_clear_reader_owned(struct 
> rw_semaphore *sem)
>  {
>  }
>  #endif
> +
> +#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
> +/*
> + * lock for reading
> + */
> +static inline void __down_read(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_inc_return_acquire(>count) <= 0))
> + rwsem_down_read_failed(sem);
> +}
> +
> +static inline int __down_read_killable(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_inc_return_acquire(>count) <= 0)) {
> + if (IS_ERR(rwsem_down_read_failed_killable(sem)))
> + return -EINTR;
> + }
> +
> + return 0;
> +}
> +
> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + while ((tmp = atomic_long_read(>count)) >= 0) {
> + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> + return 1;

That really wants to be:

if (atomic_long_try_cmpxchg_acquire(>count, ,
tmp + 
RWSEM_ACTIVE_READ_BIAS))

> + }
> + }
> + return 0;
> +}
> +
> +/*
> + * lock for writing
> + */
> +static inline void __down_write(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
> +  >count);
> + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
> + rwsem_down_write_failed(sem);
> +}
> +
> +static inline int __down_write_killable(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
> +  >count);
> + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
> + if (IS_ERR(rwsem_down_write_failed_killable(sem)))
> + return -EINTR;
> + return 0;
> +}
> +
> +static inline int __down_write_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;

tmp = RWSEM_UNLOCKED_VALUE;

> +
> + tmp = atomic_long_cmpxchg_acquire(>count, RWSEM_UNLOCKED_VALUE,
> +   RWSEM_ACTIVE_WRITE_BIAS);
> + return tmp == RWSEM_UNLOCKED_VALUE;

return atomic_long_try_cmpxchg_acquire(>count, ,
   RWSEM_ACTIVE_WRITE_BIAS);

> +}
> +
> +/*
> + * unlock after reading
> + */
> +static inline void __up_read(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_dec_return_release(>count);
> + if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
> + rwsem_wake(sem);
> +}
> +
> +/*
> + * unlock after writing
> + */
> +static inline void __up_write(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
> + >count) < 0))
> + rwsem_wake(sem);
> +}
> +
> +/*
> + * downgrade write lock to read lock
> + */
> +static inline void __downgrade_write(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + /*
> +  * When downgrading from exclusive to shared ownership,
> +  * anything inside the write-locked region cannot leak
> +  * into the read side. In contrast, anything in the
> +  * read-locked region is ok to be re-ordered into the
> +  * write side. As such, rely on RELEASE semantics.
> +  */
> + tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, >count);
> + if (tmp < 0)
> + rwsem_downgrade_wake(sem);
> +}
> +
> +#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-10 Thread Ingo Molnar


* Waiman Long  wrote:

> On 02/10/2019 09:00 PM, Waiman Long wrote:
> > As the generic rwsem-xadd code is using the appropriate acquire and
> > release versions of the atomic operations, the arch specific rwsem.h
> > files will not be that much faster than the generic code as long as the
> > atomic functions are properly implemented. So we can remove those arch
> > specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
> > effort.
> >
> > Currently, only x86, alpha and ia64 have implemented architecture
> > specific fast paths. I don't have access to alpha and ia64 systems for
> > testing, but they are legacy systems that are not likely to be updated
> > to the latest kernel anyway.
> >
> > By using a rwsem microbenchmark, the total locking rates on a 4-socket
> > 56-core 112-thread x86-64 system before and after the patch were as
> > follows (mixed means equal # of read and write locks):
> >
> >   Before Patch  After Patch
> ># of Threads  wlock   rlock   mixed wlock   rlock   mixed
> >  -   -   - -   -   -
> > 127,373  29,409  28,17028,773  30,164  29,276
> > 2 7,697  14,922   1,703 7,435  15,167   1,729
> > 4 6,987  14,285   1,490 7,181  14,438   1,330
> > 8 6,650  13,652 761 6,918  13,796 718
> >16 6,434  15,729 713 6,554  16,030 625
> >32 5,590  15,312 552 6,124  15,344 471
> >64 5,980  15,478  61 5,668  15,509  58
> >
> > There were some run-to-run variations for the multi-thread tests. For
> > x86-64, using the generic C code fast path seems to be a liitle bit
> > faster than the assembly version especially for read-lock and when lock
> > contention is low.  Looking at the assembly version of the fast paths,
> > there are assembly to/from C code wrappers that save and restore all
> > the callee-clobbered registers (7 registers on x86-64). The assembly
> > generated from the generic C code doesn't need to do that. That may
> > explain the slight performance gain here.
> >
> > The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
> > as no other code other than those under kernel/locking needs to access
> > the internal rwsem macros and functions.
> >
> > Signed-off-by: Waiman Long 
> 
> I have decided to break the rwsem patchset that I sent out on last
> Thursday into 3 parts. This patch is part 0 as it touches a number of
> arch specific files and so have the widest distribution. I would like to
> get it merged first. Part 1 will be patches 1-10 (except 4) of my
> original rwsem patchset. This part moves things around, adds more
> debugging capability and lays the ground work for the next part. Part 2
> will contains the remaining patches which are the real beef of the whole
> patchset.

Sounds good to me - I've merged this patch, will push it out after 
testing.

Thanks,

Ingo


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-10 Thread Waiman Long
On 02/10/2019 09:00 PM, Waiman Long wrote:
> As the generic rwsem-xadd code is using the appropriate acquire and
> release versions of the atomic operations, the arch specific rwsem.h
> files will not be that much faster than the generic code as long as the
> atomic functions are properly implemented. So we can remove those arch
> specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
> effort.
>
> Currently, only x86, alpha and ia64 have implemented architecture
> specific fast paths. I don't have access to alpha and ia64 systems for
> testing, but they are legacy systems that are not likely to be updated
> to the latest kernel anyway.
>
> By using a rwsem microbenchmark, the total locking rates on a 4-socket
> 56-core 112-thread x86-64 system before and after the patch were as
> follows (mixed means equal # of read and write locks):
>
>   Before Patch  After Patch
># of Threads  wlock   rlock   mixed wlock   rlock   mixed
>  -   -   - -   -   -
> 127,373  29,409  28,17028,773  30,164  29,276
> 2 7,697  14,922   1,703 7,435  15,167   1,729
> 4 6,987  14,285   1,490 7,181  14,438   1,330
> 8 6,650  13,652 761 6,918  13,796 718
>16 6,434  15,729 713 6,554  16,030 625
>32 5,590  15,312 552 6,124  15,344 471
>64 5,980  15,478  61 5,668  15,509  58
>
> There were some run-to-run variations for the multi-thread tests. For
> x86-64, using the generic C code fast path seems to be a liitle bit
> faster than the assembly version especially for read-lock and when lock
> contention is low.  Looking at the assembly version of the fast paths,
> there are assembly to/from C code wrappers that save and restore all
> the callee-clobbered registers (7 registers on x86-64). The assembly
> generated from the generic C code doesn't need to do that. That may
> explain the slight performance gain here.
>
> The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
> as no other code other than those under kernel/locking needs to access
> the internal rwsem macros and functions.
>
> Signed-off-by: Waiman Long 

I have decided to break the rwsem patchset that I sent out on last
Thursday into 3 parts. This patch is part 0 as it touches a number of
arch specific files and so have the widest distribution. I would like to
get it merged first. Part 1 will be patches 1-10 (except 4) of my
original rwsem patchset. This part moves things around, adds more
debugging capability and lays the ground work for the next part. Part 2
will contains the remaining patches which are the real beef of the whole
patchset.

Cheers,
Longman



[PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-10 Thread Waiman Long
As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code as long as the
atomic functions are properly implemented. So we can remove those arch
specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
effort.

Currently, only x86, alpha and ia64 have implemented architecture
specific fast paths. I don't have access to alpha and ia64 systems for
testing, but they are legacy systems that are not likely to be updated
to the latest kernel anyway.

By using a rwsem microbenchmark, the total locking rates on a 4-socket
56-core 112-thread x86-64 system before and after the patch were as
follows (mixed means equal # of read and write locks):

  Before Patch  After Patch
   # of Threads  wlock   rlock   mixed wlock   rlock   mixed
     -   -   - -   -   -
127,373  29,409  28,17028,773  30,164  29,276
2 7,697  14,922   1,703 7,435  15,167   1,729
4 6,987  14,285   1,490 7,181  14,438   1,330
8 6,650  13,652 761 6,918  13,796 718
   16 6,434  15,729 713 6,554  16,030 625
   32 5,590  15,312 552 6,124  15,344 471
   64 5,980  15,478  61 5,668  15,509  58

There were some run-to-run variations for the multi-thread tests. For
x86-64, using the generic C code fast path seems to be a liitle bit
faster than the assembly version especially for read-lock and when lock
contention is low.  Looking at the assembly version of the fast paths,
there are assembly to/from C code wrappers that save and restore all
the callee-clobbered registers (7 registers on x86-64). The assembly
generated from the generic C code doesn't need to do that. That may
explain the slight performance gain here.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
as no other code other than those under kernel/locking needs to access
the internal rwsem macros and functions.

Signed-off-by: Waiman Long 
---
 MAINTAINERS |   1 -
 arch/alpha/include/asm/rwsem.h  | 211 ---
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/include/asm/rwsem.h   | 172 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/x86/include/asm/rwsem.h| 237 
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 --
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 
 include/linux/rwsem.h   |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem.h  | 130 ++
 18 files changed, 133 insertions(+), 929 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9919840..053f536 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8926,7 +8926,6 @@ F:arch/*/include/asm/spinlock*.h
 F: include/linux/rwlock*.h
 F: include/linux/mutex*.h
 F: include/linux/rwsem*.h
-F: arch/*/include/asm/rwsem.h
 F: include/linux/seqlock.h
 F: lib/locking*.[ch]
 F: kernel/locking/
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
deleted file mode 100644
index cf8fc8f9..000
--- a/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,211 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky , 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include 
-
-#define RWSEM_UNLOCKED_VALUE   0xL
-#define RWSEM_ACTIVE_BIAS  0x0001L
-#define RWSEM_ACTIVE_MASK  0xL
-#define RWSEM_WAITING_BIAS (-0x0001L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
-
-static inline int ___down_read(struct rw_semaphore *sem)
-{
-   long oldcount;
-#ifndefCONFIG_SMP
-   oldcount = sem->count.counter;
-   sem->count.counter += RWSEM_ACTIVE_READ_BIAS;
-#else
-   long temp;
-   __asm__ __volatile__(
-   "1: ldq_l   %0,%1\n"
-