Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-18 Thread Will Deacon
On Fri, Feb 15, 2019 at 01:58:34PM -0500, Waiman Long wrote:
> On 02/15/2019 01:40 PM, Will Deacon wrote:
> > On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
> >> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> >>> v4:
> >>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> >>>
> >>> v3:
> >>>  - Optimize __down_read_trylock() for the uncontended case as suggested
> >>>by Linus.
> >>>
> >>> v2:
> >>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
> >>>  - Update performance test data in patch 1.
> >>>
> >>> The goal of this patchset is to remove the architecture specific files
> >>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
> >>> patches. It also removes the legacy rwsem-spinlock.c file and make all
> >>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
> >>>
> >>> Waiman Long (3):
> >>>   locking/rwsem: Remove arch specific rwsem files
> >>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> >>> archs
> >>>   locking/rwsem: Optimize down_read_trylock()
> >> Acked-by: Peter Zijlstra (Intel) 
> >>
> >> with the caveat that I'm happy to exchange patch 3 back to my earlier
> >> suggestion in case Will expesses concerns wrt the ARM64 performance of
> >> Linus' suggestion.
> > Right, the current proposal doesn't work well for us, unfortunately. Which
> > was your earlier suggestion?
> >
> > Will
> 
> In my posting yesterday, I showed that most of the trylocks done were
> actually uncontended. Assuming that pattern hold for the most of the
> workloads, it will not that bad after all.

That's fair enough; if you're going to sit in a tight trylock() loop like the
benchmark does, then you're much better off just calling lock() if you care
at all about scalability.

Will


Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-15 Thread Will Deacon
On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> > v4:
> >  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> > 
> > v3:
> >  - Optimize __down_read_trylock() for the uncontended case as suggested
> >by Linus.
> > 
> > v2:
> >  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
> >  - Update performance test data in patch 1.
> > 
> > The goal of this patchset is to remove the architecture specific files
> > for rwsem-xadd to make it easer to add enhancements in the later rwsem
> > patches. It also removes the legacy rwsem-spinlock.c file and make all
> > the architectures use one single implementation of rwsem - rwsem-xadd.c.
> > 
> > Waiman Long (3):
> >   locking/rwsem: Remove arch specific rwsem files
> >   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> > archs
> >   locking/rwsem: Optimize down_read_trylock()
> 
> Acked-by: Peter Zijlstra (Intel) 
> 
> with the caveat that I'm happy to exchange patch 3 back to my earlier
> suggestion in case Will expesses concerns wrt the ARM64 performance of
> Linus' suggestion.

Right, the current proposal doesn't work well for us, unfortunately. Which
was your earlier suggestion?

Will


Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-15 Thread Waiman Long
On 02/15/2019 01:40 PM, Will Deacon wrote:
> On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
>>> v4:
>>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
>>>
>>> v3:
>>>  - Optimize __down_read_trylock() for the uncontended case as suggested
>>>by Linus.
>>>
>>> v2:
>>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>>>  - Update performance test data in patch 1.
>>>
>>> The goal of this patchset is to remove the architecture specific files
>>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
>>> patches. It also removes the legacy rwsem-spinlock.c file and make all
>>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
>>>
>>> Waiman Long (3):
>>>   locking/rwsem: Remove arch specific rwsem files
>>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>>> archs
>>>   locking/rwsem: Optimize down_read_trylock()
>> Acked-by: Peter Zijlstra (Intel) 
>>
>> with the caveat that I'm happy to exchange patch 3 back to my earlier
>> suggestion in case Will expesses concerns wrt the ARM64 performance of
>> Linus' suggestion.
> Right, the current proposal doesn't work well for us, unfortunately. Which
> was your earlier suggestion?
>
> Will

In my posting yesterday, I showed that most of the trylocks done were
actually uncontended. Assuming that pattern hold for the most of the
workloads, it will not that bad after all.

-Longman



Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-14 Thread Waiman Long
On 02/14/2019 05:37 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
>> v4:
>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
>>
>> v3:
>>  - Optimize __down_read_trylock() for the uncontended case as suggested
>>by Linus.
>>
>> v2:
>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>>  - Update performance test data in patch 1.
>>
>> The goal of this patchset is to remove the architecture specific files
>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
>> patches. It also removes the legacy rwsem-spinlock.c file and make all
>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
>>
>> Waiman Long (3):
>>   locking/rwsem: Remove arch specific rwsem files
>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>> archs
>>   locking/rwsem: Optimize down_read_trylock()
> Acked-by: Peter Zijlstra (Intel) 
>
> with the caveat that I'm happy to exchange patch 3 back to my earlier
> suggestion in case Will expesses concerns wrt the ARM64 performance of
> Linus' suggestion.

I inserted a few lock event counters into the rwsem trylock code:

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
    /*
 * Optimize for the case when the rwsem is not locked at all.
 */
    long tmp = RWSEM_UNLOCKED_VALUE;

    lockevent_inc(rwsem_rtrylock);
    do {
    if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
    tmp + RWSEM_ACTIVE_READ_BIAS)) {
    rwsem_set_reader_owned(sem);
    return 1;
    }
    lockevent_inc(rwsem_rtrylock_retry);
    } while (tmp >= 0);
    lockevent_inc(rwsem_rtrylock_fail);
    return 0;
}

static inline int __down_write_trylock(struct rw_semaphore *sem)
{
    long tmp;

    lockevent_inc(rwsem_wtrylock);
    tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
  RWSEM_ACTIVE_WRITE_BIAS);
    if (tmp == RWSEM_UNLOCKED_VALUE) {
    rwsem_set_owner(sem);
    return true;
    }
    lockevent_inc(rwsem_wtrylock_fail);
    return false;
}

I booted the new kernel on a 4-socket 56-core 112-thread Broadwell
system. The counter values

1) After bootup:

rwsem_rtrylock=784029
rwsem_rtrylock_fail=59
rwsem_rtrylock_retry=394
rwsem_wtrylock=18284
rwsem_wtrylock_fail=230

2) After parallel kernel build (-j112):

rwsem_rtrylock=338667559
rwsem_rtrylock_fail=18
rwsem_rtrylock_retry=51
rwsem_wtrylock=17016332
rwsem_wtrylock_fail=98058

At least for these two use cases, try-for-ownership as suggested by
Linus is the right choice.

Cheers,
Longman



Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-02-14 Thread Peter Zijlstra
On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> v4:
>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> 
> v3:
>  - Optimize __down_read_trylock() for the uncontended case as suggested
>by Linus.
> 
> v2:
>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>  - Update performance test data in patch 1.
> 
> The goal of this patchset is to remove the architecture specific files
> for rwsem-xadd to make it easer to add enhancements in the later rwsem
> patches. It also removes the legacy rwsem-spinlock.c file and make all
> the architectures use one single implementation of rwsem - rwsem-xadd.c.
> 
> Waiman Long (3):
>   locking/rwsem: Remove arch specific rwsem files
>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> archs
>   locking/rwsem: Optimize down_read_trylock()

Acked-by: Peter Zijlstra (Intel) 

with the caveat that I'm happy to exchange patch 3 back to my earlier
suggestion in case Will expesses concerns wrt the ARM64 performance of
Linus' suggestion.