Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Guenter Roeck

On 08/31/2016 06:41 AM, Guenter Roeck wrote:

On 08/31/2016 01:09 AM, Peter Zijlstra wrote:

On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:

Peter,

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?


Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


Ah, I missed that, and didn't realize that it is already upstream. Thanks a lot!



Wait, it isn't ... I must have fetched a branch with 80127a39681b. Ignore the 
noise
I am making until after I had another coffee.

Guenter



Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Guenter Roeck

On 08/31/2016 06:41 AM, Guenter Roeck wrote:

On 08/31/2016 01:09 AM, Peter Zijlstra wrote:

On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:

Peter,

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?


Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


Ah, I missed that, and didn't realize that it is already upstream. Thanks a lot!



Wait, it isn't ... I must have fetched a branch with 80127a39681b. Ignore the 
noise
I am making until after I had another coffee.

Guenter



Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Guenter Roeck

On 08/31/2016 01:09 AM, Peter Zijlstra wrote:

On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:

Peter,

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?


Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


Ah, I missed that, and didn't realize that it is already upstream. Thanks a lot!

Guenter



Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Guenter Roeck

On 08/31/2016 01:09 AM, Peter Zijlstra wrote:

On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:

Peter,

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?


Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


Ah, I missed that, and didn't realize that it is already upstream. Thanks a lot!

Guenter



Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:
> Peter,
> 
> The call to rcu_sync_is_idle() causes the following build error when building
> x86_64:allmodconfig.
> 
> ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
> ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!
> 
> I think this was also reported by the 0-day build bot.
> 
> The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
> apply that change to the Android code (where the patch has been aplied and
> the problem is seen) - do you by any chance have a better solution in mind ?

Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9f3d37..198473d90f81 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -68,6 +68,8 @@ void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(),
 "suspicious rcu_sync_is_idle() usage");
 }
+
+EXPORT_SYMBOL_GPL(rcu_sync_lockdep_assert);
 #endif
 
 /**


Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-31 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 10:21:02PM -0700, Guenter Roeck wrote:
> Peter,
> 
> The call to rcu_sync_is_idle() causes the following build error when building
> x86_64:allmodconfig.
> 
> ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
> ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!
> 
> I think this was also reported by the 0-day build bot.
> 
> The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
> apply that change to the Android code (where the patch has been aplied and
> the problem is seen) - do you by any chance have a better solution in mind ?

Nope, that's exactly what Ingo did when he ran into this. If you look at
commit 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce
global impact"), you'll notice the last hunk:


diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9f3d37..198473d90f81 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -68,6 +68,8 @@ void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(),
 "suspicious rcu_sync_is_idle() usage");
 }
+
+EXPORT_SYMBOL_GPL(rcu_sync_lockdep_assert);
 #endif
 
 /**


Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-30 Thread Guenter Roeck
Peter,

On Tue, Aug 09, 2016 at 11:51:12AM +0200, Peter Zijlstra wrote:
> Currently the percpu-rwsem switches to (global) atomic ops while a
> writer is waiting; which could be quite a while and slows down
> releasing the readers.
> 
> This patch cures this problem by ordering the reader-state vs
> reader-count (see the comments in __percpu_down_read() and
> percpu_down_write()). This changes a global atomic op into a full
> memory barrier, which doesn't have the global cacheline contention.
> 
> This also enables using the percpu-rwsem with rcu_sync disabled in order
> to bias the implementation differently, reducing the writer latency by
> adding some cost to readers.
> 
> Cc: Paul McKenney 
> Reviewed-by: Oleg Nesterov 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/percpu-rwsem.h  |   84 +--
>  kernel/locking/percpu-rwsem.c |  228 
> --
>  2 files changed, 206 insertions(+), 106 deletions(-)
> 
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -10,30 +10,96 @@
>  
>  struct percpu_rw_semaphore {
>   struct rcu_sync rss;
> - unsigned int __percpu   *fast_read_ctr;
> + unsigned int __percpu   *read_count;
>   struct rw_semaphore rw_sem;
> - atomic_tslow_read_ctr;
> - wait_queue_head_t   write_waitq;
> + wait_queue_head_t   writer;
> + int readers_block;
>  };
>  
> -extern void percpu_down_read(struct percpu_rw_semaphore *);
> -extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
> -extern void percpu_up_read(struct percpu_rw_semaphore *);
> +extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
> +extern void __percpu_up_read(struct percpu_rw_semaphore *);
> +
> +static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
> +{
> + might_sleep();
> +
> + rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
> +
> + preempt_disable();
> + /*
> +  * We are in an RCU-sched read-side critical section, so the writer
> +  * cannot both change sem->state from readers_fast and start checking
> +  * counters while we are here. So if we see !sem->state, we know that
> +  * the writer won't be checking until we're past the preempt_enable()
> +  * and that one the synchronize_sched() is done, the writer will see
> +  * anything we did within this RCU-sched read-size critical section.
> +  */
> + __this_cpu_inc(*sem->read_count);
> + if (unlikely(!rcu_sync_is_idle(>rss)))

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?

Thanks,
Guenter


Re: [v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-30 Thread Guenter Roeck
Peter,

On Tue, Aug 09, 2016 at 11:51:12AM +0200, Peter Zijlstra wrote:
> Currently the percpu-rwsem switches to (global) atomic ops while a
> writer is waiting; which could be quite a while and slows down
> releasing the readers.
> 
> This patch cures this problem by ordering the reader-state vs
> reader-count (see the comments in __percpu_down_read() and
> percpu_down_write()). This changes a global atomic op into a full
> memory barrier, which doesn't have the global cacheline contention.
> 
> This also enables using the percpu-rwsem with rcu_sync disabled in order
> to bias the implementation differently, reducing the writer latency by
> adding some cost to readers.
> 
> Cc: Paul McKenney 
> Reviewed-by: Oleg Nesterov 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/percpu-rwsem.h  |   84 +--
>  kernel/locking/percpu-rwsem.c |  228 
> --
>  2 files changed, 206 insertions(+), 106 deletions(-)
> 
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -10,30 +10,96 @@
>  
>  struct percpu_rw_semaphore {
>   struct rcu_sync rss;
> - unsigned int __percpu   *fast_read_ctr;
> + unsigned int __percpu   *read_count;
>   struct rw_semaphore rw_sem;
> - atomic_tslow_read_ctr;
> - wait_queue_head_t   write_waitq;
> + wait_queue_head_t   writer;
> + int readers_block;
>  };
>  
> -extern void percpu_down_read(struct percpu_rw_semaphore *);
> -extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
> -extern void percpu_up_read(struct percpu_rw_semaphore *);
> +extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
> +extern void __percpu_up_read(struct percpu_rw_semaphore *);
> +
> +static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
> +{
> + might_sleep();
> +
> + rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
> +
> + preempt_disable();
> + /*
> +  * We are in an RCU-sched read-side critical section, so the writer
> +  * cannot both change sem->state from readers_fast and start checking
> +  * counters while we are here. So if we see !sem->state, we know that
> +  * the writer won't be checking until we're past the preempt_enable()
> +  * and that one the synchronize_sched() is done, the writer will see
> +  * anything we did within this RCU-sched read-size critical section.
> +  */
> + __this_cpu_inc(*sem->read_count);
> + if (unlikely(!rcu_sync_is_idle(>rss)))

The call to rcu_sync_is_idle() causes the following build error when building
x86_64:allmodconfig.

ERROR: "rcu_sync_lockdep_assert" [kernel/locking/locktorture.ko] undefined!
ERROR: "rcu_sync_lockdep_assert" [fs/ext4/ext4.ko] undefined!

I think this was also reported by the 0-day build bot.

The simple fix would of course be to export rcu_sync_lockdep_assert. Before I
apply that change to the Android code (where the patch has been aplied and
the problem is seen) - do you by any chance have a better solution in mind ?

Thanks,
Guenter


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Om Dhyade



On 8/26/2016 9:47 AM, Dmitry Shmidt wrote:

On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:

Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:

Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!


Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.


Thanks a lot for all your help resolving this issue!


Thank you for the help.

Thanks!

--
tejun


--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Om Dhyade



On 8/26/2016 9:47 AM, Dmitry Shmidt wrote:

On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:

Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:

Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!


Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.


Thanks a lot for all your help resolving this issue!


Thank you for the help.

Thanks!

--
tejun


--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Dmitry Shmidt
On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:
> Hello, John.
>
> On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
>> Hey! Good news. This patch along with Peter's locking changes pushes
>> the latencies down to an apparently acceptable level!
>
> Ah, that's good to hear.  Please feel free to ping me if you guys
> wanna talk about cgroup usage in android.

Thanks a lot for all your help resolving this issue!

> Thanks!
>
> --
> tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Dmitry Shmidt
On Fri, Aug 26, 2016 at 5:51 AM, Tejun Heo  wrote:
> Hello, John.
>
> On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
>> Hey! Good news. This patch along with Peter's locking changes pushes
>> the latencies down to an apparently acceptable level!
>
> Ah, that's good to hear.  Please feel free to ping me if you guys
> wanna talk about cgroup usage in android.

Thanks a lot for all your help resolving this issue!

> Thanks!
>
> --
> tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Tejun Heo
Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
> Hey! Good news. This patch along with Peter's locking changes pushes
> the latencies down to an apparently acceptable level!

Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.

Thanks!

-- 
tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-26 Thread Tejun Heo
Hello, John.

On Thu, Aug 25, 2016 at 07:14:07PM -0700, John Stultz wrote:
> Hey! Good news. This patch along with Peter's locking changes pushes
> the latencies down to an apparently acceptable level!

Ah, that's good to hear.  Please feel free to ping me if you guys
wanna talk about cgroup usage in android.

Thanks!

-- 
tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-25 Thread John Stultz
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> Hello, John.
>
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>> Hey Peter, Tejun, Oleg,
>>   So while you're tweaks for the percpu-rwsem have greatly helped the
>> regression folks were seeing (many thanks, by the way), as noted
>> above, the performance regression with the global lock compared to
>> earlier kernels is still ~3x slower (though again, much better then
>> the 80x slower that was seen earlier).
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440


Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!

Many thanks for the pointer!

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-25 Thread John Stultz
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> Hello, John.
>
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>> Hey Peter, Tejun, Oleg,
>>   So while you're tweaks for the percpu-rwsem have greatly helped the
>> regression folks were seeing (many thanks, by the way), as noted
>> above, the performance regression with the global lock compared to
>> earlier kernels is still ~3x slower (though again, much better then
>> the 80x slower that was seen earlier).
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440


Hey! Good news. This patch along with Peter's locking changes pushes
the latencies down to an apparently acceptable level!

Many thanks for the pointer!

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

That does provide a reasonable improvement in my testing!  But, I'll
pass it along to folks who are doing more in-depth performance
analysis to make sure.

If that doesn't work, I'll talk w/ Dmitry about submitting his patch
reworking things back to the per signal_struct locking.

>> At a higher level, I'm worried that Android's use of cgroups as a
>> priority enforcement mechanism is at odds with developers focusing on
>> it as a container enforcement mechanism, as in the latter its not
>> common for tasks to change between cgroups, but with the former
>> priority adjustments are quite common.
>
> It has been at odds as long as android existed.  cgroup used to have
> synchronous synchronize_rcu() in the migration path which android
> kernel simply deleted (didn't break android's use case), re-labeling
> every page's memcg ownership on foreground and background switches
> (does it still do that? it's simply unworkable) and so on.

Yea. Android folks can correct me, but I think the memcg
forground/background bits have been dropped. They still use a apps
memcg group for low-memory-notification in their
userland-low-memory-killer-daemon (however my understanding is that
has yet to sufficiently replace the in-kernel low-memory-killer).

> I find it difficult to believe that android's requirements can be
> satisfied only through moving processes around.  cgroup usage model is
> affected by container use cases but isn't limited to it.  If android
> is willing to change, I'd be more than happy to work with you and
> solve obstacles.  If not, we'll surely try not to break it anymore
> than it has always been broken.

I think folks are open to ideas, but going from idea-from-the-list to
something reliable enough to ship is always a bit fraught, so it often
takes some time and effort to really validate if those ideas are
workable.

I have on my list to re-submit the can_attach logic Android uses to
provide permissions checks on processes migrating other tasks between
cgroups, so at least we can try to reduce the separate domains of
focus and get folks sharing the same code bases.

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz
On Wed, Aug 24, 2016 at 2:30 PM, Tejun Heo  wrote:
> On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
>>
>> So I was wondering if patches to go back to the per signal_struct
>> locking would still be considered? Or is the global lock approach the
>> only way forward?
>
> We can't simply revert but we can make the lock per signal_struct
> again.  It's just that it'd be quite a bit more complex (but, again,
> if we need it...) and for cases where migrations aren't as frequent
> percpu-rwsem would be at least a bit lower overhead.  Can you please
> test with the following patch applied just in case?
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

That does provide a reasonable improvement in my testing!  But, I'll
pass it along to folks who are doing more in-depth performance
analysis to make sure.

If that doesn't work, I'll talk w/ Dmitry about submitting his patch
reworking things back to the per signal_struct locking.

>> At a higher level, I'm worried that Android's use of cgroups as a
>> priority enforcement mechanism is at odds with developers focusing on
>> it as a container enforcement mechanism, as in the latter its not
>> common for tasks to change between cgroups, but with the former
>> priority adjustments are quite common.
>
> It has been at odds as long as android existed.  cgroup used to have
> synchronous synchronize_rcu() in the migration path which android
> kernel simply deleted (didn't break android's use case), re-labeling
> every page's memcg ownership on foreground and background switches
> (does it still do that? it's simply unworkable) and so on.

Yea. Android folks can correct me, but I think the memcg
forground/background bits have been dropped. They still use a apps
memcg group for low-memory-notification in their
userland-low-memory-killer-daemon (however my understanding is that
has yet to sufficiently replace the in-kernel low-memory-killer).

> I find it difficult to believe that android's requirements can be
> satisfied only through moving processes around.  cgroup usage model is
> affected by container use cases but isn't limited to it.  If android
> is willing to change, I'd be more than happy to work with you and
> solve obstacles.  If not, we'll surely try not to break it anymore
> than it has always been broken.

I think folks are open to ideas, but going from idea-from-the-list to
something reliable enough to ship is always a bit fraught, so it often
takes some time and effort to really validate if those ideas are
workable.

I have on my list to re-submit the can_attach logic Android uses to
provide permissions checks on processes migrating other tasks between
cgroups, so at least we can try to reduce the separate domains of
focus and get folks sharing the same code bases.

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread Tejun Heo
Hello, John.

On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
> Hey Peter, Tejun, Oleg,
>   So while you're tweaks for the percpu-rwsem have greatly helped the
> regression folks were seeing (many thanks, by the way), as noted
> above, the performance regression with the global lock compared to
> earlier kernels is still ~3x slower (though again, much better then
> the 80x slower that was seen earlier).
> 
> So I was wondering if patches to go back to the per signal_struct
> locking would still be considered? Or is the global lock approach the
> only way forward?

We can't simply revert but we can make the lock per signal_struct
again.  It's just that it'd be quite a bit more complex (but, again,
if we need it...) and for cases where migrations aren't as frequent
percpu-rwsem would be at least a bit lower overhead.  Can you please
test with the following patch applied just in case?

 
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

> At a higher level, I'm worried that Android's use of cgroups as a
> priority enforcement mechanism is at odds with developers focusing on
> it as a container enforcement mechanism, as in the latter its not
> common for tasks to change between cgroups, but with the former
> priority adjustments are quite common.

It has been at odds as long as android existed.  cgroup used to have
synchronous synchronize_rcu() in the migration path which android
kernel simply deleted (didn't break android's use case), re-labeling
every page's memcg ownership on foreground and background switches
(does it still do that? it's simply unworkable) and so on.

I find it difficult to believe that android's requirements can be
satisfied only through moving processes around.  cgroup usage model is
affected by container use cases but isn't limited to it.  If android
is willing to change, I'd be more than happy to work with you and
solve obstacles.  If not, we'll surely try not to break it anymore
than it has always been broken.

Thanks.

-- 
tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread Tejun Heo
Hello, John.

On Wed, Aug 24, 2016 at 02:16:52PM -0700, John Stultz wrote:
> Hey Peter, Tejun, Oleg,
>   So while you're tweaks for the percpu-rwsem have greatly helped the
> regression folks were seeing (many thanks, by the way), as noted
> above, the performance regression with the global lock compared to
> earlier kernels is still ~3x slower (though again, much better then
> the 80x slower that was seen earlier).
> 
> So I was wondering if patches to go back to the per signal_struct
> locking would still be considered? Or is the global lock approach the
> only way forward?

We can't simply revert but we can make the lock per signal_struct
again.  It's just that it'd be quite a bit more complex (but, again,
if we need it...) and for cases where migrations aren't as frequent
percpu-rwsem would be at least a bit lower overhead.  Can you please
test with the following patch applied just in case?

 
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=for-4.8-fixes=568ac888215c7fb2fabe8ea739b00ec3c1f5d440

> At a higher level, I'm worried that Android's use of cgroups as a
> priority enforcement mechanism is at odds with developers focusing on
> it as a container enforcement mechanism, as in the latter its not
> common for tasks to change between cgroups, but with the former
> priority adjustments are quite common.

It has been at odds as long as android existed.  cgroup used to have
synchronous synchronize_rcu() in the migration path which android
kernel simply deleted (didn't break android's use case), re-labeling
every page's memcg ownership on foreground and background switches
(does it still do that? it's simply unworkable) and so on.

I find it difficult to believe that android's requirements can be
satisfied only through moving processes around.  cgroup usage model is
affected by container use cases but isn't limited to it.  If android
is willing to change, I'd be more than happy to work with you and
solve obstacles.  If not, we'll surely try not to break it anymore
than it has always been broken.

Thanks.

-- 
tejun


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz
On Fri, Aug 12, 2016 at 6:44 PM, Om Dhyade  wrote:
> Update from my tests:
> Use-case: Android application launches.
>
> I tested the patches on android N build, i see max latency ~7ms.
> In my tests, the wait is due to: copy_process(fork.c) blocks all threads in
> __cgroup_procs_write including threads which are not part of the forking
> process's thread-group.
>
> Dimtry had provided a hack patch which reverts to per-process rw-sem which
> had max latency of ~2ms.
>
> android user-space binder library does 2 cgroup write operations per
> transaction, apart from the copy_process(fork.c) wait, i see pre-emption in
> _cgroup_procs_write causing waits.


Hey Peter, Tejun, Oleg,
  So while you're tweaks for the percpu-rwsem have greatly helped the
regression folks were seeing (many thanks, by the way), as noted
above, the performance regression with the global lock compared to
earlier kernels is still ~3x slower (though again, much better then
the 80x slower that was seen earlier).

So I was wondering if patches to go back to the per signal_struct
locking would still be considered? Or is the global lock approach the
only way forward?

At a higher level, I'm worried that Android's use of cgroups as a
priority enforcement mechanism is at odds with developers focusing on
it as a container enforcement mechanism, as in the latter its not
common for tasks to change between cgroups, but with the former
priority adjustments are quite common.

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-24 Thread John Stultz
On Fri, Aug 12, 2016 at 6:44 PM, Om Dhyade  wrote:
> Update from my tests:
> Use-case: Android application launches.
>
> I tested the patches on android N build, i see max latency ~7ms.
> In my tests, the wait is due to: copy_process(fork.c) blocks all threads in
> __cgroup_procs_write including threads which are not part of the forking
> process's thread-group.
>
> Dimtry had provided a hack patch which reverts to per-process rw-sem which
> had max latency of ~2ms.
>
> android user-space binder library does 2 cgroup write operations per
> transaction, apart from the copy_process(fork.c) wait, i see pre-emption in
> _cgroup_procs_write causing waits.


Hey Peter, Tejun, Oleg,
  So while you're tweaks for the percpu-rwsem have greatly helped the
regression folks were seeing (many thanks, by the way), as noted
above, the performance regression with the global lock compared to
earlier kernels is still ~3x slower (though again, much better then
the 80x slower that was seen earlier).

So I was wondering if patches to go back to the per signal_struct
locking would still be considered? Or is the global lock approach the
only way forward?

At a higher level, I'm worried that Android's use of cgroups as a
priority enforcement mechanism is at odds with developers focusing on
it as a container enforcement mechanism, as in the latter its not
common for tasks to change between cgroups, but with the former
priority adjustments are quite common.

thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-12 Thread Om Dhyade

Thank you Dimtry for sharing the patches.

Update from my tests:
Use-case: Android application launches.

I tested the patches on android N build, i see max latency ~7ms.
In my tests, the wait is due to: copy_process(fork.c) blocks all threads 
in __cgroup_procs_write including threads which are not part of the 
forking process's thread-group.


Dimtry had provided a hack patch which reverts to per-process rw-sem 
which had max latency of ~2ms.


android user-space binder library does 2 cgroup write operations per 
transaction, apart from the copy_process(fork.c) wait, i see pre-emption 
in _cgroup_procs_write causing waits.


Thanks.

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-12 Thread Om Dhyade

Thank you Dimtry for sharing the patches.

Update from my tests:
Use-case: Android application launches.

I tested the patches on android N build, i see max latency ~7ms.
In my tests, the wait is due to: copy_process(fork.c) blocks all threads 
in __cgroup_procs_write including threads which are not part of the 
forking process's thread-group.


Dimtry had provided a hack patch which reverts to per-process rw-sem 
which had max latency of ~2ms.


android user-space binder library does 2 cgroup write operations per 
transaction, apart from the copy_process(fork.c) wait, i see pre-emption 
in _cgroup_procs_write causing waits.


Thanks.

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Peter Zijlstra
On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> >
> > Currently the percpu-rwsem switches to (global) atomic ops while a
> > writer is waiting; which could be quite a while and slows down
> > releasing the readers.
> >
> > This patch cures this problem by ordering the reader-state vs
> > reader-count (see the comments in __percpu_down_read() and
> > percpu_down_write()). This changes a global atomic op into a full
> > memory barrier, which doesn't have the global cacheline contention.
> >
> > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > to bias the implementation differently, reducing the writer latency by
> > adding some cost to readers.
> 
> So this by itself doesn't help us much, but including the following
> from Oleg does help quite a bit:

Correct, Oleg was going to send his rcu_sync rework on top of this. But
since its holiday season things might be tad delayed.


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Peter Zijlstra
On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> >
> > Currently the percpu-rwsem switches to (global) atomic ops while a
> > writer is waiting; which could be quite a while and slows down
> > releasing the readers.
> >
> > This patch cures this problem by ordering the reader-state vs
> > reader-count (see the comments in __percpu_down_read() and
> > percpu_down_write()). This changes a global atomic op into a full
> > memory barrier, which doesn't have the global cacheline contention.
> >
> > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > to bias the implementation differently, reducing the writer latency by
> > adding some cost to readers.
> 
> So this by itself doesn't help us much, but including the following
> from Oleg does help quite a bit:

Correct, Oleg was going to send his rcu_sync rework on top of this. But
since its holiday season things might be tad delayed.


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Oleg Nesterov
On 08/10, Peter Zijlstra wrote:
>
> On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> > On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> > >
> > > Currently the percpu-rwsem switches to (global) atomic ops while a
> > > writer is waiting; which could be quite a while and slows down
> > > releasing the readers.
> > >
> > > This patch cures this problem by ordering the reader-state vs
> > > reader-count (see the comments in __percpu_down_read() and
> > > percpu_down_write()). This changes a global atomic op into a full
> > > memory barrier, which doesn't have the global cacheline contention.
> > >
> > > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > > to bias the implementation differently, reducing the writer latency by
> > > adding some cost to readers.
> >
> > So this by itself doesn't help us much, but including the following
> > from Oleg does help quite a bit:
>
> Correct, Oleg was going to send his rcu_sync rework on top of this. But
> since its holiday season things might be tad delayed.

Yeees. The patch is ready and even seems to work, but I still can't (re)write
the comments. Will try to finish tomorrow.

But. We need something simple and backportable, so I am going to send the
patch below first. As you can see this is your "sabotage" change, just
the new helper was renamed + s/!GP_IDLE/GP_PASSED/ fix. And the only reason
I can't send it today is that somehow I can't write the changelog ;)

So I would be really happy if you send this change instead of me, I am going
to keep "From: peterz" anyway.

Oleg.


diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index a63a33e..ece7ed9 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
 }
 
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
+extern void rcu_sync_enter_start(struct rcu_sync *);
 extern void rcu_sync_enter(struct rcu_sync *);
 extern void rcu_sync_exit(struct rcu_sync *);
 extern void rcu_sync_dtor(struct rcu_sync *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 75c0ff0..614f9cd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5609,6 +5609,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
+   rcu_sync_enter_start(_threadgroup_rwsem.rss);
+
get_user_ns(init_cgroup_ns.user_ns);
 
mutex_lock(_mutex);
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9..c9b7bc8 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -83,6 +83,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type 
type)
 }
 
 /**
+ * Must be called after rcu_sync_init() and before first use.
+ *
+ * Ensures rcu_sync_is_idle() returns false and rcu_sync_{enter,exit}() pairs
+ * turn into NO-OPs.
+ */
+void rcu_sync_enter_start(struct rcu_sync *rsp)
+{
+   rsp->gp_count++;
+   rsp->gp_state = GP_PASSED;
+}
+
+/**
  * rcu_sync_enter() - Force readers onto slowpath
  * @rsp: Pointer to rcu_sync structure to use for synchronization
  *



Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-10 Thread Oleg Nesterov
On 08/10, Peter Zijlstra wrote:
>
> On Tue, Aug 09, 2016 at 04:47:38PM -0700, John Stultz wrote:
> > On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
> > >
> > > Currently the percpu-rwsem switches to (global) atomic ops while a
> > > writer is waiting; which could be quite a while and slows down
> > > releasing the readers.
> > >
> > > This patch cures this problem by ordering the reader-state vs
> > > reader-count (see the comments in __percpu_down_read() and
> > > percpu_down_write()). This changes a global atomic op into a full
> > > memory barrier, which doesn't have the global cacheline contention.
> > >
> > > This also enables using the percpu-rwsem with rcu_sync disabled in order
> > > to bias the implementation differently, reducing the writer latency by
> > > adding some cost to readers.
> >
> > So this by itself doesn't help us much, but including the following
> > from Oleg does help quite a bit:
>
> Correct, Oleg was going to send his rcu_sync rework on top of this. But
> since its holiday season things might be tad delayed.

Yeees. The patch is ready and even seems to work, but I still can't (re)write
the comments. Will try to finish tomorrow.

But. We need something simple and backportable, so I am going to send the
patch below first. As you can see this is your "sabotage" change, just
the new helper was renamed + s/!GP_IDLE/GP_PASSED/ fix. And the only reason
I can't send it today is that somehow I can't write the changelog ;)

So I would be really happy if you send this change instead of me, I am going
to keep "From: peterz" anyway.

Oleg.


diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index a63a33e..ece7ed9 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
 }
 
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
+extern void rcu_sync_enter_start(struct rcu_sync *);
 extern void rcu_sync_enter(struct rcu_sync *);
 extern void rcu_sync_exit(struct rcu_sync *);
 extern void rcu_sync_dtor(struct rcu_sync *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 75c0ff0..614f9cd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5609,6 +5609,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
+   rcu_sync_enter_start(_threadgroup_rwsem.rss);
+
get_user_ns(init_cgroup_ns.user_ns);
 
mutex_lock(_mutex);
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9..c9b7bc8 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -83,6 +83,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type 
type)
 }
 
 /**
+ * Must be called after rcu_sync_init() and before first use.
+ *
+ * Ensures rcu_sync_is_idle() returns false and rcu_sync_{enter,exit}() pairs
+ * turn into NO-OPs.
+ */
+void rcu_sync_enter_start(struct rcu_sync *rsp)
+{
+   rsp->gp_count++;
+   rsp->gp_state = GP_PASSED;
+}
+
+/**
  * rcu_sync_enter() - Force readers onto slowpath
  * @rsp: Pointer to rcu_sync structure to use for synchronization
  *



Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread John Stultz
On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
>
> Currently the percpu-rwsem switches to (global) atomic ops while a
> writer is waiting; which could be quite a while and slows down
> releasing the readers.
>
> This patch cures this problem by ordering the reader-state vs
> reader-count (see the comments in __percpu_down_read() and
> percpu_down_write()). This changes a global atomic op into a full
> memory barrier, which doesn't have the global cacheline contention.
>
> This also enables using the percpu-rwsem with rcu_sync disabled in order
> to bias the implementation differently, reducing the writer latency by
> adding some cost to readers.

So this by itself doesn't help us much, but including the following
from Oleg does help quite a bit:

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index db27804..9e9200b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5394,6 +5394,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

+   rcu_sync_enter(_threadgroup_rwsem.rss);
+
mutex_lock(_mutex);

/* Add init_css_set to the hash table */


thanks
-john


Re: [PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread John Stultz
On Tue, Aug 9, 2016 at 2:51 AM, Peter Zijlstra  wrote:
>
> Currently the percpu-rwsem switches to (global) atomic ops while a
> writer is waiting; which could be quite a while and slows down
> releasing the readers.
>
> This patch cures this problem by ordering the reader-state vs
> reader-count (see the comments in __percpu_down_read() and
> percpu_down_write()). This changes a global atomic op into a full
> memory barrier, which doesn't have the global cacheline contention.
>
> This also enables using the percpu-rwsem with rcu_sync disabled in order
> to bias the implementation differently, reducing the writer latency by
> adding some cost to readers.

So this by itself doesn't help us much, but including the following
from Oleg does help quite a bit:

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index db27804..9e9200b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5394,6 +5394,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

+   rcu_sync_enter(_threadgroup_rwsem.rss);
+
mutex_lock(_mutex);

/* Add init_css_set to the hash table */


thanks
-john


[PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread Peter Zijlstra

Currently the percpu-rwsem switches to (global) atomic ops while a
writer is waiting; which could be quite a while and slows down
releasing the readers.

This patch cures this problem by ordering the reader-state vs
reader-count (see the comments in __percpu_down_read() and
percpu_down_write()). This changes a global atomic op into a full
memory barrier, which doesn't have the global cacheline contention.

This also enables using the percpu-rwsem with rcu_sync disabled in order
to bias the implementation differently, reducing the writer latency by
adding some cost to readers.

Cc: Paul McKenney 
Reviewed-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/percpu-rwsem.h  |   84 +--
 kernel/locking/percpu-rwsem.c |  228 --
 2 files changed, 206 insertions(+), 106 deletions(-)

--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -10,30 +10,96 @@
 
 struct percpu_rw_semaphore {
struct rcu_sync rss;
-   unsigned int __percpu   *fast_read_ctr;
+   unsigned int __percpu   *read_count;
struct rw_semaphore rw_sem;
-   atomic_tslow_read_ctr;
-   wait_queue_head_t   write_waitq;
+   wait_queue_head_t   writer;
+   int readers_block;
 };
 
-extern void percpu_down_read(struct percpu_rw_semaphore *);
-extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
-extern void percpu_up_read(struct percpu_rw_semaphore *);
+extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
+extern void __percpu_up_read(struct percpu_rw_semaphore *);
+
+static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+   might_sleep();
+
+   rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
+
+   preempt_disable();
+   /*
+* We are in an RCU-sched read-side critical section, so the writer
+* cannot both change sem->state from readers_fast and start checking
+* counters while we are here. So if we see !sem->state, we know that
+* the writer won't be checking until we're past the preempt_enable()
+* and that one the synchronize_sched() is done, the writer will see
+* anything we did within this RCU-sched read-size critical section.
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(>rss)))
+   __percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+}
+
+static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
+{
+   int ret = 1;
+
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(>rss)))
+   ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+
+   if (ret)
+   rwsem_acquire_read(>rw_sem.dep_map, 0, 1, _RET_IP_);
+
+   return ret;
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+   /*
+* The barrier() in preempt_disable() prevents the compiler from
+* bleeding the critical section out.
+*/
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   if (likely(rcu_sync_is_idle(>rss)))
+   __this_cpu_dec(*sem->read_count);
+   else
+   __percpu_up_read(sem); /* Unconditional memory barrier */
+   preempt_enable();
+
+   rwsem_release(>rw_sem.dep_map, 1, _RET_IP_);
+}
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
 extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
const char *, struct lock_class_key *);
+
 extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
-#define percpu_init_rwsem(brw) \
+#define percpu_init_rwsem(sem) \
 ({ \
static struct lock_class_key rwsem_key; \
-   __percpu_init_rwsem(brw, #brw, _key); \
+   __percpu_init_rwsem(sem, #sem, _key); \
 })
 
-
 #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
 
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -8,152 +8,186 @@
 #include 
 #include 
 
-int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
+int 

[PATCH v2] locking/percpu-rwsem: Optimize readers and reduce global impact

2016-08-09 Thread Peter Zijlstra

Currently the percpu-rwsem switches to (global) atomic ops while a
writer is waiting; which could be quite a while and slows down
releasing the readers.

This patch cures this problem by ordering the reader-state vs
reader-count (see the comments in __percpu_down_read() and
percpu_down_write()). This changes a global atomic op into a full
memory barrier, which doesn't have the global cacheline contention.

This also enables using the percpu-rwsem with rcu_sync disabled in order
to bias the implementation differently, reducing the writer latency by
adding some cost to readers.

Cc: Paul McKenney 
Reviewed-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/percpu-rwsem.h  |   84 +--
 kernel/locking/percpu-rwsem.c |  228 --
 2 files changed, 206 insertions(+), 106 deletions(-)

--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -10,30 +10,96 @@
 
 struct percpu_rw_semaphore {
struct rcu_sync rss;
-   unsigned int __percpu   *fast_read_ctr;
+   unsigned int __percpu   *read_count;
struct rw_semaphore rw_sem;
-   atomic_tslow_read_ctr;
-   wait_queue_head_t   write_waitq;
+   wait_queue_head_t   writer;
+   int readers_block;
 };
 
-extern void percpu_down_read(struct percpu_rw_semaphore *);
-extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
-extern void percpu_up_read(struct percpu_rw_semaphore *);
+extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
+extern void __percpu_up_read(struct percpu_rw_semaphore *);
+
+static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+   might_sleep();
+
+   rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
+
+   preempt_disable();
+   /*
+* We are in an RCU-sched read-side critical section, so the writer
+* cannot both change sem->state from readers_fast and start checking
+* counters while we are here. So if we see !sem->state, we know that
+* the writer won't be checking until we're past the preempt_enable()
+* and that one the synchronize_sched() is done, the writer will see
+* anything we did within this RCU-sched read-size critical section.
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(>rss)))
+   __percpu_down_read(sem, false); /* Unconditional memory barrier 
*/
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+}
+
+static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
+{
+   int ret = 1;
+
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   __this_cpu_inc(*sem->read_count);
+   if (unlikely(!rcu_sync_is_idle(>rss)))
+   ret = __percpu_down_read(sem, true); /* Unconditional memory 
barrier */
+   preempt_enable();
+   /*
+* The barrier() from preempt_enable() prevents the compiler from
+* bleeding the critical section out.
+*/
+
+   if (ret)
+   rwsem_acquire_read(>rw_sem.dep_map, 0, 1, _RET_IP_);
+
+   return ret;
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
+{
+   /*
+* The barrier() in preempt_disable() prevents the compiler from
+* bleeding the critical section out.
+*/
+   preempt_disable();
+   /*
+* Same as in percpu_down_read().
+*/
+   if (likely(rcu_sync_is_idle(>rss)))
+   __this_cpu_dec(*sem->read_count);
+   else
+   __percpu_up_read(sem); /* Unconditional memory barrier */
+   preempt_enable();
+
+   rwsem_release(>rw_sem.dep_map, 1, _RET_IP_);
+}
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
 extern void percpu_up_write(struct percpu_rw_semaphore *);
 
 extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
const char *, struct lock_class_key *);
+
 extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
-#define percpu_init_rwsem(brw) \
+#define percpu_init_rwsem(sem) \
 ({ \
static struct lock_class_key rwsem_key; \
-   __percpu_init_rwsem(brw, #brw, _key); \
+   __percpu_init_rwsem(sem, #sem, _key); \
 })
 
-
 #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
 
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -8,152 +8,186 @@
 #include 
 #include 
 
-int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
+int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,