Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Wanpeng Li
2016-09-20 2:08 GMT+08:00 Joonwoo Park :
> On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote:
>> 2016-09-17 9:28 GMT+08:00 Joonwoo Park :
>> > From: Srivatsa Vaddagiri 
>> >
>> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
>> > (just when they would have exceeded their ideal_runtime). It makes use
>> > of a per-cpu hrtimer resource and hence alarming that hrtimer should
>> > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
>> > rather than being based on number of tasks in a particular cfs_rq (as
>> > implemented currently). As a result, with current code, its possible for
>> > a running task (which is the sole task in its cfs_rq) to be preempted
>>
>> not be preempted much, right?
>
> I don't think so
> By saying 'to be preempted much after its ideal_runtime has elapsed' I
> wanted to describe the current suboptimal behaviour.

I also described the current suboptimal behaviour. A running task
(which is the sole task in its cfs_rq) as the test case 2 you
mentioned in another reply is preempted after 10ms since the scheduler
tick, however, it will be preempted after 1.5ms since the hrtick fire
in test case 1. That's what I mean "not be preempted much for test
case 2".

Regards,
Wanpeng Li


Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park
On Mon, Sep 19, 2016 at 11:04:49AM -0700, Joonwoo Park wrote:
> On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > > From: Srivatsa Vaddagiri 
> > > 
> > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > 
> > Right, but I always found the overhead of the thing too high to be
> > really useful.
> > 
> > How come you're using this?
> 
> This patch was in our internal tree for decades so I unfortunately cannot
> find actual usecase or history.
> But I guess it was about excessive latency when there are number of CPU
> bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.
> 
> See how I recreated :
> 
> * run 4 cpu hogs on the same cgroup [1] :
>  dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
>  dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
>  dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
>  dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
>   preempt every 1.5ms slice by hrtick.
> 
> * run 4 CPU hogs on 4 different cgroups [2] :
>  dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
>  dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
>  dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
>  dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
>   preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
> preemption latency.
> 
> [1] : 
>  dd if=/dev/zero of=/dev/zero &
Ugh..  of=/dev/null instead.

>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
> 
> [2] :
>  mount -t cgroup -o cpu cpu /sys/fs/cgroup
>  mkdir /sys/fs/cgroup/grp1
>  mkdir /sys/fs/cgroup/grp2
>  mkdir /sys/fs/cgroup/grp3
>  mkdir /sys/fs/cgroup/grp4
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp1/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp2/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp3/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp4/tasks 
> 
> I could confirm this patch makes the latter behaves as same as the former in 
> terms of preemption latency.
> 
> > 
> > 
> > >  joonwoop: Do we also need to update or remove if-statement inside
> > >  hrtick_update()?
> > 
> > >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > > cfs_rq
> > >  has large number of nr_running where slice is longer than sched_latency.
> > 
> > Right, you want that to match with whatever sched_slice() does.
> 
> Cool.  Thank you!
> 
> Thanks,
> Joonwoo
> 
> > 
> > > +++ b/kernel/sched/fair.c
> > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > > task_struct *p)
> > >  
> > >   WARN_ON(task_rq(p) != rq);
> > >  
> > > - if (cfs_rq->nr_running > 1) {
> > > + if (rq->cfs.h_nr_running > 1) {
> > >   u64 slice = sched_slice(cfs_rq, se);
> > >   u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > >   s64 delta = slice - ran;
> > 
> > Yeah, that looks right. I don't think I've ever tried hrtick with
> > cgroups enabled...


Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park
On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote:
> 2016-09-17 9:28 GMT+08:00 Joonwoo Park :
> > From: Srivatsa Vaddagiri 
> >
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > (just when they would have exceeded their ideal_runtime). It makes use
> > of a per-cpu hrtimer resource and hence alarming that hrtimer should
> > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
> > rather than being based on number of tasks in a particular cfs_rq (as
> > implemented currently). As a result, with current code, its possible for
> > a running task (which is the sole task in its cfs_rq) to be preempted
> 
> not be preempted much, right?

I don't think so
By saying 'to be preempted much after its ideal_runtime has elapsed' I
wanted to describe the current suboptimal behaviour.

Thanks,
Joonwoo

> 
> > much after its ideal_runtime has elapsed, resulting in increased latency
> > for tasks in other cfs_rq on same cpu.
> >
> > Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
> > tasks a CPU has across its various cfs_rqs.
> >
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Srivatsa Vaddagiri 
> > Signed-off-by: Joonwoo Park 
> > ---
> >
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> >
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4088eed..c55c566 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >
> > WARN_ON(task_rq(p) != rq);
> >
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> > --
> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> > hosted by The Linux Foundation
> >


Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park
On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > From: Srivatsa Vaddagiri 
> > 
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> 
> Right, but I always found the overhead of the thing too high to be
> really useful.
> 
> How come you're using this?

This patch was in our internal tree for decades so I unfortunately cannot
find actual usecase or history.
But I guess it was about excessive latency when there are number of CPU
bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.

See how I recreated :

* run 4 cpu hogs on the same cgroup [1] :
 dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
 dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
 dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
 dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
  preempt every 1.5ms slice by hrtick.

* run 4 CPU hogs on 4 different cgroups [2] :
 dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
 dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
 dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
 dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
  preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
preemption latency.

[1] : 
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &

[2] :
 mount -t cgroup -o cpu cpu /sys/fs/cgroup
 mkdir /sys/fs/cgroup/grp1
 mkdir /sys/fs/cgroup/grp2
 mkdir /sys/fs/cgroup/grp3
 mkdir /sys/fs/cgroup/grp4
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp1/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp2/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp3/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp4/tasks 

I could confirm this patch makes the latter behaves as same as the former in 
terms of preemption latency.

> 
> 
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> 
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> 
> Right, you want that to match with whatever sched_slice() does.

Cool.  Thank you!

Thanks,
Joonwoo

> 
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >  
> > WARN_ON(task_rq(p) != rq);
> >  
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> 
> Yeah, that looks right. I don't think I've ever tried hrtick with
> cgroups enabled...


Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Peter Zijlstra
On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> From: Srivatsa Vaddagiri 
> 
> SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot

Right, but I always found the overhead of the thing too high to be
really useful.

How come you're using this?


>  joonwoop: Do we also need to update or remove if-statement inside
>  hrtick_update()?

>  I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq
>  has large number of nr_running where slice is longer than sched_latency.

Right, you want that to match with whatever sched_slice() does.

> +++ b/kernel/sched/fair.c
> @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> task_struct *p)
>  
>   WARN_ON(task_rq(p) != rq);
>  
> - if (cfs_rq->nr_running > 1) {
> + if (rq->cfs.h_nr_running > 1) {
>   u64 slice = sched_slice(cfs_rq, se);
>   u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
>   s64 delta = slice - ran;

Yeah, that looks right. I don't think I've ever tried hrtick with
cgroups enabled...


Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-17 Thread Wanpeng Li
2016-09-17 9:28 GMT+08:00 Joonwoo Park :
> From: Srivatsa Vaddagiri 
>
> SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> (just when they would have exceeded their ideal_runtime). It makes use
> of a per-cpu hrtimer resource and hence alarming that hrtimer should
> be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
> rather than being based on number of tasks in a particular cfs_rq (as
> implemented currently). As a result, with current code, its possible for
> a running task (which is the sole task in its cfs_rq) to be preempted

not be preempted much, right?

> much after its ideal_runtime has elapsed, resulting in increased latency
> for tasks in other cfs_rq on same cpu.
>
> Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
> tasks a CPU has across its various cfs_rqs.
>
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Srivatsa Vaddagiri 
> Signed-off-by: Joonwoo Park 
> ---
>
>  joonwoop: Do we also need to update or remove if-statement inside
>  hrtick_update()?
>  I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq
>  has large number of nr_running where slice is longer than sched_latency.
>
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4088eed..c55c566 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> task_struct *p)
>
> WARN_ON(task_rq(p) != rq);
>
> -   if (cfs_rq->nr_running > 1) {
> +   if (rq->cfs.h_nr_running > 1) {
> u64 slice = sched_slice(cfs_rq, se);
> u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> s64 delta = slice - ran;
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> hosted by The Linux Foundation
>


[PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-16 Thread Joonwoo Park
From: Srivatsa Vaddagiri 

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Joonwoo Park 
---

 joonwoop: Do we also need to update or remove if-statement inside
 hrtick_update()?
 I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq
 has large number of nr_running where slice is longer than sched_latency.

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4088eed..c55c566 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation