Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
2016-09-20 2:08 GMT+08:00 Joonwoo Park : > On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote: >> 2016-09-17 9:28 GMT+08:00 Joonwoo Park : >> > From: Srivatsa Vaddagiri >> > >> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot >> > (just when they would have exceeded their ideal_runtime). It makes use >> > of a per-cpu hrtimer resource and hence alarming that hrtimer should >> > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, >> > rather than being based on number of tasks in a particular cfs_rq (as >> > implemented currently). As a result, with current code, its possible for >> > a running task (which is the sole task in its cfs_rq) to be preempted >> >> not be preempted much, right? > > I don't think so > By saying 'to be preempted much after its ideal_runtime has elapsed' I > wanted to describe the current suboptimal behaviour. I also described the current suboptimal behaviour. A running task (which is the sole task in its cfs_rq) as the test case 2 you mentioned in another reply is preempted after 10ms since the scheduler tick, however, it will be preempted after 1.5ms since the hrtick fire in test case 1. That's what I mean "not be preempted much for test case 2". Regards, Wanpeng Li
Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
On Mon, Sep 19, 2016 at 11:04:49AM -0700, Joonwoo Park wrote: > On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote: > > On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote: > > > From: Srivatsa Vaddagiri > > > > > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot > > > > Right, but I always found the overhead of the thing too high to be > > really useful. > > > > How come you're using this? > > This patch was in our internal tree for decades so I unfortunately cannot > find actual usecase or history. > But I guess it was about excessive latency when there are number of CPU > bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100. > > See how I recreated : > > * run 4 cpu hogs on the same cgroup [1] : > dd-960 [000] d..3 110.651060: sched_switch: prev_comm=dd prev_pid=960 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120 > dd-959 [000] d..3 110.652566: sched_switch: prev_comm=dd prev_pid=959 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120 > dd-961 [000] d..3 110.654072: sched_switch: prev_comm=dd prev_pid=961 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120 > dd-962 [000] d..3 110.655578: sched_switch: prev_comm=dd prev_pid=962 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120 > preempt every 1.5ms slice by hrtick. > > * run 4 CPU hogs on 4 different cgroups [2] : > dd-964 [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120 > dd-966 [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120 > dd-965 [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120 > dd-967 [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 > prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120 > preempt every 10ms by scheduler tick so that all tasks suffers from 40ms > preemption latency. > > [1] : > dd if=/dev/zero of=/dev/zero & Ugh.. of=/dev/null instead. > dd if=/dev/zero of=/dev/zero & > dd if=/dev/zero of=/dev/zero & > dd if=/dev/zero of=/dev/zero & > > [2] : > mount -t cgroup -o cpu cpu /sys/fs/cgroup > mkdir /sys/fs/cgroup/grp1 > mkdir /sys/fs/cgroup/grp2 > mkdir /sys/fs/cgroup/grp3 > mkdir /sys/fs/cgroup/grp4 > dd if=/dev/zero of=/dev/zero & > echo $! > /sys/fs/cgroup/grp1/tasks > dd if=/dev/zero of=/dev/zero & > echo $! > /sys/fs/cgroup/grp2/tasks > dd if=/dev/zero of=/dev/zero & > echo $! > /sys/fs/cgroup/grp3/tasks > dd if=/dev/zero of=/dev/zero & > echo $! > /sys/fs/cgroup/grp4/tasks > > I could confirm this patch makes the latter behaves as same as the former in > terms of preemption latency. > > > > > > > > joonwoop: Do we also need to update or remove if-statement inside > > > hrtick_update()? > > > > > I guess not because hrtick_update() doesn't want to start hrtick when > > > cfs_rq > > > has large number of nr_running where slice is longer than sched_latency. > > > > Right, you want that to match with whatever sched_slice() does. > > Cool. Thank you! > > Thanks, > Joonwoo > > > > > > +++ b/kernel/sched/fair.c > > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct > > > task_struct *p) > > > > > > WARN_ON(task_rq(p) != rq); > > > > > > - if (cfs_rq->nr_running > 1) { > > > + if (rq->cfs.h_nr_running > 1) { > > > u64 slice = sched_slice(cfs_rq, se); > > > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > > s64 delta = slice - ran; > > > > Yeah, that looks right. I don't think I've ever tried hrtick with > > cgroups enabled...
Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote: > 2016-09-17 9:28 GMT+08:00 Joonwoo Park : > > From: Srivatsa Vaddagiri > > > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot > > (just when they would have exceeded their ideal_runtime). It makes use > > of a per-cpu hrtimer resource and hence alarming that hrtimer should > > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, > > rather than being based on number of tasks in a particular cfs_rq (as > > implemented currently). As a result, with current code, its possible for > > a running task (which is the sole task in its cfs_rq) to be preempted > > not be preempted much, right? I don't think so By saying 'to be preempted much after its ideal_runtime has elapsed' I wanted to describe the current suboptimal behaviour. Thanks, Joonwoo > > > much after its ideal_runtime has elapsed, resulting in increased latency > > for tasks in other cfs_rq on same cpu. > > > > Fix this by alarming sched hrtimer based on total number of SCHED_FAIR > > tasks a CPU has across its various cfs_rqs. > > > > Cc: Ingo Molnar > > Cc: Peter Zijlstra > > Cc: linux-kernel@vger.kernel.org > > Signed-off-by: Srivatsa Vaddagiri > > Signed-off-by: Joonwoo Park > > --- > > > > joonwoop: Do we also need to update or remove if-statement inside > > hrtick_update()? > > I guess not because hrtick_update() doesn't want to start hrtick when > > cfs_rq > > has large number of nr_running where slice is longer than sched_latency. > > > > kernel/sched/fair.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 4088eed..c55c566 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct > > task_struct *p) > > > > WARN_ON(task_rq(p) != rq); > > > > - if (cfs_rq->nr_running > 1) { > > + if (rq->cfs.h_nr_running > 1) { > > u64 slice = sched_slice(cfs_rq, se); > > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > s64 delta = slice - ran; > > -- > > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, > > hosted by The Linux Foundation > >
Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote: > On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote: > > From: Srivatsa Vaddagiri > > > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot > > Right, but I always found the overhead of the thing too high to be > really useful. > > How come you're using this? This patch was in our internal tree for decades so I unfortunately cannot find actual usecase or history. But I guess it was about excessive latency when there are number of CPU bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100. See how I recreated : * run 4 cpu hogs on the same cgroup [1] : dd-960 [000] d..3 110.651060: sched_switch: prev_comm=dd prev_pid=960 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120 dd-959 [000] d..3 110.652566: sched_switch: prev_comm=dd prev_pid=959 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120 dd-961 [000] d..3 110.654072: sched_switch: prev_comm=dd prev_pid=961 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120 dd-962 [000] d..3 110.655578: sched_switch: prev_comm=dd prev_pid=962 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120 preempt every 1.5ms slice by hrtick. * run 4 CPU hogs on 4 different cgroups [2] : dd-964 [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120 dd-966 [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120 dd-965 [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120 dd-967 [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120 preempt every 10ms by scheduler tick so that all tasks suffers from 40ms preemption latency. [1] : dd if=/dev/zero of=/dev/zero & dd if=/dev/zero of=/dev/zero & dd if=/dev/zero of=/dev/zero & dd if=/dev/zero of=/dev/zero & [2] : mount -t cgroup -o cpu cpu /sys/fs/cgroup mkdir /sys/fs/cgroup/grp1 mkdir /sys/fs/cgroup/grp2 mkdir /sys/fs/cgroup/grp3 mkdir /sys/fs/cgroup/grp4 dd if=/dev/zero of=/dev/zero & echo $! > /sys/fs/cgroup/grp1/tasks dd if=/dev/zero of=/dev/zero & echo $! > /sys/fs/cgroup/grp2/tasks dd if=/dev/zero of=/dev/zero & echo $! > /sys/fs/cgroup/grp3/tasks dd if=/dev/zero of=/dev/zero & echo $! > /sys/fs/cgroup/grp4/tasks I could confirm this patch makes the latter behaves as same as the former in terms of preemption latency. > > > > joonwoop: Do we also need to update or remove if-statement inside > > hrtick_update()? > > > I guess not because hrtick_update() doesn't want to start hrtick when > > cfs_rq > > has large number of nr_running where slice is longer than sched_latency. > > Right, you want that to match with whatever sched_slice() does. Cool. Thank you! Thanks, Joonwoo > > > +++ b/kernel/sched/fair.c > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct > > task_struct *p) > > > > WARN_ON(task_rq(p) != rq); > > > > - if (cfs_rq->nr_running > 1) { > > + if (rq->cfs.h_nr_running > 1) { > > u64 slice = sched_slice(cfs_rq, se); > > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > s64 delta = slice - ran; > > Yeah, that looks right. I don't think I've ever tried hrtick with > cgroups enabled...
Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote: > From: Srivatsa Vaddagiri > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot Right, but I always found the overhead of the thing too high to be really useful. How come you're using this? > joonwoop: Do we also need to update or remove if-statement inside > hrtick_update()? > I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq > has large number of nr_running where slice is longer than sched_latency. Right, you want that to match with whatever sched_slice() does. > +++ b/kernel/sched/fair.c > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct > task_struct *p) > > WARN_ON(task_rq(p) != rq); > > - if (cfs_rq->nr_running > 1) { > + if (rq->cfs.h_nr_running > 1) { > u64 slice = sched_slice(cfs_rq, se); > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > s64 delta = slice - ran; Yeah, that looks right. I don't think I've ever tried hrtick with cgroups enabled...
Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
2016-09-17 9:28 GMT+08:00 Joonwoo Park : > From: Srivatsa Vaddagiri > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot > (just when they would have exceeded their ideal_runtime). It makes use > of a per-cpu hrtimer resource and hence alarming that hrtimer should > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, > rather than being based on number of tasks in a particular cfs_rq (as > implemented currently). As a result, with current code, its possible for > a running task (which is the sole task in its cfs_rq) to be preempted not be preempted much, right? > much after its ideal_runtime has elapsed, resulting in increased latency > for tasks in other cfs_rq on same cpu. > > Fix this by alarming sched hrtimer based on total number of SCHED_FAIR > tasks a CPU has across its various cfs_rqs. > > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Srivatsa Vaddagiri > Signed-off-by: Joonwoo Park > --- > > joonwoop: Do we also need to update or remove if-statement inside > hrtick_update()? > I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq > has large number of nr_running where slice is longer than sched_latency. > > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4088eed..c55c566 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct > task_struct *p) > > WARN_ON(task_rq(p) != rq); > > - if (cfs_rq->nr_running > 1) { > + if (rq->cfs.h_nr_running > 1) { > u64 slice = sched_slice(cfs_rq, se); > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > s64 delta = slice - ran; > -- > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, > hosted by The Linux Foundation >
[PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks
From: Srivatsa Vaddagiri SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Cc: Ingo Molnar Cc: Peter Zijlstra Cc: linux-kernel@vger.kernel.org Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Joonwoo Park --- joonwoop: Do we also need to update or remove if-statement inside hrtick_update()? I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq has large number of nr_running where slice is longer than sched_latency. kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4088eed..c55c566 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p) WARN_ON(task_rq(p) != rq); - if (cfs_rq->nr_running > 1) { + if (rq->cfs.h_nr_running > 1) { u64 slice = sched_slice(cfs_rq, se); u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; s64 delta = slice - ran; -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation