[PATCH] selftests: x86: add version check in test_syscall_vdso
Since 4.17 registers r8-r11 are not clobbered/zeroed by a 64-bit kernel handling a 32-bit syscall and this behavior is enforced by the test_syscall_vdso testcase. See commit 8bb2610bc496 ("x86/entry/64/compat: Preserve r8-r11 in int $0x80"). Permit the old behavior in the testcase for kernels prior to 4.17. Signed-off-by: Steve Muckle --- .../testing/selftests/x86/test_syscall_vdso.c | 30 +++ 1 file changed, 30 insertions(+) diff --git a/tools/testing/selftests/x86/test_syscall_vdso.c b/tools/testing/selftests/x86/test_syscall_vdso.c index c9c3281077bc..f7284dc4c32b 100644 --- a/tools/testing/selftests/x86/test_syscall_vdso.c +++ b/tools/testing/selftests/x86/test_syscall_vdso.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #if !defined(__i386__) @@ -71,6 +72,7 @@ struct regs64 { }; struct regs64 regs64; int kernel_is_64bit; +int clobber_ok; asm ( " .pushsection .text\n" @@ -130,6 +132,28 @@ void print_regs64(void) printf("12:%016llx 13:%016llx 14:%016llx 15:%016llx\n", regs64.r12, regs64.r13, regs64.r14, regs64.r15); } +static void get_kernel_version(int *version, int *patchlevel) +{ + int ret, sublevel; + struct utsname utsname; + + ret = uname(); + if (ret) { + perror("uname"); + exit(1); + } + + ret = sscanf(utsname.release, "%d.%d.%d", version, patchlevel, +); + if (ret < 0) { + perror("sscanf"); + exit(1); + } else if (ret != 3) { + printf("Malformed kernel version %s\n", ); + exit(1); + } +} + int check_regs64(void) { int err = 0; @@ -166,6 +190,8 @@ int check_regs64(void) * Historically (and probably unintentionally), they * were clobbered or zeroed. */ + if (clobber_ok && *r64 == 0 && num <= 11) + continue; } printf("[FAIL]\tR%d has changed:%016llx\n", num, *r64); err++; @@ -385,6 +411,7 @@ int main(int argc, char **argv, char **envp) { int exitcode = 0; int cs; + int version, patchlevel; asm("\n" " movl%%cs, %%eax\n" @@ -394,6 +421,9 @@ int main(int argc, char **argv, char **envp) if (!kernel_is_64bit) printf("[NOTE]\tNot a 64-bit kernel, won't test R8..R15 leaks\n"); + get_kernel_version(, ); + clobber_ok = version < 4 || (version == 4 && patchlevel < 17); + /* This only works for non-static builds: * syscall_addr = dlsym(dlopen("linux-gate.so.1", RTLD_NOW), "__kernel_vsyscall"); */ -- 2.21.0.352.gf09ad66450-goog
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/27/2018 05:43 PM, Wanpeng Li wrote: On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. In the scenario I observed, the task is not waking - it is running and being deboosted from priority inheritance, transitioning from RT to CFS. Dietmar and I both were able to reproduce the issue with the testcase I posted earlier in this thread. thanks, Steve
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/27/2018 05:43 PM, Wanpeng Li wrote: On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. In the scenario I observed, the task is not waking - it is running and being deboosted from priority inheritance, transitioning from RT to CFS. Dietmar and I both were able to reproduce the issue with the testcase I posted earlier in this thread. thanks, Steve
[tip:sched/core] sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
Commit-ID: d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe Gitweb: https://git.kernel.org/tip/d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe Author: Steve Muckle AuthorDate: Fri, 31 Aug 2018 15:42:17 -0700 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 10:13:47 +0200 sched/fair: Fix vruntime_normalized() for remote non-migration wakeup When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias . Suggested-by: Peter Zijlstra Tested-by: Dietmar Eggemann Signed-off-by: Steve Muckle Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Redpath Cc: John Dias Cc: Linus Torvalds Cc: Miguel de Dios Cc: Morten Rasmussen Cc: Patrick Bellasi Cc: Paul Turner Cc: Quentin Perret Cc: Thomas Gleixner Cc: Todd Kjos Cc: kernel-t...@android.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuc...@google.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8cff8d55ee95..c6b7d6daab20 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9644,7 +9644,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false;
[tip:sched/core] sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
Commit-ID: d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe Gitweb: https://git.kernel.org/tip/d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe Author: Steve Muckle AuthorDate: Fri, 31 Aug 2018 15:42:17 -0700 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 10:13:47 +0200 sched/fair: Fix vruntime_normalized() for remote non-migration wakeup When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias . Suggested-by: Peter Zijlstra Tested-by: Dietmar Eggemann Signed-off-by: Steve Muckle Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Redpath Cc: John Dias Cc: Linus Torvalds Cc: Miguel de Dios Cc: Morten Rasmussen Cc: Patrick Bellasi Cc: Paul Turner Cc: Quentin Perret Cc: Thomas Gleixner Cc: Todd Kjos Cc: kernel-t...@android.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuc...@google.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8cff8d55ee95..c6b7d6daab20 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9644,7 +9644,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false;
[PATCH] sched/fair: fix vruntime_normalized for remote non-migration wakeup
When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time, for example if it is boosted to RT priority via rt_mutex_setprio. The rq's min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from joaod...@google.com. Suggested-by: Peter Zijlstra Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..b3b62cf37fb6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false; -- 2.19.0.rc1.350.ge57e33dbd1-goog
[PATCH] sched/fair: fix vruntime_normalized for remote non-migration wakeup
When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time, for example if it is boosted to RT priority via rt_mutex_setprio. The rq's min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from joaod...@google.com. Suggested-by: Peter Zijlstra Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..b3b62cf37fb6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false; -- 2.19.0.rc1.350.ge57e33dbd1-goog
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/29/2018 08:33 AM, Dietmar Eggemann wrote: Yes, this solves the issue for the case I described. Using 'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 'p->sched_class == _sched_class'. It's confirmed that this patch solves the original issue we saw (and my test case passes for me as well). I will send this version out shortly.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/29/2018 08:33 AM, Dietmar Eggemann wrote: Yes, this solves the issue for the case I described. Using 'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 'p->sched_class == _sched_class'. It's confirmed that this patch solves the original issue we saw (and my test case passes for me as well). I will send this version out shortly.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. I'm thinking that is an incomplete scenario; where do we get to TASK_WAKING. Yes there's a missing bit of context here at the beginning that the task to be boosted had already been put into TASK_WAKING.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. I'm thinking that is an incomplete scenario; where do we get to TASK_WAKING. Yes there's a missing bit of context here at the beginning that the task to be boosted had already been put into TASK_WAKING.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/23/2018 11:54 PM, Juri Lelli wrote: I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps? Maybe one could play with rt-app to recreate such specific use case? https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 This was reproduced for me on tip of mainline by using the program at the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant annotated bits of the trace: low-prio thread vruntime is 752ms pi-vruntime-tes-598 [001] d... 520.572459: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns] low-prio thread waits on a_sem pi-vruntime-tes-598 [001] d... 520.572465: sched_switch: prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 high prio thread finishes wakeup, then sleeps for 1ms -0 [000] dNh. 520.572483: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.572486: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.572498: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 high prio thread wakes up after 1ms sleep, posts a_sem which starts to wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has -0 [000] d.h. 520.573876: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] dNh. 520.573879: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.573887: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.573895: sched_waking: comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001 low-prio thread pid 598 gets pi_mutex priority inheritance, this happens while low-prio thread is in waking state pi-vruntime-tes-597 [000] d... 520.573911: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19 high-prio thread sleeps on pi_mutex pi-vruntime-tes-597 [000] d... 520.573919: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 low-prio thread finishes wakeup -0 [001] dNh. 520.573932: sched_wakeup: comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001 -0 [001] d... 520.573936: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=598 next_prio=19 low-prio thread releases pi-mutex, loses pi boost, high-prio thread wakes for pi-mutex pi-vruntime-tes-598 [001] d... 520.573946: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120 pi-vruntime-tes-598 [001] dN.. 520.573954: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 low-prio thread vruntime is 1505ms pi-vruntime-tes-598 [001] dN.. 520.573966: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns] The timing is quite sensitive since the task being boosted has to be caught in the TASK_WAKING state. The program: /* * Test case for vruntime management during rtmutex priority inheritance * promotion and demotion. * * build with -lpthread */ #define _GNU_SOURCE #include #include #include #include #include #define ERROR_CHECK(x) \ if (x) \ fprintf(stderr, "Error at line %d", __LINE__); pthread_mutex_t pi_mutex; sem_t a_sem; sem_t b_sem; void *rt_thread_func(void *arg) { int policy; int i = 0; cpu_set_t cpuset; CPU_ZERO(); CPU_SET(0, ); ERROR_CHECK(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), )); while(i++ < 100) { sem_wait(_sem);
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/23/2018 11:54 PM, Juri Lelli wrote: I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps? Maybe one could play with rt-app to recreate such specific use case? https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 This was reproduced for me on tip of mainline by using the program at the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant annotated bits of the trace: low-prio thread vruntime is 752ms pi-vruntime-tes-598 [001] d... 520.572459: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns] low-prio thread waits on a_sem pi-vruntime-tes-598 [001] d... 520.572465: sched_switch: prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 high prio thread finishes wakeup, then sleeps for 1ms -0 [000] dNh. 520.572483: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.572486: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.572498: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 high prio thread wakes up after 1ms sleep, posts a_sem which starts to wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has -0 [000] d.h. 520.573876: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] dNh. 520.573879: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.573887: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.573895: sched_waking: comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001 low-prio thread pid 598 gets pi_mutex priority inheritance, this happens while low-prio thread is in waking state pi-vruntime-tes-597 [000] d... 520.573911: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19 high-prio thread sleeps on pi_mutex pi-vruntime-tes-597 [000] d... 520.573919: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 low-prio thread finishes wakeup -0 [001] dNh. 520.573932: sched_wakeup: comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001 -0 [001] d... 520.573936: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=598 next_prio=19 low-prio thread releases pi-mutex, loses pi boost, high-prio thread wakes for pi-mutex pi-vruntime-tes-598 [001] d... 520.573946: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120 pi-vruntime-tes-598 [001] dN.. 520.573954: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 low-prio thread vruntime is 1505ms pi-vruntime-tes-598 [001] dN.. 520.573966: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns] The timing is quite sensitive since the task being boosted has to be caught in the TASK_WAKING state. The program: /* * Test case for vruntime management during rtmutex priority inheritance * promotion and demotion. * * build with -lpthread */ #define _GNU_SOURCE #include #include #include #include #include #define ERROR_CHECK(x) \ if (x) \ fprintf(stderr, "Error at line %d", __LINE__); pthread_mutex_t pi_mutex; sem_t a_sem; sem_t b_sem; void *rt_thread_func(void *arg) { int policy; int i = 0; cpu_set_t cpuset; CPU_ZERO(); CPU_SET(0, ); ERROR_CHECK(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), )); while(i++ < 100) { sem_wait(_sem);
[PATCH] sched/fair: vruntime should normalize when switching from fair
From: John Dias When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Signed-off-by: John Dias Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..14011d7929d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_class == _sched_class)) return true; return false; -- 2.18.0.865.gffc8e1a3cd6-goog
[PATCH] sched/fair: vruntime should normalize when switching from fair
From: John Dias When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Signed-off-by: John Dias Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..14011d7929d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_class == _sched_class)) return true; return false; -- 2.18.0.865.gffc8e1a3cd6-goog
Re: [RFC] vruntime updated incorrectly when rt_mutex boots prio?
On 08/07/2018 10:40 AM, 'Todd Kjos' via kernel-team wrote: This issue was discovered on a 4.9-based android device, but the relevant mainline code appears to be the same. The symptom is that over time the some workloads become sluggish resulting in missed frames or sluggishness. It appears to be the same issue described in http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html. Here is the scenario: A task is deactivated while still in the fair class. The task is then boosted to RT, so rt_mutex_setprio() is called. This changes the task to RT and calls check_class_changed(), which eventually calls detach_task_cfs_rq(), which is where vruntime_normalized() sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime. Later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, resulting in vruntime inflation. This was reproduced for me on tip of mainline by using the program at the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant annotated bits of the trace: low-prio thread vruntime is 752ms pi-vruntime-tes-598 [001] d... 520.572459: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns] low-prio thread waits on a_sem pi-vruntime-tes-598 [001] d... 520.572465: sched_switch: prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 high prio thread finishes wakeup, then sleeps for 1ms -0 [000] dNh. 520.572483: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.572486: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.572498: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 high prio thread wakes up after 1ms sleep, posts a_sem which starts to wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has -0 [000] d.h. 520.573876: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] dNh. 520.573879: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.573887: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.573895: sched_waking: comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001 low-prio thread pid 598 gets pi_mutex priority inheritance, this happens while low-prio thread is in waking state pi-vruntime-tes-597 [000] d... 520.573911: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19 high-prio thread sleeps on pi_mutex pi-vruntime-tes-597 [000] d... 520.573919: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 low-prio thread finishes wakeup -0 [001] dNh. 520.573932: sched_wakeup: comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001 -0 [001] d... 520.573936: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=598 next_prio=19 low-prio thread releases pi-mutex, loses pi boost, high-prio thread wakes for pi-mutex pi-vruntime-tes-598 [001] d... 520.573946: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120 pi-vruntime-tes-598 [001] dN.. 520.573954: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 low-prio thread vruntime is 1505ms pi-vruntime-tes-598 [001] dN.. 520.573966: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns] The program: /* * Test case for vruntime management during rtmutex priority inheritance * promotion and demotion. * * build with -lpthread */ #define _GNU_SOURCE #include #include #include #include #include #define ERROR_CHECK(x) \ if (x) \ fprintf(stderr, "Error at line %d", __LINE__); pthread_mutex_t pi_mutex; sem_t a_sem; sem_t b_sem; void *rt_thread_func(void *arg) { int policy; int i = 0; cpu_set_t cpuset; CPU_ZERO(); CPU_SET(0, ); ERROR_CHECK(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), )); while(i++ < 100) { sem_wait(_sem); usleep(1000); sem_post(_sem); pthread_mutex_lock(_mutex); pthread_mutex_unlock(_mutex); } } void *low_prio_thread_func(void *arg) { int i = 0; cpu_set_t cpuset; CPU_ZERO();
Re: [RFC] vruntime updated incorrectly when rt_mutex boots prio?
On 08/07/2018 10:40 AM, 'Todd Kjos' via kernel-team wrote: This issue was discovered on a 4.9-based android device, but the relevant mainline code appears to be the same. The symptom is that over time the some workloads become sluggish resulting in missed frames or sluggishness. It appears to be the same issue described in http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html. Here is the scenario: A task is deactivated while still in the fair class. The task is then boosted to RT, so rt_mutex_setprio() is called. This changes the task to RT and calls check_class_changed(), which eventually calls detach_task_cfs_rq(), which is where vruntime_normalized() sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime. Later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, resulting in vruntime inflation. This was reproduced for me on tip of mainline by using the program at the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant annotated bits of the trace: low-prio thread vruntime is 752ms pi-vruntime-tes-598 [001] d... 520.572459: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns] low-prio thread waits on a_sem pi-vruntime-tes-598 [001] d... 520.572465: sched_switch: prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 high prio thread finishes wakeup, then sleeps for 1ms -0 [000] dNh. 520.572483: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.572486: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.572498: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 high prio thread wakes up after 1ms sleep, posts a_sem which starts to wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has -0 [000] d.h. 520.573876: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] dNh. 520.573879: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.573887: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.573895: sched_waking: comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001 low-prio thread pid 598 gets pi_mutex priority inheritance, this happens while low-prio thread is in waking state pi-vruntime-tes-597 [000] d... 520.573911: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19 high-prio thread sleeps on pi_mutex pi-vruntime-tes-597 [000] d... 520.573919: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 low-prio thread finishes wakeup -0 [001] dNh. 520.573932: sched_wakeup: comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001 -0 [001] d... 520.573936: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=598 next_prio=19 low-prio thread releases pi-mutex, loses pi boost, high-prio thread wakes for pi-mutex pi-vruntime-tes-598 [001] d... 520.573946: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120 pi-vruntime-tes-598 [001] dN.. 520.573954: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 low-prio thread vruntime is 1505ms pi-vruntime-tes-598 [001] dN.. 520.573966: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns] The program: /* * Test case for vruntime management during rtmutex priority inheritance * promotion and demotion. * * build with -lpthread */ #define _GNU_SOURCE #include #include #include #include #include #define ERROR_CHECK(x) \ if (x) \ fprintf(stderr, "Error at line %d", __LINE__); pthread_mutex_t pi_mutex; sem_t a_sem; sem_t b_sem; void *rt_thread_func(void *arg) { int policy; int i = 0; cpu_set_t cpuset; CPU_ZERO(); CPU_SET(0, ); ERROR_CHECK(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), )); while(i++ < 100) { sem_wait(_sem); usleep(1000); sem_post(_sem); pthread_mutex_lock(_mutex); pthread_mutex_unlock(_mutex); } } void *low_prio_thread_func(void *arg) { int i = 0; cpu_set_t cpuset; CPU_ZERO();
Re: [PATCH 06/24] selftests: exec: return Kselftest Skip code for skipped tests
On 05/04/2018 06:13 PM, Shuah Khan (Samsung OSG) wrote: When execveat test is skipped because of unmet dependencies and/or unsupported configuration, it exits with error which is treated as a fail by the Kselftest framework. This leads to false negative result even when the test could not be run. Change it to return kselftest skip code when a test gets skipped to clearly report that the test could not be run. Change it to use ksft_exit_skip() when kernel doesn't support execveat. Signed-off-by: Shuah Khan (Samsung OSG)--- tools/testing/selftests/exec/execveat.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 67cd4597db2b..47cbf54d0801 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -20,6 +20,8 @@ #include #include +#include "../kselftest.h" + static char longpath[2 * PATH_MAX] = ""; static char *envp[] = { "IN_TEST=yes", NULL, NULL }; static char *argv[] = { "execveat", "99", NULL }; @@ -249,8 +251,8 @@ static int run_tests(void) errno = 0; execveat_(-1, NULL, NULL, NULL, 0); if (errno == ENOSYS) { - printf("[FAIL] ENOSYS calling execveat - no kernel support?\n"); - return 1; + ksft_exit_skip( + "ENOSYS calling execveat - no kernel support?\n"); } /* Change file position to confirm it doesn't affect anything */ LGTM. thanks, Steve
Re: [PATCH 06/24] selftests: exec: return Kselftest Skip code for skipped tests
On 05/04/2018 06:13 PM, Shuah Khan (Samsung OSG) wrote: When execveat test is skipped because of unmet dependencies and/or unsupported configuration, it exits with error which is treated as a fail by the Kselftest framework. This leads to false negative result even when the test could not be run. Change it to return kselftest skip code when a test gets skipped to clearly report that the test could not be run. Change it to use ksft_exit_skip() when kernel doesn't support execveat. Signed-off-by: Shuah Khan (Samsung OSG) --- tools/testing/selftests/exec/execveat.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 67cd4597db2b..47cbf54d0801 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -20,6 +20,8 @@ #include #include +#include "../kselftest.h" + static char longpath[2 * PATH_MAX] = ""; static char *envp[] = { "IN_TEST=yes", NULL, NULL }; static char *argv[] = { "execveat", "99", NULL }; @@ -249,8 +251,8 @@ static int run_tests(void) errno = 0; execveat_(-1, NULL, NULL, NULL, 0); if (errno == ENOSYS) { - printf("[FAIL] ENOSYS calling execveat - no kernel support?\n"); - return 1; + ksft_exit_skip( + "ENOSYS calling execveat - no kernel support?\n"); } /* Change file position to confirm it doesn't affect anything */ LGTM. thanks, Steve
Re: [PATCH RFC 2/5] sched/fair: Skip frequency update if CPU about to idle
On 10/30/2017 12:02 PM, Joel Fernandes wrote: Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this: if (cpu_idle()) "Don't change the freq"; Will something like that work? I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class. Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle. Any thoughts about this? Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
Re: [PATCH RFC 2/5] sched/fair: Skip frequency update if CPU about to idle
On 10/30/2017 12:02 PM, Joel Fernandes wrote: Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this: if (cpu_idle()) "Don't change the freq"; Will something like that work? I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class. Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle. Any thoughts about this? Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
[PATCH v2] selftests/exec: include cwd in long path calculation
When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle <smuckle.li...@gmail.com> --- Differences from v1: - free allocation from getcwd() - update comment in check_execveat_pathmax() tools/testing/selftests/exec/execveat.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..67cd4597db2b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,20 +156,30 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); longname[len] = '\0'; strcat(longpath, longname); + free(cwd); } exe_cp(src, longpath); @@ -190,7 +200,7 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) } /* -* Execute as a long pathname relative to ".". If this is a script, +* Execute as a long pathname relative to "/". If this is a script, * the interpreter will launch but fail to open the script because its * name ("/dev/fd/5/xxx") is bigger than PATH_MAX. * @@ -200,10 +210,10 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) * the exit status shall be 126."), so allow either. */ if (is_script) - fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, + fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0, 127, 126); else - fail += check_execveat(dot_dfd, longpath, 0); + fail += check_execveat(root_dfd, longpath + 1, 0); return fail; } @@ -218,6 +228,7 @@ static int run_tests(void) int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral", O_DIRECTORY|O_RDONLY); int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY); + int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY); int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH); int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC); int fd = open_or_die("execveat", O_RDONLY); @@ -353,8 +364,8 @@ static int run_tests(void) /* Attempt to execute relative to non-directory => ENOTDIR */ fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR); - fail += check_execveat_pathmax(dot_dfd, "execveat", 0); - fail += check_execveat_pathmax(dot_dfd, "script", 1); + fail += check_execveat_pathmax(root_dfd, "execveat", 0); + fail += check_execveat_pathmax(root_dfd, "script", 1); return fail; } -- 2.11.0
[PATCH v2] selftests/exec: include cwd in long path calculation
When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle --- Differences from v1: - free allocation from getcwd() - update comment in check_execveat_pathmax() tools/testing/selftests/exec/execveat.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..67cd4597db2b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,20 +156,30 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); longname[len] = '\0'; strcat(longpath, longname); + free(cwd); } exe_cp(src, longpath); @@ -190,7 +200,7 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) } /* -* Execute as a long pathname relative to ".". If this is a script, +* Execute as a long pathname relative to "/". If this is a script, * the interpreter will launch but fail to open the script because its * name ("/dev/fd/5/xxx") is bigger than PATH_MAX. * @@ -200,10 +210,10 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) * the exit status shall be 126."), so allow either. */ if (is_script) - fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, + fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0, 127, 126); else - fail += check_execveat(dot_dfd, longpath, 0); + fail += check_execveat(root_dfd, longpath + 1, 0); return fail; } @@ -218,6 +228,7 @@ static int run_tests(void) int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral", O_DIRECTORY|O_RDONLY); int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY); + int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY); int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH); int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC); int fd = open_or_die("execveat", O_RDONLY); @@ -353,8 +364,8 @@ static int run_tests(void) /* Attempt to execute relative to non-directory => ENOTDIR */ fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR); - fail += check_execveat_pathmax(dot_dfd, "execveat", 0); - fail += check_execveat_pathmax(dot_dfd, "script", 1); + fail += check_execveat_pathmax(root_dfd, "execveat", 0); + fail += check_execveat_pathmax(root_dfd, "script", 1); return fail; } -- 2.11.0
Re: [PATCH] selftests/exec: include cwd in long path calculation
Thanks David for the review. Replies inline. On 10/12/2017 05:22 AM, David Drysdale wrote: Modulo the minor comment below: Reviewed-by: David Drysdale <drysd...@google.com> On Thu, Oct 12, 2017 at 1:40 AM, Steve Muckle <smuckle.li...@gmail.com> wrote: When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle <smuckle.li...@gmail.com> --- tools/testing/selftests/exec/execveat.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..5edc609c778b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); cwd never gets freed, but that's presumably not a problem unless these tests ever get run under some kind of leak checker. Yeah I found some other instances of this in the test (specifically the calls to realpath() at the top of run_tests() which alloc their own buffer) and figured for a small unit test like this it had been decided it wasn't worth freeing stuff. I'll tidy this up anyway though. + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); There's a missing comment update around here-ish: * Execute as a long pathname relative to ".". (should now be 'relative to "/"' I think). Will fix. thanks, Steve
Re: [PATCH] selftests/exec: include cwd in long path calculation
Thanks David for the review. Replies inline. On 10/12/2017 05:22 AM, David Drysdale wrote: Modulo the minor comment below: Reviewed-by: David Drysdale On Thu, Oct 12, 2017 at 1:40 AM, Steve Muckle wrote: When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle --- tools/testing/selftests/exec/execveat.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..5edc609c778b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); cwd never gets freed, but that's presumably not a problem unless these tests ever get run under some kind of leak checker. Yeah I found some other instances of this in the test (specifically the calls to realpath() at the top of run_tests() which alloc their own buffer) and figured for a small unit test like this it had been decided it wasn't worth freeing stuff. I'll tidy this up anyway though. + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); There's a missing comment update around here-ish: * Execute as a long pathname relative to ".". (should now be 'relative to "/"' I think). Will fix. thanks, Steve
[PATCH] selftests/exec: include cwd in long path calculation
When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle <smuckle.li...@gmail.com> --- tools/testing/selftests/exec/execveat.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..5edc609c778b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); @@ -200,10 +209,10 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) * the exit status shall be 126."), so allow either. */ if (is_script) - fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, + fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0, 127, 126); else - fail += check_execveat(dot_dfd, longpath, 0); + fail += check_execveat(root_dfd, longpath + 1, 0); return fail; } @@ -218,6 +227,7 @@ static int run_tests(void) int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral", O_DIRECTORY|O_RDONLY); int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY); + int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY); int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH); int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC); int fd = open_or_die("execveat", O_RDONLY); @@ -353,8 +363,8 @@ static int run_tests(void) /* Attempt to execute relative to non-directory => ENOTDIR */ fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR); - fail += check_execveat_pathmax(dot_dfd, "execveat", 0); - fail += check_execveat_pathmax(dot_dfd, "script", 1); + fail += check_execveat_pathmax(root_dfd, "execveat", 0); + fail += check_execveat_pathmax(root_dfd, "script", 1); return fail; } -- 2.15.0.rc0.271.g36b669edcc-goog
[PATCH] selftests/exec: include cwd in long path calculation
When creating a pathname close to PATH_MAX to test execveat, factor in the current working directory path otherwise we end up with an absolute path that is longer than PATH_MAX. While execveat() may succeed, subsequent calls to the kernel from the runtime environment which are required to successfully execute the test binary/script may fail because of this. To keep the semantics of the test the same, rework the relative pathname part of the test to be relative to the root directory so it isn't decreased by the length of the current working directory path. Signed-off-by: Steve Muckle --- tools/testing/selftests/exec/execveat.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c index 8d5d1d2ee7c1..5edc609c778b 100644 --- a/tools/testing/selftests/exec/execveat.c +++ b/tools/testing/selftests/exec/execveat.c @@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest) } #define XX_DIR_LEN 200 -static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) +static int check_execveat_pathmax(int root_dfd, const char *src, int is_script) { int fail = 0; int ii, count, len; @@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) if (*longpath == '\0') { /* Create a filename close to PATH_MAX in length */ + char *cwd = getcwd(NULL, 0); + + if (!cwd) { + printf("Failed to getcwd(), errno=%d (%s)\n", + errno, strerror(errno)); + return 2; + } + strcpy(longpath, cwd); + strcat(longpath, "/"); memset(longname, 'x', XX_DIR_LEN - 1); longname[XX_DIR_LEN - 1] = '/'; longname[XX_DIR_LEN] = '\0'; - count = (PATH_MAX - 3) / XX_DIR_LEN; + count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN; for (ii = 0; ii < count; ii++) { strcat(longpath, longname); mkdir(longpath, 0755); } - len = (PATH_MAX - 3) - (count * XX_DIR_LEN); + len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN); if (len <= 0) len = 1; memset(longname, 'y', len); @@ -200,10 +209,10 @@ static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) * the exit status shall be 126."), so allow either. */ if (is_script) - fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, + fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0, 127, 126); else - fail += check_execveat(dot_dfd, longpath, 0); + fail += check_execveat(root_dfd, longpath + 1, 0); return fail; } @@ -218,6 +227,7 @@ static int run_tests(void) int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral", O_DIRECTORY|O_RDONLY); int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY); + int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY); int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH); int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC); int fd = open_or_die("execveat", O_RDONLY); @@ -353,8 +363,8 @@ static int run_tests(void) /* Attempt to execute relative to non-directory => ENOTDIR */ fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR); - fail += check_execveat_pathmax(dot_dfd, "execveat", 0); - fail += check_execveat_pathmax(dot_dfd, "script", 1); + fail += check_execveat_pathmax(root_dfd, "execveat", 0); + fail += check_execveat_pathmax(root_dfd, "script", 1); return fail; } -- 2.15.0.rc0.271.g36b669edcc-goog
Android kernel configs (was Re: configs)
On 09/19/2017 10:08 AM, John Stultz wrote: So what I was thinking of to improve the developer usability might be the following: 1) Leave the upstream configs in place. We can try to keep maintaining them, but its not something the Android team is likely to care greatly about, and hopefully can be ignored. Meanwhile for the folks tinkering with upstream, there's something that helps them figure things out. When the next android-x.y tree comes out, you can simply remove it. 2) To the android-x.y branches, add a script to the kernel/configs/ dir to fetch the current config directly from the config repo. This way its simple for folks to get the right thing w/o looking anything up. This won't fly upstream, but should be fine for common.git Thoughts? LGTM. If anyone has concerns let me know, otherwise I'll proceed with #2. thanks, Steve
Android kernel configs (was Re: configs)
On 09/19/2017 10:08 AM, John Stultz wrote: So what I was thinking of to improve the developer usability might be the following: 1) Leave the upstream configs in place. We can try to keep maintaining them, but its not something the Android team is likely to care greatly about, and hopefully can be ignored. Meanwhile for the folks tinkering with upstream, there's something that helps them figure things out. When the next android-x.y tree comes out, you can simply remove it. 2) To the android-x.y branches, add a script to the kernel/configs/ dir to fetch the current config directly from the config repo. This way its simple for folks to get the right thing w/o looking anything up. This won't fly upstream, but should be fine for common.git Thoughts? LGTM. If anyone has concerns let me know, otherwise I'll proceed with #2. thanks, Steve
Re: [PATCH RFC RESEND v2 0/2] Prevent cpufreq update for only task on rq that sleeps
On 09/07/2017 09:14 AM, Joel Fernandes wrote: I'm planning to rebase this series on Linus's master and post it again, but just checking any thoughts about it? Just to add more context, the reason for not updating the frequency: - When a last dequeue of a sleeping task happens, it is sufficient to update utilization without updating the frequency because if other CPUs are busy then their updates will consider the utilization of the idle CPU in the shared policy unless sufficient time has passed. - If the last dequeue of a sleeping task happens while all other CPUs in the cluster are idle, then the cluster will likely enter cluster-idle soon. To clarify - when you say "last dequeue of a sleeping task happens" above, you're referring to the dequeue of the last task running on the CPU, correct? I.e. the CPU is about to go idle? It's been a while since I've looked at this area so would like to hold off for a rebased version to review in further detail. But I think the concept is valid. thanks, steve
Re: [PATCH RFC RESEND v2 0/2] Prevent cpufreq update for only task on rq that sleeps
On 09/07/2017 09:14 AM, Joel Fernandes wrote: I'm planning to rebase this series on Linus's master and post it again, but just checking any thoughts about it? Just to add more context, the reason for not updating the frequency: - When a last dequeue of a sleeping task happens, it is sufficient to update utilization without updating the frequency because if other CPUs are busy then their updates will consider the utilization of the idle CPU in the shared policy unless sufficient time has passed. - If the last dequeue of a sleeping task happens while all other CPUs in the cluster are idle, then the cluster will likely enter cluster-idle soon. To clarify - when you say "last dequeue of a sleeping task happens" above, you're referring to the dequeue of the last task running on the CPU, correct? I.e. the CPU is about to go idle? It's been a while since I've looked at this area so would like to hold off for a rebased version to review in further detail. But I think the concept is valid. thanks, steve
Re: [PATCH V2 0/4] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
Hi Viresh, On Tue, Nov 15, 2016 at 01:53:19PM +0530, Viresh Kumar wrote: > This work was started by Steve Muckle, where he used a simple kthread > instead of kthread-worker and that wasn't sufficient as some guarantees > weren't met. I know this has already gone in, but can you expand on the unmet guarantees mentioned here just for my own (and perhaps others') understanding? thanks, Steve
Re: [PATCH V2 0/4] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
Hi Viresh, On Tue, Nov 15, 2016 at 01:53:19PM +0530, Viresh Kumar wrote: > This work was started by Steve Muckle, where he used a simple kthread > instead of kthread-worker and that wasn't sufficient as some guarantees > weren't met. I know this has already gone in, but can you expand on the unmet guarantees mentioned here just for my own (and perhaps others') understanding? thanks, Steve
Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
On Fri, Nov 11, 2016 at 11:16:59PM +0100, Rafael J. Wysocki wrote: > > + struct sched_param param = { .sched_priority = 50 }; > > I'd define a symbol for the 50. It's just one extra line of code ... A minor point for sure, but in general what's the motivation for defining symbols for things which are only used once? It makes it harder to untangle and learn a piece of code IMO, having to jump to the definitions of them to see what they are. thanks, Steve
Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
On Fri, Nov 11, 2016 at 11:16:59PM +0100, Rafael J. Wysocki wrote: > > + struct sched_param param = { .sched_priority = 50 }; > > I'd define a symbol for the 50. It's just one extra line of code ... A minor point for sure, but in general what's the motivation for defining symbols for things which are only used once? It makes it harder to untangle and learn a piece of code IMO, having to jump to the definitions of them to see what they are. thanks, Steve
Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
On Sun, Nov 13, 2016 at 03:37:18PM +0100, Rafael J. Wysocki wrote: > > Hold on a sec. I thought during LPC someone (Peter?) made a point that when > > RT thread run, we should bump the frequency to max? So, schedutil is going > > to trigger schedutil to bump up the frequency to max, right? > > No, it isn't, or at least that is unlikely. > > sugov_update_commit() sets sg_policy->work_in_progress before queuing > the IRQ work and it is not cleared until the frequency changes in > sugov_work(). > > OTOH, sugov_should_update_freq() checks sg_policy->work_in_progress > upfront and returns false when it is set, so the governor won't see > its own worker threads run, unless I'm overlooking something highly > non-obvious. FWIW my intention with the original version of this patch (which I neglected to communicate to Viresh) was that it would depend on changing the frequency policy for RT. I had been using rt_avg. It sounds like during LPC there were talks of using another metric. It does appear things would work okay without that but it also seems a bit fragile. There's the window between when the work_in_progress gets cleared and the RT kthread yields. I have not thought through the various scenarios there, what is possible and tested to see if it is significant enough to impact power-sensitive platforms. thanks, Steve
Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
On Sun, Nov 13, 2016 at 03:37:18PM +0100, Rafael J. Wysocki wrote: > > Hold on a sec. I thought during LPC someone (Peter?) made a point that when > > RT thread run, we should bump the frequency to max? So, schedutil is going > > to trigger schedutil to bump up the frequency to max, right? > > No, it isn't, or at least that is unlikely. > > sugov_update_commit() sets sg_policy->work_in_progress before queuing > the IRQ work and it is not cleared until the frequency changes in > sugov_work(). > > OTOH, sugov_should_update_freq() checks sg_policy->work_in_progress > upfront and returns false when it is set, so the governor won't see > its own worker threads run, unless I'm overlooking something highly > non-obvious. FWIW my intention with the original version of this patch (which I neglected to communicate to Viresh) was that it would depend on changing the frequency policy for RT. I had been using rt_avg. It sounds like during LPC there were talks of using another metric. It does appear things would work okay without that but it also seems a bit fragile. There's the window between when the work_in_progress gets cleared and the RT kthread yields. I have not thought through the various scenarios there, what is possible and tested to see if it is significant enough to impact power-sensitive platforms. thanks, Steve
Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil
On Wed, Sep 07, 2016 at 05:35:50PM -0700, Srinivas Pandruvada wrote: > Did you see any performance regression on Android workloads? I did a few AnTuTU runs and did not observe a regression. thanks, Steve
Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil
On Wed, Sep 07, 2016 at 05:35:50PM -0700, Srinivas Pandruvada wrote: > Did you see any performance regression on Android workloads? I did a few AnTuTU runs and did not observe a regression. thanks, Steve
Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil
On Sat, Sep 03, 2016 at 02:56:48AM +0200, Rafael J. Wysocki wrote: > Please let me know what you think and if you can run some benchmarks you > care about and see if the changes make any difference (this way or another), > please do that and let me know what you've found. LGTM (I just reviewed the first and last patch, skipping the intel_pstate ones). I was unable to see a conclusive power regression in Android audio, video or idle usecases on my hikey 96board. thanks, Steve
Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil
On Sat, Sep 03, 2016 at 02:56:48AM +0200, Rafael J. Wysocki wrote: > Please let me know what you think and if you can run some benchmarks you > care about and see if the changes make any difference (this way or another), > please do that and let me know what you've found. LGTM (I just reviewed the first and last patch, skipping the intel_pstate ones). I was unable to see a conclusive power regression in Android audio, video or idle usecases on my hikey 96board. thanks, Steve
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote: > > Another problem is that we have many semi related knobs; we have the > > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it > > should) > > Maybe we could create this sort of link when using the cgroup RT > throttling interface as well? It should still then fit well once we > replace the underlying mechanism with DL reservations. And, AFAIK, the > interface is used by Android folks already. I'm not sure how the upper bounds can be used to infer CPU frequency... On my Nexus 6p (an Android device), the global RT runtime limit seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and bg_non_interactive is set at 700ms/1sec.
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote: > > Another problem is that we have many semi related knobs; we have the > > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it > > should) > > Maybe we could create this sort of link when using the cgroup RT > throttling interface as well? It should still then fit well once we > replace the underlying mechanism with DL reservations. And, AFAIK, the > interface is used by Android folks already. I'm not sure how the upper bounds can be used to infer CPU frequency... On my Nexus 6p (an Android device), the global RT runtime limit seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and bg_non_interactive is set at 700ms/1sec.
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote: > On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote: > > A policy of going to fmax on any RT activity will be detrimental > > for power on many platforms. Often RT accounts for only a small amount > > of CPU activity so sending the CPU frequency to fmax is overkill. Worse > > still, some platforms may not be able to even complete the CPU frequency > > change before the RT activity has already completed. > > > > Cpufreq governors have not treated RT activity this way in the past so > > it is not part of the expected semantics of the RT scheduling class. The > > DL class offers guarantees about task completion and could be used for > > this purpose. > > Not entirely true. People have simply disabled cpufreq because of this. > > Yes, RR/FIFO are a pain, but they should still be deterministic, and > variable cpufreq destroys that. That is the way it's been with cpufreq and many systems (including all mobile devices) rely on that to not destroy power. RT + variable cpufreq is not deterministic. Given we don't have good constraints on RT tasks I don't think we should try to strengthen the semantics there. Folks should either move to DL if they want determinism *and* not-sucky power, or continue disabling cpufreq if they are able to do so. > I realize that the fmax thing is annoying, but I'm not seeing how rt_avg > is much better. Rt_avg is much closer to the current behavior offered by the most commonly used cpufreq governors since it tracks actual CPU utilization. Power is not impacted by minimal RT activity and the frequency is raised if RT activity is high.
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote: > On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote: > > A policy of going to fmax on any RT activity will be detrimental > > for power on many platforms. Often RT accounts for only a small amount > > of CPU activity so sending the CPU frequency to fmax is overkill. Worse > > still, some platforms may not be able to even complete the CPU frequency > > change before the RT activity has already completed. > > > > Cpufreq governors have not treated RT activity this way in the past so > > it is not part of the expected semantics of the RT scheduling class. The > > DL class offers guarantees about task completion and could be used for > > this purpose. > > Not entirely true. People have simply disabled cpufreq because of this. > > Yes, RR/FIFO are a pain, but they should still be deterministic, and > variable cpufreq destroys that. That is the way it's been with cpufreq and many systems (including all mobile devices) rely on that to not destroy power. RT + variable cpufreq is not deterministic. Given we don't have good constraints on RT tasks I don't think we should try to strengthen the semantics there. Folks should either move to DL if they want determinism *and* not-sucky power, or continue disabling cpufreq if they are able to do so. > I realize that the fmax thing is annoying, but I'm not seeing how rt_avg > is much better. Rt_avg is much closer to the current behavior offered by the most commonly used cpufreq governors since it tracks actual CPU utilization. Power is not impacted by minimal RT activity and the frequency is raised if RT activity is high.
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote: > On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote: > > A policy of going to fmax on any RT activity will be detrimental > > for power on many platforms. Often RT accounts for only a small amount > > of CPU activity so sending the CPU frequency to fmax is overkill. Worse > > still, some platforms may not be able to even complete the CPU frequency > > change before the RT activity has already completed. > > > > Cpufreq governors have not treated RT activity this way in the past so > > it is not part of the expected semantics of the RT scheduling class. The > > DL class offers guarantees about task completion and could be used for > > this purpose. > > > > Modify the schedutil algorithm to instead use rt_avg as an estimate of > > RT utilization of the CPU. > > > > Based on previous work by Vincent Guittot <vincent.guit...@linaro.org>. > > If we do it for RT, why not to do a similar thing for DL? As in the > original patch from Peter, for example? Agreed DL should have a similar change. I think that could be done in a separate patch. I also would need to discuss it with the deadline sched devs to fully understand the metric used there. > > > Signed-off-by: Steve Muckle <smuc...@linaro.org> > > --- > > kernel/sched/cpufreq_schedutil.c | 26 +- > > 1 file changed, 17 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/sched/cpufreq_schedutil.c > > b/kernel/sched/cpufreq_schedutil.c > > index cb8a77b1ef1b..89094a466250 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu > > *sg_cpu, unsigned long util, > > > > static void sugov_get_util(unsigned long *util, unsigned long *max) > > { > > - struct rq *rq = this_rq(); > > - unsigned long cfs_max; > > + int cpu = smp_processor_id(); > > + struct rq *rq = cpu_rq(cpu); > > + unsigned long max_cap, rt; > > + s64 delta; > > > > - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); > > + max_cap = arch_scale_cpu_capacity(NULL, cpu); > > > > - *util = min(rq->cfs.avg.util_avg, cfs_max); > > - *max = cfs_max; > > + delta = rq_clock(rq) - rq->age_stamp; > > + if (unlikely(delta < 0)) > > + delta = 0; > > + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta); > > + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT; > > These computations are rather heavy, so I wonder if they are avoidable based > on the flags, for example? Yeah the div is bad. I don't know that we can avoid it based on the flags because rt_avg will decay during CFS activity and you'd want to take note of that. One way to make this a little better is to ssume that the divisor, sched_avg_period() + delta, fits into 32 bits so that div_u64 can be used, which I believe is less bad. Doing that means placing a restriction on how large sysctl_sched_time_avg (which determines sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know that anyone uses a value that large anyway but there's currently no limit on it. Another option would be just adding another separate metric to track rt activity that is more mathematically favorable to deal with. Both these seemed potentially heavy handed so I figured I'd just start with the obvious, if suboptimal, solution... > Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures? Yes. > One more ugly thing is about using rq_clock(rq) directly from here whereas we > pass it around as the 'time' argument elsewhere. Sure I'll clean this up. > > > + > > + *util = min(rq->cfs.avg.util_avg + rt, max_cap); > > + *max = max_cap; > > }
Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote: > On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote: > > A policy of going to fmax on any RT activity will be detrimental > > for power on many platforms. Often RT accounts for only a small amount > > of CPU activity so sending the CPU frequency to fmax is overkill. Worse > > still, some platforms may not be able to even complete the CPU frequency > > change before the RT activity has already completed. > > > > Cpufreq governors have not treated RT activity this way in the past so > > it is not part of the expected semantics of the RT scheduling class. The > > DL class offers guarantees about task completion and could be used for > > this purpose. > > > > Modify the schedutil algorithm to instead use rt_avg as an estimate of > > RT utilization of the CPU. > > > > Based on previous work by Vincent Guittot . > > If we do it for RT, why not to do a similar thing for DL? As in the > original patch from Peter, for example? Agreed DL should have a similar change. I think that could be done in a separate patch. I also would need to discuss it with the deadline sched devs to fully understand the metric used there. > > > Signed-off-by: Steve Muckle > > --- > > kernel/sched/cpufreq_schedutil.c | 26 +- > > 1 file changed, 17 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/sched/cpufreq_schedutil.c > > b/kernel/sched/cpufreq_schedutil.c > > index cb8a77b1ef1b..89094a466250 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu > > *sg_cpu, unsigned long util, > > > > static void sugov_get_util(unsigned long *util, unsigned long *max) > > { > > - struct rq *rq = this_rq(); > > - unsigned long cfs_max; > > + int cpu = smp_processor_id(); > > + struct rq *rq = cpu_rq(cpu); > > + unsigned long max_cap, rt; > > + s64 delta; > > > > - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); > > + max_cap = arch_scale_cpu_capacity(NULL, cpu); > > > > - *util = min(rq->cfs.avg.util_avg, cfs_max); > > - *max = cfs_max; > > + delta = rq_clock(rq) - rq->age_stamp; > > + if (unlikely(delta < 0)) > > + delta = 0; > > + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta); > > + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT; > > These computations are rather heavy, so I wonder if they are avoidable based > on the flags, for example? Yeah the div is bad. I don't know that we can avoid it based on the flags because rt_avg will decay during CFS activity and you'd want to take note of that. One way to make this a little better is to ssume that the divisor, sched_avg_period() + delta, fits into 32 bits so that div_u64 can be used, which I believe is less bad. Doing that means placing a restriction on how large sysctl_sched_time_avg (which determines sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know that anyone uses a value that large anyway but there's currently no limit on it. Another option would be just adding another separate metric to track rt activity that is more mathematically favorable to deal with. Both these seemed potentially heavy handed so I figured I'd just start with the obvious, if suboptimal, solution... > Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures? Yes. > One more ugly thing is about using rq_clock(rq) directly from here whereas we > pass it around as the 'time' argument elsewhere. Sure I'll clean this up. > > > + > > + *util = min(rq->cfs.avg.util_avg + rt, max_cap); > > + *max = max_cap; > > }
[PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
A policy of going to fmax on any RT activity will be detrimental for power on many platforms. Often RT accounts for only a small amount of CPU activity so sending the CPU frequency to fmax is overkill. Worse still, some platforms may not be able to even complete the CPU frequency change before the RT activity has already completed. Cpufreq governors have not treated RT activity this way in the past so it is not part of the expected semantics of the RT scheduling class. The DL class offers guarantees about task completion and could be used for this purpose. Modify the schedutil algorithm to instead use rt_avg as an estimate of RT utilization of the CPU. Based on previous work by Vincent Guittot <vincent.guit...@linaro.org>. Signed-off-by: Steve Muckle <smuc...@linaro.org> --- kernel/sched/cpufreq_schedutil.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index cb8a77b1ef1b..89094a466250 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { - struct rq *rq = this_rq(); - unsigned long cfs_max; + int cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + unsigned long max_cap, rt; + s64 delta; - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); + max_cap = arch_scale_cpu_capacity(NULL, cpu); - *util = min(rq->cfs.avg.util_avg, cfs_max); - *max = cfs_max; + delta = rq_clock(rq) - rq->age_stamp; + if (unlikely(delta < 0)) + delta = 0; + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta); + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT; + + *util = min(rq->cfs.avg.util_avg + rt, max_cap); + *max = max_cap; } static void sugov_update_single(struct update_util_data *hook, u64 time, @@ -167,7 +175,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, if (!sugov_should_update_freq(sg_policy, time)) return; - if (flags & SCHED_CPUFREQ_RT_DL) { + if (flags & SCHED_CPUFREQ_DL) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(, ); @@ -186,7 +194,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 last_freq_update_time = sg_policy->last_freq_update_time; unsigned int j; - if (flags & SCHED_CPUFREQ_RT_DL) + if (flags & SCHED_CPUFREQ_DL) return max_f; for_each_cpu(j, policy->cpus) { @@ -209,7 +217,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, if (delta_ns > TICK_NSEC) continue; - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) + if (j_sg_cpu->flags & SCHED_CPUFREQ_DL) return max_f; j_util = j_sg_cpu->util; @@ -467,7 +475,7 @@ static int sugov_start(struct cpufreq_policy *policy) if (policy_is_shared(policy)) { sg_cpu->util = 0; sg_cpu->max = 0; - sg_cpu->flags = SCHED_CPUFREQ_RT; + sg_cpu->flags = SCHED_CPUFREQ_DL; sg_cpu->last_update = 0; sg_cpu->cached_raw_freq = 0; cpufreq_add_update_util_hook(cpu, _cpu->update_util, -- 2.7.3
[PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
A policy of going to fmax on any RT activity will be detrimental for power on many platforms. Often RT accounts for only a small amount of CPU activity so sending the CPU frequency to fmax is overkill. Worse still, some platforms may not be able to even complete the CPU frequency change before the RT activity has already completed. Cpufreq governors have not treated RT activity this way in the past so it is not part of the expected semantics of the RT scheduling class. The DL class offers guarantees about task completion and could be used for this purpose. Modify the schedutil algorithm to instead use rt_avg as an estimate of RT utilization of the CPU. Based on previous work by Vincent Guittot . Signed-off-by: Steve Muckle --- kernel/sched/cpufreq_schedutil.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index cb8a77b1ef1b..89094a466250 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { - struct rq *rq = this_rq(); - unsigned long cfs_max; + int cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + unsigned long max_cap, rt; + s64 delta; - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); + max_cap = arch_scale_cpu_capacity(NULL, cpu); - *util = min(rq->cfs.avg.util_avg, cfs_max); - *max = cfs_max; + delta = rq_clock(rq) - rq->age_stamp; + if (unlikely(delta < 0)) + delta = 0; + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta); + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT; + + *util = min(rq->cfs.avg.util_avg + rt, max_cap); + *max = max_cap; } static void sugov_update_single(struct update_util_data *hook, u64 time, @@ -167,7 +175,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, if (!sugov_should_update_freq(sg_policy, time)) return; - if (flags & SCHED_CPUFREQ_RT_DL) { + if (flags & SCHED_CPUFREQ_DL) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(, ); @@ -186,7 +194,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 last_freq_update_time = sg_policy->last_freq_update_time; unsigned int j; - if (flags & SCHED_CPUFREQ_RT_DL) + if (flags & SCHED_CPUFREQ_DL) return max_f; for_each_cpu(j, policy->cpus) { @@ -209,7 +217,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, if (delta_ns > TICK_NSEC) continue; - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) + if (j_sg_cpu->flags & SCHED_CPUFREQ_DL) return max_f; j_util = j_sg_cpu->util; @@ -467,7 +475,7 @@ static int sugov_start(struct cpufreq_policy *policy) if (policy_is_shared(policy)) { sg_cpu->util = 0; sg_cpu->max = 0; - sg_cpu->flags = SCHED_CPUFREQ_RT; + sg_cpu->flags = SCHED_CPUFREQ_DL; sg_cpu->last_update = 0; sg_cpu->cached_raw_freq = 0; cpufreq_add_update_util_hook(cpu, _cpu->update_util, -- 2.7.3
[PATCH 0/2] utilization changes for schedutil
The first patch (which is a re-send) corrects an issue with utilization on SMT platforms. Currently CFS utilization is being roughly doubled on these platforms due to differences in the way PELT and overall max CPU capacity are calculated. The second patch moves away from the policy of going to fmax during RT task activity, instead using rt_avg as an estimate of RT utilization. Steve Muckle (2): sched: cpufreq: ignore SMT when determining max cpu capacity sched: cpufreq: use rt_avg as estimate of required RT CPU capacity kernel/sched/cpufreq_schedutil.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) -- 2.7.3
[PATCH 1/2] sched: cpufreq: ignore SMT when determining max cpu capacity
PELT does not consider SMT when scaling its utilization values via arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does take SMT into consideration though and therefore may be smaller than the utilization reported by PELT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. This means that a 50% utilized CPU will show up in schedutil as ~86% busy. Fix this by using the same CPU scaling value in schedutil as that which is used by PELT. Signed-off-by: Steve Muckle <smuc...@linaro.org> --- kernel/sched/cpufreq_schedutil.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 60d985f4dc47..cb8a77b1ef1b 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { struct rq *rq = this_rq(); - unsigned long cfs_max = rq->cpu_capacity_orig; + unsigned long cfs_max; + + cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); *util = min(rq->cfs.avg.util_avg, cfs_max); *max = cfs_max; -- 2.7.3
[PATCH 0/2] utilization changes for schedutil
The first patch (which is a re-send) corrects an issue with utilization on SMT platforms. Currently CFS utilization is being roughly doubled on these platforms due to differences in the way PELT and overall max CPU capacity are calculated. The second patch moves away from the policy of going to fmax during RT task activity, instead using rt_avg as an estimate of RT utilization. Steve Muckle (2): sched: cpufreq: ignore SMT when determining max cpu capacity sched: cpufreq: use rt_avg as estimate of required RT CPU capacity kernel/sched/cpufreq_schedutil.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) -- 2.7.3
[PATCH 1/2] sched: cpufreq: ignore SMT when determining max cpu capacity
PELT does not consider SMT when scaling its utilization values via arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does take SMT into consideration though and therefore may be smaller than the utilization reported by PELT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. This means that a 50% utilized CPU will show up in schedutil as ~86% busy. Fix this by using the same CPU scaling value in schedutil as that which is used by PELT. Signed-off-by: Steve Muckle --- kernel/sched/cpufreq_schedutil.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 60d985f4dc47..cb8a77b1ef1b 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { struct rq *rq = this_rq(); - unsigned long cfs_max = rq->cpu_capacity_orig; + unsigned long cfs_max; + + cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); *util = min(rq->cfs.avg.util_avg, cfs_max); *max = cfs_max; -- 2.7.3
[PATCH] sched: cpufreq: ignore SMT when determining max cpu capacity
PELT does not consider SMT when scaling its utilization values via arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does take SMT into consideration though and therefore may be smaller than the utilization reported by PELT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. This means that a 50% utilized CPU will show up in schedutil as ~86% busy. Fix this by using the same CPU scaling value in schedutil as that which is used by PELT. Signed-off-by: Steve Muckle <smuc...@linaro.org> --- kernel/sched/cpufreq_schedutil.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 60d985f4dc47..cb8a77b1ef1b 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { struct rq *rq = this_rq(); - unsigned long cfs_max = rq->cpu_capacity_orig; + unsigned long cfs_max; + + cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); *util = min(rq->cfs.avg.util_avg, cfs_max); *max = cfs_max; -- 2.7.3
[PATCH] sched: cpufreq: ignore SMT when determining max cpu capacity
PELT does not consider SMT when scaling its utilization values via arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does take SMT into consideration though and therefore may be smaller than the utilization reported by PELT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. This means that a 50% utilized CPU will show up in schedutil as ~86% busy. Fix this by using the same CPU scaling value in schedutil as that which is used by PELT. Signed-off-by: Steve Muckle --- kernel/sched/cpufreq_schedutil.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 60d985f4dc47..cb8a77b1ef1b 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, static void sugov_get_util(unsigned long *util, unsigned long *max) { struct rq *rq = this_rq(); - unsigned long cfs_max = rq->cpu_capacity_orig; + unsigned long cfs_max; + + cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); *util = min(rq->cfs.avg.util_avg, cfs_max); *max = cfs_max; -- 2.7.3
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 61d485421bed..95d34b337152 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg > > *sa, > > sa->last_update_time = now; > > > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); > > Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something > like 'rcu_dereference(cpu_rq(cpu)->sd)'? > > IMHO, __update_load_avg() is called outside existing RCU read-side > critical sections as well so there would be a pair of > rcu_read_lock()/rcu_read_unlock() required in this case. Thanks Dietmar for the review. Yeah I didn't consider that this was protected with rcu. It looks like I'm abandoning this approach anyway though and doing something limited just to schedutil. thanks, Steve
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 61d485421bed..95d34b337152 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg > > *sa, > > sa->last_update_time = now; > > > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); > > Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something > like 'rcu_dereference(cpu_rq(cpu)->sd)'? > > IMHO, __update_load_avg() is called outside existing RCU read-side > critical sections as well so there would be a pair of > rcu_read_lock()/rcu_read_unlock() required in this case. Thanks Dietmar for the review. Yeah I didn't consider that this was protected with rcu. It looks like I'm abandoning this approach anyway though and doing something limited just to schedutil. thanks, Steve
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote: > Hi Steve, > > On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote: > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > I can't convince myself whether this is the right thing to do. SMT is a > bit 'special' and it depends on how you model SMT capacity. > > I'm no SMT expert, but the way I understand the current SMT capacity > model is that capacity_orig represents the capacity of the SMT-thread > when all its thread-siblings are busy. The true capacity of an > SMT-thread where all thread-siblings are idle is actually 1024, but we > don't model this (it would be nightmare to track when the capacity > should change). The capacity of a core with two or more SMT-threads is > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the > additional throughput we gain for the additional SMT-threads. The reason > why we don't have 1024 per thread is that we would prefer to have only > one task per core if possible. > > With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly > 'full' with just one always-running task. If we change util_avg to max > out at 589, it would take two always-running tasks for the combined > utilization to match the core capacity. So we may loose some bias > towards spreading for SMT systems. > > AFAICT, group_is_overloaded() and group_has_capacity() would both be > affected by this patch. > > Interestingly, Vincent recently proposed to set the SMT-thread capacity > to 1024 which would affectively make all the current SMT code redundant. > It would make things a lot simpler, but I'm not sure if we can get away > with it. It would need discussion at least. > > Opinions? Thanks for having a look. The reason I pushed this patch was to address an issue with the schedutil governor - demand is effectively doubled on SMT systems due to the above scheme. But this can just be fixed for schedutil by using a max value there consistent with what __update_load_avg() is using. I'll send another patch. It looks like there's a good reason for the current PELT scaling w.r.t. SMT in the scheduler/load balancer. thanks, Steve
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote: > Hi Steve, > > On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote: > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > I can't convince myself whether this is the right thing to do. SMT is a > bit 'special' and it depends on how you model SMT capacity. > > I'm no SMT expert, but the way I understand the current SMT capacity > model is that capacity_orig represents the capacity of the SMT-thread > when all its thread-siblings are busy. The true capacity of an > SMT-thread where all thread-siblings are idle is actually 1024, but we > don't model this (it would be nightmare to track when the capacity > should change). The capacity of a core with two or more SMT-threads is > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the > additional throughput we gain for the additional SMT-threads. The reason > why we don't have 1024 per thread is that we would prefer to have only > one task per core if possible. > > With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly > 'full' with just one always-running task. If we change util_avg to max > out at 589, it would take two always-running tasks for the combined > utilization to match the core capacity. So we may loose some bias > towards spreading for SMT systems. > > AFAICT, group_is_overloaded() and group_has_capacity() would both be > affected by this patch. > > Interestingly, Vincent recently proposed to set the SMT-thread capacity > to 1024 which would affectively make all the current SMT code redundant. > It would make things a lot simpler, but I'm not sure if we can get away > with it. It would need discussion at least. > > Opinions? Thanks for having a look. The reason I pushed this patch was to address an issue with the schedutil governor - demand is effectively doubled on SMT systems due to the above scheme. But this can just be fixed for schedutil by using a max value there consistent with what __update_load_avg() is using. I'll send another patch. It looks like there's a good reason for the current PELT scaling w.r.t. SMT in the scheduler/load balancer. thanks, Steve
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote: > 2016-08-19 9:55 GMT+08:00 Steve Muckle <steve.muc...@linaro.org>: > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > > > Fix this by passing in the sd in __update_load_avg() as well. > > I believe we notice this at least several months ago. > https://lkml.org/lkml/2016/5/25/228 Glad to see I'm not alone in thinking this is an issue. It causes an issue with schedutil, effectively doubling the apparent demand on SMT. I don't know the load balance code well enough offhand to say whether it's an issue there. cheers, Steve
Re: [PATCH] sched: fix incorrect PELT values on SMT
On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote: > 2016-08-19 9:55 GMT+08:00 Steve Muckle : > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > > > Fix this by passing in the sd in __update_load_avg() as well. > > I believe we notice this at least several months ago. > https://lkml.org/lkml/2016/5/25/228 Glad to see I'm not alone in thinking this is an issue. It causes an issue with schedutil, effectively doubling the apparent demand on SMT. I don't know the load balance code well enough offhand to say whether it's an issue there. cheers, Steve
[PATCH] sched: fix incorrect PELT values on SMT
PELT scales its util_sum and util_avg values via arch_scale_cpu_capacity(). If that function is passed the CPU's sched domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY is set. PELT does not pass in the sd however. The other caller of arch_scale_cpu_capacity, update_cpu_capacity(), does. This means util_sum and util_avg scale beyond the CPU capacity on SMT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. Fix this by passing in the sd in __update_load_avg() as well. Signed-off-by: Steve Muckle <smuc...@linaro.org> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61d485421bed..95d34b337152 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, sa->last_update_time = now; scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; -- 2.7.3
[PATCH] sched: fix incorrect PELT values on SMT
PELT scales its util_sum and util_avg values via arch_scale_cpu_capacity(). If that function is passed the CPU's sched domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY is set. PELT does not pass in the sd however. The other caller of arch_scale_cpu_capacity, update_cpu_capacity(), does. This means util_sum and util_avg scale beyond the CPU capacity on SMT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. Fix this by passing in the sd in __update_load_avg() as well. Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61d485421bed..95d34b337152 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, sa->last_update_time = now; scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; -- 2.7.3
Re: [PATCH v2 2/2] cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()
LGTM On Fri, Aug 12, 2016 at 02:06:44AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki> > All of the callers of cpufreq_update_util() pass rq_clock(rq) to it > as the time argument and some of them check whether or not cpu_of(rq) > is equal to smp_processor_id() before calling it, so rework it to > take a runqueue pointer as the argument and move the rq_clock(rq) > evaluation into it. > > Additionally, provide a wrapper checking cpu_of(rq) against > smp_processor_id() for the cpufreq_update_util() callers that > need it. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a new patch based on the [2/2] from v1. > > --- > kernel/sched/deadline.c |3 +-- > kernel/sched/fair.c |4 +--- > kernel/sched/rt.c |3 +-- > kernel/sched/sched.h| 15 +++ > 4 files changed, 14 insertions(+), 11 deletions(-) > > Index: linux-pm/kernel/sched/deadline.c > === > --- linux-pm.orig/kernel/sched/deadline.c > +++ linux-pm/kernel/sched/deadline.c > @@ -733,8 +733,7 @@ static void update_curr_dl(struct rq *rq > } > > /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_DL); > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL); > > schedstat_set(curr->se.statistics.exec_max, > max(curr->se.statistics.exec_max, delta_exec)); > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > if (_rq()->cfs == cfs_rq) { > - struct rq *rq = rq_of(cfs_rq); > - > /* >* There are a few boundary cases this might miss but it should >* get called often enough that that should (hopefully) not be > @@ -2894,7 +2892,7 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - cpufreq_update_util(rq_clock(rq), 0); > + cpufreq_update_util(rq_of(cfs_rq), 0); > } > } > > Index: linux-pm/kernel/sched/rt.c > === > --- linux-pm.orig/kernel/sched/rt.c > +++ linux-pm/kernel/sched/rt.c > @@ -958,8 +958,7 @@ static void update_curr_rt(struct rq *rq > return; > > /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_RT); > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT); > > schedstat_set(curr->se.statistics.exec_max, > max(curr->se.statistics.exec_max, delta_exec)); > Index: linux-pm/kernel/sched/sched.h > === > --- linux-pm.orig/kernel/sched/sched.h > +++ linux-pm/kernel/sched/sched.h > @@ -1763,7 +1763,7 @@ DECLARE_PER_CPU(struct update_util_data > > /** > * cpufreq_update_util - Take a note about CPU utilization changes. > - * @time: Current time. > + * @rq: Runqueue to carry out the update for. > * @flags: Update reason flags. > * > * This function is called by the scheduler on the CPU whose utilization is > @@ -1783,16 +1783,23 @@ DECLARE_PER_CPU(struct update_util_data > * but that really is a band-aid. Going forward it should be replaced with > * solutions targeted more specifically at RT and DL tasks. > */ > -static inline void cpufreq_update_util(u64 time, unsigned int flags) > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) > { > struct update_util_data *data; > > data = rcu_dereference_sched(*this_cpu_ptr(_update_util_data)); > if (data) > - data->func(data, time, flags); > + data->func(data, rq_clock(rq), flags); > +} > + > +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) > +{ > + if (cpu_of(rq) == smp_processor_id()) > + cpufreq_update_util(rq, flags); > } > #else > -static inline void cpufreq_update_util(u64 time, unsigned int flags) {} > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int > flags) {} > #endif /* CONFIG_CPU_FREQ */ > > #ifdef arch_scale_freq_capacity >
Re: [PATCH v2 2/2] cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()
LGTM On Fri, Aug 12, 2016 at 02:06:44AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > All of the callers of cpufreq_update_util() pass rq_clock(rq) to it > as the time argument and some of them check whether or not cpu_of(rq) > is equal to smp_processor_id() before calling it, so rework it to > take a runqueue pointer as the argument and move the rq_clock(rq) > evaluation into it. > > Additionally, provide a wrapper checking cpu_of(rq) against > smp_processor_id() for the cpufreq_update_util() callers that > need it. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a new patch based on the [2/2] from v1. > > --- > kernel/sched/deadline.c |3 +-- > kernel/sched/fair.c |4 +--- > kernel/sched/rt.c |3 +-- > kernel/sched/sched.h| 15 +++ > 4 files changed, 14 insertions(+), 11 deletions(-) > > Index: linux-pm/kernel/sched/deadline.c > === > --- linux-pm.orig/kernel/sched/deadline.c > +++ linux-pm/kernel/sched/deadline.c > @@ -733,8 +733,7 @@ static void update_curr_dl(struct rq *rq > } > > /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_DL); > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL); > > schedstat_set(curr->se.statistics.exec_max, > max(curr->se.statistics.exec_max, delta_exec)); > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > if (_rq()->cfs == cfs_rq) { > - struct rq *rq = rq_of(cfs_rq); > - > /* >* There are a few boundary cases this might miss but it should >* get called often enough that that should (hopefully) not be > @@ -2894,7 +2892,7 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - cpufreq_update_util(rq_clock(rq), 0); > + cpufreq_update_util(rq_of(cfs_rq), 0); > } > } > > Index: linux-pm/kernel/sched/rt.c > === > --- linux-pm.orig/kernel/sched/rt.c > +++ linux-pm/kernel/sched/rt.c > @@ -958,8 +958,7 @@ static void update_curr_rt(struct rq *rq > return; > > /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_RT); > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT); > > schedstat_set(curr->se.statistics.exec_max, > max(curr->se.statistics.exec_max, delta_exec)); > Index: linux-pm/kernel/sched/sched.h > === > --- linux-pm.orig/kernel/sched/sched.h > +++ linux-pm/kernel/sched/sched.h > @@ -1763,7 +1763,7 @@ DECLARE_PER_CPU(struct update_util_data > > /** > * cpufreq_update_util - Take a note about CPU utilization changes. > - * @time: Current time. > + * @rq: Runqueue to carry out the update for. > * @flags: Update reason flags. > * > * This function is called by the scheduler on the CPU whose utilization is > @@ -1783,16 +1783,23 @@ DECLARE_PER_CPU(struct update_util_data > * but that really is a band-aid. Going forward it should be replaced with > * solutions targeted more specifically at RT and DL tasks. > */ > -static inline void cpufreq_update_util(u64 time, unsigned int flags) > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) > { > struct update_util_data *data; > > data = rcu_dereference_sched(*this_cpu_ptr(_update_util_data)); > if (data) > - data->func(data, time, flags); > + data->func(data, rq_clock(rq), flags); > +} > + > +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) > +{ > + if (cpu_of(rq) == smp_processor_id()) > + cpufreq_update_util(rq, flags); > } > #else > -static inline void cpufreq_update_util(u64 time, unsigned int flags) {} > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int > flags) {} > #endif /* CONFIG_CPU_FREQ */ > > #ifdef arch_scale_freq_capacity >
Re: [PATCH v2 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
LGTM On Fri, Aug 12, 2016 at 02:04:42AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki> > It is useful to know the reason why cpufreq_update_util() has just > been called and that can be passed as flags to cpufreq_update_util() > and to the ->func() callback in struct update_util_data. However, > doing that in addition to passing the util and max arguments they > already take would be clumsy, so avoid it. > > Instead, use the observation that the schedutil governor is part > of the scheduler proper, so it can access scheduler data directly. > This allows the util and max arguments of cpufreq_update_util() > and the ->func() callback in struct update_util_data to be replaced > with a flags one, but schedutil has to be modified to follow. > > Thus make the schedutil governor obtain the CFS utilization > information from the scheduler and use the "RT" and "DL" flags > instead of the special utilization value of ULONG_MAX to track > updates from the RT and DL sched classes. Make it non-modular > too to avoid having to export scheduler variables to modules at > large. > > Next, update all of the other users of cpufreq_update_util() > and the ->func() callback in struct update_util_data accordingly. > > Suggested-by: Peter Zijlstra > Signed-off-by: Rafael J. Wysocki > --- > > v1 -> v2: Do not check cpu_of(rq) against smp_processor_id() in > cfs_rq_util_change(). > > --- > drivers/cpufreq/Kconfig|5 -- > drivers/cpufreq/cpufreq_governor.c |2 - > drivers/cpufreq/intel_pstate.c |2 - > include/linux/sched.h | 12 -- > kernel/sched/cpufreq.c |2 - > kernel/sched/cpufreq_schedutil.c | 67 > - > kernel/sched/deadline.c|4 +- > kernel/sched/fair.c| 10 + > kernel/sched/rt.c |4 +- > kernel/sched/sched.h | 31 + > 10 files changed, 66 insertions(+), 73 deletions(-) > > Index: linux-pm/drivers/cpufreq/cpufreq_governor.c > === > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c > +++ linux-pm/drivers/cpufreq/cpufreq_governor.c > @@ -260,7 +260,7 @@ static void dbs_irq_work(struct irq_work > } > > static void dbs_update_util_handler(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max) > + unsigned int flags) > { > struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, > update_util); > struct policy_dbs_info *policy_dbs = cdbs->policy_dbs; > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -1329,7 +1329,7 @@ static inline void intel_pstate_adjust_b > } > > static void intel_pstate_update_util(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max) > + unsigned int flags) > { > struct cpudata *cpu = container_of(data, struct cpudata, update_util); > u64 delta_ns = time - cpu->sample.time; > Index: linux-pm/include/linux/sched.h > === > --- linux-pm.orig/include/linux/sched.h > +++ linux-pm/include/linux/sched.h > @@ -3469,15 +3469,19 @@ static inline unsigned long rlimit_max(u > return task_rlimit_max(current, limit); > } > > +#define SCHED_CPUFREQ_RT (1U << 0) > +#define SCHED_CPUFREQ_DL (1U << 1) > + > +#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL) > + > #ifdef CONFIG_CPU_FREQ > struct update_util_data { > - void (*func)(struct update_util_data *data, > - u64 time, unsigned long util, unsigned long max); > + void (*func)(struct update_util_data *data, u64 time, unsigned int > flags); > }; > > void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, > - void (*func)(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max)); > + void (*func)(struct update_util_data *data, u64 time, > + unsigned int flags)); > void cpufreq_remove_update_util_hook(int cpu); > #endif /* CONFIG_CPU_FREQ */ > > Index: linux-pm/kernel/sched/cpufreq.c > === > --- linux-pm.orig/kernel/sched/cpufreq.c > +++ linux-pm/kernel/sched/cpufreq.c > @@ -33,7 +33,7 @@ DEFINE_PER_CPU(struct update_util_data * > */ > void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, > void (*func)(struct
Re: [PATCH v2 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
LGTM On Fri, Aug 12, 2016 at 02:04:42AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > It is useful to know the reason why cpufreq_update_util() has just > been called and that can be passed as flags to cpufreq_update_util() > and to the ->func() callback in struct update_util_data. However, > doing that in addition to passing the util and max arguments they > already take would be clumsy, so avoid it. > > Instead, use the observation that the schedutil governor is part > of the scheduler proper, so it can access scheduler data directly. > This allows the util and max arguments of cpufreq_update_util() > and the ->func() callback in struct update_util_data to be replaced > with a flags one, but schedutil has to be modified to follow. > > Thus make the schedutil governor obtain the CFS utilization > information from the scheduler and use the "RT" and "DL" flags > instead of the special utilization value of ULONG_MAX to track > updates from the RT and DL sched classes. Make it non-modular > too to avoid having to export scheduler variables to modules at > large. > > Next, update all of the other users of cpufreq_update_util() > and the ->func() callback in struct update_util_data accordingly. > > Suggested-by: Peter Zijlstra > Signed-off-by: Rafael J. Wysocki > --- > > v1 -> v2: Do not check cpu_of(rq) against smp_processor_id() in > cfs_rq_util_change(). > > --- > drivers/cpufreq/Kconfig|5 -- > drivers/cpufreq/cpufreq_governor.c |2 - > drivers/cpufreq/intel_pstate.c |2 - > include/linux/sched.h | 12 -- > kernel/sched/cpufreq.c |2 - > kernel/sched/cpufreq_schedutil.c | 67 > - > kernel/sched/deadline.c|4 +- > kernel/sched/fair.c| 10 + > kernel/sched/rt.c |4 +- > kernel/sched/sched.h | 31 + > 10 files changed, 66 insertions(+), 73 deletions(-) > > Index: linux-pm/drivers/cpufreq/cpufreq_governor.c > === > --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c > +++ linux-pm/drivers/cpufreq/cpufreq_governor.c > @@ -260,7 +260,7 @@ static void dbs_irq_work(struct irq_work > } > > static void dbs_update_util_handler(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max) > + unsigned int flags) > { > struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, > update_util); > struct policy_dbs_info *policy_dbs = cdbs->policy_dbs; > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -1329,7 +1329,7 @@ static inline void intel_pstate_adjust_b > } > > static void intel_pstate_update_util(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max) > + unsigned int flags) > { > struct cpudata *cpu = container_of(data, struct cpudata, update_util); > u64 delta_ns = time - cpu->sample.time; > Index: linux-pm/include/linux/sched.h > === > --- linux-pm.orig/include/linux/sched.h > +++ linux-pm/include/linux/sched.h > @@ -3469,15 +3469,19 @@ static inline unsigned long rlimit_max(u > return task_rlimit_max(current, limit); > } > > +#define SCHED_CPUFREQ_RT (1U << 0) > +#define SCHED_CPUFREQ_DL (1U << 1) > + > +#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL) > + > #ifdef CONFIG_CPU_FREQ > struct update_util_data { > - void (*func)(struct update_util_data *data, > - u64 time, unsigned long util, unsigned long max); > + void (*func)(struct update_util_data *data, u64 time, unsigned int > flags); > }; > > void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, > - void (*func)(struct update_util_data *data, u64 time, > - unsigned long util, unsigned long max)); > + void (*func)(struct update_util_data *data, u64 time, > + unsigned int flags)); > void cpufreq_remove_update_util_hook(int cpu); > #endif /* CONFIG_CPU_FREQ */ > > Index: linux-pm/kernel/sched/cpufreq.c > === > --- linux-pm.orig/kernel/sched/cpufreq.c > +++ linux-pm/kernel/sched/cpufreq.c > @@ -33,7 +33,7 @@ DEFINE_PER_CPU(struct update_util_data * > */ > void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, > void (*func)(struct update_util_data *data, u64 time, > - unsigned long
Re: [PATCH 2/2] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()
On Wed, Aug 10, 2016 at 03:11:17AM +0200, Rafael J. Wysocki wrote: > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > if (_rq()->cfs == cfs_rq) { > - struct rq *rq = rq_of(cfs_rq); > - > /* >* There are a few boundary cases this might miss but it should >* get called often enough that that should (hopefully) not be > @@ -2894,8 +2892,7 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), 0); > + cpufreq_update_util(rq_of(cfs_rq), 0); > } ... > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) > { > struct update_util_data *data; > > + if (cpu_of(rq) != smp_processor_id()) > + return; This test is unecessary in the CFS (most common) path due to the check on this_rq in cfs_rq_util_change(). I think instead of bringing the test into cpufreq_update_util it should be left at the call sites for RT and DL, and removed from CFS as part of the first patch. thanks, Steve
Re: [PATCH 2/2] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()
On Wed, Aug 10, 2016 at 03:11:17AM +0200, Rafael J. Wysocki wrote: > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > if (_rq()->cfs == cfs_rq) { > - struct rq *rq = rq_of(cfs_rq); > - > /* >* There are a few boundary cases this might miss but it should >* get called often enough that that should (hopefully) not be > @@ -2894,8 +2892,7 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - if (cpu_of(rq) == smp_processor_id()) > - cpufreq_update_util(rq_clock(rq), 0); > + cpufreq_update_util(rq_of(cfs_rq), 0); > } ... > +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) > { > struct update_util_data *data; > > + if (cpu_of(rq) != smp_processor_id()) > + return; This test is unecessary in the CFS (most common) path due to the check on this_rq in cfs_rq_util_change(). I think instead of bringing the test into cpufreq_update_util it should be left at the call sites for RT and DL, and removed from CFS as part of the first patch. thanks, Steve
Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
On Thu, Aug 11, 2016 at 11:03:47AM -0700, Steve Muckle wrote: > On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote: > > Index: linux-pm/kernel/sched/fair.c > > === > > --- linux-pm.orig/kernel/sched/fair.c > > +++ linux-pm/kernel/sched/fair.c > > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st > > > > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > > { > > - struct rq *rq = rq_of(cfs_rq); > > - int cpu = cpu_of(rq); > > - > > - if (cpu == smp_processor_id() && >cfs == cfs_rq) { > > - unsigned long max = rq->cpu_capacity_orig; > > + if (_rq()->cfs == cfs_rq) { > > + struct rq *rq = rq_of(cfs_rq); > > > > /* > > * There are a few boundary cases this might miss but it should > > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st > > * > > * See cpu_util(). > > */ > > - cpufreq_update_util(rq_clock(rq), > > - min(cfs_rq->avg.util_avg, max), max); > > + if (cpu_of(rq) == smp_processor_id()) > > Isn't this test against smp_processor_id() redundant since > this_rq()->cfs == cfs_rq? Sorry, I see this is modified in the next patch.
Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
On Thu, Aug 11, 2016 at 11:03:47AM -0700, Steve Muckle wrote: > On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote: > > Index: linux-pm/kernel/sched/fair.c > > === > > --- linux-pm.orig/kernel/sched/fair.c > > +++ linux-pm/kernel/sched/fair.c > > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st > > > > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > > { > > - struct rq *rq = rq_of(cfs_rq); > > - int cpu = cpu_of(rq); > > - > > - if (cpu == smp_processor_id() && >cfs == cfs_rq) { > > - unsigned long max = rq->cpu_capacity_orig; > > + if (_rq()->cfs == cfs_rq) { > > + struct rq *rq = rq_of(cfs_rq); > > > > /* > > * There are a few boundary cases this might miss but it should > > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st > > * > > * See cpu_util(). > > */ > > - cpufreq_update_util(rq_clock(rq), > > - min(cfs_rq->avg.util_avg, max), max); > > + if (cpu_of(rq) == smp_processor_id()) > > Isn't this test against smp_processor_id() redundant since > this_rq()->cfs == cfs_rq? Sorry, I see this is modified in the next patch.
Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote: > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st > > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > - struct rq *rq = rq_of(cfs_rq); > - int cpu = cpu_of(rq); > - > - if (cpu == smp_processor_id() && >cfs == cfs_rq) { > - unsigned long max = rq->cpu_capacity_orig; > + if (_rq()->cfs == cfs_rq) { > + struct rq *rq = rq_of(cfs_rq); > > /* >* There are a few boundary cases this might miss but it should > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - cpufreq_update_util(rq_clock(rq), > - min(cfs_rq->avg.util_avg, max), max); > + if (cpu_of(rq) == smp_processor_id()) Isn't this test against smp_processor_id() redundant since this_rq()->cfs == cfs_rq? > + cpufreq_update_util(rq_clock(rq), 0); All else looked good to me. thanks, Steve
Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()
On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote: > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st > > static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) > { > - struct rq *rq = rq_of(cfs_rq); > - int cpu = cpu_of(rq); > - > - if (cpu == smp_processor_id() && >cfs == cfs_rq) { > - unsigned long max = rq->cpu_capacity_orig; > + if (_rq()->cfs == cfs_rq) { > + struct rq *rq = rq_of(cfs_rq); > > /* >* There are a few boundary cases this might miss but it should > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st >* >* See cpu_util(). >*/ > - cpufreq_update_util(rq_clock(rq), > - min(cfs_rq->avg.util_avg, max), max); > + if (cpu_of(rq) == smp_processor_id()) Isn't this test against smp_processor_id() redundant since this_rq()->cfs == cfs_rq? > + cpufreq_update_util(rq_clock(rq), 0); All else looked good to me. thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Thu, Aug 04, 2016 at 11:19:00PM +0200, Rafael J. Wysocki wrote: > On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote: > > On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote: > > > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle <steve.muc...@linaro.org> > > > wrote: > > > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > > > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> > > > >> wrote: > > > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > > > >> > ... > > > >> >> For this purpose, define a new cpufreq_update_util() flag > > > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to > > > >> >> cpufreq_update_util() in the in_iowait case. That generally > > > >> >> requires cpufreq_update_util() to be called directly from there, > > > >> >> because update_load_avg() is not likely to be invoked in that > > > >> >> case. > > > >> > > > > >> > I didn't follow why the cpufreq hook won't likely be called if > > > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second > > > >> > loop > > > >> > and calls update_cfs_rq_load_avg (triggers the hook). > > > >> > > > >> In practice it turns out that in the majority of cases when in_iowait > > > >> is set the second loop will not run. > > > > > > > > My understanding of enqueue_task_fair() is that the first loop walks up > > > > the portion of the sched_entity hierarchy that needs to be enqueued, and > > > > the second loop updates the rest of the hierarchy that was already > > > > enqueued. > > > > > > > > Even if the se corresponding to the root cfs_rq needs to be enqueued > > > > (meaning the whole hierarchy is traversed in the first loop and the > > > > second loop does nothing), enqueue_entity() on the root cfs_rq should > > > > result in the cpufreq hook being called, via enqueue_entity() -> > > > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). > > > > > > But then it's rather difficult to pass the IO flag to this one, isn't it? > > > > > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when > > > p->in_iowait is set. > > > > > > If you can find a clever way to do it without adding an extra call > > > site, that's fine by me, but in any case the extra > > > cpufreq_update_util() invocation should not be too expensive. > > > > I was under the impression that function pointer calls were more > > expensive, and in the shared policy case there is a nontrivial amount of > > code that is run in schedutil (including taking a spinlock) before we'd > > see sugov_should_update_freq() return false and bail. > > That's correct in principle, but we only do that if p->in_iowait is set, > which is somewhat special anyway and doesn't happen every time for sure. > > So while there is overhead theoretically, I'm not even sure if it is > measurable. Ok my worry was if there were IO-heavy workloads that would hammer this path, but I don't know of any specifically or how often this path can be taken. > > > Agreed that getting knowledge of p->in_iowait down to the existing hook > > is not easy. I spent some time fiddling with that. It seemed doable but > > somewhat gross due to the required flag passing and modifications > > to enqueue_entity, update_load_avg, etc. If it is decided that it is worth > > pursuing I can keep working on it and post a draft. > > Well, that's a Peter's call. :-) > > > But I also wonder if the hooks are in the best location. They are > > currently deep in the PELT code. This may make sense from a theoretical > > standpoint, calling them whenever a root cfs_rq utilization changes, but > > it also makes the hooks difficult to correlate (for policy purposes such > > as this iowait change) with higher level logical events like a task > > wakeup. Or load balance where we probably want to call the hook just > > once after a load balance is complete. > > I generally agree. We still need to ensure that the hools will be invoked > frequently enough, though, even if HZ is 100. > > > This is also an issue for the remote wakeup case where I currently have > > another invocation of the hook in check_preempt_curr(), so I can know if > > preemption was triggered and skip a remote schedutil update in that case > > to avoid a duplicate IPI. > > > > It seems to me worth evaluating if a higher level set of hook locations > > could be used. One possibility is higher up in CFS: > > - enqueue_task_fair, dequeue_task_fair > > - scheduler_tick > > - active_load_balance_cpu_stop, load_balance > > Agreed, that's worth checking. > > > Though this wouldn't solve my issue with check_preempt_curr. That would > > probably require going further up the stack to try_to_wake_up() etc. Not > > yet sure what the other hook locations would be at that level. > > That's probably too far away from the root cfs_rq utilization changes IMO. Is your concern that the rate of hook calls would be decreased? thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Thu, Aug 04, 2016 at 11:19:00PM +0200, Rafael J. Wysocki wrote: > On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote: > > On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote: > > > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle > > > wrote: > > > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > > > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle > > > >> wrote: > > > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > > > >> > ... > > > >> >> For this purpose, define a new cpufreq_update_util() flag > > > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to > > > >> >> cpufreq_update_util() in the in_iowait case. That generally > > > >> >> requires cpufreq_update_util() to be called directly from there, > > > >> >> because update_load_avg() is not likely to be invoked in that > > > >> >> case. > > > >> > > > > >> > I didn't follow why the cpufreq hook won't likely be called if > > > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second > > > >> > loop > > > >> > and calls update_cfs_rq_load_avg (triggers the hook). > > > >> > > > >> In practice it turns out that in the majority of cases when in_iowait > > > >> is set the second loop will not run. > > > > > > > > My understanding of enqueue_task_fair() is that the first loop walks up > > > > the portion of the sched_entity hierarchy that needs to be enqueued, and > > > > the second loop updates the rest of the hierarchy that was already > > > > enqueued. > > > > > > > > Even if the se corresponding to the root cfs_rq needs to be enqueued > > > > (meaning the whole hierarchy is traversed in the first loop and the > > > > second loop does nothing), enqueue_entity() on the root cfs_rq should > > > > result in the cpufreq hook being called, via enqueue_entity() -> > > > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). > > > > > > But then it's rather difficult to pass the IO flag to this one, isn't it? > > > > > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when > > > p->in_iowait is set. > > > > > > If you can find a clever way to do it without adding an extra call > > > site, that's fine by me, but in any case the extra > > > cpufreq_update_util() invocation should not be too expensive. > > > > I was under the impression that function pointer calls were more > > expensive, and in the shared policy case there is a nontrivial amount of > > code that is run in schedutil (including taking a spinlock) before we'd > > see sugov_should_update_freq() return false and bail. > > That's correct in principle, but we only do that if p->in_iowait is set, > which is somewhat special anyway and doesn't happen every time for sure. > > So while there is overhead theoretically, I'm not even sure if it is > measurable. Ok my worry was if there were IO-heavy workloads that would hammer this path, but I don't know of any specifically or how often this path can be taken. > > > Agreed that getting knowledge of p->in_iowait down to the existing hook > > is not easy. I spent some time fiddling with that. It seemed doable but > > somewhat gross due to the required flag passing and modifications > > to enqueue_entity, update_load_avg, etc. If it is decided that it is worth > > pursuing I can keep working on it and post a draft. > > Well, that's a Peter's call. :-) > > > But I also wonder if the hooks are in the best location. They are > > currently deep in the PELT code. This may make sense from a theoretical > > standpoint, calling them whenever a root cfs_rq utilization changes, but > > it also makes the hooks difficult to correlate (for policy purposes such > > as this iowait change) with higher level logical events like a task > > wakeup. Or load balance where we probably want to call the hook just > > once after a load balance is complete. > > I generally agree. We still need to ensure that the hools will be invoked > frequently enough, though, even if HZ is 100. > > > This is also an issue for the remote wakeup case where I currently have > > another invocation of the hook in check_preempt_curr(), so I can know if > > preemption was triggered and skip a remote schedutil update in that case > > to avoid a duplicate IPI. > > > > It seems to me worth evaluating if a higher level set of hook locations > > could be used. One possibility is higher up in CFS: > > - enqueue_task_fair, dequeue_task_fair > > - scheduler_tick > > - active_load_balance_cpu_stop, load_balance > > Agreed, that's worth checking. > > > Though this wouldn't solve my issue with check_preempt_curr. That would > > probably require going further up the stack to try_to_wake_up() etc. Not > > yet sure what the other hook locations would be at that level. > > That's probably too far away from the root cfs_rq utilization changes IMO. Is your concern that the rate of hook calls would be decreased? thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote: > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle <steve.muc...@linaro.org> wrote: > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> > >> wrote: > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > >> > ... > >> >> For this purpose, define a new cpufreq_update_util() flag > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to > >> >> cpufreq_update_util() in the in_iowait case. That generally > >> >> requires cpufreq_update_util() to be called directly from there, > >> >> because update_load_avg() is not likely to be invoked in that > >> >> case. > >> > > >> > I didn't follow why the cpufreq hook won't likely be called if > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop > >> > and calls update_cfs_rq_load_avg (triggers the hook). > >> > >> In practice it turns out that in the majority of cases when in_iowait > >> is set the second loop will not run. > > > > My understanding of enqueue_task_fair() is that the first loop walks up > > the portion of the sched_entity hierarchy that needs to be enqueued, and > > the second loop updates the rest of the hierarchy that was already > > enqueued. > > > > Even if the se corresponding to the root cfs_rq needs to be enqueued > > (meaning the whole hierarchy is traversed in the first loop and the > > second loop does nothing), enqueue_entity() on the root cfs_rq should > > result in the cpufreq hook being called, via enqueue_entity() -> > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). > > But then it's rather difficult to pass the IO flag to this one, isn't it? > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when > p->in_iowait is set. > > If you can find a clever way to do it without adding an extra call > site, that's fine by me, but in any case the extra > cpufreq_update_util() invocation should not be too expensive. I was under the impression that function pointer calls were more expensive, and in the shared policy case there is a nontrivial amount of code that is run in schedutil (including taking a spinlock) before we'd see sugov_should_update_freq() return false and bail. Agreed that getting knowledge of p->in_iowait down to the existing hook is not easy. I spent some time fiddling with that. It seemed doable but somewhat gross due to the required flag passing and modifications to enqueue_entity, update_load_avg, etc. If it is decided that it is worth pursuing I can keep working on it and post a draft. But I also wonder if the hooks are in the best location. They are currently deep in the PELT code. This may make sense from a theoretical standpoint, calling them whenever a root cfs_rq utilization changes, but it also makes the hooks difficult to correlate (for policy purposes such as this iowait change) with higher level logical events like a task wakeup. Or load balance where we probably want to call the hook just once after a load balance is complete. This is also an issue for the remote wakeup case where I currently have another invocation of the hook in check_preempt_curr(), so I can know if preemption was triggered and skip a remote schedutil update in that case to avoid a duplicate IPI. It seems to me worth evaluating if a higher level set of hook locations could be used. One possibility is higher up in CFS: - enqueue_task_fair, dequeue_task_fair - scheduler_tick - active_load_balance_cpu_stop, load_balance Though this wouldn't solve my issue with check_preempt_curr. That would probably require going further up the stack to try_to_wake_up() etc. Not yet sure what the other hook locations would be at that level. thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote: > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle wrote: > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle > >> wrote: > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > >> > ... > >> >> For this purpose, define a new cpufreq_update_util() flag > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to > >> >> cpufreq_update_util() in the in_iowait case. That generally > >> >> requires cpufreq_update_util() to be called directly from there, > >> >> because update_load_avg() is not likely to be invoked in that > >> >> case. > >> > > >> > I didn't follow why the cpufreq hook won't likely be called if > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop > >> > and calls update_cfs_rq_load_avg (triggers the hook). > >> > >> In practice it turns out that in the majority of cases when in_iowait > >> is set the second loop will not run. > > > > My understanding of enqueue_task_fair() is that the first loop walks up > > the portion of the sched_entity hierarchy that needs to be enqueued, and > > the second loop updates the rest of the hierarchy that was already > > enqueued. > > > > Even if the se corresponding to the root cfs_rq needs to be enqueued > > (meaning the whole hierarchy is traversed in the first loop and the > > second loop does nothing), enqueue_entity() on the root cfs_rq should > > result in the cpufreq hook being called, via enqueue_entity() -> > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). > > But then it's rather difficult to pass the IO flag to this one, isn't it? > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when > p->in_iowait is set. > > If you can find a clever way to do it without adding an extra call > site, that's fine by me, but in any case the extra > cpufreq_update_util() invocation should not be too expensive. I was under the impression that function pointer calls were more expensive, and in the shared policy case there is a nontrivial amount of code that is run in schedutil (including taking a spinlock) before we'd see sugov_should_update_freq() return false and bail. Agreed that getting knowledge of p->in_iowait down to the existing hook is not easy. I spent some time fiddling with that. It seemed doable but somewhat gross due to the required flag passing and modifications to enqueue_entity, update_load_avg, etc. If it is decided that it is worth pursuing I can keep working on it and post a draft. But I also wonder if the hooks are in the best location. They are currently deep in the PELT code. This may make sense from a theoretical standpoint, calling them whenever a root cfs_rq utilization changes, but it also makes the hooks difficult to correlate (for policy purposes such as this iowait change) with higher level logical events like a task wakeup. Or load balance where we probably want to call the hook just once after a load balance is complete. This is also an issue for the remote wakeup case where I currently have another invocation of the hook in check_preempt_curr(), so I can know if preemption was triggered and skip a remote schedutil update in that case to avoid a duplicate IPI. It seems to me worth evaluating if a higher level set of hook locations could be used. One possibility is higher up in CFS: - enqueue_task_fair, dequeue_task_fair - scheduler_tick - active_load_balance_cpu_stop, load_balance Though this wouldn't solve my issue with check_preempt_curr. That would probably require going further up the stack to try_to_wake_up() etc. Not yet sure what the other hook locations would be at that level. thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> wrote: > > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > > ... > >> For this purpose, define a new cpufreq_update_util() flag > >> UUF_IO and modify enqueue_task_fair() to pass that flag to > >> cpufreq_update_util() in the in_iowait case. That generally > >> requires cpufreq_update_util() to be called directly from there, > >> because update_load_avg() is not likely to be invoked in that > >> case. > > > > I didn't follow why the cpufreq hook won't likely be called if > > in_iowait is set? AFAICS update_load_avg() gets called in the second loop > > and calls update_cfs_rq_load_avg (triggers the hook). > > In practice it turns out that in the majority of cases when in_iowait > is set the second loop will not run. My understanding of enqueue_task_fair() is that the first loop walks up the portion of the sched_entity hierarchy that needs to be enqueued, and the second loop updates the rest of the hierarchy that was already enqueued. Even if the se corresponding to the root cfs_rq needs to be enqueued (meaning the whole hierarchy is traversed in the first loop and the second loop does nothing), enqueue_entity() on the root cfs_rq should result in the cpufreq hook being called, via enqueue_entity() -> enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). I'll keep looking to see if I've misunderstood something in here. thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote: > On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle wrote: > > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: > > ... > >> For this purpose, define a new cpufreq_update_util() flag > >> UUF_IO and modify enqueue_task_fair() to pass that flag to > >> cpufreq_update_util() in the in_iowait case. That generally > >> requires cpufreq_update_util() to be called directly from there, > >> because update_load_avg() is not likely to be invoked in that > >> case. > > > > I didn't follow why the cpufreq hook won't likely be called if > > in_iowait is set? AFAICS update_load_avg() gets called in the second loop > > and calls update_cfs_rq_load_avg (triggers the hook). > > In practice it turns out that in the majority of cases when in_iowait > is set the second loop will not run. My understanding of enqueue_task_fair() is that the first loop walks up the portion of the sched_entity hierarchy that needs to be enqueued, and the second loop updates the rest of the hierarchy that was already enqueued. Even if the se corresponding to the root cfs_rq needs to be enqueued (meaning the whole hierarchy is traversed in the first loop and the second loop does nothing), enqueue_entity() on the root cfs_rq should result in the cpufreq hook being called, via enqueue_entity() -> enqueue_entity_load_avg() -> update_cfs_rq_load_avg(). I'll keep looking to see if I've misunderstood something in here. thanks, Steve
Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly
On Tue, Aug 02, 2016 at 11:38:17AM +0100, Juri Lelli wrote: > > > Anyway one way that my patch differed was that I had used the flags > > > field to keep the behavior the same for both RT and DL. > > Do you mean "go to max" policy for both, until proper policies will be > implemented in the future? Yep. > > That happens > > > later on in this series for RT but the DL policy is modified as above. > > > Can the DL policy be left as-is and discussed/modified in a separate > > > series? > > Not that we want to start discussing this point now, if we postpone the > change for later, but I just wanted to point out a difference w.r.t. > what the schedfreq thing was doing: it used to sum contributions from > the different classes, instead of taking the max. We probably never > really discussed on the list what is the right thing to do, though. Yeah I figured that was worth deferring into its own patchset/thread. cheers, Steve
Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly
On Tue, Aug 02, 2016 at 11:38:17AM +0100, Juri Lelli wrote: > > > Anyway one way that my patch differed was that I had used the flags > > > field to keep the behavior the same for both RT and DL. > > Do you mean "go to max" policy for both, until proper policies will be > implemented in the future? Yep. > > That happens > > > later on in this series for RT but the DL policy is modified as above. > > > Can the DL policy be left as-is and discussed/modified in a separate > > > series? > > Not that we want to start discussing this point now, if we postpone the > change for later, but I just wanted to point out a difference w.r.t. > what the schedfreq thing was doing: it used to sum contributions from > the different classes, instead of taking the max. We probably never > really discussed on the list what is the right thing to do, though. Yeah I figured that was worth deferring into its own patchset/thread. cheers, Steve
Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()
On Tue, Aug 02, 2016 at 01:44:41AM +0200, Rafael J. Wysocki wrote: > On Monday, August 01, 2016 12:59:30 PM Steve Muckle wrote: > > On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote: > > > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote: > > > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote: > > > > > +#define UUF_RT 0x01 > > > > > > > > What does UUF stand for? > > > > > > "Utilization upadte flag". > > > > I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for > > the > > prefixes, though I guess some may object to the length. > > Well, OK. > > I guess something like SCHED_CPUFREQ_RT etc would be sufficient? Yeah I think that would work. thanks, Steve
Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()
On Tue, Aug 02, 2016 at 01:44:41AM +0200, Rafael J. Wysocki wrote: > On Monday, August 01, 2016 12:59:30 PM Steve Muckle wrote: > > On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote: > > > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote: > > > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote: > > > > > +#define UUF_RT 0x01 > > > > > > > > What does UUF stand for? > > > > > > "Utilization upadte flag". > > > > I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for > > the > > prefixes, though I guess some may object to the length. > > Well, OK. > > I guess something like SCHED_CPUFREQ_RT etc would be sufficient? Yeah I think that would work. thanks, Steve
Re: [RFC][PATCH 6/7] cpufreq: schedutil: Add iowait boosting
On Mon, Aug 01, 2016 at 01:37:59AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki> > Modify the schedutil cpufreq governor to boost the CPU frequency > if the UUF_IO flag is passed to it via cpufreq_update_util(). > > If that happens, the frequency is set to the maximum during > the first update after receiving the UUF_IO flag and then the > boost is reduced by half during each following update. Were these changes to schedutil part of the positive test results mentioned in patch 5? Or are those just from intel pstate? I was nervous about the effect of this on power and tested a couple low power usecases. The platform is the Hikey 96board (8 core ARM A53, single CPUfreq domain) running AOSP Android and schedutil backported to kernel 4.4. These tests run mp3 and mpeg4 playback for a little while, recording total energy consumption during the test along with frequency residency. As the results below show I did not measure an appreciable effect - if anything things may be slightly better with the patches. The hardcoding of a non-tunable boosting scheme makes me nervous but perhaps it could be revisited if some platform or configuration shows a noticeable regression? TestcaseEnergy /- CPU frequency residency -\ (J) 208000 432000 729000 96 120 mp3-before-126.822 47.27% 24.79% 16.23% 5.20% 6.52% mp3-before-226.817 41.70% 28.75% 17.62% 5.17% 6.75% mp3-before-326.65 42.48% 28.65% 17.25% 5.07% 6.55% mp3-after-1 26.667 42.51% 27.38% 18.00% 5.40% 6.71% mp3-after-2 26.777 48.37% 24.15% 15.68% 4.55% 7.25% mp3-after-3 26.806 41.93% 27.71% 18.35% 4.78% 7.35% mpeg4-before-1 26.024 18.41% 60.09% 13.16% 0.49% 7.85% mpeg4-before-2 25.147 20.47% 64.80% 8.44% 1.37% 4.91% mpeg4-before-3 25.007 19.18% 66.08% 10.01% 0.59% 4.22% mpeg4-after-1 25.598 19.77% 61.33% 11.63% 0.79% 6.48% mpeg4-after-2 25.18 22.31% 62.78% 8.83% 1.18% 4.90% mpeg4-after-3 25.162 21.59% 64.88% 8.29% 0.49% 4.71% thanks, Steve
Re: [RFC][PATCH 6/7] cpufreq: schedutil: Add iowait boosting
On Mon, Aug 01, 2016 at 01:37:59AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > Modify the schedutil cpufreq governor to boost the CPU frequency > if the UUF_IO flag is passed to it via cpufreq_update_util(). > > If that happens, the frequency is set to the maximum during > the first update after receiving the UUF_IO flag and then the > boost is reduced by half during each following update. Were these changes to schedutil part of the positive test results mentioned in patch 5? Or are those just from intel pstate? I was nervous about the effect of this on power and tested a couple low power usecases. The platform is the Hikey 96board (8 core ARM A53, single CPUfreq domain) running AOSP Android and schedutil backported to kernel 4.4. These tests run mp3 and mpeg4 playback for a little while, recording total energy consumption during the test along with frequency residency. As the results below show I did not measure an appreciable effect - if anything things may be slightly better with the patches. The hardcoding of a non-tunable boosting scheme makes me nervous but perhaps it could be revisited if some platform or configuration shows a noticeable regression? TestcaseEnergy /- CPU frequency residency -\ (J) 208000 432000 729000 96 120 mp3-before-126.822 47.27% 24.79% 16.23% 5.20% 6.52% mp3-before-226.817 41.70% 28.75% 17.62% 5.17% 6.75% mp3-before-326.65 42.48% 28.65% 17.25% 5.07% 6.55% mp3-after-1 26.667 42.51% 27.38% 18.00% 5.40% 6.71% mp3-after-2 26.777 48.37% 24.15% 15.68% 4.55% 7.25% mp3-after-3 26.806 41.93% 27.71% 18.35% 4.78% 7.35% mpeg4-before-1 26.024 18.41% 60.09% 13.16% 0.49% 7.85% mpeg4-before-2 25.147 20.47% 64.80% 8.44% 1.37% 4.91% mpeg4-before-3 25.007 19.18% 66.08% 10.01% 0.59% 4.22% mpeg4-after-1 25.598 19.77% 61.33% 11.63% 0.79% 6.48% mpeg4-after-2 25.18 22.31% 62.78% 8.83% 1.18% 4.90% mpeg4-after-3 25.162 21.59% 64.88% 8.29% 0.49% 4.71% thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: ... > For this purpose, define a new cpufreq_update_util() flag > UUF_IO and modify enqueue_task_fair() to pass that flag to > cpufreq_update_util() in the in_iowait case. That generally > requires cpufreq_update_util() to be called directly from there, > because update_load_avg() is not likely to be invoked in that > case. I didn't follow why the cpufreq hook won't likely be called if in_iowait is set? AFAICS update_load_avg() gets called in the second loop and calls update_cfs_rq_load_avg (triggers the hook). > > Signed-off-by: Rafael J. Wysocki> --- > include/linux/sched.h |1 + > kernel/sched/fair.c |8 > 2 files changed, 9 insertions(+) > > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -4459,6 +4459,14 @@ enqueue_task_fair(struct rq *rq, struct > struct cfs_rq *cfs_rq; > struct sched_entity *se = >se; > > + /* > + * If in_iowait is set, it is likely that the loops below will not > + * trigger any cpufreq utilization updates, so do it here explicitly > + * with the IO flag passed. > + */ > + if (p->in_iowait) > + cpufreq_update_util(rq, UUF_IO); > + > for_each_sched_entity(se) { > if (se->on_rq) > break; > Index: linux-pm/include/linux/sched.h > === > --- linux-pm.orig/include/linux/sched.h > +++ linux-pm/include/linux/sched.h > @@ -3376,6 +3376,7 @@ static inline unsigned long rlimit_max(u > } > > #define UUF_RT 0x01 > +#define UUF_IO 0x02 > > #ifdef CONFIG_CPU_FREQ > struct update_util_data { thanks, Steve
Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition
On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote: ... > For this purpose, define a new cpufreq_update_util() flag > UUF_IO and modify enqueue_task_fair() to pass that flag to > cpufreq_update_util() in the in_iowait case. That generally > requires cpufreq_update_util() to be called directly from there, > because update_load_avg() is not likely to be invoked in that > case. I didn't follow why the cpufreq hook won't likely be called if in_iowait is set? AFAICS update_load_avg() gets called in the second loop and calls update_cfs_rq_load_avg (triggers the hook). > > Signed-off-by: Rafael J. Wysocki > --- > include/linux/sched.h |1 + > kernel/sched/fair.c |8 > 2 files changed, 9 insertions(+) > > Index: linux-pm/kernel/sched/fair.c > === > --- linux-pm.orig/kernel/sched/fair.c > +++ linux-pm/kernel/sched/fair.c > @@ -4459,6 +4459,14 @@ enqueue_task_fair(struct rq *rq, struct > struct cfs_rq *cfs_rq; > struct sched_entity *se = >se; > > + /* > + * If in_iowait is set, it is likely that the loops below will not > + * trigger any cpufreq utilization updates, so do it here explicitly > + * with the IO flag passed. > + */ > + if (p->in_iowait) > + cpufreq_update_util(rq, UUF_IO); > + > for_each_sched_entity(se) { > if (se->on_rq) > break; > Index: linux-pm/include/linux/sched.h > === > --- linux-pm.orig/include/linux/sched.h > +++ linux-pm/include/linux/sched.h > @@ -3376,6 +3376,7 @@ static inline unsigned long rlimit_max(u > } > > #define UUF_RT 0x01 > +#define UUF_IO 0x02 > > #ifdef CONFIG_CPU_FREQ > struct update_util_data { thanks, Steve
Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()
On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote: > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote: > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote: > > > +#define UUF_RT 0x01 > > > > What does UUF stand for? > > "Utilization upadte flag". I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for the prefixes, though I guess some may object to the length. thanks, Steve
Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()
On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote: > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote: > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote: > > > +#define UUF_RT 0x01 > > > > What does UUF stand for? > > "Utilization upadte flag". I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for the prefixes, though I guess some may object to the length. thanks, Steve
Re: [RFC][PATCH 3/7] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()
On Mon, Aug 01, 2016 at 09:29:57AM +0200, Dominik Brodowski wrote: > A small nitpick: > > On Mon, Aug 01, 2016 at 01:36:01AM +0200, Rafael J. Wysocki wrote: > > --- linux-pm.orig/kernel/sched/sched.h > > +++ linux-pm/kernel/sched/sched.h > > @@ -1760,7 +1760,7 @@ DECLARE_PER_CPU(struct update_util_data > > > > /** > > * cpufreq_update_util - Take a note about CPU utilization changes. > > - * @time: Current time. > > + * @rq: Runqueue to carry out the update for. > > * > > * This function is called by the scheduler on every invocation of > > * update_load_avg() on the CPU whose utilization is being updated. > > This comment seems to need an update due to the smp_processor_id() check > being moved into this function. The callers of this have also changed - it is no longer called directly by update_load_avg(), rather via cfs_rq_util_change() from several other locations (I believe it was my patch that failed to update this comment). Could this be replaced with a more generic statement such as "called by CFS in various paths?" thanks, Steve
Re: [RFC][PATCH 3/7] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()
On Mon, Aug 01, 2016 at 09:29:57AM +0200, Dominik Brodowski wrote: > A small nitpick: > > On Mon, Aug 01, 2016 at 01:36:01AM +0200, Rafael J. Wysocki wrote: > > --- linux-pm.orig/kernel/sched/sched.h > > +++ linux-pm/kernel/sched/sched.h > > @@ -1760,7 +1760,7 @@ DECLARE_PER_CPU(struct update_util_data > > > > /** > > * cpufreq_update_util - Take a note about CPU utilization changes. > > - * @time: Current time. > > + * @rq: Runqueue to carry out the update for. > > * > > * This function is called by the scheduler on every invocation of > > * update_load_avg() on the CPU whose utilization is being updated. > > This comment seems to need an update due to the smp_processor_id() check > being moved into this function. The callers of this have also changed - it is no longer called directly by update_load_avg(), rather via cfs_rq_util_change() from several other locations (I believe it was my patch that failed to update this comment). Could this be replaced with a more generic statement such as "called by CFS in various paths?" thanks, Steve
Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly
On Mon, Aug 01, 2016 at 01:34:36AM +0200, Rafael J. Wysocki wrote: ... > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > === > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -144,17 +144,47 @@ static unsigned int get_next_freq(struct > return cpufreq_driver_resolve_freq(policy, freq); > } > > -static void sugov_update_single(struct update_util_data *hook, u64 time, > - unsigned long util, unsigned long max) > +static void sugov_get_util(unsigned long *util, unsigned long *max) > +{ > + unsigned long dl_util, dl_max; > + unsigned long cfs_util, cfs_max; > + int cpu = smp_processor_id(); > + struct dl_bw *dl_bw = dl_bw_of(cpu); > + struct rq *rq = this_rq(); > + > + if (rt_prio(current->prio)) { > + *util = ULONG_MAX; > + return; > + } > + > + dl_max = dl_bw_cpus(cpu) << 20; > + dl_util = dl_bw->total_bw; > + > + cfs_max = rq->cpu_capacity_orig; > + cfs_util = min(rq->cfs.avg.util_avg, cfs_max); > + > + if (cfs_util * dl_max > dl_util * cfs_max) { > + *util = cfs_util; > + *max = cfs_max; > + } else { > + *util = dl_util; > + *max = dl_max; > + } > +} Last Friday I had put together a similar patch based on Peter's. I need the flags field for the remote wakeup support. My previous plan, installing a late callback in check_preempt_curr that gets requested from the earlier existing CFS callback, was not working out since those two events don't always match up 1:1. Anyway one way that my patch differed was that I had used the flags field to keep the behavior the same for both RT and DL. That happens later on in this series for RT but the DL policy is modified as above. Can the DL policy be left as-is and discussed/modified in a separate series? thanks, Steve
Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly
On Mon, Aug 01, 2016 at 01:34:36AM +0200, Rafael J. Wysocki wrote: ... > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > === > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -144,17 +144,47 @@ static unsigned int get_next_freq(struct > return cpufreq_driver_resolve_freq(policy, freq); > } > > -static void sugov_update_single(struct update_util_data *hook, u64 time, > - unsigned long util, unsigned long max) > +static void sugov_get_util(unsigned long *util, unsigned long *max) > +{ > + unsigned long dl_util, dl_max; > + unsigned long cfs_util, cfs_max; > + int cpu = smp_processor_id(); > + struct dl_bw *dl_bw = dl_bw_of(cpu); > + struct rq *rq = this_rq(); > + > + if (rt_prio(current->prio)) { > + *util = ULONG_MAX; > + return; > + } > + > + dl_max = dl_bw_cpus(cpu) << 20; > + dl_util = dl_bw->total_bw; > + > + cfs_max = rq->cpu_capacity_orig; > + cfs_util = min(rq->cfs.avg.util_avg, cfs_max); > + > + if (cfs_util * dl_max > dl_util * cfs_max) { > + *util = cfs_util; > + *max = cfs_max; > + } else { > + *util = dl_util; > + *max = dl_max; > + } > +} Last Friday I had put together a similar patch based on Peter's. I need the flags field for the remote wakeup support. My previous plan, installing a late callback in check_preempt_curr that gets requested from the earlier existing CFS callback, was not working out since those two events don't always match up 1:1. Anyway one way that my patch differed was that I had used the flags field to keep the behavior the same for both RT and DL. That happens later on in this series for RT but the DL policy is modified as above. Can the DL policy be left as-is and discussed/modified in a separate series? thanks, Steve
Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()
On Fri, Jul 22, 2016 at 08:16:42AM -0700, Viresh Kumar wrote: > > Long term as I was mentioning in the other thread I think it'd be good > > if the current target() drivers were modified to supply resolve_freq(), > > and that cpufreq_register_driver() were again changed to require it for > > those drivers. > > There is no need for us to force this, its really optional for such > platforms. Worst case, schedutil wouldn't work at the best, so what? > Its a platform driver's choice, isn't it ? This would be in the context of then being able to remove the additional if statement from the hot path. To reply to the suggestion of using likely() here, I'm partial to solving it without that because I'm guessing likely() has to be an architecture-dependent thing? It seems cleaner to me if the existing few target() drivers were augmented to provide the resolve_freq() calback and its presence required. But it's certainly not a big deal and won't affect any platforms I'm involved with, at least for now. Maybe I could work on those target() drivers if things change.
Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()
On Fri, Jul 22, 2016 at 08:16:42AM -0700, Viresh Kumar wrote: > > Long term as I was mentioning in the other thread I think it'd be good > > if the current target() drivers were modified to supply resolve_freq(), > > and that cpufreq_register_driver() were again changed to require it for > > those drivers. > > There is no need for us to force this, its really optional for such > platforms. Worst case, schedutil wouldn't work at the best, so what? > Its a platform driver's choice, isn't it ? This would be in the context of then being able to remove the additional if statement from the hot path. To reply to the suggestion of using likely() here, I'm partial to solving it without that because I'm guessing likely() has to be an architecture-dependent thing? It seems cleaner to me if the existing few target() drivers were augmented to provide the resolve_freq() calback and its presence required. But it's certainly not a big deal and won't affect any platforms I'm involved with, at least for now. Maybe I could work on those target() drivers if things change.
Re: linux-next: build failure after merge of the pm tree
On Fri, Jul 22, 2016 at 11:56:20AM +1000, Stephen Rothwell wrote: > Hi Rafael, > > After merging the pm tree, today's linux-next build (arm > multi_v7_defconfig) failed like this: > > ERROR: "cpufreq_driver_resolve_freq" [kernel/sched/cpufreq_schedutil.ko] > undefined! > > Caused by commit > > 5cbea46984d6 ("cpufreq: schedutil: map raw required frequency to driver > frequency") > > I used the pm tree from next-20160721 for today. Sorry - I have just sent a patch to address this. thanks, Steve
Re: linux-next: build failure after merge of the pm tree
On Fri, Jul 22, 2016 at 11:56:20AM +1000, Stephen Rothwell wrote: > Hi Rafael, > > After merging the pm tree, today's linux-next build (arm > multi_v7_defconfig) failed like this: > > ERROR: "cpufreq_driver_resolve_freq" [kernel/sched/cpufreq_schedutil.ko] > undefined! > > Caused by commit > > 5cbea46984d6 ("cpufreq: schedutil: map raw required frequency to driver > frequency") > > I used the pm tree from next-20160721 for today. Sorry - I have just sent a patch to address this. thanks, Steve
[PATCH] cpufreq: export cpufreq_driver_resolve_freq()
Export cpufreq_driver_resolve_freq() since governors may be compiled as modules. Reported-by: Stephen Rothwell <s...@canb.auug.org.au> Signed-off-by: Steve Muckle <smuc...@linaro.org> --- drivers/cpufreq/cpufreq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index b696baeb249d..fa0a92d6121e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -514,6 +514,7 @@ unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy, CPUFREQ_RELATION_L); return policy->freq_table[policy->cached_resolved_idx].frequency; } +EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq); /* * SYSFS INTERFACE * -- 2.7.3
[PATCH] cpufreq: export cpufreq_driver_resolve_freq()
Export cpufreq_driver_resolve_freq() since governors may be compiled as modules. Reported-by: Stephen Rothwell Signed-off-by: Steve Muckle --- drivers/cpufreq/cpufreq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index b696baeb249d..fa0a92d6121e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -514,6 +514,7 @@ unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy, CPUFREQ_RELATION_L); return policy->freq_table[policy->cached_resolved_idx].frequency; } +EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq); /* * SYSFS INTERFACE * -- 2.7.3
Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()
On Thu, Jul 21, 2016 at 04:36:48PM -0700, Steve Muckle wrote: > As another alternative, this could be caught in cpufreq driver > initialization? I believe you suggested that originally, but I avoided > it as I didn't want to have to implement resolve_freq() for every > target() style driver. It sounds like there aren't many though. Going back and checking I see I was thinking of your suggestion that cpufreq_register_driver() check that only target() drivers offer a resolve_freq() callback. I put a comment for this in cpufreq.h but not a check - I could add a check in another patch if you like. Long term as I was mentioning in the other thread I think it'd be good if the current target() drivers were modified to supply resolve_freq(), and that cpufreq_register_driver() were again changed to require it for those drivers.