[PATCH] selftests: x86: add version check in test_syscall_vdso

2019-02-28 Thread Steve Muckle
Since 4.17 registers r8-r11 are not clobbered/zeroed by a 64-bit kernel
handling a 32-bit syscall and this behavior is enforced by the
test_syscall_vdso testcase. See commit 8bb2610bc496
("x86/entry/64/compat: Preserve r8-r11 in int $0x80").

Permit the old behavior in the testcase for kernels prior to 4.17.

Signed-off-by: Steve Muckle 
---
 .../testing/selftests/x86/test_syscall_vdso.c | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/x86/test_syscall_vdso.c 
b/tools/testing/selftests/x86/test_syscall_vdso.c
index c9c3281077bc..f7284dc4c32b 100644
--- a/tools/testing/selftests/x86/test_syscall_vdso.c
+++ b/tools/testing/selftests/x86/test_syscall_vdso.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #if !defined(__i386__)
@@ -71,6 +72,7 @@ struct regs64 {
 };
 struct regs64 regs64;
 int kernel_is_64bit;
+int clobber_ok;
 
 asm (
"   .pushsection .text\n"
@@ -130,6 +132,28 @@ void print_regs64(void)
printf("12:%016llx 13:%016llx 14:%016llx 15:%016llx\n", regs64.r12,  
regs64.r13,  regs64.r14,  regs64.r15);
 }
 
+static void get_kernel_version(int *version, int *patchlevel)
+{
+   int ret, sublevel;
+   struct utsname utsname;
+
+   ret = uname();
+   if (ret) {
+   perror("uname");
+   exit(1);
+   }
+
+   ret = sscanf(utsname.release, "%d.%d.%d", version, patchlevel,
+);
+   if (ret < 0) {
+   perror("sscanf");
+   exit(1);
+   } else if (ret != 3) {
+   printf("Malformed kernel version %s\n", );
+   exit(1);
+   }
+}
+
 int check_regs64(void)
 {
int err = 0;
@@ -166,6 +190,8 @@ int check_regs64(void)
 * Historically (and probably unintentionally), they
 * were clobbered or zeroed.
 */
+   if (clobber_ok && *r64 == 0 && num <= 11)
+   continue;
}
printf("[FAIL]\tR%d has changed:%016llx\n", num, *r64);
err++;
@@ -385,6 +411,7 @@ int main(int argc, char **argv, char **envp)
 {
int exitcode = 0;
int cs;
+   int version, patchlevel;
 
asm("\n"
"   movl%%cs, %%eax\n"
@@ -394,6 +421,9 @@ int main(int argc, char **argv, char **envp)
if (!kernel_is_64bit)
printf("[NOTE]\tNot a 64-bit kernel, won't test R8..R15 
leaks\n");
 
+   get_kernel_version(, );
+   clobber_ok = version < 4 || (version == 4 && patchlevel < 17);
+
/* This only works for non-static builds:
 * syscall_addr = dlsym(dlopen("linux-gate.so.1", RTLD_NOW), 
"__kernel_vsyscall");
 */
-- 
2.21.0.352.gf09ad66450-goog



Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Steve Muckle

On 09/27/2018 05:43 PM, Wanpeng Li wrote:

On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
-> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
-> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.


Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.


thanks,
Steve


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Steve Muckle

On 09/27/2018 05:43 PM, Wanpeng Li wrote:

On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
-> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
-> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.


Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.


thanks,
Steve


[tip:sched/core] sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

2018-09-10 Thread tip-bot for Steve Muckle
Commit-ID:  d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Gitweb: https://git.kernel.org/tip/d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Author: Steve Muckle 
AuthorDate: Fri, 31 Aug 2018 15:42:17 -0700
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 10:13:47 +0200

sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.

For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from John Dias .

Suggested-by: Peter Zijlstra 
Tested-by: Dietmar Eggemann 
Signed-off-by: Steve Muckle 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Redpath 
Cc: John Dias 
Cc: Linus Torvalds 
Cc: Miguel de Dios 
Cc: Morten Rasmussen 
Cc: Patrick Bellasi 
Cc: Paul Turner 
Cc: Quentin Perret 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: kernel-t...@android.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on 
migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuc...@google.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cff8d55ee95..c6b7d6daab20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9644,7 +9644,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;


[tip:sched/core] sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

2018-09-10 Thread tip-bot for Steve Muckle
Commit-ID:  d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Gitweb: https://git.kernel.org/tip/d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Author: Steve Muckle 
AuthorDate: Fri, 31 Aug 2018 15:42:17 -0700
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 10:13:47 +0200

sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.

For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from John Dias .

Suggested-by: Peter Zijlstra 
Tested-by: Dietmar Eggemann 
Signed-off-by: Steve Muckle 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Redpath 
Cc: John Dias 
Cc: Linus Torvalds 
Cc: Miguel de Dios 
Cc: Morten Rasmussen 
Cc: Patrick Bellasi 
Cc: Paul Turner 
Cc: Quentin Perret 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: kernel-t...@android.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on 
migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuc...@google.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cff8d55ee95..c6b7d6daab20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9644,7 +9644,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;


[PATCH] sched/fair: fix vruntime_normalized for remote non-migration wakeup

2018-08-31 Thread Steve Muckle
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an
error in the task's vruntime if it is switched from the fair class
during this time, for example if it is boosted to RT priority via
rt_mutex_setprio. The rq's min_vruntime will not be subtracted from the
task's vruntime but it will be added again when the task returns to the
fair class. The task's vruntime will have been erroneously doubled and
the effective priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from joaod...@google.com.

Suggested-by: Peter Zijlstra 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;
-- 
2.19.0.rc1.350.ge57e33dbd1-goog



[PATCH] sched/fair: fix vruntime_normalized for remote non-migration wakeup

2018-08-31 Thread Steve Muckle
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an
error in the task's vruntime if it is switched from the fair class
during this time, for example if it is boosted to RT priority via
rt_mutex_setprio. The rq's min_vruntime will not be subtracted from the
task's vruntime but it will be added again when the task returns to the
fair class. The task's vruntime will have been erroneously doubled and
the effective priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from joaod...@google.com.

Suggested-by: Peter Zijlstra 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;
-- 
2.19.0.rc1.350.ge57e33dbd1-goog



Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-31 Thread Steve Muckle

On 08/29/2018 08:33 AM, Dietmar Eggemann wrote:

Yes, this solves the issue for the case I described. Using
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using
'p->sched_class == _sched_class'.


It's confirmed that this patch solves the original issue we saw (and my 
test case passes for me as well). I will send this version out shortly.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-31 Thread Steve Muckle

On 08/29/2018 08:33 AM, Dietmar Eggemann wrote:

Yes, this solves the issue for the case I described. Using
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using
'p->sched_class == _sched_class'.


It's confirmed that this patch solves the original issue we saw (and my 
test case passes for me as well). I will send this version out shortly.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:



When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.



Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
the vruntime_normalized check sees that the task's state is TASK_WAKING,
which results in skipping the subtraction of the rq's min_vruntime
from the task's vruntime
- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.


I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Yes there's a missing bit of context here at the beginning that the task 
to be boosted had already been put into TASK_WAKING.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:



When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.



Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
the vruntime_normalized check sees that the task's state is TASK_WAKING,
which results in skipping the subtraction of the rq's min_vruntime
from the task's vruntime
- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.


I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Yes there's a missing bit of context here at the beginning that the task 
to be boosted had already been put into TASK_WAKING.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/23/2018 11:54 PM, Juri Lelli wrote:

I tried to catch this issue on my Arm64 Juno board using pi_test (and a
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
  Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
   High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
from se->vruntime, is called on it before it gets the rt prio.

Maybe it requires a very specific use of the pthread library to provoke this
issue by making sure that the cfs tasks really blocks/sleeps?


Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
  pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
  pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
   -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
   -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
  pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
  pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
   -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
   -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
  pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
  pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
  pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The timing is quite sensitive since the task being boosted has to be 
caught in the TASK_WAKING state. The program:


/*
  * Test case for vruntime management during rtmutex priority inheritance
  * promotion and demotion.
  *
  * build with -lpthread
  */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
 if (x) \
 fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
 int policy;
 int i = 0;
 cpu_set_t cpuset;

 CPU_ZERO();
 CPU_SET(0, );
 ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

));

 while(i++ < 100) {
 sem_wait(_sem);
 

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/23/2018 11:54 PM, Juri Lelli wrote:

I tried to catch this issue on my Arm64 Juno board using pi_test (and a
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
  Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
   High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
from se->vruntime, is called on it before it gets the rt prio.

Maybe it requires a very specific use of the pthread library to provoke this
issue by making sure that the cfs tasks really blocks/sleeps?


Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
  pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
  pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
   -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
   -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
  pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
  pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
   -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
   -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
  pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
  pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
  pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The timing is quite sensitive since the task being boosted has to be 
caught in the TASK_WAKING state. The program:


/*
  * Test case for vruntime management during rtmutex priority inheritance
  * promotion and demotion.
  *
  * build with -lpthread
  */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
 if (x) \
 fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
 int policy;
 int i = 0;
 cpu_set_t cpuset;

 CPU_ZERO();
 CPU_SET(0, );
 ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

));

 while(i++ < 100) {
 sem_wait(_sem);
 

[PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-17 Thread Steve Muckle
From: John Dias 

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
  the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
  the vruntime_normalized check sees that the task's state is TASK_WAKING,
  which results in skipping the subtraction of the rq's min_vruntime
  from the task's vruntime
- later, when the prio is deboosted and the task is moved back
  to the fair class, the fair rq's min_vruntime is added to
  the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_class == _sched_class))
return true;
 
return false;
-- 
2.18.0.865.gffc8e1a3cd6-goog



[PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-17 Thread Steve Muckle
From: John Dias 

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
  the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
  the vruntime_normalized check sees that the task's state is TASK_WAKING,
  which results in skipping the subtraction of the rq's min_vruntime
  from the task's vruntime
- later, when the prio is deboosted and the task is moved back
  to the fair class, the fair rq's min_vruntime is added to
  the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_class == _sched_class))
return true;
 
return false;
-- 
2.18.0.865.gffc8e1a3cd6-goog



Re: [RFC] vruntime updated incorrectly when rt_mutex boots prio?

2018-08-13 Thread Steve Muckle

On 08/07/2018 10:40 AM, 'Todd Kjos' via kernel-team wrote:

This issue was discovered on a 4.9-based android device, but the
relevant mainline code appears to be the same. The symptom is that
over time the some workloads become sluggish resulting in missed
frames or sluggishness. It appears to be the same issue described in
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html.

Here is the scenario: A task is deactivated while still in the fair
class. The task is then boosted to RT, so rt_mutex_setprio() is
called. This changes the task to RT and calls check_class_changed(),
which eventually calls detach_task_cfs_rq(), which is where
vruntime_normalized() sees that the task's state is TASK_WAKING, which
results in skipping the subtraction of the rq's min_vruntime from the
task's vruntime. Later, when the prio is deboosted and the task is
moved back to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, resulting in vruntime inflation.


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
 pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
 pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
  -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
  -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
 pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
 pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
  -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
  -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
 pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
 pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
 pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The program:

/*
 * Test case for vruntime management during rtmutex priority inheritance
 * promotion and demotion.
 *
 * build with -lpthread
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
if (x) \
fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
int policy;
int i = 0;
cpu_set_t cpuset;

CPU_ZERO();
CPU_SET(0, );
ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

   ));

while(i++ < 100) {
sem_wait(_sem);
usleep(1000);
sem_post(_sem);
pthread_mutex_lock(_mutex);
pthread_mutex_unlock(_mutex);
}
}

void *low_prio_thread_func(void *arg) {
int i = 0;
cpu_set_t cpuset;

CPU_ZERO();

Re: [RFC] vruntime updated incorrectly when rt_mutex boots prio?

2018-08-13 Thread Steve Muckle

On 08/07/2018 10:40 AM, 'Todd Kjos' via kernel-team wrote:

This issue was discovered on a 4.9-based android device, but the
relevant mainline code appears to be the same. The symptom is that
over time the some workloads become sluggish resulting in missed
frames or sluggishness. It appears to be the same issue described in
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html.

Here is the scenario: A task is deactivated while still in the fair
class. The task is then boosted to RT, so rt_mutex_setprio() is
called. This changes the task to RT and calls check_class_changed(),
which eventually calls detach_task_cfs_rq(), which is where
vruntime_normalized() sees that the task's state is TASK_WAKING, which
results in skipping the subtraction of the rq's min_vruntime from the
task's vruntime. Later, when the prio is deboosted and the task is
moved back to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, resulting in vruntime inflation.


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
 pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
 pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
  -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
  -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
 pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
 pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
  -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
  -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
 pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
 pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
 pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The program:

/*
 * Test case for vruntime management during rtmutex priority inheritance
 * promotion and demotion.
 *
 * build with -lpthread
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
if (x) \
fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
int policy;
int i = 0;
cpu_set_t cpuset;

CPU_ZERO();
CPU_SET(0, );
ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

   ));

while(i++ < 100) {
sem_wait(_sem);
usleep(1000);
sem_post(_sem);
pthread_mutex_lock(_mutex);
pthread_mutex_unlock(_mutex);
}
}

void *low_prio_thread_func(void *arg) {
int i = 0;
cpu_set_t cpuset;

CPU_ZERO();

Re: [PATCH 06/24] selftests: exec: return Kselftest Skip code for skipped tests

2018-05-07 Thread Steve Muckle

On 05/04/2018 06:13 PM, Shuah Khan (Samsung OSG) wrote:

When execveat test is skipped because of unmet dependencies and/or
unsupported configuration, it exits with error which is treated as
a fail by the Kselftest framework. This leads to false negative
result even when the test could not be run.

Change it to return kselftest skip code when a test gets skipped to
clearly report that the test could not be run.

Change it to use ksft_exit_skip() when kernel doesn't support execveat.

Signed-off-by: Shuah Khan (Samsung OSG) 
---
  tools/testing/selftests/exec/execveat.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 67cd4597db2b..47cbf54d0801 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -20,6 +20,8 @@
  #include 
  #include 
  
+#include "../kselftest.h"

+
  static char longpath[2 * PATH_MAX] = "";
  static char *envp[] = { "IN_TEST=yes", NULL, NULL };
  static char *argv[] = { "execveat", "99", NULL };
@@ -249,8 +251,8 @@ static int run_tests(void)
errno = 0;
execveat_(-1, NULL, NULL, NULL, 0);
if (errno == ENOSYS) {
-   printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
-   return 1;
+   ksft_exit_skip(
+   "ENOSYS calling execveat - no kernel support?\n");
}
  
  	/* Change file position to confirm it doesn't affect anything */



LGTM.

thanks,
Steve


Re: [PATCH 06/24] selftests: exec: return Kselftest Skip code for skipped tests

2018-05-07 Thread Steve Muckle

On 05/04/2018 06:13 PM, Shuah Khan (Samsung OSG) wrote:

When execveat test is skipped because of unmet dependencies and/or
unsupported configuration, it exits with error which is treated as
a fail by the Kselftest framework. This leads to false negative
result even when the test could not be run.

Change it to return kselftest skip code when a test gets skipped to
clearly report that the test could not be run.

Change it to use ksft_exit_skip() when kernel doesn't support execveat.

Signed-off-by: Shuah Khan (Samsung OSG) 
---
  tools/testing/selftests/exec/execveat.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 67cd4597db2b..47cbf54d0801 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -20,6 +20,8 @@
  #include 
  #include 
  
+#include "../kselftest.h"

+
  static char longpath[2 * PATH_MAX] = "";
  static char *envp[] = { "IN_TEST=yes", NULL, NULL };
  static char *argv[] = { "execveat", "99", NULL };
@@ -249,8 +251,8 @@ static int run_tests(void)
errno = 0;
execveat_(-1, NULL, NULL, NULL, 0);
if (errno == ENOSYS) {
-   printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
-   return 1;
+   ksft_exit_skip(
+   "ENOSYS calling execveat - no kernel support?\n");
}
  
  	/* Change file position to confirm it doesn't affect anything */



LGTM.

thanks,
Steve


Re: [PATCH RFC 2/5] sched/fair: Skip frequency update if CPU about to idle

2017-11-01 Thread Steve Muckle

On 10/30/2017 12:02 PM, Joel Fernandes wrote:

Also, this more looks like a policy decision. Will it be better to
put that directly into schedutil? Like this:

 if (cpu_idle())
 "Don't change the freq";

Will something like that work?


I thought about this and I think it wont work very well. In the
dequeue path we're still running the task being dequeued so the CPU is
not yet idle. What is needed here IMO is a notion that the CPU is
possibly about to idle and we can get predict that from the dequeue
path of the CFS class.

Also just looking at whether the CPU is currently idle or not in the
governor doesn't help to differentiate between say the dequeue path /
tick path. Both of which can occur when the CPU is not idle.

Any thoughts about this?


Also if it really is the case that this bit of policy is universally 
desirable, I'd think it is better to do this in the scheduler and avoid 
a needless trip through a fn pointer out to schedutil for performance 
reasons.


Re: [PATCH RFC 2/5] sched/fair: Skip frequency update if CPU about to idle

2017-11-01 Thread Steve Muckle

On 10/30/2017 12:02 PM, Joel Fernandes wrote:

Also, this more looks like a policy decision. Will it be better to
put that directly into schedutil? Like this:

 if (cpu_idle())
 "Don't change the freq";

Will something like that work?


I thought about this and I think it wont work very well. In the
dequeue path we're still running the task being dequeued so the CPU is
not yet idle. What is needed here IMO is a notion that the CPU is
possibly about to idle and we can get predict that from the dequeue
path of the CFS class.

Also just looking at whether the CPU is currently idle or not in the
governor doesn't help to differentiate between say the dequeue path /
tick path. Both of which can occur when the CPU is not idle.

Any thoughts about this?


Also if it really is the case that this bit of policy is universally 
desirable, I'd think it is better to do this in the scheduler and avoid 
a needless trip through a fn pointer out to schedutil for performance 
reasons.


[PATCH v2] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle
When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle <smuckle.li...@gmail.com>
---
Differences from v1:
 - free allocation from getcwd()
 - update comment in check_execveat_pathmax()

 tools/testing/selftests/exec/execveat.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..67cd4597db2b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,20 +156,30 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
longname[len] = '\0';
strcat(longpath, longname);
+   free(cwd);
}
exe_cp(src, longpath);
 
@@ -190,7 +200,7 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
}
 
/*
-* Execute as a long pathname relative to ".".  If this is a script,
+* Execute as a long pathname relative to "/".  If this is a script,
 * the interpreter will launch but fail to open the script because its
 * name ("/dev/fd/5/xxx") is bigger than PATH_MAX.
 *
@@ -200,10 +210,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +228,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +364,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.11.0



[PATCH v2] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle
When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
Differences from v1:
 - free allocation from getcwd()
 - update comment in check_execveat_pathmax()

 tools/testing/selftests/exec/execveat.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..67cd4597db2b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,20 +156,30 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
longname[len] = '\0';
strcat(longpath, longname);
+   free(cwd);
}
exe_cp(src, longpath);
 
@@ -190,7 +200,7 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
}
 
/*
-* Execute as a long pathname relative to ".".  If this is a script,
+* Execute as a long pathname relative to "/".  If this is a script,
 * the interpreter will launch but fail to open the script because its
 * name ("/dev/fd/5/xxx") is bigger than PATH_MAX.
 *
@@ -200,10 +210,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +228,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +364,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.11.0



Re: [PATCH] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle

Thanks David for the review. Replies inline.

On 10/12/2017 05:22 AM, David Drysdale wrote:

Modulo the minor comment below:

Reviewed-by: David Drysdale <drysd...@google.com>

On Thu, Oct 12, 2017 at 1:40 AM, Steve Muckle <smuckle.li...@gmail.com> wrote:

When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle <smuckle.li...@gmail.com>
---
  tools/testing/selftests/exec/execveat.c | 24 +---
  1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
  }

  #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
  {
 int fail = 0;
 int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)

 if (*longpath == '\0') {
 /* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);


cwd never gets freed, but that's presumably not a problem unless these
tests ever get run under some kind of leak checker.


Yeah I found some other instances of this in the test (specifically the 
calls to realpath() at the top of run_tests() which alloc their own 
buffer) and figured for a small unit test like this it had been decided 
it wasn't worth freeing stuff. I'll tidy this up anyway though.





+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
 memset(longname, 'x', XX_DIR_LEN - 1);
 longname[XX_DIR_LEN - 1] = '/';
 longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
 for (ii = 0; ii < count; ii++) {
 strcat(longpath, longname);
 mkdir(longpath, 0755);
 }
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
 if (len <= 0)
 len = 1;
 memset(longname, 'y', len);


There's a missing comment update around here-ish:
   * Execute as a long pathname relative to ".".
(should now be 'relative to "/"' I think).


Will fix.

thanks,
Steve


Re: [PATCH] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle

Thanks David for the review. Replies inline.

On 10/12/2017 05:22 AM, David Drysdale wrote:

Modulo the minor comment below:

Reviewed-by: David Drysdale 

On Thu, Oct 12, 2017 at 1:40 AM, Steve Muckle  wrote:

When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
  tools/testing/selftests/exec/execveat.c | 24 +---
  1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
  }

  #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
  {
 int fail = 0;
 int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)

 if (*longpath == '\0') {
 /* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);


cwd never gets freed, but that's presumably not a problem unless these
tests ever get run under some kind of leak checker.


Yeah I found some other instances of this in the test (specifically the 
calls to realpath() at the top of run_tests() which alloc their own 
buffer) and figured for a small unit test like this it had been decided 
it wasn't worth freeing stuff. I'll tidy this up anyway though.





+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
 memset(longname, 'x', XX_DIR_LEN - 1);
 longname[XX_DIR_LEN - 1] = '/';
 longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
 for (ii = 0; ii < count; ii++) {
 strcat(longpath, longname);
 mkdir(longpath, 0755);
 }
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
 if (len <= 0)
 len = 1;
 memset(longname, 'y', len);


There's a missing comment update around here-ish:
   * Execute as a long pathname relative to ".".
(should now be 'relative to "/"' I think).


Will fix.

thanks,
Steve


[PATCH] selftests/exec: include cwd in long path calculation

2017-10-11 Thread Steve Muckle
When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle <smuckle.li...@gmail.com>
---
 tools/testing/selftests/exec/execveat.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
@@ -200,10 +209,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +227,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +363,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.15.0.rc0.271.g36b669edcc-goog



[PATCH] selftests/exec: include cwd in long path calculation

2017-10-11 Thread Steve Muckle
When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
 tools/testing/selftests/exec/execveat.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
@@ -200,10 +209,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +227,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +363,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.15.0.rc0.271.g36b669edcc-goog



Android kernel configs (was Re: configs)

2017-09-20 Thread Steve Muckle

On 09/19/2017 10:08 AM, John Stultz wrote:

So what I was thinking of to improve the developer usability might be
the following:

1) Leave the upstream configs in place.  We can try to keep
maintaining them, but its not something the Android team is likely to
care greatly about, and hopefully can be ignored. Meanwhile for the
folks tinkering with upstream, there's something that helps them
figure things out. When the next android-x.y tree comes out, you can
simply remove it.

2) To the android-x.y branches, add a script to the kernel/configs/
dir to fetch the current config directly from the config repo. This
way its simple for folks to get the right thing w/o looking anything
up. This won't fly upstream, but should be fine for common.git

Thoughts?


LGTM. If anyone has concerns let me know, otherwise I'll proceed with #2.

thanks,
Steve


Android kernel configs (was Re: configs)

2017-09-20 Thread Steve Muckle

On 09/19/2017 10:08 AM, John Stultz wrote:

So what I was thinking of to improve the developer usability might be
the following:

1) Leave the upstream configs in place.  We can try to keep
maintaining them, but its not something the Android team is likely to
care greatly about, and hopefully can be ignored. Meanwhile for the
folks tinkering with upstream, there's something that helps them
figure things out. When the next android-x.y tree comes out, you can
simply remove it.

2) To the android-x.y branches, add a script to the kernel/configs/
dir to fetch the current config directly from the config repo. This
way its simple for folks to get the right thing w/o looking anything
up. This won't fly upstream, but should be fine for common.git

Thoughts?


LGTM. If anyone has concerns let me know, otherwise I'll proceed with #2.

thanks,
Steve


Re: [PATCH RFC RESEND v2 0/2] Prevent cpufreq update for only task on rq that sleeps

2017-09-07 Thread Steve Muckle

On 09/07/2017 09:14 AM, Joel Fernandes wrote:

I'm planning to rebase this series on Linus's master and post it
again, but just checking any thoughts about it?

Just to add more context, the reason for not updating the frequency:

- When a last dequeue of a sleeping task happens, it is sufficient to
update utilization without updating the frequency because if other
CPUs are busy then their updates will consider the utilization of the
idle CPU in the shared policy unless sufficient time has passed.

- If the last dequeue of a sleeping task happens while all other CPUs
in the cluster are idle, then the cluster will likely enter
cluster-idle soon.


To clarify - when you say "last dequeue of a sleeping task happens" 
above, you're referring to the dequeue of the last task running on the 
CPU, correct? I.e. the CPU is about to go idle?


It's been a while since I've looked at this area so would like to hold 
off for a rebased version to review in further detail. But I think the 
concept is valid.


thanks,
steve


Re: [PATCH RFC RESEND v2 0/2] Prevent cpufreq update for only task on rq that sleeps

2017-09-07 Thread Steve Muckle

On 09/07/2017 09:14 AM, Joel Fernandes wrote:

I'm planning to rebase this series on Linus's master and post it
again, but just checking any thoughts about it?

Just to add more context, the reason for not updating the frequency:

- When a last dequeue of a sleeping task happens, it is sufficient to
update utilization without updating the frequency because if other
CPUs are busy then their updates will consider the utilization of the
idle CPU in the shared policy unless sufficient time has passed.

- If the last dequeue of a sleeping task happens while all other CPUs
in the cluster are idle, then the cluster will likely enter
cluster-idle soon.


To clarify - when you say "last dequeue of a sleeping task happens" 
above, you're referring to the dequeue of the last task running on the 
CPU, correct? I.e. the CPU is about to go idle?


It's been a while since I've looked at this area so would like to hold 
off for a rebased version to review in further detail. But I think the 
concept is valid.


thanks,
steve


Re: [PATCH V2 0/4] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-23 Thread Steve Muckle
Hi Viresh,

On Tue, Nov 15, 2016 at 01:53:19PM +0530, Viresh Kumar wrote:
> This work was started by Steve Muckle, where he used a simple kthread
> instead of kthread-worker and that wasn't sufficient as some guarantees
> weren't met.

I know this has already gone in, but can you expand on the unmet
guarantees mentioned here just for my own (and perhaps others')
understanding?

thanks,
Steve


Re: [PATCH V2 0/4] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-23 Thread Steve Muckle
Hi Viresh,

On Tue, Nov 15, 2016 at 01:53:19PM +0530, Viresh Kumar wrote:
> This work was started by Steve Muckle, where he used a simple kthread
> instead of kthread-worker and that wasn't sufficient as some guarantees
> weren't met.

I know this has already gone in, but can you expand on the unmet
guarantees mentioned here just for my own (and perhaps others')
understanding?

thanks,
Steve


Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle
On Fri, Nov 11, 2016 at 11:16:59PM +0100, Rafael J. Wysocki wrote:
> > +   struct sched_param param = { .sched_priority = 50 };
> 
> I'd define a symbol for the 50.  It's just one extra line of code ...

A minor point for sure, but in general what's the motivation for
defining symbols for things which are only used once? It makes it harder
to untangle and learn a piece of code IMO, having to jump to the
definitions of them to see what they are.

thanks,
Steve


Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle
On Fri, Nov 11, 2016 at 11:16:59PM +0100, Rafael J. Wysocki wrote:
> > +   struct sched_param param = { .sched_priority = 50 };
> 
> I'd define a symbol for the 50.  It's just one extra line of code ...

A minor point for sure, but in general what's the motivation for
defining symbols for things which are only used once? It makes it harder
to untangle and learn a piece of code IMO, having to jump to the
definitions of them to see what they are.

thanks,
Steve


Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle
On Sun, Nov 13, 2016 at 03:37:18PM +0100, Rafael J. Wysocki wrote:
> > Hold on a sec. I thought during LPC someone (Peter?) made a point that when
> > RT thread run, we should bump the frequency to max? So, schedutil is going
> > to trigger schedutil to bump up the frequency to max, right?
> 
> No, it isn't, or at least that is unlikely.
> 
> sugov_update_commit() sets sg_policy->work_in_progress before queuing
> the IRQ work and it is not cleared until the frequency changes in
> sugov_work().
> 
> OTOH, sugov_should_update_freq() checks sg_policy->work_in_progress
> upfront and returns false when it is set, so the governor won't see
> its own worker threads run, unless I'm overlooking something highly
> non-obvious.

FWIW my intention with the original version of this patch (which I
neglected to communicate to Viresh) was that it would depend on changing
the frequency policy for RT. I had been using rt_avg. It sounds like
during LPC there were talks of using another metric.

It does appear things would work okay without that but it also seems
a bit fragile. There's the window between when the work_in_progress
gets cleared and the RT kthread yields. I have not thought through the
various scenarios there, what is possible and tested to see if it is
significant enough to impact power-sensitive platforms.

thanks,
Steve


Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle
On Sun, Nov 13, 2016 at 03:37:18PM +0100, Rafael J. Wysocki wrote:
> > Hold on a sec. I thought during LPC someone (Peter?) made a point that when
> > RT thread run, we should bump the frequency to max? So, schedutil is going
> > to trigger schedutil to bump up the frequency to max, right?
> 
> No, it isn't, or at least that is unlikely.
> 
> sugov_update_commit() sets sg_policy->work_in_progress before queuing
> the IRQ work and it is not cleared until the frequency changes in
> sugov_work().
> 
> OTOH, sugov_should_update_freq() checks sg_policy->work_in_progress
> upfront and returns false when it is set, so the governor won't see
> its own worker threads run, unless I'm overlooking something highly
> non-obvious.

FWIW my intention with the original version of this patch (which I
neglected to communicate to Viresh) was that it would depend on changing
the frequency policy for RT. I had been using rt_avg. It sounds like
during LPC there were talks of using another metric.

It does appear things would work okay without that but it also seems
a bit fragile. There's the window between when the work_in_progress
gets cleared and the RT kthread yields. I have not thought through the
various scenarios there, what is possible and tested to see if it is
significant enough to impact power-sensitive platforms.

thanks,
Steve


Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-08 Thread Steve Muckle
On Wed, Sep 07, 2016 at 05:35:50PM -0700, Srinivas Pandruvada wrote:
> Did you see any performance regression on Android workloads?

I did a few AnTuTU runs and did not observe a regression.

thanks,
Steve


Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-08 Thread Steve Muckle
On Wed, Sep 07, 2016 at 05:35:50PM -0700, Srinivas Pandruvada wrote:
> Did you see any performance regression on Android workloads?

I did a few AnTuTU runs and did not observe a regression.

thanks,
Steve


Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-07 Thread Steve Muckle
On Sat, Sep 03, 2016 at 02:56:48AM +0200, Rafael J. Wysocki wrote:
> Please let me know what you think and if you can run some benchmarks you
> care about and see if the changes make any difference (this way or another),
> please do that and let me know what you've found.

LGTM (I just reviewed the first and last patch, skipping the
intel_pstate ones).

I was unable to see a conclusive power regression in Android audio, video or
idle usecases on my hikey 96board.

thanks,
Steve


Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-07 Thread Steve Muckle
On Sat, Sep 03, 2016 at 02:56:48AM +0200, Rafael J. Wysocki wrote:
> Please let me know what you think and if you can run some benchmarks you
> care about and see if the changes make any difference (this way or another),
> please do that and let me know what you've found.

LGTM (I just reviewed the first and last patch, skipping the
intel_pstate ones).

I was unable to see a conclusive power regression in Android audio, video or
idle usecases on my hikey 96board.

thanks,
Steve


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-01 Thread Steve Muckle
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > Another problem is that we have many semi related knobs; we have the
> > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > should)
> 
> Maybe we could create this sort of link when using the cgroup RT
> throttling interface as well? It should still then fit well once we
> replace the underlying mechanism with DL reservations. And, AFAIK, the
> interface is used by Android folks already.

I'm not sure how the upper bounds can be used to infer CPU frequency...
On my Nexus 6p (an Android device), the global RT runtime limit
seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
bg_non_interactive is set at 700ms/1sec.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-01 Thread Steve Muckle
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > Another problem is that we have many semi related knobs; we have the
> > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > should)
> 
> Maybe we could create this sort of link when using the cgroup RT
> throttling interface as well? It should still then fit well once we
> replace the underlying mechanism with DL reservations. And, AFAIK, the
> interface is used by Android folks already.

I'm not sure how the upper bounds can be used to infer CPU frequency...
On my Nexus 6p (an Android device), the global RT runtime limit
seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
bg_non_interactive is set at 700ms/1sec.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> 
> Not entirely true. People have simply disabled cpufreq because of this.
>
> Yes, RR/FIFO are a pain, but they should still be deterministic, and
> variable cpufreq destroys that.

That is the way it's been with cpufreq and many systems (including all
mobile devices) rely on that to not destroy power. RT + variable cpufreq
is not deterministic.

Given we don't have good constraints on RT tasks I don't think we should
try to strengthen the semantics there. Folks should either move to DL if
they want determinism *and* not-sucky power, or continue disabling
cpufreq if they are able to do so.

> I realize that the fmax thing is annoying, but I'm not seeing how rt_avg
> is much better.

Rt_avg is much closer to the current behavior offered by the most
commonly used cpufreq governors since it tracks actual CPU utilization.
Power is not impacted by minimal RT activity and the frequency is raised
if RT activity is high.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> 
> Not entirely true. People have simply disabled cpufreq because of this.
>
> Yes, RR/FIFO are a pain, but they should still be deterministic, and
> variable cpufreq destroys that.

That is the way it's been with cpufreq and many systems (including all
mobile devices) rely on that to not destroy power. RT + variable cpufreq
is not deterministic.

Given we don't have good constraints on RT tasks I don't think we should
try to strengthen the semantics there. Folks should either move to DL if
they want determinism *and* not-sucky power, or continue disabling
cpufreq if they are able to do so.

> I realize that the fmax thing is annoying, but I'm not seeing how rt_avg
> is much better.

Rt_avg is much closer to the current behavior offered by the most
commonly used cpufreq governors since it tracks actual CPU utilization.
Power is not impacted by minimal RT activity and the frequency is raised
if RT activity is high.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote:
> On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> > 
> > Modify the schedutil algorithm to instead use rt_avg as an estimate of
> > RT utilization of the CPU.
> > 
> > Based on previous work by Vincent Guittot <vincent.guit...@linaro.org>.
> 
> If we do it for RT, why not to do a similar thing for DL?  As in the
> original patch from Peter, for example?

Agreed DL should have a similar change. I think that could be done in a
separate patch. I also would need to discuss it with the deadline sched
devs to fully understand the metric used there.

> 
> > Signed-off-by: Steve Muckle <smuc...@linaro.org>
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 26 +-
> >  1 file changed, 17 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index cb8a77b1ef1b..89094a466250 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
> > *sg_cpu, unsigned long util,
> >  
> >  static void sugov_get_util(unsigned long *util, unsigned long *max)
> >  {
> > -   struct rq *rq = this_rq();
> > -   unsigned long cfs_max;
> > +   int cpu = smp_processor_id();
> > +   struct rq *rq = cpu_rq(cpu);
> > +   unsigned long max_cap, rt;
> > +   s64 delta;
> >  
> > -   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > +   max_cap = arch_scale_cpu_capacity(NULL, cpu);
> >  
> > -   *util = min(rq->cfs.avg.util_avg, cfs_max);
> > -   *max = cfs_max;
> > +   delta = rq_clock(rq) - rq->age_stamp;
> > +   if (unlikely(delta < 0))
> > +   delta = 0;
> > +   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
> > +   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
> 
> These computations are rather heavy, so I wonder if they are avoidable based
> on the flags, for example?

Yeah the div is bad. I don't know that we can avoid it based on the
flags because rt_avg will decay during CFS activity and you'd want to
take note of that.

One way to make this a little better is to ssume that the divisor,
sched_avg_period() + delta, fits into 32 bits so that div_u64 can be
used, which I believe is less bad. Doing that means placing a
restriction on how large sysctl_sched_time_avg (which determines
sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know
that anyone uses a value that large anyway but there's currently no
limit on it.

Another option would be just adding another separate metric to track rt
activity that is more mathematically favorable to deal with.

Both these seemed potentially heavy handed so I figured I'd just start
with the obvious, if suboptimal, solution...

> Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures?

Yes.

> One more ugly thing is about using rq_clock(rq) directly from here whereas we
> pass it around as the 'time' argument elsewhere.

Sure I'll clean this up.

> 
> > +
> > +   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
> > +   *max = max_cap;
> >  }


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote:
> On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> > 
> > Modify the schedutil algorithm to instead use rt_avg as an estimate of
> > RT utilization of the CPU.
> > 
> > Based on previous work by Vincent Guittot .
> 
> If we do it for RT, why not to do a similar thing for DL?  As in the
> original patch from Peter, for example?

Agreed DL should have a similar change. I think that could be done in a
separate patch. I also would need to discuss it with the deadline sched
devs to fully understand the metric used there.

> 
> > Signed-off-by: Steve Muckle 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 26 +-
> >  1 file changed, 17 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index cb8a77b1ef1b..89094a466250 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
> > *sg_cpu, unsigned long util,
> >  
> >  static void sugov_get_util(unsigned long *util, unsigned long *max)
> >  {
> > -   struct rq *rq = this_rq();
> > -   unsigned long cfs_max;
> > +   int cpu = smp_processor_id();
> > +   struct rq *rq = cpu_rq(cpu);
> > +   unsigned long max_cap, rt;
> > +   s64 delta;
> >  
> > -   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > +   max_cap = arch_scale_cpu_capacity(NULL, cpu);
> >  
> > -   *util = min(rq->cfs.avg.util_avg, cfs_max);
> > -   *max = cfs_max;
> > +   delta = rq_clock(rq) - rq->age_stamp;
> > +   if (unlikely(delta < 0))
> > +   delta = 0;
> > +   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
> > +   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
> 
> These computations are rather heavy, so I wonder if they are avoidable based
> on the flags, for example?

Yeah the div is bad. I don't know that we can avoid it based on the
flags because rt_avg will decay during CFS activity and you'd want to
take note of that.

One way to make this a little better is to ssume that the divisor,
sched_avg_period() + delta, fits into 32 bits so that div_u64 can be
used, which I believe is less bad. Doing that means placing a
restriction on how large sysctl_sched_time_avg (which determines
sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know
that anyone uses a value that large anyway but there's currently no
limit on it.

Another option would be just adding another separate metric to track rt
activity that is more mathematically favorable to deal with.

Both these seemed potentially heavy handed so I figured I'd just start
with the obvious, if suboptimal, solution...

> Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures?

Yes.

> One more ugly thing is about using rq_clock(rq) directly from here whereas we
> pass it around as the 'time' argument elsewhere.

Sure I'll clean this up.

> 
> > +
> > +   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
> > +   *max = max_cap;
> >  }


[PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-26 Thread Steve Muckle
A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot <vincent.guit...@linaro.org>.

Signed-off-by: Steve Muckle <smuc...@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index cb8a77b1ef1b..89094a466250 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
*sg_cpu, unsigned long util,
 
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
-   struct rq *rq = this_rq();
-   unsigned long cfs_max;
+   int cpu = smp_processor_id();
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long max_cap, rt;
+   s64 delta;
 
-   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+   max_cap = arch_scale_cpu_capacity(NULL, cpu);
 
-   *util = min(rq->cfs.avg.util_avg, cfs_max);
-   *max = cfs_max;
+   delta = rq_clock(rq) - rq->age_stamp;
+   if (unlikely(delta < 0))
+   delta = 0;
+   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
+   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
+
+   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
+   *max = max_cap;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -167,7 +175,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
if (!sugov_should_update_freq(sg_policy, time))
return;
 
-   if (flags & SCHED_CPUFREQ_RT_DL) {
+   if (flags & SCHED_CPUFREQ_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(, );
@@ -186,7 +194,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
u64 last_freq_update_time = sg_policy->last_freq_update_time;
unsigned int j;
 
-   if (flags & SCHED_CPUFREQ_RT_DL)
+   if (flags & SCHED_CPUFREQ_DL)
return max_f;
 
for_each_cpu(j, policy->cpus) {
@@ -209,7 +217,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
if (delta_ns > TICK_NSEC)
continue;
 
-   if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+   if (j_sg_cpu->flags & SCHED_CPUFREQ_DL)
return max_f;
 
j_util = j_sg_cpu->util;
@@ -467,7 +475,7 @@ static int sugov_start(struct cpufreq_policy *policy)
if (policy_is_shared(policy)) {
sg_cpu->util = 0;
sg_cpu->max = 0;
-   sg_cpu->flags = SCHED_CPUFREQ_RT;
+   sg_cpu->flags = SCHED_CPUFREQ_DL;
sg_cpu->last_update = 0;
sg_cpu->cached_raw_freq = 0;
cpufreq_add_update_util_hook(cpu, _cpu->update_util,
-- 
2.7.3



[PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-26 Thread Steve Muckle
A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot .

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index cb8a77b1ef1b..89094a466250 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
*sg_cpu, unsigned long util,
 
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
-   struct rq *rq = this_rq();
-   unsigned long cfs_max;
+   int cpu = smp_processor_id();
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long max_cap, rt;
+   s64 delta;
 
-   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+   max_cap = arch_scale_cpu_capacity(NULL, cpu);
 
-   *util = min(rq->cfs.avg.util_avg, cfs_max);
-   *max = cfs_max;
+   delta = rq_clock(rq) - rq->age_stamp;
+   if (unlikely(delta < 0))
+   delta = 0;
+   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
+   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
+
+   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
+   *max = max_cap;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -167,7 +175,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
if (!sugov_should_update_freq(sg_policy, time))
return;
 
-   if (flags & SCHED_CPUFREQ_RT_DL) {
+   if (flags & SCHED_CPUFREQ_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(, );
@@ -186,7 +194,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
u64 last_freq_update_time = sg_policy->last_freq_update_time;
unsigned int j;
 
-   if (flags & SCHED_CPUFREQ_RT_DL)
+   if (flags & SCHED_CPUFREQ_DL)
return max_f;
 
for_each_cpu(j, policy->cpus) {
@@ -209,7 +217,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
if (delta_ns > TICK_NSEC)
continue;
 
-   if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+   if (j_sg_cpu->flags & SCHED_CPUFREQ_DL)
return max_f;
 
j_util = j_sg_cpu->util;
@@ -467,7 +475,7 @@ static int sugov_start(struct cpufreq_policy *policy)
if (policy_is_shared(policy)) {
sg_cpu->util = 0;
sg_cpu->max = 0;
-   sg_cpu->flags = SCHED_CPUFREQ_RT;
+   sg_cpu->flags = SCHED_CPUFREQ_DL;
sg_cpu->last_update = 0;
sg_cpu->cached_raw_freq = 0;
cpufreq_add_update_util_hook(cpu, _cpu->update_util,
-- 
2.7.3



[PATCH 0/2] utilization changes for schedutil

2016-08-26 Thread Steve Muckle
The first patch (which is a re-send) corrects an issue with utilization on SMT
platforms. Currently CFS utilization is being roughly doubled on these
platforms due to differences in the way PELT and overall max CPU capacity are
calculated.

The second patch moves away from the policy of going to fmax during RT task
activity, instead using rt_avg as an estimate of RT utilization.

Steve Muckle (2):
  sched: cpufreq: ignore SMT when determining max cpu capacity
  sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

 kernel/sched/cpufreq_schedutil.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

-- 
2.7.3



[PATCH 1/2] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-26 Thread Steve Muckle
PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle <smuc...@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3



[PATCH 0/2] utilization changes for schedutil

2016-08-26 Thread Steve Muckle
The first patch (which is a re-send) corrects an issue with utilization on SMT
platforms. Currently CFS utilization is being roughly doubled on these
platforms due to differences in the way PELT and overall max CPU capacity are
calculated.

The second patch moves away from the policy of going to fmax during RT task
activity, instead using rt_avg as an estimate of RT utilization.

Steve Muckle (2):
  sched: cpufreq: ignore SMT when determining max cpu capacity
  sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

 kernel/sched/cpufreq_schedutil.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

-- 
2.7.3



[PATCH 1/2] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-26 Thread Steve Muckle
PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3



[PATCH] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-19 Thread Steve Muckle
PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle <smuc...@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3



[PATCH] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-19 Thread Steve Muckle
PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3



Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle
On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 61d485421bed..95d34b337152 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
> > *sa,
> > sa->last_update_time = now;
> >  
> > scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
> 
> Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something
> like 'rcu_dereference(cpu_rq(cpu)->sd)'?
> 
> IMHO, __update_load_avg() is called outside existing RCU read-side
> critical sections as well so there would be a pair of
> rcu_read_lock()/rcu_read_unlock() required in this case.

Thanks Dietmar for the review.

Yeah I didn't consider that this was protected with rcu. It looks like
I'm abandoning this approach anyway though and doing something limited
just to schedutil.

thanks,
Steve


Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle
On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 61d485421bed..95d34b337152 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
> > *sa,
> > sa->last_update_time = now;
> >  
> > scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
> 
> Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something
> like 'rcu_dereference(cpu_rq(cpu)->sd)'?
> 
> IMHO, __update_load_avg() is called outside existing RCU read-side
> critical sections as well so there would be a pair of
> rcu_read_lock()/rcu_read_unlock() required in this case.

Thanks Dietmar for the review.

Yeah I didn't consider that this was protected with rcu. It looks like
I'm abandoning this approach anyway though and doing something limited
just to schedutil.

thanks,
Steve


Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote:
> Hi Steve,
> 
> On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote:
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> > 
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> 
> I can't convince myself whether this is the right thing to do. SMT is a
> bit 'special' and it depends on how you model SMT capacity.
> 
> I'm no SMT expert, but the way I understand the current SMT capacity
> model is that capacity_orig represents the capacity of the SMT-thread
> when all its thread-siblings are busy. The true capacity of an
> SMT-thread where all thread-siblings are idle is actually 1024, but we
> don't model this (it would be nightmare to track when the capacity
> should change). The capacity of a core with two or more SMT-threads is
> chosen to be 1024 + smt_gain, where smt_gain is supposed represent the
> additional throughput we gain for the additional SMT-threads. The reason
> why we don't have 1024 per thread is that we would prefer to have only
> one task per core if possible.
> 
> With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly
> 'full' with just one always-running task. If we change util_avg to max
> out at 589, it would take two always-running tasks for the combined
> utilization to match the core capacity. So we may loose some bias
> towards spreading for SMT systems.
> 
> AFAICT, group_is_overloaded() and group_has_capacity() would both be
> affected by this patch.
> 
> Interestingly, Vincent recently proposed to set the SMT-thread capacity
> to 1024 which would affectively make all the current SMT code redundant.
> It would make things a lot simpler, but I'm not sure if we can get away
> with it. It would need discussion at least.
> 
> Opinions?

Thanks for having a look.

The reason I pushed this patch was to address an issue with the
schedutil governor - demand is effectively doubled on SMT systems due to
the above scheme. But this can just be fixed for schedutil by using a
max value there consistent with what __update_load_avg() is using. I'll send
another patch. It looks like there's a good reason for the current PELT
scaling w.r.t. SMT in the scheduler/load balancer.

thanks,
Steve



Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote:
> Hi Steve,
> 
> On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote:
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> > 
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> 
> I can't convince myself whether this is the right thing to do. SMT is a
> bit 'special' and it depends on how you model SMT capacity.
> 
> I'm no SMT expert, but the way I understand the current SMT capacity
> model is that capacity_orig represents the capacity of the SMT-thread
> when all its thread-siblings are busy. The true capacity of an
> SMT-thread where all thread-siblings are idle is actually 1024, but we
> don't model this (it would be nightmare to track when the capacity
> should change). The capacity of a core with two or more SMT-threads is
> chosen to be 1024 + smt_gain, where smt_gain is supposed represent the
> additional throughput we gain for the additional SMT-threads. The reason
> why we don't have 1024 per thread is that we would prefer to have only
> one task per core if possible.
> 
> With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly
> 'full' with just one always-running task. If we change util_avg to max
> out at 589, it would take two always-running tasks for the combined
> utilization to match the core capacity. So we may loose some bias
> towards spreading for SMT systems.
> 
> AFAICT, group_is_overloaded() and group_has_capacity() would both be
> affected by this patch.
> 
> Interestingly, Vincent recently proposed to set the SMT-thread capacity
> to 1024 which would affectively make all the current SMT code redundant.
> It would make things a lot simpler, but I'm not sure if we can get away
> with it. It would need discussion at least.
> 
> Opinions?

Thanks for having a look.

The reason I pushed this patch was to address an issue with the
schedutil governor - demand is effectively doubled on SMT systems due to
the above scheme. But this can just be fixed for schedutil by using a
max value there consistent with what __update_load_avg() is using. I'll send
another patch. It looks like there's a good reason for the current PELT
scaling w.r.t. SMT in the scheduler/load balancer.

thanks,
Steve



Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle
On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote:
> 2016-08-19 9:55 GMT+08:00 Steve Muckle <steve.muc...@linaro.org>:
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> >
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> >
> > Fix this by passing in the sd in __update_load_avg() as well.
> 
> I believe we notice this at least several months ago.
> https://lkml.org/lkml/2016/5/25/228

Glad to see I'm not alone in thinking this is an issue.

It causes an issue with schedutil, effectively doubling the apparent
demand on SMT. I don't know the load balance code well enough offhand to
say whether it's an issue there.

cheers,
Steve


Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle
On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote:
> 2016-08-19 9:55 GMT+08:00 Steve Muckle :
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> >
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> >
> > Fix this by passing in the sd in __update_load_avg() as well.
> 
> I believe we notice this at least several months ago.
> https://lkml.org/lkml/2016/5/25/228

Glad to see I'm not alone in thinking this is an issue.

It causes an issue with schedutil, effectively doubling the apparent
demand on SMT. I don't know the load balance code well enough offhand to
say whether it's an issue there.

cheers,
Steve


[PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle
PELT scales its util_sum and util_avg values via
arch_scale_cpu_capacity(). If that function is passed the CPU's sched
domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
is set. PELT does not pass in the sd however. The other caller of
arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
util_sum and util_avg scale beyond the CPU capacity on SMT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024.

Fix this by passing in the sd in __update_load_avg() as well.

Signed-off-by: Steve Muckle <smuc...@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d485421bed..95d34b337152 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
sa->last_update_time = now;
 
scale_freq = arch_scale_freq_capacity(NULL, cpu);
-   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
 
/* delta_w is the amount already accumulated against our next period */
delta_w = sa->period_contrib;
-- 
2.7.3



[PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle
PELT scales its util_sum and util_avg values via
arch_scale_cpu_capacity(). If that function is passed the CPU's sched
domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
is set. PELT does not pass in the sd however. The other caller of
arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
util_sum and util_avg scale beyond the CPU capacity on SMT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024.

Fix this by passing in the sd in __update_load_avg() as well.

Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d485421bed..95d34b337152 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
sa->last_update_time = now;
 
scale_freq = arch_scale_freq_capacity(NULL, cpu);
-   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
 
/* delta_w is the amount already accumulated against our next period */
delta_w = sa->period_contrib;
-- 
2.7.3



Re: [PATCH v2 2/2] cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()

2016-08-15 Thread Steve Muckle
LGTM

On Fri, Aug 12, 2016 at 02:06:44AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
> as the time argument and some of them check whether or not cpu_of(rq)
> is equal to smp_processor_id() before calling it, so rework it to
> take a runqueue pointer as the argument and move the rq_clock(rq)
> evaluation into it.
> 
> Additionally, provide a wrapper checking cpu_of(rq) against
> smp_processor_id() for the cpufreq_update_util() callers that
> need it.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a new patch based on the [2/2] from v1.
> 
> ---
>  kernel/sched/deadline.c |3 +--
>  kernel/sched/fair.c |4 +---
>  kernel/sched/rt.c   |3 +--
>  kernel/sched/sched.h|   15 +++
>  4 files changed, 14 insertions(+), 11 deletions(-)
> 
> Index: linux-pm/kernel/sched/deadline.c
> ===
> --- linux-pm.orig/kernel/sched/deadline.c
> +++ linux-pm/kernel/sched/deadline.c
> @@ -733,8 +733,7 @@ static void update_curr_dl(struct rq *rq
>   }
>  
>   /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_DL);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,7 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
>  }
>  
> Index: linux-pm/kernel/sched/rt.c
> ===
> --- linux-pm.orig/kernel/sched/rt.c
> +++ linux-pm/kernel/sched/rt.c
> @@ -958,8 +958,7 @@ static void update_curr_rt(struct rq *rq
>   return;
>  
>   /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_RT);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/sched.h
> ===
> --- linux-pm.orig/kernel/sched/sched.h
> +++ linux-pm/kernel/sched/sched.h
> @@ -1763,7 +1763,7 @@ DECLARE_PER_CPU(struct update_util_data
>  
>  /**
>   * cpufreq_update_util - Take a note about CPU utilization changes.
> - * @time: Current time.
> + * @rq: Runqueue to carry out the update for.
>   * @flags: Update reason flags.
>   *
>   * This function is called by the scheduler on the CPU whose utilization is
> @@ -1783,16 +1783,23 @@ DECLARE_PER_CPU(struct update_util_data
>   * but that really is a band-aid.  Going forward it should be replaced with
>   * solutions targeted more specifically at RT and DL tasks.
>   */
> -static inline void cpufreq_update_util(u64 time, unsigned int flags)
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
>   data = rcu_dereference_sched(*this_cpu_ptr(_update_util_data));
>   if (data)
> - data->func(data, time, flags);
> + data->func(data, rq_clock(rq), flags);
> +}
> +
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags)
> +{
> + if (cpu_of(rq) == smp_processor_id())
> + cpufreq_update_util(rq, flags);
>  }
>  #else
> -static inline void cpufreq_update_util(u64 time, unsigned int flags) {}
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int 
> flags) {}
>  #endif /* CONFIG_CPU_FREQ */
>  
>  #ifdef arch_scale_freq_capacity
> 


Re: [PATCH v2 2/2] cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()

2016-08-15 Thread Steve Muckle
LGTM

On Fri, Aug 12, 2016 at 02:06:44AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
> as the time argument and some of them check whether or not cpu_of(rq)
> is equal to smp_processor_id() before calling it, so rework it to
> take a runqueue pointer as the argument and move the rq_clock(rq)
> evaluation into it.
> 
> Additionally, provide a wrapper checking cpu_of(rq) against
> smp_processor_id() for the cpufreq_update_util() callers that
> need it.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a new patch based on the [2/2] from v1.
> 
> ---
>  kernel/sched/deadline.c |3 +--
>  kernel/sched/fair.c |4 +---
>  kernel/sched/rt.c   |3 +--
>  kernel/sched/sched.h|   15 +++
>  4 files changed, 14 insertions(+), 11 deletions(-)
> 
> Index: linux-pm/kernel/sched/deadline.c
> ===
> --- linux-pm.orig/kernel/sched/deadline.c
> +++ linux-pm/kernel/sched/deadline.c
> @@ -733,8 +733,7 @@ static void update_curr_dl(struct rq *rq
>   }
>  
>   /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_DL);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,7 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
>  }
>  
> Index: linux-pm/kernel/sched/rt.c
> ===
> --- linux-pm.orig/kernel/sched/rt.c
> +++ linux-pm/kernel/sched/rt.c
> @@ -958,8 +958,7 @@ static void update_curr_rt(struct rq *rq
>   return;
>  
>   /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_RT);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/sched.h
> ===
> --- linux-pm.orig/kernel/sched/sched.h
> +++ linux-pm/kernel/sched/sched.h
> @@ -1763,7 +1763,7 @@ DECLARE_PER_CPU(struct update_util_data
>  
>  /**
>   * cpufreq_update_util - Take a note about CPU utilization changes.
> - * @time: Current time.
> + * @rq: Runqueue to carry out the update for.
>   * @flags: Update reason flags.
>   *
>   * This function is called by the scheduler on the CPU whose utilization is
> @@ -1783,16 +1783,23 @@ DECLARE_PER_CPU(struct update_util_data
>   * but that really is a band-aid.  Going forward it should be replaced with
>   * solutions targeted more specifically at RT and DL tasks.
>   */
> -static inline void cpufreq_update_util(u64 time, unsigned int flags)
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
>   data = rcu_dereference_sched(*this_cpu_ptr(_update_util_data));
>   if (data)
> - data->func(data, time, flags);
> + data->func(data, rq_clock(rq), flags);
> +}
> +
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags)
> +{
> + if (cpu_of(rq) == smp_processor_id())
> + cpufreq_update_util(rq, flags);
>  }
>  #else
> -static inline void cpufreq_update_util(u64 time, unsigned int flags) {}
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int 
> flags) {}
>  #endif /* CONFIG_CPU_FREQ */
>  
>  #ifdef arch_scale_freq_capacity
> 


Re: [PATCH v2 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-15 Thread Steve Muckle
LGTM

On Fri, Aug 12, 2016 at 02:04:42AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> It is useful to know the reason why cpufreq_update_util() has just
> been called and that can be passed as flags to cpufreq_update_util()
> and to the ->func() callback in struct update_util_data.  However,
> doing that in addition to passing the util and max arguments they
> already take would be clumsy, so avoid it.
> 
> Instead, use the observation that the schedutil governor is part
> of the scheduler proper, so it can access scheduler data directly.
> This allows the util and max arguments of cpufreq_update_util()
> and the ->func() callback in struct update_util_data to be replaced
> with a flags one, but schedutil has to be modified to follow.
> 
> Thus make the schedutil governor obtain the CFS utilization
> information from the scheduler and use the "RT" and "DL" flags
> instead of the special utilization value of ULONG_MAX to track
> updates from the RT and DL sched classes.  Make it non-modular
> too to avoid having to export scheduler variables to modules at
> large.
> 
> Next, update all of the other users of cpufreq_update_util()
> and the ->func() callback in struct update_util_data accordingly.
> 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v1 -> v2: Do not check cpu_of(rq) against smp_processor_id() in
>   cfs_rq_util_change().
> 
> ---
>  drivers/cpufreq/Kconfig|5 --
>  drivers/cpufreq/cpufreq_governor.c |2 -
>  drivers/cpufreq/intel_pstate.c |2 -
>  include/linux/sched.h  |   12 --
>  kernel/sched/cpufreq.c |2 -
>  kernel/sched/cpufreq_schedutil.c   |   67 
> -
>  kernel/sched/deadline.c|4 +-
>  kernel/sched/fair.c|   10 +
>  kernel/sched/rt.c  |4 +-
>  kernel/sched/sched.h   |   31 +
>  10 files changed, 66 insertions(+), 73 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -260,7 +260,7 @@ static void dbs_irq_work(struct irq_work
>  }
>  
>  static void dbs_update_util_handler(struct update_util_data *data, u64 time,
> - unsigned long util, unsigned long max)
> + unsigned int flags)
>  {
>   struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, 
> update_util);
>   struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -1329,7 +1329,7 @@ static inline void intel_pstate_adjust_b
>  }
>  
>  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max)
> +  unsigned int flags)
>  {
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns = time - cpu->sample.time;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3469,15 +3469,19 @@ static inline unsigned long rlimit_max(u
>   return task_rlimit_max(current, limit);
>  }
>  
> +#define SCHED_CPUFREQ_RT (1U << 0)
> +#define SCHED_CPUFREQ_DL (1U << 1)
> +
> +#define SCHED_CPUFREQ_RT_DL  (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> +
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
> - void (*func)(struct update_util_data *data,
> -  u64 time, unsigned long util, unsigned long max);
> +   void (*func)(struct update_util_data *data, u64 time, unsigned int 
> flags);
>  };
>  
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> - void (*func)(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max));
> +   void (*func)(struct update_util_data *data, u64 time,
> + unsigned int flags));
>  void cpufreq_remove_update_util_hook(int cpu);
>  #endif /* CONFIG_CPU_FREQ */
>  
> Index: linux-pm/kernel/sched/cpufreq.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq.c
> +++ linux-pm/kernel/sched/cpufreq.c
> @@ -33,7 +33,7 @@ DEFINE_PER_CPU(struct update_util_data *
>   */
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>   void (*func)(struct 

Re: [PATCH v2 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-15 Thread Steve Muckle
LGTM

On Fri, Aug 12, 2016 at 02:04:42AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> It is useful to know the reason why cpufreq_update_util() has just
> been called and that can be passed as flags to cpufreq_update_util()
> and to the ->func() callback in struct update_util_data.  However,
> doing that in addition to passing the util and max arguments they
> already take would be clumsy, so avoid it.
> 
> Instead, use the observation that the schedutil governor is part
> of the scheduler proper, so it can access scheduler data directly.
> This allows the util and max arguments of cpufreq_update_util()
> and the ->func() callback in struct update_util_data to be replaced
> with a flags one, but schedutil has to be modified to follow.
> 
> Thus make the schedutil governor obtain the CFS utilization
> information from the scheduler and use the "RT" and "DL" flags
> instead of the special utilization value of ULONG_MAX to track
> updates from the RT and DL sched classes.  Make it non-modular
> too to avoid having to export scheduler variables to modules at
> large.
> 
> Next, update all of the other users of cpufreq_update_util()
> and the ->func() callback in struct update_util_data accordingly.
> 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v1 -> v2: Do not check cpu_of(rq) against smp_processor_id() in
>   cfs_rq_util_change().
> 
> ---
>  drivers/cpufreq/Kconfig|5 --
>  drivers/cpufreq/cpufreq_governor.c |2 -
>  drivers/cpufreq/intel_pstate.c |2 -
>  include/linux/sched.h  |   12 --
>  kernel/sched/cpufreq.c |2 -
>  kernel/sched/cpufreq_schedutil.c   |   67 
> -
>  kernel/sched/deadline.c|4 +-
>  kernel/sched/fair.c|   10 +
>  kernel/sched/rt.c  |4 +-
>  kernel/sched/sched.h   |   31 +
>  10 files changed, 66 insertions(+), 73 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -260,7 +260,7 @@ static void dbs_irq_work(struct irq_work
>  }
>  
>  static void dbs_update_util_handler(struct update_util_data *data, u64 time,
> - unsigned long util, unsigned long max)
> + unsigned int flags)
>  {
>   struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, 
> update_util);
>   struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -1329,7 +1329,7 @@ static inline void intel_pstate_adjust_b
>  }
>  
>  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max)
> +  unsigned int flags)
>  {
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns = time - cpu->sample.time;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3469,15 +3469,19 @@ static inline unsigned long rlimit_max(u
>   return task_rlimit_max(current, limit);
>  }
>  
> +#define SCHED_CPUFREQ_RT (1U << 0)
> +#define SCHED_CPUFREQ_DL (1U << 1)
> +
> +#define SCHED_CPUFREQ_RT_DL  (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> +
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
> - void (*func)(struct update_util_data *data,
> -  u64 time, unsigned long util, unsigned long max);
> +   void (*func)(struct update_util_data *data, u64 time, unsigned int 
> flags);
>  };
>  
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> - void (*func)(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max));
> +   void (*func)(struct update_util_data *data, u64 time,
> + unsigned int flags));
>  void cpufreq_remove_update_util_hook(int cpu);
>  #endif /* CONFIG_CPU_FREQ */
>  
> Index: linux-pm/kernel/sched/cpufreq.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq.c
> +++ linux-pm/kernel/sched/cpufreq.c
> @@ -33,7 +33,7 @@ DEFINE_PER_CPU(struct update_util_data *
>   */
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>   void (*func)(struct update_util_data *data, u64 time,
> -  unsigned long 

Re: [PATCH 2/2] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Wed, Aug 10, 2016 at 03:11:17AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,8 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
...
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
> + if (cpu_of(rq) != smp_processor_id())
> + return;

This test is unecessary in the CFS (most common) path due to the check
on this_rq in cfs_rq_util_change().

I think instead of bringing the test into cpufreq_update_util it should
be left at the call sites for RT and DL, and removed from CFS as part of
the first patch.

thanks,
Steve


Re: [PATCH 2/2] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Wed, Aug 10, 2016 at 03:11:17AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,8 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
...
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
> + if (cpu_of(rq) != smp_processor_id())
> + return;

This test is unecessary in the CFS (most common) path due to the check
on this_rq in cfs_rq_util_change().

I think instead of bringing the test into cpufreq_update_util it should
be left at the call sites for RT and DL, and removed from CFS as part of
the first patch.

thanks,
Steve


Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Thu, Aug 11, 2016 at 11:03:47AM -0700, Steve Muckle wrote:
> On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/fair.c
> > ===
> > --- linux-pm.orig/kernel/sched/fair.c
> > +++ linux-pm/kernel/sched/fair.c
> > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
> >  
> >  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
> >  {
> > -   struct rq *rq = rq_of(cfs_rq);
> > -   int cpu = cpu_of(rq);
> > -
> > -   if (cpu == smp_processor_id() && >cfs == cfs_rq) {
> > -   unsigned long max = rq->cpu_capacity_orig;
> > +   if (_rq()->cfs == cfs_rq) {
> > +   struct rq *rq = rq_of(cfs_rq);
> >  
> > /*
> >  * There are a few boundary cases this might miss but it should
> > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
> >  *
> >  * See cpu_util().
> >  */
> > -   cpufreq_update_util(rq_clock(rq),
> > -   min(cfs_rq->avg.util_avg, max), max);
> > +   if (cpu_of(rq) == smp_processor_id())
> 
> Isn't this test against smp_processor_id() redundant since
> this_rq()->cfs == cfs_rq?

Sorry, I see this is modified in the next patch.


Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Thu, Aug 11, 2016 at 11:03:47AM -0700, Steve Muckle wrote:
> On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/fair.c
> > ===
> > --- linux-pm.orig/kernel/sched/fair.c
> > +++ linux-pm/kernel/sched/fair.c
> > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
> >  
> >  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
> >  {
> > -   struct rq *rq = rq_of(cfs_rq);
> > -   int cpu = cpu_of(rq);
> > -
> > -   if (cpu == smp_processor_id() && >cfs == cfs_rq) {
> > -   unsigned long max = rq->cpu_capacity_orig;
> > +   if (_rq()->cfs == cfs_rq) {
> > +   struct rq *rq = rq_of(cfs_rq);
> >  
> > /*
> >  * There are a few boundary cases this might miss but it should
> > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
> >  *
> >  * See cpu_util().
> >  */
> > -   cpufreq_update_util(rq_clock(rq),
> > -   min(cfs_rq->avg.util_avg, max), max);
> > +   if (cpu_of(rq) == smp_processor_id())
> 
> Isn't this test against smp_processor_id() redundant since
> this_rq()->cfs == cfs_rq?

Sorry, I see this is modified in the next patch.


Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
>  
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
> - struct rq *rq = rq_of(cfs_rq);
> - int cpu = cpu_of(rq);
> -
> - if (cpu == smp_processor_id() && >cfs == cfs_rq) {
> - unsigned long max = rq->cpu_capacity_orig;
> + if (_rq()->cfs == cfs_rq) {
> + struct rq *rq = rq_of(cfs_rq);
>  
>   /*
>* There are a few boundary cases this might miss but it should
> @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + if (cpu_of(rq) == smp_processor_id())

Isn't this test against smp_processor_id() redundant since
this_rq()->cfs == cfs_rq?

> + cpufreq_update_util(rq_clock(rq), 0);

All else looked good to me.

thanks,
Steve


Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle
On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
>  
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
> - struct rq *rq = rq_of(cfs_rq);
> - int cpu = cpu_of(rq);
> -
> - if (cpu == smp_processor_id() && >cfs == cfs_rq) {
> - unsigned long max = rq->cpu_capacity_orig;
> + if (_rq()->cfs == cfs_rq) {
> + struct rq *rq = rq_of(cfs_rq);
>  
>   /*
>* There are a few boundary cases this might miss but it should
> @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + if (cpu_of(rq) == smp_processor_id())

Isn't this test against smp_processor_id() redundant since
this_rq()->cfs == cfs_rq?

> + cpufreq_update_util(rq_clock(rq), 0);

All else looked good to me.

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-04 Thread Steve Muckle
On Thu, Aug 04, 2016 at 11:19:00PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote:
> > On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle <steve.muc...@linaro.org> 
> > > wrote:
> > > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> 
> > > >> wrote:
> > > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > > >> > ...
> > > >> >> For this purpose, define a new cpufreq_update_util() flag
> > > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> > > >> >> cpufreq_update_util() in the in_iowait case.  That generally
> > > >> >> requires cpufreq_update_util() to be called directly from there,
> > > >> >> because update_load_avg() is not likely to be invoked in that
> > > >> >> case.
> > > >> >
> > > >> > I didn't follow why the cpufreq hook won't likely be called if
> > > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second 
> > > >> > loop
> > > >> > and calls update_cfs_rq_load_avg (triggers the hook).
> > > >>
> > > >> In practice it turns out that in the majority of cases when in_iowait
> > > >> is set the second loop will not run.
> > > >
> > > > My understanding of enqueue_task_fair() is that the first loop walks up
> > > > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > > > the second loop updates the rest of the hierarchy that was already
> > > > enqueued.
> > > >
> > > > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > > > (meaning the whole hierarchy is traversed in the first loop and the
> > > > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > > > result in the cpufreq hook being called, via enqueue_entity() ->
> > > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> > > 
> > > But then it's rather difficult to pass the IO flag to this one, isn't it?
> > > 
> > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> > > p->in_iowait is set.
> > > 
> > > If you can find a clever way to do it without adding an extra call
> > > site, that's fine by me, but in any case the extra
> > > cpufreq_update_util() invocation should not be too expensive.
> > 
> > I was under the impression that function pointer calls were more
> > expensive, and in the shared policy case there is a nontrivial amount of
> > code that is run in schedutil (including taking a spinlock) before we'd
> > see sugov_should_update_freq() return false and bail.
> 
> That's correct in principle, but we only do that if p->in_iowait is set,
> which is somewhat special anyway and doesn't happen every time for sure.
> 
> So while there is overhead theoretically, I'm not even sure if it is 
> measurable.

Ok my worry was if there were IO-heavy workloads that would
hammer this path, but I don't know of any specifically or how often this
path can be taken.

> 
> > Agreed that getting knowledge of p->in_iowait down to the existing hook
> > is not easy. I spent some time fiddling with that. It seemed doable but
> > somewhat gross due to the required flag passing and modifications
> > to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
> > pursuing I can keep working on it and post a draft.
> 
> Well, that's a Peter's call. :-)
> 
> > But I also wonder if the hooks are in the best location.  They are
> > currently deep in the PELT code. This may make sense from a theoretical
> > standpoint, calling them whenever a root cfs_rq utilization changes, but
> > it also makes the hooks difficult to correlate (for policy purposes such
> > as this iowait change) with higher level logical events like a task
> > wakeup. Or load balance where we probably want to call the hook just
> > once after a load balance is complete.
> 
> I generally agree.  We still need to ensure that the hools will be invoked
> frequently enough, though, even if HZ is 100.
> 
> > This is also an issue for the remote wakeup case where I currently have
> > another invocation of the hook in check_preempt_curr(), so I can know if
> > preemption was triggered and skip a remote schedutil update in that case
> > to avoid a duplicate IPI.
> > 
> > It seems to me worth evaluating if a higher level set of hook locations
> > could be used. One possibility is higher up in CFS:
> > - enqueue_task_fair, dequeue_task_fair
> > - scheduler_tick
> > - active_load_balance_cpu_stop, load_balance
> 
> Agreed, that's worth checking.
> 
> > Though this wouldn't solve my issue with check_preempt_curr. That would
> > probably require going further up the stack to try_to_wake_up() etc. Not
> > yet sure what the other hook locations would be at that level.
> 
> That's probably too far away from the root cfs_rq utilization changes IMO.

Is your concern that the rate of hook calls would be decreased?

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-04 Thread Steve Muckle
On Thu, Aug 04, 2016 at 11:19:00PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote:
> > On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle  
> > > wrote:
> > > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  
> > > >> wrote:
> > > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > > >> > ...
> > > >> >> For this purpose, define a new cpufreq_update_util() flag
> > > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> > > >> >> cpufreq_update_util() in the in_iowait case.  That generally
> > > >> >> requires cpufreq_update_util() to be called directly from there,
> > > >> >> because update_load_avg() is not likely to be invoked in that
> > > >> >> case.
> > > >> >
> > > >> > I didn't follow why the cpufreq hook won't likely be called if
> > > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second 
> > > >> > loop
> > > >> > and calls update_cfs_rq_load_avg (triggers the hook).
> > > >>
> > > >> In practice it turns out that in the majority of cases when in_iowait
> > > >> is set the second loop will not run.
> > > >
> > > > My understanding of enqueue_task_fair() is that the first loop walks up
> > > > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > > > the second loop updates the rest of the hierarchy that was already
> > > > enqueued.
> > > >
> > > > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > > > (meaning the whole hierarchy is traversed in the first loop and the
> > > > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > > > result in the cpufreq hook being called, via enqueue_entity() ->
> > > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> > > 
> > > But then it's rather difficult to pass the IO flag to this one, isn't it?
> > > 
> > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> > > p->in_iowait is set.
> > > 
> > > If you can find a clever way to do it without adding an extra call
> > > site, that's fine by me, but in any case the extra
> > > cpufreq_update_util() invocation should not be too expensive.
> > 
> > I was under the impression that function pointer calls were more
> > expensive, and in the shared policy case there is a nontrivial amount of
> > code that is run in schedutil (including taking a spinlock) before we'd
> > see sugov_should_update_freq() return false and bail.
> 
> That's correct in principle, but we only do that if p->in_iowait is set,
> which is somewhat special anyway and doesn't happen every time for sure.
> 
> So while there is overhead theoretically, I'm not even sure if it is 
> measurable.

Ok my worry was if there were IO-heavy workloads that would
hammer this path, but I don't know of any specifically or how often this
path can be taken.

> 
> > Agreed that getting knowledge of p->in_iowait down to the existing hook
> > is not easy. I spent some time fiddling with that. It seemed doable but
> > somewhat gross due to the required flag passing and modifications
> > to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
> > pursuing I can keep working on it and post a draft.
> 
> Well, that's a Peter's call. :-)
> 
> > But I also wonder if the hooks are in the best location.  They are
> > currently deep in the PELT code. This may make sense from a theoretical
> > standpoint, calling them whenever a root cfs_rq utilization changes, but
> > it also makes the hooks difficult to correlate (for policy purposes such
> > as this iowait change) with higher level logical events like a task
> > wakeup. Or load balance where we probably want to call the hook just
> > once after a load balance is complete.
> 
> I generally agree.  We still need to ensure that the hools will be invoked
> frequently enough, though, even if HZ is 100.
> 
> > This is also an issue for the remote wakeup case where I currently have
> > another invocation of the hook in check_preempt_curr(), so I can know if
> > preemption was triggered and skip a remote schedutil update in that case
> > to avoid a duplicate IPI.
> > 
> > It seems to me worth evaluating if a higher level set of hook locations
> > could be used. One possibility is higher up in CFS:
> > - enqueue_task_fair, dequeue_task_fair
> > - scheduler_tick
> > - active_load_balance_cpu_stop, load_balance
> 
> Agreed, that's worth checking.
> 
> > Though this wouldn't solve my issue with check_preempt_curr. That would
> > probably require going further up the stack to try_to_wake_up() etc. Not
> > yet sure what the other hook locations would be at that level.
> 
> That's probably too far away from the root cfs_rq utilization changes IMO.

Is your concern that the rate of hook calls would be decreased?

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-03 Thread Steve Muckle
On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle <steve.muc...@linaro.org> wrote:
> > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> 
> >> wrote:
> >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> >> > ...
> >> >> For this purpose, define a new cpufreq_update_util() flag
> >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> >> requires cpufreq_update_util() to be called directly from there,
> >> >> because update_load_avg() is not likely to be invoked in that
> >> >> case.
> >> >
> >> > I didn't follow why the cpufreq hook won't likely be called if
> >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> >> > and calls update_cfs_rq_load_avg (triggers the hook).
> >>
> >> In practice it turns out that in the majority of cases when in_iowait
> >> is set the second loop will not run.
> >
> > My understanding of enqueue_task_fair() is that the first loop walks up
> > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > the second loop updates the rest of the hierarchy that was already
> > enqueued.
> >
> > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > (meaning the whole hierarchy is traversed in the first loop and the
> > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > result in the cpufreq hook being called, via enqueue_entity() ->
> > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> 
> But then it's rather difficult to pass the IO flag to this one, isn't it?
> 
> Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> p->in_iowait is set.
> 
> If you can find a clever way to do it without adding an extra call
> site, that's fine by me, but in any case the extra
> cpufreq_update_util() invocation should not be too expensive.

I was under the impression that function pointer calls were more
expensive, and in the shared policy case there is a nontrivial amount of
code that is run in schedutil (including taking a spinlock) before we'd
see sugov_should_update_freq() return false and bail.

Agreed that getting knowledge of p->in_iowait down to the existing hook
is not easy. I spent some time fiddling with that. It seemed doable but
somewhat gross due to the required flag passing and modifications
to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
pursuing I can keep working on it and post a draft.

But I also wonder if the hooks are in the best location.  They are
currently deep in the PELT code. This may make sense from a theoretical
standpoint, calling them whenever a root cfs_rq utilization changes, but
it also makes the hooks difficult to correlate (for policy purposes such
as this iowait change) with higher level logical events like a task
wakeup. Or load balance where we probably want to call the hook just
once after a load balance is complete.

This is also an issue for the remote wakeup case where I currently have
another invocation of the hook in check_preempt_curr(), so I can know if
preemption was triggered and skip a remote schedutil update in that case
to avoid a duplicate IPI.

It seems to me worth evaluating if a higher level set of hook locations
could be used. One possibility is higher up in CFS:
- enqueue_task_fair, dequeue_task_fair
- scheduler_tick
- active_load_balance_cpu_stop, load_balance

Though this wouldn't solve my issue with check_preempt_curr. That would
probably require going further up the stack to try_to_wake_up() etc. Not
yet sure what the other hook locations would be at that level.

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-03 Thread Steve Muckle
On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle  wrote:
> > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  
> >> wrote:
> >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> >> > ...
> >> >> For this purpose, define a new cpufreq_update_util() flag
> >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> >> requires cpufreq_update_util() to be called directly from there,
> >> >> because update_load_avg() is not likely to be invoked in that
> >> >> case.
> >> >
> >> > I didn't follow why the cpufreq hook won't likely be called if
> >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> >> > and calls update_cfs_rq_load_avg (triggers the hook).
> >>
> >> In practice it turns out that in the majority of cases when in_iowait
> >> is set the second loop will not run.
> >
> > My understanding of enqueue_task_fair() is that the first loop walks up
> > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > the second loop updates the rest of the hierarchy that was already
> > enqueued.
> >
> > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > (meaning the whole hierarchy is traversed in the first loop and the
> > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > result in the cpufreq hook being called, via enqueue_entity() ->
> > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> 
> But then it's rather difficult to pass the IO flag to this one, isn't it?
> 
> Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> p->in_iowait is set.
> 
> If you can find a clever way to do it without adding an extra call
> site, that's fine by me, but in any case the extra
> cpufreq_update_util() invocation should not be too expensive.

I was under the impression that function pointer calls were more
expensive, and in the shared policy case there is a nontrivial amount of
code that is run in schedutil (including taking a spinlock) before we'd
see sugov_should_update_freq() return false and bail.

Agreed that getting knowledge of p->in_iowait down to the existing hook
is not easy. I spent some time fiddling with that. It seemed doable but
somewhat gross due to the required flag passing and modifications
to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
pursuing I can keep working on it and post a draft.

But I also wonder if the hooks are in the best location.  They are
currently deep in the PELT code. This may make sense from a theoretical
standpoint, calling them whenever a root cfs_rq utilization changes, but
it also makes the hooks difficult to correlate (for policy purposes such
as this iowait change) with higher level logical events like a task
wakeup. Or load balance where we probably want to call the hook just
once after a load balance is complete.

This is also an issue for the remote wakeup case where I currently have
another invocation of the hook in check_preempt_curr(), so I can know if
preemption was triggered and skip a remote schedutil update in that case
to avoid a duplicate IPI.

It seems to me worth evaluating if a higher level set of hook locations
could be used. One possibility is higher up in CFS:
- enqueue_task_fair, dequeue_task_fair
- scheduler_tick
- active_load_balance_cpu_stop, load_balance

Though this wouldn't solve my issue with check_preempt_curr. That would
probably require going further up the stack to try_to_wake_up() etc. Not
yet sure what the other hook locations would be at that level.

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-02 Thread Steve Muckle
On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muc...@linaro.org> wrote:
> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > ...
> >> For this purpose, define a new cpufreq_update_util() flag
> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> requires cpufreq_update_util() to be called directly from there,
> >> because update_load_avg() is not likely to be invoked in that
> >> case.
> >
> > I didn't follow why the cpufreq hook won't likely be called if
> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> > and calls update_cfs_rq_load_avg (triggers the hook).
> 
> In practice it turns out that in the majority of cases when in_iowait
> is set the second loop will not run.

My understanding of enqueue_task_fair() is that the first loop walks up
the portion of the sched_entity hierarchy that needs to be enqueued, and
the second loop updates the rest of the hierarchy that was already
enqueued.

Even if the se corresponding to the root cfs_rq needs to be enqueued
(meaning the whole hierarchy is traversed in the first loop and the
second loop does nothing), enqueue_entity() on the root cfs_rq should
result in the cpufreq hook being called, via enqueue_entity() ->
enqueue_entity_load_avg() -> update_cfs_rq_load_avg().

I'll keep looking to see if I've misunderstood something in here.

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-02 Thread Steve Muckle
On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  wrote:
> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > ...
> >> For this purpose, define a new cpufreq_update_util() flag
> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> requires cpufreq_update_util() to be called directly from there,
> >> because update_load_avg() is not likely to be invoked in that
> >> case.
> >
> > I didn't follow why the cpufreq hook won't likely be called if
> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> > and calls update_cfs_rq_load_avg (triggers the hook).
> 
> In practice it turns out that in the majority of cases when in_iowait
> is set the second loop will not run.

My understanding of enqueue_task_fair() is that the first loop walks up
the portion of the sched_entity hierarchy that needs to be enqueued, and
the second loop updates the rest of the hierarchy that was already
enqueued.

Even if the se corresponding to the root cfs_rq needs to be enqueued
(meaning the whole hierarchy is traversed in the first loop and the
second loop does nothing), enqueue_entity() on the root cfs_rq should
result in the cpufreq hook being called, via enqueue_entity() ->
enqueue_entity_load_avg() -> update_cfs_rq_load_avg().

I'll keep looking to see if I've misunderstood something in here.

thanks,
Steve


Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-02 Thread Steve Muckle
On Tue, Aug 02, 2016 at 11:38:17AM +0100, Juri Lelli wrote:
> > > Anyway one way that my patch differed was that I had used the flags
> > > field to keep the behavior the same for both RT and DL.
> 
> Do you mean "go to max" policy for both, until proper policies will be
> implemented in the future?

Yep.

> > That happens
> > > later on in this series for RT but the DL policy is modified as above.
> > > Can the DL policy be left as-is and discussed/modified in a separate
> > > series?
> 
> Not that we want to start discussing this point now, if we postpone the
> change for later, but I just wanted to point out a difference w.r.t.
> what the schedfreq thing was doing: it used to sum contributions from
> the different classes, instead of taking the max. We probably never
> really discussed on the list what is the right thing to do, though.

Yeah I figured that was worth deferring into its own patchset/thread.

cheers,
Steve


Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-02 Thread Steve Muckle
On Tue, Aug 02, 2016 at 11:38:17AM +0100, Juri Lelli wrote:
> > > Anyway one way that my patch differed was that I had used the flags
> > > field to keep the behavior the same for both RT and DL.
> 
> Do you mean "go to max" policy for both, until proper policies will be
> implemented in the future?

Yep.

> > That happens
> > > later on in this series for RT but the DL policy is modified as above.
> > > Can the DL policy be left as-is and discussed/modified in a separate
> > > series?
> 
> Not that we want to start discussing this point now, if we postpone the
> change for later, but I just wanted to point out a difference w.r.t.
> what the schedfreq thing was doing: it used to sum contributions from
> the different classes, instead of taking the max. We probably never
> really discussed on the list what is the right thing to do, though.

Yeah I figured that was worth deferring into its own patchset/thread.

cheers,
Steve


Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Tue, Aug 02, 2016 at 01:44:41AM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 12:59:30 PM Steve Muckle wrote:
> > On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > > > +#define UUF_RT   0x01
> > > > 
> > > > What does UUF stand for?
> > > 
> > > "Utilization upadte flag".
> > 
> > I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for 
> > the
> > prefixes, though I guess some may object to the length.
> 
> Well, OK.
> 
> I guess something like SCHED_CPUFREQ_RT etc would be sufficient?

Yeah I think that would work.

thanks,
Steve


Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Tue, Aug 02, 2016 at 01:44:41AM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 12:59:30 PM Steve Muckle wrote:
> > On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > > > +#define UUF_RT   0x01
> > > > 
> > > > What does UUF stand for?
> > > 
> > > "Utilization upadte flag".
> > 
> > I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for 
> > the
> > prefixes, though I guess some may object to the length.
> 
> Well, OK.
> 
> I guess something like SCHED_CPUFREQ_RT etc would be sufficient?

Yeah I think that would work.

thanks,
Steve


Re: [RFC][PATCH 6/7] cpufreq: schedutil: Add iowait boosting

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:37:59AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Modify the schedutil cpufreq governor to boost the CPU frequency
> if the UUF_IO flag is passed to it via cpufreq_update_util().
> 
> If that happens, the frequency is set to the maximum during
> the first update after receiving the UUF_IO flag and then the
> boost is reduced by half during each following update.

Were these changes to schedutil part of the positive test results
mentioned in patch 5? Or are those just from intel pstate?

I was nervous about the effect of this on power and tested a couple low
power usecases. The platform is the Hikey 96board (8 core ARM A53,
single CPUfreq domain) running AOSP Android and schedutil backported to
kernel 4.4. These tests run mp3 and mpeg4 playback for a little while,
recording total energy consumption during the test along with frequency
residency.

As the results below show I did not measure an appreciable effect - if
anything things may be slightly better with the patches.

The hardcoding of a non-tunable boosting scheme makes me nervous but
perhaps it could be revisited if some platform or configuration shows
a noticeable regression?

TestcaseEnergy  /- CPU frequency residency -\
(J) 208000  432000  729000  96  120
mp3-before-126.822  47.27%  24.79%  16.23%  5.20%   6.52%
mp3-before-226.817  41.70%  28.75%  17.62%  5.17%   6.75%
mp3-before-326.65   42.48%  28.65%  17.25%  5.07%   6.55%
mp3-after-1 26.667  42.51%  27.38%  18.00%  5.40%   6.71%
mp3-after-2 26.777  48.37%  24.15%  15.68%  4.55%   7.25%
mp3-after-3 26.806  41.93%  27.71%  18.35%  4.78%   7.35%

mpeg4-before-1  26.024  18.41%  60.09%  13.16%  0.49%   7.85%
mpeg4-before-2  25.147  20.47%  64.80%  8.44%   1.37%   4.91%
mpeg4-before-3  25.007  19.18%  66.08%  10.01%  0.59%   4.22%
mpeg4-after-1   25.598  19.77%  61.33%  11.63%  0.79%   6.48%
mpeg4-after-2   25.18   22.31%  62.78%  8.83%   1.18%   4.90%
mpeg4-after-3   25.162  21.59%  64.88%  8.29%   0.49%   4.71%

thanks,
Steve


Re: [RFC][PATCH 6/7] cpufreq: schedutil: Add iowait boosting

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:37:59AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Modify the schedutil cpufreq governor to boost the CPU frequency
> if the UUF_IO flag is passed to it via cpufreq_update_util().
> 
> If that happens, the frequency is set to the maximum during
> the first update after receiving the UUF_IO flag and then the
> boost is reduced by half during each following update.

Were these changes to schedutil part of the positive test results
mentioned in patch 5? Or are those just from intel pstate?

I was nervous about the effect of this on power and tested a couple low
power usecases. The platform is the Hikey 96board (8 core ARM A53,
single CPUfreq domain) running AOSP Android and schedutil backported to
kernel 4.4. These tests run mp3 and mpeg4 playback for a little while,
recording total energy consumption during the test along with frequency
residency.

As the results below show I did not measure an appreciable effect - if
anything things may be slightly better with the patches.

The hardcoding of a non-tunable boosting scheme makes me nervous but
perhaps it could be revisited if some platform or configuration shows
a noticeable regression?

TestcaseEnergy  /- CPU frequency residency -\
(J) 208000  432000  729000  96  120
mp3-before-126.822  47.27%  24.79%  16.23%  5.20%   6.52%
mp3-before-226.817  41.70%  28.75%  17.62%  5.17%   6.75%
mp3-before-326.65   42.48%  28.65%  17.25%  5.07%   6.55%
mp3-after-1 26.667  42.51%  27.38%  18.00%  5.40%   6.71%
mp3-after-2 26.777  48.37%  24.15%  15.68%  4.55%   7.25%
mp3-after-3 26.806  41.93%  27.71%  18.35%  4.78%   7.35%

mpeg4-before-1  26.024  18.41%  60.09%  13.16%  0.49%   7.85%
mpeg4-before-2  25.147  20.47%  64.80%  8.44%   1.37%   4.91%
mpeg4-before-3  25.007  19.18%  66.08%  10.01%  0.59%   4.22%
mpeg4-after-1   25.598  19.77%  61.33%  11.63%  0.79%   6.48%
mpeg4-after-2   25.18   22.31%  62.78%  8.83%   1.18%   4.90%
mpeg4-after-3   25.162  21.59%  64.88%  8.29%   0.49%   4.71%

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
...
> For this purpose, define a new cpufreq_update_util() flag
> UUF_IO and modify enqueue_task_fair() to pass that flag to
> cpufreq_update_util() in the in_iowait case.  That generally
> requires cpufreq_update_util() to be called directly from there,
> because update_load_avg() is not likely to be invoked in that
> case.

I didn't follow why the cpufreq hook won't likely be called if
in_iowait is set? AFAICS update_load_avg() gets called in the second loop
and calls update_cfs_rq_load_avg (triggers the hook).

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  include/linux/sched.h |1 +
>  kernel/sched/fair.c   |8 
>  2 files changed, 9 insertions(+)
> 
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -4459,6 +4459,14 @@ enqueue_task_fair(struct rq *rq, struct
>   struct cfs_rq *cfs_rq;
>   struct sched_entity *se = >se;
>  
> + /*
> +  * If in_iowait is set, it is likely that the loops below will not
> +  * trigger any cpufreq utilization updates, so do it here explicitly
> +  * with the IO flag passed.
> +  */
> + if (p->in_iowait)
> + cpufreq_update_util(rq, UUF_IO);
> +
>   for_each_sched_entity(se) {
>   if (se->on_rq)
>   break;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3376,6 +3376,7 @@ static inline unsigned long rlimit_max(u
>  }
>  
>  #define UUF_RT   0x01
> +#define UUF_IO   0x02
>  
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {

thanks,
Steve


Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
...
> For this purpose, define a new cpufreq_update_util() flag
> UUF_IO and modify enqueue_task_fair() to pass that flag to
> cpufreq_update_util() in the in_iowait case.  That generally
> requires cpufreq_update_util() to be called directly from there,
> because update_load_avg() is not likely to be invoked in that
> case.

I didn't follow why the cpufreq hook won't likely be called if
in_iowait is set? AFAICS update_load_avg() gets called in the second loop
and calls update_cfs_rq_load_avg (triggers the hook).

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  include/linux/sched.h |1 +
>  kernel/sched/fair.c   |8 
>  2 files changed, 9 insertions(+)
> 
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -4459,6 +4459,14 @@ enqueue_task_fair(struct rq *rq, struct
>   struct cfs_rq *cfs_rq;
>   struct sched_entity *se = >se;
>  
> + /*
> +  * If in_iowait is set, it is likely that the loops below will not
> +  * trigger any cpufreq utilization updates, so do it here explicitly
> +  * with the IO flag passed.
> +  */
> + if (p->in_iowait)
> + cpufreq_update_util(rq, UUF_IO);
> +
>   for_each_sched_entity(se) {
>   if (se->on_rq)
>   break;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3376,6 +3376,7 @@ static inline unsigned long rlimit_max(u
>  }
>  
>  #define UUF_RT   0x01
> +#define UUF_IO   0x02
>  
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {

thanks,
Steve


Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > +#define UUF_RT   0x01
> > 
> > What does UUF stand for?
> 
> "Utilization upadte flag".

I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for the
prefixes, though I guess some may object to the length.

thanks,
Steve


Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > +#define UUF_RT   0x01
> > 
> > What does UUF stand for?
> 
> "Utilization upadte flag".

I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for the
prefixes, though I guess some may object to the length.

thanks,
Steve


Re: [RFC][PATCH 3/7] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 09:29:57AM +0200, Dominik Brodowski wrote:
> A small nitpick:
> 
> On Mon, Aug 01, 2016 at 01:36:01AM +0200, Rafael J. Wysocki wrote:
> > --- linux-pm.orig/kernel/sched/sched.h
> > +++ linux-pm/kernel/sched/sched.h
> > @@ -1760,7 +1760,7 @@ DECLARE_PER_CPU(struct update_util_data
> >  
> >  /**
> >   * cpufreq_update_util - Take a note about CPU utilization changes.
> > - * @time: Current time.
> > + * @rq: Runqueue to carry out the update for.
> >   *
> >   * This function is called by the scheduler on every invocation of
> >   * update_load_avg() on the CPU whose utilization is being updated.
> 
> This comment seems to need an update due to the smp_processor_id() check
> being moved into this function.

The callers of this have also changed - it is no longer called directly
by update_load_avg(), rather via cfs_rq_util_change() from several other
locations (I believe it was my patch that failed to update this
comment).

Could this be replaced with a more generic statement such as "called by
CFS in various paths?"

thanks,
Steve


Re: [RFC][PATCH 3/7] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 09:29:57AM +0200, Dominik Brodowski wrote:
> A small nitpick:
> 
> On Mon, Aug 01, 2016 at 01:36:01AM +0200, Rafael J. Wysocki wrote:
> > --- linux-pm.orig/kernel/sched/sched.h
> > +++ linux-pm/kernel/sched/sched.h
> > @@ -1760,7 +1760,7 @@ DECLARE_PER_CPU(struct update_util_data
> >  
> >  /**
> >   * cpufreq_update_util - Take a note about CPU utilization changes.
> > - * @time: Current time.
> > + * @rq: Runqueue to carry out the update for.
> >   *
> >   * This function is called by the scheduler on every invocation of
> >   * update_load_avg() on the CPU whose utilization is being updated.
> 
> This comment seems to need an update due to the smp_processor_id() check
> being moved into this function.

The callers of this have also changed - it is no longer called directly
by update_load_avg(), rather via cfs_rq_util_change() from several other
locations (I believe it was my patch that failed to update this
comment).

Could this be replaced with a more generic statement such as "called by
CFS in various paths?"

thanks,
Steve


Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:34:36AM +0200, Rafael J. Wysocki wrote:
...
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -144,17 +144,47 @@ static unsigned int get_next_freq(struct
>   return cpufreq_driver_resolve_freq(policy, freq);
>  }
>  
> -static void sugov_update_single(struct update_util_data *hook, u64 time,
> - unsigned long util, unsigned long max)
> +static void sugov_get_util(unsigned long *util, unsigned long *max)
> +{
> + unsigned long dl_util, dl_max;
> + unsigned long cfs_util, cfs_max;
> + int cpu = smp_processor_id();
> + struct dl_bw *dl_bw = dl_bw_of(cpu);
> + struct rq *rq = this_rq();
> +
> + if (rt_prio(current->prio)) {
> + *util = ULONG_MAX;
> + return;
> + }
> +
> + dl_max = dl_bw_cpus(cpu) << 20;
> + dl_util = dl_bw->total_bw;
> +
> + cfs_max = rq->cpu_capacity_orig;
> + cfs_util = min(rq->cfs.avg.util_avg, cfs_max);
> +
> + if (cfs_util * dl_max > dl_util * cfs_max) {
> + *util = cfs_util;
> + *max  = cfs_max;
> + } else {
> + *util = dl_util;
> + *max  = dl_max;
> + }
> +}

Last Friday I had put together a similar patch based on Peter's. I need
the flags field for the remote wakeup support. My previous plan,
installing a late callback in check_preempt_curr that gets requested
from the earlier existing CFS callback, was not working out since those
two events don't always match up 1:1.

Anyway one way that my patch differed was that I had used the flags
field to keep the behavior the same for both RT and DL. That happens
later on in this series for RT but the DL policy is modified as above.
Can the DL policy be left as-is and discussed/modified in a separate
series?

thanks,
Steve


Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-01 Thread Steve Muckle
On Mon, Aug 01, 2016 at 01:34:36AM +0200, Rafael J. Wysocki wrote:
...
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -144,17 +144,47 @@ static unsigned int get_next_freq(struct
>   return cpufreq_driver_resolve_freq(policy, freq);
>  }
>  
> -static void sugov_update_single(struct update_util_data *hook, u64 time,
> - unsigned long util, unsigned long max)
> +static void sugov_get_util(unsigned long *util, unsigned long *max)
> +{
> + unsigned long dl_util, dl_max;
> + unsigned long cfs_util, cfs_max;
> + int cpu = smp_processor_id();
> + struct dl_bw *dl_bw = dl_bw_of(cpu);
> + struct rq *rq = this_rq();
> +
> + if (rt_prio(current->prio)) {
> + *util = ULONG_MAX;
> + return;
> + }
> +
> + dl_max = dl_bw_cpus(cpu) << 20;
> + dl_util = dl_bw->total_bw;
> +
> + cfs_max = rq->cpu_capacity_orig;
> + cfs_util = min(rq->cfs.avg.util_avg, cfs_max);
> +
> + if (cfs_util * dl_max > dl_util * cfs_max) {
> + *util = cfs_util;
> + *max  = cfs_max;
> + } else {
> + *util = dl_util;
> + *max  = dl_max;
> + }
> +}

Last Friday I had put together a similar patch based on Peter's. I need
the flags field for the remote wakeup support. My previous plan,
installing a late callback in check_preempt_curr that gets requested
from the earlier existing CFS callback, was not working out since those
two events don't always match up 1:1.

Anyway one way that my patch differed was that I had used the flags
field to keep the behavior the same for both RT and DL. That happens
later on in this series for RT but the DL policy is modified as above.
Can the DL policy be left as-is and discussed/modified in a separate
series?

thanks,
Steve


Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-22 Thread Steve Muckle
On Fri, Jul 22, 2016 at 08:16:42AM -0700, Viresh Kumar wrote:
> > Long term as I was mentioning in the other thread I think it'd be good
> > if the current target() drivers were modified to supply resolve_freq(),
> > and that cpufreq_register_driver() were again changed to require it for
> > those drivers.
> 
> There is no need for us to force this, its really optional for such
> platforms. Worst case, schedutil wouldn't work at the best, so what?
> Its a platform driver's choice, isn't it ?

This would be in the context of then being able to remove the additional if
statement from the hot path.

To reply to the suggestion of using likely() here, I'm partial to
solving it without that because I'm guessing likely() has to be an
architecture-dependent thing? It seems cleaner to me if the existing
few target() drivers were augmented to provide the resolve_freq() calback
and its presence required. But it's certainly not a big deal and won't
affect any platforms I'm involved with, at least for now. Maybe I could
work on those target() drivers if things change.


Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-22 Thread Steve Muckle
On Fri, Jul 22, 2016 at 08:16:42AM -0700, Viresh Kumar wrote:
> > Long term as I was mentioning in the other thread I think it'd be good
> > if the current target() drivers were modified to supply resolve_freq(),
> > and that cpufreq_register_driver() were again changed to require it for
> > those drivers.
> 
> There is no need for us to force this, its really optional for such
> platforms. Worst case, schedutil wouldn't work at the best, so what?
> Its a platform driver's choice, isn't it ?

This would be in the context of then being able to remove the additional if
statement from the hot path.

To reply to the suggestion of using likely() here, I'm partial to
solving it without that because I'm guessing likely() has to be an
architecture-dependent thing? It seems cleaner to me if the existing
few target() drivers were augmented to provide the resolve_freq() calback
and its presence required. But it's certainly not a big deal and won't
affect any platforms I'm involved with, at least for now. Maybe I could
work on those target() drivers if things change.


Re: linux-next: build failure after merge of the pm tree

2016-07-21 Thread Steve Muckle
On Fri, Jul 22, 2016 at 11:56:20AM +1000, Stephen Rothwell wrote:
> Hi Rafael,
> 
> After merging the pm tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> ERROR: "cpufreq_driver_resolve_freq" [kernel/sched/cpufreq_schedutil.ko] 
> undefined!
> 
> Caused by commit
> 
>   5cbea46984d6 ("cpufreq: schedutil: map raw required frequency to driver 
> frequency")
> 
> I used the pm tree from next-20160721 for today.

Sorry - I have just sent a patch to address this.

thanks,
Steve


Re: linux-next: build failure after merge of the pm tree

2016-07-21 Thread Steve Muckle
On Fri, Jul 22, 2016 at 11:56:20AM +1000, Stephen Rothwell wrote:
> Hi Rafael,
> 
> After merging the pm tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> ERROR: "cpufreq_driver_resolve_freq" [kernel/sched/cpufreq_schedutil.ko] 
> undefined!
> 
> Caused by commit
> 
>   5cbea46984d6 ("cpufreq: schedutil: map raw required frequency to driver 
> frequency")
> 
> I used the pm tree from next-20160721 for today.

Sorry - I have just sent a patch to address this.

thanks,
Steve


[PATCH] cpufreq: export cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle
Export cpufreq_driver_resolve_freq() since governors may be compiled as
modules.

Reported-by: Stephen Rothwell <s...@canb.auug.org.au>
Signed-off-by: Steve Muckle <smuc...@linaro.org>
---
 drivers/cpufreq/cpufreq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b696baeb249d..fa0a92d6121e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -514,6 +514,7 @@ unsigned int cpufreq_driver_resolve_freq(struct 
cpufreq_policy *policy,
   CPUFREQ_RELATION_L);
return policy->freq_table[policy->cached_resolved_idx].frequency;
 }
+EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
 
 /*
  *  SYSFS INTERFACE  *
-- 
2.7.3



[PATCH] cpufreq: export cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle
Export cpufreq_driver_resolve_freq() since governors may be compiled as
modules.

Reported-by: Stephen Rothwell 
Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/cpufreq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b696baeb249d..fa0a92d6121e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -514,6 +514,7 @@ unsigned int cpufreq_driver_resolve_freq(struct 
cpufreq_policy *policy,
   CPUFREQ_RELATION_L);
return policy->freq_table[policy->cached_resolved_idx].frequency;
 }
+EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
 
 /*
  *  SYSFS INTERFACE  *
-- 
2.7.3



Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle
On Thu, Jul 21, 2016 at 04:36:48PM -0700, Steve Muckle wrote:
> As another alternative, this could be caught in cpufreq driver
> initialization? I believe you suggested that originally, but I avoided
> it as I didn't want to have to implement resolve_freq() for every
> target() style driver. It sounds like there aren't many though.

Going back and checking I see I was thinking of your suggestion that
cpufreq_register_driver() check that only target() drivers offer a
resolve_freq() callback. I put a comment for this in cpufreq.h but not a
check - I could add a check in another patch if you like.

Long term as I was mentioning in the other thread I think it'd be good
if the current target() drivers were modified to supply resolve_freq(),
and that cpufreq_register_driver() were again changed to require it for
those drivers.



  1   2   3   4   5   >