Re: [PATCH] sched/fair: add support to tune PELT ramp/decay timings

2018-04-12 Thread Peter Zijlstra
On Mon, Apr 09, 2018 at 05:51:34PM +0100, Patrick Bellasi wrote:
> The PELT half-life is the time [ms] required by the PELT signal to build
> up a 50% load/utilization, starting from zero. This time is currently
> hardcoded to be 32ms, a value which seems to make sense for most of the
> workloads.
> 
> However, 32ms has been verified to be too long for certain classes of
> workloads. For example, in the mobile space many tasks affecting the
> user-experience run with a 16ms or 8ms cadence, since they need to match
> the common 60Hz or 120Hz refresh rate of the graphics pipeline.
> This contributed so fare to the idea that "PELT is too slow" to properly
> track the utilization of interactive mobile workloads, especially
> compared to alternative load tracking solutions which provides a
> better representation of tasks demand in the range of 10-20ms.

Initially the 32 was chosen to more or less correspond to the effective
scheduling period (sysctl_sched_latency based). The thinking was that if
you pick a PELT window shorter than the period, the result becomes
unstable due to not all tasks getting an equal go at things.

(of course, stuffing enough tasks on a rq will break this, but at that
point you have worse problems to deal with)

Should we retain this? Esp. with the lower end (8ms) I worry we'll see
more of those effects.


> Fortunately, since the integration of the utilization estimation
> support in mainline kernel:
> 
>commit 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")
> 
> a fast decay time is no longer an issue for tasks utilization estimation.
> Although estimated utilization does not slow down the decay of blocked
> utilization on idle CPUs, for mobile workloads this seems not to be a
> major concern compared to the benefits in interactivity responsiveness.

By picking a smaller PELT window, the util_est window shrinks
correspondingly; is that intentional or do we want to modify
UTIL_EST_WEIGHT_SHIFT to negate the PELT window changes?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sched/fair: add support to tune PELT ramp/decay timings

2018-04-09 Thread Patrick Bellasi
The PELT half-life is the time [ms] required by the PELT signal to build
up a 50% load/utilization, starting from zero. This time is currently
hardcoded to be 32ms, a value which seems to make sense for most of the
workloads.

However, 32ms has been verified to be too long for certain classes of
workloads. For example, in the mobile space many tasks affecting the
user-experience run with a 16ms or 8ms cadence, since they need to match
the common 60Hz or 120Hz refresh rate of the graphics pipeline.
This contributed so fare to the idea that "PELT is too slow" to properly
track the utilization of interactive mobile workloads, especially
compared to alternative load tracking solutions which provides a
better representation of tasks demand in the range of 10-20ms.

A faster PELT ramp-up time could give some advantages to speed-up the
time required for the signal to stabilize and thus to better represent
task demands in the mobile space. As a downside, it also reduces the
decay time, and thus we forget the load/utilization of sleeping tasks
(or idle CPUs) faster.

Fortunately, since the integration of the utilization estimation
support in mainline kernel:

   commit 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")

a fast decay time is no longer an issue for tasks utilization estimation.
Although estimated utilization does not slow down the decay of blocked
utilization on idle CPUs, for mobile workloads this seems not to be a
major concern compared to the benefits in interactivity responsiveness.

Let's add a compile time option to choose the PELT speed which better
fits for a specific system. By default the current 32ms half-life is
used, but we can also compile a kernel to use a faster ramp-up time of
either 16ms or 8ms. These two configurations have been verified to give
PELT a further improvement in performance, compared to other out-of-tree
load tracking solutions, when it comes to track interactive workloads
thus better supporting both tasks placements and frequencies selections.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Jonathan Corbet 
Cc: Paul Turner 
Cc: Vincent Guittot 
Cc: Joel Fernandes 
Cc: Morten Rasmussen 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/scheduler/sched-pelt.c | 45 ++--
 init/Kconfig | 44 +++
 kernel/sched/sched-pelt.h| 39 ++-
 3 files changed, 105 insertions(+), 23 deletions(-)

diff --git a/Documentation/scheduler/sched-pelt.c 
b/Documentation/scheduler/sched-pelt.c
index e4219139386a..e0ae21616188 100644
--- a/Documentation/scheduler/sched-pelt.c
+++ b/Documentation/scheduler/sched-pelt.c
@@ -10,34 +10,35 @@
 #include 
 #include 
 
-#define HALFLIFE 32
+#define HALFLIFE { 32, 16, 8 }
 #define SHIFT 32
 
 double y;
 
-void calc_runnable_avg_yN_inv(void)
+void calc_runnable_avg_yN_inv(const int halflife)
 {
int i;
unsigned int x;
 
printf("static const u32 runnable_avg_yN_inv[] = {");
-   for (i = 0; i < HALFLIFE; i++) {
+   for (i = 0; i < halflife; i++) {
x = ((1UL<<32)-1)*pow(y, i);
 
-   if (i % 6 == 0) printf("\n\t");
-   printf("0x%8x, ", x);
+   if (i % 4 == 0)
+   printf("\n\t");
+   printf("0x%8x,", x);
}
printf("\n};\n\n");
 }
 
 int sum = 1024;
 
-void calc_runnable_avg_yN_sum(void)
+void calc_runnable_avg_yN_sum(const int halflife)
 {
int i;
 
printf("static const u32 runnable_avg_yN_sum[] = {\n\t0,");
-   for (i = 1; i <= HALFLIFE; i++) {
+   for (i = 1; i <= halflife; i++) {
if (i == 1)
sum *= y;
else
@@ -55,7 +56,7 @@ int n = -1;
 /* first period */
 long max = 1024;
 
-void calc_converged_max(void)
+void calc_converged_max(const int halflife)
 {
long last = 0, y_inv = ((1UL<<32)-1)*y;
 
@@ -73,17 +74,17 @@ void calc_converged_max(void)
last = max;
}
n--;
-   printf("#define LOAD_AVG_PERIOD %d\n", HALFLIFE);
+   printf("#define LOAD_AVG_PERIOD %d\n", halflife);
printf("#define LOAD_AVG_MAX %ld\n", max);
-// printf("#define LOAD_AVG_MAX_N %d\n\n", n);
+   /* printf("#define LOAD_AVG_MAX_N %d\n\n", n); */
 }
 
-void calc_accumulated_sum_32(void)
+void calc_accumulated_sum_32(const int halflife)
 {
int i, x = sum;
 
printf("static const u32 __accumulated_sum_N32[] = {\n\t 0,");
-   for (i = 1; i <= n/HALFLIFE+1; i++) {
+   for (i = 1; i <= n / halflife + 1; i++) {
if (i > 1)
x = x/2 + sum;
 
@@ -97,12 +98,22 @@ void calc_accumulated_sum_32(void)
 
 void