subject:"\[PATCH v7 2\/2\] sched\/fair\: update scale invariance of PELT"

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2019-01-29 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 02:04:32PM +, Patrick Bellasi wrote:

> > So if I'm not mistaken we then have 3 cases:
> > 
> >  1) runnable == util <= capacity
> > 
> > no contention, idle
> > 
> >  2) runnable == util > capacity
> > 
> > no contention, no idle
> > 
> >  3) runnable > util
> > 
> > contention, no idle
> > 
> > For 1) we can use: 'util'
> > For 2) we can use: 'capacity'
> > For 3) we can use: 'util * capacity >> 10'
> > 
> > (note that 2 is a special case of 3 when u=1)
> > 
> > This should work right?
> 
> I think there is a case, similar to 2, in which the new 'util' could
> potentially be used. That's the case for example of a 20% (estimated)
> utilization task running alone on a 15% capacity CPU, for a single
> activation. In that case such a task will complete and be dequeued
> with:
> 
>runnable == util > capacity
> 
> The problem is that we need to be sure there was not contention... and
> that seems to be difficult to detect.

When there is contention runnable and util should diverge. Given this is
all discrete stuff, there's a few funnies, but who cares about those :-)

> > Now, instead of doing complicated things like that, you instead figure
> > that when there's no idle there's also no dequeue happening and we can
> > simply short-cut by skipping the entire thing, forgetting everything
> > about 2,3.
> > 
> > Did I get that right?
> 
> More or less... just saying that 1 is the only easy to detect scenario
> in which we are granted the utilization represents an actual bandwidth
> request and thus the only safe values to sample for estimated
> utilization. For the other cases, since anyway:
> 
>util_est := max(max(ewma, last_util), util_avg)
> 
> util_est will just keep representing a safe and actually measured
> lower-bound for the expected utilization of a task, without
> side-affecting the EWMA which has a "slow" update dynamic.

Right, so maybe we should expound the comment here a bit; but otherwise
I'm inclined to merge that v9 Vincent posted.

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2019-01-24 Thread Patrick Bellasi

On 24-Jan 10:07, Peter Zijlstra wrote:
> 
> Sorry; trying to get back to this and re-reading the old conversations.
> 
> On Thu, Nov 29, 2018 at 03:13:16PM +, Patrick Bellasi wrote:
> > On 29-Nov 13:53, Peter Zijlstra wrote:
> > > On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:
> > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index ac855b2f4774..93e0cf5d8a76 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > > > task_struct *p, bool task_sleep)
> > > > if (!task_sleep)
> > > > return;
> > > > 
> > > > +   /* Skip samples which do not represent an actual utilization */
> > > > +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > > > +   return;
> > > > +
> > > > /*
> > > >  * If the PELT values haven't changed since enqueue time,
> > > >  * skip the util_est update.
> > > 
> > > Would you not want something like:
> > > 
> > >   min(task_util(p), capacity_of(task_cpu(p)))
> > > 
> > > And is this the only place where we need this?
> > 
> > Mmm... even this could be an over-estimation:
> > 
> > I've just posted an example in my last reply to Vincent, end of:
> > 
> >Message-ID: <20181129150020.GF23094@e110439-lin>
> >https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/
> 
> In particular this bit:
> 
>  | Seems we agree that, when there is no idle time:
>  | - the two 15% tasks will be overestimated
>  | - their utilization will reach 50% after a while
> 
> Right?
> 
> > > OTOH, if the task is always running, it will be always running
> > > irrespective of where it runs.
> > 
> > That's not what I'm concerned about. I'm concerned about small tasks
> > which are running on limited capacity (e.g. due to thermal capping)
> > without idle time. In this case, the new "utilization" signal could
> > overestimate the real task needs.
> > 
> > > Not storing these samples seems weird though; this is the exact
> > > condition you want to record -- the task is very active, if we skip
> > > these, we'll come back at a low frequency on the next wakeup.
> > 
> > When there is not idle time, we don't know if the reported
> > utilization, above the cpu capacity, is due to the task being bigger...
> > or just the new utilization signal converging towards:
> > 
> > 100% / RUNNABLE_TASKS_COUNT
> 
> So if I'm not mistaken we then have 3 cases:
> 
>  1) runnable == util <= capacity
> 
> no contention, idle
> 
>  2) runnable == util > capacity
> 
> no contention, no idle
> 
>  3) runnable > util
> 
> contention, no idle
> 
> For 1) we can use: 'util'
> For 2) we can use: 'capacity'
> For 3) we can use: 'util * capacity >> 10'
> 
> (note that 2 is a special case of 3 when u=1)
> 
> This should work right?

I think there is a case, similar to 2, in which the new 'util' could
potentially be used. That's the case for example of a 20% (estimated)
utilization task running alone on a 15% capacity CPU, for a single
activation. In that case such a task will complete and be dequeued
with:

   runnable == util > capacity

The problem is that we need to be sure there was not contention... and
that seems to be difficult to detect.

> Now, instead of doing complicated things like that, you instead figure
> that when there's no idle there's also no dequeue happening and we can
> simply short-cut by skipping the entire thing, forgetting everything
> about 2,3.
> 
> Did I get that right?

More or less... just saying that 1 is the only easy to detect scenario
in which we are granted the utilization represents an actual bandwidth
request and thus the only safe values to sample for estimated
utilization. For the other cases, since anyway:

   util_est := max(max(ewma, last_util), util_avg)

util_est will just keep representing a safe and actually measured
lower-bound for the expected utilization of a task, without
side-affecting the EWMA which has a "slow" update dynamic.

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2019-01-24 Thread Peter Zijlstra



Sorry; trying to get back to this and re-reading the old conversations.

On Thu, Nov 29, 2018 at 03:13:16PM +, Patrick Bellasi wrote:
> On 29-Nov 13:53, Peter Zijlstra wrote:
> > On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:
> > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ac855b2f4774..93e0cf5d8a76 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > > task_struct *p, bool task_sleep)
> > >   if (!task_sleep)
> > >   return;
> > > 
> > > + /* Skip samples which do not represent an actual utilization */
> > > + if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > > + return;
> > > +
> > >   /*
> > >* If the PELT values haven't changed since enqueue time,
> > >* skip the util_est update.
> > 
> > Would you not want something like:
> > 
> > min(task_util(p), capacity_of(task_cpu(p)))
> > 
> > And is this the only place where we need this?
> 
> Mmm... even this could be an over-estimation:
> 
> I've just posted an example in my last reply to Vincent, end of:
> 
>Message-ID: <20181129150020.GF23094@e110439-lin>
>https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

In particular this bit:

 | Seems we agree that, when there is no idle time:
 | - the two 15% tasks will be overestimated
 | - their utilization will reach 50% after a while

Right?

> > OTOH, if the task is always running, it will be always running
> > irrespective of where it runs.
> 
> That's not what I'm concerned about. I'm concerned about small tasks
> which are running on limited capacity (e.g. due to thermal capping)
> without idle time. In this case, the new "utilization" signal could
> overestimate the real task needs.
> 
> > Not storing these samples seems weird though; this is the exact
> > condition you want to record -- the task is very active, if we skip
> > these, we'll come back at a low frequency on the next wakeup.
> 
> When there is not idle time, we don't know if the reported
> utilization, above the cpu capacity, is due to the task being bigger...
> or just the new utilization signal converging towards:
> 
> 100% / RUNNABLE_TASKS_COUNT

So if I'm not mistaken we then have 3 cases:

 1) runnable == util <= capacity

no contention, idle

 2) runnable == util > capacity

no contention, no idle

 3) runnable > util

contention, no idle

For 1) we can use: 'util'
For 2) we can use: 'capacity'
For 3) we can use: 'util * capacity >> 10'

(note that 2 is a special case of 3 when u=1)

This should work right?

Now, instead of doing complicated things like that, you instead figure
that when there's no idle there's also no dequeue happening and we can
simply short-cut by skipping the entire thing, forgetting everything
about 2,3.

Did I get that right?

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2019-01-11 Thread Vincent Guittot

On Thu, 10 Jan 2019 at 16:30, Patrick Bellasi  wrote:
>
> On 29-Nov 17:19, Vincent Guittot wrote:
> > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi  
> > wrote:
> > > On 29-Nov 11:43, Vincent Guittot wrote:
>
> [...]
>
> > > Seems we agree that, when there is no idle time:
> > > - the two 15% tasks will be overestimated
> > > - their utilization will reach 50% after a while
> > >
> > > If I'm not wrong, we will have:
> > > - 30% CPU util in  ~16ms @1024 capacity
> > >~64ms  @256 capacity
> > >
> > > Thus, the tasks will be certainly over-estimated after ~64ms.
> > > Is that correct ?
> >
> > From a pure util_avg pov it's correct
> > But i'd like to weight that a bit with the example below
> >
> > > Now, we can argue that 64ms is a pretty long time and thus it's quite
> > > unlucky we will have no idle for such a long time.
> > >
> > > Still, I'm wondering if we should keep collecting those samples or
> > > better find a way to detect that and skip the sampling.
> >
> > The problem is that you can have util_avg above capacity even with idle time
> > In the 1st example of this thread, the 39ms/80ms task will reach 709
> > which is the value saved by util_est on a big core
> > But on core with half capacity, there is still idle time so 709 is a
> > correct value although above 512
>
> Right, I see your point and (in principle) I like the idea of
> collecting samples for tasks which happen to run at a lower capacity
> then required and the utilization value makes sense...
>
> > In fact, max will be always above the linear ratio because it's based
> > on geometric series
> >
> > And this is true even with 15.6ms/32ms (same ratio as above) task
> > although the impact is smaller (max value, which should be saved by
> > util est, becomes  587 in this case).
>
> However that's not always the case... as per my example above.
>
> Moreover, we should also consider that util_est is mainly meant to be
> a lower-bound for tasks utilization.
> That's why task_util_est() already returns the actual util_avg when
> it's higher than the estimated utilization.

I can imagine that the fact that we use max(util_avg, util_est) helps
to keep using correct utilization in the scheduler when util_avg goes
above cpu capacity whereas there is still idle time

>
> With your new signal and without any special check on samples
> collection, if a task is limited because of thermal capping for
> example, we could end up overestimating its utilization and thus
> perhaps generating an unwanted frequency spike when the capping is
> relaxed... and (even worst) it will take some more activations for the
> estimated utilization to converge back to the actual utilization.
>
> Since we cannot easily know if there is idle time in a CPU when a task
> completes an activation with a utilization higher then the CPU
> capacity, I would better prefer to just skip the sampling with
> something like:
>
> ---8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9332863d122a..485053026533 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
>  {
> long last_ewma_diff;
> struct util_est ue;
> +   int cpu;
>
> if (!sched_feat(UTIL_EST))
> return;
> @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
> if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> return;
>
> +   /*
> +* To avoid overestimation of actual task utilization, skip updates if
> +* we cannot grant there is idle time in this CPU.
> +*/
> +   cpu = cpu_of(rq_of(cfs_rq));
> +   if (task_util(p) > cpu_capacity(cpu))
> +   return;
> +
> /*
>  * Update Task's estimated utilization
>  *
> ---8<---
>
> At least this will ensure that util_est always provides an actual
> measured lower bound for a task utilization.
>
> If you think this makes sense, feel free to add such a patch on
> top of your series.

ok. I'm going to add it when rebasing the series

Thanks
Vincent
>
> Cheers Patrick
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2019-01-10 Thread Patrick Bellasi

On 29-Nov 17:19, Vincent Guittot wrote:
> On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi  wrote:
> > On 29-Nov 11:43, Vincent Guittot wrote:

[...]

> > Seems we agree that, when there is no idle time:
> > - the two 15% tasks will be overestimated
> > - their utilization will reach 50% after a while
> >
> > If I'm not wrong, we will have:
> > - 30% CPU util in  ~16ms @1024 capacity
> >~64ms  @256 capacity
> >
> > Thus, the tasks will be certainly over-estimated after ~64ms.
> > Is that correct ?
> 
> From a pure util_avg pov it's correct
> But i'd like to weight that a bit with the example below
> 
> > Now, we can argue that 64ms is a pretty long time and thus it's quite
> > unlucky we will have no idle for such a long time.
> >
> > Still, I'm wondering if we should keep collecting those samples or
> > better find a way to detect that and skip the sampling.
> 
> The problem is that you can have util_avg above capacity even with idle time
> In the 1st example of this thread, the 39ms/80ms task will reach 709
> which is the value saved by util_est on a big core
> But on core with half capacity, there is still idle time so 709 is a
> correct value although above 512

Right, I see your point and (in principle) I like the idea of
collecting samples for tasks which happen to run at a lower capacity
then required and the utilization value makes sense...

> In fact, max will be always above the linear ratio because it's based
> on geometric series
> 
> And this is true even with 15.6ms/32ms (same ratio as above) task
> although the impact is smaller (max value, which should be saved by
> util est, becomes  587 in this case).

However that's not always the case... as per my example above.

Moreover, we should also consider that util_est is mainly meant to be
a lower-bound for tasks utilization.
That's why task_util_est() already returns the actual util_avg when
it's higher than the estimated utilization.

With your new signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU
capacity, I would better prefer to just skip the sampling with
something like:

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9332863d122a..485053026533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
 {
long last_ewma_diff;
struct util_est ue;
+   int cpu;
 
if (!sched_feat(UTIL_EST))
return;
@@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
return;
 
+   /*
+* To avoid overestimation of actual task utilization, skip updates if
+* we cannot grant there is idle time in this CPU.
+*/
+   cpu = cpu_of(rq_of(cfs_rq));
+   if (task_util(p) > cpu_capacity(cpu))
+   return;
+
/*
 * Update Task's estimated utilization
 *
---8<---

At least this will ensure that util_est always provides an actual
measured lower bound for a task utilization.

If you think this makes sense, feel free to add such a patch on
top of your series.

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Vincent Guittot

On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi  wrote:
>
> On 29-Nov 11:43, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  
> > wrote:
> > > On 28-Nov 16:42, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > > > wrote:
> > > > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi 
> > > > > >  wrote:
> > > > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > > > >  wrote:
> > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot 
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Is there anything else that I should do for these patches 
> > > > > > > > > > > ?
> > > > > > > > > >
> > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going 
> > > > > > > > > > to explain.
> > > > > > > > >
> > > > > > > > > I guess the problem is that, once we cross the current 
> > > > > > > > > capacity,
> > > > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > > > utilization.
> > > > > > > > >
> > > > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > > > estimated
> > > > > > > > > utilization samples which will overestimate the task 
> > > > > > > > > requirements.
> > > > > > > > >
> > > > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > > > since we
> > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also 
> > > > > > > > > inflate the EWMA in
> > > > > > > > > case we collect multiple samples above the current capacity.
> > > > > > > >
> > > > > > > > TBH I don't see how it's different from current implementation 
> > > > > > > > with a
> > > > > > > > task that was scheduled on big core and now wakes up on little 
> > > > > > > > core.
> > > > > > > > The util_est is overestimated as well.
> > > > > > >
> > > > > > > While running below the capacity of a CPU, either big or LITTLE, 
> > > > > > > we
> > > > > > > can still measure the actual used bandwidth as long as we have 
> > > > > > > idle
> > > > > > > time. If the task is then moved into a lower capacity core, I 
> > > > > > > think
> > > > > > > it's still safe to assume that, likely, it would need more 
> > > > > > > capacity.
> > > > > > >
> > > > > > > Why do you say it's the same ?
> > > > > >
> > > > > > In the example of a task that runs 39ms in period of 80ms that we 
> > > > > > used
> > > > > > during previous version,
> > > > > > the utilization on the big core will reach 709 so will util_est too
> > > > > > When the task migrates on little core (512), util_est is higher than
> > > > > > current cpu capacity
> > > > >
> > > > > Right, and what's the problem ?
> > > >
> > > > you worry about an util_est being higher than capacity which is the 
> > > > case there
> > >
> > > I worry about util_est being higher then the capacity the task WAS
> > > running... not the capacity the task IS running... if that value does
> > > not correspond to what the task really need... (more on that at the
> > > end).
> > >
> > > > > 1) We know that PELT is calibrated to 32ms period task and in your
> > > > >example, since the runtime is higher then the half-life, it's
> > > > >correct to estimate a utilization higher then 50%.
> > > > >
> > > > >PELT utilization is defined _based on the half-life_: thus
> > > > >your task having a 50% duty cycle does not mean we are not correct
> > > > >if report a utilization != 50%.
> > > > >It would be as broken as reporting 10% utilization for a task
> > > > >running 100ms every 1s.
> > > > >
> > > > > 2) If it was a 70% task on a previous activation, once it's moved into
> > > > >a lower capacity CPU it's still correct to assume that it's likely
> > > > >going to require the same bandwidth and thus will be
> > > > >under-provisioned.
> > > > >
> > > > > I still don't see where we are wrong in this case :/
> > > > >
> > > > > To me it looks different then the problem I described.
> > > > >
> > > > > > > With your new signal instead, once we cross the current capacity,
> > > > > > > utilization is just not anymore utilization. Thus, IMHO it make 
> > > > > > > sense
> > > > > > > avoid to accumulate a sample for what we call "estimated 
> > > > > > > utilization".
> > > >
> > > > This is not true. With the example above, the util_est will be exactly 
> > > > the same
> > > >  on big and little cores with the new signal
> > >
> > > ... AFAIU only if we have idle time...
> > >
> > > > > > > I would also say that, with the current implementation which caps
> > > > > > > utilization to the current capacity, we get better estimation in
> > > > > > > general. At least we can say with absolute precision:
> > > > > > >
> > > > > > >"the task needs _at least_ that amount of capacity".
> > > >

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Vincent Guittot

On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi  wrote:
>
> On 29-Nov 11:43, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  
> > wrote:
> > > On 28-Nov 16:42, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > > > wrote:
> > > > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi 
> > > > > >  wrote:
> > > > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > > > >  wrote:
> > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot 
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Is there anything else that I should do for these patches 
> > > > > > > > > > > ?
> > > > > > > > > >
> > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going 
> > > > > > > > > > to explain.
> > > > > > > > >
> > > > > > > > > I guess the problem is that, once we cross the current 
> > > > > > > > > capacity,
> > > > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > > > utilization.
> > > > > > > > >
> > > > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > > > estimated
> > > > > > > > > utilization samples which will overestimate the task 
> > > > > > > > > requirements.
> > > > > > > > >
> > > > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > > > since we
> > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also 
> > > > > > > > > inflate the EWMA in
> > > > > > > > > case we collect multiple samples above the current capacity.
> > > > > > > >
> > > > > > > > TBH I don't see how it's different from current implementation 
> > > > > > > > with a
> > > > > > > > task that was scheduled on big core and now wakes up on little 
> > > > > > > > core.
> > > > > > > > The util_est is overestimated as well.
> > > > > > >
> > > > > > > While running below the capacity of a CPU, either big or LITTLE, 
> > > > > > > we
> > > > > > > can still measure the actual used bandwidth as long as we have 
> > > > > > > idle
> > > > > > > time. If the task is then moved into a lower capacity core, I 
> > > > > > > think
> > > > > > > it's still safe to assume that, likely, it would need more 
> > > > > > > capacity.
> > > > > > >
> > > > > > > Why do you say it's the same ?
> > > > > >
> > > > > > In the example of a task that runs 39ms in period of 80ms that we 
> > > > > > used
> > > > > > during previous version,
> > > > > > the utilization on the big core will reach 709 so will util_est too
> > > > > > When the task migrates on little core (512), util_est is higher than
> > > > > > current cpu capacity
> > > > >
> > > > > Right, and what's the problem ?
> > > >
> > > > you worry about an util_est being higher than capacity which is the 
> > > > case there
> > >
> > > I worry about util_est being higher then the capacity the task WAS
> > > running... not the capacity the task IS running... if that value does
> > > not correspond to what the task really need... (more on that at the
> > > end).
> > >
> > > > > 1) We know that PELT is calibrated to 32ms period task and in your
> > > > >example, since the runtime is higher then the half-life, it's
> > > > >correct to estimate a utilization higher then 50%.
> > > > >
> > > > >PELT utilization is defined _based on the half-life_: thus
> > > > >your task having a 50% duty cycle does not mean we are not correct
> > > > >if report a utilization != 50%.
> > > > >It would be as broken as reporting 10% utilization for a task
> > > > >running 100ms every 1s.
> > > > >
> > > > > 2) If it was a 70% task on a previous activation, once it's moved into
> > > > >a lower capacity CPU it's still correct to assume that it's likely
> > > > >going to require the same bandwidth and thus will be
> > > > >under-provisioned.
> > > > >
> > > > > I still don't see where we are wrong in this case :/
> > > > >
> > > > > To me it looks different then the problem I described.
> > > > >
> > > > > > > With your new signal instead, once we cross the current capacity,
> > > > > > > utilization is just not anymore utilization. Thus, IMHO it make 
> > > > > > > sense
> > > > > > > avoid to accumulate a sample for what we call "estimated 
> > > > > > > utilization".
> > > >
> > > > This is not true. With the example above, the util_est will be exactly 
> > > > the same
> > > >  on big and little cores with the new signal
> > >
> > > ... AFAIU only if we have idle time...
> > >
> > > > > > > I would also say that, with the current implementation which caps
> > > > > > > utilization to the current capacity, we get better estimation in
> > > > > > > general. At least we can say with absolute precision:
> > > > > > >
> > > > > > >"the task needs _at least_ that amount of capacity".
> > > >

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Patrick Bellasi

On 29-Nov 13:53, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac855b2f4774..93e0cf5d8a76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > task_struct *p, bool task_sleep)
> > if (!task_sleep)
> > return;
> > 
> > +   /* Skip samples which do not represent an actual utilization */
> > +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > +   return;
> > +
> > /*
> >  * If the PELT values haven't changed since enqueue time,
> >  * skip the util_est update.
> 
> Would you not want something like:
> 
>   min(task_util(p), capacity_of(task_cpu(p)))
> 
> And is this the only place where we need this?

Mmm... even this could be an over-estimation:

I've just posted an example in my last reply to Vincent, end of:

   Message-ID: <20181129150020.GF23094@e110439-lin>
   https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> OTOH, if the task is always running, it will be always running
> irrespective of where it runs.

That's not what I'm concerned about. I'm concerned about small tasks
which are running on limited capacity (e.g. due to thermal capping)
without idle time. In this case, the new "utilization" signal could
overestimate the real task needs.

> Not storing these samples seems weird though; this is the exact
> condition you want to record -- the task is very active, if we skip
> these, we'll come back at a low frequency on the next wakeup.

When there is not idle time, we don't know if the reported
utilization, above the cpu capacity, is due to the task being bigger...
or just the new utilization signal converging towards:

100% / RUNNABLE_TASKS_COUNT

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Patrick Bellasi

On 29-Nov 13:53, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac855b2f4774..93e0cf5d8a76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > task_struct *p, bool task_sleep)
> > if (!task_sleep)
> > return;
> > 
> > +   /* Skip samples which do not represent an actual utilization */
> > +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > +   return;
> > +
> > /*
> >  * If the PELT values haven't changed since enqueue time,
> >  * skip the util_est update.
> 
> Would you not want something like:
> 
>   min(task_util(p), capacity_of(task_cpu(p)))
> 
> And is this the only place where we need this?

Mmm... even this could be an over-estimation:

I've just posted an example in my last reply to Vincent, end of:

   Message-ID: <20181129150020.GF23094@e110439-lin>
   https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> OTOH, if the task is always running, it will be always running
> irrespective of where it runs.

That's not what I'm concerned about. I'm concerned about small tasks
which are running on limited capacity (e.g. due to thermal capping)
without idle time. In this case, the new "utilization" signal could
overestimate the real task needs.

> Not storing these samples seems weird though; this is the exact
> condition you want to record -- the task is very active, if we skip
> these, we'll come back at a low frequency on the next wakeup.

When there is not idle time, we don't know if the reported
utilization, above the cpu capacity, is due to the task being bigger...
or just the new utilization signal converging towards:

100% / RUNNABLE_TASKS_COUNT

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Patrick Bellasi

On 29-Nov 11:43, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  wrote:
> > On 28-Nov 16:42, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > > wrote:
> > > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi 
> > > > >  wrote:
> > > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > > >  wrote:
> > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot 
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > > > >
> > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going 
> > > > > > > > > to explain.
> > > > > > > >
> > > > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > > utilization.
> > > > > > > >
> > > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > > estimated
> > > > > > > > utilization samples which will overestimate the task 
> > > > > > > > requirements.
> > > > > > > >
> > > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > > since we
> > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also 
> > > > > > > > inflate the EWMA in
> > > > > > > > case we collect multiple samples above the current capacity.
> > > > > > >
> > > > > > > TBH I don't see how it's different from current implementation 
> > > > > > > with a
> > > > > > > task that was scheduled on big core and now wakes up on little 
> > > > > > > core.
> > > > > > > The util_est is overestimated as well.
> > > > > >
> > > > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > > > can still measure the actual used bandwidth as long as we have idle
> > > > > > time. If the task is then moved into a lower capacity core, I think
> > > > > > it's still safe to assume that, likely, it would need more capacity.
> > > > > >
> > > > > > Why do you say it's the same ?
> > > > >
> > > > > In the example of a task that runs 39ms in period of 80ms that we used
> > > > > during previous version,
> > > > > the utilization on the big core will reach 709 so will util_est too
> > > > > When the task migrates on little core (512), util_est is higher than
> > > > > current cpu capacity
> > > >
> > > > Right, and what's the problem ?
> > >
> > > you worry about an util_est being higher than capacity which is the case 
> > > there
> >
> > I worry about util_est being higher then the capacity the task WAS
> > running... not the capacity the task IS running... if that value does
> > not correspond to what the task really need... (more on that at the
> > end).
> >
> > > > 1) We know that PELT is calibrated to 32ms period task and in your
> > > >example, since the runtime is higher then the half-life, it's
> > > >correct to estimate a utilization higher then 50%.
> > > >
> > > >PELT utilization is defined _based on the half-life_: thus
> > > >your task having a 50% duty cycle does not mean we are not correct
> > > >if report a utilization != 50%.
> > > >It would be as broken as reporting 10% utilization for a task
> > > >running 100ms every 1s.
> > > >
> > > > 2) If it was a 70% task on a previous activation, once it's moved into
> > > >a lower capacity CPU it's still correct to assume that it's likely
> > > >going to require the same bandwidth and thus will be
> > > >under-provisioned.
> > > >
> > > > I still don't see where we are wrong in this case :/
> > > >
> > > > To me it looks different then the problem I described.
> > > >
> > > > > > With your new signal instead, once we cross the current capacity,
> > > > > > utilization is just not anymore utilization. Thus, IMHO it make 
> > > > > > sense
> > > > > > avoid to accumulate a sample for what we call "estimated 
> > > > > > utilization".
> > >
> > > This is not true. With the example above, the util_est will be exactly 
> > > the same
> > >  on big and little cores with the new signal
> >
> > ... AFAIU only if we have idle time...
> >
> > > > > > I would also say that, with the current implementation which caps
> > > > > > utilization to the current capacity, we get better estimation in
> > > > > > general. At least we can say with absolute precision:
> > > > > >
> > > > > >"the task needs _at least_ that amount of capacity".
> > > > > >
> > > > > > Potentially we can also flag the task as being under-provisioned, in
> > > > > > case there was not idle time, and _let a policy_ decide what to do
> > > > > > with it and the granted information we have.
> > > > > >
> > > > > > While, with your new signal, once we are over the current capacity,
> > > > > > the "utilization" is just a sort of "random"

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Patrick Bellasi

On 29-Nov 11:43, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  wrote:
> > On 28-Nov 16:42, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > > wrote:
> > > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi 
> > > > >  wrote:
> > > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > > >  wrote:
> > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot 
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > > > >
> > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going 
> > > > > > > > > to explain.
> > > > > > > >
> > > > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > > utilization.
> > > > > > > >
> > > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > > estimated
> > > > > > > > utilization samples which will overestimate the task 
> > > > > > > > requirements.
> > > > > > > >
> > > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > > since we
> > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also 
> > > > > > > > inflate the EWMA in
> > > > > > > > case we collect multiple samples above the current capacity.
> > > > > > >
> > > > > > > TBH I don't see how it's different from current implementation 
> > > > > > > with a
> > > > > > > task that was scheduled on big core and now wakes up on little 
> > > > > > > core.
> > > > > > > The util_est is overestimated as well.
> > > > > >
> > > > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > > > can still measure the actual used bandwidth as long as we have idle
> > > > > > time. If the task is then moved into a lower capacity core, I think
> > > > > > it's still safe to assume that, likely, it would need more capacity.
> > > > > >
> > > > > > Why do you say it's the same ?
> > > > >
> > > > > In the example of a task that runs 39ms in period of 80ms that we used
> > > > > during previous version,
> > > > > the utilization on the big core will reach 709 so will util_est too
> > > > > When the task migrates on little core (512), util_est is higher than
> > > > > current cpu capacity
> > > >
> > > > Right, and what's the problem ?
> > >
> > > you worry about an util_est being higher than capacity which is the case 
> > > there
> >
> > I worry about util_est being higher then the capacity the task WAS
> > running... not the capacity the task IS running... if that value does
> > not correspond to what the task really need... (more on that at the
> > end).
> >
> > > > 1) We know that PELT is calibrated to 32ms period task and in your
> > > >example, since the runtime is higher then the half-life, it's
> > > >correct to estimate a utilization higher then 50%.
> > > >
> > > >PELT utilization is defined _based on the half-life_: thus
> > > >your task having a 50% duty cycle does not mean we are not correct
> > > >if report a utilization != 50%.
> > > >It would be as broken as reporting 10% utilization for a task
> > > >running 100ms every 1s.
> > > >
> > > > 2) If it was a 70% task on a previous activation, once it's moved into
> > > >a lower capacity CPU it's still correct to assume that it's likely
> > > >going to require the same bandwidth and thus will be
> > > >under-provisioned.
> > > >
> > > > I still don't see where we are wrong in this case :/
> > > >
> > > > To me it looks different then the problem I described.
> > > >
> > > > > > With your new signal instead, once we cross the current capacity,
> > > > > > utilization is just not anymore utilization. Thus, IMHO it make 
> > > > > > sense
> > > > > > avoid to accumulate a sample for what we call "estimated 
> > > > > > utilization".
> > >
> > > This is not true. With the example above, the util_est will be exactly 
> > > the same
> > >  on big and little cores with the new signal
> >
> > ... AFAIU only if we have idle time...
> >
> > > > > > I would also say that, with the current implementation which caps
> > > > > > utilization to the current capacity, we get better estimation in
> > > > > > general. At least we can say with absolute precision:
> > > > > >
> > > > > >"the task needs _at least_ that amount of capacity".
> > > > > >
> > > > > > Potentially we can also flag the task as being under-provisioned, in
> > > > > > case there was not idle time, and _let a policy_ decide what to do
> > > > > > with it and the granted information we have.
> > > > > >
> > > > > > While, with your new signal, once we are over the current capacity,
> > > > > > the "utilization" is just a sort of "random"

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Peter Zijlstra

On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac855b2f4774..93e0cf5d8a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
>   if (!task_sleep)
>   return;
>  
> + /* Skip samples which do not represent an actual utilization */
> + if (unlikely(task_util(p) > capacity_of(task_cpu(p
> + return;
> +
>   /*
>* If the PELT values haven't changed since enqueue time,
>* skip the util_est update.

Would you not want something like:

min(task_util(p), capacity_of(task_cpu(p)))

And is this the only place where we need this?

OTOH, if the task is always running, it will be always running
irrespective of where it runs.

Not storing these samples seems weird though; this is the exact
condition you want to record -- the task is very active, if we skip
these, we'll come back at a low frequency on the next wakeup.

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Peter Zijlstra

On Wed, Nov 28, 2018 at 11:53:36AM +, Patrick Bellasi wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac855b2f4774..93e0cf5d8a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
>   if (!task_sleep)
>   return;
>  
> + /* Skip samples which do not represent an actual utilization */
> + if (unlikely(task_util(p) > capacity_of(task_cpu(p
> + return;
> +
>   /*
>* If the PELT values haven't changed since enqueue time,
>* skip the util_est update.

Would you not want something like:

min(task_util(p), capacity_of(task_cpu(p)))

And is this the only place where we need this?

OTOH, if the task is always running, it will be always running
irrespective of where it runs.

Not storing these samples seems weird though; this is the exact
condition you want to record -- the task is very active, if we skip
these, we'll come back at a low frequency on the next wakeup.

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  wrote:
>
> On 28-Nov 16:42, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > > > wrote:
> > > > >
> > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > > > >
> > > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > > >
> > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > > > explain.
> > > > > > >
> > > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > utilization.
> > > > > > >
> > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > estimated
> > > > > > > utilization samples which will overestimate the task requirements.
> > > > > > >
> > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > since we
> > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate 
> > > > > > > the EWMA in
> > > > > > > case we collect multiple samples above the current capacity.
> > > > > >
> > > > > > TBH I don't see how it's different from current implementation with 
> > > > > > a
> > > > > > task that was scheduled on big core and now wakes up on little core.
> > > > > > The util_est is overestimated as well.
> > > > >
> > > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > > can still measure the actual used bandwidth as long as we have idle
> > > > > time. If the task is then moved into a lower capacity core, I think
> > > > > it's still safe to assume that, likely, it would need more capacity.
> > > > >
> > > > > Why do you say it's the same ?
> > > >
> > > > In the example of a task that runs 39ms in period of 80ms that we used
> > > > during previous version,
> > > > the utilization on the big core will reach 709 so will util_est too
> > > > When the task migrates on little core (512), util_est is higher than
> > > > current cpu capacity
> > >
> > > Right, and what's the problem ?
> >
> > you worry about an util_est being higher than capacity which is the case 
> > there
>
> I worry about util_est being higher then the capacity the task WAS
> running... not the capacity the task IS running... if that value does
> not correspond to what the task really need... (more on that at the
> end).
>
> > > 1) We know that PELT is calibrated to 32ms period task and in your
> > >example, since the runtime is higher then the half-life, it's
> > >correct to estimate a utilization higher then 50%.
> > >
> > >PELT utilization is defined _based on the half-life_: thus
> > >your task having a 50% duty cycle does not mean we are not correct
> > >if report a utilization != 50%.
> > >It would be as broken as reporting 10% utilization for a task
> > >running 100ms every 1s.
> > >
> > > 2) If it was a 70% task on a previous activation, once it's moved into
> > >a lower capacity CPU it's still correct to assume that it's likely
> > >going to require the same bandwidth and thus will be
> > >under-provisioned.
> > >
> > > I still don't see where we are wrong in this case :/
> > >
> > > To me it looks different then the problem I described.
> > >
> > > > > With your new signal instead, once we cross the current capacity,
> > > > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > > > avoid to accumulate a sample for what we call "estimated utilization".
> >
> > This is not true. With the example above, the util_est will be exactly the 
> > same
> >  on big and little cores with the new signal
>
> ... AFAIU only if we have idle time...
>
> > > > > I would also say that, with the current implementation which caps
> > > > > utilization to the current capacity, we get better estimation in
> > > > > general. At least we can say with absolute precision:
> > > > >
> > > > >"the task needs _at least_ that amount of capacity".
> > > > >
> > > > > Potentially we can also flag the task as being under-provisioned, in
> > > > > case there was not idle time, and _let a policy_ decide what to do
> > > > > with it and the granted information we have.
> > > > >
> > > > > While, with your new signal, once we are over the current capacity,
> > > > > the "utilization" is just a sort of "random" number at best useful to
> > > > > drive some conclusions about how long the task has been delayed.
> >
> > see my comment above
> >
> > > > >
> > > > > IOW, I fear that we are embedding a policy within a signal which is
> > > > > currently representing something very well defined: how much

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-29 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi  wrote:
>
> On 28-Nov 16:42, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 15:55, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > > > wrote:
> > > > >
> > > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > > > >
> > > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > > >
> > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > > > explain.
> > > > > > >
> > > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > > strictly speaking util_avg does not represent anymore a 
> > > > > > > utilization.
> > > > > > >
> > > > > > > With the new signal this could happen and we end up storing 
> > > > > > > estimated
> > > > > > > utilization samples which will overestimate the task requirements.
> > > > > > >
> > > > > > > We will have a spike in estimated utilization at next wakeup, 
> > > > > > > since we
> > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate 
> > > > > > > the EWMA in
> > > > > > > case we collect multiple samples above the current capacity.
> > > > > >
> > > > > > TBH I don't see how it's different from current implementation with 
> > > > > > a
> > > > > > task that was scheduled on big core and now wakes up on little core.
> > > > > > The util_est is overestimated as well.
> > > > >
> > > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > > can still measure the actual used bandwidth as long as we have idle
> > > > > time. If the task is then moved into a lower capacity core, I think
> > > > > it's still safe to assume that, likely, it would need more capacity.
> > > > >
> > > > > Why do you say it's the same ?
> > > >
> > > > In the example of a task that runs 39ms in period of 80ms that we used
> > > > during previous version,
> > > > the utilization on the big core will reach 709 so will util_est too
> > > > When the task migrates on little core (512), util_est is higher than
> > > > current cpu capacity
> > >
> > > Right, and what's the problem ?
> >
> > you worry about an util_est being higher than capacity which is the case 
> > there
>
> I worry about util_est being higher then the capacity the task WAS
> running... not the capacity the task IS running... if that value does
> not correspond to what the task really need... (more on that at the
> end).
>
> > > 1) We know that PELT is calibrated to 32ms period task and in your
> > >example, since the runtime is higher then the half-life, it's
> > >correct to estimate a utilization higher then 50%.
> > >
> > >PELT utilization is defined _based on the half-life_: thus
> > >your task having a 50% duty cycle does not mean we are not correct
> > >if report a utilization != 50%.
> > >It would be as broken as reporting 10% utilization for a task
> > >running 100ms every 1s.
> > >
> > > 2) If it was a 70% task on a previous activation, once it's moved into
> > >a lower capacity CPU it's still correct to assume that it's likely
> > >going to require the same bandwidth and thus will be
> > >under-provisioned.
> > >
> > > I still don't see where we are wrong in this case :/
> > >
> > > To me it looks different then the problem I described.
> > >
> > > > > With your new signal instead, once we cross the current capacity,
> > > > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > > > avoid to accumulate a sample for what we call "estimated utilization".
> >
> > This is not true. With the example above, the util_est will be exactly the 
> > same
> >  on big and little cores with the new signal
>
> ... AFAIU only if we have idle time...
>
> > > > > I would also say that, with the current implementation which caps
> > > > > utilization to the current capacity, we get better estimation in
> > > > > general. At least we can say with absolute precision:
> > > > >
> > > > >"the task needs _at least_ that amount of capacity".
> > > > >
> > > > > Potentially we can also flag the task as being under-provisioned, in
> > > > > case there was not idle time, and _let a policy_ decide what to do
> > > > > with it and the granted information we have.
> > > > >
> > > > > While, with your new signal, once we are over the current capacity,
> > > > > the "utilization" is just a sort of "random" number at best useful to
> > > > > drive some conclusions about how long the task has been delayed.
> >
> > see my comment above
> >
> > > > >
> > > > > IOW, I fear that we are embedding a policy within a signal which is
> > > > > currently representing something very well defined: how much

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 16:42, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  wrote:
> >
> > On 28-Nov 15:55, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > > wrote:
> > > >
> > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > >  wrote:
> > > > > >
> > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > > >
> > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > >
> > > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > > explain.
> > > > > >
> > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > strictly speaking util_avg does not represent anymore a utilization.
> > > > > >
> > > > > > With the new signal this could happen and we end up storing 
> > > > > > estimated
> > > > > > utilization samples which will overestimate the task requirements.
> > > > > >
> > > > > > We will have a spike in estimated utilization at next wakeup, since 
> > > > > > we
> > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate 
> > > > > > the EWMA in
> > > > > > case we collect multiple samples above the current capacity.
> > > > >
> > > > > TBH I don't see how it's different from current implementation with a
> > > > > task that was scheduled on big core and now wakes up on little core.
> > > > > The util_est is overestimated as well.
> > > >
> > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > can still measure the actual used bandwidth as long as we have idle
> > > > time. If the task is then moved into a lower capacity core, I think
> > > > it's still safe to assume that, likely, it would need more capacity.
> > > >
> > > > Why do you say it's the same ?
> > >
> > > In the example of a task that runs 39ms in period of 80ms that we used
> > > during previous version,
> > > the utilization on the big core will reach 709 so will util_est too
> > > When the task migrates on little core (512), util_est is higher than
> > > current cpu capacity
> >
> > Right, and what's the problem ?
> 
> you worry about an util_est being higher than capacity which is the case there

I worry about util_est being higher then the capacity the task WAS
running... not the capacity the task IS running... if that value does
not correspond to what the task really need... (more on that at the
end).

> > 1) We know that PELT is calibrated to 32ms period task and in your
> >example, since the runtime is higher then the half-life, it's
> >correct to estimate a utilization higher then 50%.
> >
> >PELT utilization is defined _based on the half-life_: thus
> >your task having a 50% duty cycle does not mean we are not correct
> >if report a utilization != 50%.
> >It would be as broken as reporting 10% utilization for a task
> >running 100ms every 1s.
> >
> > 2) If it was a 70% task on a previous activation, once it's moved into
> >a lower capacity CPU it's still correct to assume that it's likely
> >going to require the same bandwidth and thus will be
> >under-provisioned.
> >
> > I still don't see where we are wrong in this case :/
> >
> > To me it looks different then the problem I described.
> >
> > > > With your new signal instead, once we cross the current capacity,
> > > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > > avoid to accumulate a sample for what we call "estimated utilization".
> 
> This is not true. With the example above, the util_est will be exactly the 
> same
>  on big and little cores with the new signal

... AFAIU only if we have idle time...

> > > > I would also say that, with the current implementation which caps
> > > > utilization to the current capacity, we get better estimation in
> > > > general. At least we can say with absolute precision:
> > > >
> > > >"the task needs _at least_ that amount of capacity".
> > > >
> > > > Potentially we can also flag the task as being under-provisioned, in
> > > > case there was not idle time, and _let a policy_ decide what to do
> > > > with it and the granted information we have.
> > > >
> > > > While, with your new signal, once we are over the current capacity,
> > > > the "utilization" is just a sort of "random" number at best useful to
> > > > drive some conclusions about how long the task has been delayed.
> 
> see my comment above
> 
> > > >
> > > > IOW, I fear that we are embedding a policy within a signal which is
> > > > currently representing something very well defined: how much cpu
> > > > bandwidth a task used. While, latency/under-provisioning policies
> > > > perhaps should be better placed somewhere else.
> > > >
> > > > Perhaps I've missed it in some of the previous discussions:
> > > > have we have considered/discussed this signal-vs-policy aspect ?
> >
> > What's

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 16:42, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  wrote:
> >
> > On 28-Nov 15:55, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > > wrote:
> > > >
> > > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi 
> > > > >  wrote:
> > > > > >
> > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > > >
> > > > > > > > Is there anything else that I should do for these patches ?
> > > > > > >
> > > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > > explain.
> > > > > >
> > > > > > I guess the problem is that, once we cross the current capacity,
> > > > > > strictly speaking util_avg does not represent anymore a utilization.
> > > > > >
> > > > > > With the new signal this could happen and we end up storing 
> > > > > > estimated
> > > > > > utilization samples which will overestimate the task requirements.
> > > > > >
> > > > > > We will have a spike in estimated utilization at next wakeup, since 
> > > > > > we
> > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate 
> > > > > > the EWMA in
> > > > > > case we collect multiple samples above the current capacity.
> > > > >
> > > > > TBH I don't see how it's different from current implementation with a
> > > > > task that was scheduled on big core and now wakes up on little core.
> > > > > The util_est is overestimated as well.
> > > >
> > > > While running below the capacity of a CPU, either big or LITTLE, we
> > > > can still measure the actual used bandwidth as long as we have idle
> > > > time. If the task is then moved into a lower capacity core, I think
> > > > it's still safe to assume that, likely, it would need more capacity.
> > > >
> > > > Why do you say it's the same ?
> > >
> > > In the example of a task that runs 39ms in period of 80ms that we used
> > > during previous version,
> > > the utilization on the big core will reach 709 so will util_est too
> > > When the task migrates on little core (512), util_est is higher than
> > > current cpu capacity
> >
> > Right, and what's the problem ?
> 
> you worry about an util_est being higher than capacity which is the case there

I worry about util_est being higher then the capacity the task WAS
running... not the capacity the task IS running... if that value does
not correspond to what the task really need... (more on that at the
end).

> > 1) We know that PELT is calibrated to 32ms period task and in your
> >example, since the runtime is higher then the half-life, it's
> >correct to estimate a utilization higher then 50%.
> >
> >PELT utilization is defined _based on the half-life_: thus
> >your task having a 50% duty cycle does not mean we are not correct
> >if report a utilization != 50%.
> >It would be as broken as reporting 10% utilization for a task
> >running 100ms every 1s.
> >
> > 2) If it was a 70% task on a previous activation, once it's moved into
> >a lower capacity CPU it's still correct to assume that it's likely
> >going to require the same bandwidth and thus will be
> >under-provisioned.
> >
> > I still don't see where we are wrong in this case :/
> >
> > To me it looks different then the problem I described.
> >
> > > > With your new signal instead, once we cross the current capacity,
> > > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > > avoid to accumulate a sample for what we call "estimated utilization".
> 
> This is not true. With the example above, the util_est will be exactly the 
> same
>  on big and little cores with the new signal

... AFAIU only if we have idle time...

> > > > I would also say that, with the current implementation which caps
> > > > utilization to the current capacity, we get better estimation in
> > > > general. At least we can say with absolute precision:
> > > >
> > > >"the task needs _at least_ that amount of capacity".
> > > >
> > > > Potentially we can also flag the task as being under-provisioned, in
> > > > case there was not idle time, and _let a policy_ decide what to do
> > > > with it and the granted information we have.
> > > >
> > > > While, with your new signal, once we are over the current capacity,
> > > > the "utilization" is just a sort of "random" number at best useful to
> > > > drive some conclusions about how long the task has been delayed.
> 
> see my comment above
> 
> > > >
> > > > IOW, I fear that we are embedding a policy within a signal which is
> > > > currently representing something very well defined: how much cpu
> > > > bandwidth a task used. While, latency/under-provisioning policies
> > > > perhaps should be better placed somewhere else.
> > > >
> > > > Perhaps I've missed it in some of the previous discussions:
> > > > have we have considered/discussed this signal-vs-policy aspect ?
> >
> > What's

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  wrote:
>
> On 28-Nov 15:55, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > > > wrote:
> > > > >
> > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > >
> > > > > > > Is there anything else that I should do for these patches ?
> > > > > >
> > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > explain.
> > > > >
> > > > > I guess the problem is that, once we cross the current capacity,
> > > > > strictly speaking util_avg does not represent anymore a utilization.
> > > > >
> > > > > With the new signal this could happen and we end up storing estimated
> > > > > utilization samples which will overestimate the task requirements.
> > > > >
> > > > > We will have a spike in estimated utilization at next wakeup, since we
> > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > > > EWMA in
> > > > > case we collect multiple samples above the current capacity.
> > > >
> > > > TBH I don't see how it's different from current implementation with a
> > > > task that was scheduled on big core and now wakes up on little core.
> > > > The util_est is overestimated as well.
> > >
> > > While running below the capacity of a CPU, either big or LITTLE, we
> > > can still measure the actual used bandwidth as long as we have idle
> > > time. If the task is then moved into a lower capacity core, I think
> > > it's still safe to assume that, likely, it would need more capacity.
> > >
> > > Why do you say it's the same ?
> >
> > In the example of a task that runs 39ms in period of 80ms that we used
> > during previous version,
> > the utilization on the big core will reach 709 so will util_est too
> > When the task migrates on little core (512), util_est is higher than
> > current cpu capacity
>
> Right, and what's the problem ?

you worry about an util_est being higher than capacity which is the case there

>
> 1) We know that PELT is calibrated to 32ms period task and in your
>example, since the runtime is higher then the half-life, it's
>correct to estimate a utilization higher then 50%.
>
>PELT utilization is defined _based on the half-life_: thus
>your task having a 50% duty cycle does not mean we are not correct
>if report a utilization != 50%.
>It would be as broken as reporting 10% utilization for a task
>running 100ms every 1s.
>
> 2) If it was a 70% task on a previous activation, once it's moved into
>a lower capacity CPU it's still correct to assume that it's likely
>going to require the same bandwidth and thus will be
>under-provisioned.
>
> I still don't see where we are wrong in this case :/
>
> To me it looks different then the problem I described.
>
> > > With your new signal instead, once we cross the current capacity,
> > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > avoid to accumulate a sample for what we call "estimated utilization".

This is not true. With the example above, the util_est will be exactly the same
 on big and little cores with the new signal

> > >
> > > I would also say that, with the current implementation which caps
> > > utilization to the current capacity, we get better estimation in
> > > general. At least we can say with absolute precision:
> > >
> > >"the task needs _at least_ that amount of capacity".
> > >
> > > Potentially we can also flag the task as being under-provisioned, in
> > > case there was not idle time, and _let a policy_ decide what to do
> > > with it and the granted information we have.
> > >
> > > While, with your new signal, once we are over the current capacity,
> > > the "utilization" is just a sort of "random" number at best useful to
> > > drive some conclusions about how long the task has been delayed.

see my comment above

> > >
> > > IOW, I fear that we are embedding a policy within a signal which is
> > > currently representing something very well defined: how much cpu
> > > bandwidth a task used. While, latency/under-provisioning policies
> > > perhaps should be better placed somewhere else.
> > >
> > > Perhaps I've missed it in some of the previous discussions:
> > > have we have considered/discussed this signal-vs-policy aspect ?
>
> What's your opinion on the above instead ?

It's not a policy but it gives better knowledge about the amount a work done
I have put below discussion on the  subject on previous version

> >
> > With contribution scaling the PELT utilization of a task is a _minimum_
> > utilization. Regardless of where the task is currently/was running (and
> > provided that it doesn't change behaviour) its PELT utilization will
> > approximate its _minimum_ utilization on an idle 1024 capacity CPU.
>
> The main drawback is that the

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi  wrote:
>
> On 28-Nov 15:55, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 14:33, Vincent Guittot wrote:
> > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > > > wrote:
> > > > >
> > > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > > >
> > > > > > > Is there anything else that I should do for these patches ?
> > > > > >
> > > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > > explain.
> > > > >
> > > > > I guess the problem is that, once we cross the current capacity,
> > > > > strictly speaking util_avg does not represent anymore a utilization.
> > > > >
> > > > > With the new signal this could happen and we end up storing estimated
> > > > > utilization samples which will overestimate the task requirements.
> > > > >
> > > > > We will have a spike in estimated utilization at next wakeup, since we
> > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > > > EWMA in
> > > > > case we collect multiple samples above the current capacity.
> > > >
> > > > TBH I don't see how it's different from current implementation with a
> > > > task that was scheduled on big core and now wakes up on little core.
> > > > The util_est is overestimated as well.
> > >
> > > While running below the capacity of a CPU, either big or LITTLE, we
> > > can still measure the actual used bandwidth as long as we have idle
> > > time. If the task is then moved into a lower capacity core, I think
> > > it's still safe to assume that, likely, it would need more capacity.
> > >
> > > Why do you say it's the same ?
> >
> > In the example of a task that runs 39ms in period of 80ms that we used
> > during previous version,
> > the utilization on the big core will reach 709 so will util_est too
> > When the task migrates on little core (512), util_est is higher than
> > current cpu capacity
>
> Right, and what's the problem ?

you worry about an util_est being higher than capacity which is the case there

>
> 1) We know that PELT is calibrated to 32ms period task and in your
>example, since the runtime is higher then the half-life, it's
>correct to estimate a utilization higher then 50%.
>
>PELT utilization is defined _based on the half-life_: thus
>your task having a 50% duty cycle does not mean we are not correct
>if report a utilization != 50%.
>It would be as broken as reporting 10% utilization for a task
>running 100ms every 1s.
>
> 2) If it was a 70% task on a previous activation, once it's moved into
>a lower capacity CPU it's still correct to assume that it's likely
>going to require the same bandwidth and thus will be
>under-provisioned.
>
> I still don't see where we are wrong in this case :/
>
> To me it looks different then the problem I described.
>
> > > With your new signal instead, once we cross the current capacity,
> > > utilization is just not anymore utilization. Thus, IMHO it make sense
> > > avoid to accumulate a sample for what we call "estimated utilization".

This is not true. With the example above, the util_est will be exactly the same
 on big and little cores with the new signal

> > >
> > > I would also say that, with the current implementation which caps
> > > utilization to the current capacity, we get better estimation in
> > > general. At least we can say with absolute precision:
> > >
> > >"the task needs _at least_ that amount of capacity".
> > >
> > > Potentially we can also flag the task as being under-provisioned, in
> > > case there was not idle time, and _let a policy_ decide what to do
> > > with it and the granted information we have.
> > >
> > > While, with your new signal, once we are over the current capacity,
> > > the "utilization" is just a sort of "random" number at best useful to
> > > drive some conclusions about how long the task has been delayed.

see my comment above

> > >
> > > IOW, I fear that we are embedding a policy within a signal which is
> > > currently representing something very well defined: how much cpu
> > > bandwidth a task used. While, latency/under-provisioning policies
> > > perhaps should be better placed somewhere else.
> > >
> > > Perhaps I've missed it in some of the previous discussions:
> > > have we have considered/discussed this signal-vs-policy aspect ?
>
> What's your opinion on the above instead ?

It's not a policy but it gives better knowledge about the amount a work done
I have put below discussion on the  subject on previous version

> >
> > With contribution scaling the PELT utilization of a task is a _minimum_
> > utilization. Regardless of where the task is currently/was running (and
> > provided that it doesn't change behaviour) its PELT utilization will
> > approximate its _minimum_ utilization on an idle 1024 capacity CPU.
>
> The main drawback is that the

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 15:55, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  wrote:
> >
> > On 28-Nov 14:33, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > > wrote:
> > > >
> > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > >
> > > > > > Is there anything else that I should do for these patches ?
> > > > >
> > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > explain.
> > > >
> > > > I guess the problem is that, once we cross the current capacity,
> > > > strictly speaking util_avg does not represent anymore a utilization.
> > > >
> > > > With the new signal this could happen and we end up storing estimated
> > > > utilization samples which will overestimate the task requirements.
> > > >
> > > > We will have a spike in estimated utilization at next wakeup, since we
> > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > > EWMA in
> > > > case we collect multiple samples above the current capacity.
> > >
> > > TBH I don't see how it's different from current implementation with a
> > > task that was scheduled on big core and now wakes up on little core.
> > > The util_est is overestimated as well.
> >
> > While running below the capacity of a CPU, either big or LITTLE, we
> > can still measure the actual used bandwidth as long as we have idle
> > time. If the task is then moved into a lower capacity core, I think
> > it's still safe to assume that, likely, it would need more capacity.
> >
> > Why do you say it's the same ?
> 
> In the example of a task that runs 39ms in period of 80ms that we used
> during previous version,
> the utilization on the big core will reach 709 so will util_est too
> When the task migrates on little core (512), util_est is higher than
> current cpu capacity

Right, and what's the problem ?

1) We know that PELT is calibrated to 32ms period task and in your
   example, since the runtime is higher then the half-life, it's
   correct to estimate a utilization higher then 50%.

   PELT utilization is defined _based on the half-life_: thus
   your task having a 50% duty cycle does not mean we are not correct
   if report a utilization != 50%.
   It would be as broken as reporting 10% utilization for a task
   running 100ms every 1s.

2) If it was a 70% task on a previous activation, once it's moved into
   a lower capacity CPU it's still correct to assume that it's likely
   going to require the same bandwidth and thus will be
   under-provisioned.

I still don't see where we are wrong in this case :/

To me it looks different then the problem I described.

> > With your new signal instead, once we cross the current capacity,
> > utilization is just not anymore utilization. Thus, IMHO it make sense
> > avoid to accumulate a sample for what we call "estimated utilization".
> >
> > I would also say that, with the current implementation which caps
> > utilization to the current capacity, we get better estimation in
> > general. At least we can say with absolute precision:
> >
> >"the task needs _at least_ that amount of capacity".
> >
> > Potentially we can also flag the task as being under-provisioned, in
> > case there was not idle time, and _let a policy_ decide what to do
> > with it and the granted information we have.
> >
> > While, with your new signal, once we are over the current capacity,
> > the "utilization" is just a sort of "random" number at best useful to
> > drive some conclusions about how long the task has been delayed.
> >
> > IOW, I fear that we are embedding a policy within a signal which is
> > currently representing something very well defined: how much cpu
> > bandwidth a task used. While, latency/under-provisioning policies
> > perhaps should be better placed somewhere else.
> >
> > Perhaps I've missed it in some of the previous discussions:
> > have we have considered/discussed this signal-vs-policy aspect ?

What's your opinion on the above instead ?

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 15:55, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  wrote:
> >
> > On 28-Nov 14:33, Vincent Guittot wrote:
> > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > > wrote:
> > > >
> > > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > > >
> > > > > > Is there anything else that I should do for these patches ?
> > > > >
> > > > > IIRC, Morten mention they break util_est; Patrick was going to 
> > > > > explain.
> > > >
> > > > I guess the problem is that, once we cross the current capacity,
> > > > strictly speaking util_avg does not represent anymore a utilization.
> > > >
> > > > With the new signal this could happen and we end up storing estimated
> > > > utilization samples which will overestimate the task requirements.
> > > >
> > > > We will have a spike in estimated utilization at next wakeup, since we
> > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > > EWMA in
> > > > case we collect multiple samples above the current capacity.
> > >
> > > TBH I don't see how it's different from current implementation with a
> > > task that was scheduled on big core and now wakes up on little core.
> > > The util_est is overestimated as well.
> >
> > While running below the capacity of a CPU, either big or LITTLE, we
> > can still measure the actual used bandwidth as long as we have idle
> > time. If the task is then moved into a lower capacity core, I think
> > it's still safe to assume that, likely, it would need more capacity.
> >
> > Why do you say it's the same ?
> 
> In the example of a task that runs 39ms in period of 80ms that we used
> during previous version,
> the utilization on the big core will reach 709 so will util_est too
> When the task migrates on little core (512), util_est is higher than
> current cpu capacity

Right, and what's the problem ?

1) We know that PELT is calibrated to 32ms period task and in your
   example, since the runtime is higher then the half-life, it's
   correct to estimate a utilization higher then 50%.

   PELT utilization is defined _based on the half-life_: thus
   your task having a 50% duty cycle does not mean we are not correct
   if report a utilization != 50%.
   It would be as broken as reporting 10% utilization for a task
   running 100ms every 1s.

2) If it was a 70% task on a previous activation, once it's moved into
   a lower capacity CPU it's still correct to assume that it's likely
   going to require the same bandwidth and thus will be
   under-provisioned.

I still don't see where we are wrong in this case :/

To me it looks different then the problem I described.

> > With your new signal instead, once we cross the current capacity,
> > utilization is just not anymore utilization. Thus, IMHO it make sense
> > avoid to accumulate a sample for what we call "estimated utilization".
> >
> > I would also say that, with the current implementation which caps
> > utilization to the current capacity, we get better estimation in
> > general. At least we can say with absolute precision:
> >
> >"the task needs _at least_ that amount of capacity".
> >
> > Potentially we can also flag the task as being under-provisioned, in
> > case there was not idle time, and _let a policy_ decide what to do
> > with it and the granted information we have.
> >
> > While, with your new signal, once we are over the current capacity,
> > the "utilization" is just a sort of "random" number at best useful to
> > drive some conclusions about how long the task has been delayed.
> >
> > IOW, I fear that we are embedding a policy within a signal which is
> > currently representing something very well defined: how much cpu
> > bandwidth a task used. While, latency/under-provisioning policies
> > perhaps should be better placed somewhere else.
> >
> > Perhaps I've missed it in some of the previous discussions:
> > have we have considered/discussed this signal-vs-policy aspect ?

What's your opinion on the above instead ?

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  wrote:
>
> On 28-Nov 14:33, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > >
> > > > > Is there anything else that I should do for these patches ?
> > > >
> > > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> > >
> > > I guess the problem is that, once we cross the current capacity,
> > > strictly speaking util_avg does not represent anymore a utilization.
> > >
> > > With the new signal this could happen and we end up storing estimated
> > > utilization samples which will overestimate the task requirements.
> > >
> > > We will have a spike in estimated utilization at next wakeup, since we
> > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > EWMA in
> > > case we collect multiple samples above the current capacity.
> >
> > TBH I don't see how it's different from current implementation with a
> > task that was scheduled on big core and now wakes up on little core.
> > The util_est is overestimated as well.
>
> While running below the capacity of a CPU, either big or LITTLE, we
> can still measure the actual used bandwidth as long as we have idle
> time. If the task is then moved into a lower capacity core, I think
> it's still safe to assume that, likely, it would need more capacity.
>
> Why do you say it's the same ?

In the example of a task that runs 39ms in period of 80ms that we used
during previous version,
the utilization on the big core will reach 709 so will util_est too
When the task migrates on little core (512), util_est is higher than
current cpu capacity

>
> With your new signal instead, once we cross the current capacity,
> utilization is just not anymore utilization. Thus, IMHO it make sense
> avoid to accumulate a sample for what we call "estimated utilization".
>
> I would also say that, with the current implementation which caps
> utilization to the current capacity, we get better estimation in
> general. At least we can say with absolute precision:
>
>"the task needs _at least_ that amount of capacity".
>
> Potentially we can also flag the task as being under-provisioned, in
> case there was not idle time, and _let a policy_ decide what to do
> with it and the granted information we have.
>
> While, with your new signal, once we are over the current capacity,
> the "utilization" is just a sort of "random" number at best useful to
> drive some conclusions about how long the task has been delayed.
>
> IOW, I fear that we are embedding a policy within a signal which is
> currently representing something very well defined: how much cpu
> bandwidth a task used. While, latency/under-provisioning policies
> perhaps should be better placed somewhere else.
>
> Perhaps I've missed it in some of the previous discussions:
> have we have considered/discussed this signal-vs-policy aspect ?
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi  wrote:
>
> On 28-Nov 14:33, Vincent Guittot wrote:
> > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  
> > wrote:
> > >
> > > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > > >
> > > > > Is there anything else that I should do for these patches ?
> > > >
> > > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> > >
> > > I guess the problem is that, once we cross the current capacity,
> > > strictly speaking util_avg does not represent anymore a utilization.
> > >
> > > With the new signal this could happen and we end up storing estimated
> > > utilization samples which will overestimate the task requirements.
> > >
> > > We will have a spike in estimated utilization at next wakeup, since we
> > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the 
> > > EWMA in
> > > case we collect multiple samples above the current capacity.
> >
> > TBH I don't see how it's different from current implementation with a
> > task that was scheduled on big core and now wakes up on little core.
> > The util_est is overestimated as well.
>
> While running below the capacity of a CPU, either big or LITTLE, we
> can still measure the actual used bandwidth as long as we have idle
> time. If the task is then moved into a lower capacity core, I think
> it's still safe to assume that, likely, it would need more capacity.
>
> Why do you say it's the same ?

In the example of a task that runs 39ms in period of 80ms that we used
during previous version,
the utilization on the big core will reach 709 so will util_est too
When the task migrates on little core (512), util_est is higher than
current cpu capacity

>
> With your new signal instead, once we cross the current capacity,
> utilization is just not anymore utilization. Thus, IMHO it make sense
> avoid to accumulate a sample for what we call "estimated utilization".
>
> I would also say that, with the current implementation which caps
> utilization to the current capacity, we get better estimation in
> general. At least we can say with absolute precision:
>
>"the task needs _at least_ that amount of capacity".
>
> Potentially we can also flag the task as being under-provisioned, in
> case there was not idle time, and _let a policy_ decide what to do
> with it and the granted information we have.
>
> While, with your new signal, once we are over the current capacity,
> the "utilization" is just a sort of "random" number at best useful to
> drive some conclusions about how long the task has been delayed.
>
> IOW, I fear that we are embedding a policy within a signal which is
> currently representing something very well defined: how much cpu
> bandwidth a task used. While, latency/under-provisioning policies
> perhaps should be better placed somewhere else.
>
> Perhaps I've missed it in some of the previous discussions:
> have we have considered/discussed this signal-vs-policy aspect ?
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 14:33, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
> >
> > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > >
> > > > Is there anything else that I should do for these patches ?
> > >
> > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> >
> > I guess the problem is that, once we cross the current capacity,
> > strictly speaking util_avg does not represent anymore a utilization.
> >
> > With the new signal this could happen and we end up storing estimated
> > utilization samples which will overestimate the task requirements.
> >
> > We will have a spike in estimated utilization at next wakeup, since we
> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA 
> > in
> > case we collect multiple samples above the current capacity.
> 
> TBH I don't see how it's different from current implementation with a
> task that was scheduled on big core and now wakes up on little core.
> The util_est is overestimated as well.

While running below the capacity of a CPU, either big or LITTLE, we
can still measure the actual used bandwidth as long as we have idle
time. If the task is then moved into a lower capacity core, I think
it's still safe to assume that, likely, it would need more capacity.

Why do you say it's the same ?

With your new signal instead, once we cross the current capacity,
utilization is just not anymore utilization. Thus, IMHO it make sense
avoid to accumulate a sample for what we call "estimated utilization".

I would also say that, with the current implementation which caps
utilization to the current capacity, we get better estimation in
general. At least we can say with absolute precision:

   "the task needs _at least_ that amount of capacity".

Potentially we can also flag the task as being under-provisioned, in
case there was not idle time, and _let a policy_ decide what to do
with it and the granted information we have.

While, with your new signal, once we are over the current capacity,
the "utilization" is just a sort of "random" number at best useful to
drive some conclusions about how long the task has been delayed.

IOW, I fear that we are embedding a policy within a signal which is
currently representing something very well defined: how much cpu
bandwidth a task used. While, latency/under-provisioning policies
perhaps should be better placed somewhere else.

Perhaps I've missed it in some of the previous discussions:
have we have considered/discussed this signal-vs-policy aspect ?

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 14:33, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
> >
> > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > >
> > > > Is there anything else that I should do for these patches ?
> > >
> > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> >
> > I guess the problem is that, once we cross the current capacity,
> > strictly speaking util_avg does not represent anymore a utilization.
> >
> > With the new signal this could happen and we end up storing estimated
> > utilization samples which will overestimate the task requirements.
> >
> > We will have a spike in estimated utilization at next wakeup, since we
> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA 
> > in
> > case we collect multiple samples above the current capacity.
> 
> TBH I don't see how it's different from current implementation with a
> task that was scheduled on big core and now wakes up on little core.
> The util_est is overestimated as well.

While running below the capacity of a CPU, either big or LITTLE, we
can still measure the actual used bandwidth as long as we have idle
time. If the task is then moved into a lower capacity core, I think
it's still safe to assume that, likely, it would need more capacity.

Why do you say it's the same ?

With your new signal instead, once we cross the current capacity,
utilization is just not anymore utilization. Thus, IMHO it make sense
avoid to accumulate a sample for what we call "estimated utilization".

I would also say that, with the current implementation which caps
utilization to the current capacity, we get better estimation in
general. At least we can say with absolute precision:

   "the task needs _at least_ that amount of capacity".

Potentially we can also flag the task as being under-provisioned, in
case there was not idle time, and _let a policy_ decide what to do
with it and the granted information we have.

While, with your new signal, once we are over the current capacity,
the "utilization" is just a sort of "random" number at best useful to
drive some conclusions about how long the task has been delayed.

IOW, I fear that we are embedding a policy within a signal which is
currently representing something very well defined: how much cpu
bandwidth a task used. While, latency/under-provisioning policies
perhaps should be better placed somewhere else.

Perhaps I've missed it in some of the previous discussions:
have we have considered/discussed this signal-vs-policy aspect ?

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 14:33, Vincent Guittot
 wrote:
>
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
> >
> > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > >
> > > > Is there anything else that I should do for these patches ?
> > >
> > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> >
> > I guess the problem is that, once we cross the current capacity,
> > strictly speaking util_avg does not represent anymore a utilization.
> >
> > With the new signal this could happen and we end up storing estimated
> > utilization samples which will overestimate the task requirements.
> >
> > We will have a spike in estimated utilization at next wakeup, since we
> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA 
> > in
> > case we collect multiple samples above the current capacity.
>
> TBH I don't see how it's different from current implementation with a
> task that was scheduled on big core and now wakes up on little core.
> The util_est is overestimated as well.
>
> But I'm fine with adding your proposal on to on the patchset
s/on to on/on top of/

>
> >
> > So, a possible fix could be to avoid storing util_est samples if we
> > end up with a utilization above the current capacity.
> >
> > Something like:
> >
> > 8<---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac855b2f4774..93e0cf5d8a76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > task_struct *p, bool task_sleep)
> > if (!task_sleep)
> > return;
> >
> > +   /* Skip samples which do not represent an actual utilization */
> > +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > +   return;
> > +
> > /*
> >  * If the PELT values haven't changed since enqueue time,
> >  * skip the util_est update.
> > ---8<---
> >
> > Could that work ?
> >
> > Maybe using a new utility function to wrap the new check.
> >
> > --
> > #include 
> >
> > Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 14:33, Vincent Guittot
 wrote:
>
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
> >
> > On 28-Nov 11:02, Peter Zijlstra wrote:
> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> > >
> > > > Is there anything else that I should do for these patches ?
> > >
> > > IIRC, Morten mention they break util_est; Patrick was going to explain.
> >
> > I guess the problem is that, once we cross the current capacity,
> > strictly speaking util_avg does not represent anymore a utilization.
> >
> > With the new signal this could happen and we end up storing estimated
> > utilization samples which will overestimate the task requirements.
> >
> > We will have a spike in estimated utilization at next wakeup, since we
> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA 
> > in
> > case we collect multiple samples above the current capacity.
>
> TBH I don't see how it's different from current implementation with a
> task that was scheduled on big core and now wakes up on little core.
> The util_est is overestimated as well.
>
> But I'm fine with adding your proposal on to on the patchset
s/on to on/on top of/

>
> >
> > So, a possible fix could be to avoid storing util_est samples if we
> > end up with a utilization above the current capacity.
> >
> > Something like:
> >
> > 8<---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac855b2f4774..93e0cf5d8a76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> > task_struct *p, bool task_sleep)
> > if (!task_sleep)
> > return;
> >
> > +   /* Skip samples which do not represent an actual utilization */
> > +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> > +   return;
> > +
> > /*
> >  * If the PELT values haven't changed since enqueue time,
> >  * skip the util_est update.
> > ---8<---
> >
> > Could that work ?
> >
> > Maybe using a new utility function to wrap the new check.
> >
> > --
> > #include 
> >
> > Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
>
> On 28-Nov 11:02, Peter Zijlstra wrote:
> > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> >
> > > Is there anything else that I should do for these patches ?
> >
> > IIRC, Morten mention they break util_est; Patrick was going to explain.
>
> I guess the problem is that, once we cross the current capacity,
> strictly speaking util_avg does not represent anymore a utilization.
>
> With the new signal this could happen and we end up storing estimated
> utilization samples which will overestimate the task requirements.
>
> We will have a spike in estimated utilization at next wakeup, since we
> use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
> case we collect multiple samples above the current capacity.

TBH I don't see how it's different from current implementation with a
task that was scheduled on big core and now wakes up on little core.
The util_est is overestimated as well.

But I'm fine with adding your proposal on to on the patchset

>
> So, a possible fix could be to avoid storing util_est samples if we
> end up with a utilization above the current capacity.
>
> Something like:
>
> 8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac855b2f4774..93e0cf5d8a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
> if (!task_sleep)
> return;
>
> +   /* Skip samples which do not represent an actual utilization */
> +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> +   return;
> +
> /*
>  * If the PELT values haven't changed since enqueue time,
>  * skip the util_est update.
> ---8<---
>
> Could that work ?
>
> Maybe using a new utility function to wrap the new check.
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi  wrote:
>
> On 28-Nov 11:02, Peter Zijlstra wrote:
> > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> >
> > > Is there anything else that I should do for these patches ?
> >
> > IIRC, Morten mention they break util_est; Patrick was going to explain.
>
> I guess the problem is that, once we cross the current capacity,
> strictly speaking util_avg does not represent anymore a utilization.
>
> With the new signal this could happen and we end up storing estimated
> utilization samples which will overestimate the task requirements.
>
> We will have a spike in estimated utilization at next wakeup, since we
> use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
> case we collect multiple samples above the current capacity.

TBH I don't see how it's different from current implementation with a
task that was scheduled on big core and now wakes up on little core.
The util_est is overestimated as well.

But I'm fine with adding your proposal on to on the patchset

>
> So, a possible fix could be to avoid storing util_est samples if we
> end up with a utilization above the current capacity.
>
> Something like:
>
> 8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac855b2f4774..93e0cf5d8a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
> task_struct *p, bool task_sleep)
> if (!task_sleep)
> return;
>
> +   /* Skip samples which do not represent an actual utilization */
> +   if (unlikely(task_util(p) > capacity_of(task_cpu(p
> +   return;
> +
> /*
>  * If the PELT values haven't changed since enqueue time,
>  * skip the util_est update.
> ---8<---
>
> Could that work ?
>
> Maybe using a new utility function to wrap the new check.
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 11:02, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> 
> > Is there anything else that I should do for these patches ?
> 
> IIRC, Morten mention they break util_est; Patrick was going to explain.

I guess the problem is that, once we cross the current capacity,
strictly speaking util_avg does not represent anymore a utilization.

With the new signal this could happen and we end up storing estimated
utilization samples which will overestimate the task requirements.

We will have a spike in estimated utilization at next wakeup, since we
use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
case we collect multiple samples above the current capacity.

So, a possible fix could be to avoid storing util_est samples if we
end up with a utilization above the current capacity.

Something like:

8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..93e0cf5d8a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
if (!task_sleep)
return;

+   /* Skip samples which do not represent an actual utilization */
+   if (unlikely(task_util(p) > capacity_of(task_cpu(p
+   return;
+
/*
 * If the PELT values haven't changed since enqueue time,
 * skip the util_est update.
---8<---

Could that work ?

Maybe using a new utility function to wrap the new check.

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Patrick Bellasi

On 28-Nov 11:02, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> 
> > Is there anything else that I should do for these patches ?
> 
> IIRC, Morten mention they break util_est; Patrick was going to explain.

I guess the problem is that, once we cross the current capacity,
strictly speaking util_avg does not represent anymore a utilization.

With the new signal this could happen and we end up storing estimated
utilization samples which will overestimate the task requirements.

We will have a spike in estimated utilization at next wakeup, since we
use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
case we collect multiple samples above the current capacity.

So, a possible fix could be to avoid storing util_est samples if we
end up with a utilization above the current capacity.

Something like:

8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..93e0cf5d8a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
if (!task_sleep)
return;

+   /* Skip samples which do not represent an actual utilization */
+   if (unlikely(task_util(p) > capacity_of(task_cpu(p
+   return;
+
/*
 * If the PELT values haven't changed since enqueue time,
 * skip the util_est update.
---8<---

Could that work ?

Maybe using a new utility function to wrap the new check.

-- 
#include 

Patrick Bellasi

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Peter Zijlstra

On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> Is there anything else that I should do for these patches ?

IIRC, Morten mention they break util_est; Patrick was going to explain.

But yes, I like them.

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Peter Zijlstra

On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> Is there anything else that I should do for these patches ?

IIRC, Morten mention they break util_est; Patrick was going to explain.

But yes, I like them.

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

Hi,

On Tue, 20 Nov 2018 at 11:55, Vincent Guittot
 wrote:
>
> The current implementation of load tracking invariance scales the
> contribution with current frequency and uarch performance (only for
> utilization) of the CPU. One main result of this formula is that the
> figures are capped by current capacity of CPU. Another one is that the
> load_avg is not invariant because not scaled with uarch.
>
> The util_avg of a periodic task that runs r time slots every p time slots
> varies in the range :
>
> U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
>
> with U is the max util_avg value = SCHED_CAPACITY_SCALE
>
> At a lower capacity, the range becomes:
>
> U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
>
> with C reflecting the compute capacity ratio between current capacity and
> max capacity.
>
> so C tries to compensate changes in (1-y^r') but it can't be accurate.
>
> Instead of scaling the contribution value of PELT algo, we should scale the
> running time. The PELT signal aims to track the amount of computation of
> tasks and/or rq so it seems more correct to scale the running time to
> reflect the effective amount of computation done since the last update.
>
> In order to be fully invariant, we need to apply the same amount of
> running time and idle time whatever the current capacity. Because running
> at lower capacity implies that the task will run longer, we have to ensure
> that the same amount of idle time will be applied when system becomes idle
> and no idle time has been "stolen". But reaching the maximum utilization
> value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> always-running task whatever the capacity of the CPU (even at max compute
> capacity). In this case, we can discard this "stolen" idle times which
> becomes meaningless.
>
> In order to achieve this time scaling, a new clock_pelt is created per rq.
> The increase of this clock scales with current capacity when something
> is running on rq and synchronizes with clock_task when rq is idle. With
> this mechanism, we ensure the same running and idle time whatever the
> current capacity. This also enables to simplify the pelt algorithm by
> removing all references of uarch and frequency and applying the same
> contribution to utilization and loads. Furthermore, the scaling is done
> only once per update of clock (update_rq_clock_task()) instead of during
> each update of sched_entities and cfs/rt/dl_rq of the rq like the current
> implementation. This is interesting when cgroup are involved as shown in
> the results below:
>
> On a hikey (octo Arm64 platform).
> Performance cpufreq governor and only shallowest c-state to remove variance
> generated by those power features so we only track the impact of pelt algo.
>
> each test runs 16 times
>
> ./perf bench sched pipe
> (higher is better)
> kernel  tip/sched/core + patch
> ops/secondsops/seconds diff
> cgroup
> root59652(+/- 0.18%)   59876(+/- 0.24%)+0.38%
> level1  55608(+/- 0.27%)   55923(+/- 0.24%)+0.57%
> level2  52115(+/- 0.29%)   52564(+/- 0.22%)+0.86%
>
> hackbench -l 1000
> (lower is better)
> kernel  tip/sched/core + patch
> duration(sec)  duration(sec)diff
> cgroup
> root4.453(+/- 2.37%)   4.383(+/- 2.88%) -1.57%
> level1  4.859(+/- 8.50%)   4.830(+/- 7.07%) -0.60%
> level2  5.063(+/- 9.83%)   4.928(+/- 9.66%) -2.66%
>
> Then, the responsiveness of PELT is improved when CPU is not running at max
> capacity with this new algorithm. I have put below some examples of
> duration to reach some typical load values according to the capacity of the
> CPU with current implementation and with this patch. These values has been
> computed based on the geometric series and the half period value:
>
> Util (%) max capacity  half capacity(mainline)  half capacity(w/ patch)
> 972 (95%)138ms not reachable276ms
> 486 (47.5%)  30ms  138ms 60ms
> 256 (25%)13ms   32ms 26ms
>
> On my hikey (octo Arm64 platform) with schedutil governor, the time to
> reach max OPP when starting from a null utilization, decreases from 223ms
> with current scale invariance down to 121ms with the new algorithm.
>
> Signed-off-by: Vincent Guittot 

Is there anything else that I should do for these patches ?

Regards,
Vincent

Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-28 Thread Vincent Guittot

Hi,

On Tue, 20 Nov 2018 at 11:55, Vincent Guittot
 wrote:
>
> The current implementation of load tracking invariance scales the
> contribution with current frequency and uarch performance (only for
> utilization) of the CPU. One main result of this formula is that the
> figures are capped by current capacity of CPU. Another one is that the
> load_avg is not invariant because not scaled with uarch.
>
> The util_avg of a periodic task that runs r time slots every p time slots
> varies in the range :
>
> U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
>
> with U is the max util_avg value = SCHED_CAPACITY_SCALE
>
> At a lower capacity, the range becomes:
>
> U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
>
> with C reflecting the compute capacity ratio between current capacity and
> max capacity.
>
> so C tries to compensate changes in (1-y^r') but it can't be accurate.
>
> Instead of scaling the contribution value of PELT algo, we should scale the
> running time. The PELT signal aims to track the amount of computation of
> tasks and/or rq so it seems more correct to scale the running time to
> reflect the effective amount of computation done since the last update.
>
> In order to be fully invariant, we need to apply the same amount of
> running time and idle time whatever the current capacity. Because running
> at lower capacity implies that the task will run longer, we have to ensure
> that the same amount of idle time will be applied when system becomes idle
> and no idle time has been "stolen". But reaching the maximum utilization
> value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> always-running task whatever the capacity of the CPU (even at max compute
> capacity). In this case, we can discard this "stolen" idle times which
> becomes meaningless.
>
> In order to achieve this time scaling, a new clock_pelt is created per rq.
> The increase of this clock scales with current capacity when something
> is running on rq and synchronizes with clock_task when rq is idle. With
> this mechanism, we ensure the same running and idle time whatever the
> current capacity. This also enables to simplify the pelt algorithm by
> removing all references of uarch and frequency and applying the same
> contribution to utilization and loads. Furthermore, the scaling is done
> only once per update of clock (update_rq_clock_task()) instead of during
> each update of sched_entities and cfs/rt/dl_rq of the rq like the current
> implementation. This is interesting when cgroup are involved as shown in
> the results below:
>
> On a hikey (octo Arm64 platform).
> Performance cpufreq governor and only shallowest c-state to remove variance
> generated by those power features so we only track the impact of pelt algo.
>
> each test runs 16 times
>
> ./perf bench sched pipe
> (higher is better)
> kernel  tip/sched/core + patch
> ops/secondsops/seconds diff
> cgroup
> root59652(+/- 0.18%)   59876(+/- 0.24%)+0.38%
> level1  55608(+/- 0.27%)   55923(+/- 0.24%)+0.57%
> level2  52115(+/- 0.29%)   52564(+/- 0.22%)+0.86%
>
> hackbench -l 1000
> (lower is better)
> kernel  tip/sched/core + patch
> duration(sec)  duration(sec)diff
> cgroup
> root4.453(+/- 2.37%)   4.383(+/- 2.88%) -1.57%
> level1  4.859(+/- 8.50%)   4.830(+/- 7.07%) -0.60%
> level2  5.063(+/- 9.83%)   4.928(+/- 9.66%) -2.66%
>
> Then, the responsiveness of PELT is improved when CPU is not running at max
> capacity with this new algorithm. I have put below some examples of
> duration to reach some typical load values according to the capacity of the
> CPU with current implementation and with this patch. These values has been
> computed based on the geometric series and the half period value:
>
> Util (%) max capacity  half capacity(mainline)  half capacity(w/ patch)
> 972 (95%)138ms not reachable276ms
> 486 (47.5%)  30ms  138ms 60ms
> 256 (25%)13ms   32ms 26ms
>
> On my hikey (octo Arm64 platform) with schedutil governor, the time to
> reach max OPP when starting from a null utilization, decreases from 223ms
> with current scale invariance down to 121ms with the new algorithm.
>
> Signed-off-by: Vincent Guittot 

Is there anything else that I should do for these patches ?

Regards,
Vincent

[PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-20 Thread Vincent Guittot

The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times

./perf bench sched pipe
(higher is better)
kernel  tip/sched/core + patch
ops/secondsops/seconds diff
cgroup
root59652(+/- 0.18%)   59876(+/- 0.24%)+0.38%
level1  55608(+/- 0.27%)   55923(+/- 0.24%)+0.57%
level2  52115(+/- 0.29%)   52564(+/- 0.22%)+0.86%

hackbench -l 1000
(lower is better)
kernel  tip/sched/core + patch
duration(sec)  duration(sec)diff
cgroup
root4.453(+/- 2.37%)   4.383(+/- 2.88%) -1.57%
level1  4.859(+/- 8.50%)   4.830(+/- 7.07%) -0.60%
level2  5.063(+/- 9.83%)   4.928(+/- 9.66%) -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

Util (%) max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)138ms not reachable276ms
486 (47.5%)  30ms  138ms 60ms
256 (25%)13ms   32ms 26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Signed-off-by: Vincent Guittot 
---
 include/linux/sched.h   |  23 +++---
 kernel/sched/core.c |   1 +
 kernel/sched/deadline.c |   6 +--
 kernel/sched/fair.c |  45 +++-
 kernel/sched/pelt.c |  45 +++-
 kernel/sched/pelt.h | 111 ++--
 kernel/sched/rt.c   |   6 +--
 kernel/sched/sched.h|   5 ++-
 8 files changed, 174 insertions(+), 68 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f8a541..d61a523 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,12 +356,6 @@ struct util_est {
  * For cfs_rq, it is the aggregated load_avg of all runnable and
  * blocked sched_entities.
  *
- * load_avg may also take frequency scaling into account:
- *
- *   load_avg = runnable% * scale_load_down(load) * freq%
- *
- * where freq% is the CPU frequency

[PATCH v7 2/2] sched/fair: update scale invariance of PELT

2018-11-20 Thread Vincent Guittot

The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times

./perf bench sched pipe
(higher is better)
kernel  tip/sched/core + patch
ops/secondsops/seconds diff
cgroup
root59652(+/- 0.18%)   59876(+/- 0.24%)+0.38%
level1  55608(+/- 0.27%)   55923(+/- 0.24%)+0.57%
level2  52115(+/- 0.29%)   52564(+/- 0.22%)+0.86%

hackbench -l 1000
(lower is better)
kernel  tip/sched/core + patch
duration(sec)  duration(sec)diff
cgroup
root4.453(+/- 2.37%)   4.383(+/- 2.88%) -1.57%
level1  4.859(+/- 8.50%)   4.830(+/- 7.07%) -0.60%
level2  5.063(+/- 9.83%)   4.928(+/- 9.66%) -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

Util (%) max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)138ms not reachable276ms
486 (47.5%)  30ms  138ms 60ms
256 (25%)13ms   32ms 26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Signed-off-by: Vincent Guittot 
---
 include/linux/sched.h   |  23 +++---
 kernel/sched/core.c |   1 +
 kernel/sched/deadline.c |   6 +--
 kernel/sched/fair.c |  45 +++-
 kernel/sched/pelt.c |  45 +++-
 kernel/sched/pelt.h | 111 ++--
 kernel/sched/rt.c   |   6 +--
 kernel/sched/sched.h|   5 ++-
 8 files changed, 174 insertions(+), 68 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f8a541..d61a523 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,12 +356,6 @@ struct util_est {
  * For cfs_rq, it is the aggregated load_avg of all runnable and
  * blocked sched_entities.
  *
- * load_avg may also take frequency scaling into account:
- *
- *   load_avg = runnable% * scale_load_down(load) * freq%
- *
- * where freq% is the CPU frequency

37 matches

Mail list logo