Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Peter Zijlstra
On Wed, Mar 07, 2018 at 03:24:58PM +, Patrick Bellasi wrote:

> Maybe that's the tricky bit: we remove the value we enqueued, _not_
> the current util_avg. Notice we use _task_util_est(p)... with the
> leading "_".

ARGH, ok let me try that again ;-)


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 07-Mar 10:39, Peter Zijlstra wrote:
> On Tue, Mar 06, 2018 at 07:58:51PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > > + struct task_struct *p)
> > > +{
> > > + unsigned int enqueued;
> > > +
> > > + if (!sched_feat(UTIL_EST))
> > > + return;
> > > +
> > > + /* Update root cfs_rq's estimated utilization */
> > > + enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > > + enqueued += _task_util_est(p);
> > > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > > +}
> 
> > It appears to me this isn't a stable situation and completely relies on
> > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > for a significant while the sum can run-away, right?
> > 
> > Should we put a max in enqueue to avoid this?
> 
> Thinking about this a bit more; would it make sense to adjust the
> running sum/avg on migration? Something along the lines of:
> 
>   util_avg = se->load_avg / (cfs_rq->load_avg + se->load_avg);
> 
> (which disregards cgroups), because that should more or less be the time
> it ends up running, given the WFQ rule.

I would say it makes sense from a purely mechanism stanpoing, but I'm
not entirely convinced it can be useful from a practical stanpoint.

First of all, that should be applied only when we migrate to a more
saturated CPU. Otherwise, when migrating on an empty CPU we would set
util_avg = 100%

Secondly, when we migrate to a saturated CPU, this adjustment will
contribute to under-estimate the task utilization.
Let say the task was running on a completely empty CPU, and thus we
was able to ramp up without being preempted. This value represents a
good estimation of the (most recent) task CPU demands.

Now, if on a following activation, we wakeup the task on an IDLE CPU
with a lot of blocked load, then we will scale down its util_avg
and assume the task will be smaller.
But:

a) if the blocked load does not turns into some task waking up again,
   underestimated the task introduces only further ramp-up latencies

b) if the load it due to really active tasks, the task will be
   preempted and it's utilization smaller... but we are already in a
   domain where utilization does not tell us anything useful for a
   task... and thus, why bothering to make it converging sooner?


> That way the disparity between tasks migrating into the CPU at u=1 and
> them going to sleep at u<1 is much smaller and the above sum doesn't run
> away nearly as wild (it still needs some upper bound though).

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 07-Mar 13:24, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 11:31:49AM +, Patrick Bellasi wrote:
> > > It appears to me this isn't a stable situation and completely relies on
> > > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > > for a significant while the sum can run-away, right?
> > 
> > By away you mean go over 1024 or overflow the unsigned int storage?
> 
> the later, I think you can make it arbitrarily large. Have a busy task
> on CPU0, this ensure !nr_running never happens.
> 
> Start a busy task on CPU1, wait for it to hit u=1, then migrate it to
> CPU0, 

At this point util_est(CPU0) = 2048, which is:

   +1024 for the busy running task
 assuming it has been enqueued with the utilization since the beginning
   +1024 for the newly migrated task from CPU1
 which is enqueued with the value he reached at dequeued time
 from CPU1

> then wait for it to hit u=.5 then kill it,

... but when we kill it, the task is dequeued, and thus we remove
1024.

Maybe that's the tricky bit: we remove the value we enqueued, _not_
the current util_avg. Notice we use _task_util_est(p)... with the
leading "_".

> this effectively adds
> .5 to the enqueued value, repeat indefinitely.

Thus this should not happen.

Basically, the RQ's util_est is the sum of the RUNNABLE tasks's
util_est at their enqueue time... which has been update at their last
dequeue time, hence the usage of name "dequeued" for both tasks and
rqs.

Does it make sense now?

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 07-Mar 13:26, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 11:47:11AM +, Patrick Bellasi wrote:
> > On 06-Mar 20:02, Peter Zijlstra wrote:
> > > On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > > > +struct util_est {
> > > > +   unsigned intenqueued;
> > > > +   unsigned intewma;
> > > > +#define UTIL_EST_WEIGHT_SHIFT  2
> > > > +};
> > > 
> > > > +   ue = READ_ONCE(p->se.avg.util_est);
> > > 
> > > > +   WRITE_ONCE(p->se.avg.util_est, ue);
> > > 
> > > That is actually quite dodgy... and relies on the fact that we have the
> > > 8 byte case in __write_once_size() and __read_once_size()
> > > unconditionally. It then further relies on the compiler DTRT for 32bit
> > > platforms, which is generating 2 32bit loads/stores.
> > >
> > > The advantage is of course that it will use single u64 loads/stores
> > > where available.
> > 
> > Yes, that's mainly an "optimization" for 64bit targets... but perhaps
> > the benefits are negligible.
> > 
> > Do you prefer to keep more "under control" the generated code by using
> > two {READ,WRITE}_ONCEs?

Any specific preference on this previous point?

> > IMO here we can also go with just the WRITE_ONCEs. I don't see a case
> > for the compiler to mangle load/store. While the WRITE_ONCE are still
> > required to sync with non rq-lock serialized code.
> > But... maybe I'm missing something... ?
> 
> I'm not sure we rely on READ/WRITE_ONCE() of 64bit variables on 32bit
> targets to be sane anywhere else (we could be, I just dont know).

My understating is that, since here we are in an rq-lock protected
section, and only in this section we can write these vars, then the
load is a dependency for the store and the compiler cannot screw up...

> I suspect it all works as expected... but its a tad tricky.

Then let's keep them for the time being... meanwhile I try to get
some more "internal" feedback before next posting.

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 06-Mar 19:56, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > +/**
> > + * Estimation Utilization for FAIR tasks.
> > + *
> > + * Support data structure to track an Exponential Weighted Moving Average
> > + * (EWMA) of a FAIR task's utilization. New samples are added to the moving
> > + * average each time a task completes an activation. Sample's weight is
> > + * chosen so that the EWMA will be relatively insensitive to transient 
> > changes
> > + * to the task's workload.
> > + *
> > + * @enqueued: instantaneous estimated utilization of a task/cpu
> > + *  task: the task's util_avg at last task dequeue time
> > + *cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that 
> > CPU
> > + *
> > + *  Thus, the util_est.enqueued of a task represents the contribution on 
> > the
> > + *  estimated utilization of the CPU where that task is currently enqueued.
> > + *
> > + * @ewma: the Exponential Weighted Moving Average (EWMA) utilization of a 
> > task
> > + *  Only for tasks we track a moving average of the past instantaneous
> > + *  estimated utilization. This allows to absorb sporadic drops in
> > + *  utilization of an otherwise almost periodic task.
> > + *
> > + */
> 
> The above comment appears to have whitespace issues, the paragraph
> starting with "Thus" looks indented by one character for exmaple.

That was actually intentional... I wanted to keep it aligned after the
"@" to better mark paragraphs describing the struct members.

However, I've just notice the overall format is not sphinx valid.
Thus, I'll update it to ensure also that the documentation is properly
generated.

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Peter Zijlstra
On Wed, Mar 07, 2018 at 11:47:11AM +, Patrick Bellasi wrote:
> On 06-Mar 20:02, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > > +struct util_est {
> > > + unsigned intenqueued;
> > > + unsigned intewma;
> > > +#define UTIL_EST_WEIGHT_SHIFT2
> > > +};
> > 
> > > + ue = READ_ONCE(p->se.avg.util_est);
> > 
> > > + WRITE_ONCE(p->se.avg.util_est, ue);
> > 
> > That is actually quite dodgy... and relies on the fact that we have the
> > 8 byte case in __write_once_size() and __read_once_size()
> > unconditionally. It then further relies on the compiler DTRT for 32bit
> > platforms, which is generating 2 32bit loads/stores.
> >
> > The advantage is of course that it will use single u64 loads/stores
> > where available.
> 
> Yes, that's mainly an "optimization" for 64bit targets... but perhaps
> the benefits are negligible.
> 
> Do you prefer to keep more "under control" the generated code by using
> two {READ,WRITE}_ONCEs?
> 
> IMO here we can also go with just the WRITE_ONCEs. I don't see a case
> for the compiler to mangle load/store. While the WRITE_ONCE are still
> required to sync with non rq-lock serialized code.
> But... maybe I'm missing something... ?

I'm not sure we rely on READ/WRITE_ONCE() of 64bit variables on 32bit
targets to be sane anywhere else (we could be, I just dont know).

I suspect it all works as expected... but its a tad tricky.


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Peter Zijlstra
On Wed, Mar 07, 2018 at 11:31:49AM +, Patrick Bellasi wrote:
> > It appears to me this isn't a stable situation and completely relies on
> > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > for a significant while the sum can run-away, right?
> 
> By away you mean go over 1024 or overflow the unsigned int storage?

the later, I think you can make it arbitrarily large. Have a busy task
on CPU0, this ensure !nr_running never happens.

Start a busy task on CPU1, wait for it to hit u=1, then migrate it to
CPU0, then wait for it to hit u=.5 then kill it, this effectively adds
.5 to the enqueued value, repeat indefinitely.


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 06-Mar 20:02, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > +struct util_est {
> > +   unsigned intenqueued;
> > +   unsigned intewma;
> > +#define UTIL_EST_WEIGHT_SHIFT  2
> > +};
> 
> > +   ue = READ_ONCE(p->se.avg.util_est);
> 
> > +   WRITE_ONCE(p->se.avg.util_est, ue);
> 
> That is actually quite dodgy... and relies on the fact that we have the
> 8 byte case in __write_once_size() and __read_once_size()
> unconditionally. It then further relies on the compiler DTRT for 32bit
> platforms, which is generating 2 32bit loads/stores.
>
> The advantage is of course that it will use single u64 loads/stores
> where available.

Yes, that's mainly an "optimization" for 64bit targets... but perhaps
the benefits are negligible.

Do you prefer to keep more "under control" the generated code by using
two {READ,WRITE}_ONCEs?

IMO here we can also go with just the WRITE_ONCEs. I don't see a case
for the compiler to mangle load/store. While the WRITE_ONCE are still
required to sync with non rq-lock serialized code.
But... maybe I'm missing something... ?

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Patrick Bellasi
On 06-Mar 19:58, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > +   struct task_struct *p)
> > +{
> > +   unsigned int enqueued;
> > +
> > +   if (!sched_feat(UTIL_EST))
> > +   return;
> > +
> > +   /* Update root cfs_rq's estimated utilization */
> > +   enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +   enqueued += _task_util_est(p);
> > +   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > +}
> 
> > +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > +   struct task_struct *p,
> > +   bool task_sleep)
> > +{
> > +   long last_ewma_diff;
> > +   struct util_est ue;
> > +
> > +   if (!sched_feat(UTIL_EST))
> > +   return;
> > +
> > +   /*
> > +* Update root cfs_rq's estimated utilization
> > +*
> > +* If *p is the last task then the root cfs_rq's estimated utilization
> > +* of a CPU is 0 by definition.
> > +*/
> > +   ue.enqueued = 0;
> > +   if (cfs_rq->nr_running) {
> > +   ue.enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +   ue.enqueued -= min_t(unsigned int, ue.enqueued,
> > +_task_util_est(p));
> > +   }
> > +   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
> 
> It appears to me this isn't a stable situation and completely relies on
> the !nr_running case to recalibrate. If we ensure that doesn't happen
> for a significant while the sum can run-away, right?

By away you mean go over 1024 or overflow the unsigned int storage?

In the first case, I think we don't care about exceeding 1024 since:
- we cap to capacity_orig_of in cpu_util_est
- by directly reading the cfs_rq->avg.util_est.enqueued we can
  actually detect conditions in which a CPU is over-saturated.

In the second case, with an unsigned int we can enqueue up to few
millions of 100% tasks on a single CPU without overflowing.

> Should we put a max in enqueue to avoid this?

IMO the capping from the cpu_util_est getter should be enough...

Maybe I'm missing your point here?

-- 
#include 

Patrick Bellasi


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-07 Thread Peter Zijlstra
On Tue, Mar 06, 2018 at 07:58:51PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > +   struct task_struct *p)
> > +{
> > +   unsigned int enqueued;
> > +
> > +   if (!sched_feat(UTIL_EST))
> > +   return;
> > +
> > +   /* Update root cfs_rq's estimated utilization */
> > +   enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +   enqueued += _task_util_est(p);
> > +   WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > +}

> It appears to me this isn't a stable situation and completely relies on
> the !nr_running case to recalibrate. If we ensure that doesn't happen
> for a significant while the sum can run-away, right?
> 
> Should we put a max in enqueue to avoid this?

Thinking about this a bit more; would it make sense to adjust the
running sum/avg on migration? Something along the lines of:

  util_avg = se->load_avg / (cfs_rq->load_avg + se->load_avg);

(which disregards cgroups), because that should more or less be the time
it ends up running, given the WFQ rule.

That way the disparity between tasks migrating into the CPU at u=1 and
them going to sleep at u<1 is much smaller and the above sum doesn't run
away nearly as wild (it still needs some upper bound though).


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-06 Thread Peter Zijlstra
On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> +struct util_est {
> + unsigned intenqueued;
> + unsigned intewma;
> +#define UTIL_EST_WEIGHT_SHIFT2
> +};

> + ue = READ_ONCE(p->se.avg.util_est);

> + WRITE_ONCE(p->se.avg.util_est, ue);

That is actually quite dodgy... and relies on the fact that we have the
8 byte case in __write_once_size() and __read_once_size()
unconditionally. It then further relies on the compiler DTRT for 32bit
platforms, which is generating 2 32bit loads/stores.

The advantage is of course that it will use single u64 loads/stores
where available.



Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-06 Thread Peter Zijlstra
On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> + struct task_struct *p)
> +{
> + unsigned int enqueued;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Update root cfs_rq's estimated utilization */
> + enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + enqueued += _task_util_est(p);
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> +}

> +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> + struct task_struct *p,
> + bool task_sleep)
> +{
> + long last_ewma_diff;
> + struct util_est ue;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> +  * Update root cfs_rq's estimated utilization
> +  *
> +  * If *p is the last task then the root cfs_rq's estimated utilization
> +  * of a CPU is 0 by definition.
> +  */
> + ue.enqueued = 0;
> + if (cfs_rq->nr_running) {
> + ue.enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + ue.enqueued -= min_t(unsigned int, ue.enqueued,
> +  _task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

It appears to me this isn't a stable situation and completely relies on
the !nr_running case to recalibrate. If we ensure that doesn't happen
for a significant while the sum can run-away, right?

Should we put a max in enqueue to avoid this?


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-06 Thread Peter Zijlstra
On Thu, Feb 22, 2018 at 05:01:50PM +, Patrick Bellasi wrote:
> +/**
> + * Estimation Utilization for FAIR tasks.
> + *
> + * Support data structure to track an Exponential Weighted Moving Average
> + * (EWMA) of a FAIR task's utilization. New samples are added to the moving
> + * average each time a task completes an activation. Sample's weight is
> + * chosen so that the EWMA will be relatively insensitive to transient 
> changes
> + * to the task's workload.
> + *
> + * @enqueued: instantaneous estimated utilization of a task/cpu
> + *  task: the task's util_avg at last task dequeue time
> + *cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
> + *
> + *  Thus, the util_est.enqueued of a task represents the contribution on the
> + *  estimated utilization of the CPU where that task is currently enqueued.
> + *
> + * @ewma: the Exponential Weighted Moving Average (EWMA) utilization of a 
> task
> + *  Only for tasks we track a moving average of the past instantaneous
> + *  estimated utilization. This allows to absorb sporadic drops in
> + *  utilization of an otherwise almost periodic task.
> + *
> + */

The above comment appears to have whitespace issues, the paragraph
starting with "Thus" looks indented by one character for exmaple.


Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-03-01 Thread Patrick Bellasi
This is missing the below #ifdef guards, adding here has a note for
the next resping on list.

On 22-Feb 17:01, Patrick Bellasi wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e1febd252a84..c8526687f107 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5205,6 +5205,23 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
> 

#ifdef CONFIG_SMP

> +static inline unsigned long task_util(struct task_struct *p);
> +static inline unsigned long _task_util_est(struct task_struct *p);
> +
> +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> + struct task_struct *p)
> +{
> + unsigned int enqueued;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Update root cfs_rq's estimated utilization */
> + enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + enqueued += _task_util_est(p);
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> +}
> +

#else
static inline void util_est_enqueue(struct cfs_rq *cfs_rq
struct task_struct *p)
{
}
#endif /* CONFIG_SMP */

>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5257,9 +5274,86 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
>   if (!se)
>   add_nr_running(rq, 1);
> 
> + util_est_enqueue(&rq->cfs, p);
>   hrtick_update(rq);
>  }
> 
> +/*
> + * Check if a (signed) value is within a specified (unsigned) margin,
> + * based on the observation that:
> + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> + *
> + * NOTE: this only works when value + maring < INT_MAX.
> + */
> +static inline bool within_margin(int value, int margin)
> +{
> + return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> +}
> +
> +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> + struct task_struct *p,
> + bool task_sleep)
> +{

#ifdef CONFIG_SMP

> + long last_ewma_diff;
> + struct util_est ue;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> +  * Update root cfs_rq's estimated utilization
> +  *
> +  * If *p is the last task then the root cfs_rq's estimated utilization
> +  * of a CPU is 0 by definition.
> +  */
> + ue.enqueued = 0;
> + if (cfs_rq->nr_running) {
> + ue.enqueued  = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + ue.enqueued -= min_t(unsigned int, ue.enqueued,
> +  _task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
> +
> + /*
> +  * Skip update of task's estimated utilization when the task has not
> +  * yet completed an activation, e.g. being migrated.
> +  */
> + if (!task_sleep)
> + return;
> +
> + /*
> +  * Skip update of task's estimated utilization when its EWMA is
> +  * already ~1% close to its last activation value.
> +  */
> + ue = READ_ONCE(p->se.avg.util_est);
> + ue.enqueued = task_util(p);
> + last_ewma_diff = ue.enqueued - ue.ewma;
> + if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> + return;
> +
> + /*
> +  * Update Task's estimated utilization
> +  *
> +  * When *p completes an activation we can consolidate another sample
> +  * of the task size. This is done by storing the current PELT value
> +  * as ue.enqueued and by using this value to update the Exponential
> +  * Weighted Moving Average (EWMA):
> +  *
> +  *  ewma(t) = w *  task_util(p) + (1-w) * ewma(t-1)
> +  *  = w *  task_util(p) + ewma(t-1)  - w * ewma(t-1)
> +  *  = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> +  *  = w * (  last_ewma_diff) + ewma(t-1)
> +  *  = w * (last_ewma_diff  +  ewma(t-1) / w)
> +  *
> +  * Where 'w' is the weight of new samples, which is configured to be
> +  * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> +  */
> + ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
> + ue.ewma  += last_ewma_diff;
> + ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
> + WRITE_ONCE(p->se.avg.util_est, ue);

#endif /* CONFIG_SMP */

> +}
> +
>  static void set_next_buddy(struct sched_entity *se);
> 
>  /*
> @@ -5316,6 +5410,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
> task_struct *p, int flags)
>   if (!se)
>   sub_nr_running(rq, 1);
> 
> + util_est_dequeue(&rq->cfs, p, task_sleep);
>   hrtick_update(rq);
>  }
> 

-- 
#include 

Patrick Bellasi


[PATCH v5 1/4] sched/fair: add util_est on top of PELT

2018-02-22 Thread Patrick Bellasi
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

cfs_rq::util_est_runnable = sum(util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
objects of interests, specifically:
 - Tasks: to better support tasks placement decisions
 - root cfs_rqs: to better support both tasks placement decisions as
 well as frequencies selection

Signed-off-by: Patrick Bellasi 
Reviewed-by: Dietmar Eggemann 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Viresh Kumar 
Cc: Paul Turner 
Cc: Vincent Guittot 
Cc: Morten Rasmussen 
Cc: Dietmar Eggemann 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@vger.kernel.org

---
Changes in v5:
 - add documentation for "struct util_est"
 - always use int instead of long whenever possible (Peter)
 - pass cfs_rq to util_est_{en,de}queue (Peter)
 - pass task_sleep to util_est_dequeue
 - use singe WRITE_ONCE at dequeue time
 - add some other missing {READ,WRITE}_ONCE
 - add task_util_est() for code consistency

Changes in v4:
 - rebased on today's tip/sched/core (commit 460e8c3340a2)
 - renamed util_est's "last" into "enqueued"
 - using util_est's "enqueued" for both se and cfs_rqs (Joel)
 - update margin check to use more ASM friendly code (Peter)
 - optimize EWMA updates (Peter)

Changes in v3:
 - rebased on today's tip/sched/core (commit 07881166a892)
 - moved util_est into sched_avg (Peter)
 - use {READ,WRITE}_ONCE() for EWMA updates (Peter)
 - using unsigned int to fit all sched_avg into a single 64B cache line

Changes in v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 include/linux/sched.h   |  29 +
 kernel/sched/debug.c|   4 ++
 kernel/sched/fair.c | 109 +++-
 kernel/sched/features.h |   5 +++
 4 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..14b1a6ce6298 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,34 @@ struct load_weight {
u32 inv_weight;
 };
 
+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ *
+ * @enqueued: instantaneous estimated utilization of a task/cpu
+ *  task: the task's util_avg at last task dequeue time
+ *cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
+ *
+ *  Thus, the util_est.enqueued of a task represents the contribution on the
+ *  estima