Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 4 June 2018 at 12:12, Juri Lelli  wrote:
> On 04/06/18 09:14, Vincent Guittot wrote:
>> On 4 June 2018 at 09:04, Juri Lelli  wrote:
>> > Hi Vincent,
>> >
>> > On 04/06/18 08:41, Vincent Guittot wrote:
>> >> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
>> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> >
>> > [...]
>> >
>> >> > IMO I feel its overkill to account dl_avg when we already have DL's 
>> >> > running
>> >> > bandwidth we can use. I understand it may be too instanenous, but 
>> >> > perhaps we
>> >>
>> >> We keep using dl bandwidth which is quite correct for dl needs but
>> >> doesn't reflect how it has disturbed other classes
>> >>
>> >> > can fix CFS's problems within CFS itself and not have to do this kind of
>> >> > extra external accounting ?
>> >
>> > I would also keep accounting for waiting time due to higher prio classes
>> > all inside CFS. My impression, when discussing it with you on IRC, was
>> > that we should be able to do that by not decaying cfs.util_avg when CFS
>> > is preempted (creating a new signal for it). Is not this enough?
>>
>> We don't just want to not decay a signal but increase the signal to
>> reflect the amount of preemption
>
> OK.
>
>> Then, we can't do that in a current signal. So you would like to add
>> another metrics in cfs_rq ?
>
> Since it's CFS related, I'd say it should fit in CFS.

It's dl and cfs as the goal is to track cfs preempted by dl
This means creating a new struct whereas some fields are unused in avg_dl struct
And duplicate some call to ___update_load_sum as we track avg_dl for
removing sched_rt_avg_update
and update_dl/rt_rq_load_avg are already call in fair.c for updating
blocked load

>
>> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
>> you will not prevent to add call in dl class to start/stop the
>> accounting of the preemption
>>
>> >
>> > I feel we should try to keep cross-class accounting/interaction at a
>> > minimum.
>>
>> accounting for cross class preemption can't be done without
>> cross-class accounting
>
> Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
> higher prio class and act accordingly?

we will not be able to make the difference between rt/dl/stop
preemption by using only pick_next_task_fair

Thanks

>
> Thanks,
>
> - Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 4 June 2018 at 12:12, Juri Lelli  wrote:
> On 04/06/18 09:14, Vincent Guittot wrote:
>> On 4 June 2018 at 09:04, Juri Lelli  wrote:
>> > Hi Vincent,
>> >
>> > On 04/06/18 08:41, Vincent Guittot wrote:
>> >> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
>> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> >
>> > [...]
>> >
>> >> > IMO I feel its overkill to account dl_avg when we already have DL's 
>> >> > running
>> >> > bandwidth we can use. I understand it may be too instanenous, but 
>> >> > perhaps we
>> >>
>> >> We keep using dl bandwidth which is quite correct for dl needs but
>> >> doesn't reflect how it has disturbed other classes
>> >>
>> >> > can fix CFS's problems within CFS itself and not have to do this kind of
>> >> > extra external accounting ?
>> >
>> > I would also keep accounting for waiting time due to higher prio classes
>> > all inside CFS. My impression, when discussing it with you on IRC, was
>> > that we should be able to do that by not decaying cfs.util_avg when CFS
>> > is preempted (creating a new signal for it). Is not this enough?
>>
>> We don't just want to not decay a signal but increase the signal to
>> reflect the amount of preemption
>
> OK.
>
>> Then, we can't do that in a current signal. So you would like to add
>> another metrics in cfs_rq ?
>
> Since it's CFS related, I'd say it should fit in CFS.

It's dl and cfs as the goal is to track cfs preempted by dl
This means creating a new struct whereas some fields are unused in avg_dl struct
And duplicate some call to ___update_load_sum as we track avg_dl for
removing sched_rt_avg_update
and update_dl/rt_rq_load_avg are already call in fair.c for updating
blocked load

>
>> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
>> you will not prevent to add call in dl class to start/stop the
>> accounting of the preemption
>>
>> >
>> > I feel we should try to keep cross-class accounting/interaction at a
>> > minimum.
>>
>> accounting for cross class preemption can't be done without
>> cross-class accounting
>
> Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
> higher prio class and act accordingly?

we will not be able to make the difference between rt/dl/stop
preemption by using only pick_next_task_fair

Thanks

>
> Thanks,
>
> - Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Juri Lelli
On 04/06/18 09:14, Vincent Guittot wrote:
> On 4 June 2018 at 09:04, Juri Lelli  wrote:
> > Hi Vincent,
> >
> > On 04/06/18 08:41, Vincent Guittot wrote:
> >> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> >
> > [...]
> >
> >> > IMO I feel its overkill to account dl_avg when we already have DL's 
> >> > running
> >> > bandwidth we can use. I understand it may be too instanenous, but 
> >> > perhaps we
> >>
> >> We keep using dl bandwidth which is quite correct for dl needs but
> >> doesn't reflect how it has disturbed other classes
> >>
> >> > can fix CFS's problems within CFS itself and not have to do this kind of
> >> > extra external accounting ?
> >
> > I would also keep accounting for waiting time due to higher prio classes
> > all inside CFS. My impression, when discussing it with you on IRC, was
> > that we should be able to do that by not decaying cfs.util_avg when CFS
> > is preempted (creating a new signal for it). Is not this enough?
> 
> We don't just want to not decay a signal but increase the signal to
> reflect the amount of preemption

OK.

> Then, we can't do that in a current signal. So you would like to add
> another metrics in cfs_rq ?

Since it's CFS related, I'd say it should fit in CFS.

> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
> you will not prevent to add call in dl class to start/stop the
> accounting of the preemption
> 
> >
> > I feel we should try to keep cross-class accounting/interaction at a
> > minimum.
> 
> accounting for cross class preemption can't be done without
> cross-class accounting

Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
higher prio class and act accordingly?

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Juri Lelli
On 04/06/18 09:14, Vincent Guittot wrote:
> On 4 June 2018 at 09:04, Juri Lelli  wrote:
> > Hi Vincent,
> >
> > On 04/06/18 08:41, Vincent Guittot wrote:
> >> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> >
> > [...]
> >
> >> > IMO I feel its overkill to account dl_avg when we already have DL's 
> >> > running
> >> > bandwidth we can use. I understand it may be too instanenous, but 
> >> > perhaps we
> >>
> >> We keep using dl bandwidth which is quite correct for dl needs but
> >> doesn't reflect how it has disturbed other classes
> >>
> >> > can fix CFS's problems within CFS itself and not have to do this kind of
> >> > extra external accounting ?
> >
> > I would also keep accounting for waiting time due to higher prio classes
> > all inside CFS. My impression, when discussing it with you on IRC, was
> > that we should be able to do that by not decaying cfs.util_avg when CFS
> > is preempted (creating a new signal for it). Is not this enough?
> 
> We don't just want to not decay a signal but increase the signal to
> reflect the amount of preemption

OK.

> Then, we can't do that in a current signal. So you would like to add
> another metrics in cfs_rq ?

Since it's CFS related, I'd say it should fit in CFS.

> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
> you will not prevent to add call in dl class to start/stop the
> accounting of the preemption
> 
> >
> > I feel we should try to keep cross-class accounting/interaction at a
> > minimum.
> 
> accounting for cross class preemption can't be done without
> cross-class accounting

Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
higher prio class and act accordingly?

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 4 June 2018 at 09:04, Juri Lelli  wrote:
> Hi Vincent,
>
> On 04/06/18 08:41, Vincent Guittot wrote:
>> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
>> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>
> [...]
>
>> > IMO I feel its overkill to account dl_avg when we already have DL's running
>> > bandwidth we can use. I understand it may be too instanenous, but perhaps 
>> > we
>>
>> We keep using dl bandwidth which is quite correct for dl needs but
>> doesn't reflect how it has disturbed other classes
>>
>> > can fix CFS's problems within CFS itself and not have to do this kind of
>> > extra external accounting ?
>
> I would also keep accounting for waiting time due to higher prio classes
> all inside CFS. My impression, when discussing it with you on IRC, was
> that we should be able to do that by not decaying cfs.util_avg when CFS
> is preempted (creating a new signal for it). Is not this enough?

We don't just want to not decay a signal but increase the signal to
reflect the amount of preemption
Then, we can't do that in a current signal. So you would like to add
another metrics in cfs_rq ?
The place doesn't really matter to be honest in cfs_rq or in dl_rq but
you will not prevent to add call in dl class to start/stop the
accounting of the preemption

>
> I feel we should try to keep cross-class accounting/interaction at a
> minimum.

accounting for cross class preemption can't be done without
cross-class accounting

Regards,
Vincent

>
> Thanks,
>
> - Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 4 June 2018 at 09:04, Juri Lelli  wrote:
> Hi Vincent,
>
> On 04/06/18 08:41, Vincent Guittot wrote:
>> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
>> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>
> [...]
>
>> > IMO I feel its overkill to account dl_avg when we already have DL's running
>> > bandwidth we can use. I understand it may be too instanenous, but perhaps 
>> > we
>>
>> We keep using dl bandwidth which is quite correct for dl needs but
>> doesn't reflect how it has disturbed other classes
>>
>> > can fix CFS's problems within CFS itself and not have to do this kind of
>> > extra external accounting ?
>
> I would also keep accounting for waiting time due to higher prio classes
> all inside CFS. My impression, when discussing it with you on IRC, was
> that we should be able to do that by not decaying cfs.util_avg when CFS
> is preempted (creating a new signal for it). Is not this enough?

We don't just want to not decay a signal but increase the signal to
reflect the amount of preemption
Then, we can't do that in a current signal. So you would like to add
another metrics in cfs_rq ?
The place doesn't really matter to be honest in cfs_rq or in dl_rq but
you will not prevent to add call in dl class to start/stop the
accounting of the preemption

>
> I feel we should try to keep cross-class accounting/interaction at a
> minimum.

accounting for cross class preemption can't be done without
cross-class accounting

Regards,
Vincent

>
> Thanks,
>
> - Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Juri Lelli
Hi Vincent,

On 04/06/18 08:41, Vincent Guittot wrote:
> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

[...]

> > IMO I feel its overkill to account dl_avg when we already have DL's running
> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
> 
> We keep using dl bandwidth which is quite correct for dl needs but
> doesn't reflect how it has disturbed other classes
> 
> > can fix CFS's problems within CFS itself and not have to do this kind of
> > extra external accounting ?

I would also keep accounting for waiting time due to higher prio classes
all inside CFS. My impression, when discussing it with you on IRC, was
that we should be able to do that by not decaying cfs.util_avg when CFS
is preempted (creating a new signal for it). Is not this enough?

I feel we should try to keep cross-class accounting/interaction at a
minimum.

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Juri Lelli
Hi Vincent,

On 04/06/18 08:41, Vincent Guittot wrote:
> On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

[...]

> > IMO I feel its overkill to account dl_avg when we already have DL's running
> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
> 
> We keep using dl bandwidth which is quite correct for dl needs but
> doesn't reflect how it has disturbed other classes
> 
> > can fix CFS's problems within CFS itself and not have to do this kind of
> > extra external accounting ?

I would also keep accounting for waiting time due to higher prio classes
all inside CFS. My impression, when discussing it with you on IRC, was
that we should be able to do that by not decaying cfs.util_avg when CFS
is preempted (creating a new signal for it). Is not this enough?

I feel we should try to keep cross-class accounting/interaction at a
minimum.

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> > >> >> The example with a RT task described in the cover letter can be
>> > >> >> run with a DL task and will give similar results.
>> > >
>> > > In the cover letter you says:
>> > >
>> > >A rt-app use case which creates an always running cfs thread and a
>> > >rt threads that wakes up periodically with both threads pinned on
>> > >same CPU, show lot of frequency switches of the CPU whereas the CPU
>> > >never goes idles during the test.
>> > >
>> > > I would say that's a quite specific corner case where your always
>> > > running CFS task has never accumulated a util_est sample.
>> > >
>> > > Do we really have these cases in real systems?
>> >
>> > My example is voluntary an extreme one because it's easier to
>> > highlight the problem
>> >
>> > >
>> > > Otherwise, it seems to me that we are trying to solve quite specific
>> > > corner cases by adding a not negligible level of "complexity".
>> >
>> > By complexity, do you mean :
>> >
>> > Taking into account the number cfs running task to choose between
>> > rq->dl.running_bw and avg_dl.util_avg
>> >
>> > I'm preparing a patchset that will provide the cfs waiting time in
>> > addition to dl/rt util_avg for almost no additional cost. I will try
>> > to sent the proposal later today
>>
>>
>> The code below adds the tracking of the waiting level of cfs tasks because of
>> rt/dl preemption. This waiting time can then be used when selecting an OPP
>> instead of the dl util_avg which could become higher than dl bandwidth with
>> "long" runtime
>>
>> We need only one new call for the 1st cfs task that is enqueued to get these 
>> additional metrics
>> the call to arch_scale_cpu_capacity() can be removed once the later will be
>> taken into account when computing the load (which scales only with freq
>> currently)
>>
>> For rt task, we must keep to take into account util_avg to have an idea of 
>> the
>> rt level on the cpu which is given by the badnwodth for dl
>>
>> ---
>>  kernel/sched/fair.c  | 27 +++
>>  kernel/sched/pelt.c  |  8 ++--
>>  kernel/sched/sched.h |  4 +++-
>>  3 files changed, 36 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eac1f9a..1682ea7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>>  }
>>  #endif
>>
>> +static inline void update_cfs_wait_util_avg(struct rq *rq)
>> +{
>> + /*
>> +  * If cfs is already enqueued, we don't have anything to do because
>> +  * we already updated the non waiting time
>> +  */
>> + if (rq->cfs.h_nr_running)
>> + return;
>> +
>> + /*
>> +  * If rt is running, we update the non wait time before increasing
>> +  * cfs.h_nr_running)
>> +  */
>> + if (rq->curr->sched_class == _sched_class)
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +
>> + /*
>> +  * If dl is running, we update the non time before increasing
>> +  * cfs.h_nr_running)
>> +  */
>> + if (rq->curr->sched_class == _sched_class)
>> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +}
>> +
>
> Please correct me if I'm wrong but the CFS preemption-decay happens in
> set_next_entity -> update_load_avg when the CFS task is scheduled again after
> the preemption. Then can we not fix this issue by doing our UTIL_EST magic
> from set_next_entity? But yeah probably we need to be careful with overhead..

util_est is there to keep track of the last max. I'm not sure that
trying to add some magics to take into account preemption is the right
way to do. Mixing several information in the same metric just add more
fuzzy in the meaning of the metric.

>
> IMO I feel its overkill to account dl_avg when we already have DL's running
> bandwidth we can use. I understand it may be too instanenous, but perhaps we

We keep using dl bandwidth which is quite correct for dl needs but
doesn't reflect how it has disturbed other classes

> can fix CFS's problems within CFS itself and not have to do this kind of
> extra external accounting ?
>
> I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
> from within CFS class as being done above.
>
> thanks,
>
>  - Joel
>


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-04 Thread Vincent Guittot
On 1 June 2018 at 19:45, Joel Fernandes  wrote:
> On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> > >> >> The example with a RT task described in the cover letter can be
>> > >> >> run with a DL task and will give similar results.
>> > >
>> > > In the cover letter you says:
>> > >
>> > >A rt-app use case which creates an always running cfs thread and a
>> > >rt threads that wakes up periodically with both threads pinned on
>> > >same CPU, show lot of frequency switches of the CPU whereas the CPU
>> > >never goes idles during the test.
>> > >
>> > > I would say that's a quite specific corner case where your always
>> > > running CFS task has never accumulated a util_est sample.
>> > >
>> > > Do we really have these cases in real systems?
>> >
>> > My example is voluntary an extreme one because it's easier to
>> > highlight the problem
>> >
>> > >
>> > > Otherwise, it seems to me that we are trying to solve quite specific
>> > > corner cases by adding a not negligible level of "complexity".
>> >
>> > By complexity, do you mean :
>> >
>> > Taking into account the number cfs running task to choose between
>> > rq->dl.running_bw and avg_dl.util_avg
>> >
>> > I'm preparing a patchset that will provide the cfs waiting time in
>> > addition to dl/rt util_avg for almost no additional cost. I will try
>> > to sent the proposal later today
>>
>>
>> The code below adds the tracking of the waiting level of cfs tasks because of
>> rt/dl preemption. This waiting time can then be used when selecting an OPP
>> instead of the dl util_avg which could become higher than dl bandwidth with
>> "long" runtime
>>
>> We need only one new call for the 1st cfs task that is enqueued to get these 
>> additional metrics
>> the call to arch_scale_cpu_capacity() can be removed once the later will be
>> taken into account when computing the load (which scales only with freq
>> currently)
>>
>> For rt task, we must keep to take into account util_avg to have an idea of 
>> the
>> rt level on the cpu which is given by the badnwodth for dl
>>
>> ---
>>  kernel/sched/fair.c  | 27 +++
>>  kernel/sched/pelt.c  |  8 ++--
>>  kernel/sched/sched.h |  4 +++-
>>  3 files changed, 36 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eac1f9a..1682ea7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>>  }
>>  #endif
>>
>> +static inline void update_cfs_wait_util_avg(struct rq *rq)
>> +{
>> + /*
>> +  * If cfs is already enqueued, we don't have anything to do because
>> +  * we already updated the non waiting time
>> +  */
>> + if (rq->cfs.h_nr_running)
>> + return;
>> +
>> + /*
>> +  * If rt is running, we update the non wait time before increasing
>> +  * cfs.h_nr_running)
>> +  */
>> + if (rq->curr->sched_class == _sched_class)
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +
>> + /*
>> +  * If dl is running, we update the non time before increasing
>> +  * cfs.h_nr_running)
>> +  */
>> + if (rq->curr->sched_class == _sched_class)
>> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +}
>> +
>
> Please correct me if I'm wrong but the CFS preemption-decay happens in
> set_next_entity -> update_load_avg when the CFS task is scheduled again after
> the preemption. Then can we not fix this issue by doing our UTIL_EST magic
> from set_next_entity? But yeah probably we need to be careful with overhead..

util_est is there to keep track of the last max. I'm not sure that
trying to add some magics to take into account preemption is the right
way to do. Mixing several information in the same metric just add more
fuzzy in the meaning of the metric.

>
> IMO I feel its overkill to account dl_avg when we already have DL's running
> bandwidth we can use. I understand it may be too instanenous, but perhaps we

We keep using dl bandwidth which is quite correct for dl needs but
doesn't reflect how it has disturbed other classes

> can fix CFS's problems within CFS itself and not have to do this kind of
> extra external accounting ?
>
> I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
> from within CFS class as being done above.
>
> thanks,
>
>  - Joel
>


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-01 Thread Joel Fernandes
On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be
> > >> >> run with a DL task and will give similar results.
> > >
> > > In the cover letter you says:
> > >
> > >A rt-app use case which creates an always running cfs thread and a
> > >rt threads that wakes up periodically with both threads pinned on
> > >same CPU, show lot of frequency switches of the CPU whereas the CPU
> > >never goes idles during the test.
> > >
> > > I would say that's a quite specific corner case where your always
> > > running CFS task has never accumulated a util_est sample.
> > >
> > > Do we really have these cases in real systems?
> > 
> > My example is voluntary an extreme one because it's easier to
> > highlight the problem
> > 
> > >
> > > Otherwise, it seems to me that we are trying to solve quite specific
> > > corner cases by adding a not negligible level of "complexity".
> > 
> > By complexity, do you mean :
> > 
> > Taking into account the number cfs running task to choose between
> > rq->dl.running_bw and avg_dl.util_avg
> > 
> > I'm preparing a patchset that will provide the cfs waiting time in
> > addition to dl/rt util_avg for almost no additional cost. I will try
> > to sent the proposal later today
> 
> 
> The code below adds the tracking of the waiting level of cfs tasks because of
> rt/dl preemption. This waiting time can then be used when selecting an OPP
> instead of the dl util_avg which could become higher than dl bandwidth with
> "long" runtime
> 
> We need only one new call for the 1st cfs task that is enqueued to get these 
> additional metrics
> the call to arch_scale_cpu_capacity() can be removed once the later will be
> taken into account when computing the load (which scales only with freq
> currently)
> 
> For rt task, we must keep to take into account util_avg to have an idea of the
> rt level on the cpu which is given by the badnwodth for dl
> 
> ---
>  kernel/sched/fair.c  | 27 +++
>  kernel/sched/pelt.c  |  8 ++--
>  kernel/sched/sched.h |  4 +++-
>  3 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eac1f9a..1682ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +static inline void update_cfs_wait_util_avg(struct rq *rq)
> +{
> + /*
> +  * If cfs is already enqueued, we don't have anything to do because
> +  * we already updated the non waiting time
> +  */
> + if (rq->cfs.h_nr_running)
> + return;
> +
> + /*
> +  * If rt is running, we update the non wait time before increasing
> +  * cfs.h_nr_running)
> +  */
> + if (rq->curr->sched_class == _sched_class)
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> + /*
> +  * If dl is running, we update the non time before increasing
> +  * cfs.h_nr_running)
> +  */
> + if (rq->curr->sched_class == _sched_class)
> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
> +}
> +

Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

 - Joel



Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-01 Thread Joel Fernandes
On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be
> > >> >> run with a DL task and will give similar results.
> > >
> > > In the cover letter you says:
> > >
> > >A rt-app use case which creates an always running cfs thread and a
> > >rt threads that wakes up periodically with both threads pinned on
> > >same CPU, show lot of frequency switches of the CPU whereas the CPU
> > >never goes idles during the test.
> > >
> > > I would say that's a quite specific corner case where your always
> > > running CFS task has never accumulated a util_est sample.
> > >
> > > Do we really have these cases in real systems?
> > 
> > My example is voluntary an extreme one because it's easier to
> > highlight the problem
> > 
> > >
> > > Otherwise, it seems to me that we are trying to solve quite specific
> > > corner cases by adding a not negligible level of "complexity".
> > 
> > By complexity, do you mean :
> > 
> > Taking into account the number cfs running task to choose between
> > rq->dl.running_bw and avg_dl.util_avg
> > 
> > I'm preparing a patchset that will provide the cfs waiting time in
> > addition to dl/rt util_avg for almost no additional cost. I will try
> > to sent the proposal later today
> 
> 
> The code below adds the tracking of the waiting level of cfs tasks because of
> rt/dl preemption. This waiting time can then be used when selecting an OPP
> instead of the dl util_avg which could become higher than dl bandwidth with
> "long" runtime
> 
> We need only one new call for the 1st cfs task that is enqueued to get these 
> additional metrics
> the call to arch_scale_cpu_capacity() can be removed once the later will be
> taken into account when computing the load (which scales only with freq
> currently)
> 
> For rt task, we must keep to take into account util_avg to have an idea of the
> rt level on the cpu which is given by the badnwodth for dl
> 
> ---
>  kernel/sched/fair.c  | 27 +++
>  kernel/sched/pelt.c  |  8 ++--
>  kernel/sched/sched.h |  4 +++-
>  3 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eac1f9a..1682ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +static inline void update_cfs_wait_util_avg(struct rq *rq)
> +{
> + /*
> +  * If cfs is already enqueued, we don't have anything to do because
> +  * we already updated the non waiting time
> +  */
> + if (rq->cfs.h_nr_running)
> + return;
> +
> + /*
> +  * If rt is running, we update the non wait time before increasing
> +  * cfs.h_nr_running)
> +  */
> + if (rq->curr->sched_class == _sched_class)
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> + /*
> +  * If dl is running, we update the non time before increasing
> +  * cfs.h_nr_running)
> +  */
> + if (rq->curr->sched_class == _sched_class)
> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
> +}
> +

Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

 - Joel



Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-01 Thread Vincent Guittot
Le Thursday 31 May 2018 à 15:02:04 (+0200), Vincent Guittot a écrit :
> On 31 May 2018 at 12:27, Patrick Bellasi  wrote:
> >
> > Hi Vincent, Juri,
> >
> > On 28-May 18:34, Vincent Guittot wrote:
> >> On 28 May 2018 at 17:22, Juri Lelli  wrote:
> >> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> >> Hi Juri,
> >> >>
> >> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> >> >> > Hi Vincent,
> >> >> >
> >> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> >> Now that we have both the dl class bandwidth requirement and the dl 
> >> >> >> class
> >> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> >> utilization of the CPU.
> >> >> >>
> >> >> >> Signed-off-by: Vincent Guittot 
> >> >> >> ---
> >> >> >>  kernel/sched/sched.h | 6 +-
> >> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >> >>
> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> >> index 4526ba6..0eb07a8 100644
> >> >> >> --- a/kernel/sched/sched.h
> >> >> >> +++ b/kernel/sched/sched.h
> >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct 
> >> >> >> rq *rq, unsigned int flags) {}
> >> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >> >>  {
> >> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >> + unsigned long util = (rq->dl.running_bw * 
> >> >> >> SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >
> >> >> > I'd be tempted to say the we actually want to cap to this one above
> >> >> > instead of using the max (as you are proposing below) or the
> >> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> >> > tasks might vanish.
> >> >>
> >> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> >> the CFS task will be lower than reality and schedutil will set a lower
> >> >> OPP whereas the CPU is always running.
> >
> > With UTIL_EST enabled I don't expect an OPP reduction below the
> > expected utilization of a CFS task.
> 
> I'm not sure to fully catch what you mean but all tests that I ran,
> are using util_est (which is enable by default  if i'm not wrong). So
> all OPP drops that have been observed, were with util_est
> 
> >
> > IOW, when a periodic CFS task is preempted by a DL one, what we use
> > for OPP selection once the DL task is over is still the estimated
> > utilization for the CFS task itself. Thus, schedutil will eventually
> > (since we have quite conservative down scaling thresholds) go down to
> > the right OPP to serve that task.
> >
> >> >> The example with a RT task described in the cover letter can be
> >> >> run with a DL task and will give similar results.
> >
> > In the cover letter you says:
> >
> >A rt-app use case which creates an always running cfs thread and a
> >rt threads that wakes up periodically with both threads pinned on
> >same CPU, show lot of frequency switches of the CPU whereas the CPU
> >never goes idles during the test.
> >
> > I would say that's a quite specific corner case where your always
> > running CFS task has never accumulated a util_est sample.
> >
> > Do we really have these cases in real systems?
> 
> My example is voluntary an extreme one because it's easier to
> highlight the problem
> 
> >
> > Otherwise, it seems to me that we are trying to solve quite specific
> > corner cases by adding a not negligible level of "complexity".
> 
> By complexity, do you mean :
> 
> Taking into account the number cfs running task to choose between
> rq->dl.running_bw and avg_dl.util_avg
> 
> I'm preparing a patchset that will provide the cfs waiting time in
> addition to dl/rt util_avg for almost no additional cost. I will try
> to sent the proposal later today


The code below adds the tracking of the waiting level of cfs tasks because of
rt/dl preemption. This waiting time can then be used when selecting an OPP
instead of the dl util_avg which could become higher than dl bandwidth with
"long" runtime

We need only one new call for the 1st cfs task that is enqueued to get these 
additional metrics
the call to arch_scale_cpu_capacity() can be removed once the later will be
taken into account when computing the load (which scales only with freq
currently)

For rt task, we must keep to take into account util_avg to have an idea of the
rt level on the cpu which is given by the badnwodth for dl

---
 kernel/sched/fair.c  | 27 +++
 kernel/sched/pelt.c  |  8 ++--
 kernel/sched/sched.h |  4 +++-
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eac1f9a..1682ea7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static 

Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-06-01 Thread Vincent Guittot
Le Thursday 31 May 2018 à 15:02:04 (+0200), Vincent Guittot a écrit :
> On 31 May 2018 at 12:27, Patrick Bellasi  wrote:
> >
> > Hi Vincent, Juri,
> >
> > On 28-May 18:34, Vincent Guittot wrote:
> >> On 28 May 2018 at 17:22, Juri Lelli  wrote:
> >> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> >> Hi Juri,
> >> >>
> >> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> >> >> > Hi Vincent,
> >> >> >
> >> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> >> Now that we have both the dl class bandwidth requirement and the dl 
> >> >> >> class
> >> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> >> utilization of the CPU.
> >> >> >>
> >> >> >> Signed-off-by: Vincent Guittot 
> >> >> >> ---
> >> >> >>  kernel/sched/sched.h | 6 +-
> >> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >> >>
> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> >> index 4526ba6..0eb07a8 100644
> >> >> >> --- a/kernel/sched/sched.h
> >> >> >> +++ b/kernel/sched/sched.h
> >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct 
> >> >> >> rq *rq, unsigned int flags) {}
> >> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >> >>  {
> >> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >> + unsigned long util = (rq->dl.running_bw * 
> >> >> >> SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >
> >> >> > I'd be tempted to say the we actually want to cap to this one above
> >> >> > instead of using the max (as you are proposing below) or the
> >> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> >> > tasks might vanish.
> >> >>
> >> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> >> the CFS task will be lower than reality and schedutil will set a lower
> >> >> OPP whereas the CPU is always running.
> >
> > With UTIL_EST enabled I don't expect an OPP reduction below the
> > expected utilization of a CFS task.
> 
> I'm not sure to fully catch what you mean but all tests that I ran,
> are using util_est (which is enable by default  if i'm not wrong). So
> all OPP drops that have been observed, were with util_est
> 
> >
> > IOW, when a periodic CFS task is preempted by a DL one, what we use
> > for OPP selection once the DL task is over is still the estimated
> > utilization for the CFS task itself. Thus, schedutil will eventually
> > (since we have quite conservative down scaling thresholds) go down to
> > the right OPP to serve that task.
> >
> >> >> The example with a RT task described in the cover letter can be
> >> >> run with a DL task and will give similar results.
> >
> > In the cover letter you says:
> >
> >A rt-app use case which creates an always running cfs thread and a
> >rt threads that wakes up periodically with both threads pinned on
> >same CPU, show lot of frequency switches of the CPU whereas the CPU
> >never goes idles during the test.
> >
> > I would say that's a quite specific corner case where your always
> > running CFS task has never accumulated a util_est sample.
> >
> > Do we really have these cases in real systems?
> 
> My example is voluntary an extreme one because it's easier to
> highlight the problem
> 
> >
> > Otherwise, it seems to me that we are trying to solve quite specific
> > corner cases by adding a not negligible level of "complexity".
> 
> By complexity, do you mean :
> 
> Taking into account the number cfs running task to choose between
> rq->dl.running_bw and avg_dl.util_avg
> 
> I'm preparing a patchset that will provide the cfs waiting time in
> addition to dl/rt util_avg for almost no additional cost. I will try
> to sent the proposal later today


The code below adds the tracking of the waiting level of cfs tasks because of
rt/dl preemption. This waiting time can then be used when selecting an OPP
instead of the dl util_avg which could become higher than dl bandwidth with
"long" runtime

We need only one new call for the 1st cfs task that is enqueued to get these 
additional metrics
the call to arch_scale_cpu_capacity() can be removed once the later will be
taken into account when computing the load (which scales only with freq
currently)

For rt task, we must keep to take into account util_avg to have an idea of the
rt level on the cpu which is given by the badnwodth for dl

---
 kernel/sched/fair.c  | 27 +++
 kernel/sched/pelt.c  |  8 ++--
 kernel/sched/sched.h |  4 +++-
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eac1f9a..1682ea7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static 

Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-31 Thread Vincent Guittot
On 31 May 2018 at 12:27, Patrick Bellasi  wrote:
>
> Hi Vincent, Juri,
>
> On 28-May 18:34, Vincent Guittot wrote:
>> On 28 May 2018 at 17:22, Juri Lelli  wrote:
>> > On 28/05/18 16:57, Vincent Guittot wrote:
>> >> Hi Juri,
>> >>
>> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
>> >> > Hi Vincent,
>> >> >
>> >> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> >> Now that we have both the dl class bandwidth requirement and the dl 
>> >> >> class
>> >> >> utilization, we can use the max of the 2 values when agregating the
>> >> >> utilization of the CPU.
>> >> >>
>> >> >> Signed-off-by: Vincent Guittot 
>> >> >> ---
>> >> >>  kernel/sched/sched.h | 6 +-
>> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> >> index 4526ba6..0eb07a8 100644
>> >> >> --- a/kernel/sched/sched.h
>> >> >> +++ b/kernel/sched/sched.h
>> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct 
>> >> >> rq *rq, unsigned int flags) {}
>> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
>> >> >>  {
>> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
>> >> >> >> BW_SHIFT;
>> >> >
>> >> > I'd be tempted to say the we actually want to cap to this one above
>> >> > instead of using the max (as you are proposing below) or the
>> >> > (theoretical) power reduction benefits of using DEADLINE for certain
>> >> > tasks might vanish.
>> >>
>> >> The problem that I'm facing is that the sched_entity bandwidth is
>> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> >> zero but if the DL task has preempted a CFS task, the utilization of
>> >> the CFS task will be lower than reality and schedutil will set a lower
>> >> OPP whereas the CPU is always running.
>
> With UTIL_EST enabled I don't expect an OPP reduction below the
> expected utilization of a CFS task.

I'm not sure to fully catch what you mean but all tests that I ran,
are using util_est (which is enable by default  if i'm not wrong). So
all OPP drops that have been observed, were with util_est

>
> IOW, when a periodic CFS task is preempted by a DL one, what we use
> for OPP selection once the DL task is over is still the estimated
> utilization for the CFS task itself. Thus, schedutil will eventually
> (since we have quite conservative down scaling thresholds) go down to
> the right OPP to serve that task.
>
>> >> The example with a RT task described in the cover letter can be
>> >> run with a DL task and will give similar results.
>
> In the cover letter you says:
>
>A rt-app use case which creates an always running cfs thread and a
>rt threads that wakes up periodically with both threads pinned on
>same CPU, show lot of frequency switches of the CPU whereas the CPU
>never goes idles during the test.
>
> I would say that's a quite specific corner case where your always
> running CFS task has never accumulated a util_est sample.
>
> Do we really have these cases in real systems?

My example is voluntary an extreme one because it's easier to
highlight the problem

>
> Otherwise, it seems to me that we are trying to solve quite specific
> corner cases by adding a not negligible level of "complexity".

By complexity, do you mean :

Taking into account the number cfs running task to choose between
rq->dl.running_bw and avg_dl.util_avg

I'm preparing a patchset that will provide the cfs waiting time in
addition to dl/rt util_avg for almost no additional cost. I will try
to sent the proposal later today

>
> Moreover, I also have the impression that we can fix these
> use-cases by:
>
>   - improving the way we accumulate samples in util_est
> i.e. by discarding preemption time
>
>   - maybe by improving the utilization aggregation in schedutil to
> better understand DL requirements
> i.e. a 10% utilization with a 100ms running time is way different
> then the same utilization with a 1ms running time
>
>
> --
> #include 
>
> Patrick Bellasi


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-31 Thread Vincent Guittot
On 31 May 2018 at 12:27, Patrick Bellasi  wrote:
>
> Hi Vincent, Juri,
>
> On 28-May 18:34, Vincent Guittot wrote:
>> On 28 May 2018 at 17:22, Juri Lelli  wrote:
>> > On 28/05/18 16:57, Vincent Guittot wrote:
>> >> Hi Juri,
>> >>
>> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
>> >> > Hi Vincent,
>> >> >
>> >> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> >> Now that we have both the dl class bandwidth requirement and the dl 
>> >> >> class
>> >> >> utilization, we can use the max of the 2 values when agregating the
>> >> >> utilization of the CPU.
>> >> >>
>> >> >> Signed-off-by: Vincent Guittot 
>> >> >> ---
>> >> >>  kernel/sched/sched.h | 6 +-
>> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> >> index 4526ba6..0eb07a8 100644
>> >> >> --- a/kernel/sched/sched.h
>> >> >> +++ b/kernel/sched/sched.h
>> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct 
>> >> >> rq *rq, unsigned int flags) {}
>> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
>> >> >>  {
>> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
>> >> >> >> BW_SHIFT;
>> >> >
>> >> > I'd be tempted to say the we actually want to cap to this one above
>> >> > instead of using the max (as you are proposing below) or the
>> >> > (theoretical) power reduction benefits of using DEADLINE for certain
>> >> > tasks might vanish.
>> >>
>> >> The problem that I'm facing is that the sched_entity bandwidth is
>> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> >> zero but if the DL task has preempted a CFS task, the utilization of
>> >> the CFS task will be lower than reality and schedutil will set a lower
>> >> OPP whereas the CPU is always running.
>
> With UTIL_EST enabled I don't expect an OPP reduction below the
> expected utilization of a CFS task.

I'm not sure to fully catch what you mean but all tests that I ran,
are using util_est (which is enable by default  if i'm not wrong). So
all OPP drops that have been observed, were with util_est

>
> IOW, when a periodic CFS task is preempted by a DL one, what we use
> for OPP selection once the DL task is over is still the estimated
> utilization for the CFS task itself. Thus, schedutil will eventually
> (since we have quite conservative down scaling thresholds) go down to
> the right OPP to serve that task.
>
>> >> The example with a RT task described in the cover letter can be
>> >> run with a DL task and will give similar results.
>
> In the cover letter you says:
>
>A rt-app use case which creates an always running cfs thread and a
>rt threads that wakes up periodically with both threads pinned on
>same CPU, show lot of frequency switches of the CPU whereas the CPU
>never goes idles during the test.
>
> I would say that's a quite specific corner case where your always
> running CFS task has never accumulated a util_est sample.
>
> Do we really have these cases in real systems?

My example is voluntary an extreme one because it's easier to
highlight the problem

>
> Otherwise, it seems to me that we are trying to solve quite specific
> corner cases by adding a not negligible level of "complexity".

By complexity, do you mean :

Taking into account the number cfs running task to choose between
rq->dl.running_bw and avg_dl.util_avg

I'm preparing a patchset that will provide the cfs waiting time in
addition to dl/rt util_avg for almost no additional cost. I will try
to sent the proposal later today

>
> Moreover, I also have the impression that we can fix these
> use-cases by:
>
>   - improving the way we accumulate samples in util_est
> i.e. by discarding preemption time
>
>   - maybe by improving the utilization aggregation in schedutil to
> better understand DL requirements
> i.e. a 10% utilization with a 100ms running time is way different
> then the same utilization with a 1ms running time
>
>
> --
> #include 
>
> Patrick Bellasi


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-31 Thread Patrick Bellasi


Hi Vincent, Juri,

On 28-May 18:34, Vincent Guittot wrote:
> On 28 May 2018 at 17:22, Juri Lelli  wrote:
> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> Hi Juri,
> >>
> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> >> > Hi Vincent,
> >> >
> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> Now that we have both the dl class bandwidth requirement and the dl 
> >> >> class
> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> utilization of the CPU.
> >> >>
> >> >> Signed-off-by: Vincent Guittot 
> >> >> ---
> >> >>  kernel/sched/sched.h | 6 +-
> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> index 4526ba6..0eb07a8 100644
> >> >> --- a/kernel/sched/sched.h
> >> >> +++ b/kernel/sched/sched.h
> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> >> >> *rq, unsigned int flags) {}
> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >>  {
> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
> >> >> >> BW_SHIFT;
> >> >
> >> > I'd be tempted to say the we actually want to cap to this one above
> >> > instead of using the max (as you are proposing below) or the
> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> > tasks might vanish.
> >>
> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> the CFS task will be lower than reality and schedutil will set a lower
> >> OPP whereas the CPU is always running.

With UTIL_EST enabled I don't expect an OPP reduction below the
expected utilization of a CFS task.

IOW, when a periodic CFS task is preempted by a DL one, what we use
for OPP selection once the DL task is over is still the estimated
utilization for the CFS task itself. Thus, schedutil will eventually
(since we have quite conservative down scaling thresholds) go down to
the right OPP to serve that task.

> >> The example with a RT task described in the cover letter can be
> >> run with a DL task and will give similar results.

In the cover letter you says:

   A rt-app use case which creates an always running cfs thread and a
   rt threads that wakes up periodically with both threads pinned on
   same CPU, show lot of frequency switches of the CPU whereas the CPU
   never goes idles during the test.

I would say that's a quite specific corner case where your always
running CFS task has never accumulated a util_est sample.

Do we really have these cases in real systems?

Otherwise, it seems to me that we are trying to solve quite specific
corner cases by adding a not negligible level of "complexity".

Moreover, I also have the impression that we can fix these
use-cases by:

  - improving the way we accumulate samples in util_est
i.e. by discarding preemption time

  - maybe by improving the utilization aggregation in schedutil to
better understand DL requirements
i.e. a 10% utilization with a 100ms running time is way different
then the same utilization with a 1ms running time


-- 
#include 

Patrick Bellasi


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-31 Thread Patrick Bellasi


Hi Vincent, Juri,

On 28-May 18:34, Vincent Guittot wrote:
> On 28 May 2018 at 17:22, Juri Lelli  wrote:
> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> Hi Juri,
> >>
> >> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> >> > Hi Vincent,
> >> >
> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> Now that we have both the dl class bandwidth requirement and the dl 
> >> >> class
> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> utilization of the CPU.
> >> >>
> >> >> Signed-off-by: Vincent Guittot 
> >> >> ---
> >> >>  kernel/sched/sched.h | 6 +-
> >> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> index 4526ba6..0eb07a8 100644
> >> >> --- a/kernel/sched/sched.h
> >> >> +++ b/kernel/sched/sched.h
> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> >> >> *rq, unsigned int flags) {}
> >> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >>  {
> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
> >> >> >> BW_SHIFT;
> >> >
> >> > I'd be tempted to say the we actually want to cap to this one above
> >> > instead of using the max (as you are proposing below) or the
> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> > tasks might vanish.
> >>
> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> the CFS task will be lower than reality and schedutil will set a lower
> >> OPP whereas the CPU is always running.

With UTIL_EST enabled I don't expect an OPP reduction below the
expected utilization of a CFS task.

IOW, when a periodic CFS task is preempted by a DL one, what we use
for OPP selection once the DL task is over is still the estimated
utilization for the CFS task itself. Thus, schedutil will eventually
(since we have quite conservative down scaling thresholds) go down to
the right OPP to serve that task.

> >> The example with a RT task described in the cover letter can be
> >> run with a DL task and will give similar results.

In the cover letter you says:

   A rt-app use case which creates an always running cfs thread and a
   rt threads that wakes up periodically with both threads pinned on
   same CPU, show lot of frequency switches of the CPU whereas the CPU
   never goes idles during the test.

I would say that's a quite specific corner case where your always
running CFS task has never accumulated a util_est sample.

Do we really have these cases in real systems?

Otherwise, it seems to me that we are trying to solve quite specific
corner cases by adding a not negligible level of "complexity".

Moreover, I also have the impression that we can fix these
use-cases by:

  - improving the way we accumulate samples in util_est
i.e. by discarding preemption time

  - maybe by improving the utilization aggregation in schedutil to
better understand DL requirements
i.e. a 10% utilization with a 100ms running time is way different
then the same utilization with a 1ms running time


-- 
#include 

Patrick Bellasi


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-30 Thread Juri Lelli
On 30/05/18 09:37, Quentin Perret wrote:
> On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> > On 29/05/18 09:40, Quentin Perret wrote:
> > > Hi Vincent,
> > > 
> > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > > Now that we have both the dl class bandwidth requirement and the dl 
> > > > class
> > > > utilization, we can use the max of the 2 values when agregating the
> > > > utilization of the CPU.
> > > > 
> > > > Signed-off-by: Vincent Guittot 
> > > > ---
> > > >  kernel/sched/sched.h | 6 +-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index 4526ba6..0eb07a8 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > > > *rq, unsigned int flags) {}
> > > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > > >  {
> > > > -   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > +   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
> > > > >> BW_SHIFT;
> > > > +
> > > > +   util = max_t(unsigned long, util, 
> > > > READ_ONCE(rq->avg_dl.util_avg));
> > > 
> > > Would it make sense to use a UTIL_EST version of that signal here ? I
> > > don't think that would make sense for the RT class with your patch-set
> > > since you only really use the blocked part of the signal for RT IIUC,
> > > but would that work for DL ?
> > 
> > Well, UTIL_EST for DL looks pretty much what we already do by computing
> > utilization based on dl.running_bw. That's why I was thinking of using
> > that as a starting point for dl.util_avg decay phase.
> 
> Hmmm I see your point, but running_bw and the util_avg are fundamentally
> different ... I mean, the util_avg doesn't know about the period, which is
> an issue in itself I guess ...
> 
> If you have a long running DL task (say 100ms runtime) with a long period
> (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
> the util_avg can go quite high, which means that you might end up
> executing this task at max OPP. So if we really want to drive OPPs like
> that for deadline, a util_est-like version of this util_avg signal
> should help. Now, you can also argue that going to max OPP for a task
> that _we know_ uses 1/10 of the CPU capacity isn't right ...

Yep, that's my point. :)


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-30 Thread Juri Lelli
On 30/05/18 09:37, Quentin Perret wrote:
> On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> > On 29/05/18 09:40, Quentin Perret wrote:
> > > Hi Vincent,
> > > 
> > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > > Now that we have both the dl class bandwidth requirement and the dl 
> > > > class
> > > > utilization, we can use the max of the 2 values when agregating the
> > > > utilization of the CPU.
> > > > 
> > > > Signed-off-by: Vincent Guittot 
> > > > ---
> > > >  kernel/sched/sched.h | 6 +-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index 4526ba6..0eb07a8 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > > > *rq, unsigned int flags) {}
> > > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > > >  {
> > > > -   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > +   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) 
> > > > >> BW_SHIFT;
> > > > +
> > > > +   util = max_t(unsigned long, util, 
> > > > READ_ONCE(rq->avg_dl.util_avg));
> > > 
> > > Would it make sense to use a UTIL_EST version of that signal here ? I
> > > don't think that would make sense for the RT class with your patch-set
> > > since you only really use the blocked part of the signal for RT IIUC,
> > > but would that work for DL ?
> > 
> > Well, UTIL_EST for DL looks pretty much what we already do by computing
> > utilization based on dl.running_bw. That's why I was thinking of using
> > that as a starting point for dl.util_avg decay phase.
> 
> Hmmm I see your point, but running_bw and the util_avg are fundamentally
> different ... I mean, the util_avg doesn't know about the period, which is
> an issue in itself I guess ...
> 
> If you have a long running DL task (say 100ms runtime) with a long period
> (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
> the util_avg can go quite high, which means that you might end up
> executing this task at max OPP. So if we really want to drive OPPs like
> that for deadline, a util_est-like version of this util_avg signal
> should help. Now, you can also argue that going to max OPP for a task
> that _we know_ uses 1/10 of the CPU capacity isn't right ...

Yep, that's my point. :)


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-30 Thread Quentin Perret
On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> On 29/05/18 09:40, Quentin Perret wrote:
> > Hi Vincent,
> > 
> > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > Now that we have both the dl class bandwidth requirement and the dl class
> > > utilization, we can use the max of the 2 values when agregating the
> > > utilization of the CPU.
> > > 
> > > Signed-off-by: Vincent Guittot 
> > > ---
> > >  kernel/sched/sched.h | 6 +-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 4526ba6..0eb07a8 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > > *rq, unsigned int flags) {}
> > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > >  {
> > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> > > BW_SHIFT;
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > 
> > Would it make sense to use a UTIL_EST version of that signal here ? I
> > don't think that would make sense for the RT class with your patch-set
> > since you only really use the blocked part of the signal for RT IIUC,
> > but would that work for DL ?
> 
> Well, UTIL_EST for DL looks pretty much what we already do by computing
> utilization based on dl.running_bw. That's why I was thinking of using
> that as a starting point for dl.util_avg decay phase.

Hmmm I see your point, but running_bw and the util_avg are fundamentally
different ... I mean, the util_avg doesn't know about the period, which is
an issue in itself I guess ...

If you have a long running DL task (say 100ms runtime) with a long period
(say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
the util_avg can go quite high, which means that you might end up
executing this task at max OPP. So if we really want to drive OPPs like
that for deadline, a util_est-like version of this util_avg signal
should help. Now, you can also argue that going to max OPP for a task
that _we know_ uses 1/10 of the CPU capacity isn't right ...


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-30 Thread Quentin Perret
On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> On 29/05/18 09:40, Quentin Perret wrote:
> > Hi Vincent,
> > 
> > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > Now that we have both the dl class bandwidth requirement and the dl class
> > > utilization, we can use the max of the 2 values when agregating the
> > > utilization of the CPU.
> > > 
> > > Signed-off-by: Vincent Guittot 
> > > ---
> > >  kernel/sched/sched.h | 6 +-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 4526ba6..0eb07a8 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > > *rq, unsigned int flags) {}
> > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > >  {
> > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> > > BW_SHIFT;
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > 
> > Would it make sense to use a UTIL_EST version of that signal here ? I
> > don't think that would make sense for the RT class with your patch-set
> > since you only really use the blocked part of the signal for RT IIUC,
> > but would that work for DL ?
> 
> Well, UTIL_EST for DL looks pretty much what we already do by computing
> utilization based on dl.running_bw. That's why I was thinking of using
> that as a starting point for dl.util_avg decay phase.

Hmmm I see your point, but running_bw and the util_avg are fundamentally
different ... I mean, the util_avg doesn't know about the period, which is
an issue in itself I guess ...

If you have a long running DL task (say 100ms runtime) with a long period
(say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
the util_avg can go quite high, which means that you might end up
executing this task at max OPP. So if we really want to drive OPPs like
that for deadline, a util_est-like version of this util_avg signal
should help. Now, you can also argue that going to max OPP for a task
that _we know_ uses 1/10 of the CPU capacity isn't right ...


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 29/05/18 09:40, Quentin Perret wrote:
> Hi Vincent,
> 
> On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > Now that we have both the dl class bandwidth requirement and the dl class
> > utilization, we can use the max of the 2 values when agregating the
> > utilization of the CPU.
> > 
> > Signed-off-by: Vincent Guittot 
> > ---
> >  kernel/sched/sched.h | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 4526ba6..0eb07a8 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > *rq, unsigned int flags) {}
> >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >  static inline unsigned long cpu_util_dl(struct rq *rq)
> >  {
> > -   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > +   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> > BW_SHIFT;
> > +
> > +   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> 
> Would it make sense to use a UTIL_EST version of that signal here ? I
> don't think that would make sense for the RT class with your patch-set
> since you only really use the blocked part of the signal for RT IIUC,
> but would that work for DL ?

Well, UTIL_EST for DL looks pretty much what we already do by computing
utilization based on dl.running_bw. That's why I was thinking of using
that as a starting point for dl.util_avg decay phase.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 29/05/18 09:40, Quentin Perret wrote:
> Hi Vincent,
> 
> On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > Now that we have both the dl class bandwidth requirement and the dl class
> > utilization, we can use the max of the 2 values when agregating the
> > utilization of the CPU.
> > 
> > Signed-off-by: Vincent Guittot 
> > ---
> >  kernel/sched/sched.h | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 4526ba6..0eb07a8 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> > *rq, unsigned int flags) {}
> >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >  static inline unsigned long cpu_util_dl(struct rq *rq)
> >  {
> > -   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > +   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> > BW_SHIFT;
> > +
> > +   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> 
> Would it make sense to use a UTIL_EST version of that signal here ? I
> don't think that would make sense for the RT class with your patch-set
> since you only really use the blocked part of the signal for RT IIUC,
> but would that work for DL ?

Well, UTIL_EST for DL looks pretty much what we already do by computing
utilization based on dl.running_bw. That's why I was thinking of using
that as a starting point for dl.util_avg decay phase.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 29/05/18 08:48, Vincent Guittot wrote:
> On 29 May 2018 at 08:31, Juri Lelli  wrote:
> > On 28/05/18 22:08, Joel Fernandes wrote:
> >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> >> [..]
> >> > > +
> >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> >> > > +
> >> > > + return util;
> >> >
> >> > Anyway, just a quick thought. I guess we should experiment with this a
> >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> >> > this change?
> >> >
> >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> >> > lately.
> >>
> >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> >> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> >> are running and being preempted as you/Vincent mentioned in one of the
> >> threads.
> >
> > It's however a bit awkward that we might be running at a higher OPP when
> > CFS tasks are running (even though they are of less priority). :/
> 
> Even if cfs task has lower priority that doesn't mean that we should
> not take their needs into account.
> In the same way, we run at max OPP as soon as a RT task is runnable

Sure. What I fear is a little CFS utilization generating spikes because
dl.util_avg became big when running DL tasks. Not sure that can happen
though because such DL tasks should be throttled anyway.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 29/05/18 08:48, Vincent Guittot wrote:
> On 29 May 2018 at 08:31, Juri Lelli  wrote:
> > On 28/05/18 22:08, Joel Fernandes wrote:
> >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> >> [..]
> >> > > +
> >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> >> > > +
> >> > > + return util;
> >> >
> >> > Anyway, just a quick thought. I guess we should experiment with this a
> >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> >> > this change?
> >> >
> >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> >> > lately.
> >>
> >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> >> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> >> are running and being preempted as you/Vincent mentioned in one of the
> >> threads.
> >
> > It's however a bit awkward that we might be running at a higher OPP when
> > CFS tasks are running (even though they are of less priority). :/
> 
> Even if cfs task has lower priority that doesn't mean that we should
> not take their needs into account.
> In the same way, we run at max OPP as soon as a RT task is runnable

Sure. What I fear is a little CFS utilization generating spikes because
dl.util_avg became big when running DL tasks. Not sure that can happen
though because such DL tasks should be throttled anyway.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Quentin Perret
Hi Vincent,

On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/sched.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
> unsigned int flags) {}
>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> BW_SHIFT;
> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

Would it make sense to use a UTIL_EST version of that signal here ? I
don't think that would make sense for the RT class with your patch-set
since you only really use the blocked part of the signal for RT IIUC,
but would that work for DL ?
> +
> + return util;
>  }
>  
>  static inline unsigned long cpu_util_cfs(struct rq *rq)
> -- 
> 2.7.4
> 


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Quentin Perret
Hi Vincent,

On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/sched.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
> unsigned int flags) {}
>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> BW_SHIFT;
> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

Would it make sense to use a UTIL_EST version of that signal here ? I
don't think that would make sense for the RT class with your patch-set
since you only really use the blocked part of the signal for RT IIUC,
but would that work for DL ?
> +
> + return util;
>  }
>  
>  static inline unsigned long cpu_util_cfs(struct rq *rq)
> -- 
> 2.7.4
> 


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Vincent Guittot
On 29 May 2018 at 08:31, Juri Lelli  wrote:
> On 28/05/18 22:08, Joel Fernandes wrote:
>> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
>> [..]
>> > > +
>> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> > > +
>> > > + return util;
>> >
>> > Anyway, just a quick thought. I guess we should experiment with this a
>> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
>> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
>> > this change?
>> >
>> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
>> > lately.
>>
>> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
>> beyond what the running bandwidth is, or atleast do that only if CFS tasks
>> are running and being preempted as you/Vincent mentioned in one of the
>> threads.
>
> It's however a bit awkward that we might be running at a higher OPP when
> CFS tasks are running (even though they are of less priority). :/

Even if cfs task has lower priority that doesn't mean that we should
not take their needs into account.
In the same way, we run at max OPP as soon as a RT task is runnable

>
>> With our DL experiments, I didn't measure power but got it to a point where
>> the OPP is scaling correctly based on DL parameters. I think Alessio did
>> measure power at his setup but I can't recall now. Alessio?
>
> I see. Thanks.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Vincent Guittot
On 29 May 2018 at 08:31, Juri Lelli  wrote:
> On 28/05/18 22:08, Joel Fernandes wrote:
>> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
>> [..]
>> > > +
>> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> > > +
>> > > + return util;
>> >
>> > Anyway, just a quick thought. I guess we should experiment with this a
>> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
>> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
>> > this change?
>> >
>> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
>> > lately.
>>
>> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
>> beyond what the running bandwidth is, or atleast do that only if CFS tasks
>> are running and being preempted as you/Vincent mentioned in one of the
>> threads.
>
> It's however a bit awkward that we might be running at a higher OPP when
> CFS tasks are running (even though they are of less priority). :/

Even if cfs task has lower priority that doesn't mean that we should
not take their needs into account.
In the same way, we run at max OPP as soon as a RT task is runnable

>
>> With our DL experiments, I didn't measure power but got it to a point where
>> the OPP is scaling correctly based on DL parameters. I think Alessio did
>> measure power at his setup but I can't recall now. Alessio?
>
> I see. Thanks.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 28/05/18 22:08, Joel Fernandes wrote:
> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> [..] 
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > > +
> > > + return util;
> > 
> > Anyway, just a quick thought. I guess we should experiment with this a
> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> > this change?
> > 
> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> > lately.
> 
> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> are running and being preempted as you/Vincent mentioned in one of the
> threads.

It's however a bit awkward that we might be running at a higher OPP when
CFS tasks are running (even though they are of less priority). :/

> With our DL experiments, I didn't measure power but got it to a point where
> the OPP is scaling correctly based on DL parameters. I think Alessio did
> measure power at his setup but I can't recall now. Alessio?

I see. Thanks.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-29 Thread Juri Lelli
On 28/05/18 22:08, Joel Fernandes wrote:
> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> [..] 
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > > +
> > > + return util;
> > 
> > Anyway, just a quick thought. I guess we should experiment with this a
> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> > this change?
> > 
> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> > lately.
> 
> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> are running and being preempted as you/Vincent mentioned in one of the
> threads.

It's however a bit awkward that we might be running at a higher OPP when
CFS tasks are running (even though they are of less priority). :/

> With our DL experiments, I didn't measure power but got it to a point where
> the OPP is scaling correctly based on DL parameters. I think Alessio did
> measure power at his setup but I can't recall now. Alessio?

I see. Thanks.


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Joel Fernandes
On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
[..] 
> > +
> > +   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > +
> > +   return util;
> 
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
> 
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.

I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
beyond what the running bandwidth is, or atleast do that only if CFS tasks
are running and being preempted as you/Vincent mentioned in one of the
threads.

With our DL experiments, I didn't measure power but got it to a point where
the OPP is scaling correctly based on DL parameters. I think Alessio did
measure power at his setup but I can't recall now. Alessio?

thanks,

 - Joel



Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Joel Fernandes
On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
[..] 
> > +
> > +   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > +
> > +   return util;
> 
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
> 
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.

I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
beyond what the running bandwidth is, or atleast do that only if CFS tasks
are running and being preempted as you/Vincent mentioned in one of the
threads.

With our DL experiments, I didn't measure power but got it to a point where
the OPP is scaling correctly based on DL parameters. I think Alessio did
measure power at his setup but I can't recall now. Alessio?

thanks,

 - Joel



Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Vincent Guittot
On 28 May 2018 at 17:22, Juri Lelli  wrote:
> On 28/05/18 16:57, Vincent Guittot wrote:
>> Hi Juri,
>>
>> On 28 May 2018 at 12:12, Juri Lelli  wrote:
>> > Hi Vincent,
>> >
>> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> Now that we have both the dl class bandwidth requirement and the dl class
>> >> utilization, we can use the max of the 2 values when agregating the
>> >> utilization of the CPU.
>> >>
>> >> Signed-off-by: Vincent Guittot 
>> >> ---
>> >>  kernel/sched/sched.h | 6 +-
>> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index 4526ba6..0eb07a8 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
>> >> *rq, unsigned int flags) {}
>> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
>> >>  {
>> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
>> >> BW_SHIFT;
>> >
>> > I'd be tempted to say the we actually want to cap to this one above
>> > instead of using the max (as you are proposing below) or the
>> > (theoretical) power reduction benefits of using DEADLINE for certain
>> > tasks might vanish.
>>
>> The problem that I'm facing is that the sched_entity bandwidth is
>> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> zero but if the DL task has preempted a CFS task, the utilization of
>> the CFS task will be lower than reality and schedutil will set a lower
>> OPP whereas the CPU is always running. The example with a RT task
>> described in the cover letter can be run with a DL task and will give
>> similar results.
>> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
>> whereas rq->dl.running_bw gives the minimum to match DL requirement.
>
> Mmm, I see. Note that I'm only being cautious, what you propose might
> work OK, but it seems to me that we might lose some of the benefits of
> running tasks with DEADLINE if we start selecting frequency as you
> propose even when such tasks are running.

I see your point. Taking into account the number cfs running task to
choose between rq->dl.running_bw and avg_dl.util_avg could help

>
> An idea might be to copy running_bw util into dl.util_avg when a DL task
> goes to sleep, and then decay the latter as for RT contribution. What
> you think?

Not sure that this will work because you will overwrite the value each
time a DL task goes to sleep and the decay will mainly happen on the
update when last DL task goes to sleep which might not reflect what
has been used by DL tasks but only the requirement of the last running
DL task. This other interest of the PELT is to have an utilization
tracking which uses the same curve as CFS so the increase of
cfs_rq->avg.util_avg and the decrease of avg_dl.util_avg will
compensate themselves (or the opposite)


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Vincent Guittot
On 28 May 2018 at 17:22, Juri Lelli  wrote:
> On 28/05/18 16:57, Vincent Guittot wrote:
>> Hi Juri,
>>
>> On 28 May 2018 at 12:12, Juri Lelli  wrote:
>> > Hi Vincent,
>> >
>> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> Now that we have both the dl class bandwidth requirement and the dl class
>> >> utilization, we can use the max of the 2 values when agregating the
>> >> utilization of the CPU.
>> >>
>> >> Signed-off-by: Vincent Guittot 
>> >> ---
>> >>  kernel/sched/sched.h | 6 +-
>> >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index 4526ba6..0eb07a8 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
>> >> *rq, unsigned int flags) {}
>> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
>> >>  {
>> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
>> >> BW_SHIFT;
>> >
>> > I'd be tempted to say the we actually want to cap to this one above
>> > instead of using the max (as you are proposing below) or the
>> > (theoretical) power reduction benefits of using DEADLINE for certain
>> > tasks might vanish.
>>
>> The problem that I'm facing is that the sched_entity bandwidth is
>> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> zero but if the DL task has preempted a CFS task, the utilization of
>> the CFS task will be lower than reality and schedutil will set a lower
>> OPP whereas the CPU is always running. The example with a RT task
>> described in the cover letter can be run with a DL task and will give
>> similar results.
>> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
>> whereas rq->dl.running_bw gives the minimum to match DL requirement.
>
> Mmm, I see. Note that I'm only being cautious, what you propose might
> work OK, but it seems to me that we might lose some of the benefits of
> running tasks with DEADLINE if we start selecting frequency as you
> propose even when such tasks are running.

I see your point. Taking into account the number cfs running task to
choose between rq->dl.running_bw and avg_dl.util_avg could help

>
> An idea might be to copy running_bw util into dl.util_avg when a DL task
> goes to sleep, and then decay the latter as for RT contribution. What
> you think?

Not sure that this will work because you will overwrite the value each
time a DL task goes to sleep and the decay will mainly happen on the
update when last DL task goes to sleep which might not reflect what
has been used by DL tasks but only the requirement of the last running
DL task. This other interest of the PELT is to have an utilization
tracking which uses the same curve as CFS so the increase of
cfs_rq->avg.util_avg and the decrease of avg_dl.util_avg will
compensate themselves (or the opposite)


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Juri Lelli
Hi Vincent,

On 25/05/18 15:12, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/sched.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
> unsigned int flags) {}
>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> BW_SHIFT;

I'd be tempted to say the we actually want to cap to this one above
instead of using the max (as you are proposing below) or the
(theoretical) power reduction benefits of using DEADLINE for certain
tasks might vanish.

> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> +
> + return util;

Anyway, just a quick thought. I guess we should experiment with this a
bit. Now, I don't unfortunately have a Arm platform at hand for testing.
Claudio, Luca (now Cc-ed), would you be able to fire some tests with
this change?

Oh, adding Joel and Alessio as well that experimented with DEADLINE
lately.

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Juri Lelli
Hi Vincent,

On 25/05/18 15:12, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/sched.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
> unsigned int flags) {}
>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> BW_SHIFT;

I'd be tempted to say the we actually want to cap to this one above
instead of using the max (as you are proposing below) or the
(theoretical) power reduction benefits of using DEADLINE for certain
tasks might vanish.

> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> +
> + return util;

Anyway, just a quick thought. I guess we should experiment with this a
bit. Now, I don't unfortunately have a Arm platform at hand for testing.
Claudio, Luca (now Cc-ed), would you be able to fire some tests with
this change?

Oh, adding Joel and Alessio as well that experimented with DEADLINE
lately.

Thanks,

- Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Juri Lelli
On 28/05/18 16:57, Vincent Guittot wrote:
> Hi Juri,
> 
> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> > Hi Vincent,
> >
> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> Now that we have both the dl class bandwidth requirement and the dl class
> >> utilization, we can use the max of the 2 values when agregating the
> >> utilization of the CPU.
> >>
> >> Signed-off-by: Vincent Guittot 
> >> ---
> >>  kernel/sched/sched.h | 6 +-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 4526ba6..0eb07a8 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> >> *rq, unsigned int flags) {}
> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >>  {
> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> >> BW_SHIFT;
> >
> > I'd be tempted to say the we actually want to cap to this one above
> > instead of using the max (as you are proposing below) or the
> > (theoretical) power reduction benefits of using DEADLINE for certain
> > tasks might vanish.
> 
> The problem that I'm facing is that the sched_entity bandwidth is
> removed after the 0-lag time and the rq->dl.running_bw goes back to
> zero but if the DL task has preempted a CFS task, the utilization of
> the CFS task will be lower than reality and schedutil will set a lower
> OPP whereas the CPU is always running. The example with a RT task
> described in the cover letter can be run with a DL task and will give
> similar results.
> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
> whereas rq->dl.running_bw gives the minimum to match DL requirement.

Mmm, I see. Note that I'm only being cautious, what you propose might
work OK, but it seems to me that we might lose some of the benefits of
running tasks with DEADLINE if we start selecting frequency as you
propose even when such tasks are running.

An idea might be to copy running_bw util into dl.util_avg when a DL task
goes to sleep, and then decay the latter as for RT contribution. What
you think?


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Juri Lelli
On 28/05/18 16:57, Vincent Guittot wrote:
> Hi Juri,
> 
> On 28 May 2018 at 12:12, Juri Lelli  wrote:
> > Hi Vincent,
> >
> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> Now that we have both the dl class bandwidth requirement and the dl class
> >> utilization, we can use the max of the 2 values when agregating the
> >> utilization of the CPU.
> >>
> >> Signed-off-by: Vincent Guittot 
> >> ---
> >>  kernel/sched/sched.h | 6 +-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 4526ba6..0eb07a8 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq 
> >> *rq, unsigned int flags) {}
> >>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >>  static inline unsigned long cpu_util_dl(struct rq *rq)
> >>  {
> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
> >> BW_SHIFT;
> >
> > I'd be tempted to say the we actually want to cap to this one above
> > instead of using the max (as you are proposing below) or the
> > (theoretical) power reduction benefits of using DEADLINE for certain
> > tasks might vanish.
> 
> The problem that I'm facing is that the sched_entity bandwidth is
> removed after the 0-lag time and the rq->dl.running_bw goes back to
> zero but if the DL task has preempted a CFS task, the utilization of
> the CFS task will be lower than reality and schedutil will set a lower
> OPP whereas the CPU is always running. The example with a RT task
> described in the cover letter can be run with a DL task and will give
> similar results.
> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
> whereas rq->dl.running_bw gives the minimum to match DL requirement.

Mmm, I see. Note that I'm only being cautious, what you propose might
work OK, but it seems to me that we might lose some of the benefits of
running tasks with DEADLINE if we start selecting frequency as you
propose even when such tasks are running.

An idea might be to copy running_bw util into dl.util_avg when a DL task
goes to sleep, and then decay the latter as for RT contribution. What
you think?


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Vincent Guittot
Hi Juri,

On 28 May 2018 at 12:12, Juri Lelli  wrote:
> Hi Vincent,
>
> On 25/05/18 15:12, Vincent Guittot wrote:
>> Now that we have both the dl class bandwidth requirement and the dl class
>> utilization, we can use the max of the 2 values when agregating the
>> utilization of the CPU.
>>
>> Signed-off-by: Vincent Guittot 
>> ---
>>  kernel/sched/sched.h | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 4526ba6..0eb07a8 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
>> unsigned int flags) {}
>>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>>  static inline unsigned long cpu_util_dl(struct rq *rq)
>>  {
>> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
>> BW_SHIFT;
>
> I'd be tempted to say the we actually want to cap to this one above
> instead of using the max (as you are proposing below) or the
> (theoretical) power reduction benefits of using DEADLINE for certain
> tasks might vanish.

The problem that I'm facing is that the sched_entity bandwidth is
removed after the 0-lag time and the rq->dl.running_bw goes back to
zero but if the DL task has preempted a CFS task, the utilization of
the CFS task will be lower than reality and schedutil will set a lower
OPP whereas the CPU is always running. The example with a RT task
described in the cover letter can be run with a DL task and will give
similar results.
avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
whereas rq->dl.running_bw gives the minimum to match DL requirement.

>
>> +
>> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> +
>> + return util;
>
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
>
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.
>
> Thanks,
>
> - Juri


Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-28 Thread Vincent Guittot
Hi Juri,

On 28 May 2018 at 12:12, Juri Lelli  wrote:
> Hi Vincent,
>
> On 25/05/18 15:12, Vincent Guittot wrote:
>> Now that we have both the dl class bandwidth requirement and the dl class
>> utilization, we can use the max of the 2 values when agregating the
>> utilization of the CPU.
>>
>> Signed-off-by: Vincent Guittot 
>> ---
>>  kernel/sched/sched.h | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 4526ba6..0eb07a8 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
>> unsigned int flags) {}
>>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>>  static inline unsigned long cpu_util_dl(struct rq *rq)
>>  {
>> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
>> BW_SHIFT;
>
> I'd be tempted to say the we actually want to cap to this one above
> instead of using the max (as you are proposing below) or the
> (theoretical) power reduction benefits of using DEADLINE for certain
> tasks might vanish.

The problem that I'm facing is that the sched_entity bandwidth is
removed after the 0-lag time and the rq->dl.running_bw goes back to
zero but if the DL task has preempted a CFS task, the utilization of
the CFS task will be lower than reality and schedutil will set a lower
OPP whereas the CPU is always running. The example with a RT task
described in the cover letter can be run with a DL task and will give
similar results.
avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
whereas rq->dl.running_bw gives the minimum to match DL requirement.

>
>> +
>> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> +
>> + return util;
>
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
>
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.
>
> Thanks,
>
> - Juri


[PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-25 Thread Vincent Guittot
Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can use the max of the 2 values when agregating the
utilization of the CPU.

Signed-off-by: Vincent Guittot 
---
 kernel/sched/sched.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4526ba6..0eb07a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
unsigned int flags) {}
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
-   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
BW_SHIFT;
+
+   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+
+   return util;
 }
 
 static inline unsigned long cpu_util_cfs(struct rq *rq)
-- 
2.7.4



[PATCH v5 05/10] cpufreq/schedutil: get max utilization

2018-05-25 Thread Vincent Guittot
Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can use the max of the 2 values when agregating the
utilization of the CPU.

Signed-off-by: Vincent Guittot 
---
 kernel/sched/sched.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4526ba6..0eb07a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, 
unsigned int flags) {}
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
-   return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+   unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> 
BW_SHIFT;
+
+   util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+
+   return util;
 }
 
 static inline unsigned long cpu_util_cfs(struct rq *rq)
-- 
2.7.4